Production Best Practices and Troubleshooting in Kubernetes: Complete Real-Time Enterprise Guide
Running Kubernetes in production is very different from learning Kubernetes in a local lab. In development, a simple Pod, Deployment, or Service may be enough. But in production, applications must be secure, scalable, observable, cost-efficient, and highly available.
A production Kubernetes cluster must handle real users, traffic spikes, application failures, node failures, security threats, bad deployments, storage problems, and networking issues. This is why every Kubernetes engineer must understand both best practices and troubleshooting workflows.
Your base article already covers resource management, security, reliability, observability, governance, and common troubleshooting areas. This version expands it with deeper real-time examples, production debugging flows, enterprise recommendations, and interview-ready explanations. :contentReference[oaicite:0]{index=0}
Why Kubernetes Production Best Practices Matter?
Kubernetes is powerful, but it does not automatically make applications production-ready. If the cluster is poorly configured, Kubernetes can still suffer from:
- Pod crashes
- Slow applications
- Security leaks
- Unexpected downtime
- High cloud bills
- Failed deployments
- Data loss
- Difficult troubleshooting
Production readiness means building the system in a way that prevents problems and helps teams recover quickly when issues happen.
Real-Time Banking Example
Imagine a banking platform running on Kubernetes. It contains payment APIs, authentication services, account services, fraud detection services, notification services, and databases.
If Kubernetes is not production-ready:
- Payment Pods may crash during high traffic
- Database credentials may leak through poor Secrets management
- Wrong Network Policies may block transaction APIs
- No monitoring may delay incident detection
- No rollback plan may extend downtime
For banking, even a few minutes of downtime can create financial loss and customer trust issues.
Real-Time E-Commerce Example
During a festival sale, an e-commerce platform may receive 10x or 50x normal traffic.
Production best practices help the platform:
- Scale Pods automatically using HPA
- Add nodes using Cluster Autoscaler
- Prevent bad Pods from receiving traffic using readiness probes
- Detect errors using Prometheus and Grafana
- Rollback failed releases quickly
- Protect payment and database services using Network Policies
Production Kubernetes Architecture
[ Users ]
|
v
[ Ingress Controller ]
|
v
[ Application Services ]
|
+--> ConfigMaps
+--> Secrets
+--> Persistent Volumes
+--> Network Policies
+--> HPA
+--> Monitoring
+--> Logging
+--> RBAC
A production Kubernetes system is not just Deployments and Services. It is a combination of reliability, security, scaling, monitoring, and governance.
1. Resource Management Best Practices
Every production workload should define CPU and memory requests and limits.
Why?
Without resource management, one bad application can consume too much CPU or memory and affect other applications running on the same node.
Example
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Meaning
requestshelp Kubernetes schedule Pods properlylimitsprevent containers from consuming unlimited resources
Resource Management Flow
Pod Created
|
v
Scheduler Checks Requests
|
v
Pod Placed on Suitable Node
|
v
Container Starts
|
v
Limits Prevent Resource Abuse
Common Resource Issues
- Pods stuck in Pending because requests are too high
- Pods killed with OOMKilled because memory limit is too low
- Application slow because CPU is throttled
- Nodes overloaded because limits are missing
2. Namespace and Quota Best Practices
Use namespaces to separate environments and teams.
[ Kubernetes Cluster ]
|
+-- dev
+-- qa
+-- staging
+-- production
+-- monitoring
Apply ResourceQuotas and LimitRanges to prevent one namespace from consuming all cluster resources.
ResourceQuota Example
apiVersion: v1
kind: ResourceQuota
metadata:
name: prod-quota
namespace: production
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
3. Security Best Practices
Security must be planned from the beginning. Production Kubernetes security should include multiple layers.
Important Security Controls
- RBAC with least privilege
- Kubernetes Secrets or external secret managers
- Network Policies
- Pod Security Standards
- Image vulnerability scanning
- Private container registries
- Audit logs
- mTLS using service mesh when required
RBAC Best Practice
Never give unnecessary cluster-admin access.
Bad Practice
verbs: ["*"]
resources: ["*"]
Good Practice
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
Give users and service accounts only the permissions they need.
Secrets Best Practice
Do not hardcode sensitive information in YAML files or source code.
Sensitive data includes:
- Database passwords
- JWT secrets
- Payment gateway keys
- Cloud credentials
- OAuth secrets
Use Kubernetes Secrets, Vault, AWS Secrets Manager, Azure Key Vault, or Sealed Secrets.
4. Network Security Best Practices
By default, many Kubernetes clusters allow broad Pod-to-Pod communication. This is not ideal for production.
Use Network Policies to allow only required communication.
[ Frontend ] ---> [ Backend ] ---> [ Database ]
Blocked:
[ Random Pod ] ---X---> [ Database ]
This reduces lateral movement risk if one Pod is compromised.
5. Reliability Best Practices
Production applications should use:
- Liveness probes
- Readiness probes
- Startup probes
- Rolling updates
- PodDisruptionBudgets
- Replica distribution across nodes
Readiness Probe Example
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Readiness probes prevent traffic from reaching Pods that are not ready.
Liveness Probe Example
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
Liveness probes restart stuck or unhealthy containers.
6. Deployment Best Practices
Use rolling updates for safer deployments.
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
This ensures new Pods are created before old Pods are removed.
Rollback Command
kubectl rollout undo deployment/payment-service
Rollback should always be part of the production deployment plan.
7. Autoscaling Best Practices
Use HPA for Pod scaling and Cluster Autoscaler for node scaling.
Traffic Spike
|
v
HPA Adds Pods
|
v
Cluster Autoscaler Adds Nodes if Needed
Without autoscaling, applications may fail during peak traffic.
8. Observability Best Practices
A production Kubernetes cluster should include monitoring, logging, and alerting.
Recommended Stack
- Prometheus for metrics
- Grafana for dashboards
- Loki or ELK for logs
- Alertmanager for alerts
- Jaeger or OpenTelemetry for tracing
Golden Signals to Monitor
| Signal | Meaning |
|---|---|
| Latency | How long requests take |
| Traffic | How many requests are coming |
| Errors | How many requests fail |
| Saturation | How full the system is |
9. Backup and Disaster Recovery
Production systems should have backup plans for:
- etcd
- Databases
- Persistent Volumes
- Critical configuration
- Secrets
Backups must be tested regularly. A backup that cannot be restored is not useful.
10. GitOps Best Practices
Use Git as the source of truth for Kubernetes configuration.
Tools like Argo CD and Flux help keep clusters synchronized with Git.
Git Repository
|
v
Argo CD / Flux
|
v
Kubernetes Cluster
Troubleshooting Kubernetes Problems
Troubleshooting Kubernetes requires a systematic approach. Do not guess. Start from the symptom and move layer by layer.
General Troubleshooting Flow
Problem Reported
|
v
Check Pods
|
v
Check Events
|
v
Check Logs
|
v
Check Services
|
v
Check ConfigMaps / Secrets
|
v
Check Nodes
|
v
Check Network Policies
|
v
Find Root Cause
Issue 1: Pod Stuck in Pending
A Pod stays in Pending when Kubernetes cannot schedule it.
Possible Causes
- Insufficient CPU or memory
- Node selector mismatch
- Taints without tolerations
- PVC not bound
- Cluster Autoscaler issue
Commands
kubectl describe pod pod-name
kubectl get events
kubectl get nodes
kubectl describe node node-name
kubectl get pvc
Issue 2: ImagePullBackOff
ImagePullBackOff means Kubernetes cannot pull the container image.
Possible Causes
- Wrong image name
- Wrong image tag
- Private registry authentication issue
- Image does not exist
- Network problem
Commands
kubectl describe pod pod-name
kubectl get secret
kubectl create secret docker-registry regcred
Issue 3: CrashLoopBackOff
CrashLoopBackOff means the container starts, crashes, and Kubernetes keeps restarting it.
Possible Causes
- Application startup failure
- Missing environment variables
- Wrong ConfigMap or Secret
- Database connection failure
- Bad liveness probe
- Application exception
Commands
kubectl logs pod-name
kubectl logs pod-name --previous
kubectl describe pod pod-name
kubectl get events
CrashLoopBackOff Debugging Flow
Pod Restarting
|
v
Check Current Logs
|
v
Check Previous Logs
|
v
Check Environment Variables
|
v
Check ConfigMaps and Secrets
|
v
Check Probe Configuration
|
v
Fix Application or Config
Issue 4: Service Not Reachable
If a Service is not reachable, the problem may be in labels, endpoints, ports, or network policies.
Commands
kubectl get svc
kubectl describe svc service-name
kubectl get endpoints
kubectl get pods --show-labels
kubectl exec -it pod-name -- curl http://service-name
Common Causes
- Service selector does not match Pod labels
- Target port is wrong
- Pods are not ready
- Network Policy blocks traffic
Issue 5: DNS Not Working
If services cannot resolve names, check CoreDNS.
Commands
kubectl get pods -n kube-system
kubectl logs -n kube-system deployment/coredns
kubectl exec -it pod-name -- nslookup service-name
Symptoms
- UnknownHostException
- Service name resolution failure
- Intermittent communication errors
Issue 6: OOMKilled
OOMKilled means the container used more memory than its limit.
Commands
kubectl describe pod pod-name
kubectl top pod pod-name
kubectl logs pod-name --previous
Fix Options
- Increase memory limit
- Fix memory leak
- Optimize application memory usage
- Scale horizontally
Issue 7: Node NotReady
A node becomes NotReady when Kubernetes cannot confirm its health.
Possible Causes
- Kubelet issue
- Disk pressure
- Memory pressure
- Network issue
- Container runtime issue
Commands
kubectl get nodes
kubectl describe node node-name
kubectl top nodes
journalctl -u kubelet
Node Maintenance Flow
Node Problem Detected
|
v
Cordon Node
|
v
Drain Node
|
v
Fix Node
|
v
Uncordon Node
Commands
kubectl cordon node-name
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data
kubectl uncordon node-name
Issue 8: Ingress Not Working
Possible Causes
- Ingress Controller not running
- DNS not pointing correctly
- TLS certificate issue
- Wrong backend service
- Wrong service port
Commands
kubectl get ingress
kubectl describe ingress ingress-name
kubectl get pods -n ingress-nginx
kubectl logs pod-name -n ingress-nginx
kubectl get svc
Issue 9: HPA Not Scaling
Possible Causes
- Metrics Server not installed
- No CPU requests defined
- Target utilization too high
- Max replicas reached
- Pods not becoming ready
Commands
kubectl get hpa
kubectl describe hpa hpa-name
kubectl top pods
kubectl top nodes
Issue 10: Persistent Volume Problems
Possible Causes
- PVC stuck Pending
- No matching PV
- StorageClass missing
- Access mode mismatch
- Cloud storage provisioning issue
Commands
kubectl get pvc
kubectl describe pvc pvc-name
kubectl get pv
kubectl get storageclass
Production Incident Response Flow
Alert Triggered
|
v
Identify Impact
|
v
Check Dashboards
|
v
Check Logs
|
v
Check Recent Deployments
|
v
Rollback if Needed
|
v
Fix Root Cause
|
v
Document Incident
Real-Time Incident Example
A new payment service version is deployed. After deployment:
- Error rate increases
- Payment latency increases
- Pods restart frequently
Action Plan
- Check Grafana dashboard
- Check payment service logs
- Check rollout history
- Rollback deployment
- Analyze failed version
- Fix issue and redeploy safely
Useful Kubernetes Debug Commands
kubectl get pods -A
kubectl describe pod pod-name
kubectl logs pod-name
kubectl logs pod-name --previous
kubectl get events --sort-by=.metadata.creationTimestamp
kubectl get svc
kubectl get endpoints
kubectl top pods
kubectl top nodes
kubectl rollout status deployment/app
kubectl rollout undo deployment/app
Production Readiness Checklist
- Requests and limits configured
- HPA configured for important services
- Cluster Autoscaler enabled
- Liveness/readiness/startup probes configured
- RBAC follows least privilege
- Secrets are managed securely
- Network Policies are applied
- Monitoring and logging are configured
- Backups are tested
- Rollbacks are documented
- Ingress TLS is configured
- CI/CD pipeline includes testing
- Pod disruption budgets are configured for critical workloads
Common Production Mistakes
1. No Resource Limits
Can cause node instability.
2. No Readiness Probe
Traffic may reach unready Pods.
3. Using latest Image Tag
Creates unpredictable deployments.
4. No Monitoring
Issues are detected late.
5. Weak RBAC
Users may get excessive permissions.
6. No Backup Testing
Backups may fail during real recovery.
Interview Questions
Q1: What are Kubernetes production best practices?
Use resource requests and limits, configure probes, enforce RBAC, manage Secrets securely, apply Network Policies, enable monitoring/logging, use autoscaling, and plan backups and rollbacks.
Q2: How do you troubleshoot CrashLoopBackOff?
Check logs, previous logs, Pod events, environment variables, ConfigMaps, Secrets, and probe configuration.
Q3: How do you troubleshoot Service connectivity issues?
Check Service selectors, endpoints, Pod labels, target ports, readiness status, DNS, and Network Policies.
Q4: What causes OOMKilled?
The container exceeded its memory limit and was killed by Kubernetes.
Q5: Why are readiness probes important?
They prevent traffic from reaching Pods that are not ready to serve requests.
Advanced Interview Questions
Q1: How do you handle a bad production deployment?
Check rollout status, review logs and metrics, stop or rollback the deployment, identify the root cause, and redeploy after fixing.
Q2: What is the difference between liveness and readiness probes?
Liveness checks whether a container should be restarted. Readiness checks whether a Pod should receive traffic.
Q3: Why use Network Policies?
To restrict Pod-to-Pod traffic and improve microservices security.
Q4: Why is observability important?
It helps detect, diagnose, and resolve production issues quickly.
Q5: Why should backups be tested?
Because untested backups may fail during real disaster recovery.
Recommended Learning Path
- Kubernetes Pods
- Kubernetes Deployments
- Health Probes
- Resource Management
- Horizontal Pod Autoscaler
- Monitoring and Logging
- RBAC
- Network Policies
- CI/CD Pipelines
Summary
Production Kubernetes requires more than basic YAML files. It requires resource control, security, observability, automation, backup planning, and systematic troubleshooting.
Best practices prevent many issues before they happen. Troubleshooting skills help teams recover quickly when incidents occur.
For banking, e-commerce, healthcare, SaaS, fintech, and enterprise systems, Kubernetes production readiness directly impacts performance, reliability, security, and customer trust.
Mastering production best practices and troubleshooting makes developers, DevOps engineers, cloud engineers, and platform engineers much stronger in real-world Kubernetes operations.