Published: 2026-06-01 โ€ข Updated: 2026-07-05

Production Best Practices and Troubleshooting in Kubernetes: Complete Real-Time Enterprise Guide

Running Kubernetes in production is very different from learning Kubernetes in a local lab. In development, a simple Pod, Deployment, or Service may be enough. But in production, applications must be secure, scalable, observable, cost-efficient, and highly available.

A production Kubernetes cluster must handle real users, traffic spikes, application failures, node failures, security threats, bad deployments, storage problems, and networking issues. This is why every Kubernetes engineer must understand both best practices and troubleshooting workflows.

Your base article already covers resource management, security, reliability, observability, governance, and common troubleshooting areas. This version expands it with deeper real-time examples, production debugging flows, enterprise recommendations, and interview-ready explanations. :contentReference[oaicite:0]{index=0}


Why Kubernetes Production Best Practices Matter?

Kubernetes is powerful, but it does not automatically make applications production-ready. If the cluster is poorly configured, Kubernetes can still suffer from:

  • Pod crashes
  • Slow applications
  • Security leaks
  • Unexpected downtime
  • High cloud bills
  • Failed deployments
  • Data loss
  • Difficult troubleshooting

Production readiness means building the system in a way that prevents problems and helps teams recover quickly when issues happen.


Real-Time Banking Example

Imagine a banking platform running on Kubernetes. It contains payment APIs, authentication services, account services, fraud detection services, notification services, and databases.

If Kubernetes is not production-ready:

  • Payment Pods may crash during high traffic
  • Database credentials may leak through poor Secrets management
  • Wrong Network Policies may block transaction APIs
  • No monitoring may delay incident detection
  • No rollback plan may extend downtime

For banking, even a few minutes of downtime can create financial loss and customer trust issues.


Real-Time E-Commerce Example

During a festival sale, an e-commerce platform may receive 10x or 50x normal traffic.

Production best practices help the platform:

  • Scale Pods automatically using HPA
  • Add nodes using Cluster Autoscaler
  • Prevent bad Pods from receiving traffic using readiness probes
  • Detect errors using Prometheus and Grafana
  • Rollback failed releases quickly
  • Protect payment and database services using Network Policies

Production Kubernetes Architecture

[ Users ]
   |
   v
[ Ingress Controller ]
   |
   v
[ Application Services ]
   |
   +--> ConfigMaps
   +--> Secrets
   +--> Persistent Volumes
   +--> Network Policies
   +--> HPA
   +--> Monitoring
   +--> Logging
   +--> RBAC

A production Kubernetes system is not just Deployments and Services. It is a combination of reliability, security, scaling, monitoring, and governance.


1. Resource Management Best Practices

Every production workload should define CPU and memory requests and limits.

Why?

Without resource management, one bad application can consume too much CPU or memory and affect other applications running on the same node.

Example

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"

  limits:
    cpu: "500m"
    memory: "512Mi"

Meaning

  • requests help Kubernetes schedule Pods properly
  • limits prevent containers from consuming unlimited resources

Resource Management Flow

Pod Created
   |
   v
Scheduler Checks Requests
   |
   v
Pod Placed on Suitable Node
   |
   v
Container Starts
   |
   v
Limits Prevent Resource Abuse

Common Resource Issues

  • Pods stuck in Pending because requests are too high
  • Pods killed with OOMKilled because memory limit is too low
  • Application slow because CPU is throttled
  • Nodes overloaded because limits are missing

2. Namespace and Quota Best Practices

Use namespaces to separate environments and teams.

[ Kubernetes Cluster ]
   |
   +-- dev
   +-- qa
   +-- staging
   +-- production
   +-- monitoring

Apply ResourceQuotas and LimitRanges to prevent one namespace from consuming all cluster resources.

ResourceQuota Example

apiVersion: v1
kind: ResourceQuota

metadata:
  name: prod-quota
  namespace: production

spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"

3. Security Best Practices

Security must be planned from the beginning. Production Kubernetes security should include multiple layers.

Important Security Controls

  • RBAC with least privilege
  • Kubernetes Secrets or external secret managers
  • Network Policies
  • Pod Security Standards
  • Image vulnerability scanning
  • Private container registries
  • Audit logs
  • mTLS using service mesh when required

RBAC Best Practice

Never give unnecessary cluster-admin access.

Bad Practice

verbs: ["*"]
resources: ["*"]

Good Practice

resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]

Give users and service accounts only the permissions they need.


Secrets Best Practice

Do not hardcode sensitive information in YAML files or source code.

Sensitive data includes:

  • Database passwords
  • JWT secrets
  • Payment gateway keys
  • Cloud credentials
  • OAuth secrets

Use Kubernetes Secrets, Vault, AWS Secrets Manager, Azure Key Vault, or Sealed Secrets.


4. Network Security Best Practices

By default, many Kubernetes clusters allow broad Pod-to-Pod communication. This is not ideal for production.

Use Network Policies to allow only required communication.

[ Frontend ] ---> [ Backend ] ---> [ Database ]

Blocked:
[ Random Pod ] ---X---> [ Database ]

This reduces lateral movement risk if one Pod is compromised.


5. Reliability Best Practices

Production applications should use:

  • Liveness probes
  • Readiness probes
  • Startup probes
  • Rolling updates
  • PodDisruptionBudgets
  • Replica distribution across nodes

Readiness Probe Example

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

Readiness probes prevent traffic from reaching Pods that are not ready.


Liveness Probe Example

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

Liveness probes restart stuck or unhealthy containers.


6. Deployment Best Practices

Use rolling updates for safer deployments.

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

This ensures new Pods are created before old Pods are removed.


Rollback Command

kubectl rollout undo deployment/payment-service

Rollback should always be part of the production deployment plan.


7. Autoscaling Best Practices

Use HPA for Pod scaling and Cluster Autoscaler for node scaling.

Traffic Spike
   |
   v
HPA Adds Pods
   |
   v
Cluster Autoscaler Adds Nodes if Needed

Without autoscaling, applications may fail during peak traffic.


8. Observability Best Practices

A production Kubernetes cluster should include monitoring, logging, and alerting.

Recommended Stack

  • Prometheus for metrics
  • Grafana for dashboards
  • Loki or ELK for logs
  • Alertmanager for alerts
  • Jaeger or OpenTelemetry for tracing

Golden Signals to Monitor

Signal Meaning
Latency How long requests take
Traffic How many requests are coming
Errors How many requests fail
Saturation How full the system is

9. Backup and Disaster Recovery

Production systems should have backup plans for:

  • etcd
  • Databases
  • Persistent Volumes
  • Critical configuration
  • Secrets

Backups must be tested regularly. A backup that cannot be restored is not useful.


10. GitOps Best Practices

Use Git as the source of truth for Kubernetes configuration.

Tools like Argo CD and Flux help keep clusters synchronized with Git.

Git Repository
   |
   v
Argo CD / Flux
   |
   v
Kubernetes Cluster

Troubleshooting Kubernetes Problems

Troubleshooting Kubernetes requires a systematic approach. Do not guess. Start from the symptom and move layer by layer.


General Troubleshooting Flow

Problem Reported
   |
   v
Check Pods
   |
   v
Check Events
   |
   v
Check Logs
   |
   v
Check Services
   |
   v
Check ConfigMaps / Secrets
   |
   v
Check Nodes
   |
   v
Check Network Policies
   |
   v
Find Root Cause

Issue 1: Pod Stuck in Pending

A Pod stays in Pending when Kubernetes cannot schedule it.

Possible Causes

  • Insufficient CPU or memory
  • Node selector mismatch
  • Taints without tolerations
  • PVC not bound
  • Cluster Autoscaler issue

Commands

kubectl describe pod pod-name

kubectl get events

kubectl get nodes

kubectl describe node node-name

kubectl get pvc

Issue 2: ImagePullBackOff

ImagePullBackOff means Kubernetes cannot pull the container image.

Possible Causes

  • Wrong image name
  • Wrong image tag
  • Private registry authentication issue
  • Image does not exist
  • Network problem

Commands

kubectl describe pod pod-name

kubectl get secret

kubectl create secret docker-registry regcred

Issue 3: CrashLoopBackOff

CrashLoopBackOff means the container starts, crashes, and Kubernetes keeps restarting it.

Possible Causes

  • Application startup failure
  • Missing environment variables
  • Wrong ConfigMap or Secret
  • Database connection failure
  • Bad liveness probe
  • Application exception

Commands

kubectl logs pod-name

kubectl logs pod-name --previous

kubectl describe pod pod-name

kubectl get events

CrashLoopBackOff Debugging Flow

Pod Restarting
   |
   v
Check Current Logs
   |
   v
Check Previous Logs
   |
   v
Check Environment Variables
   |
   v
Check ConfigMaps and Secrets
   |
   v
Check Probe Configuration
   |
   v
Fix Application or Config

Issue 4: Service Not Reachable

If a Service is not reachable, the problem may be in labels, endpoints, ports, or network policies.

Commands

kubectl get svc

kubectl describe svc service-name

kubectl get endpoints

kubectl get pods --show-labels

kubectl exec -it pod-name -- curl http://service-name

Common Causes

  • Service selector does not match Pod labels
  • Target port is wrong
  • Pods are not ready
  • Network Policy blocks traffic

Issue 5: DNS Not Working

If services cannot resolve names, check CoreDNS.

Commands

kubectl get pods -n kube-system

kubectl logs -n kube-system deployment/coredns

kubectl exec -it pod-name -- nslookup service-name

Symptoms

  • UnknownHostException
  • Service name resolution failure
  • Intermittent communication errors

Issue 6: OOMKilled

OOMKilled means the container used more memory than its limit.

Commands

kubectl describe pod pod-name

kubectl top pod pod-name

kubectl logs pod-name --previous

Fix Options

  • Increase memory limit
  • Fix memory leak
  • Optimize application memory usage
  • Scale horizontally

Issue 7: Node NotReady

A node becomes NotReady when Kubernetes cannot confirm its health.

Possible Causes

  • Kubelet issue
  • Disk pressure
  • Memory pressure
  • Network issue
  • Container runtime issue

Commands

kubectl get nodes

kubectl describe node node-name

kubectl top nodes

journalctl -u kubelet

Node Maintenance Flow

Node Problem Detected
   |
   v
Cordon Node
   |
   v
Drain Node
   |
   v
Fix Node
   |
   v
Uncordon Node

Commands

kubectl cordon node-name

kubectl drain node-name --ignore-daemonsets --delete-emptydir-data

kubectl uncordon node-name

Issue 8: Ingress Not Working

Possible Causes

  • Ingress Controller not running
  • DNS not pointing correctly
  • TLS certificate issue
  • Wrong backend service
  • Wrong service port

Commands

kubectl get ingress

kubectl describe ingress ingress-name

kubectl get pods -n ingress-nginx

kubectl logs pod-name -n ingress-nginx

kubectl get svc

Issue 9: HPA Not Scaling

Possible Causes

  • Metrics Server not installed
  • No CPU requests defined
  • Target utilization too high
  • Max replicas reached
  • Pods not becoming ready

Commands

kubectl get hpa

kubectl describe hpa hpa-name

kubectl top pods

kubectl top nodes

Issue 10: Persistent Volume Problems

Possible Causes

  • PVC stuck Pending
  • No matching PV
  • StorageClass missing
  • Access mode mismatch
  • Cloud storage provisioning issue

Commands

kubectl get pvc

kubectl describe pvc pvc-name

kubectl get pv

kubectl get storageclass

Production Incident Response Flow

Alert Triggered
   |
   v
Identify Impact
   |
   v
Check Dashboards
   |
   v
Check Logs
   |
   v
Check Recent Deployments
   |
   v
Rollback if Needed
   |
   v
Fix Root Cause
   |
   v
Document Incident

Real-Time Incident Example

A new payment service version is deployed. After deployment:

  • Error rate increases
  • Payment latency increases
  • Pods restart frequently

Action Plan

  1. Check Grafana dashboard
  2. Check payment service logs
  3. Check rollout history
  4. Rollback deployment
  5. Analyze failed version
  6. Fix issue and redeploy safely

Useful Kubernetes Debug Commands

kubectl get pods -A

kubectl describe pod pod-name

kubectl logs pod-name

kubectl logs pod-name --previous

kubectl get events --sort-by=.metadata.creationTimestamp

kubectl get svc

kubectl get endpoints

kubectl top pods

kubectl top nodes

kubectl rollout status deployment/app

kubectl rollout undo deployment/app

Production Readiness Checklist

  • Requests and limits configured
  • HPA configured for important services
  • Cluster Autoscaler enabled
  • Liveness/readiness/startup probes configured
  • RBAC follows least privilege
  • Secrets are managed securely
  • Network Policies are applied
  • Monitoring and logging are configured
  • Backups are tested
  • Rollbacks are documented
  • Ingress TLS is configured
  • CI/CD pipeline includes testing
  • Pod disruption budgets are configured for critical workloads

Common Production Mistakes

1. No Resource Limits

Can cause node instability.

2. No Readiness Probe

Traffic may reach unready Pods.

3. Using latest Image Tag

Creates unpredictable deployments.

4. No Monitoring

Issues are detected late.

5. Weak RBAC

Users may get excessive permissions.

6. No Backup Testing

Backups may fail during real recovery.


Interview Questions

Q1: What are Kubernetes production best practices?

Use resource requests and limits, configure probes, enforce RBAC, manage Secrets securely, apply Network Policies, enable monitoring/logging, use autoscaling, and plan backups and rollbacks.

Q2: How do you troubleshoot CrashLoopBackOff?

Check logs, previous logs, Pod events, environment variables, ConfigMaps, Secrets, and probe configuration.

Q3: How do you troubleshoot Service connectivity issues?

Check Service selectors, endpoints, Pod labels, target ports, readiness status, DNS, and Network Policies.

Q4: What causes OOMKilled?

The container exceeded its memory limit and was killed by Kubernetes.

Q5: Why are readiness probes important?

They prevent traffic from reaching Pods that are not ready to serve requests.


Advanced Interview Questions

Q1: How do you handle a bad production deployment?

Check rollout status, review logs and metrics, stop or rollback the deployment, identify the root cause, and redeploy after fixing.

Q2: What is the difference between liveness and readiness probes?

Liveness checks whether a container should be restarted. Readiness checks whether a Pod should receive traffic.

Q3: Why use Network Policies?

To restrict Pod-to-Pod traffic and improve microservices security.

Q4: Why is observability important?

It helps detect, diagnose, and resolve production issues quickly.

Q5: Why should backups be tested?

Because untested backups may fail during real disaster recovery.


Recommended Learning Path


Summary

Production Kubernetes requires more than basic YAML files. It requires resource control, security, observability, automation, backup planning, and systematic troubleshooting.

Best practices prevent many issues before they happen. Troubleshooting skills help teams recover quickly when incidents occur.

For banking, e-commerce, healthcare, SaaS, fintech, and enterprise systems, Kubernetes production readiness directly impacts performance, reliability, security, and customer trust.

Mastering production best practices and troubleshooting makes developers, DevOps engineers, cloud engineers, and platform engineers much stronger in real-world Kubernetes operations.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile