Production Best Practices and Troubleshooting in Kubernetes: Complete Real-Time Enterprise Guide

Running Kubernetes in production is very different from learning Kubernetes in a local lab. In development, a simple Pod, Deployment, or Service may be enough. But in production, applications must be secure, scalable, observable, cost-efficient, and highly available.

A production Kubernetes cluster must handle real users, traffic spikes, application failures, node failures, security threats, bad deployments, storage problems, and networking issues. This is why every Kubernetes engineer must understand both best practices and troubleshooting workflows.

Your base article already covers resource management, security, reliability, observability, governance, and common troubleshooting areas. This version expands it with deeper real-time examples, production debugging flows, enterprise recommendations, and interview-ready explanations. :contentReference[oaicite:0]{index=0}

Why Kubernetes Production Best Practices Matter?

Kubernetes is powerful, but it does not automatically make applications production-ready. If the cluster is poorly configured, Kubernetes can still suffer from:

Pod crashes
Slow applications
Security leaks
Unexpected downtime
High cloud bills
Failed deployments
Data loss
Difficult troubleshooting

Production readiness means building the system in a way that prevents problems and helps teams recover quickly when issues happen.

Real-Time Banking Example

Imagine a banking platform running on Kubernetes. It contains payment APIs, authentication services, account services, fraud detection services, notification services, and databases.

If Kubernetes is not production-ready:

Payment Pods may crash during high traffic
Database credentials may leak through poor Secrets management
Wrong Network Policies may block transaction APIs
No monitoring may delay incident detection
No rollback plan may extend downtime

For banking, even a few minutes of downtime can create financial loss and customer trust issues.

Real-Time E-Commerce Example

During a festival sale, an e-commerce platform may receive 10x or 50x normal traffic.

Production best practices help the platform:

Scale Pods automatically using HPA
Add nodes using Cluster Autoscaler
Prevent bad Pods from receiving traffic using readiness probes
Detect errors using Prometheus and Grafana
Rollback failed releases quickly
Protect payment and database services using Network Policies

Production Kubernetes Architecture

[ Users ]
   |
   v
[ Ingress Controller ]
   |
   v
[ Application Services ]
   |
   +--> ConfigMaps
   +--> Secrets
   +--> Persistent Volumes
   +--> Network Policies
   +--> HPA
   +--> Monitoring
   +--> Logging
   +--> RBAC

A production Kubernetes system is not just Deployments and Services. It is a combination of reliability, security, scaling, monitoring, and governance.

1. Resource Management Best Practices

Every production workload should define CPU and memory requests and limits.

Why?

Without resource management, one bad application can consume too much CPU or memory and affect other applications running on the same node.

Example

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"

  limits:
    cpu: "500m"
    memory: "512Mi"

Meaning

requests help Kubernetes schedule Pods properly
limits prevent containers from consuming unlimited resources

Resource Management Flow

Pod Created
   |
   v
Scheduler Checks Requests
   |
   v
Pod Placed on Suitable Node
   |
   v
Container Starts
   |
   v
Limits Prevent Resource Abuse

Common Resource Issues

Pods stuck in Pending because requests are too high
Pods killed with OOMKilled because memory limit is too low
Application slow because CPU is throttled
Nodes overloaded because limits are missing

2. Namespace and Quota Best Practices

Use namespaces to separate environments and teams.

[ Kubernetes Cluster ]
   |
   +-- dev
   +-- qa
   +-- staging
   +-- production
   +-- monitoring

Apply ResourceQuotas and LimitRanges to prevent one namespace from consuming all cluster resources.

ResourceQuota Example

apiVersion: v1
kind: ResourceQuota

metadata:
  name: prod-quota
  namespace: production

spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"

3. Security Best Practices

Security must be planned from the beginning. Production Kubernetes security should include multiple layers.

Important Security Controls

RBAC with least privilege
Kubernetes Secrets or external secret managers
Network Policies
Pod Security Standards
Image vulnerability scanning
Private container registries
Audit logs
mTLS using service mesh when required

RBAC Best Practice

Never give unnecessary cluster-admin access.

Bad Practice

verbs: ["*"]
resources: ["*"]

Good Practice

resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]

Give users and service accounts only the permissions they need.

Secrets Best Practice

Do not hardcode sensitive information in YAML files or source code.

Sensitive data includes:

Database passwords
JWT secrets
Payment gateway keys
Cloud credentials
OAuth secrets

Use Kubernetes Secrets, Vault, AWS Secrets Manager, Azure Key Vault, or Sealed Secrets.

4. Network Security Best Practices

By default, many Kubernetes clusters allow broad Pod-to-Pod communication. This is not ideal for production.

Use Network Policies to allow only required communication.

[ Frontend ] ---> [ Backend ] ---> [ Database ]

Blocked:
[ Random Pod ] ---X---> [ Database ]

This reduces lateral movement risk if one Pod is compromised.

5. Reliability Best Practices

Production applications should use:

Liveness probes
Readiness probes
Startup probes
Rolling updates
PodDisruptionBudgets
Replica distribution across nodes

Readiness Probe Example

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

Readiness probes prevent traffic from reaching Pods that are not ready.

Liveness Probe Example

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

Liveness probes restart stuck or unhealthy containers.

6. Deployment Best Practices

Use rolling updates for safer deployments.

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

This ensures new Pods are created before old Pods are removed.

Rollback Command

kubectl rollout undo deployment/payment-service

Rollback should always be part of the production deployment plan.

7. Autoscaling Best Practices

Use HPA for Pod scaling and Cluster Autoscaler for node scaling.

Traffic Spike
   |
   v
HPA Adds Pods
   |
   v
Cluster Autoscaler Adds Nodes if Needed

Without autoscaling, applications may fail during peak traffic.

8. Observability Best Practices

A production Kubernetes cluster should include monitoring, logging, and alerting.

Recommended Stack

Prometheus for metrics
Grafana for dashboards
Loki or ELK for logs
Alertmanager for alerts
Jaeger or OpenTelemetry for tracing

Golden Signals to Monitor

Signal	Meaning
Latency	How long requests take
Traffic	How many requests are coming
Errors	How many requests fail
Saturation	How full the system is

9. Backup and Disaster Recovery

Production systems should have backup plans for:

etcd
Databases
Persistent Volumes
Critical configuration
Secrets

Backups must be tested regularly. A backup that cannot be restored is not useful.

10. GitOps Best Practices

Use Git as the source of truth for Kubernetes configuration.

Tools like Argo CD and Flux help keep clusters synchronized with Git.

Git Repository
   |
   v
Argo CD / Flux
   |
   v
Kubernetes Cluster

Troubleshooting Kubernetes Problems

Troubleshooting Kubernetes requires a systematic approach. Do not guess. Start from the symptom and move layer by layer.

General Troubleshooting Flow

Problem Reported
   |
   v
Check Pods
   |
   v
Check Events
   |
   v
Check Logs
   |
   v
Check Services
   |
   v
Check ConfigMaps / Secrets
   |
   v
Check Nodes
   |
   v
Check Network Policies
   |
   v
Find Root Cause

Issue 1: Pod Stuck in Pending

A Pod stays in Pending when Kubernetes cannot schedule it.

Possible Causes

Insufficient CPU or memory
Node selector mismatch
Taints without tolerations
PVC not bound
Cluster Autoscaler issue

Commands

kubectl describe pod pod-name

kubectl get events

kubectl get nodes

kubectl describe node node-name

kubectl get pvc

Issue 2: ImagePullBackOff

ImagePullBackOff means Kubernetes cannot pull the container image.

Possible Causes

Wrong image name
Wrong image tag
Private registry authentication issue
Image does not exist
Network problem

Commands

kubectl describe pod pod-name

kubectl get secret

kubectl create secret docker-registry regcred

Issue 3: CrashLoopBackOff

CrashLoopBackOff means the container starts, crashes, and Kubernetes keeps restarting it.

Possible Causes

Application startup failure
Missing environment variables
Wrong ConfigMap or Secret
Database connection failure
Bad liveness probe
Application exception

Commands

kubectl logs pod-name

kubectl logs pod-name --previous

kubectl describe pod pod-name

kubectl get events

CrashLoopBackOff Debugging Flow

Pod Restarting
   |
   v
Check Current Logs
   |
   v
Check Previous Logs
   |
   v
Check Environment Variables
   |
   v
Check ConfigMaps and Secrets
   |
   v
Check Probe Configuration
   |
   v
Fix Application or Config

Issue 4: Service Not Reachable

If a Service is not reachable, the problem may be in labels, endpoints, ports, or network policies.

Commands

kubectl get svc

kubectl describe svc service-name

kubectl get endpoints

kubectl get pods --show-labels

kubectl exec -it pod-name -- curl http://service-name

Common Causes

Service selector does not match Pod labels
Target port is wrong
Pods are not ready
Network Policy blocks traffic

Issue 5: DNS Not Working

If services cannot resolve names, check CoreDNS.

Commands

kubectl get pods -n kube-system

kubectl logs -n kube-system deployment/coredns

kubectl exec -it pod-name -- nslookup service-name

Symptoms

UnknownHostException
Service name resolution failure
Intermittent communication errors

Issue 6: OOMKilled

OOMKilled means the container used more memory than its limit.

Commands

kubectl describe pod pod-name

kubectl top pod pod-name

kubectl logs pod-name --previous

Fix Options

Increase memory limit
Fix memory leak
Optimize application memory usage
Scale horizontally

Issue 7: Node NotReady

A node becomes NotReady when Kubernetes cannot confirm its health.

Possible Causes

Kubelet issue
Disk pressure
Memory pressure
Network issue
Container runtime issue

Commands

kubectl get nodes

kubectl describe node node-name

kubectl top nodes

journalctl -u kubelet

Node Maintenance Flow

Node Problem Detected
   |
   v
Cordon Node
   |
   v
Drain Node
   |
   v
Fix Node
   |
   v
Uncordon Node

Commands

kubectl cordon node-name

kubectl drain node-name --ignore-daemonsets --delete-emptydir-data

kubectl uncordon node-name

Issue 8: Ingress Not Working

Possible Causes

Ingress Controller not running
DNS not pointing correctly
TLS certificate issue
Wrong backend service
Wrong service port

Commands

kubectl get ingress

kubectl describe ingress ingress-name

kubectl get pods -n ingress-nginx

kubectl logs pod-name -n ingress-nginx

kubectl get svc

Issue 9: HPA Not Scaling

Possible Causes

Metrics Server not installed
No CPU requests defined
Target utilization too high
Max replicas reached
Pods not becoming ready

Commands

kubectl get hpa

kubectl describe hpa hpa-name

kubectl top pods

kubectl top nodes

Issue 10: Persistent Volume Problems

Possible Causes

PVC stuck Pending
No matching PV
StorageClass missing
Access mode mismatch
Cloud storage provisioning issue

Commands

kubectl get pvc

kubectl describe pvc pvc-name

kubectl get pv

kubectl get storageclass

Production Incident Response Flow

Alert Triggered
   |
   v
Identify Impact
   |
   v
Check Dashboards
   |
   v
Check Logs
   |
   v
Check Recent Deployments
   |
   v
Rollback if Needed
   |
   v
Fix Root Cause
   |
   v
Document Incident

Real-Time Incident Example

A new payment service version is deployed. After deployment:

Error rate increases
Payment latency increases
Pods restart frequently

Action Plan

Check Grafana dashboard
Check payment service logs
Check rollout history
Rollback deployment
Analyze failed version
Fix issue and redeploy safely

Useful Kubernetes Debug Commands

kubectl get pods -A

kubectl describe pod pod-name

kubectl logs pod-name

kubectl logs pod-name --previous

kubectl get events --sort-by=.metadata.creationTimestamp

kubectl get svc

kubectl get endpoints

kubectl top pods

kubectl top nodes

kubectl rollout status deployment/app

kubectl rollout undo deployment/app

Production Readiness Checklist

Requests and limits configured
HPA configured for important services
Cluster Autoscaler enabled
Liveness/readiness/startup probes configured
RBAC follows least privilege
Secrets are managed securely
Network Policies are applied
Monitoring and logging are configured
Backups are tested
Rollbacks are documented
Ingress TLS is configured
CI/CD pipeline includes testing
Pod disruption budgets are configured for critical workloads

Common Production Mistakes

1. No Resource Limits

Can cause node instability.

2. No Readiness Probe

Traffic may reach unready Pods.

3. Using latest Image Tag

Creates unpredictable deployments.

4. No Monitoring

Issues are detected late.

5. Weak RBAC

Users may get excessive permissions.

6. No Backup Testing

Backups may fail during real recovery.

Interview Questions

Q1: What are Kubernetes production best practices?

Use resource requests and limits, configure probes, enforce RBAC, manage Secrets securely, apply Network Policies, enable monitoring/logging, use autoscaling, and plan backups and rollbacks.

Q2: How do you troubleshoot CrashLoopBackOff?

Check logs, previous logs, Pod events, environment variables, ConfigMaps, Secrets, and probe configuration.

Q3: How do you troubleshoot Service connectivity issues?

Check Service selectors, endpoints, Pod labels, target ports, readiness status, DNS, and Network Policies.

Q4: What causes OOMKilled?

The container exceeded its memory limit and was killed by Kubernetes.

Q5: Why are readiness probes important?

They prevent traffic from reaching Pods that are not ready to serve requests.

Advanced Interview Questions

Q1: How do you handle a bad production deployment?

Check rollout status, review logs and metrics, stop or rollback the deployment, identify the root cause, and redeploy after fixing.

Q2: What is the difference between liveness and readiness probes?

Liveness checks whether a container should be restarted. Readiness checks whether a Pod should receive traffic.

Q3: Why use Network Policies?

To restrict Pod-to-Pod traffic and improve microservices security.

Q4: Why is observability important?

It helps detect, diagnose, and resolve production issues quickly.

Q5: Why should backups be tested?

Because untested backups may fail during real disaster recovery.

Recommended Learning Path

Summary

Production Kubernetes requires more than basic YAML files. It requires resource control, security, observability, automation, backup planning, and systematic troubleshooting.

Best practices prevent many issues before they happen. Troubleshooting skills help teams recover quickly when incidents occur.

For banking, e-commerce, healthcare, SaaS, fintech, and enterprise systems, Kubernetes production readiness directly impacts performance, reliability, security, and customer trust.

Mastering production best practices and troubleshooting makes developers, DevOps engineers, cloud engineers, and platform engineers much stronger in real-world Kubernetes operations.