Published: 2026-06-01 โ€ข Updated: 2026-07-05

Horizontal Pod Autoscaling (HPA) in Kubernetes: Complete Real-Time Production Guide

Modern cloud-native applications face unpredictable traffic patterns. Some applications may receive very low traffic during normal hours but experience massive spikes during sales events, banking transactions, viral campaigns, or seasonal traffic peaks.

Manually scaling Pods during these situations is slow, risky, and inefficient.

Kubernetes solves this problem using Horizontal Pod Autoscaling (HPA).

HPA automatically increases or decreases the number of Pods based on real-time workload metrics such as:

  • CPU utilization
  • Memory usage
  • Custom metrics
  • External metrics

Your base content explains HPA fundamentals well. This enhanced version adds:

  • Real-world banking examples
  • E-commerce traffic scaling examples
  • Production architecture diagrams
  • Metrics Server explanation
  • Cluster Autoscaler integration
  • Scaling algorithms
  • Advanced scaling policies
  • Custom metrics usage
  • Production troubleshooting
  • Common scaling mistakes
  • Enterprise best practices
  • Interview-focused explanations

Why Autoscaling Is Important?

Applications rarely receive constant traffic.

Examples:

  • E-commerce traffic spikes during sales
  • Banking traffic increases during salary dates
  • Food delivery apps spike during lunch hours
  • Streaming platforms spike during live events
  • Travel websites spike during holidays

Without autoscaling:

  • Applications may become slow
  • Pods may crash
  • Users may experience downtime
  • Infrastructure costs may increase unnecessarily

Real-Time E-Commerce Example

Suppose an e-commerce website normally handles:

  • 5,000 users during regular hours

But during a flash sale:

  • Traffic suddenly jumps to 500,000 users

Without autoscaling:

  • Frontend Pods overload
  • Checkout APIs become slow
  • Payments fail
  • Users abandon carts

With HPA:

  • Kubernetes automatically creates more Pods
  • Traffic distributes across replicas
  • Performance remains stable

What is Horizontal Pod Autoscaler?

Horizontal Pod Autoscaler automatically scales the number of Pod replicas based on observed metrics.

HPA works with:

  • Deployments
  • ReplicaSets
  • StatefulSets

It adjusts replicas dynamically according to workload demand.


Simple Understanding

Traffic HPA Action
Traffic increases Scale up Pods
Traffic decreases Scale down Pods

HPA Workflow


Users Send Requests
         |
         v
Pods Handle Traffic
         |
         v
CPU/Memory Usage Increases
         |
         v
Metrics Server Collects Metrics
         |
         v
HPA Evaluates Metrics
         |
         +-- High Usage ---> Add More Pods
         |
         +-- Low Usage ---> Remove Pods

Real-Time Banking Example

Suppose a banking application processes:

  • UPI payments
  • Fund transfers
  • Account balance checks
  • Bill payments

During salary dates or festival shopping periods:

  • Transaction volume increases massively

Without HPA:

  • Payment APIs become overloaded
  • Transaction latency increases
  • Timeouts occur
  • Customers face failures

With HPA:

  • Additional Pods are created automatically
  • Load distributes across Pods
  • Response time improves
  • System remains stable

Metrics Server in Kubernetes

HPA depends on metrics.

Kubernetes uses:

Metrics Server

to collect CPU and memory usage from nodes and Pods.


Metrics Flow Diagram


Pods Running
      |
      v
Kubelet Collects Metrics
      |
      v
Metrics Server Aggregates Metrics
      |
      v
HPA Reads Metrics
      |
      v
Scaling Decision Taken

Installing Metrics Server

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Verify:

kubectl top pods

kubectl top nodes

Basic HPA YAML Example

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler

metadata:
  name: webapp-hpa

spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp

  minReplicas: 2
  maxReplicas: 10

  metrics:
  - type: Resource

    resource:
      name: cpu

      target:
        type: Utilization
        averageUtilization: 70

Understanding Important Fields

Field Purpose
scaleTargetRef Target Deployment or StatefulSet
minReplicas Minimum number of Pods
maxReplicas Maximum number of Pods
averageUtilization Target CPU utilization percentage

How HPA Calculates Scaling

Suppose:

  • Target CPU = 70%
  • Current CPU = 140%
  • Current replicas = 2

HPA calculates:

Desired Replicas =
Current Replicas ร— (Current CPU / Target CPU)

2 ร— (140 / 70)

= 4 Pods

Kubernetes scales to 4 replicas.


Scale Up Example


Current Pods: 2
CPU Usage: 90%
Target CPU: 70%

HPA Decision:
Scale to 3 or 4 Pods

Scale Down Example


Current Pods: 10
CPU Usage: 20%
Target CPU: 70%

HPA Decision:
Reduce Pods Gradually

HPA Scaling Flow


Traffic Spike
      |
      v
CPU Usage Increases
      |
      v
Metrics Server Detects High Usage
      |
      v
HPA Calculates Desired Replicas
      |
      v
Deployment Scales Up
      |
      v
More Pods Handle Traffic

Real-Time Video Streaming Example

Suppose a streaming platform broadcasts a live cricket final.

Traffic suddenly increases:

  • Millions of users start streaming simultaneously

HPA automatically scales:

  • Video APIs
  • Authentication services
  • Recommendation services
  • Chat services

This helps avoid outages during peak demand.


HPA with Memory Metrics

HPA can also scale using memory usage.

Memory Metric Example

metrics:
- type: Resource

  resource:
    name: memory

    target:
      type: Utilization
      averageUtilization: 80

CPU vs Memory Scaling

Metric Best For
CPU Web applications, APIs
Memory Java apps, caching systems

Custom Metrics in HPA

Production systems often require scaling using business metrics.

Examples:

  • Queue length
  • Request latency
  • Kafka lag
  • Requests per second
  • Active users

Real-Time Queue Processing Example

Suppose an order processing service consumes messages from Kafka.

During high traffic:

  • Kafka queue grows rapidly

Custom metric:

Queue Length > 1000

can trigger HPA scaling automatically.


HPA with Prometheus

Advanced HPA setups commonly use:

  • Prometheus
  • Prometheus Adapter

This enables custom metric-based autoscaling.


HPA + Cluster Autoscaler

HPA scales Pods.

But what if:

  • Cluster nodes do not have enough capacity?

This is where:

Cluster Autoscaler

becomes important.


Complete Autoscaling Workflow


Traffic Increases
       |
       v
HPA Creates More Pods
       |
       v
Cluster Has No Free Resources
       |
       v
Cluster Autoscaler Adds New Nodes
       |
       v
New Pods Scheduled Successfully

Real-Time Production Architecture


Users
  |
  v
Ingress Controller
  |
  v
Frontend Deployment
  |
  v
HPA Monitors CPU
  |
  +-- Scale Up During Peak Traffic
  |
  +-- Scale Down During Off Hours

Scaling Policies

Aggressive scaling may cause:

  • Frequent Pod creation/removal
  • Application instability
  • Increased infrastructure cost

Kubernetes provides:

  • Stabilization windows
  • Scaling policies

to avoid scaling thrashing.


Example Scaling Behavior

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300

This delays rapid scale-down operations.


Why HPA Needs Resource Requests?

HPA calculates utilization based on:

Actual usage รท Requested resources

Without CPU requests:

  • HPA may not work correctly

Deployment Example with Requests

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"

  limits:
    cpu: "500m"
    memory: "512Mi"

Common Mistakes

1. Metrics Server Not Installed

HPA cannot function without metrics.

2. No Resource Requests

HPA calculations become inaccurate.

3. Very Low minReplicas

Sudden traffic spikes may overwhelm applications.

4. Extremely High maxReplicas

Infrastructure costs may increase unexpectedly.

5. Ignoring Memory-Based Scaling

Some applications are memory-intensive rather than CPU-intensive.


Production Troubleshooting Commands

kubectl get hpa

kubectl describe hpa webapp-hpa

kubectl top pods

kubectl top nodes

kubectl get deployment

kubectl describe deployment webapp

Real-Time Production Failure Example

Suppose:

  • Application becomes slow during traffic spike

Possible Causes

  • Metrics Server missing
  • Incorrect CPU requests
  • maxReplicas too low
  • Cluster lacks nodes
  • Readiness probes failing

Troubleshooting Flow


Traffic Spike
      |
      v
Application Slow
      |
      v
Check HPA Status
      |
      v
Check Metrics Server
      |
      v
Check CPU Requests
      |
      v
Check Node Capacity
      |
      v
Verify Scaling Events

Best Practices

  • Always define resource requests
  • Monitor HPA behavior continuously
  • Use realistic min/max replicas
  • Test scaling under load
  • Combine HPA with Cluster Autoscaler
  • Use readiness probes carefully
  • Avoid aggressive scaling policies
  • Use custom metrics for advanced workloads

Interview Questions

Q1: What is Horizontal Pod Autoscaler?

HPA automatically scales Pod replicas based on metrics such as CPU or memory usage.

Q2: What metrics does HPA use?

CPU, memory, custom metrics, and external metrics.

Q3: Difference between HPA and VPA?

HPA scales replicas horizontally, while VPA adjusts CPU and memory resources vertically.

Q4: Why does HPA require Metrics Server?

Metrics Server provides CPU and memory usage data required for scaling decisions.

Q5: What is Cluster Autoscaler?

Cluster Autoscaler automatically adds or removes Kubernetes nodes based on resource demand.


Interview Trap Questions

Can HPA work without resource requests?

Not properly. CPU utilization calculations depend on requests.

Does HPA instantly create Pods?

No. Scaling takes some time depending on image pull and startup duration.

Can HPA scale StatefulSets?

Yes, HPA supports StatefulSets.

Does HPA replace Cluster Autoscaler?

No. HPA scales Pods while Cluster Autoscaler scales nodes.


Recommended Learning Path


Summary

Horizontal Pod Autoscaler is one of the most important Kubernetes features for building scalable, cost-efficient, and resilient applications.

HPA automatically adjusts the number of Pods based on real-time demand, helping applications handle traffic spikes while reducing infrastructure waste during low traffic periods.

Modern enterprises rely heavily on HPA for:

  • Banking platforms
  • E-commerce systems
  • Streaming platforms
  • AI workloads
  • SaaS applications
  • Cloud-native microservices

Understanding HPA deeply helps developers and DevOps engineers design highly scalable production-ready Kubernetes architectures confidently.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile