ArgoCD Sync Waves and Resource Hooks: The Ultimate Enterprise Orchestration Guide

Master the advanced orchestration capabilities of ArgoCD. Learn how to design, implement, and debug complex deployment workflows using Sync Waves, Resource Hooks, and Custom Health Checks in mission-critical enterprise environments.

1. Introduction to GitOps Orchestration Challenges
2. What You Will Learn
3. Prerequisites
4. Core Concepts: Sync Waves vs. Resource Hooks
5. Deep Dive: ArgoCD Sync Waves
6. Deep Dive: ArgoCD Resource Hooks
7. Architectural Workflows & Lifecycle States
8. Combining Waves and Hooks: The Orchestration Matrix
9. Production-Grade Code Examples & Implementation
10. Custom Health Checks for Sync Wave Progression
11. Enterprise-Scale Production Scenarios
12. Common Pitfalls and Anti-Patterns
13. Troubleshooting and Debugging Guide
14. Monitoring and Observability
15. Security and RBAC for Hooks
16. Technical Interview Questions & Answers
17. Frequently Asked Questions (FAQs)
18. Summary & Next Steps

1. Introduction to GitOps Orchestration Challenges

In a declarative GitOps world, Kubernetes manages the eventual state of resources. However, Kubernetes does not natively understand the logical dependencies between your application components. It treats a database schema migration, a Redis cache cluster, a microservices deployment, and an ingress route as independent API objects to be reconciled concurrently.

This concurrent reconciliation leads to what platform engineers call the "Startup Race Condition." For example, if your application backend starts before the database schema migration Job completes, the application pods will crash, trigger back-offs, and potentially degrade cluster performance or cause cascading failures. While Kubernetes initContainers can check for database availability, they cannot easily orchestrate complex, multi-stage, cluster-wide deployment workflows—such as running smoke tests, clearing external CDNs, or notifying Slack channels only upon deployment success or failure.

ArgoCD solves this fundamental architectural limitation through two powerful mechanisms: Sync Waves and Resource Hooks. Together, they allow you to transform a chaotic, concurrent application sync into a highly structured, ordered, and resilient deployment pipeline directly from your Git repository.

2. What You Will Learn

This masterclass lesson provides a comprehensive, production-grade guide to mastering ArgoCD Sync Waves and Resource Hooks. By the end of this guide, you will be able to:

Design complex, multi-step deployment pipelines using Sync Waves.
Implement transactional database migrations and smoke tests using PreSync, Sync, PostSync, and SyncFail Hooks.
Manage the lifecycle of ephemeral hook pods using Hook Deletion Policies to prevent resource clutter and security vulnerabilities.
Write custom Lua-based health checks in ArgoCD to ensure complex Custom Resource Definitions (CRDs) block or allow sync progression correctly.
Troubleshoot and debug stuck sync phases using the ArgoCD CLI, UI, and controller logs.
Configure Prometheus metrics and alerts to monitor Hook and Sync Wave execution times and failures in production.

3. Prerequisites

To fully benefit from this advanced guide, you should have a solid foundation in the following areas:

Kubernetes Administration: Deep familiarity with Pods, Jobs, Deployments, Services, RBAC (ServiceAccounts, Roles, RoleBindings), and Custom Resource Definitions (CRDs).
ArgoCD Fundamentals: Understanding of the ArgoCD Application CRD, synchronization phases (Manual/Automatic), and basic reconciliation loops. If you need a refresher, please refer to our previous lesson on ArgoCD Application Controller Internals.
GitOps Workflows: Experience managing Kubernetes manifests in Git repositories.
Command Line Tools: Access to a Kubernetes cluster with kubectl and the argocd CLI installed.

4. Core Concepts: Sync Waves vs. Resource Hooks

What is the difference between ArgoCD Sync Waves and Resource Hooks?

ArgoCD Sync Waves are integer annotations (e.g., argocd.argoproj.io/sync-wave: "5") applied to manifests that dictate the linear order in which resources are applied to the Kubernetes cluster. ArgoCD applies resources from the lowest wave number to the highest, waiting for all resources in wave N to reach a "Healthy" state before moving to wave N+1.

ArgoCD Resource Hooks are annotations (e.g., argocd.argoproj.io/hook: PreSync) that trigger ephemeral tasks (typically Kubernetes Jobs) at specific phases of the synchronization lifecycle (such as before, during, or after the main resource application, or upon sync failure). While Sync Waves order the deployment of desired-state resources, Hooks execute transient actions to coordinate external systems or perform validation tasks.

5. Deep Dive: ArgoCD Sync Waves

By default, when ArgoCD synchronizes an application, it applies all manifests in a single, parallel batch. If you have resources that must exist before others can successfully initialize, you can assign them to different Sync Waves.

How Sync Waves Work

ArgoCD parses the argocd.argoproj.io/sync-wave annotation on every resource manifest. This annotation accepts any valid positive or negative integer (e.g., "-5", "0", "100"). If a resource does not have this annotation, it defaults to a wave value of 0.

The synchronization process follows a strict execution flow:

ArgoCD groups all resources in the Application by their wave number.
ArgoCD sorts these waves in ascending numerical order (e.g., -10 runs before -5, which runs before 0, which runs before 5).
ArgoCD applies the resources belonging to the lowest wave.
ArgoCD pauses the synchronization loop and monitors the status of the applied resources. It waits until all resources in the current wave are reported as Healthy by the controller.
Once the current wave is fully Healthy, ArgoCD moves to the next wave and repeats the process.

The Importance of Resource Health

It is critical to understand that Sync Waves rely entirely on ArgoCD's capability to assess resource health. For standard Kubernetes objects (like Deployments, StatefulSets, DaemonSets, and PersistentVolumeClaims), ArgoCD has built-in health assessment logic. For example, a Deployment is only "Healthy" when its replica count matches the desired replica count and all replicas are running and ready.

If a resource in Wave 1 remains in a "Progressing" or "Degraded" state (for instance, a Deployment with a failing liveness probe), ArgoCD will block indefinitely and will not apply any resources in Wave 2 or higher. This prevents broken configurations from cascading through your stack.

6. Deep Dive: ArgoCD Resource Hooks

While Sync Waves are excellent for ordering persistent resources, enterprise deployments often require executing transient tasks that should not remain as long-running workloads in the cluster. This is where Resource Hooks come in.

Supported Hook Phases

ArgoCD supports several hook phases, defined by the argocd.argoproj.io/hook annotation:

Hook Phase	Execution Timing	Common Use Cases
`PreSync`	Executes before any manifests in the application are applied.	Database backups, schema migrations, pre-flight infrastructure checks, pausing external alerts.
`Sync`	Executes after `PreSync` hooks complete, concurrently with the main application resources.	Complex orchestration steps that must run side-by-side with the main deployment.
`PostSync`	Executes after all main application resources have successfully reached a "Healthy" state.	Smoke testing, integration testing, cache warming, Slack/Teams notifications, CDN cache invalidation.
`SyncFail`	Executes only when the synchronization operation fails (e.g., a resource fails to become healthy or a hook fails).	Sending critical failure alerts, automated rollbacks, clean-up operations, capturing debug logs.
`Skip`	Tells ArgoCD to skip applying this resource during a sync.	Preserving specific historical manifests, manual intervention templates.

Hook Deletion Policies

Because hooks typically create Kubernetes Jobs, Pods, or custom tasks, these resources will remain in the cluster after execution, consuming namespace quota and cluttering your observability tools. To manage this lifecycle automatically, ArgoCD provides the argocd.argoproj.io/hook-delete-policy annotation.

You can configure one or more of the following deletion policies:

HookSucceeded: ArgoCD deletes the hook resource automatically if it completes successfully (e.g., a Job completes with exit code 0). This is the standard policy for successful migrations and smoke tests.
HookFailed: ArgoCD deletes the hook resource if it fails. Warning: In production, you may want to avoid this policy for critical jobs so that engineers can inspect the failing pod logs to debug issues.
BeforeHookCreation: ArgoCD deletes any existing hook resource of the same name before creating a new one. This is highly recommended for continuous delivery pipelines, as it guarantees that subsequent sync runs will not fail due to a "Resource already exists" error.

Multiple deletion policies can be combined using a comma-separated list, for example:

metadata:
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation,HookSucceeded

7. Architectural Workflows & Lifecycle States

To design resilient deployment pipelines, you must understand the exact sequence of events that the ArgoCD Application Controller executes during a sync operation. Below are detailed ASCII flowcharts illustrating these lifecycles.

The Complete Sync Wave and Hook Lifecycle Flow

+-----------------------------------------------------------------+
|                    Sync Operation Triggered                     |
+-----------------------------------------------------------------+
                                 |
                                 v
+-----------------------------------------------------------------+
| 1. Execute PreSync Hooks (Ordered by Sync Waves: -32768 to +32767) |
|    - Wait for each PreSync Hook to complete successfully        |
+-----------------------------------------------------------------+
                                 |
                                 v
+-----------------------------------------------------------------+
| 2. Execute Main Sync Phase (Ordered by Sync Waves: -32768 to +32767)|
|    - Apply manifests for Wave N                                 |
|    - Wait for Wave N resources to become HEALTHY                |
|    - Move to Wave N+1                                           |
+-----------------------------------------------------------------+
                                 |
        +------------------------+------------------------+
        |                                                 |
        | (All Waves Healthy)                             | (Any Wave Fails / Timeouts)
        v                                                 v
+----------------------------------+            +----------------------------------+
| 3. Execute PostSync Hooks        |            | 3. Execute SyncFail Hooks        |
|    - Ordered by Sync Waves       |            |    - Trigger alerts/rollbacks    |
|    - Wait for completion         |            +----------------------------------+
+----------------------------------+                              |
        |                                                         v
        v                                               +----------------------------------+
+----------------------------------+                    | Sync Operation Marked: FAILED    |
| Sync Operation Marked: SUCCESS   |                    +----------------------------------+
+----------------------------------+
                                 |
                                 v
+-----------------------------------------------------------------+
| 4. Apply Hook Deletion Policies                                 |
|    - Clean up Jobs/Pods based on HookSucceeded/HookFailed       |
+-----------------------------------------------------------------+

Internal Micro-Step of a Single Sync Wave

This diagram zooms into how ArgoCD processes a single wave (Wave N) during the "Main Sync Phase":

   +-------------------------------------------------------+
   |              Enter Sync Wave N Reconcile              |
   +-------------------------------------------------------+
                               |
                               v
   +-------------------------------------------------------+
   | Apply all Kubernetes Manifests assigned to Wave N     |
   +-------------------------------------------------------+
                               |
                               v
                     [Reconciliation Loop]
                               |
                               v
            /-------------------------------------\
           / Is every resource in Wave N Healthy?  \
           \                                       /
            \-------------------------------------/
                            /     \
                    YES    /       \   NO (Progressing / Degraded)
                          /         \
                         v           v
   +---------------------------+   +-----------------------------------+
   | Proceed to Sync Wave N+1  |   | Check Sync Timeout Limit          |
   +---------------------------+   +-----------------------------------+
                                             |
                                    /-----------------\
                                   /  Has sync timed   \
                                  /      out yet?       \
                                  \                     /
                                   \-------------------/
                                       /           \
                               YES    /             \  NO (Keep Waiting)
                                     /               \
                                    v                 v
                     +-------------------+    +----------------------+
                     | Trigger SyncFail  |    | Re-evaluate Health   |
                     | Hook Execution    |    | after delay interval |
                     +-------------------+    +----------------------+

8. Combining Waves and Hooks: The Orchestration Matrix

One of the most powerful, yet least understood, features of ArgoCD is that Sync Waves and Resource Hooks can be combined on the same resource. This allows you to orchestrate complex dependencies within a specific hook phase.

For example, if you have three different PreSync tasks, you do not have to run them concurrently. By assigning them different Sync Waves, you can run them sequentially:

PreSync Hook with Wave -5: Disables alerts in your monitoring system (e.g., Datadog, Prometheus/Alertmanager).
PreSync Hook with Wave -2: Performs a snapshot backup of your PostgreSQL database.
PreSync Hook with Wave 0: Executes the database schema migration scripts.

ArgoCD respects this matrix across all execution phases. The controller processes the phases in order (PreSync -> Sync -> PostSync), and within *each* of those phases, it executes resources sorted by their Sync Wave values.

The Complete Execution Matrix

Execution Order	Resource Type	Hook Phase Annotation	Sync Wave Annotation	Description
1	Job	`PreSync`	`-5`	First PreSync Hook (e.g., Silence Monitoring Alerts).
2	Job	`PreSync`	`-2`	Second PreSync Hook (e.g., DB Snapshot Backup).
3	Job	`PreSync`	`0`	Third PreSync Hook (e.g., Schema Migration).
4	Namespace / ConfigMap	None (Standard)	`-10`	Core infrastructure resources applied before workloads.
5	Deployment (Database)	None (Standard)	`-5`	Database workloads applied and verified healthy.
6	Deployment (API App)	None (Standard)	`0`	Core Application backend deployed.
7	Deployment (Frontend)	None (Standard)	`5`	Frontend UI deployed after the backend API is ready.
8	Job	`PostSync`	`1`	First PostSync Hook (e.g., Smoke test critical API endpoints).
9	Job	`PostSync`	`10`	Second PostSync Hook (e.g., Re-enable alerts, send Slack success notification).

9. Production-Grade Code Examples & Implementation

Below is a complete, production-ready scenario demonstrating a multi-tier application deployment. The scenario includes a ServiceAccount with fine-grained RBAC, a database migration job (PreSync), a backend deployment (Sync Wave 0), a smoke-test job (PostSync), and a failure-notification job (SyncFail).

1. RBAC and ServiceAccount for Hook Execution

Hooks often need to interact with the Kubernetes API (e.g., to query other resources, delete temporary files, or trigger restarts). We define a dedicated ServiceAccount and Role with minimal privileges.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: argocd-hook-executor
  namespace: production
  annotations:
    argocd.argoproj.io/sync-wave: "-20" # Created first before any hooks execute
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: argocd-hook-executor-role
  namespace: production
  annotations:
    argocd.argoproj.io/sync-wave: "-20"
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "watch", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argocd-hook-executor-binding
  namespace: production
  annotations:
    argocd.argoproj.io/sync-wave: "-20"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argocd-hook-executor-role
subjects:
  - kind: ServiceAccount
    name: argocd-hook-executor
    namespace: production

2. PreSync Hook: Database Schema Migration Job

This Job runs before the application deployment starts. If the migration fails (exit code non-zero), the application deployment is blocked, preventing database corruption or app crashes.

---
apiVersion: batch/v1
kind: Job
metadata:
  name: prod-db-migration
  namespace: production
  annotations:
    # Mark this as a PreSync Hook
    argocd.argoproj.io/hook: PreSync
    # Execute this migration in wave -2 of the PreSync phase
    argocd.argoproj.io/sync-wave: "-2"
    # Recreate the job if it exists from a previous run, and clean up upon success
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation,HookSucceeded
spec:
  activeDeadlineSeconds: 300 # Prevent job from hanging indefinitely
  backoffLimit: 2            # Fail quickly to block the sync loop if things are broken
  template:
    spec:
      serviceAccountName: argocd-hook-executor
      restartPolicy: Never
      containers:
        - name: db-migrator
          image: postgres:15-alpine
          command:
            - /bin/sh
            - -c
            - |
              echo "Starting database migration..."
              # Simulate database connectivity and schema update checks
              for i in $(seq 1 5); do
                echo "Applying schema patch $i/5..."
                sleep 2
              done
              echo "Database schema migration completed successfully!"
          env:
            - name: DB_HOST
              value: "postgres-service.production.svc.cluster.local"
            - name: DB_USER
              value: "app_admin"
          resources:
            limits:
              cpu: "200m"
              memory: "256Mi"
            requests:
              cpu: "100m"
              memory: "128Mi"

3. Main Application Workload (Sync Wave 0)

This is the standard application Deployment. It will only be applied after the prod-db-migration Job completes successfully.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: product-api
  namespace: production
  annotations:
    # Belongs to the main Sync phase, wave 0 (applied after PreSync completes)
    argocd.argoproj.io/sync-wave: "0"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: product-api
  template:
    metadata:
      labels:
        app: product-api
    spec:
      containers:
        - name: api-server
          image: nginx:alpine
          ports:
            - containerPort: 80
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 15
            periodSeconds: 10
          resources:
            limits:
              cpu: "500m"
              memory: "512Mi"
            requests:
              cpu: "250m"
              memory: "256Mi"

4. PostSync Hook: Integration / Smoke Test Job

This Job runs after the product-api Deployment has fully rolled out and all 3 replicas are 100% healthy.

---
apiVersion: batch/v1
kind: Job
metadata:
  name: prod-smoke-test
  namespace: production
  annotations:
    # Run after the sync phase completes and all resources are healthy
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/sync-wave: "1"
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation,HookSucceeded
spec:
  activeDeadlineSeconds: 120
  backoffLimit: 1
  template:
    spec:
      serviceAccountName: argocd-hook-executor
      restartPolicy: Never
      containers:
        - name: curl-tester
          image: alpine:latest
          command:
            - /bin/sh
            - -c
            - |
              echo "Starting smoke tests against product-api..."
              # Query the internal service endpoint
              STATUS_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://product-api.production.svc.cluster.local:80/)
              if [ "$STATUS_CODE" -eq 200 ]; then
                echo "Smoke test passed with status code $STATUS_CODE!"
                exit 0
              else
                echo "Smoke test failed! Received status code: $STATUS_CODE"
                exit 1
              fi
          resources:
            limits:
              cpu: "100m"
              memory: "128Mi"
            requests:
              cpu: "50m"
              memory: "64Mi"

5. SyncFail Hook: Notification and Alert Dispatcher

If any resource fails to sync, or if the prod-db-migration or prod-smoke-test Jobs fail, ArgoCD immediately aborts the pipeline and triggers this SyncFail Hook.

---
apiVersion: batch/v1
kind: Job
metadata:
  name: prod-sync-failure-handler
  namespace: production
  annotations:
    # Triggered only when the sync operation fails
    argocd.argoproj.io/hook: SyncFail
    # Clean up before running, but preserve failing pod for post-mortem analysis
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  activeDeadlineSeconds: 180
  backoffLimit: 0
  template:
    spec:
      serviceAccountName: argocd-hook-executor
      restartPolicy: Never
      containers:
        - name: alert-dispatcher
          image: alpine:latest
          command:
            - /bin/sh
            - -c
            - |
              echo "CRITICAL: ArgoCD Synchronization failed for application 'product-suite'!"
              echo "Dispatching webhook alert to PagerDuty/Slack..."
              # Simulate API call to notification engine
              # curl -X POST -H 'Content-type: application/json' --data '{"text":"Deployment Failed!"}' $SLACK_WEBHOOK_URL
              echo "Alert successfully dispatched."
          resources:
            limits:
              cpu: "100m"
              memory: "128Mi"
            requests:
              cpu: "50m"
              memory: "64Mi"

10. Custom Health Checks for Sync Wave Progression

Because Sync Waves rely strictly on resources reaching a "Healthy" state, custom resources (CRDs) can easily stall your pipeline. If ArgoCD does not know how to evaluate the health of a third-party or custom-built CRD, it defaults to treating it as Healthy immediately. This bypasses your waves, or conversely, treats it as permanently Progressing, which blocks your waves indefinitely.

To solve this, you can configure custom Lua health assessments inside the argocd-cm ConfigMap (or within the Application Controller settings).

Configuring a Lua Health Check for a Custom Resource

Below is an example of adding a health check for a custom database operator resource (MyDatabaseInstance) inside the argocd-cm manifest:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations.health.databases.example.com_MyDatabaseInstance: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.phase == "Running" then
        hs.status = "Healthy"
        hs.message = "Database is fully synchronized and running."
        return hs
      end
      if obj.status.phase == "Failed" then
        hs.status = "Degraded"
        hs.message = "Database initialization failed: " .. obj.status.reason
        return hs
      end
    end
    hs.status = "Progressing"
    hs.message = "Database is provisioning schema or allocating storage..."
    return hs

By defining this Lua script, ArgoCD will monitor the .status.phase of your custom database resource. If it is assigned to Sync Wave 1, ArgoCD will successfully block Wave 2 from starting until the database operator sets the status phase to "Running".

11. Enterprise-Scale Production Scenarios

In large-scale enterprise environments, basic tutorials fall short. Below are three real-world deployment patterns designed for high availability, compliance, and zero-downtime operations.

Scenario A: Zero-Downtime Blue/Green Database Migrations

When running high-traffic APIs, you cannot take the database offline during a migration. However, running a schema migration concurrently with old application pods can break them (e.g., if you rename or delete a column). The solution is a multi-step, backwards-compatible rollout using Sync Waves.

+-------------------------------------------------------------------+
|  Wave -10: Run PreSync DB Migration Job (Add columns/nullable)    |
+-------------------------------------------------------------------+
                                 |
                                 v
+-------------------------------------------------------------------+
|  Wave 0: Deploy New App Version (Writes to both old/new columns)  |
+-------------------------------------------------------------------+
                                 |
                                 v
+-------------------------------------------------------------------+
|  Wave 5: Run PostSync Cleanup Job (Drop old columns/constraints)  |
+-------------------------------------------------------------------+

Step 1 (PreSync, Wave -10): The Database Migration Job applies *only* additive changes (e.g., adding a new nullable column). This ensures the currently running v1.0.0 application pods do not crash.
Step 2 (Sync, Wave 0): ArgoCD rolls out the v1.1.0 Application Deployment. These pods can read and write to both the old and new columns.
Step 3 (PostSync, Wave 5): A final cleanup Job runs to migrate old data and drop the deprecated column. This ensures complete safety and zero downtime.

Scenario B: Automated Rollback via Argo Rollouts and SyncFail Hooks

While standard Kubernetes Deployments support rolling updates, they lack advanced progressive delivery features like canary rollouts. By combining ArgoCD Sync Waves with Argo Rollouts and SyncFail Hooks, you can build an automated, self-healing deployment pipeline:

Wave 0: Deploy the Rollout resource. The rollout begins routing 10% of traffic to the canary version.
Wave 5 (PostSync): A smoke test Job executes simulated user traffic against the canary endpoint.
If the smoke test fails, the Job exits with status code 1.
ArgoCD marks the sync as failed and immediately executes the SyncFail Hook.
The SyncFail Job triggers an API call to abort the rollout (e.g., kubectl argo rollouts abort product-api), instantly routing 100% of traffic back to the stable version.

Scenario C: Clean CDN and Cache Invalidation

Deploying static frontend assets to a Kubernetes cluster often results in users receiving stale cached versions from CDNs (like Cloudflare, Akamai, or AWS CloudFront). You can use a PostSync Hook to automate cache clearing:

Wave 0 (Sync): Deploy the updated Frontend pods and services.
Wave 1 (Sync): Update the Ingress controller or Gateway API configuration.
Wave 5 (PostSync): Run a Job that uses the AWS or Cloudflare CLI to invalidate the edge cache:
```
aws cloudfront create-invalidation --distribution-id E1234567890 --paths "/*"
```
This ensures that users immediately receive the new frontend assets the moment the deployment finishes.

12. Common Pitfalls and Anti-Patterns

Avoid these common design mistakes when working with Sync Waves and Resource Hooks in production environments:

1. The Infinite Progressing Loop (Deadlock)

The Scenario: You place a PreSync Hook Job in Wave 0, and your application Deployment in Wave 0. The Job depends on a ConfigMap or Secret that is assigned to Wave 5.

The Consequence: Since the Job runs during PreSync, it starts before Wave 5 is applied. The Job crashes because the Secret does not exist. The application Deployment is blocked, and the sync hangs indefinitely. Always ensure your foundational configuration resources (Secrets, ConfigMaps, Namespaces) are assigned to a wave *lower* than or *equal* to the hooks that consume them.

2. Omitting the Hook Deletion Policy

If you do not specify an argocd.argoproj.io/hook-delete-policy, the Job and its Pods remain in the cluster forever. On the next Git commit and sync, ArgoCD will try to apply the Job again, but the Kubernetes API will reject it with:
Error: jobs.batch "my-hook-job" already exists.
Always include BeforeHookCreation to ensure clean, repeatable sync operations.

3. High Backoff Limits on Hook Jobs

By default, Kubernetes Jobs have a backoffLimit of 6. If your hook job is failing due to a configuration error, Kubernetes will retry it 6 times with exponential back-off, which can take up to 10–15 minutes. This locks up the ArgoCD application controller reconciliation loop. For Hook Jobs, set backoffLimit: 1 or 2 to fail fast and trigger alert mechanisms quickly.

4. Mixing Server-Side Apply with Hooks

When using Server-Side Apply (SSA) in ArgoCD, ensure your hook definitions do not contain fields that conflict with other managers. Because hook resources are often ephemeral, conflicts can lead to partial applications or validation failures that block the sync controller.

13. Troubleshooting and Debugging Guide

When a deployment is stuck or failing, follow this systematic guide to isolate and resolve the issue.

Step 1: Identify the Stuck Wave or Hook via CLI

Use the argocd app get command to view the live synchronization status and identify which resource is blocking progression:

argocd app get product-suite --refresh

Look for the STATUS and HEALTH columns in the output. If you see a resource in a Progressing state with a wave annotation, that is your blocker:

NAME                         KIND        STATUS     HEALTH       OWNER  RULES
prod-db-migration            Job         Synced     Healthy             <none>  (PreSync, Wave -2)
product-api                  Deployment  Synced     Progressing         <none>  (Sync, Wave 0)
product-api-service          Service     Synced     Healthy             <none>  (Sync, Wave 0)

In this example, the database migration succeeded, but the product-api deployment is stuck in Progressing, blocking any PostSync hooks from starting.

Step 2: Inspect the Stuck Resource

Query Kubernetes directly to find out why the resource is not healthy. For a Deployment, check the replica status and event logs:

kubectl describe deployment product-api -n production

Common issues include:

ImagePullBackOff: The container image tag is incorrect or the registry credentials are missing.
Failing Liveness/Readiness Probes: The application is crashing or not starting within the allocated time.
Insufficient CPU/Memory: The cluster does not have enough allocatable resources to schedule the new pods.

Step 3: Debugging Failing Hook Jobs

If a Hook Job failed and was deleted due to a HookSucceeded policy, you can temporarily remove the deletion policy in Git to preserve the pod for debugging. Once preserved, retrieve the logs of the failed hook pod:

kubectl logs -l job-name=prod-db-migration -n production --tail=100

Step 4: Force-Terminating a Stuck Sync

If a sync is stuck and blocking other operations, you can terminate it safely using the ArgoCD CLI:

argocd app terminate-op product-suite

This commands instructs the ArgoCD controller to stop waiting for resource health and abort the active synchronization, transitioning the Application state to Failed.

14. Monitoring and Observability

To operate ArgoCD reliably at enterprise scale, you must monitor the performance and failure rates of your Sync Waves and Hooks. ArgoCD exports detailed Prometheus metrics natively.

Key Prometheus Metrics to Track

Metric Name	Type	Description
`argocd_app_sync_total`	Counter	Total number of application synchronization operations. Partition by `phase` and `result`.
`argocd_app_reconcile_duration_seconds`	Histogram	The time taken to reconcile application state. Useful for detecting slow sync loops.
`argocd_app_k8s_request_duration_seconds`	Histogram	The response time of the Kubernetes API server during sync operations.

Sample Prometheus Alert Rules

Below is a Prometheus Alert

Table of Contents