ArgoCD Sync Waves and Resource Hooks: The Ultimate Enterprise Orchestration Guide
Master the advanced orchestration capabilities of ArgoCD. Learn how to design, implement, and debug complex deployment workflows using Sync Waves, Resource Hooks, and Custom Health Checks in mission-critical enterprise environments.
Table of Contents
- 1. Introduction to GitOps Orchestration Challenges
- 2. What You Will Learn
- 3. Prerequisites
- 4. Core Concepts: Sync Waves vs. Resource Hooks
- 5. Deep Dive: ArgoCD Sync Waves
- 6. Deep Dive: ArgoCD Resource Hooks
- 7. Architectural Workflows & Lifecycle States
- 8. Combining Waves and Hooks: The Orchestration Matrix
- 9. Production-Grade Code Examples & Implementation
- 10. Custom Health Checks for Sync Wave Progression
- 11. Enterprise-Scale Production Scenarios
- 12. Common Pitfalls and Anti-Patterns
- 13. Troubleshooting and Debugging Guide
- 14. Monitoring and Observability
- 15. Security and RBAC for Hooks
- 16. Technical Interview Questions & Answers
- 17. Frequently Asked Questions (FAQs)
- 18. Summary & Next Steps
1. Introduction to GitOps Orchestration Challenges
In a declarative GitOps world, Kubernetes manages the eventual state of resources. However, Kubernetes does not natively understand the logical dependencies between your application components. It treats a database schema migration, a Redis cache cluster, a microservices deployment, and an ingress route as independent API objects to be reconciled concurrently.
This concurrent reconciliation leads to what platform engineers call the "Startup Race Condition." For example, if your application backend starts before the database schema migration Job completes, the application pods will crash, trigger back-offs, and potentially degrade cluster performance or cause cascading failures. While Kubernetes initContainers can check for database availability, they cannot easily orchestrate complex, multi-stage, cluster-wide deployment workflows—such as running smoke tests, clearing external CDNs, or notifying Slack channels only upon deployment success or failure.
ArgoCD solves this fundamental architectural limitation through two powerful mechanisms: Sync Waves and Resource Hooks. Together, they allow you to transform a chaotic, concurrent application sync into a highly structured, ordered, and resilient deployment pipeline directly from your Git repository.
2. What You Will Learn
This masterclass lesson provides a comprehensive, production-grade guide to mastering ArgoCD Sync Waves and Resource Hooks. By the end of this guide, you will be able to:
- Design complex, multi-step deployment pipelines using Sync Waves.
- Implement transactional database migrations and smoke tests using PreSync, Sync, PostSync, and SyncFail Hooks.
- Manage the lifecycle of ephemeral hook pods using Hook Deletion Policies to prevent resource clutter and security vulnerabilities.
- Write custom Lua-based health checks in ArgoCD to ensure complex Custom Resource Definitions (CRDs) block or allow sync progression correctly.
- Troubleshoot and debug stuck sync phases using the ArgoCD CLI, UI, and controller logs.
- Configure Prometheus metrics and alerts to monitor Hook and Sync Wave execution times and failures in production.
3. Prerequisites
To fully benefit from this advanced guide, you should have a solid foundation in the following areas:
- Kubernetes Administration: Deep familiarity with Pods, Jobs, Deployments, Services, RBAC (ServiceAccounts, Roles, RoleBindings), and Custom Resource Definitions (CRDs).
- ArgoCD Fundamentals: Understanding of the ArgoCD Application CRD, synchronization phases (Manual/Automatic), and basic reconciliation loops. If you need a refresher, please refer to our previous lesson on ArgoCD Application Controller Internals.
- GitOps Workflows: Experience managing Kubernetes manifests in Git repositories.
- Command Line Tools: Access to a Kubernetes cluster with
kubectland theargocdCLI installed.
4. Core Concepts: Sync Waves vs. Resource Hooks
What is the difference between ArgoCD Sync Waves and Resource Hooks?
ArgoCD Sync Waves are integer annotations (e.g.,
argocd.argoproj.io/sync-wave: "5") applied to manifests that dictate the linear order in which resources are applied to the Kubernetes cluster. ArgoCD applies resources from the lowest wave number to the highest, waiting for all resources in wave N to reach a "Healthy" state before moving to wave N+1.ArgoCD Resource Hooks are annotations (e.g.,
argocd.argoproj.io/hook: PreSync) that trigger ephemeral tasks (typically Kubernetes Jobs) at specific phases of the synchronization lifecycle (such as before, during, or after the main resource application, or upon sync failure). While Sync Waves order the deployment of desired-state resources, Hooks execute transient actions to coordinate external systems or perform validation tasks.
5. Deep Dive: ArgoCD Sync Waves
By default, when ArgoCD synchronizes an application, it applies all manifests in a single, parallel batch. If you have resources that must exist before others can successfully initialize, you can assign them to different Sync Waves.
How Sync Waves Work
ArgoCD parses the argocd.argoproj.io/sync-wave annotation on every resource manifest. This annotation accepts any valid positive or negative integer (e.g., "-5", "0", "100"). If a resource does not have this annotation, it defaults to a wave value of 0.
The synchronization process follows a strict execution flow:
- ArgoCD groups all resources in the Application by their wave number.
- ArgoCD sorts these waves in ascending numerical order (e.g.,
-10runs before-5, which runs before0, which runs before5). - ArgoCD applies the resources belonging to the lowest wave.
- ArgoCD pauses the synchronization loop and monitors the status of the applied resources. It waits until all resources in the current wave are reported as Healthy by the controller.
- Once the current wave is fully Healthy, ArgoCD moves to the next wave and repeats the process.
The Importance of Resource Health
It is critical to understand that Sync Waves rely entirely on ArgoCD's capability to assess resource health. For standard Kubernetes objects (like Deployments, StatefulSets, DaemonSets, and PersistentVolumeClaims), ArgoCD has built-in health assessment logic. For example, a Deployment is only "Healthy" when its replica count matches the desired replica count and all replicas are running and ready.
If a resource in Wave 1 remains in a "Progressing" or "Degraded" state (for instance, a Deployment with a failing liveness probe), ArgoCD will block indefinitely and will not apply any resources in Wave 2 or higher. This prevents broken configurations from cascading through your stack.
6. Deep Dive: ArgoCD Resource Hooks
While Sync Waves are excellent for ordering persistent resources, enterprise deployments often require executing transient tasks that should not remain as long-running workloads in the cluster. This is where Resource Hooks come in.
Supported Hook Phases
ArgoCD supports several hook phases, defined by the argocd.argoproj.io/hook annotation:
| Hook Phase | Execution Timing | Common Use Cases |
|---|---|---|
PreSync |
Executes before any manifests in the application are applied. | Database backups, schema migrations, pre-flight infrastructure checks, pausing external alerts. |
Sync |
Executes after PreSync hooks complete, concurrently with the main application resources. |
Complex orchestration steps that must run side-by-side with the main deployment. |
PostSync |
Executes after all main application resources have successfully reached a "Healthy" state. | Smoke testing, integration testing, cache warming, Slack/Teams notifications, CDN cache invalidation. |
SyncFail |
Executes only when the synchronization operation fails (e.g., a resource fails to become healthy or a hook fails). | Sending critical failure alerts, automated rollbacks, clean-up operations, capturing debug logs. |
Skip |
Tells ArgoCD to skip applying this resource during a sync. | Preserving specific historical manifests, manual intervention templates. |
Hook Deletion Policies
Because hooks typically create Kubernetes Jobs, Pods, or custom tasks, these resources will remain in the cluster after execution, consuming namespace quota and cluttering your observability tools. To manage this lifecycle automatically, ArgoCD provides the argocd.argoproj.io/hook-delete-policy annotation.
You can configure one or more of the following deletion policies:
HookSucceeded: ArgoCD deletes the hook resource automatically if it completes successfully (e.g., a Job completes with exit code 0). This is the standard policy for successful migrations and smoke tests.HookFailed: ArgoCD deletes the hook resource if it fails. Warning: In production, you may want to avoid this policy for critical jobs so that engineers can inspect the failing pod logs to debug issues.BeforeHookCreation: ArgoCD deletes any existing hook resource of the same name before creating a new one. This is highly recommended for continuous delivery pipelines, as it guarantees that subsequent sync runs will not fail due to a "Resource already exists" error.
Multiple deletion policies can be combined using a comma-separated list, for example:
metadata:
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation,HookSucceeded
7. Architectural Workflows & Lifecycle States
To design resilient deployment pipelines, you must understand the exact sequence of events that the ArgoCD Application Controller executes during a sync operation. Below are detailed ASCII flowcharts illustrating these lifecycles.
The Complete Sync Wave and Hook Lifecycle Flow
+-----------------------------------------------------------------+
| Sync Operation Triggered |
+-----------------------------------------------------------------+
|
v
+-----------------------------------------------------------------+
| 1. Execute PreSync Hooks (Ordered by Sync Waves: -32768 to +32767) |
| - Wait for each PreSync Hook to complete successfully |
+-----------------------------------------------------------------+
|
v
+-----------------------------------------------------------------+
| 2. Execute Main Sync Phase (Ordered by Sync Waves: -32768 to +32767)|
| - Apply manifests for Wave N |
| - Wait for Wave N resources to become HEALTHY |
| - Move to Wave N+1 |
+-----------------------------------------------------------------+
|
+------------------------+------------------------+
| |
| (All Waves Healthy) | (Any Wave Fails / Timeouts)
v v
+----------------------------------+ +----------------------------------+
| 3. Execute PostSync Hooks | | 3. Execute SyncFail Hooks |
| - Ordered by Sync Waves | | - Trigger alerts/rollbacks |
| - Wait for completion | +----------------------------------+
+----------------------------------+ |
| v
v +----------------------------------+
+----------------------------------+ | Sync Operation Marked: FAILED |
| Sync Operation Marked: SUCCESS | +----------------------------------+
+----------------------------------+
|
v
+-----------------------------------------------------------------+
| 4. Apply Hook Deletion Policies |
| - Clean up Jobs/Pods based on HookSucceeded/HookFailed |
+-----------------------------------------------------------------+
Internal Micro-Step of a Single Sync Wave
This diagram zooms into how ArgoCD processes a single wave (Wave N) during the "Main Sync Phase":
+-------------------------------------------------------+
| Enter Sync Wave N Reconcile |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Apply all Kubernetes Manifests assigned to Wave N |
+-------------------------------------------------------+
|
v
[Reconciliation Loop]
|
v
/-------------------------------------\
/ Is every resource in Wave N Healthy? \
\ /
\-------------------------------------/
/ \
YES / \ NO (Progressing / Degraded)
/ \
v v
+---------------------------+ +-----------------------------------+
| Proceed to Sync Wave N+1 | | Check Sync Timeout Limit |
+---------------------------+ +-----------------------------------+
|
/-----------------\
/ Has sync timed \
/ out yet? \
\ /
\-------------------/
/ \
YES / \ NO (Keep Waiting)
/ \
v v
+-------------------+ +----------------------+
| Trigger SyncFail | | Re-evaluate Health |
| Hook Execution | | after delay interval |
+-------------------+ +----------------------+
8. Combining Waves and Hooks: The Orchestration Matrix
One of the most powerful, yet least understood, features of ArgoCD is that Sync Waves and Resource Hooks can be combined on the same resource. This allows you to orchestrate complex dependencies within a specific hook phase.
For example, if you have three different PreSync tasks, you do not have to run them concurrently. By assigning them different Sync Waves, you can run them sequentially:
PreSyncHook with Wave-5: Disables alerts in your monitoring system (e.g., Datadog, Prometheus/Alertmanager).PreSyncHook with Wave-2: Performs a snapshot backup of your PostgreSQL database.PreSyncHook with Wave0: Executes the database schema migration scripts.
ArgoCD respects this matrix across all execution phases. The controller processes the phases in order (PreSync -> Sync -> PostSync), and within *each* of those phases, it executes resources sorted by their Sync Wave values.
The Complete Execution Matrix
| Execution Order | Resource Type | Hook Phase Annotation | Sync Wave Annotation | Description |
|---|---|---|---|---|
| 1 | Job | PreSync |
-5 |
First PreSync Hook (e.g., Silence Monitoring Alerts). |
| 2 | Job | PreSync |
-2 |
Second PreSync Hook (e.g., DB Snapshot Backup). |
| 3 | Job | PreSync |
0 |
Third PreSync Hook (e.g., Schema Migration). |
| 4 | Namespace / ConfigMap | None (Standard) | -10 |
Core infrastructure resources applied before workloads. |
| 5 | Deployment (Database) | None (Standard) | -5 |
Database workloads applied and verified healthy. |
| 6 | Deployment (API App) | None (Standard) | 0 |
Core Application backend deployed. |
| 7 | Deployment (Frontend) | None (Standard) | 5 |
Frontend UI deployed after the backend API is ready. |
| 8 | Job | PostSync |
1 |
First PostSync Hook (e.g., Smoke test critical API endpoints). |
| 9 | Job | PostSync |
10 |
Second PostSync Hook (e.g., Re-enable alerts, send Slack success notification). |
9. Production-Grade Code Examples & Implementation
Below is a complete, production-ready scenario demonstrating a multi-tier application deployment. The scenario includes a ServiceAccount with fine-grained RBAC, a database migration job (PreSync), a backend deployment (Sync Wave 0), a smoke-test job (PostSync), and a failure-notification job (SyncFail).
1. RBAC and ServiceAccount for Hook Execution
Hooks often need to interact with the Kubernetes API (e.g., to query other resources, delete temporary files, or trigger restarts). We define a dedicated ServiceAccount and Role with minimal privileges.
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: argocd-hook-executor
namespace: production
annotations:
argocd.argoproj.io/sync-wave: "-20" # Created first before any hooks execute
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: argocd-hook-executor-role
namespace: production
annotations:
argocd.argoproj.io/sync-wave: "-20"
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: argocd-hook-executor-binding
namespace: production
annotations:
argocd.argoproj.io/sync-wave: "-20"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: argocd-hook-executor-role
subjects:
- kind: ServiceAccount
name: argocd-hook-executor
namespace: production
2. PreSync Hook: Database Schema Migration Job
This Job runs before the application deployment starts. If the migration fails (exit code non-zero), the application deployment is blocked, preventing database corruption or app crashes.
---
apiVersion: batch/v1
kind: Job
metadata:
name: prod-db-migration
namespace: production
annotations:
# Mark this as a PreSync Hook
argocd.argoproj.io/hook: PreSync
# Execute this migration in wave -2 of the PreSync phase
argocd.argoproj.io/sync-wave: "-2"
# Recreate the job if it exists from a previous run, and clean up upon success
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation,HookSucceeded
spec:
activeDeadlineSeconds: 300 # Prevent job from hanging indefinitely
backoffLimit: 2 # Fail quickly to block the sync loop if things are broken
template:
spec:
serviceAccountName: argocd-hook-executor
restartPolicy: Never
containers:
- name: db-migrator
image: postgres:15-alpine
command:
- /bin/sh
- -c
- |
echo "Starting database migration..."
# Simulate database connectivity and schema update checks
for i in $(seq 1 5); do
echo "Applying schema patch $i/5..."
sleep 2
done
echo "Database schema migration completed successfully!"
env:
- name: DB_HOST
value: "postgres-service.production.svc.cluster.local"
- name: DB_USER
value: "app_admin"
resources:
limits:
cpu: "200m"
memory: "256Mi"
requests:
cpu: "100m"
memory: "128Mi"
3. Main Application Workload (Sync Wave 0)
This is the standard application Deployment. It will only be applied after the prod-db-migration Job completes successfully.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: product-api
namespace: production
annotations:
# Belongs to the main Sync phase, wave 0 (applied after PreSync completes)
argocd.argoproj.io/sync-wave: "0"
spec:
replicas: 3
selector:
matchLabels:
app: product-api
template:
metadata:
labels:
app: product-api
spec:
containers:
- name: api-server
image: nginx:alpine
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 15
periodSeconds: 10
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "250m"
memory: "256Mi"
4. PostSync Hook: Integration / Smoke Test Job
This Job runs after the product-api Deployment has fully rolled out and all 3 replicas are 100% healthy.
---
apiVersion: batch/v1
kind: Job
metadata:
name: prod-smoke-test
namespace: production
annotations:
# Run after the sync phase completes and all resources are healthy
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/sync-wave: "1"
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation,HookSucceeded
spec:
activeDeadlineSeconds: 120
backoffLimit: 1
template:
spec:
serviceAccountName: argocd-hook-executor
restartPolicy: Never
containers:
- name: curl-tester
image: alpine:latest
command:
- /bin/sh
- -c
- |
echo "Starting smoke tests against product-api..."
# Query the internal service endpoint
STATUS_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://product-api.production.svc.cluster.local:80/)
if [ "$STATUS_CODE" -eq 200 ]; then
echo "Smoke test passed with status code $STATUS_CODE!"
exit 0
else
echo "Smoke test failed! Received status code: $STATUS_CODE"
exit 1
fi
resources:
limits:
cpu: "100m"
memory: "128Mi"
requests:
cpu: "50m"
memory: "64Mi"
5. SyncFail Hook: Notification and Alert Dispatcher
If any resource fails to sync, or if the prod-db-migration or prod-smoke-test Jobs fail, ArgoCD immediately aborts the pipeline and triggers this SyncFail Hook.
---
apiVersion: batch/v1
kind: Job
metadata:
name: prod-sync-failure-handler
namespace: production
annotations:
# Triggered only when the sync operation fails
argocd.argoproj.io/hook: SyncFail
# Clean up before running, but preserve failing pod for post-mortem analysis
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
activeDeadlineSeconds: 180
backoffLimit: 0
template:
spec:
serviceAccountName: argocd-hook-executor
restartPolicy: Never
containers:
- name: alert-dispatcher
image: alpine:latest
command:
- /bin/sh
- -c
- |
echo "CRITICAL: ArgoCD Synchronization failed for application 'product-suite'!"
echo "Dispatching webhook alert to PagerDuty/Slack..."
# Simulate API call to notification engine
# curl -X POST -H 'Content-type: application/json' --data '{"text":"Deployment Failed!"}' $SLACK_WEBHOOK_URL
echo "Alert successfully dispatched."
resources:
limits:
cpu: "100m"
memory: "128Mi"
requests:
cpu: "50m"
memory: "64Mi"
10. Custom Health Checks for Sync Wave Progression
Because Sync Waves rely strictly on resources reaching a "Healthy" state, custom resources (CRDs) can easily stall your pipeline. If ArgoCD does not know how to evaluate the health of a third-party or custom-built CRD, it defaults to treating it as Healthy immediately. This bypasses your waves, or conversely, treats it as permanently Progressing, which blocks your waves indefinitely.
To solve this, you can configure custom Lua health assessments inside the argocd-cm ConfigMap (or within the Application Controller settings).
Configuring a Lua Health Check for a Custom Resource
Below is an example of adding a health check for a custom database operator resource (MyDatabaseInstance) inside the argocd-cm manifest:
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cm
namespace: argocd
data:
resource.customizations.health.databases.example.com_MyDatabaseInstance: |
hs = {}
if obj.status ~= nil then
if obj.status.phase == "Running" then
hs.status = "Healthy"
hs.message = "Database is fully synchronized and running."
return hs
end
if obj.status.phase == "Failed" then
hs.status = "Degraded"
hs.message = "Database initialization failed: " .. obj.status.reason
return hs
end
end
hs.status = "Progressing"
hs.message = "Database is provisioning schema or allocating storage..."
return hs
By defining this Lua script, ArgoCD will monitor the .status.phase of your custom database resource. If it is assigned to Sync Wave 1, ArgoCD will successfully block Wave 2 from starting until the database operator sets the status phase to "Running".
11. Enterprise-Scale Production Scenarios
In large-scale enterprise environments, basic tutorials fall short. Below are three real-world deployment patterns designed for high availability, compliance, and zero-downtime operations.
Scenario A: Zero-Downtime Blue/Green Database Migrations
When running high-traffic APIs, you cannot take the database offline during a migration. However, running a schema migration concurrently with old application pods can break them (e.g., if you rename or delete a column). The solution is a multi-step, backwards-compatible rollout using Sync Waves.
+-------------------------------------------------------------------+
| Wave -10: Run PreSync DB Migration Job (Add columns/nullable) |
+-------------------------------------------------------------------+
|
v
+-------------------------------------------------------------------+
| Wave 0: Deploy New App Version (Writes to both old/new columns) |
+-------------------------------------------------------------------+
|
v
+-------------------------------------------------------------------+
| Wave 5: Run PostSync Cleanup Job (Drop old columns/constraints) |
+-------------------------------------------------------------------+
- Step 1 (PreSync, Wave -10): The Database Migration Job applies *only* additive changes (e.g., adding a new nullable column). This ensures the currently running v1.0.0 application pods do not crash.
- Step 2 (Sync, Wave 0): ArgoCD rolls out the v1.1.0 Application Deployment. These pods can read and write to both the old and new columns.
- Step 3 (PostSync, Wave 5): A final cleanup Job runs to migrate old data and drop the deprecated column. This ensures complete safety and zero downtime.
Scenario B: Automated Rollback via Argo Rollouts and SyncFail Hooks
While standard Kubernetes Deployments support rolling updates, they lack advanced progressive delivery features like canary rollouts. By combining ArgoCD Sync Waves with Argo Rollouts and SyncFail Hooks, you can build an automated, self-healing deployment pipeline:
- Wave 0: Deploy the
Rolloutresource. The rollout begins routing 10% of traffic to the canary version. - Wave 5 (PostSync): A smoke test Job executes simulated user traffic against the canary endpoint.
- If the smoke test fails, the Job exits with status code
1. - ArgoCD marks the sync as failed and immediately executes the
SyncFailHook. - The
SyncFailJob triggers an API call to abort the rollout (e.g.,kubectl argo rollouts abort product-api), instantly routing 100% of traffic back to the stable version.
Scenario C: Clean CDN and Cache Invalidation
Deploying static frontend assets to a Kubernetes cluster often results in users receiving stale cached versions from CDNs (like Cloudflare, Akamai, or AWS CloudFront). You can use a PostSync Hook to automate cache clearing:
- Wave 0 (Sync): Deploy the updated Frontend pods and services.
- Wave 1 (Sync): Update the Ingress controller or Gateway API configuration.
- Wave 5 (PostSync): Run a Job that uses the AWS or Cloudflare CLI to invalidate the edge cache:
This ensures that users immediately receive the new frontend assets the moment the deployment finishes.aws cloudfront create-invalidation --distribution-id E1234567890 --paths "/*"
12. Common Pitfalls and Anti-Patterns
Avoid these common design mistakes when working with Sync Waves and Resource Hooks in production environments:
1. The Infinite Progressing Loop (Deadlock)
The Scenario: You place a PreSync Hook Job in Wave 0, and your application Deployment in Wave 0. The Job depends on a ConfigMap or Secret that is assigned to Wave 5.
The Consequence: Since the Job runs during PreSync, it starts before Wave 5 is applied. The Job crashes because the Secret does not exist. The application Deployment is blocked, and the sync hangs indefinitely. Always ensure your foundational configuration resources (Secrets, ConfigMaps, Namespaces) are assigned to a wave *lower* than or *equal* to the hooks that consume them.
2. Omitting the Hook Deletion Policy
If you do not specify an argocd.argoproj.io/hook-delete-policy, the Job and its Pods remain in the cluster forever. On the next Git commit and sync, ArgoCD will try to apply the Job again, but the Kubernetes API will reject it with:
Error: jobs.batch "my-hook-job" already exists.
Always include BeforeHookCreation to ensure clean, repeatable sync operations.
3. High Backoff Limits on Hook Jobs
By default, Kubernetes Jobs have a backoffLimit of 6. If your hook job is failing due to a configuration error, Kubernetes will retry it 6 times with exponential back-off, which can take up to 10–15 minutes. This locks up the ArgoCD application controller reconciliation loop. For Hook Jobs, set backoffLimit: 1 or 2 to fail fast and trigger alert mechanisms quickly.
4. Mixing Server-Side Apply with Hooks
When using Server-Side Apply (SSA) in ArgoCD, ensure your hook definitions do not contain fields that conflict with other managers. Because hook resources are often ephemeral, conflicts can lead to partial applications or validation failures that block the sync controller.
13. Troubleshooting and Debugging Guide
When a deployment is stuck or failing, follow this systematic guide to isolate and resolve the issue.
Step 1: Identify the Stuck Wave or Hook via CLI
Use the argocd app get command to view the live synchronization status and identify which resource is blocking progression:
argocd app get product-suite --refresh
Look for the STATUS and HEALTH columns in the output. If you see a resource in a Progressing state with a wave annotation, that is your blocker:
NAME KIND STATUS HEALTH OWNER RULES
prod-db-migration Job Synced Healthy <none> (PreSync, Wave -2)
product-api Deployment Synced Progressing <none> (Sync, Wave 0)
product-api-service Service Synced Healthy <none> (Sync, Wave 0)
In this example, the database migration succeeded, but the product-api deployment is stuck in Progressing, blocking any PostSync hooks from starting.
Step 2: Inspect the Stuck Resource
Query Kubernetes directly to find out why the resource is not healthy. For a Deployment, check the replica status and event logs:
kubectl describe deployment product-api -n production
Common issues include:
ImagePullBackOff: The container image tag is incorrect or the registry credentials are missing.- Failing Liveness/Readiness Probes: The application is crashing or not starting within the allocated time.
- Insufficient CPU/Memory: The cluster does not have enough allocatable resources to schedule the new pods.
Step 3: Debugging Failing Hook Jobs
If a Hook Job failed and was deleted due to a HookSucceeded policy, you can temporarily remove the deletion policy in Git to preserve the pod for debugging. Once preserved, retrieve the logs of the failed hook pod:
kubectl logs -l job-name=prod-db-migration -n production --tail=100
Step 4: Force-Terminating a Stuck Sync
If a sync is stuck and blocking other operations, you can terminate it safely using the ArgoCD CLI:
argocd app terminate-op product-suite
This commands instructs the ArgoCD controller to stop waiting for resource health and abort the active synchronization, transitioning the Application state to Failed.
14. Monitoring and Observability
To operate ArgoCD reliably at enterprise scale, you must monitor the performance and failure rates of your Sync Waves and Hooks. ArgoCD exports detailed Prometheus metrics natively.
Key Prometheus Metrics to Track
| Metric Name | Type | Description |
|---|---|---|
argocd_app_sync_total |
Counter | Total number of application synchronization operations. Partition by phase and result. |
argocd_app_reconcile_duration_seconds |
Histogram | The time taken to reconcile application state. Useful for detecting slow sync loops. |
argocd_app_k8s_request_duration_seconds |
Histogram | The response time of the Kubernetes API server during sync operations. |
Sample Prometheus Alert Rules
Below is a Prometheus Alert