ArgoCD & GitOps Masterclass โ Lesson 2
Understanding Declarative Infrastructure and Kubernetes
An enterprise-grade, deep-dive exploration of the declarative paradigm, the Kubernetes control plane, custom resource definitions (CRDs), server-side apply mechanics, and the architectural foundation of GitOps.
What is Declarative Infrastructure?
Declarative Infrastructure is a system management paradigm where you define the desired state of your system (the "what") in static configuration files, and a continuous reconciliation engine automatically executes the steps required to reach and maintain that state. This stands in direct contrast to imperative infrastructure, where you write scripts or execute sequential commands (the "how") to provision and configure resources.
In Kubernetes and GitOps, declarative infrastructure is the foundational bedrock. It allows tools like ArgoCD to constantly compare your desired state stored in Git with the live state running in your cluster, automatically correcting any drift without human intervention.
What You Will Learn
- The architectural and operational differences between Imperative and Declarative systems.
- How the Kubernetes Control Plane acts as a highly distributed declarative state machine.
- The internal mechanics of the Kubernetes Reconciliation Loop (Observe, Analyze, Act).
- How Custom Resource Definitions (CRDs) extend the declarative API to arbitrary domain models.
- The deep mechanics of Server-Side Apply (SSA) and field ownership in Kubernetes.
- How GitOps reconcilers like ArgoCD scale the Kubernetes controller pattern to external Git repositories.
- Production-grade troubleshooting workflows for state drift, schema conflicts, and optimistic concurrency failures.
Prerequisites
Before proceeding with this lesson, you should have a solid understanding of:
- Basic Kubernetes concepts (Pods, Deployments, Services, Namespaces) as covered in Lesson 1: Introduction to GitOps.
- The command-line interface tool
kubectl. - Basic YAML syntax and structure.
Table of Contents
- 1. The Paradigm Shift: Imperative vs. Declarative Infrastructure
- 2. Inside the Kubernetes Declarative Control Plane
- 3. Deep Dive: The Reconciliation Loop (The Heart of GitOps)
- 4. Server-Side Apply (SSA) and Field Ownership
- 5. Custom Resource Definitions (CRDs): Extending the Declarative Model
- 6. The GitOps Connection: Scaling Reconciliation to Git
- 7. Enterprise Patterns & Best Practices
- 8. Troubleshooting & Operational Scenarios
- 9. Monitoring & Observability of Declarative State
- 10. Advanced Technical Interview Questions
- 11. Frequently Asked Questions (FAQs)
- 12. Summary & Next Steps
1. The Paradigm Shift: Imperative vs. Declarative Infrastructure
To understand why GitOps has become the industry standard for cloud-native continuous delivery, we must first analyze the fundamental shift from imperative scripting to declarative state engines.
The Imperative Paradigm: "Do This, Then Do That"
In an imperative world, operators write scripts or run sequential CLI commands to reach a target state. Think of standard Bash scripts, Ansible playbooks (when not carefully designed for idempotency), or direct AWS CLI invocations. You are telling the system how to build the infrastructure.
Consider this imperative shell script designed to scale a deployment and update its image:
#!/usr/bin/env bash
# Imperative script to update our application
set -euo pipefail
echo "Scaling deployment to 5 replicas..."
kubectl scale deployment/payment-service --replicas=5
echo "Updating application image..."
kubectl set image deployment/payment-service payment=payment-service:v2.1.0
echo "Verifying rollout status..."
kubectl rollout status deployment/payment-service
While this script appears simple, it poses severe operational challenges at enterprise scale:
- Lack of Idempotency: If the script fails halfway through (e.g., due to a network timeout during the image update), running it again may cause unexpected side effects or fail outright depending on the state of the cluster.
- State Drift Vulnerability: If an engineer manually scales the deployment down to 2 replicas via the CLI an hour later, the system has drifted. The script has no mechanism to continuously enforce the "5 replicas" rule; it only executed that command once.
- No Single Source of Truth: The actual desired state of the system is scattered across multiple scripts, Jenkins pipeline definitions, and the minds of the operations team.
The Declarative Paradigm: "This is My Desired State"
In a declarative world, you write a document (typically YAML or JSON) that fully describes what the final state of the infrastructure should look like. You hand this document to a controller, and the controller figures out how to make the live infrastructure match your document. You are describing what you want, not how to get there.
Here is the declarative equivalent of the above operation:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
namespace: production
spec:
replicas: 5
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
containers:
- name: payment
image: payment-service:v2.1.0
ports:
- containerPort: 8080
When you apply this manifest using a declarative engine, the engine performs the following analysis:
- It reads the manifest and queries the live system to see if
payment-serviceexists. - If it does not exist, it creates it with 5 replicas and the
v2.1.0image. - If it does exist, it compares the current configuration with the desired configuration. If the live system currently has 3 replicas running
v2.0.0, it calculates the minimum delta and updates the resource to match the new manifest (scaling up by 2 and performing a rolling update of the container image). - Most importantly, if someone manually scales the deployment down to 2 replicas later, the engine detects this variance during its next reconciliation run and scales it back up to 5 automatically.
Detailed Comparison Matrix
| Feature | Imperative Paradigm | Declarative Paradigm |
|---|---|---|
| Primary Focus | The sequence of steps (How) | The target final state (What) |
| Idempotency | Must be manually coded into scripts with complex logic | Built-in by design at the engine level |
| Drift Correction | None; scripts run once and terminate | Continuous; loops constantly monitor and fix drift |
| Self-Healing | No; requires external monitoring and manual intervention | Yes; the reconciliation loop automatically heals the system |
| Auditability | Low; state is spread across historical execution logs | High; state is represented as version-controlled code |
2. Inside the Kubernetes Declarative Control Plane
Kubernetes is not just a container orchestrator; it is a highly optimized, distributed declarative state engine. To understand how GitOps controllers interact with it, we must analyze the internal components of the Kubernetes control plane and how they manage state.
The Architecture of Declarative State Tracking
The Kubernetes control plane consists of several key components that cooperate to process, store, and enforce declarative configurations. The diagram below illustrates how a declarative manifest moves through the control plane:
+-----------------------------------------------------------------------------------+
| KUBERNETES CONTROL PLANE |
| |
| +------------------+ Authentication +-------------------------------+ |
| | kubectl / GitOps| =====================> | kube-apiserver | |
| | (Manifest YAML) | & Validation | (Declarative REST API Gateway)| |
| +------------------+ +-------------------------------+ |
| || |
| || Persist Desired |
| || State |
| \/ |
| +------------------+ Read Live State +-------------------------------+ |
| | kube-controller- | <======================> | etcd | |
| | manager | | (Distributed Consensus Store)| |
| | (Reconciliation) | +-------------------------------+ |
| +------------------+ |
| || |
| || Issue Commands to Match Desired State |
| \/ |
| +-----------------------------------------------------------------------------+ |
| | DATA PLANE | |
| | | |
| | +------------------------+ +------------------------+ | |
| | | Node 1 | | Node 2 | | |
| | | +------------------+ | | +------------------+ | | |
| | | | kubelet | | | | kubelet | | | |
| | | +------------------+ | | +------------------+ | | |
| | | | Container Runtime| | | | Container Runtime| | | |
| | | +------------------+ | | +------------------+ | | |
| | +------------------------+ +------------------------+ | |
| +-----------------------------------------------------------------------------+ |
+-----------------------------------------------------------------------------------+
Key Control Plane Components
1. The API Server (kube-apiserver)
The API Server is the front door to the Kubernetes control plane. It exposes a declarative RESTful API. When you submit a YAML file, the API Server does not immediately spin up containers. Instead, it performs the following sequence of operations:
- Authentication & Authorization: Verifies who you are (using OIDC, certificates, or tokens) and whether you have permission to perform the action (via RBAC).
- Mutating Admission Webhooks: Modifies the incoming request if necessary (e.g., injecting default values, adding sidecar containers, or inserting corporate-mandated labels).
- Schema Validation: Ensures the submitted YAML complies exactly with the OpenAPI schema defined for that resource type.
- Validating Admission Webhooks: Performs complex validation logic that cannot be expressed via schema alone (e.g., preventing a deployment from using a deprecated registry, or checking resource quota compliance).
- Persistence: Writes the validated, normalized resource definition to the backing store.
2. The Distributed Consensus Store (etcd)
etcd is a strongly consistent, distributed key-value store that implements the Raft consensus algorithm. It serves as the single source of truth for the entire cluster's live and desired state. In a declarative system, state durability and consistency are paramount. If etcd reports that 3 replicas exist, the rest of the system operates under the assumption that this is the absolute truth. The API Server is the only component allowed to talk directly to etcd.
3. The Controller Manager (kube-controller-manager)
The Controller Manager is a daemon that embeds the core control loops shipped with Kubernetes. A controller is a non-terminating loop that regulates the state of the system. Examples include the Deployment Controller, Namespace Controller, and StatefulSet Controller. These controllers watch the state of the cluster through the API Server's watch APIs and make changes attempting to move the current state towards the desired state.
3. Deep Dive: The Reconciliation Loop (The Heart of GitOps)
The core mechanism of declarative infrastructure is the Reconciliation Loop. This loop is a continuous, self-correcting cycle that can be mathematically expressed as:
f(Desired State, Actual State) -> Action to minimize difference
The Three Phases of Reconciliation
Every controller in Kubernetes, as well as ArgoCD itself, executes a loop consisting of three distinct phases:
+--------------------------------------------+
| |
| OBSERVE |
| Query API Server & Live System |
| |
+---------------------+----------------------+
|
| State Data
\/
+--------------------------------------------+
| |
| ANALYZE |
| Calculate Delta: Desired vs Actual |
| |
+---------------------+----------------------+
|
| Calculated Delta
\/
+--------------------------------------------+
| |
| ACT |
| Execute Changes to Align States |
| |
+---------------------+----------------------+
|
+----------------------+ Loop Continues (Infinite)
- Observe: The controller queries the current state of the resource it is managing. It does this by listening to the Kubernetes API Server's streaming HTTP watch API, which provides real-time updates on resource modifications, creations, and deletions.
- Analyze: The controller compares the desired state (specified in the resource's
specblock) with the actual state (reported in the resource'sstatusblock or gathered directly from the infrastructure, such as running containers or cloud provider APIs). - Act: If a discrepancy (drift) is found, the controller executes API calls to bring the actual state in line with the desired state. This might involve creating a pod, deleting an orphaned service, or calling an external cloud API to provision a load balancer.
Reconciliation in Go: A Conceptual Implementation
To demystify how this works under the hood, let's look at a simplified, production-style Go code block representing how a custom controller reconciles a resource using the popular controller-runtime library:
package controllers
import (
"context"
"fmt"
"github.com/go-logr/logr"
apierrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
appv1 "github.com/enterprise/gitops-operator/api/v1"
)
// DatabaseReconciler reconciles a DatabaseInstance object
type DatabaseReconciler struct {
client.Client
Log logr.Logger
Scheme *runtime.Scheme
}
// Reconcile is the core reconciliation loop function
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := r.Log.WithValues("databaseinstance", req.NamespacedName)
// 1. OBSERVE: Fetch the desired state from the API Server
var dbInstance appv1.DatabaseInstance
if err := r.Get(ctx, req.NamespacedName, &dbInstance); err != nil {
if apierrors.IsNotFound(err) {
// Resource was deleted; clean up any external resources if needed
log.Info("DatabaseInstance resource deleted. Cleaning up external cloud DB...")
return ctrl.Result{}, nil
}
log.Error(err, "Unable to fetch DatabaseInstance")
return ctrl.Result{}, err
}
log.Info("Observed Desired State", "Engine", dbInstance.Spec.Engine, "StorageGB", dbInstance.Spec.StorageGB)
// 2. OBSERVE & ANALYZE: Query actual state of the physical database
actualDBExists, actualStorage, err := r.checkExternalDatabaseStatus(&dbInstance)
if err != nil {
log.Error(err, "Failed to inspect physical database state")
return ctrl.Result{}, err
}
// 3. ACT: Reconcile discrepancies
if !actualDBExists {
log.Info("Database does not exist. Creating physical database instance...")
if err := r.createExternalDatabase(&dbInstance); err != nil {
log.Error(err, "Failed to provision database")
return ctrl.Result{}, err
}
return ctrl.Result{Requeue: true}, nil
}
if actualStorage < dbInstance.Spec.StorageGB {
log.Info("Detected storage drift. Scaling physical database storage...", "Current", actualStorage, "Desired", dbInstance.Spec.StorageGB)
if err := r.scaleDatabaseStorage(&dbInstance, dbInstance.Spec.StorageGB); err != nil {
log.Error(err, "Failed to scale database storage")
return ctrl.Result{}, err
}
}
// Update Status to reflect the actual state
dbInstance.Status.Phase = "Ready"
dbInstance.Status.ActiveStorageGB = dbInstance.Spec.StorageGB
if err := r.Status().Update(ctx, &dbInstance); err != nil {
log.Error(err, "Failed to update DatabaseInstance status")
return ctrl.Result{}, err
}
// Requeue periodically to check for external drift
return ctrl.Result{RequeueAfter: ctrl.Result{}.RequeueAfter}, nil
}
Optimistic Concurrency Control (OCC)
In a highly concurrent, distributed system like Kubernetes, multiple controllers or users might try to update the same resource simultaneously. To prevent overwriting updates, Kubernetes uses Optimistic Concurrency Control (OCC).
Every Kubernetes resource contains a metadata field called resourceVersion. This is an opaque string managed by etcd. When you read a resource, the API Server returns its current resourceVersion. When you attempt to write an update back to the API Server, your request must include this resourceVersion. If another actor modified the resource in the millisecond between your read and write, the resourceVersion in etcd will have changed, and the API Server will reject your update with a 409 Conflict error.
The controller is then expected to fetch the latest version of the resource, re-apply its logic, and try the write operation again.
4. Server-Side Apply (SSA) and Field Ownership
Historically, the client-side tool kubectl was responsible for calculating the differences between your local YAML file and the live cluster state. It did this using a complex patching mechanism called Strategic Merge Patch, saving the last applied configuration in a massive annotation: kubectl.kubernetes.io/last-applied-configuration.
This approach had major drawbacks, especially for declarative CD engines like ArgoCD. If multiple systems (e.g., an automated horizontal pod autoscaler and a GitOps delivery engine) modified different fields of the same resource, they would constantly overwrite each other's changes. To solve this, Kubernetes introduced Server-Side Apply (SSA).
How Server-Side Apply Works
With Server-Side Apply, the logic of merging and patching resources is moved from the client (e.g., your laptop or ArgoCD) to the kube-apiserver. When a client applies a manifest using SSA (by setting the HTTP request header or using kubectl apply --server-side), the API Server tracks exactly which client (known as a Field Manager) owns which fields of the resource.
This tracking is stored directly in the resource's metadata under the managedFields block. Let's look at an example of how this metadata looks on a live Pod:
apiVersion: v1
kind: Pod
metadata:
name: payment-processor
namespace: production
managedFields:
- manager: argo-cd
operation: Apply
apiVersion: v1
time: "2023-10-27T14:32:00Z"
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:containers:
k:{"name":"processor"}:
.: {}
f:image: {}
f:ports:
k:{"containerPort":8080}:
.: {}
f:containerPort: {}
- manager: kube-controller-manager
operation: Update
apiVersion: v1
time: "2023-10-27T14:35:00Z"
fieldsType: FieldsV1
fieldsV1:
f:status:
f:phase: {}
f:podIP: {}
Field Conflicts and Resolution
Because Kubernetes knows who owns what, it can intelligently prevent systems from stepping on each other's toes. Let's trace a common enterprise conflict scenario:
+-----------------------+ +-----------------------+
| GitOps (ArgoCD) | | HPA (Autoscaler) |
| Manager: "argo-cd" | | Manager: "kube-hpa" |
+-----------+-----------+ +-----------+-----------+
| |
| Sets spec.replicas = 3 | Sets spec.replicas = 10
| |
\/ \/
+--------------------------------------------------------------+
| kube-apiserver |
| |
| 1. ArgoCD applies replicas=3. |
| - Field "spec.replicas" owner set to "argo-cd". |
| |
| 2. HPA attempts to scale replicas to 10. |
| - Replicas field is owned by "argo-cd". |
| - Conflict occurs! |
| |
| 3. Resolution: |
| - HPA forces ownership of "spec.replicas" field. |
| - Owner becomes "kube-hpa". |
| - ArgoCD is notified of the change. |
+--------------------------------------------------------------+
If Manager B attempts to modify a field owned by Manager A, the API Server will reject the request with a conflict error unless Manager B explicitly sets the force flag. If the update is forced, Manager B takes ownership of the field, and Manager A is notified of the conflict during its next reconciliation sync.
For GitOps engines like ArgoCD, this is a game-changer. It allows us to configure ArgoCD to ignore fields that are dynamically managed by in-cluster controllers (like spec.replicas managed by an HPA) while still maintaining strict declarative control over other fields (like container images, env vars, and security contexts).
5. Custom Resource Definitions (CRDs): Extending the Declarative Model
One of the primary reasons Kubernetes became the foundation for modern cloud platforms is its extensibility. You are not limited to using built-in resources like Pods and Services. You can define your own domain-specific resources using Custom Resource Definitions (CRDs).
When you register a CRD, you are teaching the Kubernetes API Server how to parse, validate, and store a brand-new declarative API object. Once registered, users can interact with your custom resource using standard tools like kubectl and GitOps controllers like ArgoCD.
Anatomy of a Production-Grade CRD
Let's examine a complete, production-grade CRD that defines a declarative database instance. This CRD includes OpenAPI v3 validation schemas, custom printer columns for kubectl get, and subresources for status tracking.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databaseinstances.database.enterprise.io
spec:
group: database.enterprise.io
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required:
- engine
- version
- storageGB
properties:
engine:
type: string
enum:
- postgresql
- mysql
- redis
version:
type: string
pattern: '^[0-9]+(\.[0-9]+)*$'
storageGB:
type: integer
minimum: 10
maximum: 5000
backup:
type: object
properties:
enabled:
type: boolean
retentionDays:
type: integer
minimum: 1
status:
type: object
properties:
phase:
type: string
activeStorageGB:
type: integer
connectionEndpoint:
type: string
subresources:
status: {}
additionalPrinterColumns:
- name: Engine
type: string
jsonPath: .spec.engine
- name: Version
type: string
jsonPath: .spec.version
- name: Status
type: string
jsonPath: .status.phase
scope: Namespaced
names:
plural: databaseinstances
singular: databaseinstance
kind: DatabaseInstance
shortNames:
- dbi
Deploying a Custom Resource (CR) Instance
Once the CRD is applied to the cluster, the API Server exposes a new REST endpoint: /apis/database.enterprise.io/v1alpha1/namespaces/{namespace}/databaseinstances. We can now submit declarative manifests of our custom type:
apiVersion: database.enterprise.io/v1alpha1
kind: DatabaseInstance
metadata:
name: billing-db
namespace: production
spec:
engine: postgresql
version: "15.4"
storageGB: 200
backup:
enabled: true
retentionDays: 30
When this manifest is applied, the API Server validates it against the OpenAPI schema defined in the CRD. If a user tries to set storageGB: 5 (which is below the minimum of 10) or engine: oracle (which is not in the allowed enum), the API Server will reject the request with a validation error before it ever reaches the database controller.
6. The GitOps Connection: Scaling Reconciliation to Git
Now that we have explored how Kubernetes handles declarative state internally, we can understand the core architectural premise of GitOps: extending the reconciliation loop outside the Kubernetes cluster to a Git repository.
The GitOps Paradigm
In standard Kubernetes operations, the desired state is applied manually or via CI scripts using kubectl apply. In the GitOps paradigm, we introduce a Git repository as the Single Source of Truth (SSOT) for our desired state, and we place a specialized controller (like ArgoCD) inside the cluster.
+---------------------------------------------------------------------------------+
| THE GITOPS CYCLE |
| |
| +-------------------+ Git Push +--------------------------------+ |
| | Git Repository | ==================> | ArgoCD Controller | |
| | (Desired State) | | (Continuous Git Reconciliation)| |
| +-------------------+ +--------------------------------+ |
| ^ || |
| | || Compares Git |
| | Pull Request Audit || vs Live Cluster |
| | \/ |
| +-------------------+ +--------------------------------+ |
| | Developer / Infra | | Kubernetes Cluster | |
| | Engineer | | (Live State) | |
| +-------------------+ +--------------------------------+ |
+---------------------------------------------------------------------------------+
ArgoCD runs its own reconciliation loop that wraps around the Kubernetes API Server:
- Observe Git: ArgoCD polls or receives webhooks from your Git repository (GitHub, GitLab, Bitbucket) and parses the declarative manifests (raw YAML, Helm charts, or Kustomize targets). This is the Target Desired State.
- Observe Cluster: ArgoCD queries the Kubernetes API Server to fetch the current live configuration of all managed resources. This is the Actual Live State.
- Analyze: ArgoCD calculates the diff between Git and the Cluster. If the states match, the application is marked as
Synced. If they differ, the application is marked asOutOfSync. - Act: Depending on its configuration, ArgoCD will either alert operators of the drift (Manual Sync mode) or invoke the Kubernetes API Server to apply the changes from Git, forcing the cluster back into alignment (Auto-Sync mode).
Why GitOps Requires Declarative Configurations
It is structurally impossible to implement GitOps with imperative configurations. If your Git repository contained a series of shell scripts, a GitOps controller would have no safe way to calculate a diff, determine if the system has drifted, or automatically resolve discrepancies. Declarative manifests are mathematically diffable, versionable, and highly auditable, making them the only format suitable for GitOps pipelines.
7. Enterprise Patterns & Best Practices
Operating declarative systems at enterprise scale requires strict adherence to architectural patterns that ensure security, reliability, and maintainability.
1. Decoupling Configuration and Code
Never store your declarative application manifests in the same Git repository as your application source code. Keep them in separate repositories. This separation provides several critical benefits:
- Security (Least Privilege): Developers may have permission to commit code and trigger CI pipelines, but only platform engineers or automated release managers should have write access to production environment configuration repositories.
- Build Performance: Changing a resource limit or replica count in a manifest should not trigger a 20-minute container image build and test suite run. It should merely trigger a GitOps sync.
- Clean Audit Trails: The Git history of your deployment repository represents a clean timeline of environment state changes, unpolluted by code commits, branch merges, and test runs.
2. Embracing Immutable Infrastructure
In a declarative world, you should treat your infrastructure and application instances as immutable. Never modify running containers or cluster resources directly via kubectl edit or kubectl exec. If a change is required, modify the declarative manifest in Git, commit it, merge it, and let the GitOps controller roll out the update. This guarantees that your cluster can be completely recreated from scratch using only the contents of your Git repositories in the event of a disaster.
3. Structuring Multi-Environment Configurations
Enterprise platforms must support multiple environments (development, staging, production) without duplicating massive amounts of YAML code. To achieve this, use configuration management tools like Kustomize or Helm within your declarative pipeline.
Here is a recommended enterprise directory structure using Kustomize:
infrastructure-gitops/
โโโ apps/
โ โโโ payment-service/
โ โโโ base/
โ โ โโโ deployment.yaml
โ โ โโโ service.yaml
โ โ โโโ kustomization.yaml
โ โโโ environments/
โ โโโ development/
โ โ โโโ replica-patch.yaml
โ โ โโโ kustomization.yaml
โ โโโ production/
โ โโโ replica-patch.yaml
โ โโโ resources-patch.yaml
โ โโโ kustomization.yaml
In this structure, the base directory contains the core declarative manifests that are common across all environments. The environments/production directory contains only the specific patches (e.g., scaling up replicas, setting higher CPU/memory limits) and overlays unique to production. ArgoCD is configured to point to the environment-specific directories, dynamically rendering the final declarative manifests before applying them to the target clusters.
8. Troubleshooting & Operational Scenarios
Even in robust declarative systems, failures can occur due to misconfigurations, schema violations, or complex controller interactions. Below are common real-world failure scenarios and their step-by-step resolution playbooks.
Scenario A: The Infinite Reconciliation Loop (Flapping State)
Symptom: ArgoCD shows that an application is constantly toggling between Synced and OutOfSync. Looking at the diff, a specific field (e.g., a replica count or an annotation) keeps changing back and forth every few seconds.
Root Cause: This occurs when there is a conflict between your declarative manifest in Git and an in-cluster dynamic controller (like a Horizontal Pod Autoscaler or an admission webhook). Git says the replica count should be 3, so ArgoCD applies 3. A second later, the HPA controller decides the cluster needs 10 replicas due to high load, so it updates the replica count to 10. ArgoCD detects this drift from Git, updates it back to 3, and the cycle repeats infinitely.
Resolution Playbook:
- Identify the conflicting field by examining the live diff in the ArgoCD UI or running:
Check thekubectl get deployment payment-service -o yamlmetadata.managedFieldsblock to see which controllers are editing the field. - Configure ArgoCD to ignore the specific field being mutated by the in-cluster controller. In your ArgoCD Application manifest, add an
ignoreDifferencesblock:spec: ignoreDifferences: - group: apps kind: Deployment name: payment-service jsonPointers: - /spec/replicas - Apply the updated Application manifest. ArgoCD will now allow the in-cluster controller to manage that specific field without triggering a sync operation.
Scenario B: CRD Schema Validation Failures
Symptom: When trying to apply a custom resource manifest, the API Server returns an error similar to:
error: ValidationError(DatabaseInstance.spec): unknown field "backupInterval" in io.enterprise.database.v1alpha1.DatabaseInstance.spec
Root Cause: The custom resource manifest contains a field that is not defined or is defined incorrectly in the CustomResourceDefinition's OpenAPI v3 validation schema.
Resolution Playbook:
- Inspect the registered schema in your cluster using
kubectl explain:
This will show you all valid fields and their expected data types.kubectl explain databaseinstances.spec - If the field is missing from the explanation but should be there, you must update the CRD definition to include the field under
spec.versions[].schema.openAPIV3Schema. - If the field is simply misspelled in your custom resource manifest, correct the spelling in your Git repository and push the change to trigger a clean reconciliation.
Scenario C: Optimistic Concurrency Conflict (409 Conflict)
Symptom: Your automation scripts or custom controllers are logging errors like:
Operation cannot be fulfilled on databaseinstances.database.enterprise.io "billing-db": the object has been modified; please apply your changes to the latest version and try again
Root Cause: The controller attempted to write an update to the resource using an outdated resourceVersion. Another client modified the resource in the background during the controller's reconciliation execution.
Resolution Playbook:
- If you are writing custom controllers, ensure your code implements a retry-on-conflict mechanism. The
client-golibrary provides a helper function for this:import "k8s.io/client-go/util/retry" err := retry.RetryOnConflict(retry.DefaultRetry, func() error { // 1. Fetch the latest version of the resource err := r.Get(ctx, req.NamespacedName, &dbInstance) if err != nil { return err } // 2. Make your modifications dbInstance.Spec.StorageGB = 300 // 3. Attempt to update return r.Update(ctx, &dbInstance) }) - If you are using
kubectl