Published: 2026-06-01 โ€ข Updated: 2026-07-05

ArgoCD & GitOps Masterclass โ€” Lesson 2

Understanding Declarative Infrastructure and Kubernetes

An enterprise-grade, deep-dive exploration of the declarative paradigm, the Kubernetes control plane, custom resource definitions (CRDs), server-side apply mechanics, and the architectural foundation of GitOps.


What You Will Learn

  • The architectural and operational differences between Imperative and Declarative systems.
  • How the Kubernetes Control Plane acts as a highly distributed declarative state machine.
  • The internal mechanics of the Kubernetes Reconciliation Loop (Observe, Analyze, Act).
  • How Custom Resource Definitions (CRDs) extend the declarative API to arbitrary domain models.
  • The deep mechanics of Server-Side Apply (SSA) and field ownership in Kubernetes.
  • How GitOps reconcilers like ArgoCD scale the Kubernetes controller pattern to external Git repositories.
  • Production-grade troubleshooting workflows for state drift, schema conflicts, and optimistic concurrency failures.

Prerequisites

Before proceeding with this lesson, you should have a solid understanding of:

  • Basic Kubernetes concepts (Pods, Deployments, Services, Namespaces) as covered in Lesson 1: Introduction to GitOps.
  • The command-line interface tool kubectl.
  • Basic YAML syntax and structure.

Table of Contents


1. The Paradigm Shift: Imperative vs. Declarative Infrastructure

To understand why GitOps has become the industry standard for cloud-native continuous delivery, we must first analyze the fundamental shift from imperative scripting to declarative state engines.

The Imperative Paradigm: "Do This, Then Do That"

In an imperative world, operators write scripts or run sequential CLI commands to reach a target state. Think of standard Bash scripts, Ansible playbooks (when not carefully designed for idempotency), or direct AWS CLI invocations. You are telling the system how to build the infrastructure.

Consider this imperative shell script designed to scale a deployment and update its image:

#!/usr/bin/env bash
# Imperative script to update our application
set -euo pipefail

echo "Scaling deployment to 5 replicas..."
kubectl scale deployment/payment-service --replicas=5

echo "Updating application image..."
kubectl set image deployment/payment-service payment=payment-service:v2.1.0

echo "Verifying rollout status..."
kubectl rollout status deployment/payment-service

While this script appears simple, it poses severe operational challenges at enterprise scale:

  • Lack of Idempotency: If the script fails halfway through (e.g., due to a network timeout during the image update), running it again may cause unexpected side effects or fail outright depending on the state of the cluster.
  • State Drift Vulnerability: If an engineer manually scales the deployment down to 2 replicas via the CLI an hour later, the system has drifted. The script has no mechanism to continuously enforce the "5 replicas" rule; it only executed that command once.
  • No Single Source of Truth: The actual desired state of the system is scattered across multiple scripts, Jenkins pipeline definitions, and the minds of the operations team.

The Declarative Paradigm: "This is My Desired State"

In a declarative world, you write a document (typically YAML or JSON) that fully describes what the final state of the infrastructure should look like. You hand this document to a controller, and the controller figures out how to make the live infrastructure match your document. You are describing what you want, not how to get there.

Here is the declarative equivalent of the above operation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: production
spec:
  replicas: 5
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
      - name: payment
        image: payment-service:v2.1.0
        ports:
        - containerPort: 8080

When you apply this manifest using a declarative engine, the engine performs the following analysis:

  1. It reads the manifest and queries the live system to see if payment-service exists.
  2. If it does not exist, it creates it with 5 replicas and the v2.1.0 image.
  3. If it does exist, it compares the current configuration with the desired configuration. If the live system currently has 3 replicas running v2.0.0, it calculates the minimum delta and updates the resource to match the new manifest (scaling up by 2 and performing a rolling update of the container image).
  4. Most importantly, if someone manually scales the deployment down to 2 replicas later, the engine detects this variance during its next reconciliation run and scales it back up to 5 automatically.

Detailed Comparison Matrix

Feature Imperative Paradigm Declarative Paradigm
Primary Focus The sequence of steps (How) The target final state (What)
Idempotency Must be manually coded into scripts with complex logic Built-in by design at the engine level
Drift Correction None; scripts run once and terminate Continuous; loops constantly monitor and fix drift
Self-Healing No; requires external monitoring and manual intervention Yes; the reconciliation loop automatically heals the system
Auditability Low; state is spread across historical execution logs High; state is represented as version-controlled code

2. Inside the Kubernetes Declarative Control Plane

Kubernetes is not just a container orchestrator; it is a highly optimized, distributed declarative state engine. To understand how GitOps controllers interact with it, we must analyze the internal components of the Kubernetes control plane and how they manage state.

The Architecture of Declarative State Tracking

The Kubernetes control plane consists of several key components that cooperate to process, store, and enforce declarative configurations. The diagram below illustrates how a declarative manifest moves through the control plane:

+-----------------------------------------------------------------------------------+
|                            KUBERNETES CONTROL PLANE                               |
|                                                                                   |
|  +------------------+      Authentication      +-------------------------------+  |
|  |  kubectl / GitOps|  =====================>  |          kube-apiserver       |  |
|  |  (Manifest YAML) |      & Validation        | (Declarative REST API Gateway)|  |
|  +------------------+                          +-------------------------------+  |
|                                                            ||                     |
|                                                            || Persist Desired     |
|                                                            || State               |
|                                                            \/                     |
|  +------------------+   Read Live State        +-------------------------------+  |
|  | kube-controller- | <======================> |             etcd              |  |
|  |    manager       |                          |  (Distributed Consensus Store)|  |
|  | (Reconciliation) |                          +-------------------------------+  |
|  +------------------+                                                             |
|           ||                                                                      |
|           || Issue Commands to Match Desired State                                 |
|           \/                                                                      |
|  +-----------------------------------------------------------------------------+  |
|  |                                  DATA PLANE                                 |  |
|  |                                                                             |  |
|  |    +------------------------+             +------------------------+        |  |
|  |    |         Node 1         |             |         Node 2         |        |  |
|  |    |  +------------------+  |             |  +------------------+  |        |  |
|  |    |  |     kubelet      |  |             |  |     kubelet      |  |        |  |
|  |    |  +------------------+  |             |  +------------------+  |        |  |
|  |    |  | Container Runtime|  |             |  | Container Runtime|  |        |  |
|  |    |  +------------------+  |             |  +------------------+  |        |  |
|  |    +------------------------+             +------------------------+        |  |
|  +-----------------------------------------------------------------------------+  |
+-----------------------------------------------------------------------------------+
    

Key Control Plane Components

1. The API Server (kube-apiserver)

The API Server is the front door to the Kubernetes control plane. It exposes a declarative RESTful API. When you submit a YAML file, the API Server does not immediately spin up containers. Instead, it performs the following sequence of operations:

  • Authentication & Authorization: Verifies who you are (using OIDC, certificates, or tokens) and whether you have permission to perform the action (via RBAC).
  • Mutating Admission Webhooks: Modifies the incoming request if necessary (e.g., injecting default values, adding sidecar containers, or inserting corporate-mandated labels).
  • Schema Validation: Ensures the submitted YAML complies exactly with the OpenAPI schema defined for that resource type.
  • Validating Admission Webhooks: Performs complex validation logic that cannot be expressed via schema alone (e.g., preventing a deployment from using a deprecated registry, or checking resource quota compliance).
  • Persistence: Writes the validated, normalized resource definition to the backing store.

2. The Distributed Consensus Store (etcd)

etcd is a strongly consistent, distributed key-value store that implements the Raft consensus algorithm. It serves as the single source of truth for the entire cluster's live and desired state. In a declarative system, state durability and consistency are paramount. If etcd reports that 3 replicas exist, the rest of the system operates under the assumption that this is the absolute truth. The API Server is the only component allowed to talk directly to etcd.

3. The Controller Manager (kube-controller-manager)

The Controller Manager is a daemon that embeds the core control loops shipped with Kubernetes. A controller is a non-terminating loop that regulates the state of the system. Examples include the Deployment Controller, Namespace Controller, and StatefulSet Controller. These controllers watch the state of the cluster through the API Server's watch APIs and make changes attempting to move the current state towards the desired state.


3. Deep Dive: The Reconciliation Loop (The Heart of GitOps)

The core mechanism of declarative infrastructure is the Reconciliation Loop. This loop is a continuous, self-correcting cycle that can be mathematically expressed as:

f(Desired State, Actual State) -> Action to minimize difference

The Three Phases of Reconciliation

Every controller in Kubernetes, as well as ArgoCD itself, executes a loop consisting of three distinct phases:

       +--------------------------------------------+
       |                                            |
       |                  OBSERVE                   |
       |      Query API Server & Live System        |
       |                                            |
       +---------------------+----------------------+
                             |
                             | State Data
                             \/
       +--------------------------------------------+
       |                                            |
       |                  ANALYZE                   |
       |     Calculate Delta: Desired vs Actual     |
       |                                            |
       +---------------------+----------------------+
                             |
                             | Calculated Delta
                             \/
       +--------------------------------------------+
       |                                            |
       |                    ACT                     |
       |     Execute Changes to Align States        |
       |                                            |
       +---------------------+----------------------+
                             |
                             +----------------------+ Loop Continues (Infinite)
    
  1. Observe: The controller queries the current state of the resource it is managing. It does this by listening to the Kubernetes API Server's streaming HTTP watch API, which provides real-time updates on resource modifications, creations, and deletions.
  2. Analyze: The controller compares the desired state (specified in the resource's spec block) with the actual state (reported in the resource's status block or gathered directly from the infrastructure, such as running containers or cloud provider APIs).
  3. Act: If a discrepancy (drift) is found, the controller executes API calls to bring the actual state in line with the desired state. This might involve creating a pod, deleting an orphaned service, or calling an external cloud API to provision a load balancer.

Reconciliation in Go: A Conceptual Implementation

To demystify how this works under the hood, let's look at a simplified, production-style Go code block representing how a custom controller reconciles a resource using the popular controller-runtime library:

package controllers

import (
	"context"
	"fmt"

	"github.com/go-logr/logr"
	apierrors "k8s.io/apimachinery/pkg/api/errors"
	"k8s.io/apimachinery/pkg/runtime"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"

	appv1 "github.com/enterprise/gitops-operator/api/v1"
)

// DatabaseReconciler reconciles a DatabaseInstance object
type DatabaseReconciler struct {
	client.Client
	Log    logr.Logger
	Scheme *runtime.Scheme
}

// Reconcile is the core reconciliation loop function
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := r.Log.WithValues("databaseinstance", req.NamespacedName)

	// 1. OBSERVE: Fetch the desired state from the API Server
	var dbInstance appv1.DatabaseInstance
	if err := r.Get(ctx, req.NamespacedName, &dbInstance); err != nil {
		if apierrors.IsNotFound(err) {
			// Resource was deleted; clean up any external resources if needed
			log.Info("DatabaseInstance resource deleted. Cleaning up external cloud DB...")
			return ctrl.Result{}, nil
		}
		log.Error(err, "Unable to fetch DatabaseInstance")
		return ctrl.Result{}, err
	}

	log.Info("Observed Desired State", "Engine", dbInstance.Spec.Engine, "StorageGB", dbInstance.Spec.StorageGB)

	// 2. OBSERVE & ANALYZE: Query actual state of the physical database
	actualDBExists, actualStorage, err := r.checkExternalDatabaseStatus(&dbInstance)
	if err != nil {
		log.Error(err, "Failed to inspect physical database state")
		return ctrl.Result{}, err
	}

	// 3. ACT: Reconcile discrepancies
	if !actualDBExists {
		log.Info("Database does not exist. Creating physical database instance...")
		if err := r.createExternalDatabase(&dbInstance); err != nil {
			log.Error(err, "Failed to provision database")
			return ctrl.Result{}, err
		}
		return ctrl.Result{Requeue: true}, nil
	}

	if actualStorage < dbInstance.Spec.StorageGB {
		log.Info("Detected storage drift. Scaling physical database storage...", "Current", actualStorage, "Desired", dbInstance.Spec.StorageGB)
		if err := r.scaleDatabaseStorage(&dbInstance, dbInstance.Spec.StorageGB); err != nil {
			log.Error(err, "Failed to scale database storage")
			return ctrl.Result{}, err
		}
	}

	// Update Status to reflect the actual state
	dbInstance.Status.Phase = "Ready"
	dbInstance.Status.ActiveStorageGB = dbInstance.Spec.StorageGB
	if err := r.Status().Update(ctx, &dbInstance); err != nil {
		log.Error(err, "Failed to update DatabaseInstance status")
		return ctrl.Result{}, err
	}

	// Requeue periodically to check for external drift
	return ctrl.Result{RequeueAfter: ctrl.Result{}.RequeueAfter}, nil
}

Optimistic Concurrency Control (OCC)

In a highly concurrent, distributed system like Kubernetes, multiple controllers or users might try to update the same resource simultaneously. To prevent overwriting updates, Kubernetes uses Optimistic Concurrency Control (OCC).

Every Kubernetes resource contains a metadata field called resourceVersion. This is an opaque string managed by etcd. When you read a resource, the API Server returns its current resourceVersion. When you attempt to write an update back to the API Server, your request must include this resourceVersion. If another actor modified the resource in the millisecond between your read and write, the resourceVersion in etcd will have changed, and the API Server will reject your update with a 409 Conflict error.

The controller is then expected to fetch the latest version of the resource, re-apply its logic, and try the write operation again.


4. Server-Side Apply (SSA) and Field Ownership

Historically, the client-side tool kubectl was responsible for calculating the differences between your local YAML file and the live cluster state. It did this using a complex patching mechanism called Strategic Merge Patch, saving the last applied configuration in a massive annotation: kubectl.kubernetes.io/last-applied-configuration.

This approach had major drawbacks, especially for declarative CD engines like ArgoCD. If multiple systems (e.g., an automated horizontal pod autoscaler and a GitOps delivery engine) modified different fields of the same resource, they would constantly overwrite each other's changes. To solve this, Kubernetes introduced Server-Side Apply (SSA).

How Server-Side Apply Works

With Server-Side Apply, the logic of merging and patching resources is moved from the client (e.g., your laptop or ArgoCD) to the kube-apiserver. When a client applies a manifest using SSA (by setting the HTTP request header or using kubectl apply --server-side), the API Server tracks exactly which client (known as a Field Manager) owns which fields of the resource.

This tracking is stored directly in the resource's metadata under the managedFields block. Let's look at an example of how this metadata looks on a live Pod:

apiVersion: v1
kind: Pod
metadata:
  name: payment-processor
  namespace: production
  managedFields:
  - manager: argo-cd
    operation: Apply
    apiVersion: v1
    time: "2023-10-27T14:32:00Z"
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:containers:
          k:{"name":"processor"}:
            .: {}
            f:image: {}
            f:ports:
              k:{"containerPort":8080}:
                .: {}
                f:containerPort: {}
  - manager: kube-controller-manager
    operation: Update
    apiVersion: v1
    time: "2023-10-27T14:35:00Z"
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:phase: {}
        f:podIP: {}

Field Conflicts and Resolution

Because Kubernetes knows who owns what, it can intelligently prevent systems from stepping on each other's toes. Let's trace a common enterprise conflict scenario:

  +-----------------------+              +-----------------------+
  |  GitOps (ArgoCD)      |              |  HPA (Autoscaler)     |
  |  Manager: "argo-cd"   |              |  Manager: "kube-hpa"  |
  +-----------+-----------+              +-----------+-----------+
              |                                      |
              | Sets spec.replicas = 3               | Sets spec.replicas = 10
              |                                      |
              \/                                     \/
  +--------------------------------------------------------------+
  |                     kube-apiserver                           |
  |                                                              |
  |  1. ArgoCD applies replicas=3.                               |
  |     - Field "spec.replicas" owner set to "argo-cd".          |
  |                                                              |
  |  2. HPA attempts to scale replicas to 10.                    |
  |     - Replicas field is owned by "argo-cd".                  |
  |     - Conflict occurs!                                       |
  |                                                              |
  |  3. Resolution:                                              |
  |     - HPA forces ownership of "spec.replicas" field.         |
  |     - Owner becomes "kube-hpa".                              |
  |     - ArgoCD is notified of the change.                      |
  +--------------------------------------------------------------+
    

If Manager B attempts to modify a field owned by Manager A, the API Server will reject the request with a conflict error unless Manager B explicitly sets the force flag. If the update is forced, Manager B takes ownership of the field, and Manager A is notified of the conflict during its next reconciliation sync.

For GitOps engines like ArgoCD, this is a game-changer. It allows us to configure ArgoCD to ignore fields that are dynamically managed by in-cluster controllers (like spec.replicas managed by an HPA) while still maintaining strict declarative control over other fields (like container images, env vars, and security contexts).


5. Custom Resource Definitions (CRDs): Extending the Declarative Model

One of the primary reasons Kubernetes became the foundation for modern cloud platforms is its extensibility. You are not limited to using built-in resources like Pods and Services. You can define your own domain-specific resources using Custom Resource Definitions (CRDs).

When you register a CRD, you are teaching the Kubernetes API Server how to parse, validate, and store a brand-new declarative API object. Once registered, users can interact with your custom resource using standard tools like kubectl and GitOps controllers like ArgoCD.

Anatomy of a Production-Grade CRD

Let's examine a complete, production-grade CRD that defines a declarative database instance. This CRD includes OpenAPI v3 validation schemas, custom printer columns for kubectl get, and subresources for status tracking.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databaseinstances.database.enterprise.io
spec:
  group: database.enterprise.io
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required:
                - engine
                - version
                - storageGB
              properties:
                engine:
                  type: string
                  enum:
                    - postgresql
                    - mysql
                    - redis
                version:
                  type: string
                  pattern: '^[0-9]+(\.[0-9]+)*$'
                storageGB:
                  type: integer
                  minimum: 10
                  maximum: 5000
                backup:
                  type: object
                  properties:
                    enabled:
                      type: boolean
                    retentionDays:
                      type: integer
                      minimum: 1
            status:
              type: object
              properties:
                phase:
                  type: string
                activeStorageGB:
                  type: integer
                connectionEndpoint:
                  type: string
      subresources:
        status: {}
      additionalPrinterColumns:
        - name: Engine
          type: string
          jsonPath: .spec.engine
        - name: Version
          type: string
          jsonPath: .spec.version
        - name: Status
          type: string
          jsonPath: .status.phase
  scope: Namespaced
  names:
    plural: databaseinstances
    singular: databaseinstance
    kind: DatabaseInstance
    shortNames:
      - dbi

Deploying a Custom Resource (CR) Instance

Once the CRD is applied to the cluster, the API Server exposes a new REST endpoint: /apis/database.enterprise.io/v1alpha1/namespaces/{namespace}/databaseinstances. We can now submit declarative manifests of our custom type:

apiVersion: database.enterprise.io/v1alpha1
kind: DatabaseInstance
metadata:
  name: billing-db
  namespace: production
spec:
  engine: postgresql
  version: "15.4"
  storageGB: 200
  backup:
    enabled: true
    retentionDays: 30

When this manifest is applied, the API Server validates it against the OpenAPI schema defined in the CRD. If a user tries to set storageGB: 5 (which is below the minimum of 10) or engine: oracle (which is not in the allowed enum), the API Server will reject the request with a validation error before it ever reaches the database controller.


6. The GitOps Connection: Scaling Reconciliation to Git

Now that we have explored how Kubernetes handles declarative state internally, we can understand the core architectural premise of GitOps: extending the reconciliation loop outside the Kubernetes cluster to a Git repository.

The GitOps Paradigm

In standard Kubernetes operations, the desired state is applied manually or via CI scripts using kubectl apply. In the GitOps paradigm, we introduce a Git repository as the Single Source of Truth (SSOT) for our desired state, and we place a specialized controller (like ArgoCD) inside the cluster.

+---------------------------------------------------------------------------------+
|                                 THE GITOPS CYCLE                                |
|                                                                                 |
|  +-------------------+       Git Push       +--------------------------------+  |
|  |  Git Repository   |  ==================> |        ArgoCD Controller       |  |
|  |  (Desired State)  |                      | (Continuous Git Reconciliation)|  |
|  +-------------------+                      +--------------------------------+  |
|           ^                                                 ||                  |
|           |                                                 || Compares Git     |
|           | Pull Request Audit                              || vs Live Cluster  |
|           |                                                 \/                  |
|  +-------------------+                      +--------------------------------+  |
|  | Developer / Infra |                      |        Kubernetes Cluster      |  |
|  |      Engineer     |                      |          (Live State)          |  |
|  +-------------------+                      +--------------------------------+  |
+---------------------------------------------------------------------------------+
    

ArgoCD runs its own reconciliation loop that wraps around the Kubernetes API Server:

  1. Observe Git: ArgoCD polls or receives webhooks from your Git repository (GitHub, GitLab, Bitbucket) and parses the declarative manifests (raw YAML, Helm charts, or Kustomize targets). This is the Target Desired State.
  2. Observe Cluster: ArgoCD queries the Kubernetes API Server to fetch the current live configuration of all managed resources. This is the Actual Live State.
  3. Analyze: ArgoCD calculates the diff between Git and the Cluster. If the states match, the application is marked as Synced. If they differ, the application is marked as OutOfSync.
  4. Act: Depending on its configuration, ArgoCD will either alert operators of the drift (Manual Sync mode) or invoke the Kubernetes API Server to apply the changes from Git, forcing the cluster back into alignment (Auto-Sync mode).

Why GitOps Requires Declarative Configurations

It is structurally impossible to implement GitOps with imperative configurations. If your Git repository contained a series of shell scripts, a GitOps controller would have no safe way to calculate a diff, determine if the system has drifted, or automatically resolve discrepancies. Declarative manifests are mathematically diffable, versionable, and highly auditable, making them the only format suitable for GitOps pipelines.


7. Enterprise Patterns & Best Practices

Operating declarative systems at enterprise scale requires strict adherence to architectural patterns that ensure security, reliability, and maintainability.

1. Decoupling Configuration and Code

Never store your declarative application manifests in the same Git repository as your application source code. Keep them in separate repositories. This separation provides several critical benefits:

  • Security (Least Privilege): Developers may have permission to commit code and trigger CI pipelines, but only platform engineers or automated release managers should have write access to production environment configuration repositories.
  • Build Performance: Changing a resource limit or replica count in a manifest should not trigger a 20-minute container image build and test suite run. It should merely trigger a GitOps sync.
  • Clean Audit Trails: The Git history of your deployment repository represents a clean timeline of environment state changes, unpolluted by code commits, branch merges, and test runs.

2. Embracing Immutable Infrastructure

In a declarative world, you should treat your infrastructure and application instances as immutable. Never modify running containers or cluster resources directly via kubectl edit or kubectl exec. If a change is required, modify the declarative manifest in Git, commit it, merge it, and let the GitOps controller roll out the update. This guarantees that your cluster can be completely recreated from scratch using only the contents of your Git repositories in the event of a disaster.

3. Structuring Multi-Environment Configurations

Enterprise platforms must support multiple environments (development, staging, production) without duplicating massive amounts of YAML code. To achieve this, use configuration management tools like Kustomize or Helm within your declarative pipeline.

Here is a recommended enterprise directory structure using Kustomize:

infrastructure-gitops/
โ”œโ”€โ”€ apps/
โ”‚   โ””โ”€โ”€ payment-service/
โ”‚       โ”œโ”€โ”€ base/
โ”‚       โ”‚   โ”œโ”€โ”€ deployment.yaml
โ”‚       โ”‚   โ”œโ”€โ”€ service.yaml
โ”‚       โ”‚   โ””โ”€โ”€ kustomization.yaml
โ”‚       โ””โ”€โ”€ environments/
โ”‚           โ”œโ”€โ”€ development/
โ”‚           โ”‚   โ”œโ”€โ”€ replica-patch.yaml
โ”‚           โ”‚   โ””โ”€โ”€ kustomization.yaml
โ”‚           โ””โ”€โ”€ production/
โ”‚               โ”œโ”€โ”€ replica-patch.yaml
โ”‚               โ”œโ”€โ”€ resources-patch.yaml
โ”‚               โ””โ”€โ”€ kustomization.yaml

In this structure, the base directory contains the core declarative manifests that are common across all environments. The environments/production directory contains only the specific patches (e.g., scaling up replicas, setting higher CPU/memory limits) and overlays unique to production. ArgoCD is configured to point to the environment-specific directories, dynamically rendering the final declarative manifests before applying them to the target clusters.


8. Troubleshooting & Operational Scenarios

Even in robust declarative systems, failures can occur due to misconfigurations, schema violations, or complex controller interactions. Below are common real-world failure scenarios and their step-by-step resolution playbooks.

Scenario A: The Infinite Reconciliation Loop (Flapping State)

Symptom: ArgoCD shows that an application is constantly toggling between Synced and OutOfSync. Looking at the diff, a specific field (e.g., a replica count or an annotation) keeps changing back and forth every few seconds.

Root Cause: This occurs when there is a conflict between your declarative manifest in Git and an in-cluster dynamic controller (like a Horizontal Pod Autoscaler or an admission webhook). Git says the replica count should be 3, so ArgoCD applies 3. A second later, the HPA controller decides the cluster needs 10 replicas due to high load, so it updates the replica count to 10. ArgoCD detects this drift from Git, updates it back to 3, and the cycle repeats infinitely.

Resolution Playbook:

  1. Identify the conflicting field by examining the live diff in the ArgoCD UI or running:
    kubectl get deployment payment-service -o yaml
    Check the metadata.managedFields block to see which controllers are editing the field.
  2. Configure ArgoCD to ignore the specific field being mutated by the in-cluster controller. In your ArgoCD Application manifest, add an ignoreDifferences block:
    spec:
      ignoreDifferences:
      - group: apps
        kind: Deployment
        name: payment-service
        jsonPointers:
        - /spec/replicas
  3. Apply the updated Application manifest. ArgoCD will now allow the in-cluster controller to manage that specific field without triggering a sync operation.

Scenario B: CRD Schema Validation Failures

Symptom: When trying to apply a custom resource manifest, the API Server returns an error similar to:

error: ValidationError(DatabaseInstance.spec): unknown field "backupInterval" in io.enterprise.database.v1alpha1.DatabaseInstance.spec

Root Cause: The custom resource manifest contains a field that is not defined or is defined incorrectly in the CustomResourceDefinition's OpenAPI v3 validation schema.

Resolution Playbook:

  1. Inspect the registered schema in your cluster using kubectl explain:
    kubectl explain databaseinstances.spec
    This will show you all valid fields and their expected data types.
  2. If the field is missing from the explanation but should be there, you must update the CRD definition to include the field under spec.versions[].schema.openAPIV3Schema.
  3. If the field is simply misspelled in your custom resource manifest, correct the spelling in your Git repository and push the change to trigger a clean reconciliation.

Scenario C: Optimistic Concurrency Conflict (409 Conflict)

Symptom: Your automation scripts or custom controllers are logging errors like:

Operation cannot be fulfilled on databaseinstances.database.enterprise.io "billing-db": the object has been modified; please apply your changes to the latest version and try again

Root Cause: The controller attempted to write an update to the resource using an outdated resourceVersion. Another client modified the resource in the background during the controller's reconciliation execution.

Resolution Playbook:

  1. If you are writing custom controllers, ensure your code implements a retry-on-conflict mechanism. The client-go library provides a helper function for this:
    import "k8s.io/client-go/util/retry"
    
    err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
        // 1. Fetch the latest version of the resource
        err := r.Get(ctx, req.NamespacedName, &dbInstance)
        if err != nil {
            return err
        }
        
        // 2. Make your modifications
        dbInstance.Spec.StorageGB = 300
        
        // 3. Attempt to update
        return r.Update(ctx, &dbInstance)
    })
  2. If you are using kubectl

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile