AWS DevOps Masterclass: Containerization Basics with Amazon ECR and ECS

A comprehensive, production-grade guide to containerizing, storing, and orchestrating enterprise applications at scale using Amazon Elastic Container Registry (ECR) and Amazon Elastic Container Service (ECS).

Introduction to Container Orchestration on AWS
What You Will Learn
Prerequisites
Enterprise Architecture Overview
Deep Dive: Amazon Elastic Container Registry (ECR)
Deep Dive: Amazon Elastic Container Service (ECS)
ECS Networking Modes Explained
Infrastructure as Code (IaC) with Terraform
CI/CD Pipeline Integration with GitHub Actions
Production Best Practices & Security Hardening
Monitoring, Logging, and Observability
Troubleshooting and Debugging Guide
Technical Interview Questions & Answers
Frequently Asked Questions (FAQs)
Summary and Next Steps

Introduction to Container Orchestration on AWS

In modern cloud-native engineering, containerization is the foundation of scalable, predictable, and isolated application delivery. While running a single Docker container on a local machine is straightforward, running thousands of containers across a resilient, distributed infrastructure requires a robust container orchestration engine.

What is Amazon ECS? Amazon Elastic Container Service (ECS) is a highly scalable, high-performance container orchestration service that allows you to run, stop, and manage Docker containers on a cluster. ECS eliminates the need for you to install, operate, and scale your own container orchestration infrastructure.

What is Amazon ECR? Amazon Elastic Container Registry (ECR) is a fully managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. ECR is integrated with ECS, simplifying your development-to-production workflow.

Featured Snippet / Quick Definition:
Amazon ECS is AWS's opinionated, highly integrated container orchestrator that manages the lifecycle of containerized applications. It supports two launch types: AWS Fargate (a serverless compute engine where AWS manages the underlying EC2 instances) and EC2 Launch Type (where you manage and scale your own fleet of EC2 instances). Amazon ECR acts as the secure, private storage repository for the Docker images that ECS pulls to run these workloads.

For enterprise workloads, choosing between ECS and Kubernetes (EKS) often comes down to operational complexity. ECS offers deep integration with AWS-native services (such as IAM, CloudWatch, Route 53, and Application Load Balancers) without the administrative overhead of managing Kubernetes control planes, custom resource definitions (CRDs), and complex networking plugins.

What You Will Learn

By the end of this comprehensive guide, you will be able to:

Design and deploy highly secure, private Amazon ECR registries with advanced lifecycle policies and vulnerability scanning.
Architect an Amazon ECS cluster utilizing both AWS Fargate and EC2 capacity providers.
Draft production-ready Dockerfiles using multi-stage builds to minimize attack surface and image size.
Configure ECS Task Definitions with explicit IAM task roles, security groups, log configurations, and secrets management.
Deploy ECS Services with Application Load Balancers (ALBs), auto-scaling policies, and rolling update strategies.
Provision the entire container infrastructure using declarative Terraform code.
Build an automated CI/CD pipeline using GitHub Actions to build, push, and deploy containerized applications.
Troubleshoot common production failures such as container crash loops, ELB health check failures, and IAM permission bottlenecks.

Prerequisites

To successfully implement the patterns in this guide, you should have:

An active AWS Account with administrative privileges or permissions to create IAM Roles, VPCs, ECR Repositories, ECS Clusters, and ALBs.
Local installation of the AWS CLI v2, configured with valid credentials.
Docker Desktop or Docker Engine installed locally for building and testing container images.
Terraform CLI (v1.5.0 or later) installed for Infrastructure as Code provisioning.
A basic understanding of network concepts (subnets, route tables, security groups, and load balancers). You can review our VPC Architecture and Networking Guide to refresh your knowledge.

Enterprise Architecture Overview

A production-grade ECS architecture requires a multi-Availability Zone (AZ) design. Containers must run within private subnets, completely isolated from the public internet. Ingress traffic is strictly controlled via an internet-facing Application Load Balancer (ALB) located in public subnets, which routes traffic to the ECS tasks.

The diagram below illustrates the complete architecture, including the secure image storage layer (ECR), the serverless execution layer (Fargate), secure secrets retrieval, and private network communication via VPC Endpoints.

+-----------------------------------------------------------------------------------------------------------------------+
|                                                      AWS Cloud                                                        |
|                                                                                                                       |
|  +-----------------------------------------------------------------------------------------------------------------+  |
|  |                                                 VPC (10.0.0.0/16)                                               |  |
|  |                                                                                                                 |  |
|  |    +----------------------------------------- Public Subnets (AZ1 & AZ2) ----------------------------------+    |  |
|  |    |                                                                                                       |    |  |
|  |    |      +------------------------+                                       +------------------------+      |    |  |
|  |    |      |  NAT Gateway (AZ1)     |                                       |  NAT Gateway (AZ2)     |      |    |  |
|  |    |      +-----------+------------+                                       +-----------+------------+      |    |  |
|  |    |                  |                                                                |                   |    |  |
|  |    |                  v                                                                v                   |    |  |
|  |    |      +-----------------------------------------------------------------------------------------+      |    |  |
|  |    |      |                         Application Load Balancer (ALB) - Public                        |      |    |  |
|  |    |      +--------------------------------------------+--------------------------------------------+      |    |  |
|  |    +---------------------------------------------------|---------------------------------------------------+    |  |
|  |                                                        |                                                        |  |
|  |    +---------------------------------------- Private Subnets (AZ1 & AZ2) ---------------------------------+    |  |
|  |    |                                                   |                                                   |    |  |
|  |    |       +-------------------------------------------v-------------------------------------------+       |    |  |
|  |    |       |                                    ECS Cluster (Fargate)                              |       |    |  |
|  |    |       |                                                                                       |       |    |  |
|  |    |       |   +------------------------------------+             +------------------------------------+   |    |  |
|  |    |       |   |         Private Subnet AZ1         |             |         Private Subnet AZ2         |   |    |  |
|  |    |       |   |                                    |             |                                    |   |    |  |
|  |    |       |   |  +------------------------------+  |             |  +------------------------------+  |   |    |  |
|  |    |       |   |  |     ECS Task (Container)     |  |             |  |     ECS Task (Container)     |  |   |    |  |
|  |    |       |   |  |   - App (Port 8080)          |  |             |  |   - App (Port 8080)          |  |   |    |  |
|  |    |       |   |  |   - Private IP: 10.0.1.45    |  |             |  |   - Private IP: 10.0.2.112   |  |   |    |  |
|  |    |       |   |  +--------------+---------------+  |             |  +--------------+---------------+  |   |    |  |
|  |    |       |   +-----------------|------------------+             +-----------------|------------------+   |    |  |
|  |    |       +---------------------|--------------------------------------------------|----------------------+   |    |  |
|  |    |                             |                                                  |                      |    |  |
|  |    |                             |          +----------------------------+          |                      |    |  |
|  |    |                             +--------->| VPC Endpoints (PrivateLink)|<--------+                      |    |  |
|  |    |                                        | - ECR API / ECR DKR        |                                 |    |  |
|  |    |                                        | - CloudWatch Logs / S3     |                                 |    |  |
|  |    |                                        +--------------+-------------+                                 |    |  |
|  |    +-------------------------------------------------------|-----------------------------------------------+    |  |
|  +------------------------------------------------------------|----------------------------------------------------+  |
|                                                               |                                                       |
|                                                               v                                                       |
|  +----------------------------------+           +-------------+------------+           +---------------------------+  |
|  |       Amazon ECR Registry        |           |   AWS Secrets Manager    |           |    Amazon CloudWatch      |  |
|  |  - Secure Container Images       |           |  - Database Credentials  |           |  - Container Insights     |  |
|  |  - KMS Encrypted & Scanned       |           |  - API Keys              |           |  - Log Streams            |  |
|  +----------------------------------+           +--------------------------+           +---------------------------+  |
+-----------------------------------------------------------------------------------------------------------------------+

In this design:

VPC Endpoints (AWS PrivateLink): Ensure that even if NAT Gateways fail or are omitted for cost/security optimization, Fargate tasks can securely pull images from ECR, stream logs to CloudWatch, and fetch secrets from Secrets Manager without traversing the public internet.
Separation of Concerns: The ALB is the only resource exposed to the public internet. It performs SSL termination and forwards traffic to the backend Fargate tasks via target groups using private IP addresses.
Multi-AZ Redundancy: ECS automatically distributes tasks across multiple availability zones to maintain high availability in the event of an AZ-level outage.

Deep Dive: Amazon Elastic Container Registry (ECR)

Amazon ECR is more than just a storage bucket for Docker images. It is an enterprise-grade registry that provides integrated vulnerability scanning, image immutability, fine-grained access control via AWS IAM, and cross-region replication.

Private vs. Public Repositories

ECR supports both public and private repositories. Public repositories (hosted on gallery.ecr.aws) are globally accessible and ideal for open-source projects. Private repositories require AWS authentication via IAM and are designed for internal proprietary applications.

Image Security and KMS Encryption

By default, ECR encrypts images at rest using Amazon S3-managed encryption keys (SSE-S3). For strict compliance environments (HIPAA, PCI-DSS, FedRAMP), you should configure ECR to use Customer Managed Keys (CMK) stored in AWS Key Management Service (KMS). This ensures you have full audit control over who can decrypt and pull the container images.

Vulnerability Scanning

ECR offers two levels of vulnerability scanning:

Basic Scanning: Powered by the Clair open-source engine, this performs a scan upon image push. It is free of charge (excluding a small limit on scans per month).
Enhanced Scanning: Integrated with Amazon Inspector, this provides continuous scanning of your repository images. It automatically scans images when pushed and continuously monitors them for new CVEs (Common Vulnerabilities and Exposures) as database definitions update.

Lifecycle Policies

As your CI/CD pipelines build images on every commit, storage costs can escalate. ECR Lifecycle Policies allow you to define rules to automatically clean up old, unused, or untagged images.

Here is an example of an ECR Lifecycle Policy that retains only the last 30 tagged images and immediately deletes untagged images older than 7 days:

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Expire untagged images older than 7 days",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 7
      },
      "action": {
        "type": "expire"
      }
    },
    {
      "rulePriority": 2,
      "description": "Keep only the last 30 release images",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["release-"],
        "countType": "imageCountMoreThan",
        "countNumber": 30
      },
      "action": {
        "type": "expire"
      }
    }
  ]
}

Practical Hands-on: Pushing Your First Image

Let's walk through the exact steps to build a secure, multi-stage Node.js application, create an ECR repository, authenticate, and push the image.

Step 1: The Production-Grade Dockerfile

We use a multi-stage Docker build to ensure our final production image contains only the runtime dependencies, reducing the attack surface and download size.

# --- Build Stage ---
FROM node:18-alpine AS builder
WORKDIR /usr/src/app

# Copy dependency manifests
COPY package*.json ./

# Install ALL dependencies (including devDependencies for compilation)
RUN npm ci

# Copy application source
COPY . .

# Run build step (e.g., compile TypeScript)
RUN npm run build

# Prune dev dependencies to keep production image light
RUN npm prune --production

# --- Production Stage ---
FROM node:18-alpine
WORKDIR /usr/src/app

# Set Node environment to production
ENV NODE_ENV=production

# Copy only runtime dependencies and compiled build artifacts from builder stage
COPY --from=builder /usr/src/app/node_modules ./node_modules
COPY --from=builder /usr/src/app/dist ./dist
COPY --from=builder /usr/src/app/package*.json ./

# Run as a non-root user for security hardening
USER node

# Expose the application port
EXPOSE 8080

# Define container health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1

CMD ["node", "dist/index.js"]

Step 2: Authenticate and Push to ECR

Execute the following shell commands to provision the repository, log in, tag, and push your image. Replace 123456789012 with your AWS Account ID and us-east-1 with your target region.

# Set environment variables
AWS_ACCOUNT_ID="123456789012"
AWS_REGION="us-east-1"
REPO_NAME="enterprise-app"

# 1. Create the ECR repository with image scanning and KMS encryption enabled
aws ecr create-repository \
    --repository-name ${REPO_NAME} \
    --image-scanning-configuration scanOnPush=true \
    --encryption-configuration encryptionType=KMS \
    --region ${AWS_REGION}

# 2. Authenticate the local Docker daemon to your private ECR registry
aws ecr get-login-password --region ${AWS_REGION} | \
    docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

# 3. Build the Docker image locally
docker build -t ${REPO_NAME}:latest .

# 4. Tag the image with the ECR repository URI
docker tag ${REPO_NAME}:latest ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:v1.0.0
docker tag ${REPO_NAME}:latest ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest

# 5. Push the images to Amazon ECR
docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:v1.0.0
docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest

Deep Dive: Amazon Elastic Container Service (ECS)

Amazon ECS is built on a highly optimized, state-driven control plane. Understanding its component hierarchy is crucial for designing stable deployments.

The ECS Object Hierarchy

Cluster: A logical grouping of tasks or services. Clusters can run tasks on EC2 instances, AWS Fargate, or external on-premise servers (ECS Anywhere).
Task Definition: A blueprint (JSON format) that describes one or more containers (up to 10) that make up your application. It defines parameters such as CPU, memory, Docker images, port mappings, storage volumes, and log configurations.
Task: The instantiation of a Task Definition within a cluster. Think of a Task Definition as the "Class" and a Task as the running "Object Instance".
Service: The scheduler that maintains the desired count of tasks simultaneously in an ECS cluster. If a task fails, the service scheduler replaces it automatically. It also integrates with Load Balancers, Service Discovery (Cloud Map), and Service Meshes (App Mesh).

AWS Fargate vs. EC2 Launch Type

The choice of launch type dictates your operational overhead and cost structure:

Feature	AWS Fargate (Serverless)	EC2 Launch Type (Managed Instances)
Infrastructure Management	None. AWS provisions, configures, and scales the virtual machines.	High. You manage the EC2 instances, patching, OS updates, and Docker agents.
Isolation Model	Hypervisor-level isolation. Each task runs in its own dedicated VM.	OS-level isolation. Multiple tasks share the same EC2 host instance.
Billing Model	Pay per vCPU and GB memory per second allocated to the running task.	Pay for the underlying EC2 instances, regardless of container utilization.
Scaling Speed	Under 60 seconds (no instance provisioning lag).	Requires Auto Scaling Groups to spin up new EC2 instances if cluster runs out of capacity.
Customization	Limited. Cannot access host OS, custom kernels, or mount raw block devices.	Full root access to host OS. Support for custom AMIs, SSH access, and specialized daemonsets.

Task Execution Role vs. Task Role

One of the most common configuration mistakes in ECS is confusing the two IAM roles assigned to a Task Definition:

Task Execution Role (execution_role_arn): This role is used by the ECS Agent (the underlying system worker) before the container starts. It grants permission to pull images from Amazon ECR, stream logs to CloudWatch Logs, and retrieve secrets from AWS Secrets Manager or Systems Manager Parameter Store.
Task Role (task_role_arn): This role is used by the application inside your container once it is running. For example, if your Node.js application needs to read files from an S3 bucket or write items to a DynamoDB table, you must grant those permissions to the Task Role.

Decoupling Secrets Management

Hardcoding credentials (database passwords, API keys) inside Dockerfiles or Task Definitions is a critical security vulnerability. ECS provides native integration with AWS Secrets Manager and SSM Parameter Store.

By referencing the ARN of the secret in the Task Definition, the ECS Agent retrieves the secret at runtime and injects it as an environment variable into the container. The secret is never written to disk or exposed in the Task Definition metadata.

ECS Networking Modes Explained

The networking mode specified in the Task Definition determines how containers communicate with each other, with the host, and with external networks.

1. `awsvpc` Mode (Fargate Standard)

This is the recommended and default mode for modern ECS deployments. In this mode, every running task is allocated its own Elastic Network Interface (ENI) and a dedicated private IP address from your VPC subnet.

Security: You can apply standard AWS Security Groups directly to each individual task, controlling inbound and outbound traffic at the container level.
Port Mapping: No port conflicts. Multiple tasks running the same application can listen on port 8080 on the same host, as they each have unique IP addresses.
Performance: Simplifies load balancing, as the ALB routes traffic directly to the task's private IP.

2. `bridge` Mode (EC2 Only)

This mode utilizes Docker's built-in virtual network bridge on the host EC2 instance.

Port Mapping: You must map a host port to the container port. If you use static host ports (e.g., mapping host port 80 to container port 80), you can only run one instance of that task per EC2 host.
Dynamic Port Mapping: To run multiple instances, you set the host port to 0. ECS automatically assigns a random high-number port (e.g., 32768 to 65535) on the host, and registers this dynamic port with the Application Load Balancer target group.

3. `host` Mode (EC2 Only)

The container bypasses the Docker host's network isolation and maps directly to the host's network interface.

Performance: Offers the highest network performance as there is no virtualization overhead.
Drawback: Severe port conflict limitations. You cannot run multiple tasks of the same container on a single EC2 instance if they bind to the same port.

4. `none` Mode

The container has no external network connectivity. This is used for highly secure, isolated batch processing tasks that do not require access to the internet or other VPC resources.

Infrastructure as Code (IaC) with Terraform

To build a repeatable, audit-compliant infrastructure, we will use Terraform to provision our network dependencies, ECR repository, Application Load Balancer, ECS Fargate Cluster, Task Definitions with secure IAM configurations, and the auto-scaling ECS Service.

Create a file named main.tf and populate it with the following production-ready configuration.

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

# --- Variables ---
variable "aws_region" {
  type    = string
  default = "us-east-1"
}

variable "environment" {
  type    = string
  default = "production"
}

variable "app_name" {
  type    = string
  default = "enterprise-app"
}

# --- VPC & Networking (Data Source or New Setup) ---
# For brevity, we assume a standard VPC setup exists. 
# We fetch the default VPC and public/private subnets.
data "aws_vpc" "default" {
  default = true
}

data "aws_subnets" "default_public" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }
}

# --- Security Groups ---
resource "aws_security_group" "alb" {
  name        = "${var.app_name}-alb-sg"
  description = "Allow inbound HTTP/HTTPS traffic to ALB"
  vpc_id      = data.aws_vpc.default.id

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Environment = var.environment
  }
}

resource "aws_security_group" "ecs_tasks" {
  name        = "${var.app_name}-tasks-sg"
  description = "Allow inbound traffic from ALB only"
  vpc_id      = data.aws_vpc.default.id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Environment = var.environment
  }
}

# --- Amazon ECR Repository ---
resource "aws_ecr_repository" "app" {
  name                 = var.app_name
  image_tag_mutability = "IMMUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }

  encryption_configuration {
    encryption_type = "KMS"
  }

  tags = {
    Environment = var.environment
  }
}

# --- Application Load Balancer ---
resource "aws_lb" "main" {
  name               = "${var.app_name}-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = data.aws_subnets.default_public.ids

  tags = {
    Environment = var.environment
  }
}

resource "aws_lb_target_group" "app" {
  name        = "${var.app_name}-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = data.aws_vpc.default.id
  target_type = "ip" # Required for awsvpc network mode

  health_check {
    path                = "/health"
    protocol            = "HTTP"
    matcher             = "200"
    interval            = 30
    timeout             = 5
    healthy_threshold   = 3
    unhealthy_threshold = 3
  }
}

resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.main.arn
  port              = "80"
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

# --- ECS Cluster ---
resource "aws_ecs_cluster" "main" {
  name = "${var.app_name}-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

# --- IAM Roles for ECS ---
# 1. Task Execution Role
resource "aws_iam_role" "ecs_execution_role" {
  name = "${var.app_name}-execution-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action    = "sts:AssumeRole"
        Effect    = "Allow"
        Principal = { Service = "ecs-tasks.amazonaws.com" }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "ecs_execution" {
  role       = aws_iam_role.ecs_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# 2. Task Role
resource "aws_iam_role" "ecs_task_role" {
  name = "${var.app_name}-task-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action    = "sts:AssumeRole"
        Effect    = "Allow"
        Principal = { Service = "ecs-tasks.amazonaws.com" }
      }
    ]
  })
}

# Custom policy for Task Role to write to CloudWatch and read from S3 (Example)
resource "aws_iam_policy" "task_custom_policy" {
  name        = "${var.app_name}-task-policy"
  description = "Permissions required by application containers at runtime"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:ListBucket"
        ]
        Resource = "*"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "ecs_task_custom" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = aws_iam_policy.task_custom_policy.arn
}

# --- CloudWatch Log Group ---
resource "aws_cloudwatch_log_group" "ecs" {
  name              = "/ecs/${var.app_name}"
  retention_in_days = 30
}

# --- ECS Task Definition ---
resource "aws_ecs_task_definition" "app" {
  family                   = var.app_name
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "256" # 0.25 vCPU
  memory                   = "512" # 512 MB
  execution_role_arn       = aws_iam_role.ecs_execution_role.arn
  task_role_arn            = aws_iam_role.ecs_task_role.arn

  container_definitions = jsonencode([
    {
      name      = var.app_name
      image     = "${aws_ecr_repository.app.repository_url}:latest"
      essential = true
      portMappings = [
        {
          containerPort = 8080
          hostPort      = 8080
          protocol      = "tcp"
        }
      ]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.ecs.name
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = "web"
        }
      }
    }
  ])
}

# --- ECS Service ---
resource "aws_ecs_service" "main" {
  name            = var.app_name
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  launch_type     = "FARGATE"
  desired_count   = 2

  network_configuration {
    subnets          = data.aws_subnets.default_public.ids # Replace with private subnets in production with NAT Gateways
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = true                               # Set to false if deploying to private subnets
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = var.app_name
    container_port   = 8080
  }

  # Prevent Terraform from resetting desired count if auto-scaling is active
  lifecycle {
    ignore_changes = [desired_count]
  }

  depends_on = [aws_lb_listener.http]
}

# --- Outputs ---
output "alb_dns_name" {
  value       = aws_lb.main.dns_name
  description = "Public URL to access the application"
}

output "ecr_repository_url" {
  value       = aws_ecr_repository.app.repository_url
  description = "The URL of the ECR repository"
}

CI/CD Pipeline Integration with GitHub Actions

In a DevOps-centric enterprise, manual deployments are anti-patterns. The code below demonstrates a complete GitHub Actions CI/CD workflow. On every push to the main branch, this pipeline:

Authenticates to AWS using OpenID Connect (OIDC) - avoiding long-lived credentials.
Builds the Docker image.
Pushes the image to Amazon ECR.
Updates the ECS Task Definition with the new image tag.
Deploys the updated Task Definition to the ECS Service using a rolling update strategy.

Create a file at .github/workflows/deploy.yml:

name: Deploy to Amazon ECS

on:
  push:
    branches:
      - main

permissions:
  id-token: write # Required for requesting the JWT to assume AWS IAM Role via OIDC
  contents: read  # Required for checkout

env:
  AWS_REGION: us-east-1
  ROLE_TO_ASSUME: arn:aws:iam::123456789012:role/github-actions-ecs-deploy-role
  ECR_REPOSITORY: enterprise-app
  ECS_SERVICE: enterprise-app
  ECS_CLUSTER: enterprise-app-cluster
  ECS_TASK_DEFINITION: .aws/task-definition.json # Store your task definition template in this path
  CONTAINER_NAME: enterprise-app

jobs:
  deploy:
    name: Build, Push, and Deploy
    runs-on: ubuntu-latest

    steps:
      - name: Checkout Source Code
        uses: actions/checkout@v3

      - name: Configure AWS Credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: ${{ env.ROLE_TO_ASSUME }}
          aws-region: ${{ env.AWS_REGION }}
          audience: sts.amazonaws.com

      - name: Log in to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1

      - name: Build, Tag, and Push Image to Amazon ECR
        id: build-image
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          # Build the docker container
          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
          # Push the image to Amazon ECR
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
          # Output the image URI for the next step
          echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT

      - name: Fill in the new image ID in the Amazon ECS task definition
        id: render-task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ${{ env.ECS_TASK_DEFINITION }}
          container-name: ${{ env.CONTAINER_NAME }}
          image: ${{ steps.build-image.outputs.image }}

      - name: Deploy Amazon ECS Task Definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.render-task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true

Production Best Practices & Security Hardening

Operating containerized workloads at enterprise scale requires strict adherence to security and operational design principles.

1. Enforce Non-Root Execution

Containers run as the root user by default. If a container is compromised, an attacker may attempt privilege escalation. Always configure a dedicated non-root user using the Dockerfile USER directive or ECS Task Definition configuration.

2. Read-Only Root Filesystem

Prevent runtime modification of application binaries by enabling a read-only root filesystem:

"readonlyRootFilesystem": true

If write access is required, mount temporary writable storage locations such as /tmp or dedicated ephemeral volumes.

3. Clean Up Your Images

Large Docker images increase deployment times and consume additional storage. To keep images lean and secure:

Use minimal base images such as alpine, distroless, or slim variants.
Remove unnecessary packages and build tools.
Leverage multi-stage Docker builds.
Exclude development artifacts using a .dockerignore file.
Regularly rebuild images to receive security patches.

Example .dockerignore

.git
.github
node_modules
coverage
*.log
.env
README.md

4. Enable Image Immutability

Prevent accidental overwrites of production images by enabling immutable tags in Amazon ECR.

aws ecr put-image-tag-mutability \
  --repository-name enterprise-app \
  --image-tag-mutability IMMUTABLE

Once enabled, image tags cannot be overwritten, ensuring deployment integrity.

5. Apply Least-Privilege IAM

Never grant wildcard permissions such as:

{
  "Action": "*",
  "Resource": "*"
}

Instead, scope permissions to only the AWS resources required by the application.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem"
      ],
      "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/orders"
    }
  ]
}

6. Encrypt Data Everywhere

Ensure encryption is enabled for:

Amazon ECR repositories (KMS encryption)
ECS task secrets
CloudWatch Logs
Amazon S3 buckets
Amazon RDS databases
Network traffic using TLS 1.2+

7. Use Private Subnets

Production ECS tasks should execute inside private subnets without public IP addresses. Internet access should occur through:

NAT Gateways
AWS PrivateLink
VPC Interface Endpoints

8. Configure Auto Scaling

ECS Service Auto Scaling ensures applications automatically respond to changes in demand.

Scale on CPU utilization
Scale on memory utilization
Scale on request count
Scale on custom CloudWatch metrics

Monitoring, Logging, and Observability

Observability is critical for operating containerized workloads in production. ECS integrates natively with CloudWatch, AWS X-Ray, OpenTelemetry, Prometheus, and Grafana.

CloudWatch Container Insights

Container Insights provides cluster-level and task-level visibility.

aws ecs update-cluster-settings \
  --cluster enterprise-app-cluster \
  --settings name=containerInsights,value=enabled

Metrics collected include:

CPU utilization
Memory utilization
Network throughput
Running task count
Service deployment status

Centralized Logging

Configure ECS tasks to send application logs directly to CloudWatch Logs.

{
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ecs/enterprise-app",
      "awslogs-region": "us-east-1",
      "awslogs-stream-prefix": "web"
    }
  }
}

Distributed Tracing with AWS X-Ray

AWS X-Ray enables end-to-end request tracing across microservices.

Client
  ↓
Application Load Balancer
  ↓
ECS Service A
  ↓
ECS Service B
  ↓
Amazon RDS

This helps identify latency bottlenecks and service dependencies.

Prometheus and Grafana Integration

Enterprise teams often integrate ECS with Prometheus and Grafana to visualize application and infrastructure metrics.

http_requests_total
http_request_duration_seconds
container_cpu_usage_seconds_total
container_memory_usage_bytes

Troubleshooting and Debugging Guide

Issue: ECS Tasks Stuck in PENDING

Possible causes:

Insufficient CPU or memory allocation
Invalid subnet configuration
Security group restrictions
ECR connectivity issues

aws ecs describe-services \
  --cluster enterprise-app-cluster \
  --services enterprise-app

Issue: Cannot Pull Container Image

Common reasons include:

Missing ECR permissions
Incorrect image tag
Missing VPC endpoints
Network ACL restrictions

Issue: ALB Health Check Failures

Verify:

Correct health endpoint path
Security group configuration
Container port mapping
Application startup completion

@GetMapping("/health")
public ResponseEntity<String> health() {
    return ResponseEntity.ok("UP");
}