AWS DevOps Masterclass: Containerization Basics with Amazon ECR and ECS
A comprehensive, production-grade guide to containerizing, storing, and orchestrating enterprise applications at scale using Amazon Elastic Container Registry (ECR) and Amazon Elastic Container Service (ECS).
Table of Contents
- Introduction to Container Orchestration on AWS
- What You Will Learn
- Prerequisites
- Enterprise Architecture Overview
- Deep Dive: Amazon Elastic Container Registry (ECR)
- Deep Dive: Amazon Elastic Container Service (ECS)
- ECS Networking Modes Explained
- Infrastructure as Code (IaC) with Terraform
- CI/CD Pipeline Integration with GitHub Actions
- Production Best Practices & Security Hardening
- Monitoring, Logging, and Observability
- Troubleshooting and Debugging Guide
- Technical Interview Questions & Answers
- Frequently Asked Questions (FAQs)
- Summary and Next Steps
Introduction to Container Orchestration on AWS
In modern cloud-native engineering, containerization is the foundation of scalable, predictable, and isolated application delivery. While running a single Docker container on a local machine is straightforward, running thousands of containers across a resilient, distributed infrastructure requires a robust container orchestration engine.
What is Amazon ECS? Amazon Elastic Container Service (ECS) is a highly scalable, high-performance container orchestration service that allows you to run, stop, and manage Docker containers on a cluster. ECS eliminates the need for you to install, operate, and scale your own container orchestration infrastructure.
What is Amazon ECR? Amazon Elastic Container Registry (ECR) is a fully managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. ECR is integrated with ECS, simplifying your development-to-production workflow.
Featured Snippet / Quick Definition:
Amazon ECS is AWS's opinionated, highly integrated container orchestrator that manages the lifecycle of containerized applications. It supports two launch types: AWS Fargate (a serverless compute engine where AWS manages the underlying EC2 instances) and EC2 Launch Type (where you manage and scale your own fleet of EC2 instances). Amazon ECR acts as the secure, private storage repository for the Docker images that ECS pulls to run these workloads.
For enterprise workloads, choosing between ECS and Kubernetes (EKS) often comes down to operational complexity. ECS offers deep integration with AWS-native services (such as IAM, CloudWatch, Route 53, and Application Load Balancers) without the administrative overhead of managing Kubernetes control planes, custom resource definitions (CRDs), and complex networking plugins.
What You Will Learn
By the end of this comprehensive guide, you will be able to:
- Design and deploy highly secure, private Amazon ECR registries with advanced lifecycle policies and vulnerability scanning.
- Architect an Amazon ECS cluster utilizing both AWS Fargate and EC2 capacity providers.
- Draft production-ready Dockerfiles using multi-stage builds to minimize attack surface and image size.
- Configure ECS Task Definitions with explicit IAM task roles, security groups, log configurations, and secrets management.
- Deploy ECS Services with Application Load Balancers (ALBs), auto-scaling policies, and rolling update strategies.
- Provision the entire container infrastructure using declarative Terraform code.
- Build an automated CI/CD pipeline using GitHub Actions to build, push, and deploy containerized applications.
- Troubleshoot common production failures such as container crash loops, ELB health check failures, and IAM permission bottlenecks.
Prerequisites
To successfully implement the patterns in this guide, you should have:
- An active AWS Account with administrative privileges or permissions to create IAM Roles, VPCs, ECR Repositories, ECS Clusters, and ALBs.
- Local installation of the AWS CLI v2, configured with valid credentials.
- Docker Desktop or Docker Engine installed locally for building and testing container images.
- Terraform CLI (v1.5.0 or later) installed for Infrastructure as Code provisioning.
- A basic understanding of network concepts (subnets, route tables, security groups, and load balancers). You can review our VPC Architecture and Networking Guide to refresh your knowledge.
Enterprise Architecture Overview
A production-grade ECS architecture requires a multi-Availability Zone (AZ) design. Containers must run within private subnets, completely isolated from the public internet. Ingress traffic is strictly controlled via an internet-facing Application Load Balancer (ALB) located in public subnets, which routes traffic to the ECS tasks.
The diagram below illustrates the complete architecture, including the secure image storage layer (ECR), the serverless execution layer (Fargate), secure secrets retrieval, and private network communication via VPC Endpoints.
+-----------------------------------------------------------------------------------------------------------------------+
| AWS Cloud |
| |
| +-----------------------------------------------------------------------------------------------------------------+ |
| | VPC (10.0.0.0/16) | |
| | | |
| | +----------------------------------------- Public Subnets (AZ1 & AZ2) ----------------------------------+ | |
| | | | | |
| | | +------------------------+ +------------------------+ | | |
| | | | NAT Gateway (AZ1) | | NAT Gateway (AZ2) | | | |
| | | +-----------+------------+ +-----------+------------+ | | |
| | | | | | | |
| | | v v | | |
| | | +-----------------------------------------------------------------------------------------+ | | |
| | | | Application Load Balancer (ALB) - Public | | | |
| | | +--------------------------------------------+--------------------------------------------+ | | |
| | +---------------------------------------------------|---------------------------------------------------+ | |
| | | | |
| | +---------------------------------------- Private Subnets (AZ1 & AZ2) ---------------------------------+ | |
| | | | | | |
| | | +-------------------------------------------v-------------------------------------------+ | | |
| | | | ECS Cluster (Fargate) | | | |
| | | | | | | |
| | | | +------------------------------------+ +------------------------------------+ | | |
| | | | | Private Subnet AZ1 | | Private Subnet AZ2 | | | |
| | | | | | | | | | |
| | | | | +------------------------------+ | | +------------------------------+ | | | |
| | | | | | ECS Task (Container) | | | | ECS Task (Container) | | | | |
| | | | | | - App (Port 8080) | | | | - App (Port 8080) | | | | |
| | | | | | - Private IP: 10.0.1.45 | | | | - Private IP: 10.0.2.112 | | | | |
| | | | | +--------------+---------------+ | | +--------------+---------------+ | | | |
| | | | +-----------------|------------------+ +-----------------|------------------+ | | |
| | | +---------------------|--------------------------------------------------|----------------------+ | | |
| | | | | | | |
| | | | +----------------------------+ | | | |
| | | +--------->| VPC Endpoints (PrivateLink)|<--------+ | | |
| | | | - ECR API / ECR DKR | | | |
| | | | - CloudWatch Logs / S3 | | | |
| | | +--------------+-------------+ | | |
| | +-------------------------------------------------------|-----------------------------------------------+ | |
| +------------------------------------------------------------|----------------------------------------------------+ |
| | |
| v |
| +----------------------------------+ +-------------+------------+ +---------------------------+ |
| | Amazon ECR Registry | | AWS Secrets Manager | | Amazon CloudWatch | |
| | - Secure Container Images | | - Database Credentials | | - Container Insights | |
| | - KMS Encrypted & Scanned | | - API Keys | | - Log Streams | |
| +----------------------------------+ +--------------------------+ +---------------------------+ |
+-----------------------------------------------------------------------------------------------------------------------+
In this design:
- VPC Endpoints (AWS PrivateLink): Ensure that even if NAT Gateways fail or are omitted for cost/security optimization, Fargate tasks can securely pull images from ECR, stream logs to CloudWatch, and fetch secrets from Secrets Manager without traversing the public internet.
- Separation of Concerns: The ALB is the only resource exposed to the public internet. It performs SSL termination and forwards traffic to the backend Fargate tasks via target groups using private IP addresses.
- Multi-AZ Redundancy: ECS automatically distributes tasks across multiple availability zones to maintain high availability in the event of an AZ-level outage.
Deep Dive: Amazon Elastic Container Registry (ECR)
Amazon ECR is more than just a storage bucket for Docker images. It is an enterprise-grade registry that provides integrated vulnerability scanning, image immutability, fine-grained access control via AWS IAM, and cross-region replication.
Private vs. Public Repositories
ECR supports both public and private repositories. Public repositories (hosted on gallery.ecr.aws) are globally accessible and ideal for open-source projects. Private repositories require AWS authentication via IAM and are designed for internal proprietary applications.
Image Security and KMS Encryption
By default, ECR encrypts images at rest using Amazon S3-managed encryption keys (SSE-S3). For strict compliance environments (HIPAA, PCI-DSS, FedRAMP), you should configure ECR to use Customer Managed Keys (CMK) stored in AWS Key Management Service (KMS). This ensures you have full audit control over who can decrypt and pull the container images.
Vulnerability Scanning
ECR offers two levels of vulnerability scanning:
- Basic Scanning: Powered by the Clair open-source engine, this performs a scan upon image push. It is free of charge (excluding a small limit on scans per month).
- Enhanced Scanning: Integrated with Amazon Inspector, this provides continuous scanning of your repository images. It automatically scans images when pushed and continuously monitors them for new CVEs (Common Vulnerabilities and Exposures) as database definitions update.
Lifecycle Policies
As your CI/CD pipelines build images on every commit, storage costs can escalate. ECR Lifecycle Policies allow you to define rules to automatically clean up old, unused, or untagged images.
Here is an example of an ECR Lifecycle Policy that retains only the last 30 tagged images and immediately deletes untagged images older than 7 days:
{
"rules": [
{
"rulePriority": 1,
"description": "Expire untagged images older than 7 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 7
},
"action": {
"type": "expire"
}
},
{
"rulePriority": 2,
"description": "Keep only the last 30 release images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["release-"],
"countType": "imageCountMoreThan",
"countNumber": 30
},
"action": {
"type": "expire"
}
}
]
}
Practical Hands-on: Pushing Your First Image
Let's walk through the exact steps to build a secure, multi-stage Node.js application, create an ECR repository, authenticate, and push the image.
Step 1: The Production-Grade Dockerfile
We use a multi-stage Docker build to ensure our final production image contains only the runtime dependencies, reducing the attack surface and download size.
# --- Build Stage ---
FROM node:18-alpine AS builder
WORKDIR /usr/src/app
# Copy dependency manifests
COPY package*.json ./
# Install ALL dependencies (including devDependencies for compilation)
RUN npm ci
# Copy application source
COPY . .
# Run build step (e.g., compile TypeScript)
RUN npm run build
# Prune dev dependencies to keep production image light
RUN npm prune --production
# --- Production Stage ---
FROM node:18-alpine
WORKDIR /usr/src/app
# Set Node environment to production
ENV NODE_ENV=production
# Copy only runtime dependencies and compiled build artifacts from builder stage
COPY --from=builder /usr/src/app/node_modules ./node_modules
COPY --from=builder /usr/src/app/dist ./dist
COPY --from=builder /usr/src/app/package*.json ./
# Run as a non-root user for security hardening
USER node
# Expose the application port
EXPOSE 8080
# Define container health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
CMD ["node", "dist/index.js"]
Step 2: Authenticate and Push to ECR
Execute the following shell commands to provision the repository, log in, tag, and push your image. Replace 123456789012 with your AWS Account ID and us-east-1 with your target region.
# Set environment variables
AWS_ACCOUNT_ID="123456789012"
AWS_REGION="us-east-1"
REPO_NAME="enterprise-app"
# 1. Create the ECR repository with image scanning and KMS encryption enabled
aws ecr create-repository \
--repository-name ${REPO_NAME} \
--image-scanning-configuration scanOnPush=true \
--encryption-configuration encryptionType=KMS \
--region ${AWS_REGION}
# 2. Authenticate the local Docker daemon to your private ECR registry
aws ecr get-login-password --region ${AWS_REGION} | \
docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com
# 3. Build the Docker image locally
docker build -t ${REPO_NAME}:latest .
# 4. Tag the image with the ECR repository URI
docker tag ${REPO_NAME}:latest ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:v1.0.0
docker tag ${REPO_NAME}:latest ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest
# 5. Push the images to Amazon ECR
docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:v1.0.0
docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest
Deep Dive: Amazon Elastic Container Service (ECS)
Amazon ECS is built on a highly optimized, state-driven control plane. Understanding its component hierarchy is crucial for designing stable deployments.
The ECS Object Hierarchy
- Cluster: A logical grouping of tasks or services. Clusters can run tasks on EC2 instances, AWS Fargate, or external on-premise servers (ECS Anywhere).
- Task Definition: A blueprint (JSON format) that describes one or more containers (up to 10) that make up your application. It defines parameters such as CPU, memory, Docker images, port mappings, storage volumes, and log configurations.
- Task: The instantiation of a Task Definition within a cluster. Think of a Task Definition as the "Class" and a Task as the running "Object Instance".
- Service: The scheduler that maintains the desired count of tasks simultaneously in an ECS cluster. If a task fails, the service scheduler replaces it automatically. It also integrates with Load Balancers, Service Discovery (Cloud Map), and Service Meshes (App Mesh).
AWS Fargate vs. EC2 Launch Type
The choice of launch type dictates your operational overhead and cost structure:
| Feature | AWS Fargate (Serverless) | EC2 Launch Type (Managed Instances) |
|---|---|---|
| Infrastructure Management | None. AWS provisions, configures, and scales the virtual machines. | High. You manage the EC2 instances, patching, OS updates, and Docker agents. |
| Isolation Model | Hypervisor-level isolation. Each task runs in its own dedicated VM. | OS-level isolation. Multiple tasks share the same EC2 host instance. |
| Billing Model | Pay per vCPU and GB memory per second allocated to the running task. | Pay for the underlying EC2 instances, regardless of container utilization. |
| Scaling Speed | Under 60 seconds (no instance provisioning lag). | Requires Auto Scaling Groups to spin up new EC2 instances if cluster runs out of capacity. |
| Customization | Limited. Cannot access host OS, custom kernels, or mount raw block devices. | Full root access to host OS. Support for custom AMIs, SSH access, and specialized daemonsets. |
Task Execution Role vs. Task Role
One of the most common configuration mistakes in ECS is confusing the two IAM roles assigned to a Task Definition:
-
Task Execution Role (
execution_role_arn): This role is used by the ECS Agent (the underlying system worker) before the container starts. It grants permission to pull images from Amazon ECR, stream logs to CloudWatch Logs, and retrieve secrets from AWS Secrets Manager or Systems Manager Parameter Store. -
Task Role (
task_role_arn): This role is used by the application inside your container once it is running. For example, if your Node.js application needs to read files from an S3 bucket or write items to a DynamoDB table, you must grant those permissions to the Task Role.
Decoupling Secrets Management
Hardcoding credentials (database passwords, API keys) inside Dockerfiles or Task Definitions is a critical security vulnerability. ECS provides native integration with AWS Secrets Manager and SSM Parameter Store.
By referencing the ARN of the secret in the Task Definition, the ECS Agent retrieves the secret at runtime and injects it as an environment variable into the container. The secret is never written to disk or exposed in the Task Definition metadata.
ECS Networking Modes Explained
The networking mode specified in the Task Definition determines how containers communicate with each other, with the host, and with external networks.
1. awsvpc Mode (Fargate Standard)
This is the recommended and default mode for modern ECS deployments. In this mode, every running task is allocated its own Elastic Network Interface (ENI) and a dedicated private IP address from your VPC subnet.
- Security: You can apply standard AWS Security Groups directly to each individual task, controlling inbound and outbound traffic at the container level.
- Port Mapping: No port conflicts. Multiple tasks running the same application can listen on port 8080 on the same host, as they each have unique IP addresses.
- Performance: Simplifies load balancing, as the ALB routes traffic directly to the task's private IP.
2. bridge Mode (EC2 Only)
This mode utilizes Docker's built-in virtual network bridge on the host EC2 instance.
- Port Mapping: You must map a host port to the container port. If you use static host ports (e.g., mapping host port 80 to container port 80), you can only run one instance of that task per EC2 host.
- Dynamic Port Mapping: To run multiple instances, you set the host port to
0. ECS automatically assigns a random high-number port (e.g., 32768 to 65535) on the host, and registers this dynamic port with the Application Load Balancer target group.
3. host Mode (EC2 Only)
The container bypasses the Docker host's network isolation and maps directly to the host's network interface.
- Performance: Offers the highest network performance as there is no virtualization overhead.
- Drawback: Severe port conflict limitations. You cannot run multiple tasks of the same container on a single EC2 instance if they bind to the same port.
4. none Mode
The container has no external network connectivity. This is used for highly secure, isolated batch processing tasks that do not require access to the internet or other VPC resources.
Infrastructure as Code (IaC) with Terraform
To build a repeatable, audit-compliant infrastructure, we will use Terraform to provision our network dependencies, ECR repository, Application Load Balancer, ECS Fargate Cluster, Task Definitions with secure IAM configurations, and the auto-scaling ECS Service.
Create a file named main.tf and populate it with the following production-ready configuration.
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.aws_region
}
# --- Variables ---
variable "aws_region" {
type = string
default = "us-east-1"
}
variable "environment" {
type = string
default = "production"
}
variable "app_name" {
type = string
default = "enterprise-app"
}
# --- VPC & Networking (Data Source or New Setup) ---
# For brevity, we assume a standard VPC setup exists.
# We fetch the default VPC and public/private subnets.
data "aws_vpc" "default" {
default = true
}
data "aws_subnets" "default_public" {
filter {
name = "vpc-id"
values = [data.aws_vpc.default.id]
}
}
# --- Security Groups ---
resource "aws_security_group" "alb" {
name = "${var.app_name}-alb-sg"
description = "Allow inbound HTTP/HTTPS traffic to ALB"
vpc_id = data.aws_vpc.default.id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Environment = var.environment
}
}
resource "aws_security_group" "ecs_tasks" {
name = "${var.app_name}-tasks-sg"
description = "Allow inbound traffic from ALB only"
vpc_id = data.aws_vpc.default.id
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Environment = var.environment
}
}
# --- Amazon ECR Repository ---
resource "aws_ecr_repository" "app" {
name = var.app_name
image_tag_mutability = "IMMUTABLE"
image_scanning_configuration {
scan_on_push = true
}
encryption_configuration {
encryption_type = "KMS"
}
tags = {
Environment = var.environment
}
}
# --- Application Load Balancer ---
resource "aws_lb" "main" {
name = "${var.app_name}-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = data.aws_subnets.default_public.ids
tags = {
Environment = var.environment
}
}
resource "aws_lb_target_group" "app" {
name = "${var.app_name}-tg"
port = 8080
protocol = "HTTP"
vpc_id = data.aws_vpc.default.id
target_type = "ip" # Required for awsvpc network mode
health_check {
path = "/health"
protocol = "HTTP"
matcher = "200"
interval = 30
timeout = 5
healthy_threshold = 3
unhealthy_threshold = 3
}
}
resource "aws_lb_listener" "http" {
load_balancer_arn = aws_lb.main.arn
port = "80"
protocol = "HTTP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.app.arn
}
}
# --- ECS Cluster ---
resource "aws_ecs_cluster" "main" {
name = "${var.app_name}-cluster"
setting {
name = "containerInsights"
value = "enabled"
}
}
# --- IAM Roles for ECS ---
# 1. Task Execution Role
resource "aws_iam_role" "ecs_execution_role" {
name = "${var.app_name}-execution-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ecs-tasks.amazonaws.com" }
}
]
})
}
resource "aws_iam_role_policy_attachment" "ecs_execution" {
role = aws_iam_role.ecs_execution_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
# 2. Task Role
resource "aws_iam_role" "ecs_task_role" {
name = "${var.app_name}-task-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ecs-tasks.amazonaws.com" }
}
]
})
}
# Custom policy for Task Role to write to CloudWatch and read from S3 (Example)
resource "aws_iam_policy" "task_custom_policy" {
name = "${var.app_name}-task-policy"
description = "Permissions required by application containers at runtime"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:ListBucket"
]
Resource = "*"
}
]
})
}
resource "aws_iam_role_policy_attachment" "ecs_task_custom" {
role = aws_iam_role.ecs_task_role.name
policy_arn = aws_iam_policy.task_custom_policy.arn
}
# --- CloudWatch Log Group ---
resource "aws_cloudwatch_log_group" "ecs" {
name = "/ecs/${var.app_name}"
retention_in_days = 30
}
# --- ECS Task Definition ---
resource "aws_ecs_task_definition" "app" {
family = var.app_name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = "256" # 0.25 vCPU
memory = "512" # 512 MB
execution_role_arn = aws_iam_role.ecs_execution_role.arn
task_role_arn = aws_iam_role.ecs_task_role.arn
container_definitions = jsonencode([
{
name = var.app_name
image = "${aws_ecr_repository.app.repository_url}:latest"
essential = true
portMappings = [
{
containerPort = 8080
hostPort = 8080
protocol = "tcp"
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.ecs.name
"awslogs-region" = var.aws_region
"awslogs-stream-prefix" = "web"
}
}
}
])
}
# --- ECS Service ---
resource "aws_ecs_service" "main" {
name = var.app_name
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
launch_type = "FARGATE"
desired_count = 2
network_configuration {
subnets = data.aws_subnets.default_public.ids # Replace with private subnets in production with NAT Gateways
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = true # Set to false if deploying to private subnets
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = var.app_name
container_port = 8080
}
# Prevent Terraform from resetting desired count if auto-scaling is active
lifecycle {
ignore_changes = [desired_count]
}
depends_on = [aws_lb_listener.http]
}
# --- Outputs ---
output "alb_dns_name" {
value = aws_lb.main.dns_name
description = "Public URL to access the application"
}
output "ecr_repository_url" {
value = aws_ecr_repository.app.repository_url
description = "The URL of the ECR repository"
}
CI/CD Pipeline Integration with GitHub Actions
In a DevOps-centric enterprise, manual deployments are anti-patterns. The code below demonstrates a complete GitHub Actions CI/CD workflow. On every push to the main branch, this pipeline:
- Authenticates to AWS using OpenID Connect (OIDC) - avoiding long-lived credentials.
- Builds the Docker image.
- Pushes the image to Amazon ECR.
- Updates the ECS Task Definition with the new image tag.
- Deploys the updated Task Definition to the ECS Service using a rolling update strategy.
Create a file at .github/workflows/deploy.yml:
name: Deploy to Amazon ECS
on:
push:
branches:
- main
permissions:
id-token: write # Required for requesting the JWT to assume AWS IAM Role via OIDC
contents: read # Required for checkout
env:
AWS_REGION: us-east-1
ROLE_TO_ASSUME: arn:aws:iam::123456789012:role/github-actions-ecs-deploy-role
ECR_REPOSITORY: enterprise-app
ECS_SERVICE: enterprise-app
ECS_CLUSTER: enterprise-app-cluster
ECS_TASK_DEFINITION: .aws/task-definition.json # Store your task definition template in this path
CONTAINER_NAME: enterprise-app
jobs:
deploy:
name: Build, Push, and Deploy
runs-on: ubuntu-latest
steps:
- name: Checkout Source Code
uses: actions/checkout@v3
- name: Configure AWS Credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ env.ROLE_TO_ASSUME }}
aws-region: ${{ env.AWS_REGION }}
audience: sts.amazonaws.com
- name: Log in to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Build, Tag, and Push Image to Amazon ECR
id: build-image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
# Build the docker container
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
# Push the image to Amazon ECR
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
# Output the image URI for the next step
echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT
- name: Fill in the new image ID in the Amazon ECS task definition
id: render-task-def
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: ${{ env.ECS_TASK_DEFINITION }}
container-name: ${{ env.CONTAINER_NAME }}
image: ${{ steps.build-image.outputs.image }}
- name: Deploy Amazon ECS Task Definition
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ${{ steps.render-task-def.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
Production Best Practices & Security Hardening
Operating containerized workloads at enterprise scale requires strict adherence to security and operational design principles.
1. Enforce Non-Root Execution
Containers run as the root user by default. If a container is
compromised, an attacker may attempt privilege escalation. Always configure
a dedicated non-root user using the Dockerfile USER directive
or ECS Task Definition configuration.
2. Read-Only Root Filesystem
Prevent runtime modification of application binaries by enabling a read-only root filesystem:
"readonlyRootFilesystem": true
If write access is required, mount temporary writable storage locations such
as /tmp or dedicated ephemeral volumes.
3. Clean Up Your Images
Large Docker images increase deployment times and consume additional storage. To keep images lean and secure:
- Use minimal base images such as
alpine,distroless, or slim variants. - Remove unnecessary packages and build tools.
- Leverage multi-stage Docker builds.
- Exclude development artifacts using a
.dockerignorefile. - Regularly rebuild images to receive security patches.
Example .dockerignore
.git
.github
node_modules
coverage
*.log
.env
README.md
4. Enable Image Immutability
Prevent accidental overwrites of production images by enabling immutable tags in Amazon ECR.
aws ecr put-image-tag-mutability \
--repository-name enterprise-app \
--image-tag-mutability IMMUTABLE
Once enabled, image tags cannot be overwritten, ensuring deployment integrity.
5. Apply Least-Privilege IAM
Never grant wildcard permissions such as:
{
"Action": "*",
"Resource": "*"
}
Instead, scope permissions to only the AWS resources required by the application.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem"
],
"Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/orders"
}
]
}
6. Encrypt Data Everywhere
Ensure encryption is enabled for:
- Amazon ECR repositories (KMS encryption)
- ECS task secrets
- CloudWatch Logs
- Amazon S3 buckets
- Amazon RDS databases
- Network traffic using TLS 1.2+
7. Use Private Subnets
Production ECS tasks should execute inside private subnets without public IP addresses. Internet access should occur through:
- NAT Gateways
- AWS PrivateLink
- VPC Interface Endpoints
8. Configure Auto Scaling
ECS Service Auto Scaling ensures applications automatically respond to changes in demand.
- Scale on CPU utilization
- Scale on memory utilization
- Scale on request count
- Scale on custom CloudWatch metrics
Monitoring, Logging, and Observability
Observability is critical for operating containerized workloads in production. ECS integrates natively with CloudWatch, AWS X-Ray, OpenTelemetry, Prometheus, and Grafana.
CloudWatch Container Insights
Container Insights provides cluster-level and task-level visibility.
aws ecs update-cluster-settings \
--cluster enterprise-app-cluster \
--settings name=containerInsights,value=enabled
Metrics collected include:
- CPU utilization
- Memory utilization
- Network throughput
- Running task count
- Service deployment status
Centralized Logging
Configure ECS tasks to send application logs directly to CloudWatch Logs.
{
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/enterprise-app",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "web"
}
}
}
Distributed Tracing with AWS X-Ray
AWS X-Ray enables end-to-end request tracing across microservices.
Client
โ
Application Load Balancer
โ
ECS Service A
โ
ECS Service B
โ
Amazon RDS
This helps identify latency bottlenecks and service dependencies.
Prometheus and Grafana Integration
Enterprise teams often integrate ECS with Prometheus and Grafana to visualize application and infrastructure metrics.
http_requests_totalhttp_request_duration_secondscontainer_cpu_usage_seconds_totalcontainer_memory_usage_bytes
Troubleshooting and Debugging Guide
Issue: ECS Tasks Stuck in PENDING
Possible causes:
- Insufficient CPU or memory allocation
- Invalid subnet configuration
- Security group restrictions
- ECR connectivity issues
aws ecs describe-services \
--cluster enterprise-app-cluster \
--services enterprise-app
Issue: Cannot Pull Container Image
Common reasons include:
- Missing ECR permissions
- Incorrect image tag
- Missing VPC endpoints
- Network ACL restrictions
Issue: ALB Health Check Failures
Verify:
- Correct health endpoint path
- Security group configuration
- Container port mapping
- Application startup completion
@GetMapping("/health")
public ResponseEntity<String> health() {
return ResponseEntity.ok("UP");
}