☁️ AWS Interview Questions

40 in-depth questions covering EC2, S3, Lambda, VPC, IAM, RDS, DynamoDB, CloudFormation, CloudFront, Auto Scaling, security, cost optimization, and performance — with theory, real configs, real-world scenarios, and common mistakes.

40Questions
5Levels
6Answer Sections
240Total Answers
Showing 40 of 40 questions
0 of 40 viewed
01 What is AWS Global Infrastructure? Explain Regions, Availability Zones, and Edge Locations. basic

AWS Global Infrastructure is the physical foundation of all AWS services, organized into three tiers:

Regions:

  • A geographic area (e.g., us-east-1, ap-south-1) containing multiple data centers.
  • Each Region is completely independent — data does not replicate between Regions unless you configure it.
  • Choose a Region based on: latency (proximity to users), compliance (data residency laws), service availability (not all services are in all Regions), and cost (pricing varies by Region).

Availability Zones (AZs):

  • Each Region has 2-6 AZs — physically separate data centers within the Region.
  • Connected by low-latency, high-bandwidth private fiber (< 2ms latency between AZs).
  • Designed for fault isolation — separate power, cooling, networking. A fire/flood in one AZ doesn't affect others.
  • Multi-AZ deployment is the foundation of high availability on AWS.

Edge Locations:

  • 400+ locations worldwide used by CloudFront (CDN), Route 53 (DNS), and AWS WAF.
  • Cache content close to end users for lower latency.
  • Separate from Regions — there are many more Edge Locations than Regions.
# ── List all Regions ──
aws ec2 describe-regions --query "Regions[].RegionName" --output table

# ── List AZs in current Region ──
aws ec2 describe-availability-zones \
    --query "AvailabilityZones[].{Zone:ZoneName,State:State,Type:ZoneType}" \
    --output table

# Output:
# | Zone           | State      | Type              |
# | us-east-1a     | available  | availability-zone |
# | us-east-1b     | available  | availability-zone |
# | us-east-1c     | available  | availability-zone |

# ── CloudFormation: Multi-AZ deployment ──
# Resources:
#   MyVPC:
#     Type: AWS::EC2::VPC
#     Properties:
#       CidrBlock: 10.0.0.0/16
#
#   SubnetAZ1:
#     Type: AWS::EC2::Subnet
#     Properties:
#       VpcId: !Ref MyVPC
#       CidrBlock: 10.0.1.0/24
#       AvailabilityZone: !Select [0, !GetAZs ""]
#
#   SubnetAZ2:
#     Type: AWS::EC2::Subnet
#     Properties:
#       VpcId: !Ref MyVPC
#       CidrBlock: 10.0.2.0/24
#       AvailabilityZone: !Select [1, !GetAZs ""]

# ── Check Edge Location count ──
# As of 2026: 400+ Edge Locations, 13+ Regional Edge Caches
# Edge Locations are used by CloudFront, Route 53, AWS Shield, WAF

# ── Python boto3: Get AZs programmatically ──
# import boto3
# ec2 = boto3.client("ec2", region_name="us-east-1")
# azs = ec2.describe_availability_zones()
# for az in azs["AvailabilityZones"]:
#     print(f"{az['ZoneName']} - {az['State']}")

A startup deployed their entire application in a single AZ (us-east-1a). When that AZ experienced a network partition, the app was completely down for 4 hours. After the incident, they redesigned for Multi-AZ: the database moved to RDS Multi-AZ (automatic failover), web servers spanned two AZs behind an ALB, and S3 (inherently multi-AZ) stored static assets. The next AZ outage caused zero downtime.

Always deploy across multiple AZs for high availability — it's the #1 AWS architecture principle. Regions are for geographic reach and compliance. AZs are for fault tolerance within a Region. Edge Locations are for caching content close to users. Choose your Region based on latency, compliance, services, and cost.
⚠️ Common Mistake
// ❌ Single-AZ deployment — single point of failure // EC2 in us-east-1a only // RDS in us-east-1a only (Single-AZ) // AZ outage = complete application downtime // No automatic failover — manual recovery needed
// ✅ Multi-AZ deployment — fault tolerant // EC2 in us-east-1a AND us-east-1b behind ALB // RDS Multi-AZ — automatic failover to standby in another AZ // S3 — inherently stores data across 3+ AZs // AZ outage = traffic shifts to healthy AZ automatically
🔁 Follow-Up Question

What is the difference between a Region, an AZ, and a Local Zone? When would you use Local Zones or Wavelength Zones?

02 What are EC2 instance types? Explain families, sizing, and burstable vs fixed performance. basic

EC2 (Elastic Compute Cloud) provides virtual servers. Instance types define the hardware profile — CPU, memory, storage, and networking.

Instance family naming: m7g.xlarge = m (family) + 7 (generation) + g (Graviton/ARM) + xlarge (size).

Key families:

  • T3/T4g — Burstable. Earns CPU credits when idle, spends them during spikes. Cheapest. Good for dev/test, low-traffic web servers.
  • M7i/M7g — General Purpose. Balanced CPU/memory. Good for most workloads (web apps, small databases).
  • C7i/C7g — Compute Optimized. High CPU-to-memory ratio. Good for batch processing, ML inference, gaming servers.
  • R7i/R7g — Memory Optimized. High memory-to-CPU ratio. Good for in-memory caches (Redis), real-time analytics.
  • I4i — Storage Optimized. High IOPS NVMe SSDs. Good for databases, data warehouses.
  • P5/G5 — Accelerated Computing. GPUs for ML training, video encoding, HPC.

Burstable (T-family): Uses a CPU credit system. Below baseline → earns credits. Above baseline → spends credits. When credits run out → throttled to baseline (default) or charged per-vCPU-hour (unlimited mode). Check baseline: T3.medium baseline is 20% CPU.

Graviton (g suffix): ARM-based processors by AWS. Up to 40% better price/performance vs Intel. Not all software supports ARM.

# ── List available instance types in a Region ──
aws ec2 describe-instance-types \
    --filters "Name=current-generation,Values=true" \
    --query "InstanceTypes[].{Type:InstanceType,vCPUs:VCpuInfo.DefaultVCpus,MemGB:MemoryInfo.SizeInMiB}" \
    --output table | head -20

# ── Launch an EC2 instance ──
aws ec2 run-instances \
    --image-id ami-0abcdef1234567890 \
    --instance-type m7g.large \
    --key-name my-key \
    --subnet-id subnet-abc123 \
    --security-group-ids sg-abc123 \
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=WebServer}]"

# ── Check T3 CPU credit balance ──
aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUCreditBalance \
    --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
    --start-time 2026-05-29T00:00:00Z \
    --end-time 2026-05-30T00:00:00Z \
    --period 3600 --statistics Average

# ── CloudFormation: EC2 with instance type ──
# Resources:
#   WebServer:
#     Type: AWS::EC2::Instance
#     Properties:
#       InstanceType: m7g.large    # Graviton for cost savings
#       ImageId: ami-0abcdef1234567890
#       SubnetId: !Ref PrivateSubnet
#       SecurityGroupIds:
#         - !Ref WebSG
#       CreditSpecification:
#         CPUCredits: unlimited    # For T-family only
#       Tags:
#         - Key: Name
#           Value: WebServer

# ── Instance type comparison (common choices) ──
# t3.medium:  2 vCPU,  4 GB RAM  — $0.0416/hr  (burstable, dev/test)
# m7g.large:  2 vCPU,  8 GB RAM  — $0.0816/hr  (general, Graviton)
# c7g.large:  2 vCPU,  4 GB RAM  — $0.0725/hr  (compute-heavy)
# r7g.large:  2 vCPU, 16 GB RAM  — $0.1008/hr  (memory-heavy)

A team ran their production API on t3.large instances. During a traffic spike, CPU credit balance dropped to zero and instances throttled to 20% CPU — response times jumped from 50ms to 2 seconds. They switched to m7g.large (fixed performance, Graviton) which was only 15% more expensive but provided consistent CPU. For their dev environment, they kept T3 with unlimited mode enabled — cost-effective with no throttling risk.

Use T-family (burstable) for dev/test and low-traffic workloads. Use M-family (general purpose) for production web apps. Use C-family for CPU-intensive work. Use R-family for memory-intensive work. Choose Graviton (g suffix) for ~40% better price/performance. Always check CPU credit balance on T-family instances.
⚠️ Common Mistake
// ❌ Running production on T3 without understanding bursting // T3.large baseline: 30% CPU — fine most of the time // Traffic spike → CPU credits exhausted → throttled to 30% // Response times 10x slower — users see timeouts // "Unlimited mode" not enabled → hard throttle at 0 credits
// ✅ Use fixed-performance instances for production // m7g.large — consistent CPU, no credit system // Or T3 with CreditSpecification: unlimited (pay for bursts) // Monitor: CPUCreditBalance alarm when < 100 credits // Right-size: use AWS Compute Optimizer recommendations
🔁 Follow-Up Question

What is AWS Compute Optimizer and how does it recommend the right instance type?

03 What are S3 storage classes? Explain when to use each and how lifecycle policies work. basic

Amazon S3 offers multiple storage classes optimized for different access patterns and cost requirements:

  • S3 Standard — frequently accessed data. 99.99% availability, 11 nines durability. Highest cost per GB but no retrieval fees.
  • S3 Intelligent-Tiering — automatically moves objects between tiers based on access patterns. Small monthly monitoring fee. Best when access patterns are unpredictable.
  • S3 Standard-IA (Infrequent Access) — data accessed less than once a month. Lower storage cost, but per-GB retrieval fee. Minimum 30-day charge. Min object size: 128 KB.
  • S3 One Zone-IA — same as IA but stored in a single AZ. 20% cheaper. Use for re-creatable data (thumbnails, transcoded media).
  • S3 Glacier Instant Retrieval — archive data needing millisecond access (quarterly reports). Cheapest with instant access.
  • S3 Glacier Flexible Retrieval — archive with retrieval in minutes to hours (1-5 min expedited, 3-5 hr standard, 5-12 hr bulk).
  • S3 Glacier Deep Archive — cheapest storage. Retrieval in 12-48 hours. For compliance archives, 7-10 year retention.

Lifecycle Policies automate transitions between classes and expiration of objects based on age or other criteria.

# ── Upload with specific storage class ──
aws s3 cp backup.tar.gz s3://my-bucket/backups/ \
    --storage-class GLACIER_IR

# ── Check current storage class of an object ──
aws s3api head-object --bucket my-bucket --key data/report.csv \
    --query "StorageClass"

# ── CloudFormation: S3 Lifecycle Policy ──
# Resources:
#   DataBucket:
#     Type: AWS::S3::Bucket
#     Properties:
#       BucketName: my-data-bucket
#       LifecycleConfiguration:
#         Rules:
#           - Id: TransitionToIA
#             Status: Enabled
#             Transitions:
#               # After 30 days → Infrequent Access
#               - TransitionInDays: 30
#                 StorageClass: STANDARD_IA
#               # After 90 days → Glacier Instant
#               - TransitionInDays: 90
#                 StorageClass: GLACIER_IR
#               # After 365 days → Deep Archive
#               - TransitionInDays: 365
#                 StorageClass: DEEP_ARCHIVE
#             ExpirationInDays: 2555  # Delete after 7 years
#
#           - Id: CleanupIncompleteUploads
#             Status: Enabled
#             AbortIncompleteMultipartUpload:
#               DaysAfterInitiation: 7  # Clean up failed uploads

# ── AWS CLI: Set lifecycle policy ──
aws s3api put-bucket-lifecycle-configuration \
    --bucket my-bucket \
    --lifecycle-configuration file://lifecycle.json

# lifecycle.json example:
# {
#   "Rules": [{
#     "ID": "ArchiveOldLogs",
#     "Status": "Enabled",
#     "Filter": { "Prefix": "logs/" },
#     "Transitions": [
#       { "Days": 30, "StorageClass": "STANDARD_IA" },
#       { "Days": 90, "StorageClass": "GLACIER" }
#     ],
#     "Expiration": { "Days": 365 }
#   }]
# }

# ── Storage class cost comparison (us-east-1, per GB/month) ──
# Standard:           $0.023
# Intelligent-Tier:   $0.023 (+ $0.0025/1000 objects monitoring)
# Standard-IA:        $0.0125  (+ $0.01/GB retrieval)
# One Zone-IA:        $0.01    (+ $0.01/GB retrieval)
# Glacier Instant:    $0.004   (+ $0.03/GB retrieval)
# Glacier Flexible:   $0.0036  (+ $0.01-$0.03/GB retrieval)
# Deep Archive:       $0.00099 (+ $0.02/GB retrieval)

A healthcare company stored 50TB of patient records in S3 Standard — costing $1,150/month. Analysis showed 90% of records were accessed only during the first 30 days. They implemented a lifecycle policy: Standard for 30 days → Standard-IA for 30-90 days → Glacier Instant Retrieval after 90 days. Monthly cost dropped to $320 — a 72% reduction. Compliance-required 7-year retention records moved to Deep Archive at $0.50/TB/month.

Use S3 Standard for frequently accessed data, Intelligent-Tiering when patterns are unknown, Standard-IA for monthly access, and Glacier tiers for archives. Set lifecycle policies to automate transitions — most cost savings come from this. Always clean up incomplete multipart uploads (they cost money). One Zone-IA is only for re-creatable data.
⚠️ Common Mistake
// ❌ Storing everything in S3 Standard forever // 50TB in Standard = $1,150/month // 90% of data is rarely accessed after 30 days // No lifecycle policy = paying premium for cold data // Incomplete multipart uploads accumulate silently
// ✅ Lifecycle policy optimizes costs automatically // Days 0-30: Standard ($0.023/GB) — active data // Days 30-90: Standard-IA ($0.0125/GB) — infrequent // Days 90+: Glacier Instant ($0.004/GB) — archive // 50TB cost drops from $1,150 to $320/month // + AbortIncompleteMultipartUpload: 7 days
🔁 Follow-Up Question

What is S3 Intelligent-Tiering and when does it make more sense than manual lifecycle policies?

04 How does IAM work in AWS? Explain Users, Groups, Roles, and Policies. basic

IAM (Identity and Access Management) controls who (authentication) can do what (authorization) in your AWS account.

Core components:

  • Users — individual identities with long-term credentials (password + access keys). Map to a person or application. Best practice: minimize IAM users, use IAM Identity Center (SSO) instead.
  • Groups — collections of users. Attach policies to groups, not individual users. E.g., "Developers" group, "DBAdmins" group.
  • Roles — temporary credentials assumed by users, services, or accounts. No long-term credentials. EC2 instances, Lambda functions, and cross-account access use roles. Most important IAM concept.
  • Policies — JSON documents that define permissions. Attached to users, groups, or roles.

Policy structure: Effect (Allow/Deny) + Action (e.g., s3:GetObject) + Resource (ARN of the resource). Deny always wins over Allow.

Policy types:

  • AWS Managed — predefined by AWS (e.g., AmazonS3ReadOnlyAccess).
  • Customer Managed — created by you, reusable across entities.
  • Inline — embedded directly in a single user/group/role. Avoid when possible.

Least Privilege Principle: Grant only the minimum permissions needed. Start with zero permissions and add as needed.

# ── IAM Policy JSON structure ──
# {
#   "Version": "2012-10-17",
#   "Statement": [
#     {
#       "Sid": "AllowS3ReadOnly",
#       "Effect": "Allow",
#       "Action": [
#         "s3:GetObject",
#         "s3:ListBucket"
#       ],
#       "Resource": [
#         "arn:aws:s3:::my-bucket",
#         "arn:aws:s3:::my-bucket/*"
#       ]
#     },
#     {
#       "Sid": "DenyDeleteBucket",
#       "Effect": "Deny",
#       "Action": "s3:DeleteBucket",
#       "Resource": "*"
#     }
#   ]
# }

# ── Create an IAM Role for EC2 ──
aws iam create-role --role-name EC2-S3-Reader \
    --assume-role-policy-document \
    '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}'

# Attach a managed policy
aws iam attach-role-policy --role-name EC2-S3-Reader \
    --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

# Create instance profile (required for EC2)
aws iam create-instance-profile --instance-profile-name EC2-S3-Reader
aws iam add-role-to-instance-profile \
    --instance-profile-name EC2-S3-Reader \
    --role-name EC2-S3-Reader

# ── CloudFormation: IAM Role for Lambda ──
# Resources:
#   LambdaExecutionRole:
#     Type: AWS::IAM::Role
#     Properties:
#       AssumeRolePolicyDocument:
#         Version: "2012-10-17"
#         Statement:
#           - Effect: Allow
#             Principal:
#               Service: lambda.amazonaws.com
#             Action: sts:AssumeRole
#       ManagedPolicyArns:
#         - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
#       Policies:
#         - PolicyName: DynamoDBAccess
#           PolicyDocument:
#             Version: "2012-10-17"
#             Statement:
#               - Effect: Allow
#                 Action:
#                   - dynamodb:GetItem
#                   - dynamodb:PutItem
#                   - dynamodb:Query
#                 Resource: !GetAtt MyTable.Arn

A developer stored AWS access keys in their application code and pushed it to a public GitHub repo. Within 20 minutes, crypto miners had spun up 50 expensive GPU instances. The fix: rotated all credentials, enabled MFA, switched to IAM Roles (no access keys needed for EC2/Lambda), and enabled AWS CloudTrail to audit all API calls. They also set up billing alarms and AWS Organizations SCPs to restrict instance types.

Use IAM Roles instead of access keys whenever possible — roles provide temporary credentials that auto-rotate. Attach policies to groups, not users. Follow least privilege — start with no permissions and add only what's needed. Enable MFA for all human users. Never hardcode credentials — use roles for EC2, Lambda, ECS. Deny always overrides Allow.
⚠️ Common Mistake
// ❌ Hardcoded access keys in application code // const aws = require("aws-sdk"); // aws.config.update({ // accessKeyId: "AKIAIOSFODNN7EXAMPLE", // Leaked! // secretAccessKey: "wJalrXUtnFEMI/K7MDENG" // Leaked! // }); // Keys in source code → pushed to GitHub → compromised in minutes
// ✅ Use IAM Roles — no credentials in code // EC2: attach instance profile with IAM Role // Lambda: assign execution role // ECS: task role // SDK auto-discovers credentials from the role // const s3 = new AWS.S3(); // No credentials needed! // Temporary creds auto-rotate every hour
🔁 Follow-Up Question

What is the difference between IAM Roles and IAM Identity Center (SSO)? When do you use each?

05 What is a VPC? Explain subnets, route tables, Internet Gateway, and NAT Gateway. basic

A VPC (Virtual Private Cloud) is your isolated virtual network in AWS. You control the IP range, subnets, routing, and security.

Key components:

  • CIDR Block — the IP address range of your VPC (e.g., 10.0.0.0/16 = 65,536 IPs). Cannot be changed after creation (but you can add secondary CIDRs).
  • Subnets — subdivisions of the VPC CIDR, each in a single AZ. Two types:
    • Public subnet — has a route to an Internet Gateway. Resources get public IPs.
    • Private subnet — no direct internet access. Resources communicate via NAT Gateway or VPC endpoints.
  • Route Tables — rules that determine where traffic goes. Each subnet is associated with one route table.
    • Public subnet route: 0.0.0.0/0 → igw-xxx (Internet Gateway).
    • Private subnet route: 0.0.0.0/0 → nat-xxx (NAT Gateway).
  • Internet Gateway (IGW) — allows resources in public subnets to reach the internet (and be reached from the internet). One per VPC.
  • NAT Gateway — allows resources in private subnets to reach the internet (for updates, API calls) but prevents inbound connections. Deployed in a public subnet. Costs: hourly + per-GB data processed.
# ── CloudFormation: Complete VPC setup ──
# Resources:
#   VPC:
#     Type: AWS::EC2::VPC
#     Properties:
#       CidrBlock: 10.0.0.0/16
#       EnableDnsSupport: true
#       EnableDnsHostnames: true
#       Tags: [{Key: Name, Value: MyVPC}]
#
#   InternetGateway:
#     Type: AWS::EC2::InternetGateway
#   AttachGateway:
#     Type: AWS::EC2::VPCGatewayAttachment
#     Properties:
#       VpcId: !Ref VPC
#       InternetGatewayId: !Ref InternetGateway
#
#   PublicSubnet1:
#     Type: AWS::EC2::Subnet
#     Properties:
#       VpcId: !Ref VPC
#       CidrBlock: 10.0.1.0/24
#       AvailabilityZone: !Select [0, !GetAZs ""]
#       MapPublicIpOnLaunch: true
#
#   PrivateSubnet1:
#     Type: AWS::EC2::Subnet
#     Properties:
#       VpcId: !Ref VPC
#       CidrBlock: 10.0.10.0/24
#       AvailabilityZone: !Select [0, !GetAZs ""]
#
#   NatGateway:
#     Type: AWS::EC2::NatGateway
#     Properties:
#       SubnetId: !Ref PublicSubnet1  # NAT lives in PUBLIC subnet
#       AllocationId: !GetAtt NatEIP.AllocationId
#   NatEIP:
#     Type: AWS::EC2::EIP
#
#   PublicRouteTable:
#     Type: AWS::EC2::RouteTable
#     Properties:
#       VpcId: !Ref VPC
#   PublicRoute:
#     Type: AWS::EC2::Route
#     Properties:
#       RouteTableId: !Ref PublicRouteTable
#       DestinationCidrBlock: 0.0.0.0/0
#       GatewayId: !Ref InternetGateway  # → Internet
#
#   PrivateRouteTable:
#     Type: AWS::EC2::RouteTable
#     Properties:
#       VpcId: !Ref VPC
#   PrivateRoute:
#     Type: AWS::EC2::Route
#     Properties:
#       RouteTableId: !Ref PrivateRouteTable
#       DestinationCidrBlock: 0.0.0.0/0
#       NatGatewayId: !Ref NatGateway  # → NAT (outbound only)

# ── AWS CLI: Create VPC ──
aws ec2 create-vpc --cidr-block 10.0.0.0/16 \
    --tag-specifications "ResourceType=vpc,Tags=[{Key=Name,Value=MyVPC}]"

A company put their database EC2 instance in a public subnet with a public IP. A port scan found the open MySQL port and brute-forced the weak password. After the breach, they redesigned: databases moved to private subnets (no public IP, no IGW route). Application servers in private subnets accessed the internet via NAT Gateway for package updates. Only the ALB sat in public subnets. NAT Gateway cost ($0.045/hr + $0.045/GB) was trivial compared to the breach cost.

Place databases and application servers in private subnets — never expose them to the internet. Use public subnets only for load balancers, bastion hosts, and NAT Gateways. NAT Gateway provides outbound-only internet for private resources. Always use at least 2 AZs with one public and one private subnet each. A well-designed VPC is the foundation of cloud security.
⚠️ Common Mistake
// ❌ Database in a public subnet with public IP // EC2 (MySQL) in public subnet → public IP 54.x.x.x // Security Group allows 0.0.0.0/0 on port 3306 // Anyone on the internet can attempt to connect // Brute force → data breach → front-page news
// ✅ Database in private subnet, no public access // RDS in private subnet → no public IP, no IGW route // Security Group: allow 3306 only from app server SG // App servers in private subnet → outbound via NAT Gateway // ALB in public subnet → only port 443 exposed // Zero attack surface from the internet
🔁 Follow-Up Question

What is a VPC endpoint and how does it avoid NAT Gateway costs for AWS service access?

06 What is the difference between Security Groups and NACLs? How do they work together? basic

AWS provides two layers of network security that work together:

Security Groups (SGs) — instance-level firewall:

  • Stateful — if you allow inbound traffic, the response is automatically allowed outbound (and vice versa).
  • Attached to ENIs (network interfaces) on EC2, RDS, Lambda VPC, etc.
  • Allow rules only — no explicit deny. Anything not allowed is implicitly denied.
  • Can reference other Security Groups as source/destination (e.g., "allow traffic from the ALB SG").
  • All rules evaluated together — if any rule allows, traffic passes.
  • Default: all outbound allowed, all inbound denied.

NACLs (Network Access Control Lists) — subnet-level firewall:

  • Stateless — you must explicitly allow both inbound AND outbound (including ephemeral ports for responses).
  • Applied to subnets — affects all resources in the subnet.
  • Allow AND deny rules — can explicitly block specific IPs.
  • Rules evaluated in order by rule number — first match wins.
  • Default NACL: allows all traffic. Custom NACL: denies all by default.

Evaluation order: Inbound traffic hits NACL first → then Security Group. Outbound: Security Group first → then NACL.

# ── Security Group: Web server ──
aws ec2 create-security-group \
    --group-name WebSG --description "Web server SG" \
    --vpc-id vpc-abc123

# Allow HTTPS from anywhere
aws ec2 authorize-security-group-ingress \
    --group-id sg-abc123 \
    --protocol tcp --port 443 --cidr 0.0.0.0/0

# Allow app traffic from ALB Security Group (SG reference!)
aws ec2 authorize-security-group-ingress \
    --group-id sg-abc123 \
    --protocol tcp --port 8080 \
    --source-group sg-alb456  # ← Reference another SG

# ── NACL: Block a specific IP range ──
aws ec2 create-network-acl-entry \
    --network-acl-id acl-abc123 \
    --rule-number 50 --protocol tcp \
    --port-range From=0,To=65535 \
    --cidr-block 203.0.113.0/24 \
    --rule-action deny --ingress

# Allow HTTPS inbound (rule 100 — evaluated after rule 50)
aws ec2 create-network-acl-entry \
    --network-acl-id acl-abc123 \
    --rule-number 100 --protocol tcp \
    --port-range From=443,To=443 \
    --cidr-block 0.0.0.0/0 \
    --rule-action allow --ingress

# Allow ephemeral ports outbound (NACL is STATELESS!)
aws ec2 create-network-acl-entry \
    --network-acl-id acl-abc123 \
    --rule-number 100 --protocol tcp \
    --port-range From=1024,To=65535 \
    --cidr-block 0.0.0.0/0 \
    --rule-action allow --egress

# ── Key comparison ──
# Feature          | Security Group      | NACL
# Level            | Instance (ENI)      | Subnet
# Stateful?        | Yes                 | No
# Rules            | Allow only          | Allow + Deny
# Evaluation       | All rules together  | Ordered by rule #
# Default inbound  | Deny all            | Allow all (default NACL)
# SG references    | Yes                 | No (CIDR only)

A web app was under a DDoS attack from a specific IP range (203.0.113.0/24). Security Groups couldn't help because they don't have deny rules — they could only allow legitimate traffic. The team added a NACL deny rule (rule number 50, lower than the allow rules) to block the attacking IP range at the subnet level. Traffic from those IPs was dropped before reaching any EC2 instance. For ongoing protection, they also enabled AWS Shield and WAF.

Security Groups are your primary defense — stateful, instance-level, reference other SGs. Use NACLs as a second layer for explicit deny rules (blocking IPs, subnets). NACLs are stateless — remember to allow ephemeral ports (1024-65535) for return traffic. Reference SGs instead of CIDR blocks when possible — it's more maintainable and scales with Auto Scaling.
⚠️ Common Mistake
// ❌ Forgetting NACLs are stateless // NACL: Allow inbound TCP 443 ✅ // NACL: No outbound rule for ephemeral ports ❌ // Result: HTTPS request arrives but response is blocked! // Client sees: connection timeout (no response) // SGs are stateful so developers forget NACLs aren't
// ✅ NACL: Allow both directions // Inbound: Allow TCP 443 from 0.0.0.0/0 // Outbound: Allow TCP 1024-65535 to 0.0.0.0/0 (ephemeral) // Or keep the default NACL (allows all) and rely on SGs // Use NACLs only when you need explicit DENY rules
🔁 Follow-Up Question

Can you use Security Group references across VPCs? What about across accounts with VPC Peering?

07 What are EBS volume types? Explain IOPS, throughput, and when to use each type. basic

EBS (Elastic Block Store) provides persistent block storage volumes for EC2 instances. Different volume types optimize for different workloads:

SSD-backed (random I/O):

  • gp3 (General Purpose SSD) — baseline 3,000 IOPS + 125 MB/s throughput, independently scalable to 16,000 IOPS and 1,000 MB/s. Best default choice. 20% cheaper than gp2.
  • gp2 (previous gen) — IOPS scales with volume size (3 IOPS/GB, burst to 3,000). Being replaced by gp3.
  • io2 Block Express — provisioned IOPS up to 256,000 IOPS. Sub-millisecond latency. For critical databases (Oracle, SAP HANA). 99.999% durability (vs 99.8-99.9% for others).

HDD-backed (sequential I/O):

  • st1 (Throughput Optimized HDD) — max 500 MB/s throughput. For big data, data warehouses, log processing. Cannot be a boot volume.
  • sc1 (Cold HDD) — cheapest EBS. Max 250 MB/s. For infrequently accessed data. Cannot be a boot volume.

Key concepts:

  • IOPS = Input/Output Operations Per Second — measures random read/write speed (databases).
  • Throughput = MB/s — measures sequential read/write speed (big data, streaming).
  • gp3 advantage: you can provision IOPS and throughput independently of volume size (unlike gp2).
  • Snapshots: point-in-time backups stored in S3. Incremental — only changed blocks are saved.
# ── Create a gp3 volume with custom IOPS ──
aws ec2 create-volume \
    --volume-type gp3 \
    --size 500 \
    --iops 10000 \
    --throughput 500 \
    --availability-zone us-east-1a \
    --tag-specifications "ResourceType=volume,Tags=[{Key=Name,Value=AppData}]"

# ── Modify existing volume (no downtime!) ──
aws ec2 modify-volume \
    --volume-id vol-abc123 \
    --volume-type gp3 \
    --iops 10000 \
    --throughput 500 \
    --size 1000

# ── CloudFormation: EBS volume for database ──
# Resources:
#   DatabaseVolume:
#     Type: AWS::EC2::Volume
#     Properties:
#       VolumeType: io2
#       Iops: 50000
#       Size: 500        # 500 GB
#       AvailabilityZone: !GetAtt DBInstance.AvailabilityZone
#       Encrypted: true
#       KmsKeyId: !Ref MyKMSKey
#       Tags:
#         - Key: Name
#           Value: DatabaseVolume

# ── Create a snapshot (backup) ──
aws ec2 create-snapshot \
    --volume-id vol-abc123 \
    --description "Pre-upgrade backup" \
    --tag-specifications "ResourceType=snapshot,Tags=[{Key=Name,Value=PreUpgrade}]"

# ── Volume type comparison ──
# Type | Max IOPS  | Max Throughput | Use Case
# gp3  | 16,000    | 1,000 MB/s     | Default choice, boot volumes
# io2  | 256,000   | 4,000 MB/s     | Critical databases
# st1  | 500       | 500 MB/s       | Big data, logs (sequential)
# sc1  | 250       | 250 MB/s       | Cold storage (cheapest)

# ── Monitor IOPS usage ──
aws cloudwatch get-metric-statistics \
    --namespace AWS/EBS \
    --metric-name VolumeReadOps \
    --dimensions Name=VolumeId,Value=vol-abc123 \
    --start-time 2026-05-29T00:00:00Z \
    --end-time 2026-05-30T00:00:00Z \
    --period 300 --statistics Sum

A PostgreSQL database on gp2 (100GB = 300 baseline IOPS) suffered from burst credit exhaustion during batch processing. Response times spiked when credits ran out. Migrating to gp3 with 10,000 provisioned IOPS (independent of size) solved the problem and was actually cheaper — gp3 base price is 20% lower than gp2, and they only paid for the IOPS they needed. The volume modification was done live with no downtime using modify-volume.

Use gp3 as your default — it's cheaper than gp2 and lets you provision IOPS independently of size. Use io2 only for critical databases needing >16,000 IOPS or 99.999% durability. Use st1 for sequential workloads (big data, logs). Always encrypt volumes with KMS. You can modify volume type, size, and IOPS live — no downtime needed.
⚠️ Common Mistake
// ❌ Using gp2 and relying on burst credits // 100 GB gp2 = 300 baseline IOPS (3 IOPS/GB) // Burst to 3,000 IOPS — but credits deplete quickly // Batch job runs → credits exhausted → throttled to 300 IOPS // Database queries take 10x longer during batch processing
// ✅ Switch to gp3 with provisioned IOPS // gp3: 3,000 baseline IOPS included (any size) // Provision up to 16,000 IOPS independently // No credit system — consistent performance 24/7 // 20% cheaper base price than gp2 // aws ec2 modify-volume --volume-type gp3 --iops 10000 (live!)
🔁 Follow-Up Question

What are EBS Multi-Attach and io2 Block Express? When would you use them?

08 How does Route 53 work? Explain DNS routing policies and health checks. basic

Route 53 is AWS's managed DNS service. It provides domain registration, DNS routing, and health checking.

Hosted Zones:

  • Public Hosted Zone — resolves domain names from the internet (e.g., www.example.com → ALB IP).
  • Private Hosted Zone — resolves names only within your VPC (e.g., db.internal → RDS private IP).

Routing Policies:

  • Simple — one record, one or more values. No health checks. Good for single resources.
  • Weighted — distribute traffic by percentage (e.g., 90% to v1, 10% to v2). Great for canary deployments.
  • Latency-based — routes to the Region with lowest latency for the user. Best for multi-region apps.
  • Failover — primary/secondary setup. If primary health check fails → routes to secondary. Active-passive DR.
  • Geolocation — routes based on user's geographic location (continent, country). For compliance, localization.
  • Geoproximity — routes based on geographic distance with bias to shift traffic between regions.
  • Multivalue Answer — returns multiple healthy IPs (up to 8). Client-side load balancing with health checks.

Health Checks: Route 53 monitors endpoint health (HTTP/HTTPS/TCP). Unhealthy records are removed from DNS responses. Can trigger CloudWatch alarms.

Alias Records: Route 53-specific feature — point to AWS resources (ALB, CloudFront, S3) without a CNAME. Free of charge. Works at the zone apex (example.com, not just www.example.com).

# ── Create a Hosted Zone ──
aws route53 create-hosted-zone \
    --name example.com \
    --caller-reference "2026-05-30"

# ── Create an Alias record pointing to ALB ──
# aws route53 change-resource-record-sets \
#     --hosted-zone-id Z1234567890 \
#     --change-batch '{
#       "Changes": [{
#         "Action": "CREATE",
#         "ResourceRecordSet": {
#           "Name": "www.example.com",
#           "Type": "A",
#           "AliasTarget": {
#             "HostedZoneId": "Z35SXDOTRQ7X7K",
#             "DNSName": "my-alb-1234567890.us-east-1.elb.amazonaws.com",
#             "EvaluateTargetHealth": true
#           }
#         }
#       }]
#     }'

# ── Health Check ──
aws route53 create-health-check --caller-reference "web-hc-2026" \
    --health-check-config \
    Type=HTTPS,FullyQualifiedDomainName=www.example.com,\
Port=443,ResourcePath=/health,RequestInterval=30,FailureThreshold=3

# ── CloudFormation: Failover routing ──
# Resources:
#   PrimaryRecord:
#     Type: AWS::Route53::RecordSet
#     Properties:
#       HostedZoneId: !Ref MyHostedZone
#       Name: api.example.com
#       Type: A
#       AliasTarget:
#         HostedZoneId: !GetAtt PrimaryALB.CanonicalHostedZoneID
#         DNSName: !GetAtt PrimaryALB.DNSName
#         EvaluateTargetHealth: true
#       Failover: PRIMARY
#       SetIdentifier: primary
#       HealthCheckId: !Ref PrimaryHealthCheck
#
#   SecondaryRecord:
#     Type: AWS::Route53::RecordSet
#     Properties:
#       HostedZoneId: !Ref MyHostedZone
#       Name: api.example.com
#       Type: A
#       AliasTarget:
#         HostedZoneId: !GetAtt SecondaryALB.CanonicalHostedZoneID
#         DNSName: !GetAtt SecondaryALB.DNSName
#       Failover: SECONDARY
#       SetIdentifier: secondary

# ── Routing policy comparison ──
# Policy       | Use Case                    | Health Check?
# Simple       | Single resource             | No
# Weighted     | Canary, A/B testing         | Yes
# Latency      | Multi-region, lowest ping   | Yes
# Failover     | Active-passive DR           | Yes (primary)
# Geolocation  | Compliance, localization    | Yes
# Multivalue   | Client-side load balancing  | Yes

A global SaaS app deployed in us-east-1 and eu-west-1. Initially, all users hit us-east-1 — European users experienced 200ms latency. After switching to Route 53 latency-based routing with health checks, European users were automatically routed to eu-west-1 (30ms latency). When eu-west-1 had an outage, health checks detected it within 30 seconds, and Route 53 automatically routed all traffic to us-east-1 — no manual intervention needed.

Use Alias records (free) to point to AWS resources — they work at the zone apex. Use latency-based routing for multi-region apps. Use failover routing for DR. Always attach health checks to routing policies — Route 53 only removes unhealthy records if health checks are configured. Weighted routing is perfect for canary deployments (send 5% to new version).
⚠️ Common Mistake
// ❌ Failover routing WITHOUT health checks // Primary: us-east-1 ALB (no health check) // Secondary: eu-west-1 ALB // us-east-1 goes down → Route 53 doesn't know! // All traffic still goes to the dead primary // Health checks are REQUIRED for failover to work
// ✅ Failover with health checks // Primary: us-east-1 ALB + health check on /health // Secondary: eu-west-1 ALB (no health check needed) // us-east-1 goes down → health check fails (3 consecutive) // Route 53 removes primary from DNS → traffic to secondary // Recovery: primary comes back → health check passes → traffic returns
🔁 Follow-Up Question

What is the difference between a CNAME record and an Alias record? Why can't you use CNAME at the zone apex?

09 What is the difference between ALB, NLB, and CLB? When do you use each? intermediate

AWS offers three types of Elastic Load Balancers (ELB):

Application Load Balancer (ALB) — Layer 7 (HTTP/HTTPS):

  • Routes based on URL path (/api/* → backend, /images/* → static), hostname (api.example.com vs www.example.com), HTTP headers, and query strings.
  • Supports WebSocket, HTTP/2, gRPC.
  • Target types: EC2 instances, IP addresses, Lambda functions, containers (ECS/EKS).
  • Built-in features: sticky sessions, authentication (Cognito/OIDC), request/response modification, WAF integration.
  • Best for: web applications, microservices, container-based architectures.

Network Load Balancer (NLB) — Layer 4 (TCP/UDP/TLS):

  • Routes based on IP + port only. Does not inspect HTTP content.
  • Ultra-low latency (~100 microseconds vs ~400ms for ALB).
  • Handles millions of requests/sec with static IP addresses.
  • Supports TLS termination, TCP pass-through, and UDP.
  • Preserves client source IP (ALB replaces it with its own).
  • Best for: TCP/UDP services, gaming, IoT, extreme performance, static IPs.

Classic Load Balancer (CLB) — Layer 4 + basic Layer 7 (legacy):

  • Do not use for new projects. AWS recommends migrating to ALB or NLB.
  • Limited feature set. No path-based routing, no host-based routing.
# ── CloudFormation: ALB with path-based routing ──
# Resources:
#   ALB:
#     Type: AWS::ElasticLoadBalancingV2::LoadBalancer
#     Properties:
#       Type: application
#       Scheme: internet-facing
#       Subnets: [!Ref PublicSubnet1, !Ref PublicSubnet2]
#       SecurityGroups: [!Ref ALBSG]
#
#   Listener:
#     Type: AWS::ElasticLoadBalancingV2::Listener
#     Properties:
#       LoadBalancerArn: !Ref ALB
#       Port: 443
#       Protocol: HTTPS
#       Certificates:
#         - CertificateArn: !Ref SSLCert
#       DefaultActions:
#         - Type: forward
#           TargetGroupArn: !Ref WebTG
#
#   APIRule:
#     Type: AWS::ElasticLoadBalancingV2::ListenerRule
#     Properties:
#       ListenerArn: !Ref Listener
#       Priority: 10
#       Conditions:
#         - Field: path-pattern
#           Values: ["/api/*"]
#       Actions:
#         - Type: forward
#           TargetGroupArn: !Ref APITG
#
#   WebTG:
#     Type: AWS::ElasticLoadBalancingV2::TargetGroup
#     Properties:
#       VpcId: !Ref VPC
#       Port: 80
#       Protocol: HTTP
#       TargetType: instance
#       HealthCheckPath: /health
#       HealthCheckIntervalSeconds: 15
#       HealthyThresholdCount: 2
#       UnhealthyThresholdCount: 3

# ── AWS CLI: Create NLB with static IP ──
aws elbv2 create-load-balancer \
    --name my-nlb \
    --type network \
    --subnets subnet-aaa subnet-bbb

# NLB gets one static IP per AZ
# Useful for: DNS whitelisting, firewall rules, clients that can't do DNS

# ── Comparison ──
# Feature         | ALB              | NLB              | CLB
# Layer           | 7 (HTTP/HTTPS)   | 4 (TCP/UDP)      | 4 + basic 7
# Latency         | ~ms              | ~μs              | ~ms
# Path routing    | ✅               | ❌               | ❌
# WebSocket       | ✅               | ✅ (TCP)         | ❌
# Static IP       | ❌ (use GA)      | ✅               | ❌
# Lambda target   | ✅               | ❌               | ❌
# Client IP       | X-Forwarded-For  | Preserved        | X-Forwarded-For
# WAF             | ✅               | ❌               | ❌
# Cost            | $$               | $$               | $

A microservices app needed path-based routing (/api → API service, /auth → auth service, / → frontend) with WAF protection. ALB handled this perfectly with listener rules. Later, they added a real-time gaming service that needed TCP connections with ultra-low latency and static IPs for firewall whitelisting. An NLB was added for that service. Both load balancers ran in parallel — ALB for HTTP traffic, NLB for TCP.

Use ALB for web apps and microservices — it has path/host routing, WAF, and Lambda targets. Use NLB for TCP/UDP, ultra-low latency, static IPs, or non-HTTP protocols. Never use CLB for new projects. ALB replaces client IP (use X-Forwarded-For header). NLB preserves client IP natively.
⚠️ Common Mistake
// ❌ Using NLB when you need path-based routing // NLB only sees IP + port — can't route /api/* vs /web/* // All traffic goes to the same target group // Need separate NLBs per service (expensive, complex) // Can't attach WAF to NLB
// ✅ Use ALB for HTTP routing // /api/* → API target group (ECS Fargate) // /auth/* → Auth target group // /* → Frontend target group // One ALB, multiple listener rules, WAF attached // Use NLB only when you need Layer 4 or static IPs
🔁 Follow-Up Question

What is cross-zone load balancing? How does it differ between ALB and NLB?

10 How does EC2 Auto Scaling work? Explain launch templates, scaling policies, and cooldowns. intermediate

EC2 Auto Scaling automatically adjusts the number of EC2 instances based on demand, ensuring availability and cost optimization.

Components:

  • Launch Template — defines the instance configuration (AMI, instance type, key pair, security groups, user data). Replaces the older Launch Configuration. Supports versioning.
  • Auto Scaling Group (ASG) — manages the fleet. Defines min, max, and desired capacity. Spans multiple AZs for HA.
  • Scaling Policies — rules that trigger scaling actions.

Scaling Policy Types:

  • Target Tracking — simplest. Set a target (e.g., "keep average CPU at 50%"). ASG adds/removes instances to maintain the target. Recommended for most cases.
  • Step Scaling — define steps: if CPU > 70% add 2 instances, if CPU > 90% add 4. More control than target tracking.
  • Scheduled — scale at specific times (e.g., scale up at 9 AM, down at 6 PM). For predictable traffic patterns.
  • Predictive — uses ML to forecast traffic and pre-scale. Combines with target tracking.

Cooldown: After a scaling action, ASG waits (default 300 seconds) before acting again — prevents rapid scaling oscillation.

Health Checks: ASG uses EC2 status checks (default) or ELB health checks. Unhealthy instances are terminated and replaced.

# ── Create Launch Template ──
aws ec2 create-launch-template \
    --launch-template-name WebServerTemplate \
    --version-description "v1" \
    --launch-template-data '{
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m7g.large",
        "KeyName": "my-key",
        "SecurityGroupIds": ["sg-abc123"],
        "UserData": "BASE64_ENCODED_USER_DATA"
    }'

# User data script (base64-encode before passing):
# #!/bin/bash
# yum update -y
# yum install -y httpd
# systemctl start httpd

# ── Create Auto Scaling Group ──
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name WebASG \
    --launch-template LaunchTemplateName=WebServerTemplate,Version=\$Latest \
    --min-size 2 --max-size 10 --desired-capacity 3 \
    --vpc-zone-identifier "subnet-aaa,subnet-bbb" \
    --target-group-arns arn:aws:elasticloadbalancing:...:targetgroup/WebTG/... \
    --health-check-type ELB \
    --health-check-grace-period 300

# ── Target Tracking Policy (recommended) ──
aws autoscaling put-scaling-policy \
    --auto-scaling-group-name WebASG \
    --policy-name CPUTargetTracking \
    --policy-type TargetTrackingScaling \
    --target-tracking-configuration '{
        "TargetValue": 50.0,
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "ASGAverageCPUUtilization"
        },
        "ScaleInCooldown": 300,
        "ScaleOutCooldown": 60
    }'

# ── Scheduled Scaling (predictable traffic) ──
aws autoscaling put-scheduled-update-group-action \
    --auto-scaling-group-name WebASG \
    --scheduled-action-name ScaleUpMorning \
    --recurrence "0 9 * * MON-FRI" \
    --desired-capacity 8

aws autoscaling put-scheduled-update-group-action \
    --auto-scaling-group-name WebASG \
    --scheduled-action-name ScaleDownEvening \
    --recurrence "0 18 * * MON-FRI" \
    --desired-capacity 3

# ── CloudFormation: ASG with Target Tracking ──
# Resources:
#   ASG:
#     Type: AWS::AutoScaling::AutoScalingGroup
#     Properties:
#       LaunchTemplate:
#         LaunchTemplateId: !Ref LaunchTemplate
#         Version: !GetAtt LaunchTemplate.LatestVersionNumber
#       MinSize: 2
#       MaxSize: 10
#       DesiredCapacity: 3
#       VPCZoneIdentifier: [!Ref SubnetA, !Ref SubnetB]
#       TargetGroupARNs: [!Ref WebTG]
#       HealthCheckType: ELB
#       HealthCheckGracePeriod: 300

An e-commerce site ran 4 EC2 instances 24/7 — overprovisioned for 80% of the day, underprovisioned during flash sales. After implementing Auto Scaling with target tracking (CPU target 50%) + scheduled scaling (pre-scale for known sales), the fleet ranged from 3 instances at night to 15 during sales. Monthly EC2 cost dropped 40% while performance improved during peaks.

Use target tracking for most workloads — it's simple and effective. Set a CPU target of 40-60% for headroom. Use scheduled scaling for predictable patterns. Set different cooldowns: short for scale-out (60s, react fast) and longer for scale-in (300s, avoid thrashing). Always use ELB health checks, not just EC2 status checks. Span at least 2 AZs.
⚠️ Common Mistake
// ❌ Min=1, single AZ, EC2 health check only // One instance in one AZ — no redundancy // App crash → EC2 status "running" → ASG thinks it's healthy // AZ outage → all capacity lost, no other AZ to scale into // Scale-out cooldown = 300s → 5 minutes to react to traffic spike
// ✅ Min=2, multi-AZ, ELB health check // 2+ instances across 2+ AZs — always available // ELB health check: /health endpoint fails → instance replaced // Scale-out cooldown: 60s → react in 1 minute // Scale-in cooldown: 300s → avoid premature termination // health-check-grace-period: 300s → let new instances warm up
🔁 Follow-Up Question

What is a warm pool in Auto Scaling? How does it reduce scale-out latency?

11 What is the difference between RDS Multi-AZ and Read Replicas? When do you use Aurora? intermediate

RDS (Relational Database Service) provides managed databases (MySQL, PostgreSQL, SQL Server, Oracle, MariaDB). Two key high-availability features:

Multi-AZ Deployment:

  • Creates a synchronous standby replica in another AZ.
  • Purpose: high availability and failover — NOT for read scaling.
  • Automatic failover in 60-120 seconds if primary fails (AZ outage, hardware failure, patching).
  • Standby is not accessible for reads (standby only).
  • Same endpoint — DNS automatically switches to standby on failover.

Read Replicas:

  • Creates asynchronous copies for read scaling.
  • Up to 15 Read Replicas per primary (5 for non-Aurora).
  • Each replica has its own endpoint — application must direct read traffic to replicas.
  • Can be in the same Region, cross-Region, or cross-account.
  • Can be promoted to standalone database (for migration or DR).
  • Replication lag: typically seconds, but can increase under heavy write load.

Amazon Aurora:

  • AWS-designed, cloud-native database (MySQL/PostgreSQL compatible).
  • Storage: auto-scales up to 128 TB, replicated 6 ways across 3 AZs.
  • Up to 5x faster than MySQL, 3x faster than PostgreSQL.
  • Aurora Replicas share the same storage — near-zero replication lag.
  • Aurora Serverless v2: auto-scales compute (ACUs) based on load. Pay per second.
# ── Create RDS Multi-AZ instance ──
aws rds create-db-instance \
    --db-instance-identifier my-db \
    --db-instance-class db.r7g.large \
    --engine postgres \
    --master-username admin \
    --master-user-password "****" \
    --allocated-storage 100 \
    --multi-az \
    --storage-encrypted \
    --vpc-security-group-ids sg-abc123 \
    --db-subnet-group-name my-db-subnets

# ── Create Read Replica ──
aws rds create-db-instance-read-replica \
    --db-instance-identifier my-db-replica \
    --source-db-instance-identifier my-db \
    --db-instance-class db.r7g.large \
    --availability-zone us-east-1b

# ── Create Aurora Cluster ──
aws rds create-db-cluster \
    --db-cluster-identifier my-aurora \
    --engine aurora-postgresql \
    --engine-version 15.4 \
    --master-username admin \
    --master-user-password "****" \
    --vpc-security-group-ids sg-abc123 \
    --db-subnet-group-name my-db-subnets \
    --storage-encrypted

# Add Aurora instances (writer + reader)
aws rds create-db-instance \
    --db-instance-identifier my-aurora-writer \
    --db-cluster-identifier my-aurora \
    --db-instance-class db.r7g.large \
    --engine aurora-postgresql

aws rds create-db-instance \
    --db-instance-identifier my-aurora-reader \
    --db-cluster-identifier my-aurora \
    --db-instance-class db.r7g.large \
    --engine aurora-postgresql

# ── Aurora endpoints ──
# Writer endpoint: my-aurora.cluster-xxxx.us-east-1.rds.amazonaws.com
# Reader endpoint: my-aurora.cluster-ro-xxxx.us-east-1.rds.amazonaws.com
# Reader endpoint auto-load-balances across all Aurora Replicas

# ── Comparison ──
# Feature        | Multi-AZ          | Read Replica      | Aurora
# Purpose        | High availability | Read scaling      | Both
# Replication    | Synchronous       | Asynchronous      | Shared storage
# Failover       | Automatic (60s)   | Manual promotion  | Automatic (30s)
# Read traffic   | No (standby)      | Yes (own endpoint)| Yes (reader EP)
# Cross-Region   | No                | Yes               | Yes (Global DB)

A SaaS app hit a database bottleneck — the primary PostgreSQL RDS instance was at 90% CPU with read-heavy analytics queries competing with transactional writes. They created 2 Read Replicas and directed analytics queries to the reader endpoint. Primary CPU dropped to 35%. They also enabled Multi-AZ for the primary to survive AZ failures. Six months later, they migrated to Aurora PostgreSQL — got auto-scaling storage, 30-second failover (vs 120 seconds), and near-zero replication lag.

Multi-AZ is for availability (automatic failover), Read Replicas are for read scaling (offload queries). You can use both together. Aurora gives you both plus better performance and auto-scaling storage. Use Aurora Reader endpoint for automatic read load balancing. Consider Aurora Serverless v2 for variable workloads to avoid paying for idle capacity.
⚠️ Common Mistake
// ❌ Using Multi-AZ standby for read scaling // Multi-AZ standby is NOT accessible for reads // All queries still hit the primary → CPU at 90% // "But I enabled Multi-AZ, why isn't it faster?" // Multi-AZ = failover, NOT performance
// ✅ Read Replicas for scaling + Multi-AZ for availability // Primary (Multi-AZ): handles all writes // Read Replica 1: analytics queries // Read Replica 2: reporting dashboard // Primary CPU: 35% (was 90%) // Or use Aurora: shared storage + reader endpoint + auto-failover
🔁 Follow-Up Question

What is Aurora Global Database? How does it provide cross-region disaster recovery?

12 How does AWS Lambda work? Explain cold starts, concurrency, layers, and execution model. intermediate

AWS Lambda runs your code without provisioning servers. You pay only for compute time consumed.

Execution model:

  1. Event triggers Lambda (API Gateway, S3, SQS, CloudWatch, etc.).
  2. Lambda creates an execution environment (container) with your code + runtime.
  3. Your handler function runs and returns a response.
  4. The environment is frozen (kept warm for ~15-30 minutes) for potential reuse.
  5. Next invocation may reuse the warm environment (warm start) or create a new one (cold start).

Cold Start:

  • Time to create a new execution environment: download code, start runtime, run initialization.
  • Adds 100ms-2s+ latency depending on runtime (Python/Node fastest, Java/C# slowest) and package size.
  • Mitigations: Provisioned Concurrency (pre-warm environments), SnapStart (Java snapshot), smaller packages, keep functions warm.

Concurrency:

  • Each concurrent invocation uses one execution environment.
  • Account limit: 1,000 concurrent executions (default, can be increased).
  • Reserved Concurrency: guarantees capacity for a function (but limits it too).
  • Provisioned Concurrency: pre-creates warm environments — no cold starts. Costs money even when idle.

Layers: Shared code/libraries packaged separately. Up to 5 layers per function. Useful for common dependencies (numpy, SDK, custom utils).

Limits: 15 minutes max timeout, 10 GB memory, 250 MB deployment package (unzipped), 512 MB /tmp storage (configurable to 10 GB).

# ── Create a Lambda function ──
aws lambda create-function \
    --function-name ProcessOrder \
    --runtime python3.12 \
    --handler app.handler \
    --role arn:aws:iam::123456789012:role/LambdaExecRole \
    --zip-file fileb://function.zip \
    --timeout 30 \
    --memory-size 512 \
    --environment Variables="{DB_HOST=mydb.cluster-xxx.rds.amazonaws.com}"

# ── Python Lambda handler ──
# import json
# import boto3
#
# # Initialization code runs ONCE per cold start (reused on warm starts)
# dynamodb = boto3.resource("dynamodb")
# table = dynamodb.Table("Orders")
#
# def handler(event, context):
#     """Triggered by API Gateway POST /orders"""
#     body = json.loads(event["body"])
#     table.put_item(Item={
#         "orderId": body["id"],
#         "amount": body["amount"],
#         "status": "pending"
#     })
#     return {
#         "statusCode": 201,
#         "body": json.dumps({"message": "Order created"})
#     }

# ── Set Provisioned Concurrency (no cold starts) ──
aws lambda put-provisioned-concurrency-config \
    --function-name ProcessOrder \
    --qualifier prod \
    --provisioned-concurrent-executions 50

# ── Set Reserved Concurrency (limit + guarantee) ──
aws lambda put-function-concurrency \
    --function-name ProcessOrder \
    --reserved-concurrent-executions 100

# ── Create a Layer (shared dependencies) ──
# zip -r layer.zip python/  # python/lib/python3.12/site-packages/...
aws lambda publish-layer-version \
    --layer-name common-utils \
    --zip-file fileb://layer.zip \
    --compatible-runtimes python3.12

# Attach layer to function
aws lambda update-function-configuration \
    --function-name ProcessOrder \
    --layers arn:aws:lambda:us-east-1:123456789012:layer:common-utils:1

# ── CloudFormation: Lambda + API Gateway ──
# Resources:
#   ProcessOrderFn:
#     Type: AWS::Lambda::Function
#     Properties:
#       FunctionName: ProcessOrder
#       Runtime: python3.12
#       Handler: app.handler
#       Code:
#         S3Bucket: my-deployment-bucket
#         S3Key: function.zip
#       MemorySize: 512
#       Timeout: 30
#       Role: !GetAtt LambdaRole.Arn
#       Environment:
#         Variables:
#           TABLE_NAME: !Ref OrdersTable

A payment processing Lambda had 2-second cold starts on Java runtime with a 50MB package. P99 latency was 3 seconds — unacceptable for checkout. The team applied three fixes: (1) switched to Python for the API handler (cold start dropped to 200ms), (2) moved heavy shared libraries to a Layer (reduced package size), (3) enabled Provisioned Concurrency with 50 instances during business hours. P99 dropped to 80ms.

Initialize SDK clients and DB connections outside the handler (reused on warm starts). Use Python/Node for latency-sensitive functions (fastest cold starts). Use Provisioned Concurrency for production APIs that can't tolerate cold starts. Keep packages small — use Layers for shared dependencies. Set reserved concurrency to prevent a runaway function from consuming your entire account limit.
⚠️ Common Mistake
// ❌ Initializing SDK clients inside the handler // def handler(event, context): // dynamodb = boto3.resource("dynamodb") # Created EVERY invocation! // table = dynamodb.Table("Orders") # Connection overhead each time // ... // Even on warm starts, you waste 50-100ms creating new connections
// ✅ Initialize outside the handler — reused on warm starts // dynamodb = boto3.resource("dynamodb") # Created once per cold start // table = dynamodb.Table("Orders") # Reused for all subsequent calls // // def handler(event, context): // table.put_item(...) # Reuses existing connection // ... // Warm start: 0ms initialization overhead
🔁 Follow-Up Question

What is Lambda SnapStart and how does it differ from Provisioned Concurrency?

13 How do you secure an S3 bucket? Explain bucket policies, Block Public Access, encryption, and presigned URLs. intermediate

S3 security operates at multiple layers:

Block Public Access (account or bucket level):

  • Four settings that override any policy or ACL that would make a bucket/object public.
  • Always enable all four at the account level unless you specifically need public access (like a static website).
  • This is the #1 S3 security setting — prevents accidental public exposure.

Bucket Policies (resource-based):

  • JSON policies attached to the bucket. Control who can access the bucket and its objects.
  • Can grant cross-account access, require encryption, restrict by IP, enforce HTTPS.

IAM Policies (identity-based):

  • Attached to IAM users/roles. Define what S3 actions they can perform.
  • Both bucket policy AND IAM policy must allow access (unless one explicitly allows and neither denies).

Encryption:

  • SSE-S3 — AWS manages keys. Simplest. Default for new buckets.
  • SSE-KMS — AWS KMS manages keys. Audit trail in CloudTrail. Key rotation. Cross-account control.
  • SSE-C — you provide the key with each request. You manage key storage.
  • Client-side — encrypt before uploading. Maximum control.

Presigned URLs: temporary URLs that grant time-limited access to private objects. Generated by the server, shared with clients. No AWS credentials needed by the client.

# ── Enable Block Public Access (account level) ──
aws s3control put-public-access-block \
    --account-id 123456789012 \
    --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,\
BlockPublicPolicy=true,RestrictPublicBuckets=true

# ── Bucket Policy: Enforce HTTPS + encryption ──
# {
#   "Version": "2012-10-17",
#   "Statement": [
#     {
#       "Sid": "DenyHTTP",
#       "Effect": "Deny",
#       "Principal": "*",
#       "Action": "s3:*",
#       "Resource": [
#         "arn:aws:s3:::my-bucket",
#         "arn:aws:s3:::my-bucket/*"
#       ],
#       "Condition": {
#         "Bool": { "aws:SecureTransport": "false" }
#       }
#     },
#     {
#       "Sid": "DenyUnencrypted",
#       "Effect": "Deny",
#       "Principal": "*",
#       "Action": "s3:PutObject",
#       "Resource": "arn:aws:s3:::my-bucket/*",
#       "Condition": {
#         "StringNotEquals": {
#           "s3:x-amz-server-side-encryption": "aws:kms"
#         }
#       }
#     }
#   ]
# }

# ── Enable default encryption ──
aws s3api put-bucket-encryption \
    --bucket my-bucket \
    --server-side-encryption-configuration '{
        "Rules": [{
            "ApplyServerSideEncryptionByDefault": {
                "SSEAlgorithm": "aws:kms",
                "KMSMasterKeyID": "arn:aws:kms:us-east-1:123:key/abc-123"
            },
            "BucketKeyEnabled": true
        }]
    }'

# ── Generate Presigned URL (temporary access) ──
aws s3 presign s3://my-bucket/reports/q4-2025.pdf \
    --expires-in 3600  # 1 hour

# Python boto3:
# s3 = boto3.client("s3")
# url = s3.generate_presigned_url(
#     "get_object",
#     Params={"Bucket": "my-bucket", "Key": "reports/q4.pdf"},
#     ExpiresIn=3600  # seconds
# )
# print(url)  # Share this URL — no AWS credentials needed

# ── Enable versioning (protection against accidental deletes) ──
aws s3api put-bucket-versioning \
    --bucket my-bucket \
    --versioning-configuration Status=Enabled

# ── Enable access logging ──
aws s3api put-bucket-logging --bucket my-bucket \
    --bucket-logging-status '{
        "LoggingEnabled": {
            "TargetBucket": "my-logs-bucket",
            "TargetPrefix": "s3-access-logs/"
        }
    }'

A company's S3 bucket containing customer PII was found publicly accessible — a developer had set a bucket policy with Principal: * to test and forgot to remove it. Block Public Access was not enabled. After the incident: (1) Block Public Access enabled at the account level for all buckets, (2) SCPs in AWS Organizations prevented any user from disabling it, (3) SSE-KMS encryption enforced via bucket policy, (4) S3 Access Analyzer enabled to detect any future public or cross-account access.

Enable Block Public Access at the account level — it's the single most important S3 security setting. Enforce encryption with bucket policies (deny unencrypted uploads). Use SSE-KMS for audit trails and key management. Use presigned URLs for temporary access to private objects. Enable versioning to protect against accidental deletes. Enable S3 Access Analyzer to detect overly permissive policies.
⚠️ Common Mistake
// ❌ Bucket policy with Principal: "*" (public access) // { // "Effect": "Allow", // "Principal": "*", ← Anyone on the internet! // "Action": "s3:GetObject", // "Resource": "arn:aws:s3:::customer-data/*" // } // Customer PII accessible to the entire internet // Block Public Access not enabled → policy takes effect
// ✅ Block Public Access + specific principal // Account-level Block Public Access: ALL FOUR enabled // Bucket policy: specific IAM roles only // { // "Principal": {"AWS": "arn:aws:iam::123:role/AppRole"}, // "Action": "s3:GetObject" // } // + SSE-KMS encryption required // + Deny non-HTTPS requests // + S3 Access Analyzer monitoring
🔁 Follow-Up Question

What is S3 Access Analyzer and how does it detect unintended public or cross-account access?

14 How does CloudWatch work? Explain metrics, alarms, logs, and CloudWatch Logs Insights. intermediate

CloudWatch is AWS's monitoring and observability service with four key components:

Metrics:

  • Time-series data points from AWS services (CPU, network, disk) and your applications (custom metrics).
  • Standard resolution: 1-minute intervals (default, free for basic).
  • High resolution: up to 1-second intervals (custom metrics, extra cost).
  • Namespaces: AWS/EC2, AWS/RDS, AWS/Lambda, or custom (e.g., MyApp/Orders).
  • Dimensions: key-value pairs to filter metrics (InstanceId, LoadBalancer, FunctionName).

Alarms:

  • Watch a metric and trigger actions when thresholds are crossed.
  • States: OK → ALARM → INSUFFICIENT_DATA.
  • Actions: SNS notification, Auto Scaling policy, EC2 action (stop/terminate/reboot), Lambda invocation.
  • Composite Alarms: combine multiple alarms with AND/OR logic to reduce alarm noise.

Logs:

  • Log Groups: container for log streams (e.g., /aws/lambda/ProcessOrder).
  • Log Streams: individual sources (each Lambda container, each EC2 instance).
  • Agents: CloudWatch Agent (EC2), automatic (Lambda, ECS).
  • Retention: configurable 1 day to 10 years (or never expire). Default: never expire (costly!).

Logs Insights: SQL-like query language for searching and analyzing logs. Much faster than manual searching.

# ── Put custom metric ──
aws cloudwatch put-metric-data \
    --namespace "MyApp/Orders" \
    --metric-name OrderCount \
    --value 1 \
    --unit Count \
    --dimensions Environment=prod,Service=checkout

# ── Python boto3: Custom metric ──
# cloudwatch = boto3.client("cloudwatch")
# cloudwatch.put_metric_data(
#     Namespace="MyApp/Orders",
#     MetricData=[{
#         "MetricName": "OrderProcessingTime",
#         "Value": 245.5,
#         "Unit": "Milliseconds",
#         "Dimensions": [
#             {"Name": "Service", "Value": "checkout"},
#             {"Name": "Environment", "Value": "prod"}
#         ]
#     }]
# )

# ── Create alarm: CPU > 80% for 5 minutes ──
aws cloudwatch put-metric-alarm \
    --alarm-name HighCPU \
    --metric-name CPUUtilization \
    --namespace AWS/EC2 \
    --statistic Average \
    --period 300 \
    --evaluation-periods 2 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=InstanceId,Value=i-abc123 \
    --alarm-actions arn:aws:sns:us-east-1:123:ops-alerts

# ── CloudWatch Logs Insights: Find errors ──
# fields @timestamp, @message
# | filter @message like /ERROR|Exception/
# | sort @timestamp desc
# | limit 50

# ── Logs Insights: P99 latency by API path ──
# fields @timestamp, path, latency
# | stats percentile(latency, 99) as p99,
#         percentile(latency, 50) as p50,
#         count() as requests
#   by path
# | sort p99 desc

# ── Set log retention (default is "never expire"!) ──
aws logs put-retention-policy \
    --log-group-name /aws/lambda/ProcessOrder \
    --retention-in-days 30

# ── CloudFormation: Alarm + SNS ──
# Resources:
#   HighCPUAlarm:
#     Type: AWS::CloudWatch::Alarm
#     Properties:
#       AlarmName: HighCPU
#       MetricName: CPUUtilization
#       Namespace: AWS/EC2
#       Statistic: Average
#       Period: 300
#       EvaluationPeriods: 2
#       Threshold: 80
#       ComparisonOperator: GreaterThanThreshold
#       AlarmActions:
#         - !Ref OpsAlertsTopic
#       Dimensions:
#         - Name: AutoScalingGroupName
#           Value: !Ref ASG

A team had no log retention policy — Lambda log groups grew to 2TB over 18 months, costing $1,200/month in storage. Most logs were older than 30 days and never looked at. Setting retention to 30 days across all log groups reduced storage costs by 95%. They also created a Logs Insights dashboard for real-time error monitoring, replacing the old manual grep-through-console approach.

Always set log retention policies — the default "never expire" is expensive. Use Composite Alarms to reduce alert fatigue. Publish custom metrics for business KPIs (order count, revenue, error rate). Use Logs Insights for fast log analysis — it's much better than the old filter pattern approach. Consider CloudWatch Contributor Insights for top-N analysis (top IPs, top error codes).
⚠️ Common Mistake
// ❌ No log retention → unlimited storage costs // /aws/lambda/ProcessOrder: 500 GB (18 months of logs) // /aws/lambda/SendEmail: 200 GB // /aws/ecs/WebApp: 1.3 TB // Total: 2TB × $0.03/GB = $60/month, growing every day // Nobody reads logs older than a week
// ✅ Set retention on ALL log groups // aws logs put-retention-policy --retention-in-days 30 // Production critical: 90 days // Development: 7 days // Archive to S3 Glacier if long-term compliance needed // Automate: use a Lambda to set retention on new log groups
🔁 Follow-Up Question

What is CloudWatch Contributor Insights and how does it help identify top contributors to operational issues?

15 What is the difference between SQS, SNS, and EventBridge? When do you use each? intermediate

AWS provides three core messaging/event services for decoupling architectures:

SQS (Simple Queue Service) — message queue:

  • Pull-based: consumers poll the queue for messages.
  • Point-to-point: each message is processed by exactly one consumer.
  • Standard Queue: at-least-once delivery, best-effort ordering, nearly unlimited throughput.
  • FIFO Queue: exactly-once delivery, strict ordering (300 msg/sec, or 3,000 with batching).
  • Retention: 1 minute to 14 days (default 4 days).
  • Dead Letter Queue (DLQ): failed messages go here after X retries.
  • Best for: decoupling, work queues, buffering, rate limiting.

SNS (Simple Notification Service) — pub/sub:

  • Push-based: SNS pushes messages to all subscribers.
  • Fan-out: one message → multiple subscribers (SQS, Lambda, HTTP, email, SMS).
  • No message retention — if subscriber is down, message is lost (unless SQS subscriber).
  • Best for: fan-out pattern, notifications, alerts, event broadcasting.

EventBridge — event bus:

  • Event-driven architecture: routes events based on content (rules with patterns).
  • Integrates with 200+ AWS services and SaaS partners (Zendesk, Shopify, Datadog).
  • Schema Registry: auto-discovers event schemas for type safety.
  • Scheduling: cron/rate-based triggers (replacing CloudWatch Events).
  • Best for: event-driven microservices, SaaS integrations, complex routing rules.
# ── SQS: Create queue + send message ──
aws sqs create-queue --queue-name OrderQueue

aws sqs send-message \
    --queue-url https://sqs.us-east-1.amazonaws.com/123/OrderQueue \
    --message-body '{"orderId":"123","amount":99.99}'

# Receive + process + delete
# msgs = sqs.receive_message(QueueUrl=url, MaxNumberOfMessages=10)
# for msg in msgs["Messages"]:
#     process(msg["Body"])
#     sqs.delete_message(QueueUrl=url, ReceiptHandle=msg["ReceiptHandle"])

# ── SNS → SQS Fan-out pattern ──
# SNS Topic: "OrderEvents"
#   ├── SQS: InventoryQueue (update stock)
#   ├── SQS: EmailQueue (send confirmation)
#   ├── SQS: AnalyticsQueue (track metrics)
#   └── Lambda: FraudCheck (real-time)
#
# One publish → 4 subscribers process independently

aws sns create-topic --name OrderEvents
aws sns subscribe \
    --topic-arn arn:aws:sns:us-east-1:123:OrderEvents \
    --protocol sqs \
    --notification-endpoint arn:aws:sqs:us-east-1:123:InventoryQueue

aws sns publish \
    --topic-arn arn:aws:sns:us-east-1:123:OrderEvents \
    --message '{"orderId":"123","status":"placed"}'

# ── EventBridge: Content-based routing ──
# Rule: Route "order.placed" events to Lambda
aws events put-rule \
    --name ProcessNewOrders \
    --event-pattern '{
        "source": ["com.myapp.orders"],
        "detail-type": ["OrderPlaced"],
        "detail": {
            "amount": [{"numeric": [">", 100]}]
        }
    }'

aws events put-targets --rule ProcessNewOrders \
    --targets "Id=1,Arn=arn:aws:lambda:us-east-1:123:function:ProcessOrder"

# ── Comparison ──
# Feature      | SQS              | SNS             | EventBridge
# Model        | Queue (pull)     | Pub/Sub (push)  | Event Bus (push)
# Consumers    | 1 per message    | Many (fan-out)  | Many (rules)
# Retention    | Up to 14 days    | None            | Replay (up to 24h)
# Ordering     | FIFO available   | FIFO available  | Ordered per rule
# Routing      | None             | Topic filter    | Content-based rules
# Best for     | Work queues      | Fan-out/alerts  | Event architecture

An e-commerce app had a monolithic order handler that sent emails, updated inventory, charged payments, and logged analytics — all synchronously. Any failure caused the entire order to fail. They decoupled it: order service publishes to SNS topic → SNS fans out to 4 SQS queues (email, inventory, payment, analytics). Each queue is processed independently by its own Lambda. If email service is down, orders still complete — email queue buffers messages for retry.

Use SQS when you need a buffer/work queue with one consumer. Use SNS for fan-out (one event → many consumers). Use SNS+SQS combo for reliable fan-out (SNS pushes to SQS queues so messages aren't lost). Use EventBridge when you need content-based routing rules or SaaS integrations. Always configure Dead Letter Queues (DLQ) for failed SQS messages.
⚠️ Common Mistake
// ❌ Synchronous processing — one failure breaks everything // function processOrder(order) { // chargePayment(order); // Fails → entire order fails // updateInventory(order); // Never runs // sendEmail(order); // Never runs // logAnalytics(order); // Never runs // } // Tightly coupled — can't scale services independently
// ✅ SNS + SQS fan-out — decoupled, resilient // publishToSNS("OrderPlaced", order); // → SQS PaymentQueue → PaymentLambda // → SQS InventoryQueue → InventoryLambda // → SQS EmailQueue → EmailLambda (can retry if down) // → SQS AnalyticsQueue → AnalyticsLambda // Each service scales and fails independently // DLQ catches any failed processing for manual review
🔁 Follow-Up Question

What is the SNS+SQS fan-out pattern? Why is it preferred over direct SNS → Lambda fan-out?

16 How does DynamoDB work? Explain partition keys, sort keys, GSIs, LSIs, and capacity modes. intermediate

DynamoDB is a fully managed NoSQL key-value and document database. Single-digit millisecond latency at any scale.

Primary Key Design:

  • Partition Key (PK) only — simple primary key. PK determines the physical partition where data is stored.
  • Partition Key + Sort Key (SK) — composite primary key. Same PK = same partition, SK orders items within. Enables range queries.
  • Key design is the most critical DynamoDB decision — it determines query patterns and performance.

Secondary Indexes:

  • GSI (Global Secondary Index): different partition key + optional sort key. Separate throughput. Eventually consistent. Up to 20 per table.
  • LSI (Local Secondary Index): same partition key, different sort key. Shares table throughput. Strongly consistent option. Must be created at table creation. Up to 5 per table.

Capacity Modes:

  • On-Demand: pay-per-request. Auto-scales instantly. Best for unpredictable or new workloads. More expensive per-request.
  • Provisioned: you set Read/Write Capacity Units (RCUs/WCUs). Cheaper at steady-state. Use Auto Scaling. 1 WCU = 1 write/sec (up to 1 KB). 1 RCU = 1 strongly consistent read/sec (up to 4 KB) or 2 eventually consistent.

Single-Table Design: store multiple entity types in one table using generic PK/SK names. Reduces joins (which DynamoDB doesn't support).

# ── Create DynamoDB table with composite key ──
aws dynamodb create-table \
    --table-name Orders \
    --key-schema \
        AttributeName=customerId,KeyType=HASH \
        AttributeName=orderId,KeyType=RANGE \
    --attribute-definitions \
        AttributeName=customerId,AttributeType=S \
        AttributeName=orderId,AttributeType=S \
        AttributeName=status,AttributeType=S \
    --billing-mode PAY_PER_REQUEST \
    --global-secondary-indexes '[{
        "IndexName": "StatusIndex",
        "KeySchema": [
            {"AttributeName": "status", "KeyType": "HASH"},
            {"AttributeName": "orderId", "KeyType": "RANGE"}
        ],
        "Projection": {"ProjectionType": "ALL"}
    }]'

# ── Write an item ──
aws dynamodb put-item --table-name Orders --item '{
    "customerId": {"S": "CUST-001"},
    "orderId": {"S": "ORD-2026-001"},
    "amount": {"N": "99.99"},
    "status": {"S": "shipped"},
    "items": {"L": [{"S": "Widget"}, {"S": "Gadget"}]}
}'

# ── Query: Get all orders for a customer ──
aws dynamodb query --table-name Orders \
    --key-condition-expression "customerId = :cid" \
    --expression-attribute-values '{":cid": {"S": "CUST-001"}}'

# ── Query: Get orders in a date range (SK) ──
aws dynamodb query --table-name Orders \
    --key-condition-expression "customerId = :cid AND orderId BETWEEN :start AND :end" \
    --expression-attribute-values '{
        ":cid": {"S": "CUST-001"},
        ":start": {"S": "ORD-2026-001"},
        ":end": {"S": "ORD-2026-100"}
    }'

# ── Query GSI: Get all shipped orders ──
aws dynamodb query --table-name Orders \
    --index-name StatusIndex \
    --key-condition-expression "status = :s" \
    --expression-attribute-values '{":s": {"S": "shipped"}}'

# ── Python boto3: Single-table design pattern ──
# table.put_item(Item={
#     "PK": "CUSTOMER#C001",
#     "SK": "ORDER#2026-01-15#O001",
#     "type": "order",
#     "amount": Decimal("99.99")
# })
# table.put_item(Item={
#     "PK": "CUSTOMER#C001",
#     "SK": "PROFILE",
#     "type": "customer",
#     "name": "Alice",
#     "email": "alice@example.com"
# })

An app used a customer email as the partition key. One enterprise customer had 5 million records — creating a "hot partition" that throttled the entire table. The fix: changed PK to customerId (UUID, even distribution) + SK for time-based ordering. Added a GSI on email for lookup queries. Hot partition problem disappeared. They also switched from Provisioned to On-Demand mode during migration since traffic was unpredictable.

Choose a high-cardinality partition key to avoid hot partitions (UUID > email > status). Use composite keys (PK+SK) for hierarchical data and range queries. Use GSIs for alternative query patterns. Start with On-Demand mode, switch to Provisioned once traffic stabilizes. Design your key schema based on access patterns, not data relationships — DynamoDB is not a relational database.
⚠️ Common Mistake
// ❌ Low-cardinality partition key → hot partition // PK: "status" (only 5 values: pending, shipped, delivered, ...) // 80% of orders are "delivered" → one partition gets 80% of traffic // That partition throttles → reads/writes fail with ProvisionedThroughputExceeded // Table has 10,000 WCUs but one partition maxes at 1,000 WCUs
// ✅ High-cardinality partition key + GSI for queries // PK: customerId (UUID — millions of unique values) // SK: orderId (for range queries within a customer) // GSI: status → orderId (query by status when needed) // Traffic distributes evenly across all partitions // No hot partition — full throughput utilized
🔁 Follow-Up Question

What is DynamoDB single-table design? Why is it recommended over multiple tables?

17 What is the difference between ECS and EKS? When do you use Fargate vs EC2 launch type? intermediate

AWS offers two container orchestration services:

ECS (Elastic Container Service):

  • AWS-native container orchestration. Simpler, tightly integrated with AWS services.
  • Uses Task Definitions (JSON) to define containers — image, CPU, memory, ports, env vars, logging.
  • Services: manage desired count, rolling updates, load balancing.
  • Deep ALB integration, CloudWatch logging, IAM task roles.
  • Best for: teams that want simplicity and are AWS-centric.

EKS (Elastic Kubernetes Service):

  • Managed Kubernetes control plane. Uses standard K8s APIs, manifests, tools (kubectl, Helm).
  • Portable — same manifests work on any Kubernetes cluster (GKE, AKS, on-prem).
  • Larger ecosystem — thousands of K8s tools, operators, CRDs.
  • More complex to operate. More expensive ($0.10/hr for control plane).
  • Best for: teams with K8s expertise, multi-cloud, complex orchestration needs.

Launch Types (both ECS and EKS):

  • Fargate (serverless): AWS manages the underlying EC2 instances. You define CPU/memory per task. No patching, no capacity planning. Pay per task.
  • EC2: you manage EC2 instances in an ASG. More control (GPU, custom AMI, lower cost for steady-state). You handle patching, scaling.
# ── ECS Task Definition (simplified) ──
# {
#   "family": "web-app",
#   "networkMode": "awsvpc",
#   "requiresCompatibilities": ["FARGATE"],
#   "cpu": "512",
#   "memory": "1024",
#   "executionRoleArn": "arn:aws:iam::123:role/ecsTaskExecutionRole",
#   "taskRoleArn": "arn:aws:iam::123:role/appTaskRole",
#   "containerDefinitions": [{
#     "name": "web",
#     "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/web:latest",
#     "portMappings": [{"containerPort": 8080, "protocol": "tcp"}],
#     "logConfiguration": {
#       "logDriver": "awslogs",
#       "options": {
#         "awslogs-group": "/ecs/web-app",
#         "awslogs-region": "us-east-1",
#         "awslogs-stream-prefix": "web"
#       }
#     },
#     "environment": [
#       {"name": "DB_HOST", "value": "mydb.cluster-xxx.rds.amazonaws.com"}
#     ]
#   }]
# }

# ── Create ECS Fargate Service ──
aws ecs create-service \
    --cluster my-cluster \
    --service-name web-service \
    --task-definition web-app:1 \
    --desired-count 3 \
    --launch-type FARGATE \
    --network-configuration '{
        "awsvpcConfiguration": {
            "subnets": ["subnet-aaa", "subnet-bbb"],
            "securityGroups": ["sg-abc123"],
            "assignPublicIp": "DISABLED"
        }
    }' \
    --load-balancers '{
        "targetGroupArn": "arn:aws:elasticloadbalancing:...",
        "containerName": "web",
        "containerPort": 8080
    }'

# ── EKS: Deploy with kubectl ──
# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: web-app
# spec:
#   replicas: 3
#   selector:
#     matchLabels:
#       app: web
#   template:
#     spec:
#       containers:
#         - name: web
#           image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/web:latest
#           ports:
#             - containerPort: 8080
#           resources:
#             requests:
#               cpu: "256m"
#               memory: "512Mi"

# ── Comparison ──
# Feature     | ECS             | EKS             | Fargate         | EC2 Launch
# Complexity  | Low             | High            | Lowest          | Medium
# Portability | AWS only        | Multi-cloud     | -               | -
# Cost        | No control plane| $0.10/hr/cluster| Pay per task    | Pay per instance
# GPU         | Yes (EC2)       | Yes (EC2)       | No              | Yes
# Scaling     | Auto (service)  | HPA/Karpenter   | Per task        | ASG

A startup began with ECS Fargate — zero infrastructure management, deployed in a day. As they grew to 50+ microservices and hired Kubernetes engineers, they migrated to EKS for the richer ecosystem (Helm charts, service mesh, custom operators). They kept Fargate as the compute layer for EKS (EKS on Fargate) for dev/test environments, and used EC2 managed node groups for production (better cost control, GPU support for ML services).

Choose ECS if you want simplicity and are AWS-only. Choose EKS if you need Kubernetes portability, have K8s expertise, or need the K8s ecosystem. Use Fargate for most workloads — no patching, no capacity planning. Use EC2 launch type only when you need GPUs, custom AMIs, or cost optimization at scale. Start with Fargate, move to EC2 when the cost difference justifies the operational overhead.
⚠️ Common Mistake
// ❌ Using EKS without Kubernetes expertise // Team of 3 devs → chose EKS because "Kubernetes is the future" // Months spent learning K8s concepts, RBAC, networking, Helm // Production incident → nobody understood pod scheduling // EKS control plane: $72/month even when idle // Could have shipped the same app on ECS Fargate in 1 day
// ✅ Match tool to team capability // Small team, AWS-centric → ECS + Fargate // - Deploy in hours, not weeks // - ALB integration built-in // - IAM task roles, CloudWatch logs — native // // Large team, K8s expertise, multi-cloud → EKS // - Full K8s ecosystem (Istio, ArgoCD, Karpenter) // - Same manifests on AWS, GCP, Azure
🔁 Follow-Up Question

What is AWS App Runner and how does it compare to ECS Fargate for simple web applications?

18 What is CloudFormation? Explain stacks, drift detection, change sets, and nested stacks. intermediate

CloudFormation is AWS's Infrastructure as Code (IaC) service. You define resources in YAML/JSON templates, and CloudFormation provisions and manages them.

Core concepts:

  • Template: YAML/JSON file describing resources, parameters, outputs, mappings, and conditions.
  • Stack: a collection of AWS resources created from a template. Managed as a single unit — create, update, or delete the entire stack.
  • Stack operations: Create → Update → Delete. On failure, automatic rollback to the previous state.

Key features:

  • Change Sets: preview changes before applying. Shows what will be added, modified, or deleted. Prevents surprises (like accidentally replacing a database).
  • Drift Detection: detects when actual resource configuration differs from the template (someone manually changed a security group in the console). Reports "IN_SYNC" or "DRIFTED".
  • Nested Stacks: reusable template components. A "parent" stack includes "child" stacks. DRY principle for common infrastructure (VPC, security groups).
  • Stack Sets: deploy stacks across multiple accounts and Regions from a single template. For Organizations-wide infrastructure.

Deletion Policy: controls what happens when a resource is removed from the template. Options: Delete (default), Retain (keep resource), Snapshot (create backup before deleting — RDS, EBS).

# ── CloudFormation Template (YAML) ──
# AWSTemplateFormatVersion: "2010-09-09"
# Description: Web application infrastructure
#
# Parameters:
#   Environment:
#     Type: String
#     AllowedValues: [dev, staging, prod]
#     Default: dev
#   InstanceType:
#     Type: String
#     Default: m7g.large
#
# Conditions:
#   IsProd: !Equals [!Ref Environment, prod]
#
# Resources:
#   VPC:
#     Type: AWS::EC2::VPC
#     Properties:
#       CidrBlock: 10.0.0.0/16
#       Tags:
#         - Key: Name
#           Value: !Sub "${Environment}-vpc"
#
#   Database:
#     Type: AWS::RDS::DBInstance
#     DeletionPolicy: Snapshot    # ← Take snapshot before delete!
#     Properties:
#       DBInstanceClass: !If [IsProd, db.r7g.xlarge, db.t4g.medium]
#       Engine: postgres
#       MultiAZ: !If [IsProd, true, false]
#       StorageEncrypted: true
#
# Outputs:
#   VPCId:
#     Value: !Ref VPC
#     Export:
#       Name: !Sub "${Environment}-VPCId"

# ── Create stack ──
aws cloudformation create-stack \
    --stack-name my-app-prod \
    --template-body file://template.yaml \
    --parameters ParameterKey=Environment,ParameterValue=prod \
    --capabilities CAPABILITY_IAM

# ── Create Change Set (preview before update) ──
aws cloudformation create-change-set \
    --stack-name my-app-prod \
    --change-set-name update-instance-type \
    --template-body file://template-v2.yaml \
    --parameters ParameterKey=InstanceType,ParameterValue=m7g.xlarge

# Review changes
aws cloudformation describe-change-set \
    --stack-name my-app-prod \
    --change-set-name update-instance-type

# Execute if safe
aws cloudformation execute-change-set \
    --stack-name my-app-prod \
    --change-set-name update-instance-type

# ── Drift Detection ──
aws cloudformation detect-stack-drift --stack-name my-app-prod
# Wait, then check results:
aws cloudformation describe-stack-resource-drifts \
    --stack-name my-app-prod \
    --stack-resource-drift-status-filters MODIFIED DELETED

# ── Nested Stack (reuse VPC template) ──
# Resources:
#   NetworkStack:
#     Type: AWS::CloudFormation::Stack
#     Properties:
#       TemplateURL: https://s3.amazonaws.com/templates/vpc.yaml
#       Parameters:
#         Environment: !Ref Environment

A team updated their CloudFormation template and ran an update without a Change Set. The update replaced their production RDS instance (a "replacement" update because they changed the engine version incorrectly). Data was lost because DeletionPolicy was set to "Delete" (the default). After the incident: (1) mandatory Change Sets for all production updates, (2) DeletionPolicy: Snapshot on all databases and EBS volumes, (3) weekly drift detection to catch manual console changes.

Always use Change Sets before updating production stacks — they show exactly what will change. Set DeletionPolicy: Snapshot on databases and EBS volumes. Run drift detection regularly to catch manual changes. Use nested stacks for reusable infrastructure (VPC, security). Use Stack Sets for multi-account/multi-region deployments. CloudFormation's automatic rollback on failure is a safety net, but prevent issues with Change Sets.
⚠️ Common Mistake
// ❌ Direct stack update without Change Set // aws cloudformation update-stack ... (runs immediately!) // Changed DB engine version → CloudFormation REPLACES the DB // Replace = delete old + create new → all data lost! // DeletionPolicy: Delete (default) → no snapshot taken // No way to recover
// ✅ Change Set → Review → Execute // 1. Create Change Set (no changes applied yet) // 2. Review: "Action: REPLACE, Resource: Database" ← RED FLAG! // 3. Cancel the change set — modify template instead // 4. Add DeletionPolicy: Snapshot to all databases // 5. Also enable RDS automated backups as an additional safety net
🔁 Follow-Up Question

What are the differences between CloudFormation and Terraform? When would you choose one over the other?

19 What is the difference between VPC Peering, Transit Gateway, and PrivateLink? When do you use each? advanced

AWS provides three ways to connect VPCs and services privately:

VPC Peering:

  • Direct, one-to-one connection between two VPCs using private IPs.
  • Works cross-account and cross-Region.
  • No single point of failure — uses AWS backbone, not the internet.
  • Non-transitive: VPC-A peers with VPC-B, VPC-B peers with VPC-C → A cannot reach C through B.
  • CIDR ranges cannot overlap.
  • Best for: small number of VPCs (< 10). N VPCs need N×(N-1)/2 peering connections (full mesh).

Transit Gateway (TGW):

  • Hub-and-spoke network — one TGW connects hundreds of VPCs, VPNs, and Direct Connect.
  • Transitive routing: all attached VPCs can communicate through the TGW.
  • Supports route tables for segmentation (prod VPCs can't reach dev VPCs).
  • Cross-Region peering between Transit Gateways.
  • Best for: large organizations with many VPCs. Replaces the full mesh of VPC Peering.

PrivateLink (VPC Endpoints):

  • Expose a specific service (not the whole VPC) to other VPCs or accounts.
  • Interface Endpoint: ENI with private IP in your VPC → accesses AWS services (S3, DynamoDB, SQS) or your own services privately.
  • Gateway Endpoint: route table entry for S3 and DynamoDB only (free).
  • No VPC CIDR overlap issues. Unidirectional: consumer → provider.
  • Best for: SaaS providers exposing services, accessing AWS services without NAT Gateway.
# ── VPC Peering ──
aws ec2 create-vpc-peering-connection \
    --vpc-id vpc-aaa \
    --peer-vpc-id vpc-bbb \
    --peer-owner-id 987654321012  # Cross-account

# Accept in the other account
aws ec2 accept-vpc-peering-connection \
    --vpc-peering-connection-id pcx-abc123

# Add routes in BOTH VPCs
aws ec2 create-route --route-table-id rtb-aaa \
    --destination-cidr-block 10.1.0.0/16 \
    --vpc-peering-connection-id pcx-abc123

# ── Transit Gateway ──
aws ec2 create-transit-gateway \
    --description "Central Hub" \
    --options DefaultRouteTableAssociation=enable,DefaultRouteTablePropagation=enable

# Attach VPCs
aws ec2 create-transit-gateway-vpc-attachment \
    --transit-gateway-id tgw-abc123 \
    --vpc-id vpc-aaa \
    --subnet-ids subnet-aaa1 subnet-aaa2

# All attached VPCs can now communicate through TGW
# Use TGW route tables for segmentation

# ── PrivateLink: Access S3 without NAT (Gateway Endpoint) ──
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-aaa \
    --service-name com.amazonaws.us-east-1.s3 \
    --route-table-ids rtb-private  # Free! No NAT needed for S3

# ── PrivateLink: Interface Endpoint for SQS ──
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-aaa \
    --vpc-endpoint-type Interface \
    --service-name com.amazonaws.us-east-1.sqs \
    --subnet-ids subnet-private1 \
    --security-group-ids sg-endpoint

# ── Comparison ──
# Feature        | VPC Peering   | Transit Gateway  | PrivateLink
# Topology       | 1:1           | Hub-and-spoke    | Service endpoint
# Transitive     | No            | Yes              | N/A
# Scale          | < 10 VPCs     | Hundreds of VPCs | Per-service
# Overlap CIDRs  | No            | No               | Yes
# Cross-Region   | Yes           | Yes (peering)    | No (same Region)
# Cost           | Data transfer | $0.05/hr + data  | $0.01/hr + data

A company with 5 VPCs started with VPC Peering (10 peering connections for full mesh). When they grew to 30 VPCs, managing 435 peering connections was impossible. They migrated to Transit Gateway — one hub connecting all VPCs. They segmented traffic using TGW route tables: production VPCs could reach shared services but not development VPCs. They also added Gateway Endpoints for S3 and DynamoDB, saving $2,000/month in NAT Gateway data processing fees.

Use VPC Peering for a few VPCs (< 10) — it's simpler and cheaper. Use Transit Gateway for many VPCs — hub-and-spoke with route table segmentation. Use PrivateLink/VPC Endpoints to access AWS services without NAT Gateway (S3 Gateway Endpoint is free). Remember: VPC Peering is non-transitive, Transit Gateway is transitive. PrivateLink solves CIDR overlap issues.
⚠️ Common Mistake
// ❌ Full mesh VPC Peering at scale // 30 VPCs → 30 × 29 / 2 = 435 peering connections // Each needs routes in BOTH VPCs = 870 route entries // Adding VPC #31 → 30 more peering connections // Non-transitive: can't route through intermediate VPCs // Operational nightmare — one wrong route = connectivity issue
// ✅ Transit Gateway — hub-and-spoke // 30 VPCs → 30 attachments to 1 Transit Gateway // Adding VPC #31 → 1 new attachment // TGW route tables: prod ↔ shared-services, dev ↔ shared-services // prod ✗ dev (isolated by route tables) // One place to manage all routing — single pane of glass
🔁 Follow-Up Question

What is AWS Direct Connect and when would you use it instead of VPN over the internet?

20 How do you design a multi-region architecture on AWS? Explain active-active vs active-passive. advanced

Multi-region architecture deploys your application across two or more AWS Regions for low latency, disaster recovery, or compliance.

Active-Passive (Pilot Light / Warm Standby):

  • Primary Region: handles all traffic.
  • Secondary Region: infrastructure exists but receives no traffic until failover.
  • Pilot Light: only critical components running (database replica). Compute scaled to zero. Cheapest, slowest recovery (hours).
  • Warm Standby: reduced-capacity copy of production. Can take traffic within minutes.
  • Use Route 53 Failover routing to switch DNS on primary failure.

Active-Active:

  • Both Regions serve live traffic simultaneously.
  • Use Route 53 Latency-based routing — users go to the nearest Region.
  • Requires data replication: DynamoDB Global Tables, Aurora Global Database, S3 Cross-Region Replication.
  • Conflict resolution: "last writer wins" (DynamoDB) or application-level logic.
  • Most resilient but most complex and expensive.

Key services for multi-region:

  • DynamoDB Global Tables: multi-region, multi-master. Sub-second replication.
  • Aurora Global Database: 1 primary Region (read/write) + up to 5 secondary Regions (read-only, < 1s lag). Failover in < 1 minute.
  • S3 Cross-Region Replication: async copy of objects to another Region.
  • CloudFront: global CDN, caches in 400+ Edge Locations.
# ── DynamoDB Global Table (active-active database) ──
aws dynamodb create-table \
    --table-name UserSessions \
    --key-schema AttributeName=userId,KeyType=HASH \
    --attribute-definitions AttributeName=userId,AttributeType=S \
    --billing-mode PAY_PER_REQUEST \
    --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES

# Add replica in eu-west-1
aws dynamodb update-table --table-name UserSessions \
    --replica-updates '[{"Create":{"RegionName":"eu-west-1"}}]'

# ── Aurora Global Database ──
aws rds create-global-cluster \
    --global-cluster-identifier my-global-db \
    --source-db-cluster-identifier arn:aws:rds:us-east-1:123:cluster:my-aurora \
    --engine aurora-postgresql

# Add secondary Region
aws rds create-db-cluster \
    --db-cluster-identifier my-aurora-secondary \
    --engine aurora-postgresql \
    --global-cluster-identifier my-global-db \
    --region eu-west-1

# ── S3 Cross-Region Replication ──
aws s3api put-bucket-replication --bucket source-bucket \
    --replication-configuration '{
        "Role": "arn:aws:iam::123:role/S3ReplicationRole",
        "Rules": [{
            "Status": "Enabled",
            "Destination": {
                "Bucket": "arn:aws:s3:::destination-bucket-eu",
                "StorageClass": "STANDARD_IA"
            }
        }]
    }'

# ── Route 53: Latency-based routing (active-active) ──
# US users → us-east-1 ALB (30ms)
# EU users → eu-west-1 ALB (25ms)
# Health checks on both → automatic failover if one Region fails

# ── Route 53: Failover routing (active-passive) ──
# Primary: us-east-1 ALB + health check
# Secondary: eu-west-1 ALB (standby)
# Primary fails → DNS switches to secondary

# ── Architecture diagram ──
# Active-Active:
# Users → Route 53 (Latency) → us-east-1 ALB / eu-west-1 ALB
#                                    ↓                ↓
#                              Aurora Primary    Aurora Secondary
#                                    ←  replication  →
#                              DynamoDB Global Table (both write)

A fintech company required < 100ms latency globally and < 1 minute RTO (Recovery Time Objective). They deployed active-active in us-east-1 and eu-west-1: DynamoDB Global Tables for session data (multi-master, sub-second replication), Aurora Global Database for transactional data (primary in us-east-1, read-only secondary in eu-west-1 with < 1s lag). Route 53 latency-based routing sent users to the nearest Region. During a us-east-1 outage, Aurora Global Database promoted the eu-west-1 secondary to primary in 45 seconds — users experienced only brief read-only mode.

Active-passive is simpler and cheaper — use for DR with longer RTO (minutes to hours). Active-active provides lowest latency and fastest failover — use when global latency or RPO near zero is required. Data replication is the hardest part — choose the right strategy (DynamoDB Global Tables for multi-master, Aurora Global for single-master). Always test failover regularly. Use CloudFront for static content caching regardless of architecture.
⚠️ Common Mistake
// ❌ Multi-region without testing failover // "We deployed to two Regions so we're resilient" // But... IAM policies reference us-east-1 specific ARNs // Application has hardcoded Region in config files // Database failover never tested → takes 30 minutes (not 1 minute) // DNS TTL: 300 seconds → clients cache old Region for 5 minutes
// ✅ Multi-region with regular failover testing // Monthly "Game Day" — simulate Region failure // Region-agnostic config: use environment variables, not hardcoded Regions // Route 53 TTL: 60 seconds (faster DNS propagation) // Aurora Global DB failover tested: 45 seconds (verified) // Runbook documented: step-by-step failover procedure // Monitoring: cross-region health dashboard
🔁 Follow-Up Question

What is the difference between RPO and RTO? How do you choose a DR strategy based on these requirements?

21 How do AWS Organizations and Service Control Policies (SCPs) work? Explain OU structure and guardrails. advanced

AWS Organizations centrally manages multiple AWS accounts as a single unit.

Key concepts:

  • Management Account (root): creates the organization, manages billing, applies policies. Should NOT run workloads.
  • Member Accounts: individual AWS accounts for workloads, environments, or teams.
  • Organizational Units (OUs): hierarchical grouping of accounts (like folders). SCPs cascade down.

Recommended OU structure:

  • Security OU: log archive account, security tooling account (GuardDuty, Config).
  • Infrastructure OU: shared services (DNS, networking, CI/CD).
  • Workloads OU: prod, staging, dev sub-OUs with separate accounts.
  • Sandbox OU: experimentation accounts with spending limits.

Service Control Policies (SCPs):

  • Guardrails — define the maximum permissions for accounts in an OU.
  • SCPs don't grant permissions — they restrict what IAM policies can do.
  • Applied to OUs or accounts. Cascade to all child OUs/accounts.
  • The management account is never affected by SCPs.
  • Common SCPs: deny Region access (restrict to specific Regions), deny root user actions, deny disabling CloudTrail/GuardDuty, deny public S3.
# ── SCP: Deny all Regions except us-east-1 and eu-west-1 ──
# {
#   "Version": "2012-10-17",
#   "Statement": [{
#     "Sid": "DenyOtherRegions",
#     "Effect": "Deny",
#     "Action": "*",
#     "Resource": "*",
#     "Condition": {
#       "StringNotEquals": {
#         "aws:RequestedRegion": ["us-east-1", "eu-west-1"]
#       },
#       "ForAnyValue:StringNotLike": {
#         "aws:PrincipalArn": [
#           "arn:aws:iam::*:role/OrganizationAdmin"
#         ]
#       }
#     }
#   }]
# }

# ── SCP: Prevent disabling CloudTrail ──
# {
#   "Version": "2012-10-17",
#   "Statement": [{
#     "Sid": "ProtectCloudTrail",
#     "Effect": "Deny",
#     "Action": [
#       "cloudtrail:StopLogging",
#       "cloudtrail:DeleteTrail",
#       "cloudtrail:UpdateTrail"
#     ],
#     "Resource": "*"
#   }]
# }

# ── SCP: Prevent public S3 ──
# {
#   "Version": "2012-10-17",
#   "Statement": [{
#     "Sid": "DenyS3PublicAccess",
#     "Effect": "Deny",
#     "Action": "s3:PutBucketPublicAccessBlock",
#     "Resource": "*",
#     "Condition": {
#       "StringNotEquals": {
#         "s3:PublicAccessBlockConfiguration/BlockPublicAcls": "true"
#       }
#     }
#   }]
# }

# ── Create OU structure ──
aws organizations create-organizational-unit \
    --parent-id r-xxxx --name "Security"
aws organizations create-organizational-unit \
    --parent-id r-xxxx --name "Workloads"
aws organizations create-organizational-unit \
    --parent-id ou-workloads --name "Production"
aws organizations create-organizational-unit \
    --parent-id ou-workloads --name "Development"

# ── Attach SCP to an OU ──
aws organizations attach-policy \
    --policy-id p-abc123 \
    --target-id ou-workloads

# ── OU hierarchy ──
# Root
# ├── Security OU (log archive, security tools)
# ├── Infrastructure OU (networking, CI/CD)
# ├── Workloads OU
# │   ├── Production OU (SCP: deny risky actions)
# │   ├── Staging OU
# │   └── Development OU (SCP: restrict instance types)
# └── Sandbox OU (SCP: spending limit, Region restrict)

A company had all teams sharing a single AWS account — 200 developers. A junior developer accidentally deleted a production DynamoDB table. After the incident, they moved to AWS Organizations: separate accounts for prod, staging, dev, and security. SCPs on the Production OU prevented deleting databases, disabling CloudTrail, or launching instances in unauthorized Regions. A Sandbox OU let developers experiment freely with a $100/month budget. Security account centralized CloudTrail logs from all accounts.

Use AWS Organizations to separate environments into different accounts — it's the strongest blast-radius isolation. SCPs are guardrails, not permissions — they limit what IAM can do. Always protect CloudTrail, GuardDuty, and S3 Block Public Access with SCPs. The management account should be empty (no workloads) since SCPs don't affect it. Use AWS Control Tower to automate this setup.
⚠️ Common Mistake
// ❌ All teams in one AWS account // 200 developers sharing one account // Junior dev: aws dynamodb delete-table --table-name UserData // Production table deleted — 4 hours of downtime // No blast radius — one mistake affects everything // No audit: which team caused the $10K bill spike?
// ✅ Multi-account with Organizations + SCPs // Prod account: SCP denies dynamodb:DeleteTable, rds:DeleteDBInstance // Dev account: separate — experiments don't touch prod // Security account: centralized CloudTrail, GuardDuty // Sandbox account: SCP limits to t3.small, $100 budget // Blast radius: mistake in dev → zero prod impact
🔁 Follow-Up Question

What is AWS Control Tower and how does it automate multi-account setup with Organizations?

22 How does API Gateway work? Explain REST API vs HTTP API, throttling, caching, and usage plans. advanced

API Gateway is a fully managed service for creating, publishing, and managing APIs at any scale.

API Types:

  • REST API: full-featured. Supports API keys, usage plans, resource policies, request/response transformation, caching, WAF, private APIs. ~$3.50/million requests.
  • HTTP API: simpler, faster, cheaper. Supports JWT authorization, CORS. No caching, no usage plans, no WAF. ~$1.00/million requests. 70% cheaper.
  • WebSocket API: real-time two-way communication (chat, gaming, notifications).

Throttling:

  • Account-level: 10,000 requests/sec across all APIs (soft limit).
  • Stage-level: configurable per API stage (e.g., prod vs dev).
  • Method-level: throttle specific routes (e.g., POST /orders at 100 req/sec).
  • Usage Plans + API Keys: rate limit per customer (free tier: 100 req/day, paid: 10,000 req/day).

Caching (REST API only):

  • Caches API responses for a configurable TTL (300 seconds default).
  • Reduces backend calls and improves latency.
  • Cache size: 0.5 GB to 237 GB. Costs $0.02-$3.80/hr.

Authorization: IAM, Cognito User Pools, Lambda Authorizer (custom logic), JWT (HTTP API).

# ── CloudFormation: REST API with Lambda backend ──
# Resources:
#   MyAPI:
#     Type: AWS::ApiGateway::RestApi
#     Properties:
#       Name: OrderAPI
#       Description: Order management API
#
#   OrdersResource:
#     Type: AWS::ApiGateway::Resource
#     Properties:
#       RestApiId: !Ref MyAPI
#       ParentId: !GetAtt MyAPI.RootResourceId
#       PathPart: orders
#
#   PostOrder:
#     Type: AWS::ApiGateway::Method
#     Properties:
#       RestApiId: !Ref MyAPI
#       ResourceId: !Ref OrdersResource
#       HttpMethod: POST
#       AuthorizationType: COGNITO_USER_POOLS
#       AuthorizerId: !Ref CognitoAuth
#       Integration:
#         Type: AWS_PROXY
#         IntegrationHttpMethod: POST
#         Uri: !Sub "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${OrderFn.Arn}/invocations"

# ── HTTP API (simpler, cheaper) ──
aws apigatewayv2 create-api \
    --name OrderAPI \
    --protocol-type HTTP \
    --target arn:aws:lambda:us-east-1:123:function:ProcessOrder

# ── Enable caching (REST API only) ──
aws apigateway update-stage \
    --rest-api-id abc123 \
    --stage-name prod \
    --patch-operations \
    op=replace,path=/cacheClusterEnabled,value=true \
    op=replace,path=/cacheClusterSize,value=0.5

# ── Usage Plan + API Key (per-customer rate limiting) ──
aws apigateway create-usage-plan --name "FreeTier" \
    --throttle burstLimit=10,rateLimit=5 \
    --quota limit=1000,period=MONTH \
    --api-stages apiId=abc123,stage=prod

aws apigateway create-api-key --name "customer-001" --enabled
aws apigateway create-usage-plan-key \
    --usage-plan-id plan123 \
    --key-id key123 \
    --key-type API_KEY

# ── Comparison ──
# Feature        | REST API              | HTTP API
# Cost           | $3.50/million         | $1.00/million
# Caching        | ✅                    | ❌
# Usage Plans    | ✅                    | ❌
# WAF            | ✅                    | ❌
# Transformation | ✅ (VTL templates)    | ❌
# Auth           | IAM, Cognito, Lambda  | JWT, IAM, Lambda
# Private API    | ✅                    | ❌
# Performance    | ~29ms overhead        | ~10ms overhead

A SaaS company needed to expose their API to customers with different pricing tiers. They used REST API with Usage Plans: free tier (100 requests/day, 5 req/sec burst), basic ($29/month, 10K requests/day), and enterprise ($299/month, 100K requests/day). API caching with 60-second TTL reduced Lambda invocations by 70% for read-heavy endpoints. They added a Lambda Authorizer to validate JWT tokens and inject tenant context into requests.

Use HTTP API for simple Lambda/HTTP integrations — it's 70% cheaper and faster. Use REST API when you need caching, usage plans, WAF, request transformation, or private APIs. Always set throttling to protect your backend. Use API keys with usage plans for per-customer rate limiting. Cache GET responses aggressively to reduce costs. Lambda Authorizer is the most flexible auth option.
⚠️ Common Mistake
// ❌ No throttling — Lambda concurrency exhausted // API Gateway: no throttle limits set // Bot sends 50,000 requests/second // Lambda concurrency limit (1000) reached instantly // All legitimate users get 429 Too Many Requests // Lambda costs spike to $500 in one hour
// ✅ Multi-layer throttling // API Gateway stage: 5,000 req/sec (account protection) // Method level: POST /orders → 100 req/sec // Usage Plans: per-customer limits // Lambda: reserved concurrency = 200 (protect account limit) // WAF: rate-based rule — block IPs exceeding 2,000 req/5min // Caching: GET endpoints cached for 60s (reduce backend load)
🔁 Follow-Up Question

What is a Lambda Authorizer and how does it compare to Cognito User Pools for API authorization?

23 What is ElastiCache? Explain Redis vs Memcached, caching strategies, and cluster modes. advanced

ElastiCache is a managed in-memory data store for caching, session management, and real-time analytics.

Redis vs Memcached:

  • Redis:
    • Rich data structures: strings, lists, sets, sorted sets, hashes, streams, geospatial.
    • Persistence: snapshots (RDB) and append-only file (AOF).
    • Replication: primary-replica with automatic failover (Multi-AZ).
    • Pub/Sub, Lua scripting, transactions.
    • Cluster mode: shards data across up to 500 nodes (up to 340 TB).
    • Best for: most use cases — caching, sessions, leaderboards, rate limiting, real-time analytics.
  • Memcached:
    • Simple key-value only. Multi-threaded (better per-node throughput for simple operations).
    • No persistence, no replication, no failover.
    • Node failure = data loss.
    • Best for: simple caching where data loss is acceptable. Horizontally scalable (add/remove nodes).

Caching strategies:

  • Lazy Loading (Cache-Aside): application checks cache → miss → reads from DB → writes to cache. Pros: only caches what's needed. Con: initial miss penalty, stale data possible.
  • Write-Through: every write goes to cache AND DB. Pros: cache always fresh. Con: write latency, caches unused data.
  • Write-Behind: write to cache → async write to DB later. Fastest writes but risk of data loss if cache node fails.
  • TTL (Time-to-Live): always set TTL — prevents stale data and manages memory.
# ── Create Redis cluster (cluster mode disabled) ──
aws elasticache create-replication-group \
    --replication-group-id my-redis \
    --replication-group-description "App Cache" \
    --engine redis \
    --cache-node-type cache.r7g.large \
    --num-cache-clusters 3 \
    --multi-az-enabled \
    --automatic-failover-enabled \
    --at-rest-encryption-enabled \
    --transit-encryption-enabled \
    --cache-subnet-group-name my-cache-subnets \
    --security-group-ids sg-redis

# ── Create Redis cluster (cluster mode enabled — sharding) ──
aws elasticache create-replication-group \
    --replication-group-id my-redis-cluster \
    --replication-group-description "Sharded Cache" \
    --engine redis \
    --cache-node-type cache.r7g.large \
    --num-node-groups 3 \
    --replicas-per-node-group 2 \
    --cluster-enabled

# ── Python: Lazy Loading (Cache-Aside) pattern ──
# import redis, json, boto3
#
# r = redis.Redis(host="my-redis.xxx.cache.amazonaws.com", port=6379, ssl=True)
# dynamodb = boto3.resource("dynamodb")
# table = dynamodb.Table("Products")
#
# def get_product(product_id):
#     # 1. Check cache
#     cached = r.get(f"product:{product_id}")
#     if cached:
#         return json.loads(cached)  # Cache HIT
#
#     # 2. Cache MISS — read from DB
#     response = table.get_item(Key={"productId": product_id})
#     product = response.get("Item")
#
#     # 3. Write to cache with TTL
#     if product:
#         r.setex(f"product:{product_id}", 3600, json.dumps(product))
#
#     return product
#
# def update_product(product_id, data):
#     # Write-through: update DB + invalidate cache
#     table.put_item(Item=data)
#     r.delete(f"product:{product_id}")  # Invalidate, not update

# ── Comparison ──
# Feature        | Redis             | Memcached
# Data types     | Rich (lists,sets) | String only
# Persistence    | Yes (RDB+AOF)     | No
# Replication    | Yes (Multi-AZ)    | No
# Failover       | Automatic         | None (data lost)
# Pub/Sub        | Yes               | No
# Cluster mode   | Yes (sharding)    | Yes (hash-based)
# Threading      | Single-threaded   | Multi-threaded
# Max data       | 340 TB (cluster)  | Per-node only

A product catalog API had 500ms response times due to repeated DynamoDB queries. They added ElastiCache Redis with lazy loading: cache product data with 1-hour TTL. Cache hit rate reached 95% within a day — average response time dropped to 5ms. For flash sales, they used write-through caching to pre-warm the cache before the sale started. Redis sorted sets powered a real-time leaderboard for a gamified promotion with zero additional database load.

Choose Redis over Memcached in 95% of cases — it has persistence, replication, failover, and rich data structures. Use lazy loading for read-heavy workloads. Always set TTL to prevent stale data and memory exhaustion. Use cluster mode for data > 100GB or throughput > 100K ops/sec. Place ElastiCache in private subnets with encryption in transit and at rest.
⚠️ Common Mistake
// ❌ No TTL — stale data + memory exhaustion // r.set("product:123", json.dumps(product)) // No TTL! // Product price changes in DB → cache still has old price // Users see wrong prices for hours/days // Cache grows forever → OOM (Out of Memory) → evictions // Random keys evicted → unpredictable cache behavior
// ✅ Always set TTL + invalidate on updates // r.setex("product:123", 3600, json.dumps(product)) // 1hr TTL // On product update: // table.put_item(Item=updated_product) // r.delete("product:123") // Invalidate immediately // TTL is a safety net — even if invalidation fails, // stale data expires within 1 hour // Set maxmemory-policy to allkeys-lru for graceful eviction
🔁 Follow-Up Question

What is ElastiCache Serverless and how does it differ from provisioned clusters?

24 How do KMS and Secrets Manager work? Explain envelope encryption, key rotation, and secrets management. advanced

AWS KMS (Key Management Service) manages encryption keys for encrypting your data across AWS services.

Key types:

  • AWS Managed Keys (aws/s3, aws/ebs): AWS creates and rotates them. Free. Limited control.
  • Customer Managed Keys (CMK): you create, manage, define policies. $1/month + $0.03/10K API calls. Full control over rotation, deletion, cross-account access.
  • AWS Owned Keys: used internally by AWS services. Not visible in your account.

Envelope Encryption:

  1. KMS generates a Data Key (plaintext + encrypted copy).
  2. Your app encrypts data with the plaintext data key (locally, fast).
  3. Stores the encrypted data key alongside the ciphertext.
  4. Plaintext data key is discarded from memory.
  5. To decrypt: send encrypted data key to KMS → get plaintext data key → decrypt data locally.
  6. Benefit: KMS only handles the small data key (< 4 KB), not your large data. Faster, cheaper.

Key Rotation: automatic annual rotation for CMKs. Old key material is kept for decryption of previously encrypted data. No re-encryption needed.

Secrets Manager:

  • Stores database passwords, API keys, tokens, certificates securely.
  • Automatic rotation: rotates secrets on a schedule (e.g., every 30 days) using a Lambda function.
  • Built-in rotation for RDS, Redshift, DocumentDB credentials.
  • SDKs retrieve secrets at runtime — no secrets in code or config files.
# ── Create a KMS key ──
aws kms create-key \
    --description "Application data encryption" \
    --key-usage ENCRYPT_DECRYPT \
    --origin AWS_KMS

# Enable automatic key rotation
aws kms enable-key-rotation --key-id abc-123-def

# ── Envelope encryption with boto3 ──
# import boto3, base64
# from cryptography.fernet import Fernet
#
# kms = boto3.client("kms")
#
# # 1. Generate data key
# response = kms.generate_data_key(
#     KeyId="alias/my-app-key",
#     KeySpec="AES_256"
# )
# plaintext_key = response["Plaintext"]
# encrypted_key = response["CiphertextBlob"]
#
# # 2. Encrypt data locally (fast, no KMS API call)
# f = Fernet(base64.urlsafe_b64encode(plaintext_key[:32]))
# ciphertext = f.encrypt(b"sensitive data here")
#
# # 3. Store encrypted_key + ciphertext together
# # 4. Delete plaintext_key from memory
# del plaintext_key
#
# # To decrypt:
# # 1. Send encrypted_key to KMS → get plaintext key
# # 2. Decrypt ciphertext locally with plaintext key

# ── Secrets Manager: Store a database password ──
aws secretsmanager create-secret \
    --name prod/myapp/db-password \
    --description "Production DB credentials" \
    --secret-string '{"username":"admin","password":"SuperS3cret!","host":"mydb.cluster-xxx.rds.amazonaws.com","port":"5432","dbname":"appdb"}'

# ── Enable automatic rotation (every 30 days) ──
aws secretsmanager rotate-secret \
    --secret-id prod/myapp/db-password \
    --rotation-lambda-arn arn:aws:lambda:us-east-1:123:function:SecretRotation \
    --rotation-rules AutomaticallyAfterDays=30

# ── Python: Retrieve secret at runtime ──
# import boto3, json
#
# def get_db_credentials():
#     client = boto3.client("secretsmanager")
#     response = client.get_secret_value(SecretId="prod/myapp/db-password")
#     return json.loads(response["SecretString"])
#
# creds = get_db_credentials()
# conn = psycopg2.connect(
#     host=creds["host"],
#     user=creds["username"],
#     password=creds["password"],
#     dbname=creds["dbname"]
# )

# ── KMS Key Policy: Cross-account access ──
# {
#   "Statement": [{
#     "Sid": "AllowCrossAccountDecrypt",
#     "Effect": "Allow",
#     "Principal": {"AWS": "arn:aws:iam::987654321:root"},
#     "Action": ["kms:Decrypt", "kms:DescribeKey"],
#     "Resource": "*"
#   }]
# }

A company stored database passwords in environment variables on EC2 instances. When an instance was compromised, the attacker found credentials in /proc/self/environ. After the incident: (1) passwords moved to Secrets Manager with automatic 30-day rotation, (2) application retrieves credentials at runtime via SDK, (3) EC2 instance profile has secretsmanager:GetSecretValue permission only for its own secrets, (4) all data at rest encrypted with customer-managed KMS keys (audit trail in CloudTrail). A leaked credential is now useless within 30 days.

Never store secrets in code, environment variables, or config files — use Secrets Manager. Enable automatic rotation for database passwords. Use KMS customer-managed keys when you need audit trails (CloudTrail), cross-account access, or custom key policies. Understand envelope encryption — it's how AWS services encrypt your data efficiently. Enable key rotation for compliance.
⚠️ Common Mistake
// ❌ Secrets in environment variables // export DB_PASSWORD="SuperS3cret!" // # Visible in: /proc/self/environ, docker inspect, CloudFormation outputs // # Never rotated — same password for 3 years // # Compromised once → attacker has permanent access // # No audit trail — who accessed the password?
// ✅ Secrets Manager with automatic rotation // creds = secretsmanager.get_secret_value(SecretId="prod/db") // # Encrypted at rest with KMS // # Auto-rotated every 30 days — old password stops working // # CloudTrail logs every access (who, when, from where) // # IAM policy: only this Lambda/EC2 role can read this secret // # Compromised? Rotate immediately — attacker locked out
🔁 Follow-Up Question

What is the difference between Secrets Manager and Systems Manager Parameter Store for storing secrets?

25 How does AWS CI/CD work? Explain CodePipeline, CodeBuild, and CodeDeploy. advanced

AWS provides a fully managed CI/CD pipeline using three services:

CodePipeline — orchestrator:

  • Defines the stages of your pipeline: Source → Build → Test → Deploy.
  • Integrates with GitHub, CodeCommit, S3 (source), CodeBuild (build/test), CodeDeploy, ECS, Lambda, CloudFormation (deploy).
  • Triggers automatically on code push.
  • Supports manual approval stages (for production deployments).

CodeBuild — build/test service:

  • Fully managed build environment. No servers to manage.
  • Uses a buildspec.yml file defining phases: install, pre_build, build, post_build.
  • Supports any language/framework. Uses Docker containers for builds.
  • Produces artifacts (JAR, ZIP, Docker image) stored in S3 or ECR.
  • Pay per build minute.

CodeDeploy — deployment service:

  • Deploys to EC2, ECS, Lambda, or on-premises.
  • Deployment strategies:
    • In-place (EC2): update instances one by one. Downtime risk.
    • Blue/Green (EC2/ECS): create new environment → switch traffic → terminate old. Zero downtime.
    • Canary (Lambda/ECS): send 10% traffic to new version → wait → shift 100%.
    • Linear (Lambda/ECS): shift traffic in equal increments every N minutes.
  • Automatic rollback: if CloudWatch alarms fire during deployment → auto-rollback.
# ── buildspec.yml (CodeBuild) ──
# version: 0.2
# phases:
#   install:
#     runtime-versions:
#       nodejs: 20
#   pre_build:
#     commands:
#       - npm ci
#       - echo "Running tests..."
#       - npm test
#   build:
#     commands:
#       - npm run build
#       - echo "Building Docker image..."
#       - docker build -t $ECR_REPO:$CODEBUILD_RESOLVED_SOURCE_VERSION .
#   post_build:
#     commands:
#       - docker push $ECR_REPO:$CODEBUILD_RESOLVED_SOURCE_VERSION
#       - printf '{"ImageURI":"%s"}' $ECR_REPO:$CODEBUILD_RESOLVED_SOURCE_VERSION > imageDetail.json
# artifacts:
#   files:
#     - imageDetail.json
#     - appspec.yaml
#     - taskdef.json

# ── appspec.yml (CodeDeploy for ECS Blue/Green) ──
# version: 0.0
# Resources:
#   - TargetService:
#       Type: AWS::ECS::Service
#       Properties:
#         TaskDefinition: <TASK_DEFINITION>
#         LoadBalancerInfo:
#           ContainerName: "web"
#           ContainerPort: 8080
# Hooks:
#   - BeforeInstall: "LambdaFunctionToValidateBeforeInstall"
#   - AfterAllowTestTraffic: "LambdaFunctionToValidateTestTraffic"
#   - AfterAllowTraffic: "LambdaFunctionToValidateAfterTraffic"

# ── CloudFormation: CodePipeline ──
# Resources:
#   Pipeline:
#     Type: AWS::CodePipeline::Pipeline
#     Properties:
#       Stages:
#         - Name: Source
#           Actions:
#             - Name: GitHub
#               ActionTypeId:
#                 Category: Source
#                 Provider: CodeStarSourceConnection
#               Configuration:
#                 ConnectionArn: !Ref GitHubConnection
#                 FullRepositoryId: "myorg/myapp"
#                 BranchName: main
#               OutputArtifacts: [{Name: SourceOutput}]
#
#         - Name: Build
#           Actions:
#             - Name: BuildAndTest
#               ActionTypeId:
#                 Category: Build
#                 Provider: CodeBuild
#               Configuration:
#                 ProjectName: !Ref BuildProject
#               InputArtifacts: [{Name: SourceOutput}]
#               OutputArtifacts: [{Name: BuildOutput}]
#
#         - Name: Approval
#           Actions:
#             - Name: ManualApproval
#               ActionTypeId:
#                 Category: Approval
#                 Provider: Manual
#
#         - Name: Deploy
#           Actions:
#             - Name: DeployToECS
#               ActionTypeId:
#                 Category: Deploy
#                 Provider: ECS
#               Configuration:
#                 ClusterName: !Ref Cluster
#                 ServiceName: !Ref Service

A team deployed to production by SSH-ing into servers and running git pull. Deployment took 2 hours, had no rollback mechanism, and caused 30 minutes of downtime every release. They implemented CodePipeline: push to main → CodeBuild runs tests + builds Docker image → pushes to ECR → CodeDeploy does Blue/Green ECS deployment. Deployment time dropped to 15 minutes with zero downtime. A bad deployment auto-rolled back when the ALB health check alarm fired in CloudWatch.

Use CodePipeline to orchestrate the full CI/CD flow. Use CodeBuild for building and testing — no servers to manage. Use CodeDeploy Blue/Green for zero-downtime deployments. Always add manual approval stages for production. Configure automatic rollback with CloudWatch alarms. Store all pipeline config in source control (buildspec.yml, appspec.yml).
⚠️ Common Mistake
// ❌ No automated rollback // CodeDeploy: in-place deployment to all instances at once // Bad code deployed → all instances serving errors // No health checks → CodeDeploy reports "success" // Manual rollback: 45 minutes to redeploy previous version // Users impacted for entire duration
// ✅ Blue/Green with automatic rollback // CodeDeploy: Blue/Green ECS deployment // New task set (green) starts → ALB sends 10% traffic // CloudWatch alarm: 5xx error rate > 1% → auto-rollback // Traffic shifts back to blue (old version) in < 1 minute // Failed deployment: zero user impact // Successful: green becomes primary, blue terminated
🔁 Follow-Up Question

How does CodePipeline compare to GitHub Actions or Jenkins for AWS-based CI/CD?

26 How does CloudFront work? Explain OAC, Lambda@Edge, cache behaviors, and invalidation. advanced

CloudFront is AWS's Content Delivery Network (CDN) with 400+ Edge Locations worldwide.

How it works:

  1. User requests content (e.g., https://cdn.example.com/image.jpg).
  2. Request goes to the nearest Edge Location.
  3. Cache hit: return cached content immediately (< 10ms).
  4. Cache miss: Edge Location fetches from Origin (S3, ALB, custom HTTP server), caches it, returns to user.

Origins: S3 bucket, ALB, API Gateway, custom HTTP server, MediaStore.

OAC (Origin Access Control):

  • Replaces the older OAI (Origin Access Identity).
  • Ensures S3 bucket is only accessible through CloudFront, not directly.
  • S3 bucket policy allows only the CloudFront distribution's OAC.

Cache Behaviors:

  • Rules that match URL patterns (e.g., /api/*, /images/*, /static/*).
  • Each behavior can have a different origin, TTL, and caching policy.
  • /api/* → ALB (no caching), /static/* → S3 (cache 1 year).

Lambda@Edge / CloudFront Functions:

  • Run code at Edge Locations on viewer request/response or origin request/response.
  • CloudFront Functions: lightweight (< 1ms), for header manipulation, URL rewrites, redirects.
  • Lambda@Edge: full Lambda (up to 30s), for auth, A/B testing, dynamic content generation.

Invalidation: removes cached content before TTL expires. Use sparingly (costs $0.005/path after 1,000 free). Better: use versioned file names (/css/style.v2.css).

# ── CloudFormation: CloudFront + S3 with OAC ──
# Resources:
#   Distribution:
#     Type: AWS::CloudFront::Distribution
#     Properties:
#       DistributionConfig:
#         Origins:
#           - Id: S3Origin
#             DomainName: !GetAtt AssetsBucket.RegionalDomainName
#             OriginAccessControlId: !GetAtt OAC.Id
#             S3OriginConfig:
#               OriginAccessIdentity: ""
#           - Id: ALBOrigin
#             DomainName: !GetAtt ALB.DNSName
#             CustomOriginConfig:
#               OriginProtocolPolicy: https-only
#         DefaultCacheBehavior:
#           TargetOriginId: S3Origin
#           ViewerProtocolPolicy: redirect-to-https
#           CachePolicyId: 658327ea-f89d-4fab-a63d-7e88639e58f6  # CachingOptimized
#           Compress: true
#         CacheBehaviors:
#           - PathPattern: "/api/*"
#             TargetOriginId: ALBOrigin
#             ViewerProtocolPolicy: https-only
#             CachePolicyId: 4135ea2d-6df8-44a3-9df3-4b5a84be39ad  # CachingDisabled
#         Aliases: [cdn.example.com]
#         ViewerCertificate:
#           AcmCertificateArn: !Ref SSLCert
#           SslSupportMethod: sni-only
#
#   OAC:
#     Type: AWS::CloudFront::OriginAccessControl
#     Properties:
#       OriginAccessControlConfig:
#         Name: S3OAC
#         OriginAccessControlOriginType: s3
#         SigningBehavior: always
#         SigningProtocol: sigv4

# ── S3 Bucket Policy allowing only CloudFront OAC ──
# {
#   "Statement": [{
#     "Effect": "Allow",
#     "Principal": {"Service": "cloudfront.amazonaws.com"},
#     "Action": "s3:GetObject",
#     "Resource": "arn:aws:s3:::my-bucket/*",
#     "Condition": {
#       "StringEquals": {
#         "AWS:SourceArn": "arn:aws:cloudfront::123:distribution/E1234"
#       }
#     }
#   }]
# }

# ── Invalidation ──
aws cloudfront create-invalidation \
    --distribution-id E1234ABCDEF \
    --paths "/index.html" "/css/*"

# ── CloudFront Function: Add security headers ──
# function handler(event) {
#   var response = event.response;
#   response.headers["x-frame-options"] = {value: "DENY"};
#   response.headers["x-content-type-options"] = {value: "nosniff"};
#   response.headers["strict-transport-security"] = {
#     value: "max-age=63072000; includeSubdomains; preload"
#   };
#   return response;
# }

An e-commerce site served images directly from S3 — first-time load took 800ms for users in Asia (S3 bucket in us-east-1). After adding CloudFront: first request still went to origin (cache miss), but subsequent requests from the same region returned in < 20ms (cache hit). They set up cache behaviors: /static/* → S3 with 1-year TTL, /api/* → ALB with no caching, /* → S3 (default). OAC blocked direct S3 access. Monthly S3 request costs dropped 80% because CloudFront absorbed the traffic.

Use CloudFront for all public-facing content — even if your origin is in one Region. Use OAC to restrict S3 access to CloudFront only. Set long TTLs for static assets and use versioned file names instead of invalidation. Use cache behaviors to route different paths to different origins. CloudFront Functions for simple header manipulation, Lambda@Edge for complex logic.
⚠️ Common Mistake
// ❌ Using invalidation instead of versioned file names // Deploy new CSS → aws cloudfront create-invalidation --paths "/css/*" // Costs money: $0.005/path after 1,000 free // Takes 5-15 minutes to propagate to all Edge Locations // Some users still see old CSS until invalidation completes // Doing 50 deploys/day → invalidation costs add up
// ✅ Versioned file names — instant, free, reliable // /css/style.abc123.css (hash in filename) // New deploy → /css/style.def456.css (different hash) // CloudFront sees new URL → fetches from origin // Old version stays cached (harmless, expires naturally) // Zero invalidation cost, zero propagation delay // Build tools (Vite, Webpack) do this automatically
🔁 Follow-Up Question

What is CloudFront Functions vs Lambda@Edge? When would you use one over the other?

27 What is the difference between Step Functions, SQS, and EventBridge for workflow orchestration? advanced

AWS provides three patterns for coordinating distributed systems:

Step Functions — orchestration (centralized control):

  • Visual state machine that coordinates Lambda, ECS, Glue, SQS, and 200+ AWS services.
  • States: Task, Choice (if/else), Parallel, Map (loop), Wait, Pass, Fail, Succeed.
  • Built-in error handling: Retry with exponential backoff, Catch for fallback paths.
  • Maintains execution state — you can see exactly where a workflow is at any time.
  • Standard Workflows: up to 1 year, exactly-once, auditable ($0.025/1K transitions).
  • Express Workflows: up to 5 minutes, at-least-once, high-volume ($0.000001/request).
  • Best for: complex, stateful workflows — order processing, ETL pipelines, human approval flows.

SQS — choreography (decoupled, point-to-point):

  • Simple queue — no workflow state, no branching, no parallel execution.
  • Consumer processes messages independently. Dead Letter Queue for failures.
  • Best for: simple task queues, decoupling, buffering between services.

EventBridge — choreography (event-driven, many-to-many):

  • Events route to targets based on content rules. Loose coupling — producers don't know about consumers.
  • No workflow state. Each target acts independently.
  • Best for: event-driven architectures, SaaS integrations, decoupled microservices.

Orchestration vs Choreography: orchestration = one central coordinator (Step Functions). Choreography = services react to events independently (SQS/EventBridge).

# ── Step Functions: Order processing workflow (ASL) ──
# {
#   "StartAt": "ValidateOrder",
#   "States": {
#     "ValidateOrder": {
#       "Type": "Task",
#       "Resource": "arn:aws:lambda:...:ValidateOrderFn",
#       "Next": "CheckInventory",
#       "Retry": [{"ErrorEquals": ["States.ALL"], "MaxAttempts": 3}],
#       "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "OrderFailed"}]
#     },
#     "CheckInventory": {
#       "Type": "Task",
#       "Resource": "arn:aws:lambda:...:CheckInventoryFn",
#       "Next": "ProcessPayment"
#     },
#     "ProcessPayment": {
#       "Type": "Task",
#       "Resource": "arn:aws:lambda:...:ProcessPaymentFn",
#       "Next": "ParallelNotifications"
#     },
#     "ParallelNotifications": {
#       "Type": "Parallel",
#       "Branches": [
#         {"StartAt": "SendEmail", "States": {"SendEmail": {"Type": "Task", "Resource": "arn:aws:lambda:...:SendEmailFn", "End": true}}},
#         {"StartAt": "UpdateAnalytics", "States": {"UpdateAnalytics": {"Type": "Task", "Resource": "arn:aws:lambda:...:AnalyticsFn", "End": true}}}
#       ],
#       "Next": "OrderComplete"
#     },
#     "OrderComplete": {"Type": "Succeed"},
#     "OrderFailed": {"Type": "Fail", "Error": "OrderProcessingFailed"}
#   }
# }

# ── Create State Machine ──
aws stepfunctions create-state-machine \
    --name OrderProcessing \
    --definition file://workflow.json \
    --role-arn arn:aws:iam::123:role/StepFunctionsRole

# ── Start execution ──
aws stepfunctions start-execution \
    --state-machine-arn arn:aws:states:us-east-1:123:stateMachine:OrderProcessing \
    --input '{"orderId":"ORD-001","amount":99.99}'

# ── Comparison ──
# Feature       | Step Functions      | SQS            | EventBridge
# Pattern       | Orchestration       | Choreography   | Choreography
# State         | Full state machine  | None           | None
# Branching     | Choice, Parallel    | None           | Rule-based routing
# Error handling| Retry + Catch       | DLQ            | DLQ
# Visibility    | Visual workflow     | Queue depth    | Events log
# Max duration  | 1 year (Standard)   | 14 days retain | Instant
# Cost model    | Per transition      | Per message    | Per event

An ETL pipeline was built with chained Lambda functions triggering each other via SQS. When step 3 of 7 failed, there was no way to retry from step 3 — the entire pipeline had to restart from step 1 (re-processing 2 hours of work). After migrating to Step Functions, each step was a state with built-in retry (3 attempts, exponential backoff) and a catch block for notification. A failure at step 3 retried automatically, and the visual console showed exactly where the workflow was stuck.

Use Step Functions when your workflow has sequential steps, branching, parallel execution, or needs retry/error handling. Use SQS for simple point-to-point decoupling. Use EventBridge for event-driven, loosely coupled architectures. Don't over-orchestrate — if services are truly independent, choreography (EventBridge) is simpler. Use Express Workflows for high-volume, short-duration processing.
⚠️ Common Mistake
// ❌ Chaining Lambdas via SQS for sequential workflow // Lambda A → SQS → Lambda B → SQS → Lambda C → SQS → Lambda D // Step C fails → message goes to DLQ // No way to retry from step C only // No visibility: where is the order in the pipeline? // Adding error handling = custom code in every Lambda // Total Lambda timeouts: can exceed 15 min limit
// ✅ Step Functions for sequential workflows // ValidateOrder → CheckInventory → ProcessPayment → Notify // Step C fails → auto-retry 3 times with backoff // After 3 retries → Catch → send to error handler // Visual console: see exactly where each order is // Built-in: parallel branches, conditional logic, wait states // Total workflow: up to 1 year (Standard), not limited by Lambda timeout
🔁 Follow-Up Question

What are Step Functions Express Workflows and when should you use them instead of Standard Workflows?

28 How do AWS WAF and AWS Shield work? Explain rule groups, rate limiting, and DDoS protection. advanced

AWS WAF (Web Application Firewall) protects web applications from common web exploits at Layer 7.

Components:

  • Web ACL: the main resource. Contains rules that are evaluated in order. Associated with CloudFront, ALB, API Gateway, or AppSync.
  • Rules:
    • Regular rules: match conditions (IP, header, body, URI) → Allow, Block, Count, or CAPTCHA.
    • Rate-based rules: block IPs exceeding a threshold (e.g., > 2,000 requests in 5 minutes).
  • Rule Groups:
    • AWS Managed Rules: pre-built by AWS — Core Rule Set (CRS), SQL injection, XSS, bad bots, known bad inputs.
    • Marketplace Rules: from third-party vendors (F5, Fortinet, Imperva).
    • Custom Rule Groups: your own rules for application-specific logic.
  • WCUs (Web ACL Capacity Units): each rule costs WCUs. Web ACL limit: 5,000 WCUs.

AWS Shield — DDoS protection:

  • Shield Standard: free, automatic. Protects against Layer 3/4 DDoS (SYN floods, UDP reflection). Applied to all AWS resources.
  • Shield Advanced: $3,000/month. Layer 3/4/7 protection, real-time visibility, DDoS Response Team (DRT), cost protection (refund for scaling during attack), health-based detection.
# ── Create Web ACL with managed rules ──
aws wafv2 create-web-acl \
    --name MyAppWAF \
    --scope REGIONAL \
    --default-action Allow={} \
    --rules '[
        {
            "Name": "AWSManagedRulesCommonRuleSet",
            "Priority": 1,
            "OverrideAction": {"None": {}},
            "Statement": {
                "ManagedRuleGroupStatement": {
                    "VendorName": "AWS",
                    "Name": "AWSManagedRulesCommonRuleSet"
                }
            },
            "VisibilityConfig": {
                "SampledRequestsEnabled": true,
                "CloudWatchMetricsEnabled": true,
                "MetricName": "CommonRuleSet"
            }
        },
        {
            "Name": "RateLimitRule",
            "Priority": 2,
            "Action": {"Block": {}},
            "Statement": {
                "RateBasedStatement": {
                    "Limit": 2000,
                    "AggregateKeyType": "IP"
                }
            },
            "VisibilityConfig": {
                "SampledRequestsEnabled": true,
                "CloudWatchMetricsEnabled": true,
                "MetricName": "RateLimit"
            }
        }
    ]'

# ── CloudFormation: WAF for ALB ──
# Resources:
#   WebACL:
#     Type: AWS::WAFv2::WebACL
#     Properties:
#       Name: MyAppWAF
#       Scope: REGIONAL
#       DefaultAction: {Allow: {}}
#       Rules:
#         - Name: AWSManagedRulesCommonRuleSet
#           Priority: 1
#           OverrideAction: {None: {}}
#           Statement:
#             ManagedRuleGroupStatement:
#               VendorName: AWS
#               Name: AWSManagedRulesCommonRuleSet
#         - Name: SQLiProtection
#           Priority: 2
#           OverrideAction: {None: {}}
#           Statement:
#             ManagedRuleGroupStatement:
#               VendorName: AWS
#               Name: AWSManagedRulesSQLiRuleSet
#         - Name: RateLimit
#           Priority: 3
#           Action: {Block: {}}
#           Statement:
#             RateBasedStatement:
#               Limit: 2000
#               AggregateKeyType: IP
#
#   WebACLAssociation:
#     Type: AWS::WAFv2::WebACLAssociation
#     Properties:
#       ResourceArn: !Ref ALB
#       WebACLArn: !GetAtt WebACL.Arn

# ── Common AWS Managed Rule Groups ──
# AWSManagedRulesCommonRuleSet — general protection (XSS, SSRF, etc.)
# AWSManagedRulesSQLiRuleSet — SQL injection
# AWSManagedRulesKnownBadInputsRuleSet — log4j, Spring exploits
# AWSManagedRulesBotControlRuleSet — bot detection ($10/million)
# AWSManagedRulesATPRuleSet — account takeover protection

An e-commerce site was hit by a credential stuffing attack — bots trying thousands of username/password combinations on the login page. They deployed WAF with: (1) AWS Managed Rules Common Rule Set (blocked common exploits), (2) Rate-based rule at 100 requests per 5 minutes per IP on /login, (3) Bot Control managed rule to detect automated tools. Login abuse dropped by 99%. They also enabled Shield Advanced for their peak sales events — during a DDoS attack, the DDoS Response Team helped mitigate it within minutes.

Start with AWS Managed Rules (Common Rule Set + SQLi) — they cover most attack vectors. Add rate-based rules for login and API endpoints. Use Count mode first to test rules before blocking. WAF logs to S3 via Kinesis Firehose for analysis. Shield Standard is free and automatic — Shield Advanced is for high-value targets that need DRT support and cost protection. Always deploy WAF in front of public-facing ALBs and CloudFront.
⚠️ Common Mistake
// ❌ Deploying WAF rules in Block mode without testing // New WAF rule: block requests with "select" in body // Legitimate users searching for "select a product" → BLOCKED! // E-commerce search broken → revenue loss // WAF rules can have false positives — test first!
// ✅ Test in Count mode, then switch to Block // 1. Deploy rule in COUNT mode (logs but doesn't block) // 2. Analyze WAF logs for 1-2 weeks // 3. Review sampled requests — any false positives? // 4. Adjust rule conditions to reduce false positives // 5. Switch to BLOCK mode with confidence // Use WAF logging → S3 → Athena for analysis
🔁 Follow-Up Question

What is AWS Firewall Manager and how does it centrally manage WAF rules across multiple accounts?

29 What is the AWS Well-Architected Framework? Explain the 6 pillars and how to apply them. experienced

The AWS Well-Architected Framework provides best practices for building secure, high-performing, resilient, and efficient cloud architectures. It has 6 pillars:

1. Operational Excellence:

  • Run and monitor systems, continuously improve processes.
  • Key practices: IaC (CloudFormation/Terraform), CI/CD pipelines, runbooks, observability (CloudWatch, X-Ray), small frequent changes, anticipate failure.

2. Security:

  • Protect data, systems, and assets. Principle of least privilege.
  • Key practices: IAM roles (not users), encryption at rest and in transit, MFA, security groups, WAF, GuardDuty, detective controls, incident response plan.

3. Reliability:

  • Recover from failures, meet demand. Design for failure.
  • Key practices: Multi-AZ, Auto Scaling, health checks, circuit breakers, backups, DR testing, chaos engineering.

4. Performance Efficiency:

  • Use computing resources efficiently as demand changes.
  • Key practices: right-sizing (Compute Optimizer), serverless (Lambda, Fargate), caching (ElastiCache, CloudFront), global infrastructure, performance testing.

5. Cost Optimization:

  • Avoid unnecessary costs. Pay only for what you use.
  • Key practices: Reserved Instances, Savings Plans, Spot, right-sizing, lifecycle policies, Cost Explorer, budgets and alerts.

6. Sustainability (newest pillar):

  • Minimize environmental impact of cloud workloads.
  • Key practices: Graviton (energy-efficient), serverless, efficient code, data retention policies, Region selection (renewable energy).
# ── Well-Architected Tool: Create a workload review ──
aws wellarchitected create-workload \
    --workload-name "E-Commerce Platform" \
    --description "Production e-commerce application" \
    --environment PRODUCTION \
    --lenses wellarchitected \
    --aws-regions us-east-1

# ── Answer a question in the review ──
aws wellarchitected update-answer \
    --workload-id w-abc123 \
    --lens-alias wellarchitected \
    --question-id "ops-1" \
    --selected-choices "ops_1_aws_cloud_ops_1" \
    --notes "We use CloudFormation for all infrastructure"

# ── Pillar checklist (key questions per pillar) ──
# OPERATIONAL EXCELLENCE:
# □ Do you use IaC for all infrastructure?
# □ Do you have a CI/CD pipeline with automated testing?
# □ Do you have runbooks for common operational tasks?
# □ Do you have observability (metrics, logs, traces)?
#
# SECURITY:
# □ Is MFA enabled for all human users?
# □ Are IAM roles used instead of access keys?
# □ Is encryption enabled at rest and in transit?
# □ Are security groups following least privilege?
# □ Is CloudTrail enabled for audit logging?
#
# RELIABILITY:
# □ Are workloads deployed across multiple AZs?
# □ Is Auto Scaling configured with health checks?
# □ Do you have automated backups and tested restores?
# □ Is there a disaster recovery plan with defined RPO/RTO?
#
# PERFORMANCE EFFICIENCY:
# □ Are instances right-sized (Compute Optimizer)?
# □ Is caching used (CloudFront, ElastiCache)?
# □ Are you using serverless where appropriate?
#
# COST OPTIMIZATION:
# □ Are Reserved Instances/Savings Plans used for steady-state?
# □ Are unused resources identified and removed?
# □ Are S3 lifecycle policies configured?
# □ Are budgets and billing alerts set up?
#
# SUSTAINABILITY:
# □ Are you using Graviton instances (40% more energy-efficient)?
# □ Are you using serverless to minimize idle resources?

A fintech startup passed their Well-Architected Review with 15 high-risk issues (HRIs). Top findings: (1) no Multi-AZ for database — single point of failure, (2) no encryption on S3 buckets, (3) IAM users with admin access and no MFA, (4) no budget alerts — $8K surprise bill, (5) no DR plan. Over 3 months, they remediated all HRIs: RDS Multi-AZ, S3 SSE-KMS, IAM roles with least privilege, billing alerts, and quarterly DR drills. Their next review had zero HRIs.

Use the Well-Architected Tool to conduct periodic reviews — it surfaces risks you might miss. Focus on high-risk issues (HRIs) first. The pillars often involve trade-offs: higher reliability = higher cost, better security = more operational complexity. Design decisions should explicitly state which pillar is being prioritized. Run reviews quarterly and after major architecture changes.
⚠️ Common Mistake
// ❌ Treating Well-Architected as a one-time checklist // "We did the review 2 years ago — we're fine" // Architecture changed 10 times since then // New services added without pillar review // Security posture degraded over time (new IAM users, open SGs) // Cost crept up 300% without anyone noticing
// ✅ Continuous Well-Architected practice // Quarterly reviews with the Well-Architected Tool // Each new service: security + reliability review before launch // Cost review monthly: right-sizing, unused resources // DR drill quarterly: verify failover actually works // Track HRI count over time → trending down = improving
🔁 Follow-Up Question

What are Well-Architected Lenses and how do they extend the framework for specific workloads (serverless, SaaS, ML)?

30 How do you optimize AWS costs? Explain Reserved Instances, Savings Plans, Spot Instances, and right-sizing. experienced

AWS cost optimization is a continuous process with multiple strategies:

Pricing Models:

  • On-Demand: pay per second/hour. No commitment. Full price. Use for variable, short-term workloads.
  • Reserved Instances (RIs): 1 or 3-year commitment for a specific instance type in a specific Region. Up to 72% discount. Standard (fixed type) or Convertible (can change type).
  • Savings Plans: 1 or 3-year commitment to a dollar amount per hour of compute. More flexible than RIs. Applies to EC2, Lambda, and Fargate. Recommended over RIs for most cases.
  • Spot Instances: unused EC2 capacity at up to 90% discount. Can be interrupted with 2-minute notice. Best for batch processing, CI/CD, data analysis, fault-tolerant workloads.

Right-Sizing:

  • Use AWS Compute Optimizer to identify over-provisioned instances.
  • Common finding: instances running at 5-10% CPU → downsize to save 50%.
  • Review monthly. Right-size before buying RIs/Savings Plans.

Other strategies:

  • S3 Lifecycle Policies: move to cheaper tiers automatically.
  • Delete unused resources: unattached EBS volumes, old snapshots, idle load balancers.
  • Graviton instances: 40% better price/performance.
  • Cost Explorer + Budgets: visibility and alerting.
# ── Buy a Savings Plan ──
aws savingsplans create-savings-plan \
    --savings-plan-offering-id offering-abc123 \
    --commitment 10.00 \
    --term-duration-in-seconds 31536000  # 1 year

# ── Request Spot Instances ──
aws ec2 run-instances \
    --instance-type m7g.xlarge \
    --instance-market-options MarketType=spot \
    --count 5 \
    --image-id ami-abc123 \
    --tag-specifications "ResourceType=instance,Tags=[{Key=Purpose,Value=BatchProcessing}]"

# ── Right-sizing: Get Compute Optimizer recommendations ──
aws compute-optimizer get-ec2-instance-recommendations \
    --filters name=Finding,values=OVER_PROVISIONED \
    --query "instanceRecommendations[].{Instance:instanceArn,Current:currentInstanceType,Recommended:recommendationOptions[0].instanceType,Savings:recommendationOptions[0].estimatedMonthlySavings.value}"

# ── Find unused EBS volumes ──
aws ec2 describe-volumes \
    --filters Name=status,Values=available \
    --query "Volumes[].{VolumeId:VolumeId,Size:Size,Created:CreateTime}" \
    --output table

# ── Set up billing alarm ──
aws cloudwatch put-metric-alarm \
    --alarm-name BillingAlarm \
    --metric-name EstimatedCharges \
    --namespace AWS/Billing \
    --statistic Maximum --period 21600 \
    --evaluation-periods 1 --threshold 1000 \
    --comparison-operator GreaterThanThreshold \
    --alarm-actions arn:aws:sns:us-east-1:123:billing-alerts \
    --dimensions Name=Currency,Value=USD

# ── Cost comparison (m7g.large, us-east-1) ──
# Pricing Model    | $/hour  | Monthly (730hr) | Savings
# On-Demand        | $0.0816 | $59.57          | 0%
# 1yr Savings Plan | $0.0518 | $37.81          | 37%
# 3yr Savings Plan | $0.0332 | $24.24          | 59%
# Spot Instance    | ~$0.024 | ~$17.52         | ~70%
#
# ── When to use each ──
# Steady-state production → Savings Plan (1yr or 3yr)
# Variable production     → On-Demand + Auto Scaling
# Batch processing        → Spot Instances (fault-tolerant)
# Dev/Test                → Spot or On-Demand + shutdown schedule

A company spent $50K/month on AWS. Cost optimization analysis found: (1) 40% of EC2 instances were running at < 10% CPU → right-sized, saving $8K, (2) 200 unattached EBS volumes → deleted, saving $1.5K, (3) S3 lifecycle policies → saved $3K, (4) Compute Savings Plans for steady-state workloads → saved $12K, (5) Spot Instances for nightly batch jobs → saved $4K. Total monthly savings: $28.5K (57% reduction). They set up Cost Explorer dashboards and $40K monthly budget alerts.

Right-size first, then commit (Savings Plans). Delete unused resources regularly (volumes, snapshots, IPs, load balancers). Use Spot for batch/CI/CD. Use S3 lifecycle policies. Use Graviton for 40% better price/performance. Set budget alerts to catch cost anomalies early. Cost optimization is a monthly habit, not a one-time project.
⚠️ Common Mistake
// ❌ Buying Reserved Instances before right-sizing // 20 × m5.4xlarge RIs purchased (3-year, $180K commitment) // After purchase: Compute Optimizer says "use m7g.large" // 16 vCPUs used but workload needs 2 vCPUs → 87.5% waste // RI is non-refundable (Standard) — stuck for 3 years // $180K spent on 8x more compute than needed
// ✅ Right-size first, then commit // 1. Run Compute Optimizer — identify true resource needs // 2. Resize instances (m5.4xlarge → m7g.large) // 3. Monitor for 2-4 weeks — verify new sizes work // 4. Buy Savings Plans (not RIs) — more flexible // 5. Use Convertible if uncertain about future instance types // Savings Plans apply across instance families — lower risk
🔁 Follow-Up Question

What is the difference between Reserved Instances and Savings Plans? Which should you choose?

31 What are the AWS disaster recovery strategies? Explain RPO, RTO, and the four DR architectures. experienced

Disaster Recovery (DR) ensures business continuity when infrastructure fails. Two key metrics define your DR requirements:

RPO (Recovery Point Objective): Maximum acceptable data loss measured in time. If RPO = 1 hour, you can afford to lose up to 1 hour of data.

RTO (Recovery Time Objective): Maximum acceptable downtime. If RTO = 15 minutes, the system must be back online within 15 minutes.

Four DR strategies (ordered by cost and recovery time):

  • 1. Backup & Restore: cheapest. Backups stored in S3/Glacier. On disaster: restore from backup, provision infrastructure. RPO: hours. RTO: hours. Cost: very low.
  • 2. Pilot Light: critical components running (database replica), compute scaled to zero. On disaster: start compute, scale up. RPO: minutes. RTO: tens of minutes. Cost: low.
  • 3. Warm Standby: scaled-down copy of production always running. On disaster: scale up to full production capacity. RPO: seconds. RTO: minutes. Cost: medium.
  • 4. Multi-Site Active-Active: full production in 2+ Regions handling live traffic. On disaster: traffic shifts to surviving Region. RPO: near-zero. RTO: near-zero. Cost: high (2x+ infrastructure).

Choosing a strategy: based on business impact of downtime. A banking app (RTO < 1 min) needs active-active. A reporting dashboard (RTO < 4 hrs) can use pilot light.

# ── Strategy 1: Backup & Restore ──
# Automated backups:
aws rds modify-db-instance \
    --db-instance-identifier mydb \
    --backup-retention-period 7 \
    --preferred-backup-window "03:00-04:00"

# S3 cross-region backup
aws s3 sync s3://prod-data s3://dr-backup-eu --storage-class STANDARD_IA

# To recover: create new infrastructure from CloudFormation,
# restore DB from latest snapshot
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier mydb-restored \
    --db-snapshot-identifier rds:mydb-2026-05-30-03-00

# ── Strategy 2: Pilot Light ──
# Aurora read replica in DR Region (always running)
# ASG with min=0, max=10 in DR Region (no instances running)
# Route 53 failover with health checks

# On disaster: scale up ASG
aws autoscaling update-auto-scaling-group \
    --auto-scaling-group-name DR-WebServers \
    --min-size 3 --desired-capacity 6

# Promote Aurora replica to primary
aws rds failover-global-cluster \
    --global-cluster-identifier my-global-db \
    --target-db-cluster-identifier arn:aws:rds:eu-west-1:123:cluster:my-aurora-secondary

# ── Strategy 3: Warm Standby ──
# Scaled-down infra in DR Region always running
# Primary: 10 instances, DR: 2 instances
# Route 53 weighted: 100% primary, 0% DR (or failover)

# ── Strategy 4: Active-Active ──
# DynamoDB Global Tables + Aurora Global DB
# Route 53 latency-based routing
# Both Regions serve live traffic simultaneously

# ── DR Strategy comparison ──
# Strategy      | RPO           | RTO           | Cost
# Backup/Restore| Hours         | Hours         | $
# Pilot Light   | Minutes       | 10-30 min     | $$
# Warm Standby  | Seconds       | Minutes       | $$$
# Active-Active | Near-zero     | Near-zero     | $$$$

# ── Key AWS services for DR ──
# Data: Aurora Global DB, DynamoDB Global Tables, S3 CRR
# Compute: ASG, Launch Templates, AMIs (copy cross-region)
# DNS: Route 53 failover/latency routing + health checks
# IaC: CloudFormation StackSets (deploy to multiple Regions)

A healthcare SaaS application needed < 15-minute RTO and < 1-minute RPO for compliance. They implemented warm standby: Aurora Global Database with a secondary cluster in eu-west-1 (RPO < 1 second replication lag), a scaled-down ECS service (2 tasks vs 10 in production), and Route 53 failover routing with health checks. Quarterly DR drills proved they could failover in 8 minutes. During an actual us-east-1 outage, the automated failover completed in 6 minutes — within their 15-minute RTO commitment.

Define RPO and RTO based on business requirements and cost tolerance. Always test your DR plan — an untested DR plan is not a plan. Automate failover as much as possible (Route 53 health checks, Aurora Global DB failover). Keep DR infrastructure as code so you can redeploy reliably. Start with the cheapest strategy that meets your RPO/RTO and upgrade as the business grows.
⚠️ Common Mistake
// ❌ DR plan exists on paper but never tested // "We have a DR runbook — 47 pages long" // Last tested: never (or 2 years ago) // Actual disaster: Step 12 references a deleted IAM role // Step 23: "login to Jenkins" — Jenkins is in the failed Region // Recovery takes 6 hours instead of the planned 30 minutes // RTO promise to customers: 1 hour — SLA breached
// ✅ Regular DR drills with automated runbooks // Quarterly "Game Day" — simulate Region failure // Automated: Route 53 failover + Aurora promotion + ASG scale-up // Measure actual RTO: 8 minutes (within 15-min target) // Document findings: "IAM role missing in DR Region → fixed" // Chaos engineering: randomly inject failures in production // DR infrastructure managed by CloudFormation StackSets
🔁 Follow-Up Question

What is AWS Elastic Disaster Recovery (DRS) and how does it simplify DR for on-premises workloads?

32 What are AWS Landing Zone and Control Tower? How do you set up a secure multi-account environment? experienced

AWS Control Tower automates the setup and governance of a secure, multi-account AWS environment (a "landing zone").

What it provides:

  • Account Factory: automated account provisioning with predefined configurations (VPC, subnets, IAM, logging). Create new accounts in minutes via Service Catalog.
  • Guardrails (Controls):
    • Preventive: SCPs that prevent non-compliant actions (e.g., "disallow public S3 buckets").
    • Detective: AWS Config rules that detect non-compliance (e.g., "S3 bucket without encryption").
    • Proactive: CloudFormation hooks that block non-compliant resource creation before deployment.
  • Dashboard: centralized view of compliance across all accounts.
  • Log Archive Account: centralized CloudTrail and Config logs (immutable).
  • Audit Account: security team access to all accounts for investigation.

Landing Zone Architecture:

  • Management Account: Organizations root. Billing. No workloads.
  • Log Archive Account: centralized CloudTrail, Config, VPC Flow Logs. S3 bucket with object lock.
  • Audit/Security Account: GuardDuty admin, Security Hub, IAM Access Analyzer.
  • Shared Services Account: Transit Gateway, DNS, CI/CD pipelines.
  • Workload Accounts: separate accounts per environment (prod, staging, dev) and per team.
# ── Enable Control Tower (via Console — no CLI support yet) ──
# 1. Go to AWS Control Tower in the management account
# 2. Set up landing zone → creates:
#    - Log Archive account
#    - Audit account
#    - Security OU, Sandbox OU
#    - 20+ mandatory guardrails

# ── Account Factory: Create a new workload account ──
# Via Service Catalog or CLI:
aws servicecatalog provision-product \
    --product-name "AWS Control Tower Account Factory" \
    --provisioned-product-name "team-alpha-prod" \
    --provisioning-parameters \
    Key=AccountName,Value=team-alpha-prod \
    Key=AccountEmail,Value=team-alpha-prod@company.com \
    Key=SSOUserEmail,Value=admin@company.com \
    Key=ManagedOrganizationalUnit,Value="Workloads/Production"

# ── Guardrails (Controls) ──
# Mandatory (always on):
# - Disallow changes to CloudTrail configuration
# - Disallow changes to AWS Config rules
# - Disallow deletion of log archive

# Strongly recommended:
# - Enable encryption for EBS volumes
# - Disallow public S3 buckets
# - Disallow internet access for RDS instances
# - Enable MFA for root user

# ── Enable additional guardrails ──
aws controltower enable-control \
    --control-identifier arn:aws:controltower:us-east-1::control/AWS-GR_ENCRYPTED_VOLUMES \
    --target-identifier arn:aws:organizations::123:ou/o-xxx/ou-xxx

# ── Landing Zone structure ──
# Root (Management Account — billing only)
# ├── Security OU
# │   ├── Log Archive Account (CloudTrail, Config, immutable S3)
# │   └── Audit Account (GuardDuty, Security Hub, IAM Analyzer)
# ├── Infrastructure OU
# │   └── Shared Services (Transit Gateway, DNS, CI/CD)
# ├── Workloads OU
# │   ├── Production OU
# │   │   ├── team-alpha-prod
# │   │   └── team-beta-prod
# │   ├── Staging OU
# │   └── Development OU
# └── Sandbox OU (experimentation, budget limits)

# ── Customizations for Control Tower (CfCT) ──
# Deploy additional CloudFormation stacks to new accounts:
# - VPC with standard subnets
# - Security groups baseline
# - IAM roles for CI/CD
# - CloudWatch alarm baseline

A growing startup with 3 teams needed separate AWS environments. They set up Control Tower: 8 accounts (mgmt, log archive, audit, shared services, team-a-prod, team-a-dev, team-b-prod, team-b-dev). Account Factory created each account in 15 minutes with standardized VPCs, IAM roles, and guardrails. Detective guardrails caught a developer who created an unencrypted RDS instance in prod — automatically flagged in the dashboard. Centralized CloudTrail logs in the log archive account made security audits trivial.

Use Control Tower for any organization with 3+ AWS accounts — it automates what would take weeks manually. Account Factory ensures every new account is compliant from day one. Guardrails prevent and detect non-compliance continuously. Keep the management account empty (no workloads). Centralize logs in a dedicated account with immutable storage. Use Customizations for Control Tower (CfCT) to add your own baseline stacks.
⚠️ Common Mistake
// ❌ Manual multi-account setup // Creating accounts manually in Console // Each account has different VPC design, IAM roles, security groups // No centralized logging — each account logs locally // No guardrails — developers can create public RDS instances // New account setup: 2 days of manual configuration // No consistency — "why does team B's account look different?"
// ✅ Control Tower automated landing zone // Account Factory: new account in 15 minutes // Standardized: same VPC, IAM, security baseline every time // Guardrails: 20+ mandatory + custom controls // Centralized: CloudTrail + Config → Log Archive (immutable) // Dashboard: see compliance status across all accounts // CfCT: custom CloudFormation for company-specific baseline
🔁 Follow-Up Question

What is AWS Config and how does it complement Control Tower for continuous compliance monitoring?

33 How do you implement observability on AWS? Explain X-Ray, Container Insights, and distributed tracing. experienced

Observability = Metrics + Logs + Traces. The ability to understand system behavior from external outputs.

AWS X-Ray — distributed tracing:

  • Traces requests across multiple services (API Gateway → Lambda → DynamoDB → SQS → Lambda).
  • Generates a Service Map: visual graph of service dependencies with latency and error rates.
  • Traces: end-to-end path of a request with timing for each segment.
  • Sampling: traces a percentage of requests (default 5%) to manage cost.
  • Integrates with: Lambda, API Gateway, ECS, EKS, EC2, Elastic Beanstalk.
  • Use X-Ray SDK in your code to add custom subsegments and annotations.

CloudWatch Container Insights:

  • Collects and aggregates metrics from ECS and EKS: CPU, memory, disk, network per cluster, service, task, pod.
  • Pre-built dashboards for container performance.
  • Uses CloudWatch Agent (EC2 launch type) or Fluent Bit (sidecar or daemonset).

CloudWatch Application Insights: auto-detects application components and sets up monitoring for .NET, Java, and SQL Server workloads.

Amazon Managed Grafana: managed Grafana for custom dashboards. Sources: CloudWatch, X-Ray, Prometheus, Elasticsearch.

Amazon Managed Prometheus: managed Prometheus for Kubernetes metrics. Works with EKS.

# ── Enable X-Ray for Lambda ──
aws lambda update-function-configuration \
    --function-name ProcessOrder \
    --tracing-config Mode=Active

# ── Python: X-Ray SDK for custom subsegments ──
# from aws_xray_sdk.core import xray_recorder, patch_all
# patch_all()  # Auto-instrument boto3, requests, etc.
#
# @xray_recorder.capture("process_payment")
# def process_payment(order):
#     # Add annotation for filtering in X-Ray console
#     xray_recorder.current_subsegment().put_annotation("orderId", order["id"])
#     xray_recorder.current_subsegment().put_annotation("amount", order["amount"])
#     # ... payment logic
#     return {"status": "success"}
#
# def handler(event, context):
#     subsegment = xray_recorder.begin_subsegment("validate_input")
#     order = validate(event)
#     xray_recorder.end_subsegment()
#
#     result = process_payment(order)
#     return result

# ── Enable Container Insights for ECS ──
aws ecs update-cluster-settings \
    --cluster my-cluster \
    --settings name=containerInsights,value=enabled

# ── EKS: Install CloudWatch agent for Container Insights ──
# kubectl apply -f https://raw.githubusercontent.com/aws-samples/\
# amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/\
# deployment-mode/daemonSet/container-insights-monitoring/quickstart/\
# cwagent-fluentd-quickstart.yaml

# ── X-Ray: Get Service Map ──
aws xray get-service-graph \
    --start-time 2026-05-30T00:00:00Z \
    --end-time 2026-05-30T23:59:59Z

# ── Observability stack architecture ──
# Metrics:  CloudWatch Metrics + Container Insights → Grafana
# Logs:     CloudWatch Logs + Fluent Bit → Logs Insights
# Traces:   X-Ray + SDK instrumentation → Service Map
#
# The three pillars together:
# "API latency increased" → (Metrics: P99 spike)
# "Which requests?" → (Traces: X-Ray shows DB calls slow)
# "What error?" → (Logs: CloudWatch Logs Insights query)

# ── CloudWatch dashboard (IaC) ──
# Resources:
#   ObservabilityDashboard:
#     Type: AWS::CloudWatch::Dashboard
#     Properties:
#       DashboardName: AppOverview
#       DashboardBody: !Sub |
#         {"widgets": [
#           {"type": "metric", "properties": {
#             "metrics": [["AWS/Lambda","Duration","FunctionName","ProcessOrder"]],
#             "title": "Lambda Duration"
#           }},
#           {"type": "metric", "properties": {
#             "metrics": [["AWS/Lambda","Errors","FunctionName","ProcessOrder"]],
#             "title": "Lambda Errors"
#           }}
#         ]}

A microservices team had 15 services on ECS. When users reported slow checkout, the team spent 3 hours checking each service's logs individually. After enabling X-Ray with SDK instrumentation and Container Insights, they could: (1) see the Service Map showing the checkout flow: API → OrderService → PaymentService → DynamoDB, (2) find the bottleneck: PaymentService → DynamoDB query taking 2 seconds (should be 10ms), (3) drill into the trace and see a missing GSI causing a table scan. Total debugging time: 5 minutes.

Implement all three pillars: metrics (CloudWatch), logs (CloudWatch Logs), and traces (X-Ray). Instrument your code with X-Ray SDK for custom subsegments and annotations. Enable Container Insights for ECS/EKS — it provides per-task/pod metrics automatically. Use the Service Map to understand service dependencies. Use annotations for filtering traces (by orderId, customerId, etc.).
⚠️ Common Mistake
// ❌ Only metrics, no traces — can't find the bottleneck // CloudWatch shows: Lambda P99 = 5 seconds (too slow!) // But Lambda calls 4 other services... which one is slow? // Check Service A logs: looks fine // Check Service B logs: looks fine // Check Service C logs: looks fine // 3 hours later: it was Service D → DynamoDB table scan // Without traces, debugging distributed systems is guesswork
// ✅ X-Ray traces + metrics + logs = full picture // CloudWatch: Lambda P99 = 5 seconds (alert!) // X-Ray Service Map: PaymentService → DynamoDB = 4.8 seconds // X-Ray Trace: DynamoDB Query on table "Orders" — no GSI, full scan // CloudWatch Logs: "Scan consumed 5000 RCUs" (confirmation) // Fix: add GSI → DynamoDB query drops to 5ms // Total debugging: 5 minutes (not 3 hours)
🔁 Follow-Up Question

What is OpenTelemetry (OTEL) and how does AWS Distro for OpenTelemetry (ADOT) work with X-Ray?

34 How do you build a data lake on AWS? Explain S3, Glue, Athena, and Lake Formation. experienced

A data lake stores structured, semi-structured, and unstructured data in its raw form for analytics and ML.

Key components:

  • Amazon S3 — the storage layer. Data stored in open formats (Parquet, ORC, JSON, CSV). Organized by zones:
    • Raw/Landing: original data as received (JSON, CSV).
    • Cleansed/Processed: cleaned, validated, converted to Parquet.
    • Curated/Analytics: aggregated, enriched, ready for queries.
  • AWS Glue — ETL (Extract, Transform, Load):
    • Glue Crawlers: auto-discover schema and populate the Glue Data Catalog.
    • Glue Jobs: serverless Spark/Python ETL to transform data.
    • Glue Data Catalog: centralized metadata store (like Apache Hive Metastore).
  • Amazon Athena — serverless SQL query engine:
    • Query S3 data directly using SQL. No infrastructure to manage.
    • Pay per query ($5/TB scanned). Use Parquet + partitioning to minimize cost.
    • Integrates with Glue Data Catalog for table definitions.
  • AWS Lake Formation — governance:
    • Centralized permissions for the data lake (row-level, column-level, cell-level security).
    • Simplifies data sharing across accounts.
    • Data lineage and audit trails.
# ── Data Lake S3 structure ──
# s3://my-data-lake/
# ├── raw/                    # Landing zone (original data)
# │   ├── orders/2026/05/30/  # Partitioned by date
# │   │   └── orders.json
# │   └── clickstream/2026/05/30/
# │       └── events.json
# ├── processed/              # Cleaned, Parquet format
# │   └── orders/year=2026/month=05/day=30/
# │       └── part-00000.snappy.parquet
# └── curated/                # Analytics-ready
#     └── daily_revenue/year=2026/month=05/
#         └── revenue.parquet

# ── Glue Crawler: Auto-discover schema ──
aws glue create-crawler \
    --name orders-crawler \
    --role GlueServiceRole \
    --database-name datalake \
    --targets '{"S3Targets":[{"Path":"s3://my-data-lake/processed/orders/"}]}'

aws glue start-crawler --name orders-crawler
# Creates table "orders" in Glue Data Catalog with schema

# ── Athena: Query S3 data with SQL ──
aws athena start-query-execution \
    --query-string "SELECT product, SUM(amount) as revenue
                    FROM datalake.orders
                    WHERE year='2026' AND month='05'
                    GROUP BY product
                    ORDER BY revenue DESC
                    LIMIT 10" \
    --result-configuration OutputLocation=s3://athena-results/

# ── Glue ETL Job (PySpark) ──
# import sys
# from awsglue.transforms import *
# from awsglue.context import GlueContext
# from pyspark.context import SparkContext
#
# sc = SparkContext()
# glueContext = GlueContext(sc)
#
# # Read from raw zone
# raw_df = glueContext.create_dynamic_frame.from_catalog(
#     database="datalake", table_name="raw_orders"
# )
#
# # Transform: clean, filter, flatten
# cleaned = raw_df.filter(lambda x: x["amount"] > 0)
#
# # Write to processed zone in Parquet (partitioned)
# glueContext.write_dynamic_frame.from_options(
#     frame=cleaned,
#     connection_type="s3",
#     connection_options={
#         "path": "s3://my-data-lake/processed/orders/",
#         "partitionKeys": ["year", "month", "day"]
#     },
#     format="parquet"
# )

# ── Lake Formation: Grant permissions ──
aws lakeformation grant-permissions \
    --principal DataLakePrincipalIdentifier=arn:aws:iam::123:role/AnalystRole \
    --resource '{"Table":{"DatabaseName":"datalake","Name":"orders"}}' \
    --permissions SELECT \
    --permissions-with-grant-option []

# ── Cost optimization ──
# Raw JSON: Athena scans 100 GB → $0.50/query
# Parquet + Snappy: scans 10 GB → $0.05/query (90% cheaper!)
# Add partitioning: scans 1 GB → $0.005/query (99% cheaper!)

A retail company had data in 15 siloed databases. They built a data lake: Kinesis Firehose streamed clickstream data to S3 raw zone, Glue ETL jobs transformed and converted to Parquet in the processed zone, Glue Crawlers kept the Data Catalog updated, Athena powered a QuickSight dashboard for business analytics. Lake Formation enforced column-level security — marketing could see purchase data but not PII. Athena query costs dropped 95% by switching from CSV to partitioned Parquet.

Store data in S3 using Parquet format with Snappy compression — reduces query costs by 90%+ compared to CSV/JSON. Partition data by common query dimensions (date, region). Use Glue Crawlers to keep the Data Catalog current. Use Lake Formation for fine-grained access control (row/column level). Organize in zones: raw → processed → curated. Athena is serverless and costs $5/TB — minimize scans with partitioning.
⚠️ Common Mistake
// ❌ Storing data as unpartitioned CSV in S3 // All order data in one file: s3://lake/orders.csv (100 GB) // Athena query: SELECT * WHERE date = '2026-05-30' // Scans ALL 100 GB to find 1 day of data → $0.50/query // 100 analysts × 50 queries/day = $2,500/day in Athena costs // CSV is row-based → entire row scanned even for 1 column
// ✅ Parquet + partitioning + compression // s3://lake/orders/year=2026/month=05/day=30/data.parquet.snappy // Athena: WHERE year='2026' AND month='05' AND day='30' // Scans only 100 MB (partition pruning + columnar) → $0.0005/query // 100 analysts × 50 queries/day = $2.50/day (1000x cheaper!) // Parquet: columnar → only reads columns in SELECT
🔁 Follow-Up Question

What is the difference between Athena, Redshift, and EMR for data analytics? When do you use each?

35 What are the AWS migration strategies? Explain the 7 Rs and AWS migration services. experienced

The 7 Rs of migration define strategies for moving workloads to AWS:

  • 1. Retire: decommission applications that are no longer needed. Saves cost immediately. Typically 10-20% of a portfolio.
  • 2. Retain: keep on-premises for now. Not ready to migrate (compliance, dependency, technical debt). Revisit later.
  • 3. Rehost (Lift & Shift): move as-is to AWS (EC2). No code changes. Fastest migration. Use AWS Application Migration Service (MGN) for automated rehosting. Good for quick wins and getting out of data center leases.
  • 4. Relocate: move to AWS with minimal changes (e.g., VMware Cloud on AWS, container lift-and-shift to ECS).
  • 5. Replatform (Lift & Optimize): make minimal changes for cloud benefits. Examples: database → RDS, app server → Elastic Beanstalk, Windows → Linux. Low effort, meaningful improvement.
  • 6. Repurchase (Drop & Shop): switch to a SaaS alternative. On-prem CRM → Salesforce, on-prem email → Microsoft 365, on-prem HR → Workday.
  • 7. Refactor/Re-architect: redesign the application to be cloud-native. Monolith → microservices, serverless, containers. Most effort but most benefit. Use for strategic applications that need to scale.

Migration services:

  • AWS MGN (Application Migration Service): automated lift-and-shift. Continuous replication → cutover with minimal downtime.
  • AWS DMS (Database Migration Service): migrate databases to RDS/Aurora/DynamoDB. Supports heterogeneous migration (Oracle → PostgreSQL) using SCT (Schema Conversion Tool).
  • AWS Migration Hub: central dashboard to track migration progress across tools.
# ── AWS DMS: Migrate Oracle to Aurora PostgreSQL ──
# 1. Create replication instance
aws dms create-replication-instance \
    --replication-instance-identifier oracle-to-aurora \
    --replication-instance-class dms.r5.large \
    --allocated-storage 100

# 2. Create source endpoint (Oracle)
aws dms create-endpoint \
    --endpoint-identifier oracle-source \
    --endpoint-type source \
    --engine-name oracle \
    --server-name oracle.onprem.company.com \
    --port 1521 \
    --username dms_user \
    --password "****" \
    --database-name ORCL

# 3. Create target endpoint (Aurora PostgreSQL)
aws dms create-endpoint \
    --endpoint-identifier aurora-target \
    --endpoint-type target \
    --engine-name aurora-postgresql \
    --server-name myaurora.cluster-xxx.rds.amazonaws.com \
    --port 5432 \
    --username dms_user \
    --password "****" \
    --database-name appdb

# 4. Create migration task (full load + CDC)
aws dms create-replication-task \
    --replication-task-identifier full-migration \
    --source-endpoint-arn arn:aws:dms:...:endpoint:oracle-source \
    --target-endpoint-arn arn:aws:dms:...:endpoint:aurora-target \
    --replication-instance-arn arn:aws:dms:...:rep:oracle-to-aurora \
    --migration-type full-load-and-cdc \
    --table-mappings file://table-mappings.json

# ── 7 Rs decision matrix ──
# Application Profile            | Strategy     | AWS Tool
# Legacy, unused                 | Retire       | N/A
# Not ready / compliance hold    | Retain       | N/A
# Standard VM workload           | Rehost       | MGN (lift-and-shift)
# VMware workloads               | Relocate     | VMware Cloud on AWS
# Database to managed            | Replatform   | DMS + RDS/Aurora
# COTS software available as SaaS| Repurchase   | SaaS vendor
# Strategic app, needs scale     | Refactor     | Containers, Lambda, etc.

# ── MGN: Application Migration Service ──
# 1. Install replication agent on source server
# 2. Agent replicates to AWS (continuous block-level replication)
# 3. Test: launch test instance from replicated data
# 4. Cutover: launch final instance, update DNS
# Minimal downtime: only the final cutover (minutes)

# ── Migration Hub: Track progress ──
aws migrationhub notify-migration-task-state \
    --progress-update-stream my-migration \
    --migration-task-name "Oracle DB Migration" \
    --task Status=IN_PROGRESS \
    --update-date-time 2026-05-30T12:00:00Z

A company migrated 200 applications from their data center to AWS over 18 months. Assessment phase: 30 applications retired (unused), 20 repurchased (moved to SaaS), 100 rehosted with MGN (2-4 weeks each), 30 replatformed (databases to RDS, middleware to managed services), 20 refactored to serverless/containers (strategic apps). DMS migrated 15 databases including a critical Oracle-to-Aurora PostgreSQL migration with zero downtime using Change Data Capture (CDC). Total data center cost savings: $2.5M/year.

Start with assessment — categorize every application into one of the 7 Rs. Rehost first for quick wins (get out of the data center). Replatform databases to RDS/Aurora for immediate operational benefits. Refactor only strategic applications where cloud-native will provide significant business value. Use DMS for database migration with CDC for near-zero downtime. Track everything in Migration Hub.
⚠️ Common Mistake
// ❌ Trying to refactor everything at once // "Let's rewrite all 200 apps as microservices on Kubernetes" // 18 months later: 3 apps refactored, 197 still on-premises // Data center lease expired — emergency lift-and-shift // Budget exhausted on 3 apps that didn't need refactoring // Most apps worked fine as VMs — didn't need Kubernetes
// ✅ Right strategy for each application // Retire: 30 apps (15%) — immediate cost savings // Rehost: 100 apps (50%) — MGN, done in 3 months // Replatform: 30 apps (15%) — DB → RDS, done in 6 months // Repurchase: 20 apps (10%) — SaaS replacements // Refactor: 20 apps (10%) — strategic only, parallel track // Data center exit: 6 months (not 18) // Refactoring continues after exit — no deadline pressure
🔁 Follow-Up Question

How does AWS Schema Conversion Tool (SCT) work for heterogeneous database migrations?

36 How do you optimize EC2 performance? Explain placement groups, enhanced networking, and instance store. performance

EC2 performance optimization involves networking, storage, and instance placement:

Placement Groups:

  • Cluster: instances in the same rack/AZ. Lowest latency (< 1μs), highest throughput (up to 100 Gbps between instances). For HPC, tightly coupled workloads.
  • Spread: each instance on different hardware. Max 7 instances per AZ. For critical instances that must not fail together.
  • Partition: instances grouped into partitions on separate racks. For large distributed systems (Hadoop, Cassandra, Kafka). Up to 7 partitions per AZ.

Enhanced Networking:

  • ENA (Elastic Network Adapter): up to 200 Gbps network bandwidth. SR-IOV (hardware-level virtualization bypass). Lower latency, higher PPS (packets per second).
  • EFA (Elastic Fabric Adapter): OS-bypass networking for HPC and ML. Enables MPI and NCCL communication. For GPU clusters (P5, G5).
  • Most current-gen instances have ENA enabled by default.

Instance Store:

  • NVMe SSDs physically attached to the host. Ephemeral — data lost on stop/terminate.
  • Extremely fast: up to 7.5 million IOPS (i4i.metal), microsecond latency.
  • Use for: temp data, caches, buffers, scratch space, data replicated elsewhere.
  • Available on specific instance types (i4i, c5d, m5d, r5d).

Nitro System: custom hardware + lightweight hypervisor. Nearly all CPU/memory available to the instance. Better security (hardware root of trust) and performance.

# ── Create a Cluster Placement Group ──
aws ec2 create-placement-group \
    --group-name hpc-cluster \
    --strategy cluster

# Launch instances in the cluster
aws ec2 run-instances \
    --instance-type c7gn.16xlarge \
    --placement GroupName=hpc-cluster \
    --count 8 \
    --image-id ami-abc123

# ── Verify Enhanced Networking (ENA) ──
aws ec2 describe-instances \
    --instance-ids i-abc123 \
    --query "Reservations[].Instances[].EnaSupport"
# Output: true

# ── Instance Store: Check available NVMe drives ──
# lsblk   (on the instance)
# NAME    SIZE  TYPE
# nvme0n1 1.9T  disk    ← Instance store (ephemeral!)
# nvme1n1 1.9T  disk    ← Instance store
# nvme2n1 500G  disk    ← EBS root volume

# ── Instance Store: Format and mount ──
# mkfs.xfs /dev/nvme0n1
# mount /dev/nvme0n1 /mnt/scratch
# Warning: data lost on stop/terminate!

# ── CloudFormation: Cluster Placement Group ──
# Resources:
#   HPCPlacementGroup:
#     Type: AWS::EC2::PlacementGroup
#     Properties:
#       Strategy: cluster
#
#   HPCInstance:
#     Type: AWS::EC2::Instance
#     Properties:
#       InstanceType: c7gn.16xlarge
#       Placement:
#         GroupName: !Ref HPCPlacementGroup
#       NetworkInterfaces:
#         - DeviceIndex: 0
#           SubnetId: !Ref HPCSubnet
#           Groups: [!Ref HPCSG]
#           InterfaceType: efa  # EFA for HPC networking

# ── Performance comparison ──
# Networking:
# Standard:  up to 25 Gbps, ~100μs latency
# ENA:       up to 200 Gbps, ~25μs latency
# EFA:       up to 400 Gbps, ~5μs latency (OS bypass)
# Cluster PG: < 1μs between instances (same rack)
#
# Storage:
# gp3 EBS:      16,000 IOPS, ~1ms latency
# io2 EBS:      256,000 IOPS, ~sub-ms latency
# Instance Store: 7.5M IOPS (i4i.metal), ~μs latency

A genomics company ran HPC workloads processing DNA sequences. Initial setup: 16 × c5.4xlarge in random placement → inter-node MPI communication took 200μs, job completed in 8 hours. After optimization: 16 × c7gn.16xlarge in a cluster placement group with EFA enabled → inter-node latency dropped to 5μs, job completed in 2.5 hours. They used instance store NVMe for scratch data (4x faster than EBS) and EBS only for final results that needed persistence.

Use cluster placement groups + EFA for HPC and ML training — the latency reduction can cut job times by 60%+. Use instance store for temporary high-IOPS data (caches, scratch space) — but always replicate important data elsewhere. Use spread placement groups for critical instances that need hardware-level isolation. Graviton (c7g, m7g) instances offer best price/performance for most workloads.
⚠️ Common Mistake
// ❌ Storing important data on instance store // Application writes database to /mnt/scratch (instance store) // Instance stopped for maintenance → ALL DATA LOST // Instance store is EPHEMERAL — not backed up, no snapshots // "But it was so fast!" — yes, and also gone forever
// ✅ Instance store for temp/cache, EBS for persistent data // /mnt/scratch (instance store) → temp files, caches, buffers // /data (EBS gp3) → database, application data // Replicate instance store data to S3 periodically // Or use instance store as a local cache layer in front of EBS // Design for instance store data being disposable
🔁 Follow-Up Question

What is the difference between ENA and EFA? When would you use EFA over ENA?

37 How do you optimize S3 performance? Explain multipart upload, Transfer Acceleration, and request rate. performance

S3 is designed for high throughput, but optimization is still important for large-scale workloads:

Request Rate Performance:

  • S3 supports 5,500 GET/HEAD and 3,500 PUT/DELETE requests per second per prefix.
  • A prefix is the path before the object key: s3://bucket/prefix1/key.
  • To scale beyond these limits, distribute objects across prefixes.
  • S3 automatically partitions prefixes that receive high request rates (no manual intervention needed since 2018).

Multipart Upload:

  • Upload large objects in parallel parts. Each part is uploaded independently.
  • Mandatory for objects > 5 GB. Recommended for objects > 100 MB.
  • Benefits: parallel uploads (faster), retry individual parts (resilient), pause/resume.
  • Part size: 5 MB to 5 GB. Maximum 10,000 parts.

S3 Transfer Acceleration:

  • Uses CloudFront Edge Locations to accelerate uploads from distant locations.
  • Client uploads to the nearest Edge Location → AWS backbone → S3 bucket.
  • 50-500% improvement for long-distance uploads (e.g., Asia → us-east-1).
  • Adds $0.04-0.08/GB cost. Use the speed comparison tool to verify benefit.

S3 Select / Glacier Select:

  • Filter data server-side using SQL. Only transfer the rows/columns you need.
  • Reduces data transfer by up to 400%. Faster and cheaper than downloading entire objects.

Byte-Range Fetches: download specific byte ranges of an object in parallel. Useful for large files where you need only a portion.

# ── Multipart Upload (AWS CLI does this automatically for large files) ──
aws s3 cp large-file.zip s3://my-bucket/ \
    --expected-size 5368709120  # 5 GB

# ── Manual multipart with boto3 (for custom control) ──
# import boto3
# from boto3.s3.transfer import TransferConfig
#
# s3 = boto3.client("s3")
# config = TransferConfig(
#     multipart_threshold=100 * 1024 * 1024,  # 100 MB
#     multipart_chunksize=25 * 1024 * 1024,   # 25 MB per part
#     max_concurrency=10                       # 10 parallel uploads
# )
# s3.upload_file("large-file.zip", "my-bucket", "large-file.zip", Config=config)

# ── Enable Transfer Acceleration ──
aws s3api put-bucket-accelerate-configuration \
    --bucket my-bucket \
    --accelerate-configuration Status=Enabled

# Upload using accelerated endpoint
aws s3 cp large-file.zip s3://my-bucket/ \
    --endpoint-url https://my-bucket.s3-accelerate.amazonaws.com

# ── S3 Select: Query CSV without downloading ──
aws s3api select-object-content \
    --bucket my-bucket \
    --key data/sales.csv \
    --expression "SELECT s.product, s.amount FROM s3object s WHERE s.amount > '1000'" \
    --expression-type SQL \
    --input-serialization '{"CSV":{"FileHeaderInfo":"USE"}}' \
    --output-serialization '{"JSON":{}}'

# ── Python: S3 Select ──
# response = s3.select_object_content(
#     Bucket="my-bucket",
#     Key="data/sales.csv",
#     Expression="SELECT * FROM s3object s WHERE s.region = 'US'",
#     ExpressionType="SQL",
#     InputSerialization={"CSV": {"FileHeaderInfo": "USE"}},
#     OutputSerialization={"JSON": {}}
# )
# for event in response["Payload"]:
#     if "Records" in event:
#         print(event["Records"]["Payload"].decode())

# ── Prefix distribution for high request rates ──
# Bad: all objects under one prefix
# s3://bucket/images/img001.jpg  → 1 prefix, limited to 5,500 GET/s
# s3://bucket/images/img002.jpg
#
# Good: distribute across prefixes (use hash or date)
# s3://bucket/a1b2/images/img001.jpg  → prefix "a1b2"
# s3://bucket/c3d4/images/img002.jpg  → prefix "c3d4"
# Or: s3://bucket/2026/05/30/img001.jpg

# ── Performance tuning checklist ──
# ✅ Objects > 100 MB: use multipart upload
# ✅ Cross-continent uploads: enable Transfer Acceleration
# ✅ Need only subset of data: use S3 Select
# ✅ > 5,500 GET/s: distribute across multiple prefixes
# ✅ Large downloads: use byte-range fetches (parallel)
# ✅ Analytics: use Parquet (columnar) instead of CSV/JSON

A media company ingested 10TB of video files daily from studios in Asia to an S3 bucket in us-east-1. Upload speeds: 50 Mbps (bottlenecked by internet path). After enabling Transfer Acceleration, speeds jumped to 300 Mbps (6x improvement) because data went to the nearest Edge Location in Tokyo and then traveled over the AWS backbone. Combined with multipart upload (25 MB chunks, 20 concurrent), total ingest time dropped from 18 hours to 3 hours.

Use multipart upload for any file > 100 MB — it's parallel, resumable, and faster. Enable Transfer Acceleration for cross-continent uploads — measure the benefit with the speed comparison tool. Use S3 Select to filter data server-side when you need only a subset. For high request rates, S3 auto-scales but distributing across prefixes helps. Use Parquet for analytics data — S3 Select and Athena can read only the columns needed.
⚠️ Common Mistake
// ❌ Single-threaded upload of 5 GB file // s3.put_object(Body=open("5gb-file.zip", "rb"), ...) // Single HTTP stream → 15 minutes to upload // Network glitch at 4.8 GB → start over from scratch // put_object has 5 GB limit anyway → fails for larger files
// ✅ Multipart upload with parallel parts // config = TransferConfig( // multipart_threshold=100 * 1024 * 1024, // multipart_chunksize=25 * 1024 * 1024, // max_concurrency=10 // ) // s3.upload_file("5gb-file.zip", bucket, key, Config=config) // 10 parallel streams → 3 minutes to upload // Part 15 fails → retry only part 15 (not the whole file) // Works for files up to 5 TB
🔁 Follow-Up Question

What is S3 Express One Zone and how does it provide single-digit millisecond latency for S3 access?

38 How do you optimize DynamoDB performance? Explain hot keys, partition strategies, DAX, and adaptive capacity. performance

DynamoDB performance depends on partition key design and access patterns:

Hot Partitions:

  • Each partition handles up to 3,000 RCUs and 1,000 WCUs.
  • If a single partition key gets disproportionate traffic, that partition becomes "hot" → throttling even with available capacity on other partitions.
  • Common causes: using date, status, or country as PK (low cardinality), popular items (viral product, trending post).

Solutions for hot keys:

  • Write sharding: append a random suffix to the partition key (e.g., "product#123#shard3"). Distributes writes across multiple partitions. Requires scatter-gather reads.
  • Composite keys: use a high-cardinality PK (userId, orderId) instead of low-cardinality (status, date).
  • Adaptive Capacity: DynamoDB automatically redistributes throughput to hot partitions (enabled by default). Helps but doesn't solve extreme cases.

DAX (DynamoDB Accelerator):

  • Fully managed, in-memory write-through cache for DynamoDB.
  • Microsecond latency for reads (vs milliseconds for DynamoDB).
  • API-compatible — just change the endpoint, no code changes.
  • Best for read-heavy workloads with repeated access to the same items.

Other optimizations:

  • Projection: only read the attributes you need (ProjectionExpression).
  • BatchGetItem: read up to 100 items in a single API call (16 MB max).
  • Parallel Scan: divide a full table scan across multiple threads/segments.
  • TTL: automatically delete expired items — free, no WCU cost.
# ── Write Sharding: Distribute hot keys ──
# Problem: PK = "trending_products" gets 10,000 WCU/s
# Solution: shard across N keys

# import random
# SHARD_COUNT = 10
#
# def write_trending(product_data):
#     shard = random.randint(0, SHARD_COUNT - 1)
#     table.put_item(Item={
#         "PK": f"trending_products#{shard}",  # Distributed!
#         "SK": product_data["productId"],
#         "data": product_data
#     })
#
# def read_all_trending():
#     """Scatter-gather across all shards"""
#     items = []
#     for shard in range(SHARD_COUNT):
#         response = table.query(
#             KeyConditionExpression=Key("PK").eq(f"trending_products#{shard}")
#         )
#         items.extend(response["Items"])
#     return items

# ── DAX: In-memory cache ──
aws dax create-cluster \
    --cluster-name my-dax \
    --node-type dax.r5.large \
    --replication-factor 3 \
    --iam-role-arn arn:aws:iam::123:role/DAXRole \
    --subnet-group-name my-dax-subnets \
    --security-group-ids sg-dax

# Python: Switch from DynamoDB to DAX (minimal code change)
# import amazondax
#
# # Before (DynamoDB direct):
# # dynamodb = boto3.resource("dynamodb")
#
# # After (DAX — API compatible!):
# dax = amazondax.AmazonDaxClient(endpoint_url="daxs://my-dax.xxx.dax-clusters.us-east-1.amazonaws.com:8111")
# table = dax.Table("Products")
# response = table.get_item(Key={"productId": "P001"})
# # Microsecond response from cache!

# ── Efficient queries ──
# Only read attributes you need (save RCUs)
aws dynamodb query \
    --table-name Orders \
    --key-condition-expression "customerId = :cid" \
    --projection-expression "orderId, amount, #s" \
    --expression-attribute-names '{"#s": "status"}' \
    --expression-attribute-values '{":cid": {"S": "CUST-001"}}'

# ── BatchGetItem: Read up to 100 items at once ──
aws dynamodb batch-get-item --request-items '{
    "Products": {
        "Keys": [
            {"productId": {"S": "P001"}},
            {"productId": {"S": "P002"}},
            {"productId": {"S": "P003"}}
        ],
        "ProjectionExpression": "productId, productName, price"
    }
}'

# ── Enable TTL (auto-delete expired items) ──
aws dynamodb update-time-to-live \
    --table-name Sessions \
    --time-to-live-specification Enabled=true,AttributeName=expiresAt

# ── Performance metrics to monitor ──
# ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits
# ThrottledRequests (should be 0!)
# SuccessfulRequestLatency (P99 should be < 10ms)
# AccountProvisionedReadCapacityUtilization

A social media app used "trending" as a partition key for popular posts. During viral events, this single key received 50,000 reads/sec — far exceeding the per-partition limit. Throttling caused the trending feed to fail. The fix: write sharding with 20 shards (trending#0 through trending#19) distributed the load evenly. They also added DAX for the trending feed — cache hit rate was 99.5%, reducing DynamoDB RCU consumption by 200x and dropping read latency from 5ms to 50μs.

Design partition keys for uniform distribution — avoid low-cardinality keys (date, status, country). Use write sharding for known hot keys. Use DAX for read-heavy workloads with repeated access patterns. Always use ProjectionExpression to read only needed attributes. Enable TTL for session/cache data — it's free (no WCU cost). Monitor ThrottledRequests and set alarms at > 0.
⚠️ Common Mistake
// ❌ Scanning entire table to find items // response = table.scan( // FilterExpression=Attr("status").eq("active") // ) // Scan reads EVERY item in the table (millions of items) // Consumes massive RCUs → throttles other operations // 10M items × 1KB = 10 GB scan = 2,500,000 RCUs! // Gets slower as table grows — O(n) operation
// ✅ Query with GSI — reads only matching items // GSI: PK=status, SK=createdAt // response = table.query( // IndexName="StatusIndex", // KeyConditionExpression=Key("status").eq("active") // ) // Reads only "active" items (100 items, not 10M) // Fast, efficient, doesn't grow with table size // Cost: 100 RCUs instead of 2,500,000 RCUs
🔁 Follow-Up Question

What is DynamoDB Global Tables and how does it handle write conflicts in multi-region deployments?

39 How do you optimize Lambda performance? Explain cold starts, SnapStart, Provisioned Concurrency, and memory tuning. performance

Lambda performance optimization focuses on cold starts, memory/CPU allocation, and code efficiency:

Cold Start Deep Dive:

  • What happens: download code → start runtime → run init code → execute handler.
  • Duration by runtime: Python/Node (~200-500ms), .NET (~400-800ms), Java (~1-3 seconds).
  • Factors: package size (larger = slower), VPC (adds ENI creation ~1-2s extra), runtime, number of dependencies.
  • When: first invocation, after scale-up, after ~15 minutes of inactivity.

Mitigation strategies:

  • Provisioned Concurrency: pre-warm N execution environments. No cold starts. Pay even when idle. Best for APIs with consistent latency requirements.
  • SnapStart (Java only): snapshot of initialized runtime → restore from snapshot on cold start. Reduces Java cold start from 3s to ~200ms. Free.
  • Smaller packages: remove unused dependencies. Use Layers for shared code. Use tree-shaking (Node/TypeScript).
  • Avoid VPC unless needed: VPC adds ENI creation time. Use VPC endpoints instead of NAT for AWS service access.

Memory Tuning:

  • Lambda allocates CPU proportional to memory. 1,769 MB = 1 full vCPU.
  • CPU-bound functions benefit from more memory (even if they don't use the extra RAM).
  • Use AWS Lambda Power Tuning (open-source tool) to find the optimal memory setting — often the cheapest AND fastest option is higher memory.
# ── Set Provisioned Concurrency ──
# First, publish a version (can't set PC on $LATEST)
aws lambda publish-version \
    --function-name ProcessOrder \
    --description "v1"

aws lambda put-provisioned-concurrency-config \
    --function-name ProcessOrder \
    --qualifier 1 \
    --provisioned-concurrent-executions 50

# ── Enable SnapStart (Java only) ──
aws lambda update-function-configuration \
    --function-name JavaOrderProcessor \
    --snap-start ApplyOn=PublishedVersions

# Publish a version to trigger snapshot creation
aws lambda publish-version --function-name JavaOrderProcessor

# ── Auto Scaling for Provisioned Concurrency ──
aws application-autoscaling register-scalable-target \
    --service-namespace lambda \
    --resource-id function:ProcessOrder:prod \
    --scalable-dimension lambda:function:ProvisionedConcurrency \
    --min-capacity 10 --max-capacity 200

aws application-autoscaling put-scaling-policy \
    --service-namespace lambda \
    --resource-id function:ProcessOrder:prod \
    --scalable-dimension lambda:function:ProvisionedConcurrency \
    --policy-name LambdaPCAutoScaling \
    --policy-type TargetTrackingScaling \
    --target-tracking-scaling-policy-configuration '{
        "TargetValue": 0.7,
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "LambdaProvisionedConcurrencyUtilization"
        }
    }'

# ── Lambda Power Tuning results (example) ──
# Memory (MB) | Duration (ms) | Cost ($) | Notes
# 128         | 3,200         | 0.000053 | CPU throttled, very slow
# 512         | 850           | 0.000056 | Better, still CPU-bound
# 1024        | 420           | 0.000055 | Sweet spot!
# 1769        | 250           | 0.000057 | Full vCPU, diminishing returns
# 3008        | 240           | 0.000093 | Minimal improvement, more expensive
# Optimal: 1024 MB — fastest AND cheapest!

# ── Cold start reduction tips ──
# 1. Keep deployment package small
#    zip -r function.zip handler.py  # Not the entire virtualenv
#    Use Layers for shared dependencies
#
# 2. Initialize outside the handler
#    import boto3
#    dynamodb = boto3.resource("dynamodb")  # Init once!
#    table = dynamodb.Table("Orders")       # Reused on warm starts
#
#    def handler(event, context):
#        return table.get_item(Key={"id": event["id"]})
#
# 3. Avoid unnecessary imports
#    # Bad: import boto3 (loads entire SDK)
#    # Good: from boto3 import client (loads only what you need)
#
# 4. Use arm64 (Graviton) — 34% cheaper, often faster
aws lambda update-function-configuration \
    --function-name ProcessOrder \
    --architectures arm64

A Java-based API had 3-second cold starts on Lambda — P99 latency was 3.5 seconds. The team applied three optimizations: (1) SnapStart reduced cold start to 200ms (free!), (2) Lambda Power Tuning found 1,536 MB as optimal — duration dropped from 800ms to 200ms at the same cost, (3) switched to arm64 (Graviton2) for 34% cost savings. Final P99: 250ms. For the payment endpoint (zero cold start tolerance), they added Provisioned Concurrency with auto-scaling — P99 dropped to 50ms.

Run Lambda Power Tuning to find optimal memory — more memory = more CPU, often faster AND cheaper. Use SnapStart for Java (free cold start reduction from 3s to 200ms). Use Provisioned Concurrency only for endpoints that can't tolerate any cold start. Keep packages small — use Layers for dependencies. Initialize SDK clients outside the handler. Use arm64 (Graviton) for 34% cost savings. VPC adds cold start time — avoid unless necessary.
⚠️ Common Mistake
// ❌ Using minimum memory (128 MB) to "save money" // Lambda at 128 MB: 1/14th of a vCPU // CPU-bound function: 3,200ms duration // Cost: 128 MB × 3.2s = 0.000053 per invocation // Users experience 3-second API responses // "Why is Lambda so slow?" — it's CPU-starved!
// ✅ Right-size memory with Power Tuning // Lambda at 1024 MB: ~0.58 vCPU // Same function: 420ms duration (7.6x faster!) // Cost: 1024 MB × 0.42s = 0.000055 per invocation // Nearly the SAME COST but 7.6x faster! // More memory = more CPU = faster execution // The cheapest config is often NOT the lowest memory
🔁 Follow-Up Question

How does Lambda Graviton2 (arm64) performance compare to x86? What compatibility issues exist?

40 How do you optimize AWS network performance? Explain VPC endpoints, Global Accelerator, and data transfer costs. performance

Network performance optimization on AWS involves latency, throughput, and data transfer cost management:

VPC Endpoints (reduce latency + cost):

  • Gateway Endpoint: for S3 and DynamoDB. Routes traffic through AWS backbone instead of the internet/NAT. Free. No bandwidth limit.
  • Interface Endpoint (PrivateLink): for 100+ AWS services. Creates ENI in your VPC. $0.01/hr + $0.01/GB. Private, lower latency than going through NAT Gateway.
  • Using VPC endpoints instead of NAT Gateway for AWS service access saves the NAT Gateway data processing fee ($0.045/GB).

AWS Global Accelerator:

  • Provides 2 static anycast IPs that route traffic to the nearest AWS edge location → AWS backbone → your application.
  • Reduces internet hops → lower latency, more consistent performance.
  • Automatic failover between Regions in < 30 seconds.
  • Works with ALB, NLB, EC2, and Elastic IP endpoints.
  • $0.025/hr + $0.015-0.035/GB (premium data transfer).
  • vs CloudFront: CloudFront caches content. Global Accelerator optimizes the network path (no caching). Use GA for non-HTTP (TCP/UDP) or when caching isn't useful (dynamic APIs).

Data Transfer Costs (often the surprise on AWS bills):

  • Inbound: free (data into AWS).
  • Same AZ: free (using private IP).
  • Cross-AZ: $0.01/GB each way ($0.02/GB round trip).
  • Cross-Region: $0.02/GB (varies by Region pair).
  • Internet outbound: $0.09/GB (first 10 TB, then tiered). CloudFront is cheaper ($0.085/GB).
  • NAT Gateway processing: $0.045/GB (on top of data transfer!).
# ── Create Gateway Endpoint for S3 (FREE!) ──
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-abc123 \
    --service-name com.amazonaws.us-east-1.s3 \
    --route-table-ids rtb-private1 rtb-private2 \
    --vpc-endpoint-type Gateway

# ── Create Interface Endpoint for SQS ──
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-abc123 \
    --service-name com.amazonaws.us-east-1.sqs \
    --vpc-endpoint-type Interface \
    --subnet-ids subnet-private1 subnet-private2 \
    --security-group-ids sg-endpoint \
    --private-dns-enabled

# ── Global Accelerator ──
aws globalaccelerator create-accelerator \
    --name my-app-accelerator \
    --ip-address-type IPV4 \
    --enabled

# Add listener
aws globalaccelerator create-listener \
    --accelerator-arn arn:aws:globalaccelerator::123:accelerator/abc \
    --port-ranges FromPort=443,ToPort=443 \
    --protocol TCP

# Add endpoint group (ALBs in two Regions)
aws globalaccelerator create-endpoint-group \
    --listener-arn arn:aws:globalaccelerator::123:accelerator/abc/listener/def \
    --endpoint-group-region us-east-1 \
    --endpoint-configurations \
    EndpointId=arn:aws:elasticloadbalancing:us-east-1:123:loadbalancer/app/my-alb/xxx,Weight=70 \
    EndpointId=arn:aws:elasticloadbalancing:eu-west-1:123:loadbalancer/app/my-alb-eu/xxx,Weight=30

# ── Data transfer cost optimization ──
# Scenario: Lambda in private subnet calls S3 via NAT Gateway
# Without VPC endpoint:
#   Lambda → NAT Gateway → Internet → S3
#   Cost: $0.045/GB (NAT) + $0.09/GB (internet) = $0.135/GB
#
# With S3 Gateway Endpoint:
#   Lambda → VPC Endpoint → S3 (AWS backbone)
#   Cost: $0.00/GB (free!)
#   Savings: 100%!

# ── Monitor data transfer ──
aws ce get-cost-and-usage \
    --time-period Start=2026-05-01,End=2026-05-31 \
    --granularity MONTHLY \
    --filter '{"Dimensions":{"Key":"USAGE_TYPE","Values":["DataTransfer-Out-Bytes"]}}' \
    --metrics BlendedCost

# ── Architecture: Minimize cross-AZ data transfer ──
# Bad: Web server in AZ-a, Redis in AZ-b
#   Every cache call: $0.01/GB cross-AZ each way
#   10TB/month cache traffic = $200/month in data transfer alone
#
# Good: Co-locate in same AZ, or accept cross-AZ for HA
#   Use private IPs (not public IPs — public IPs go via IGW, cost more)

# ── Comparison: Global Accelerator vs CloudFront ──
# Feature        | Global Accelerator   | CloudFront
# Caching        | No                   | Yes
# Static IPs     | Yes (2 anycast)      | No
# Protocol       | TCP, UDP             | HTTP, HTTPS, WebSocket
# Use case       | Dynamic APIs, gaming | Static content, web apps
# Failover       | < 30 seconds         | Origin failover
# Cost           | $0.025/hr + data     | Data transfer only

A company's monthly AWS bill showed $15,000 in data transfer costs. Investigation found: (1) Lambda functions calling S3/DynamoDB through NAT Gateway — $8,000 in NAT processing fees. Adding S3 Gateway Endpoint (free) and DynamoDB Gateway Endpoint (free) eliminated the NAT fees. (2) Public IP communication between instances in the same VPC — $3,000. Switching to private IPs reduced this to $500 (cross-AZ only). (3) API traffic from Europe going through us-east-1 — added Global Accelerator for 30% latency improvement and better routing. Total savings: $10,500/month.

Always create Gateway Endpoints for S3 and DynamoDB — they're free and eliminate NAT Gateway fees. Use Interface Endpoints for other AWS services accessed frequently from private subnets. Use Global Accelerator for multi-region dynamic content (TCP/UDP), CloudFront for cacheable HTTP content. Monitor data transfer costs — they're often the hidden surprise on AWS bills. Use private IPs between instances, use same-AZ placement when HA isn't needed.
⚠️ Common Mistake
// ❌ AWS service calls going through NAT Gateway // Lambda → NAT Gateway → Internet → S3 // 100 TB/month of S3 traffic through NAT: // NAT processing: 100 TB × $0.045/GB = $4,500/month // NAT hourly: $0.045/hr × 730 = $33/month // Total: $4,533/month just to reach S3! // NAT Gateway is for internet access, not AWS services
// ✅ VPC Gateway Endpoint for S3 (FREE) // Lambda → VPC Endpoint → S3 (AWS backbone) // Cost: $0/month (100 TB or 1 PB — still free!) // Also faster: no NAT hop, direct backbone routing // 5 minutes to set up, saves thousands per month // // Do the same for DynamoDB (also free Gateway Endpoint) // Use Interface Endpoints for SQS, SNS, KMS, etc.
🔁 Follow-Up Question

How does AWS PrivateLink work for exposing your own services to other VPCs and accounts?

Frequently Asked Questions

Written and reviewed by the FreeBytes Editorial Team · Last updated: June 2026