Cloud Architecture Patterns

Architecture is about trade-offs. Every pattern has a cost — in complexity, latency, or dollars. Choose patterns that match your actual reliability, scalability, and recovery requirements, not theoretical maximums.

High Availability (HA) Patterns

High availability design targets continuous operation by eliminating single points of failure. In cloud environments, HA is achieved through redundancy across Availability Zones — independent data centers within a region with separate power, cooling, and networking.

Active-Active Multi-AZ

All instances across all AZs simultaneously serve production traffic. A load balancer distributes requests and performs health checks. When an AZ becomes unavailable, the load balancer automatically stops routing to unhealthy targets — with no human intervention required.

# Terraform: Active-Active ALB across 3 AZs (AWS)
resource "aws_lb" "prod_alb" {
  name               = "prod-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = [
    aws_subnet.public_1a.id,
    aws_subnet.public_1b.id,
    aws_subnet.public_1c.id
  ]

  enable_deletion_protection = true
  enable_http2               = true
  idle_timeout               = 60

  access_logs {
    bucket  = aws_s3_bucket.logs.id
    prefix  = "alb"
    enabled = true
  }
}

resource "aws_lb_target_group" "app" {
  name        = "prod-app-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = aws_vpc.prod.id
  target_type = "ip"

  health_check {
    enabled             = true
    path                = "/health"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }

  deregistration_delay = 30

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400
    enabled         = false
  }
}

resource "aws_autoscaling_group" "app" {
  name                = "prod-app-asg"
  vpc_zone_identifier = [
    aws_subnet.private_1a.id,
    aws_subnet.private_1b.id,
    aws_subnet.private_1c.id
  ]
  min_size            = 3
  max_size            = 30
  desired_capacity    = 3
  health_check_type   = "ELB"
  health_check_grace_period = 300

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 90
    }
  }
}

resource "aws_autoscaling_policy" "scale_out" {
  name                   = "scale-out"
  autoscaling_group_name = aws_autoscaling_group.app.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

Health check design: Your /health endpoint must respond within the timeout window and should check actual application health (database connectivity, cache, critical dependencies) — not just return HTTP 200 unconditionally.

Disaster Recovery (DR)

RTO and RPO Definitions

RTO — Recovery Time Objective

The maximum acceptable time for restoring a system after a disaster. Measured from the moment of failure to the moment the system is operational again. An RTO of 4 hours means the business can tolerate up to 4 hours of downtime.

RPO — Recovery Point Objective

The maximum acceptable amount of data loss measured in time. An RPO of 1 hour means the business can tolerate losing up to 1 hour of data. Drives your backup frequency and replication strategy.

4 DR Strategies — Cost vs Recovery Time

Strategy	RTO	RPO	Relative Cost	Description
Backup and Restore	Hours – Days	Hours – Days	$ (Lowest)	Periodic backups stored in durable storage. Restore from backup when disaster strikes. No standby infrastructure.
Pilot Light	30 min – 2 hrs	Minutes	$$	Core components (DB) running in DR region at minimal scale. App tier is off. Start it and scale on failover.
Warm Standby	Minutes	Seconds – Minutes	$$$	Scaled-down but fully functional copy always running in DR region. Scale up and switch DNS on failover.
Multi-Site Active-Active	Near zero	Near zero	$$$$ (Highest)	Full capacity in multiple regions simultaneously. Traffic always distributed. Failure of one region handled transparently.

Pilot Light — Implementation Example

# Pilot Light: DR region has RDS read replica (always on)
# App servers are off; AMI is pre-built and ready

# Primary region (ap-southeast-1): full stack running
# DR region (us-east-1): only RDS read replica + pre-built AMI

# Step 1: RDS read replica in DR region (cross-region)
resource "aws_db_instance" "dr_replica" {
  provider                = aws.us_east_1
  identifier              = "prod-db-dr-replica"
  replicate_source_db     = "arn:aws:rds:ap-southeast-1:123456789012:db:prod-db"
  instance_class          = "db.t3.medium"  # Small — scale up on failover
  publicly_accessible     = false
  auto_minor_version_upgrade = true
  skip_final_snapshot     = false
}

# Step 2: On failover — promote replica to standalone
aws rds promote-read-replica \
  --db-instance-identifier prod-db-dr-replica \
  --region us-east-1

# Step 3: Launch ASG with pre-built AMI pointing to new DB endpoint
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name dr-app-asg \
  --min-size 3 --max-size 20 --desired-capacity 3 \
  --region us-east-1

# Step 4: Update Route 53 to point to DR region ALB
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234ABCD \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "dr-alb-us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

Multi-Region Architecture

Data Replication: Synchronous vs Asynchronous

Synchronous Replication

Guarantee: Write is not acknowledged until committed in all regions. Zero data loss (RPO = 0).
Cost: Write latency increases by network round-trip between regions (typically 50–200ms cross-region).
Use for: Financial transactions, inventory, anything requiring strong consistency.
Examples: Cloud Spanner (global), Aurora Global Database with synchronous writes.

Asynchronous Replication

Guarantee: Write acknowledged at primary immediately; replicated to secondary in the background. Small data loss window (RPO = seconds to minutes).
Cost: No write latency penalty.
Use for: Read replicas for reporting, log archival, analytics, non-critical data.
Examples: RDS cross-region read replicas, S3 cross-region replication.

Traffic Routing — Route 53 Failover

# Active-Passive failover with health checks
# Primary: ap-southeast-1; Secondary: us-east-1 (failover target)

# Create health check for primary
aws route53 create-health-check \
  --caller-reference $(date +%s)-primary \
  --health-check-config '{
    "Type": "HTTPS",
    "FullyQualifiedDomainName": "primary-alb.ap-southeast-1.elb.amazonaws.com",
    "Port": 443,
    "ResourcePath": "/health",
    "RequestInterval": 30,
    "FailureThreshold": 3,
    "MeasureLatency": true,
    "Regions": ["ap-southeast-1", "us-east-1", "eu-west-1"]
  }'

# Primary record (Failover = PRIMARY)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234ABCD \
  --change-batch '{
    "Changes": [
      {
        "Action": "UPSERT",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "Primary-SGP",
          "Failover": "PRIMARY",
          "HealthCheckId": "abc-health-check-id",
          "AliasTarget": {
            "HostedZoneId": "Z35SXDOTRQ7X7K",
            "DNSName": "primary-alb.ap-southeast-1.elb.amazonaws.com",
            "EvaluateTargetHealth": true
          }
        }
      },
      {
        "Action": "UPSERT",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "Secondary-USE1",
          "Failover": "SECONDARY",
          "AliasTarget": {
            "HostedZoneId": "Z35SXDOTRQ7X7K",
            "DNSName": "secondary-alb.us-east-1.elb.amazonaws.com",
            "EvaluateTargetHealth": true
          }
        }
      }
    ]
  }'

Microservices vs Monolith

When to Use Each

Factor	Monolith	Microservices
Team size	Small (1–8 engineers)	Large (multiple autonomous teams)
Domain clarity	Not yet defined	Well-defined bounded contexts
Deployment frequency	Weekly or less	Multiple times per day per service
Scaling requirements	Uniform across app	Independent per service
Technology diversity	Single stack	Best tool per service
Operational maturity	Low overhead	Requires service mesh, observability, CI/CD per service

Decomposition Patterns

Decompose by Domain (Domain-Driven Design)

Identify bounded contexts using DDD. Each bounded context becomes a service with its own data store and API. Example: a retail system decomposes into Order, Inventory, Catalog, Customer, Payment, Notification services.

Decompose by Team (Conway's Law)

Align service boundaries with team boundaries. Each team owns one or more services end-to-end — from code to deployment to on-call. Avoids inter-team coordination overhead for routine changes.

Decompose by Data

Services that need to scale independently or have different data consistency requirements get their own database. Avoids shared databases that create tight coupling and deployment dependencies between services.

Event-Driven Architecture

Event-driven architecture decouples producers from consumers through an event broker. Producers publish events without knowing who will consume them. Consumers subscribe to event types and process them independently.

# AWS SQS + SNS Fan-out pattern
# SNS topic → multiple SQS queues (fan-out)
# Each queue → Lambda (or EKS pod) for processing

# Create SNS topic
aws sns create-topic --name prod-order-events

# Create SQS queues for different consumers
aws sqs create-queue --queue-name order-fulfillment-queue \
  --attributes '{
    "VisibilityTimeout": "300",
    "MessageRetentionPeriod": "86400",
    "ReceiveMessageWaitTimeSeconds": "20",
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:ap-southeast-1:123456789012:order-fulfillment-dlq\",\"maxReceiveCount\":\"3\"}"
  }'

aws sqs create-queue --queue-name order-notification-queue

# Subscribe queues to SNS topic (fan-out)
aws sns subscribe \
  --topic-arn arn:aws:sns:ap-southeast-1:123456789012:prod-order-events \
  --protocol sqs \
  --notification-endpoint arn:aws:sqs:ap-southeast-1:123456789012:order-fulfillment-queue \
  --attributes '{"FilterPolicy":"{\"event_type\":[\"order_created\",\"order_updated\"]}"}'

# GCP Pub/Sub equivalent
gcloud pubsub topics create prod-order-events \
  --project=prod-backend-123456

gcloud pubsub subscriptions create order-fulfillment-sub \
  --topic=prod-order-events \
  --ack-deadline=300 \
  --message-retention-duration=86400s \
  --dead-letter-topic=projects/prod-backend-123456/topics/order-events-dlq \
  --max-delivery-attempts=5 \
  --filter='attributes.event_type = "order_created"' \
  --project=prod-backend-123456

Event Sourcing and CQRS

Event Sourcing

Store all changes to application state as a sequence of immutable events, rather than just the current state. The current state is derived by replaying the event log. Provides a complete audit trail, enables time-travel debugging, and supports event replay for building new projections.

Example stores: EventStoreDB, Kafka (as log), DynamoDB Streams, GCP Firestore with event collection.

CQRS — Command Query Responsibility Segregation

Separate the write model (Commands) from the read model (Queries). Commands mutate state; queries read from an optimized read store. Allows independent scaling and optimization of reads and writes. Often combined with Event Sourcing — events update multiple read models.

Example: Write to PostgreSQL (normalized); publish event; consumer updates Elasticsearch for search queries and Redis for dashboard queries.

Serverless Architecture

Serverless shifts operational responsibility — you provide code, the cloud manages execution, scaling, and infrastructure. True serverless means pay-per-invocation with automatic scaling from zero to thousands of concurrent executions.

Cold Start Mitigation

# AWS Lambda — provisioned concurrency eliminates cold starts
aws lambda put-provisioned-concurrency-config \
  --function-name prod-api-handler \
  --qualifier prod \
  --provisioned-concurrent-executions 20

# Lambda SnapStart (Java functions only — near-zero cold starts)
aws lambda update-function-configuration \
  --function-name prod-java-handler \
  --snap-start ApplyOn=PublishedVersions

# Minimize cold start time:
# 1. Use smaller deployment packages (only necessary dependencies)
# 2. Prefer interpreted runtimes (Node.js, Python) over JVM for latency-sensitive
# 3. Initialize SDK clients OUTSIDE the handler (reused across warm invocations)
# 4. Use Lambda layers for shared dependencies

# Example: good Lambda handler structure
import boto3
import os

# Initialize outside handler — reused across warm invocations
s3_client = boto3.client("s3")
SECRET_VALUE = os.environ.get("SECRET_VALUE")  # set via Secrets Manager Lambda extension

def handler(event, context):
    # Business logic only — no SDK init, no secrets fetch
    bucket = event["bucket"]
    key = event["key"]
    obj = s3_client.get_object(Bucket=bucket, Key=key)
    return obj["Body"].read().decode("utf-8")

EventBridge — Event Bus for Serverless Orchestration

# Create EventBridge rule to trigger Lambda on schedule
aws events put-rule \
  --name prod-daily-report \
  --schedule-expression "cron(0 2 * * ? *)" \
  --state ENABLED \
  --description "Trigger daily report generation at 02:00 UTC"

aws events put-targets \
  --rule prod-daily-report \
  --targets '[{
    "Id": "report-lambda",
    "Arn": "arn:aws:lambda:ap-southeast-1:123456789012:function:generate-report",
    "Input": "{\"type\":\"daily\",\"format\":\"pdf\"}"
  }]'

# Pattern-based rule (react to specific events)
aws events put-rule \
  --name on-ec2-state-change \
  --event-pattern '{
    "source": ["aws.ec2"],
    "detail-type": ["EC2 Instance State-change Notification"],
    "detail": { "state": ["terminated"] }
  }' \
  --state ENABLED

Container Architecture Patterns

Sidecar Pattern

# Sidecar: auxiliary container that extends main container without changing it
# Example: Envoy proxy sidecar for mTLS, observability, traffic management
apiVersion: v1
kind: Pod
metadata:
  name: app-with-envoy-sidecar
  annotations:
    sidecar.istio.io/inject: "true"
spec:
  containers:
  - name: app
    image: myapp:v1.2.3
    ports:
    - containerPort: 8080
    resources:
      requests: { cpu: "250m", memory: "256Mi" }
      limits:   { cpu: "500m", memory: "512Mi" }
  - name: log-shipper          # Sidecar for log forwarding
    image: fluent/fluent-bit:latest
    volumeMounts:
    - name: app-logs
      mountPath: /var/log/app
    env:
    - name: FLUENTBIT_OUTPUT_HOST
      value: "loki.monitoring.svc.cluster.local"
  volumes:
  - name: app-logs
    emptyDir: {}

Ambassador Pattern

# Ambassador: proxy that handles network concerns on behalf of the main container
# Useful for: connection pooling, circuit breaking, retry logic, protocol translation
apiVersion: v1
kind: Pod
metadata:
  name: app-with-ambassador
spec:
  containers:
  - name: app
    image: myapp:v1.2.3
    # App connects to localhost:5432 — ambassador handles the real connection
    env:
    - name: DB_HOST
      value: "localhost"
    - name: DB_PORT
      value: "5432"
  - name: cloud-sql-proxy      # Ambassador to Cloud SQL
    image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.9
    args:
    - "--structured-logs"
    - "--port=5432"
    - "prod-project:asia-southeast1:prod-db"
    securityContext:
      runAsNonRoot: true
    resources:
      requests: { cpu: "100m", memory: "64Mi" }
      limits:   { cpu: "200m", memory: "128Mi" }

Adapter Pattern

# Adapter: transforms output of main container to a standard interface
# Example: legacy app emits logs in custom format; adapter normalizes to JSON
apiVersion: v1
kind: Pod
metadata:
  name: app-with-adapter
spec:
  initContainers: []
  containers:
  - name: legacy-app
    image: legacy-app:v2
    # Writes logs to /var/log/app/app.log in proprietary format
    volumeMounts:
    - name: log-volume
      mountPath: /var/log/app
  - name: log-adapter           # Adapter: transforms log format
    image: log-normalizer:latest
    # Reads proprietary format, writes JSON to stdout for Fluentbit
    command: ["/bin/sh", "-c"]
    args: ["tail -f /var/log/app/app.log | ./normalize-to-json"]
    volumeMounts:
    - name: log-volume
      mountPath: /var/log/app
      readOnly: true
  volumes:
  - name: log-volume
    emptyDir: {}

12-Factor App Methodology

The 12-Factor App provides a methodology for building scalable, maintainable software-as-a-service applications. Each factor maps directly to cloud-native deployment practices.

#	Factor	Cloud Mapping
1	Codebase — One codebase tracked in VCS, many deploys	Git repository; different branches/tags → different environments
2	Dependencies — Explicitly declared, isolated	Dockerfile with pinned package versions; no system-level dependencies assumed
3	Config — Stored in environment, not code	Kubernetes ConfigMaps/Secrets; AWS Parameter Store; GCP Secret Manager; never commit secrets
4	Backing Services — Treated as attached resources	DB URL in env var; swap RDS for Aurora by changing a variable with no code change
5	Build, Release, Run — Strict separation	CI builds image; CD creates release (image + config); Kubernetes runs it
6	Processes — Stateless, share-nothing	Containers are ephemeral; state goes to RDS/Redis/S3 — never local disk
7	Port Binding — Self-contained via port binding	Container exposes port 8080; Kubernetes Service routes traffic to it
8	Concurrency — Scale out via process model	Horizontal Pod Autoscaler; ASG; Cloud Run max-instances
9	Disposability — Fast startup, graceful shutdown	Containers start in <5s; handle SIGTERM with connection draining; preStop lifecycle hooks
10	Dev/Prod Parity — Keep environments similar	Same Docker image from dev → staging → prod; Terraform workspaces for environment configs
11	Logs — Treat as event streams	Write to stdout/stderr; Fluentbit/Fluentd ships to CloudWatch Logs / Cloud Logging
12	Admin Processes — Run as one-off processes	Kubernetes Jobs for DB migrations; `kubectl exec` or Cloud Run Jobs for admin tasks

Well-Architected Review Process

A Well-Architected Review (WAR) is a structured assessment of your workload against proven architectural best practices. Conduct reviews at key lifecycle points: before go-live, after significant changes, and annually for production systems.

AWS Well-Architected Tool

# Create a workload in AWS Well-Architected Tool
aws wellarchitected create-workload \
  --workload-name "prod-payment-service" \
  --description "Production payment processing microservice" \
  --environment PRODUCTION \
  --aws-regions ap-southeast-1 \
  --account-ids 123456789012 \
  --lenses "wellarchitected" "serverless" \
  --review-owner "[email protected]" \
  --architectural-design "https://confluence.example.com/arch/payment"

# List workloads
aws wellarchitected list-workloads --query 'WorkloadSummaries[*].[WorkloadName,WorkloadId,RiskCounts]'

# Get lens review and high-risk items
aws wellarchitected get-lens-review \
  --workload-id abc123def456 \
  --lens-alias wellarchitected

Review Checklist

Operational Excellence

Is infrastructure defined as code? (Terraform / CloudFormation)
Are deployments automated with CI/CD pipelines?
Are runbooks documented for common operational tasks?
Is there a defined and tested process for rollbacks?

Security

Is least-privilege applied to all IAM roles and service accounts?
Is data encrypted at rest and in transit?
Are secrets managed centrally (Secrets Manager / Secret Manager)?
Is CloudTrail / Cloud Audit Logs enabled and centrally aggregated?
Is a WAF in front of public-facing endpoints?

Reliability

Is there no single point of failure? (Multi-AZ for all stateful services)
Are auto-scaling policies defined and tested?
Has a DR test been performed within the last 6 months?
Are RTO and RPO defined, documented, and achievable?

Cost Optimization

Are all resources tagged with cost allocation tags?
Are right-sizing recommendations from Compute Optimizer / Recommender applied?
Are Reserved Instances or Committed Use Discounts in place for stable workloads?
Are lifecycle policies configured for S3 / Cloud Storage?

Terraform: Multi-AZ VPC — AWS and GCP

AWS Multi-AZ VPC with Public/Private Subnets

# variables.tf
variable "aws_region"    { default = "ap-southeast-1" }
variable "vpc_cidr"      { default = "10.10.0.0/16" }
variable "environment"   { default = "prod" }
variable "project_name"  { default = "myapp" }

# main.tf
terraform {
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  backend "s3" {
    bucket = "terraform-state-123456789012"
    key    = "prod/vpc/terraform.tfstate"
    region = "ap-southeast-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:ap-southeast-1:123456789012:key/abc123"
    dynamodb_table = "terraform-state-lock"
  }
}

provider "aws" { region = var.aws_region }

data "aws_availability_zones" "available" { state = "available" }

locals {
  azs = slice(data.aws_availability_zones.available.names, 0, 3)
  common_tags = {
    Environment = var.environment
    Project     = var.project_name
    ManagedBy   = "terraform"
  }
}

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags = merge(local.common_tags, { Name = "${var.project_name}-${var.environment}-vpc" })
}

# Public subnets — one per AZ
resource "aws_subnet" "public" {
  count                   = 3
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(var.vpc_cidr, 8, count.index + 1)
  availability_zone       = local.azs[count.index]
  map_public_ip_on_launch = false
  tags = merge(local.common_tags, {
    Name = "${var.project_name}-public-${local.azs[count.index]}"
    Type = "public"
    "kubernetes.io/role/elb" = "1"
  })
}

# Private subnets — one per AZ
resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + 11)
  availability_zone = local.azs[count.index]
  tags = merge(local.common_tags, {
    Name = "${var.project_name}-private-${local.azs[count.index]}"
    Type = "private"
    "kubernetes.io/role/internal-elb" = "1"
  })
}

# Isolated subnets (databases — no internet route)
resource "aws_subnet" "isolated" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + 21)
  availability_zone = local.azs[count.index]
  tags = merge(local.common_tags, {
    Name = "${var.project_name}-isolated-${local.azs[count.index]}"
    Type = "isolated"
  })
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  tags   = merge(local.common_tags, { Name = "${var.project_name}-igw" })
}

# Elastic IPs for NAT Gateways
resource "aws_eip" "nat" {
  count  = 3
  domain = "vpc"
  tags   = merge(local.common_tags, { Name = "${var.project_name}-nat-eip-${count.index + 1}" })
}

# NAT Gateways — one per AZ for HA
resource "aws_nat_gateway" "main" {
  count         = 3
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  depends_on    = [aws_internet_gateway.main]
  tags          = merge(local.common_tags, { Name = "${var.project_name}-nat-${local.azs[count.index]}" })
}

# Route tables
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
  tags = merge(local.common_tags, { Name = "${var.project_name}-public-rt" })
}

resource "aws_route_table" "private" {
  count  = 3
  vpc_id = aws_vpc.main.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }
  tags = merge(local.common_tags, { Name = "${var.project_name}-private-rt-${count.index + 1}" })
}

resource "aws_route_table" "isolated" {
  vpc_id = aws_vpc.main.id
  # No routes to internet — isolated
  tags = merge(local.common_tags, { Name = "${var.project_name}-isolated-rt" })
}

# Route table associations
resource "aws_route_table_association" "public" {
  count          = 3
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count          = 3
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

resource "aws_route_table_association" "isolated" {
  count          = 3
  subnet_id      = aws_subnet.isolated[count.index].id
  route_table_id = aws_route_table.isolated.id
}

# VPC Endpoints — S3 and DynamoDB (free Gateway type)
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.aws_region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = concat(
    [aws_route_table.public.id],
    aws_route_table.private[*].id
  )
  tags = merge(local.common_tags, { Name = "${var.project_name}-s3-endpoint" })
}

# outputs.tf
output "vpc_id"              { value = aws_vpc.main.id }
output "public_subnet_ids"   { value = aws_subnet.public[*].id }
output "private_subnet_ids"  { value = aws_subnet.private[*].id }
output "isolated_subnet_ids" { value = aws_subnet.isolated[*].id }
output "nat_gateway_ips"     { value = aws_eip.nat[*].public_ip }

GCP Multi-AZ (Multi-Zone) VPC with Public/Private Subnets

# GCP Terraform — custom VPC with regional subnets
# Note: GCP subnets are regional; VM placement into zones is done at VM level

terraform {
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.0" }
  }
  backend "gcs" {
    bucket = "terraform-state-prod-123456"
    prefix = "prod/vpc"
  }
}

variable "project_id"   { default = "prod-backend-123456" }
variable "region"       { default = "asia-southeast1" }
variable "environment"  { default = "prod" }

provider "google" {
  project = var.project_id
  region  = var.region
}

resource "google_compute_network" "main" {
  name                    = "${var.environment}-vpc"
  auto_create_subnetworks = false
  mtu                     = 1460
  routing_mode            = "REGIONAL"
  project                 = var.project_id
}

# Public subnet (for GKE, Cloud NAT, load balancers)
resource "google_compute_subnetwork" "public" {
  name                     = "${var.environment}-public-subnet"
  network                  = google_compute_network.main.id
  region                   = var.region
  ip_cidr_range            = "10.10.0.0/20"
  private_ip_google_access = true
  project                  = var.project_id

  log_config {
    aggregation_interval = "INTERVAL_5_SEC"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

# Private subnet for GKE with secondary ranges for pods/services
resource "google_compute_subnetwork" "private_gke" {
  name                     = "${var.environment}-gke-subnet"
  network                  = google_compute_network.main.id
  region                   = var.region
  ip_cidr_range            = "10.10.16.0/20"
  private_ip_google_access = true
  project                  = var.project_id

  secondary_ip_range {
    range_name    = "gke-pods"
    ip_cidr_range = "10.20.0.0/16"
  }
  secondary_ip_range {
    range_name    = "gke-services"
    ip_cidr_range = "10.30.0.0/20"
  }

  log_config {
    aggregation_interval = "INTERVAL_5_SEC"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

# Isolated subnet for Cloud SQL (private services)
resource "google_compute_subnetwork" "isolated" {
  name                     = "${var.environment}-isolated-subnet"
  network                  = google_compute_network.main.id
  region                   = var.region
  ip_cidr_range            = "10.10.32.0/24"
  private_ip_google_access = true
  project                  = var.project_id
}

# Cloud Router and Cloud NAT (outbound internet for private VMs)
resource "google_compute_router" "main" {
  name    = "${var.environment}-router"
  network = google_compute_network.main.id
  region  = var.region
  project = var.project_id
}

resource "google_compute_router_nat" "main" {
  name                               = "${var.environment}-nat"
  router                             = google_compute_router.main.name
  region                             = var.region
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "LIST_OF_SUBNETWORKS"
  project                            = var.project_id

  subnetwork {
    name                    = google_compute_subnetwork.private_gke.id
    source_ip_ranges_to_nat = ["ALL_IP_RANGES"]
  }

  log_config {
    enable = true
    filter = "ERRORS_ONLY"
  }
}

# Firewall rules — deny all by default, allow selectively
resource "google_compute_firewall" "deny_all_ingress" {
  name      = "${var.environment}-deny-all-ingress"
  network   = google_compute_network.main.id
  project   = var.project_id
  direction = "INGRESS"
  priority  = 65534

  deny { protocol = "all" }
  source_ranges = ["0.0.0.0/0"]
}

resource "google_compute_firewall" "allow_internal" {
  name      = "${var.environment}-allow-internal"
  network   = google_compute_network.main.id
  project   = var.project_id
  direction = "INGRESS"
  priority  = 1000

  allow { protocol = "tcp" }
  allow { protocol = "udp" }
  allow { protocol = "icmp" }
  source_ranges = ["10.10.0.0/16", "10.20.0.0/16", "10.30.0.0/20"]
}

resource "google_compute_firewall" "allow_iap_ssh" {
  name      = "${var.environment}-allow-iap-ssh"
  network   = google_compute_network.main.id
  project   = var.project_id
  direction = "INGRESS"
  priority  = 1000

  allow { protocol = "tcp"; ports = ["22"] }
  # IAP's IP range — allow SSH only through Identity-Aware Proxy
  source_ranges = ["35.235.240.0/20"]
}

output "vpc_id"                { value = google_compute_network.main.id }
output "public_subnet_id"      { value = google_compute_subnetwork.public.id }
output "private_gke_subnet_id" { value = google_compute_subnetwork.private_gke.id }
output "nat_ip"                { value = google_compute_router_nat.main.name }

Architecture Review reminder: Run terraform plan with output reviewed before every apply in production. Use separate state files per environment and per component (VPC, EKS, apps). Enable state locking with DynamoDB (AWS) or GCS (GCP) to prevent concurrent modifications.