Cost Optimization Best Practices

Optimization order matters: Start with rightsizing and waste elimination (free, immediate savings), then purchase commitments (Reserved Instances / Savings Plans), then architectural optimizations (Spot instances, storage tiering). Never buy commitments before rightsizing — you lock in inefficiency.

1. Rightsizing

Rightsizing means selecting the correct resource size for your actual workload. Over-provisioning is endemic in cloud environments — teams request large instances "just in case" and never revisit them. Studies show 30–40% of cloud compute spend is typically wasted on over-provisioned resources.

Identifying Oversized Instances

Use native cloud tools to find rightsizing opportunities automatically:

# AWS Compute Optimizer — query recommendations via CLI
aws compute-optimizer get-ec2-instance-recommendations \
  --region us-east-1 \
  --query 'instanceRecommendations[?finding == `OVER_PROVISIONED`].{
    Instance: instanceArn,
    Current: currentInstanceType,
    Recommended: recommendationOptions[0].instanceType,
    MonthlySavings: recommendationOptions[0].estimatedMonthlySavings.value
  }' \
  --output table

# Example output:
# Instance           | Current     | Recommended | MonthlySavings
# i-0abc123...prod   | m5.2xlarge  | m5.large    | $127.44
# i-0def456...api    | c5.4xlarge  | c5.xlarge   | $203.76
# i-0ghi789...worker | r5.xlarge   | r5.large    | $89.20

# GCP Recommender — list rightsizing recommendations
gcloud recommender recommendations list \
  --recommender=google.compute.instance.MachineTypeRecommender \
  --project=my-project \
  --location=us-central1-a \
  --format='table(name,stateInfo.state,primaryImpact.costProjection.cost.units,content.operationGroups)'

Rightsizing Workflow

  1. Collect 2–4 weeks of CPU, memory, and network utilization data — short windows miss weekly patterns
  2. Set thresholds: Flag instances with CPU <20% p90 AND memory <40% p90 as rightsizing candidates
  3. Validate with the owning team: Some instances have intentional headroom (burst capacity, maintenance jobs)
  4. Test in staging first: Downsize in staging, run load tests, then apply to production
  5. Apply during low-traffic windows: Resize requires a stop/start cycle for EC2
  6. Track savings: Record the before/after cost impact for reporting

Auto-Remediation for Dev Environments

# Lambda function to auto-rightsize dev instances (Python)
# Triggered weekly by EventBridge

import boto3

def lambda_handler(event, context):
    ec2 = boto3.client('ec2', region_name='us-east-1')
    cw = boto3.client('cloudwatch')

    # Find all dev instances
    instances = ec2.describe_instances(Filters=[
        {'Name': 'tag:Environment', 'Values': ['dev']},
        {'Name': 'instance-state-name', 'Values': ['running']}
    ])

    # Instance family downsizing map
    downsize_map = {
        't3.2xlarge': 't3.xlarge',
        't3.xlarge':  't3.large',
        'm5.2xlarge': 'm5.large',
        'm5.xlarge':  'm5.large',
    }

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id   = instance['InstanceId']
            current_type  = instance['InstanceType']
            target_type   = downsize_map.get(current_type)

            if not target_type:
                continue

            # Check 14-day avg CPU
            metrics = cw.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime='2025-02-12T00:00:00Z',
                EndTime='2025-02-26T00:00:00Z',
                Period=1209600,  # 14 days
                Statistics=['Average']
            )
            avg_cpu = metrics['Datapoints'][0]['Average'] if metrics['Datapoints'] else 100

            if avg_cpu < 20:
                print(f"Rightsizing {instance_id}: {current_type} → {target_type} (avg CPU: {avg_cpu:.1f}%)")
                ec2.stop_instances(InstanceIds=[instance_id])
                # wait for stop, then modify and start (simplified)
                ec2.modify_instance_attribute(
                    InstanceId=instance_id,
                    Attribute='instanceType',
                    Value=target_type
                )
                ec2.start_instances(InstanceIds=[instance_id])

2. Reserved Instances & Committed Use Discounts

After rightsizing, purchasing commitments is the highest-ROI FinOps action for stable workloads. The key rule: commit to what you know you will always run. Cover your minimum 24/7 baseline with reservations; let on-demand or Spot cover variable burst traffic.

AWS Reserved Instances

  • Standard RI: Locked to specific instance family, size, OS, and region. Highest discount (up to 72%). Not exchangeable.
  • Convertible RI: Can exchange for different instance types/sizes during term. Lower discount (up to 54%). Best for evolving architectures.
  • 1-year vs 3-year: 3-year saves ~40% more but increases commitment risk. Use 1-year for services with uncertain lifetime.
  • Payment: All Upfront > Partial Upfront > No Upfront. All Upfront gives maximum discount; use it only when cash flow allows.

GCP Committed Use Discounts

  • Resource-based CUDs: Commit to a specific amount of vCPU and RAM in a region. Up to 57% discount for 3-year.
  • Spend-based CUDs: Commit to a minimum dollar spend per hour. Applies to Cloud SQL, VMware Engine, and some other services.
  • Flexibility: CUDs apply across all machine types in a family — more flexible than AWS Standard RIs.
  • Sustained Use Discounts: Remember GCP auto-applies up to 30% for always-on instances before CUDs are even needed.

Analyzing Before Purchasing RIs

# Before purchasing any RI, analyze your coverage and utilization
# Use AWS Cost Explorer RI recommendations

aws ce get-reservation-purchase-recommendation \
  --service "Amazon EC2" \
  --lookback-period-in-days SIXTY_DAYS \
  --term-in-years ONE_YEAR \
  --payment-option PARTIAL_UPFRONT \
  --account-scope PAYER \
  --query 'Recommendations[*].{
    InstanceType: RecommendationDetails[0].InstanceDetails.EC2InstanceDetails.InstanceType,
    Platform: RecommendationDetails[0].InstanceDetails.EC2InstanceDetails.Platform,
    RecommendedCount: RecommendationDetails[0].RecommendedNumberOfInstancesToPurchase,
    MonthlySavings: RecommendationSummary.TotalEstimatedMonthlySavingsAmount,
    Coverage: RecommendationSummary.CurrentAverageCoverage
  }' \
  --output table

# Coverage target: 70-80% of baseline compute
# Anything above 80% risks wasted RIs if workloads change

# Check current RI utilization (avoid buying more if utilization is <80%)
aws ce get-reservation-utilization \
  --time-period Start=2025-02-01,End=2025-03-01 \
  --query 'Total.{Utilization: UtilizationPercentage, Savings: NetSavings}'

3. AWS Savings Plans

Savings Plans are more flexible than RIs — you commit to a $/hour compute spend level rather than specific instance types. This makes them easier to manage as your infrastructure evolves.

Compute Savings Plans Up to 66% off

Applies automatically to EC2 (any instance family, size, region, OS, tenancy), Fargate, and Lambda. The most flexible option. Ideal if you are migrating between instance families or running containers.

EC2 Instance Savings Plans Up to 72% off

Committed to a specific EC2 instance family in a specific region (e.g., m5 in us-east-1). Higher discount than Compute Savings Plans. Applies regardless of size, OS, or tenancy within that family.

SageMaker Savings Plans Up to 64% off

Applies to SageMaker instance usage for training, real-time inference, and batch transform. Essential if you have stable ML workloads.

Savings Plans Purchase Strategy

# Step 1: Identify your stable compute baseline (minimum hourly spend)
aws ce get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days SIXTY_DAYS \
  --query 'SavingsPlansPurchaseRecommendation.{
    HourlyCommitment: SavingsPlansPurchaseRecommendationDetails[0].HourlyCommitmentToPurchase,
    MonthlySavings: SavingsPlansPurchaseRecommendationSummary.EstimatedMonthlySavingsAmount,
    Coverage: SavingsPlansPurchaseRecommendationSummary.CurrentAverageCoveragePercentage
  }'

# Step 2: Review coverage vs utilization ratio
# Purchase in tranches: buy 50% of recommended, evaluate, then buy more
# This prevents over-commitment if workloads shrink

# Step 3: Check Savings Plans utilization monthly
aws ce get-savings-plans-utilization \
  --time-period Start=2025-02-01,End=2025-03-01 \
  --query 'Total.{Utilization: UtilizationPercentage, NetSavings: Savings.NetSavings}'

4. Spot / Preemptible Instances

Spot Instances (AWS) and Preemptible VMs (GCP) offer the largest discounts available in cloud computing — 60–90% below on-demand. The tradeoff: they can be reclaimed with short notice when the cloud provider needs the capacity back.

Ideal Spot Workloads

  • Batch processing jobs (ETL, data transformation, report generation)
  • ML/AI model training (checkpoint your model state every epoch)
  • CI/CD build runners and test executors
  • Stateless web tier nodes behind a load balancer (with graceful shutdown)
  • Development and staging environments (tolerate interruptions)
  • Video transcoding and image processing pipelines

Spot Interruption Handling (AWS EC2)

#!/bin/bash
# /etc/spot-interruption-handler.sh
# Runs as a background service on every Spot instance

while true; do
  # AWS provides 2-minute notice via instance metadata
  HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
    http://169.254.169.254/latest/meta-data/spot/instance-action)

  if [ "$HTTP_STATUS" = "200" ]; then
    TERMINATION_TIME=$(curl -s \
      http://169.254.169.254/latest/meta-data/spot/instance-action | jq -r '.time')

    echo "Spot termination notice received at $(date). Scheduled for: $TERMINATION_TIME"

    # 1. Deregister from load balancer (if applicable)
    INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
    aws elbv2 deregister-targets \
      --target-group-arn "$TARGET_GROUP_ARN" \
      --targets "Id=$INSTANCE_ID"

    # 2. Complete or drain current work
    # (application-specific — emit a SIGTERM to your app)
    kill -TERM $(pgrep -f my-worker-process)

    # 3. Checkpoint any stateful work to S3
    aws s3 sync /tmp/checkpoints/ "s3://my-checkpoints/${INSTANCE_ID}/"

    # 4. Drain SQS messages back to queue (if applicable)
    # Your worker should handle SIGTERM to ack or reject current message

    echo "Graceful shutdown complete"
    break
  fi
  sleep 5
done

Mixed Instance Groups (EKS Node Groups)

# Terraform: EKS node group with Spot + On-Demand mix
resource "aws_eks_node_group" "mixed" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "mixed-spot-ondemand"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = var.private_subnet_ids

  capacity_type = "SPOT"  # Primary: Spot instances

  instance_types = [
    "m5.large", "m5.xlarge", "m4.large", "m4.xlarge",
    "t3.large", "t3.xlarge",  # Multiple types = more Spot availability
  ]

  scaling_config {
    desired_size = 10
    min_size     = 3
    max_size     = 50
  }

  # Use On-Demand for critical system pods (taint Spot nodes)
  taint {
    key    = "node.kubernetes.io/capacity-type"
    value  = "spot"
    effect = "NO_SCHEDULE"
  }

  labels = {
    "node.kubernetes.io/capacity-type" = "spot"
  }
}

# Separate On-Demand node group for critical workloads
resource "aws_eks_node_group" "ondemand_critical" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "ondemand-critical"
  capacity_type   = "ON_DEMAND"
  instance_types  = ["m5.large"]

  scaling_config {
    desired_size = 3
    min_size     = 3
    max_size     = 10
  }
}

5. Storage Optimization

S3 Intelligent-Tiering and Lifecycle Policies

# Terraform: S3 bucket with comprehensive lifecycle rules
resource "aws_s3_bucket_lifecycle_configuration" "data_lifecycle" {
  bucket = aws_s3_bucket.data.id

  rule {
    id     = "intelligent-tiering-for-active-data"
    status = "Enabled"

    transition {
      days          = 0
      storage_class = "INTELLIGENT_TIERING"
    }
  }

  rule {
    id     = "archive-old-logs"
    status = "Enabled"
    filter { prefix = "logs/" }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"    # Infrequent Access after 30 days
    }
    transition {
      days          = 90
      storage_class = "GLACIER"        # Glacier after 90 days (~$0.004/GB/mo)
    }
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"   # Deep Archive after 1 year (~$0.00099/GB/mo)
    }
    expiration {
      days = 2555  # Delete after 7 years (compliance requirement)
    }
  }

  rule {
    id     = "delete-incomplete-multipart"
    status = "Enabled"

    abort_incomplete_multipart_upload { days_after_initiation = 7 }
  }

  rule {
    id     = "clean-old-versions"
    status = "Enabled"

    noncurrent_version_transition {
      noncurrent_days = 30
      storage_class   = "GLACIER"
    }
    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

EBS Volume Cleanup Script

#!/usr/bin/env python3
# Find and report unattached EBS volumes (potential waste)

import boto3
import csv
from datetime import datetime

def find_unattached_ebs():
    ec2 = boto3.client('ec2', region_name='us-east-1')
    pricing = boto3.client('pricing', region_name='us-east-1')

    volumes = ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]  # Not attached
    )['Volumes']

    # GP3 cost: $0.08/GB/month
    cost_per_gb = {'gp3': 0.08, 'gp2': 0.10, 'io1': 0.125, 'st1': 0.045, 'sc1': 0.025}

    total_waste = 0
    report = []

    for vol in volumes:
        vol_type = vol['VolumeType']
        size_gb  = vol['Size']
        monthly  = size_gb * cost_per_gb.get(vol_type, 0.10)
        total_waste += monthly

        tags = {t['Key']: t['Value'] for t in vol.get('Tags', [])}

        report.append({
            'VolumeId': vol['VolumeId'],
            'Size':     f"{size_gb} GB",
            'Type':     vol_type,
            'Region':   'us-east-1',
            'Team':     tags.get('Team', 'UNTAGGED'),
            'Created':  vol['CreateTime'].strftime('%Y-%m-%d'),
            'MonthlyWaste': f"${monthly:.2f}"
        })

    print(f"\nTotal unattached EBS volumes: {len(volumes)}")
    print(f"Total monthly waste: ${total_waste:,.2f}")
    print(f"\nTop wasteful volumes:")
    for r in sorted(report, key=lambda x: float(x['MonthlyWaste'][1:]), reverse=True)[:10]:
        print(f"  {r['VolumeId']}: {r['Size']} {r['Type']} - {r['MonthlyWaste']}/mo (Team: {r['Team']})")

    # Write CSV for team review
    with open('unattached-ebs-report.csv', 'w') as f:
        writer = csv.DictWriter(f, fieldnames=report[0].keys())
        writer.writeheader()
        writer.writerows(report)

if __name__ == '__main__':
    find_unattached_ebs()

6. Kubernetes Cost Optimization

Kubernetes clusters are complex cost centers — costs are driven by node provisioning, pod scheduling efficiency, namespace resource quotas, and persistent storage. These optimizations typically yield 30–50% savings on K8s infrastructure.

Cluster Autoscaler Configuration

# cluster-autoscaler Helm values (EKS)
helm upgrade --install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=my-cluster \
  --set awsRegion=us-east-1 \
  --set extraArgs.scale-down-enabled=true \
  --set extraArgs.scale-down-delay-after-add=10m \
  --set extraArgs.scale-down-unneeded-time=10m \
  --set extraArgs.scale-down-utilization-threshold=0.5 \
  --set extraArgs.skip-nodes-with-local-storage=false \
  --set extraArgs.expander=least-waste  # Pack nodes efficiently before scaling out

Karpenter — Next-Generation Node Autoscaling

# karpenter/provisioner.yaml — intelligent node provisioning
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  # Allow Spot and On-Demand, prefer Spot
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
        - m5.large
        - m5.xlarge
        - m5.2xlarge
        - m4.large
        - m4.xlarge
        - t3.large
        - t3.xlarge
  # Consolidation: remove underutilized nodes automatically
  consolidation:
    enabled: true
  # TTL: terminate empty nodes after 30 seconds
  ttlSecondsAfterEmpty: 30
  # TTL: rotate nodes after 30 days (security, patch compliance)
  ttlSecondsUntilExpired: 2592000
  limits:
    resources:
      cpu: "1000"
      memory: 4000Gi
  provider:
    instanceProfile: KarpenterNodeInstanceProfile
    subnetSelector:
      karpenter.sh/discovery: my-cluster
    securityGroupSelector:
      karpenter.sh/discovery: my-cluster
    tags:
      ManagedBy: karpenter
      Environment: prod

Namespace Resource Quotas

# Apply resource quotas per namespace to prevent runaway costs
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    persistentvolumeclaims: "20"
    requests.storage: 500Gi
    count/pods: "100"
---
# LimitRange: default requests/limits for pods without explicit settings
apiVersion: v1
kind: LimitRange
metadata:
  name: team-backend-limits
  namespace: team-backend
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "4"
        memory: "8Gi"

7. Network Cost Optimization

Hidden cost alert: Data transfer is one of the most overlooked cloud costs. AWS charges $0.09/GB for data leaving a region, and $0.01/GB for cross-AZ traffic. In high-throughput architectures, transfer costs can rival compute costs.

NAT Gateway vs VPC Endpoints

# VPC endpoints eliminate NAT Gateway data transfer costs for AWS services
# NAT Gateway: $0.045/GB processed + $0.045/hour = expensive for S3/DynamoDB traffic

# Terraform: Interface endpoint for S3 (eliminates NAT GW traffic)
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id

  tags = { Name = "s3-endpoint" }
}

resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.dynamodb"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
}

# Interface endpoints for ECR (eliminates NAT GW costs for container pulls)
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.api"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoint.id]
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoint.id]
}

# ROI calculation:
# NAT GW: 10TB/month × $0.045 = $450/month
# Interface endpoints: ~$15/month fixed + no per-GB charge
# Savings: ~$435/month per service

8. Database Cost Optimization

Aurora Serverless v2

# Aurora Serverless v2: scales to zero ACUs when idle, perfect for dev/staging
resource "aws_rds_cluster" "aurora_serverless" {
  cluster_identifier      = "my-app-db"
  engine                  = "aurora-postgresql"
  engine_version          = "15.4"
  engine_mode             = "provisioned"  # Serverless v2 uses provisioned mode
  database_name           = "myapp"
  master_username         = "admin"
  manage_master_user_password = true

  serverlessv2_scaling_configuration {
    min_capacity = 0.5   # Minimum 0.5 ACUs (~$0.06/ACU-hr in us-east-1)
    max_capacity = 16    # Scales up under load
  }

  deletion_protection = true
  skip_final_snapshot = false
}

resource "aws_rds_cluster_instance" "aurora_instance" {
  cluster_identifier = aws_rds_cluster.aurora_serverless.id
  instance_class     = "db.serverless"  # Special class for Serverless v2
  engine             = aws_rds_cluster.aurora_serverless.engine
}

# Cost comparison:
# db.r6g.large (always on): ~$175/month
# Aurora Serverless v2 (dev, mostly idle): ~$15-30/month

9. Automated Cost Governance

Budget Alerts in Terraform

# AWS Budget with SNS alerts (Terraform)
resource "aws_budgets_budget" "team_monthly" {
  for_each = var.teams

  name         = "${each.key}-monthly-budget"
  budget_type  = "COST"
  limit_amount = each.value.budget
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["Team$${each.key}"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = [each.value.owner_email]
    subscriber_sns_topic_arns  = [aws_sns_topic.finops.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["[email protected]", each.value.owner_email]
    subscriber_sns_topic_arns  = [aws_sns_topic.finops.arn]
  }
}

Auto-Stop Dev Environments (EventBridge Scheduler)

# Stop all dev EC2 instances nightly at 8PM, restart at 8AM
resource "aws_scheduler_schedule" "stop_dev" {
  name                         = "stop-dev-instances"
  schedule_expression          = "cron(0 20 * * ? *)"  # 8PM UTC daily
  schedule_expression_timezone = "Asia/Ho_Chi_Minh"

  flexible_time_window { mode = "OFF" }

  target {
    arn      = "arn:aws:scheduler:::aws-sdk:ec2:stopInstances"
    role_arn = aws_iam_role.scheduler.arn
    input = jsonencode({
      Filters = [
        { Name = "tag:Environment", Values = ["dev"] },
        { Name = "instance-state-name", Values = ["running"] }
      ]
    })
  }
}

resource "aws_scheduler_schedule" "start_dev" {
  name                = "start-dev-instances"
  schedule_expression = "cron(0 8 ? * MON-FRI *)"  # 8AM UTC, weekdays only

  flexible_time_window { mode = "OFF" }

  target {
    arn      = "arn:aws:scheduler:::aws-sdk:ec2:startInstances"
    role_arn = aws_iam_role.scheduler.arn
    input = jsonencode({
      Filters = [{ Name = "tag:Environment", Values = ["dev"] }]
    })
  }
}

# Savings: dev instances running 12h/day instead of 24h = 50% cost reduction
# Example: 20 × t3.medium ($0.0416/hr) × 12h saved = $9.98/day = ~$300/month

10. FinOps Reporting

Weekly Cost Report Script (Python)

#!/usr/bin/env python3
# weekly-cost-report.py — Generate and send weekly cost summary to Slack

import boto3
import json
import urllib3
import os
from datetime import datetime, timedelta

def generate_weekly_report():
    ce = boto3.client('ce', region_name='us-east-1')

    today = datetime.today()
    week_start = (today - timedelta(days=7)).strftime('%Y-%m-%d')
    week_end   = today.strftime('%Y-%m-%d')
    prev_start = (today - timedelta(days=14)).strftime('%Y-%m-%d')
    prev_end   = week_start

    def get_cost_by_tag(start, end, group_key='Team'):
        resp = ce.get_cost_and_usage(
            TimePeriod={'Start': start, 'End': end},
            Granularity='MONTHLY',
            Metrics=['UnblendedCost'],
            GroupBy=[{'Type': 'TAG', 'Key': group_key}]
        )
        return {
            g['Keys'][0].replace(f'{group_key}$', ''): float(g['Total']['UnblendedCost']['Amount'])
            for g in resp['ResultsByTime'][0]['Groups']
        }

    this_week  = get_cost_by_tag(week_start, week_end)
    last_week  = get_cost_by_tag(prev_start, prev_end)

    total_this  = sum(this_week.values())
    total_last  = sum(last_week.values())
    pct_change  = ((total_this - total_last) / total_last * 100) if total_last else 0

    arrow = ":chart_with_upwards_trend:" if pct_change > 5 else ":chart_with_downwards_trend:" if pct_change < -5 else ":heavy_minus_sign:"

    lines = [f"*Team*\t\t*This Week*\t*Last Week*\t*Change*"]
    for team, cost in sorted(this_week.items(), key=lambda x: -x[1]):
        prev = last_week.get(team, 0)
        chg  = ((cost - prev) / prev * 100) if prev else 0
        chg_str = f"+{chg:.1f}%" if chg > 0 else f"{chg:.1f}%"
        lines.append(f"{team:<16}\t${cost:,.2f}\t${prev:,.2f}\t{chg_str}")

    message = {
        "blocks": [
            {"type": "header", "text": {"type": "plain_text", "text": f"Weekly Cloud Cost Report — {week_start} to {week_end}"}},
            {"type": "section", "fields": [
                {"type": "mrkdwn", "text": f"*This Week Total:*\n${total_this:,.2f}"},
                {"type": "mrkdwn", "text": f"*Last Week Total:*\n${total_last:,.2f}"},
                {"type": "mrkdwn", "text": f"*Week-over-Week:*\n{arrow} {pct_change:+.1f}%"},
            ]},
            {"type": "section", "text": {"type": "mrkdwn", "text": "```\n" + "\n".join(lines) + "\n```"}},
            {"type": "section", "text": {"type": "mrkdwn",
                "text": "_View full dashboard: _"}}
        ]
    }

    http = urllib3.PoolManager()
    http.request('POST', os.environ['SLACK_WEBHOOK_URL'],
                 body=json.dumps(message).encode('utf-8'),
                 headers={'Content-Type': 'application/json'})
    print(f"Report sent. Total: ${total_this:,.2f} ({pct_change:+.1f}% WoW)")

if __name__ == '__main__':
    generate_weekly_report()
FinOps maturity check: If you have implemented rightsizing, purchase commitments at 70%+ coverage, Spot instances for batch workloads, S3 lifecycle policies, and automated dev environment shutdown — you are operating at "Walk" maturity. The next step to "Run" is automated chargeback, CI/CD cost gates with Infracost, and FinOps OKRs embedded in engineering quarterly planning.