AWS Cost Optimization

AWS Cost Optimization — AWS provides a rich set of purchasing options, managed tools, and operational controls to reduce spend. This guide covers the key techniques applicable to EC2, S3, EBS, RDS, and data transfer, along with practical governance tooling.

AWS Pricing Models Deep Dive

Choosing the right pricing model for each workload is the single most impactful cost decision you will make on AWS.

On-Demand

Pay per second (Linux) or per hour (Windows) with no commitment. Best for unpredictable workloads, new applications being evaluated, and short-term projects. Serves as the baseline price from which all discounts are measured.

Reserved Instances — Standard

1-year or 3-year commitment to a specific instance type, OS, tenancy, and region. Up to 72% discount vs On-Demand. Payment options: All Upfront (deepest discount), Partial Upfront (moderate discount), No Upfront (smallest discount but no capital outlay). Cannot exchange for a different instance family.

Reserved Instances — Convertible

Same commitment structure as Standard RIs but with the ability to exchange for a different instance family, OS, or tenancy during the commitment period. Up to 66% discount. The flexibility premium over Standard RI is typically 6–8 percentage points.

Compute Savings Plans

Commit to a consistent spend ($/hour) for 1 or 3 years. Applies automatically to EC2 (any family, size, region, OS), AWS Fargate, and Lambda. Up to 66% discount. The most flexible commitment — ideal for organizations that frequently change instance families or migrate workloads between regions.

EC2 Instance Savings Plans

Commit to a specific EC2 instance family in a specific region. Up to 72% discount, matching Standard RI savings. Flexible within the family — covers any size, OS, and tenancy. Does not apply to Fargate or Lambda.

Spot Instances

Use spare AWS capacity at discounts of 60–90%+ vs On-Demand. The instance can be reclaimed with 2-minute notice. Best for fault-tolerant, stateless, or batch workloads. Use Spot Fleet or Auto Scaling groups with mixed instances policy to maintain capacity across multiple instance types and AZs.

EC2 Rightsizing with AWS Compute Optimizer

AWS Compute Optimizer uses machine learning to analyze 14 days of CloudWatch metrics and produce rightsizing recommendations for EC2 instances. It classifies each recommendation as Over-provisioned, Under-provisioned, or Optimized, with an estimated monthly savings and a performance risk indicator.

Key CloudWatch Metrics to Analyze

Metric Namespace What It Tells You
CPUUtilization AWS/EC2 Hypervisor-level CPU usage; available without the CloudWatch agent
mem_used_percent CWAgent Memory utilization — requires the CloudWatch agent to be installed
NetworkIn / NetworkOut AWS/EC2 Network throughput; helps identify network-bound instances
DiskReadOps / DiskWriteOps AWS/EC2 IOPS consumed; relevant for storage-optimized instance selection
EBSReadBytes / EBSWriteBytes AWS/EC2 EBS throughput; useful for gp3 vs io2 selection
# Retrieve Compute Optimizer EC2 recommendations via CLI
aws compute-optimizer get-ec2-instance-recommendations \
  --filters name=Finding,values=Overprovisioned \
  --output table \
  --query 'instanceRecommendations[*].{
    Instance:instanceArn,
    Current:currentInstanceType,
    Recommended:recommendationOptions[0].instanceType,
    EstimatedSavings:recommendationOptions[0].estimatedMonthlySavings.value
  }'

Reserved Instance Strategies

1-Year vs 3-Year Commitment

A 3-year Standard RI delivers roughly 10–15 percentage points more discount than a 1-year RI. However, locking in a 3-year commitment for a workload that may change significantly within that period erodes the value. As a rule of thumb:

  • Use 3-year for core, stable infrastructure (database servers, domain controllers, always-on application tiers) that will not change instance family within the commitment window.
  • Use 1-year for workloads that may change size or family — or use Convertible RIs to retain the ability to exchange.

Payment Options

Payment Option Typical Discount (1yr Standard) Capital Requirement Best For
All Upfront ~40% High (full year paid today) Organizations with capital budget and strong cost discipline
Partial Upfront ~38% Medium (50% upfront) Balance between cash flow and discount depth
No Upfront ~33% None Opex-constrained teams; still ~33% cheaper than On-Demand

Convertible RI Exchange

To exchange a Convertible RI, the new RI must have equal or greater value than the RI being surrendered. AWS will apply a pro-rated credit for any remaining value. Use exchanges when:

  • You need to move to a newer generation (e.g., m5 to m6i) for better price/performance.
  • The workload has grown and needs a larger instance type within the same family.
  • You need to change the operating system (e.g., RHEL to Linux).

Savings Plans Strategy

Compute Savings Plans vs EC2 Instance Savings Plans

Attribute Compute Savings Plans EC2 Instance Savings Plans
Max discount Up to 66% Up to 72%
Scope EC2 (any family/region/OS) + Fargate + Lambda EC2 in a specific instance family and region only
Flexibility Highest — no instance family or region lock-in Medium — locked to family and region; flexible on size/OS/tenancy
Recommendation Use for Fargate, Lambda, or when you expect to change EC2 families Use when you are confident the instance family and region are stable
Recommended approach: Start with Compute Savings Plans to cover a conservative baseline (e.g., 70% of your typical EC2 On-Demand spend). Once usage patterns stabilize, supplement with EC2 Instance Savings Plans for the largest, most stable instance families to capture the additional discount.

Spot Instances in Practice

Spot Fleet and Mixed Instances Policy

Use a mixed instances policy in your Auto Scaling group to diversify across multiple instance types and AZs. This reduces the probability of all Spot capacity being reclaimed simultaneously.

# Auto Scaling group with mixed instances policy (AWS CLI)
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name my-mixed-asg \
  --min-size 2 \
  --max-size 20 \
  --desired-capacity 6 \
  --vpc-zone-identifier "subnet-aaa,subnet-bbb,subnet-ccc" \
  --mixed-instances-policy '{
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateName": "my-launch-template",
        "Version": "$Latest"
      },
      "Overrides": [
        {"InstanceType": "m5.xlarge"},
        {"InstanceType": "m5a.xlarge"},
        {"InstanceType": "m6i.xlarge"},
        {"InstanceType": "m6a.xlarge"}
      ]
    },
    "InstancesDistribution": {
      "OnDemandBaseCapacity": 2,
      "OnDemandPercentageAboveBaseCapacity": 20,
      "SpotAllocationStrategy": "capacity-optimized"
    }
  }'

Interruption Handling User Data Script

#!/bin/bash
# /etc/spot-interruption-handler.sh
# Polls instance metadata for Spot interruption notice every 5 seconds.
# When detected: drains application, flushes queue, deregisters from ALB.

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

while true; do
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/spot/termination-time)

  if [ "$HTTP_CODE" -eq 200 ]; then
    echo "Spot interruption notice received. Starting graceful shutdown..."

    # Stop accepting new requests
    systemctl stop nginx

    # Deregister from ALB target group
    INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
      http://169.254.169.254/latest/meta-data/instance-id)

    aws elbv2 deregister-targets \
      --target-group-arn "$TARGET_GROUP_ARN" \
      --targets Id="$INSTANCE_ID"

    # Flush any in-flight work (application-specific)
    /usr/local/bin/app-graceful-shutdown.sh

    break
  fi

  sleep 5
done

S3 Cost Optimization

Storage Classes

Storage Class Use Case Min Storage Duration Retrieval Latency Approx. Cost/GB/mo
S3 Standard Frequently accessed data None Milliseconds $0.023
S3 Standard-IA Infrequently accessed, rapid retrieval 30 days Milliseconds $0.0125
S3 One Zone-IA Non-critical, infrequent, single AZ 30 days Milliseconds $0.01
S3 Intelligent-Tiering Unknown or changing access patterns None Milliseconds (frequent/infrequent tiers) $0.023 + $0.0025/1K objects monitoring fee
S3 Glacier Instant Retrieval Archive with millisecond access once/quarter 90 days Milliseconds $0.004
S3 Glacier Flexible Retrieval Archive, 1–5 min to 5–12 hr retrieval 90 days Minutes to hours $0.0036
S3 Glacier Deep Archive Long-term retention (compliance, cold backup) 180 days 12–48 hours $0.00099

S3 Lifecycle Policy Example

{
  "Rules": [
    {
      "ID": "archive-old-logs",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "logs/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 2555
      },
      "NoncurrentVersionTransitions": [
        {
          "NoncurrentDays": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "NoncurrentDays": 90,
          "StorageClass": "GLACIER_IR"
        }
      ],
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 365
      }
    }
  ]
}
# Apply lifecycle policy to an S3 bucket
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-bucket \
  --lifecycle-configuration file://lifecycle-policy.json

EBS Cost Optimization

gp2 to gp3 Migration

gp3 volumes are up to 20% cheaper than gp2 at the same size, while delivering a guaranteed baseline of 3,000 IOPS and 125 MB/s throughput (vs gp2's burst model). Additional IOPS and throughput on gp3 are purchased separately, but most workloads do not need to exceed the free baseline.

# Identify all gp2 volumes in a region
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,State:State}' \
  --output table

# Modify a single gp2 volume to gp3 (no downtime required)
aws ec2 modify-volume \
  --volume-id vol-0abcdef1234567890 \
  --volume-type gp3

# Bulk migrate all gp2 volumes using a loop
for vol_id in $(aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].VolumeId' \
  --output text); do
  echo "Migrating $vol_id to gp3..."
  aws ec2 modify-volume --volume-id "$vol_id" --volume-type gp3
done

Snapshot Cleanup

# List snapshots older than 90 days owned by your account
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2025-12-28`].{ID:SnapshotId,Date:StartTime,Size:VolumeSize}' \
  --output table

# Delete a snapshot
aws ec2 delete-snapshot --snapshot-id snap-0abcdef1234567890
Warning: Before deleting EBS snapshots, verify that no AMIs depend on them (aws ec2 describe-images --filters Name=block-device-mapping.snapshot-id,Values=snap-xxx). Deleting a snapshot referenced by an AMI will break the AMI.

Data Transfer Costs

Data transfer charges are often underestimated and can account for 10–20% of total AWS spend for data-intensive applications. Understanding the pricing tiers is essential for architecture decisions.

Transfer Path Approximate Cost Notes
Same AZ (EC2 to EC2, private IP) Free Must use private IP within the same AZ
Same region, different AZ $0.01/GB (both directions) Each GB transferred incurs cost in both send and receive ($0.02/GB total)
Same region via public/Elastic IP $0.01/GB (both directions) Even within same AZ — avoid using public IPs for internal traffic
Cross-region (AWS backbone) ~$0.02/GB Route-dependent; US to EU is ~$0.02/GB
Internet egress (first 10 TB/mo) $0.09/GB Waived for CloudFront origin fetch
Internet egress via CloudFront ~$0.0085–0.085/GB Tiered; much cheaper than direct EC2 egress at scale
S3 to EC2 (same region) Free No charge for S3 data transfer within the same region to EC2
VPC Endpoint (Gateway type) to S3/DynamoDB Free Removes NAT Gateway data processing charges for S3/DynamoDB traffic

How to Minimize Data Transfer Costs

  • Place resources that communicate frequently in the same AZ and use private IPs.
  • Use VPC Gateway Endpoints for S3 and DynamoDB to eliminate NAT Gateway charges on that traffic.
  • Use CloudFront to cache and serve static content — CloudFront egress pricing is lower than EC2 direct egress.
  • Use S3 Transfer Acceleration only when uploading from distant geographies; it adds cost for closer regions.
  • For cross-region replication, evaluate whether the data truly needs to be replicated or whether a single-region read replica with regional routing would suffice.

RDS Cost Optimization

Reserved Instances for RDS

RDS supports Reserved Instances with 1-year and 3-year terms. For production databases running 24/7, a 1-year No Upfront RI typically saves ~30% versus On-Demand. A 3-year All Upfront RI can save up to 60%. Purchase RIs for your largest, most stable database instances first.

Multi-AZ Only for Production

Multi-AZ RDS deployments are ~2x the cost of Single-AZ. Enforce a policy: Multi-AZ only for production environments. Development and staging databases should be Single-AZ. Use environment tags in Terraform to enforce this automatically.

Aurora Serverless v2 for Variable Workloads

Aurora Serverless v2 scales in fine-grained increments (0.5 ACU steps) and can scale to zero during idle periods. For workloads with highly variable traffic — such as internal tools used only during business hours, or development databases — Aurora Serverless v2 can cut costs by 50–70% compared to a provisioned Aurora cluster running 24/7.

# Create an Aurora Serverless v2 cluster
aws rds create-db-cluster \
  --db-cluster-identifier my-serverless-cluster \
  --engine aurora-mysql \
  --engine-version 8.0.mysql_aurora.3.04.0 \
  --serverless-v2-scaling-configuration MinCapacity=0.5,MaxCapacity=16 \
  --db-subnet-group-name my-subnet-group \
  --vpc-security-group-ids sg-0abc12345 \
  --master-username admin \
  --master-user-password "$(aws secretsmanager get-secret-value \
    --secret-id db-master-password --query SecretString --output text)"

AWS Cost Explorer

AWS Cost Explorer provides interactive cost and usage data visualization. The following filters and saved reports are most useful for ongoing cost management.

Report / Filter Purpose Key Settings
Monthly costs by service Identify which services are driving the most spend Group by: Service; Time: monthly; 6-month view
Daily spend trend Detect anomalies and unexpected spikes Group by: Service; Time: daily; 30-day view; compare to prior period
Costs by tag (Environment) Understand spend per environment (prod / staging / dev) Group by: Tag:Environment; Time: monthly
RI Coverage report Measure what % of eligible On-Demand hours are covered by RIs Coverage > 80% is a healthy target; below 60% indicates opportunity
Savings Plans Coverage Measure Savings Plans utilization and coverage rate Target > 90% utilization; coverage gap shows On-Demand overspend
Data Transfer breakdown Identify large data transfer charges by usage type Filter by Service: EC2-Other; Group by: Usage Type

AWS Budgets

AWS Budgets allows you to set custom cost and usage budgets and receive alerts via email or SNS when thresholds are breached.

# Create a monthly cost budget with alert at 80% and 100% via AWS CLI
aws budgets create-budget \
  --account-id "$(aws sts get-caller-identity --query Account --output text)" \
  --budget '{
    "BudgetName": "monthly-total-budget",
    "BudgetLimit": {
      "Amount": "5000",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "[email protected]"
        },
        {
          "SubscriptionType": "SNS",
          "Address": "arn:aws:sns:ap-southeast-1:123456789012:cost-alerts"
        }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "[email protected]"
        }
      ]
    }
  ]'

Automated Cost Governance: Lambda to Stop Idle Instances

The following Lambda function (Python) runs on a schedule and stops EC2 instances that have had average CPU utilization below 5% for the past 7 days, unless they carry a DoNotStop=true tag.

import boto3
import datetime

ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
SNS_TOPIC_ARN = 'arn:aws:sns:ap-southeast-1:123456789012:cost-alerts'

def get_avg_cpu(instance_id, days=7):
    """Return average CPUUtilization for the past N days."""
    end = datetime.datetime.utcnow()
    start = end - datetime.timedelta(days=days)
    resp = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=start,
        EndTime=end,
        Period=86400 * days,  # single data point over the full period
        Statistics=['Average']
    )
    datapoints = resp.get('Datapoints', [])
    return datapoints[0]['Average'] if datapoints else None


def lambda_handler(event, context):
    paginator = ec2.get_paginator('describe_instances')
    stopped = []

    for page in paginator.paginate(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    ):
        for reservation in page['Reservations']:
            for instance in reservation['Instances']:
                iid = instance['InstanceId']
                tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}

                # Skip instances tagged DoNotStop=true
                if tags.get('DoNotStop', '').lower() == 'true':
                    continue

                avg_cpu = get_avg_cpu(iid)
                if avg_cpu is not None and avg_cpu < 5.0:
                    print(f"Stopping idle instance {iid} (avg CPU: {avg_cpu:.2f}%)")
                    ec2.stop_instances(InstanceIds=[iid])
                    stopped.append({'id': iid, 'name': tags.get('Name', ''), 'cpu': avg_cpu})

    if stopped:
        sns = boto3.client('sns')
        message = "Stopped idle EC2 instances (avg CPU < 5% over 7 days):\n"
        for s in stopped:
            message += f"  - {s['id']} ({s['name']}): {s['cpu']:.2f}%\n"
        sns.publish(TopicArn=SNS_TOPIC_ARN, Subject='AWS Cost: Idle Instances Stopped', Message=message)

    return {'stopped': len(stopped), 'instances': stopped}

Cost Allocation Tags

Tags are the foundation of cost accountability on AWS. Activate tags as Cost Allocation Tags in the Billing console so they appear in Cost Explorer and billing reports.

Recommended Tagging Strategy

Tag Key Example Values Purpose
Environment prod, staging, dev, sandbox Separate production from non-production spend
Team platform, backend, data, security Chargeback / showback to business units
Project customer-portal, data-lake, auth-service Attribute costs to products or initiatives
CostCenter CC-1001, CC-2003 Align with finance chargeback codes
ManagedBy terraform, cloudformation, manual Track IaC coverage; identify manually managed resources
# Tag multiple EC2 instances at once
aws ec2 create-tags \
  --resources i-0abc123 i-0def456 i-0ghi789 \
  --tags \
    Key=Environment,Value=prod \
    Key=Team,Value=platform \
    Key=Project,Value=customer-portal \
    Key=CostCenter,Value=CC-1001 \
    Key=ManagedBy,Value=terraform

# Find untagged EC2 instances (missing 'Environment' tag)
aws resourcegroupstaggingapi get-resources \
  --resource-type-filters ec2:instance \
  --tag-filters Key=Environment \
  --query 'ResourceTagMappingList[?Tags[?Key==`Environment`] == `[]`].ResourceARN' \
  --output text
Enforcement tip: Use AWS Config rule required-tags to detect resources missing mandatory tags. Pair it with an Auto Remediation action that sends a notification to the resource owner's team Slack channel, including a direct link to the resource in the AWS console.