AWS Cost Optimization
AWS Pricing Models Deep Dive
Choosing the right pricing model for each workload is the single most impactful cost decision you will make on AWS.
On-Demand
Pay per second (Linux) or per hour (Windows) with no commitment. Best for unpredictable workloads, new applications being evaluated, and short-term projects. Serves as the baseline price from which all discounts are measured.
Reserved Instances — Standard
1-year or 3-year commitment to a specific instance type, OS, tenancy, and region. Up to 72% discount vs On-Demand. Payment options: All Upfront (deepest discount), Partial Upfront (moderate discount), No Upfront (smallest discount but no capital outlay). Cannot exchange for a different instance family.
Reserved Instances — Convertible
Same commitment structure as Standard RIs but with the ability to exchange for a different instance family, OS, or tenancy during the commitment period. Up to 66% discount. The flexibility premium over Standard RI is typically 6–8 percentage points.
Compute Savings Plans
Commit to a consistent spend ($/hour) for 1 or 3 years. Applies automatically to EC2 (any family, size, region, OS), AWS Fargate, and Lambda. Up to 66% discount. The most flexible commitment — ideal for organizations that frequently change instance families or migrate workloads between regions.
EC2 Instance Savings Plans
Commit to a specific EC2 instance family in a specific region. Up to 72% discount, matching Standard RI savings. Flexible within the family — covers any size, OS, and tenancy. Does not apply to Fargate or Lambda.
Spot Instances
Use spare AWS capacity at discounts of 60–90%+ vs On-Demand. The instance can be reclaimed with 2-minute notice. Best for fault-tolerant, stateless, or batch workloads. Use Spot Fleet or Auto Scaling groups with mixed instances policy to maintain capacity across multiple instance types and AZs.
EC2 Rightsizing with AWS Compute Optimizer
AWS Compute Optimizer uses machine learning to analyze 14 days of CloudWatch metrics and produce rightsizing recommendations for EC2 instances. It classifies each recommendation as Over-provisioned, Under-provisioned, or Optimized, with an estimated monthly savings and a performance risk indicator.
Key CloudWatch Metrics to Analyze
| Metric | Namespace | What It Tells You |
|---|---|---|
CPUUtilization |
AWS/EC2 | Hypervisor-level CPU usage; available without the CloudWatch agent |
mem_used_percent |
CWAgent | Memory utilization — requires the CloudWatch agent to be installed |
NetworkIn / NetworkOut |
AWS/EC2 | Network throughput; helps identify network-bound instances |
DiskReadOps / DiskWriteOps |
AWS/EC2 | IOPS consumed; relevant for storage-optimized instance selection |
EBSReadBytes / EBSWriteBytes |
AWS/EC2 | EBS throughput; useful for gp3 vs io2 selection |
# Retrieve Compute Optimizer EC2 recommendations via CLI
aws compute-optimizer get-ec2-instance-recommendations \
--filters name=Finding,values=Overprovisioned \
--output table \
--query 'instanceRecommendations[*].{
Instance:instanceArn,
Current:currentInstanceType,
Recommended:recommendationOptions[0].instanceType,
EstimatedSavings:recommendationOptions[0].estimatedMonthlySavings.value
}'
Reserved Instance Strategies
1-Year vs 3-Year Commitment
A 3-year Standard RI delivers roughly 10–15 percentage points more discount than a 1-year RI. However, locking in a 3-year commitment for a workload that may change significantly within that period erodes the value. As a rule of thumb:
- Use 3-year for core, stable infrastructure (database servers, domain controllers, always-on application tiers) that will not change instance family within the commitment window.
- Use 1-year for workloads that may change size or family — or use Convertible RIs to retain the ability to exchange.
Payment Options
| Payment Option | Typical Discount (1yr Standard) | Capital Requirement | Best For |
|---|---|---|---|
| All Upfront | ~40% | High (full year paid today) | Organizations with capital budget and strong cost discipline |
| Partial Upfront | ~38% | Medium (50% upfront) | Balance between cash flow and discount depth |
| No Upfront | ~33% | None | Opex-constrained teams; still ~33% cheaper than On-Demand |
Convertible RI Exchange
To exchange a Convertible RI, the new RI must have equal or greater value than the RI being surrendered. AWS will apply a pro-rated credit for any remaining value. Use exchanges when:
- You need to move to a newer generation (e.g.,
m5tom6i) for better price/performance. - The workload has grown and needs a larger instance type within the same family.
- You need to change the operating system (e.g., RHEL to Linux).
Savings Plans Strategy
Compute Savings Plans vs EC2 Instance Savings Plans
| Attribute | Compute Savings Plans | EC2 Instance Savings Plans |
|---|---|---|
| Max discount | Up to 66% | Up to 72% |
| Scope | EC2 (any family/region/OS) + Fargate + Lambda | EC2 in a specific instance family and region only |
| Flexibility | Highest — no instance family or region lock-in | Medium — locked to family and region; flexible on size/OS/tenancy |
| Recommendation | Use for Fargate, Lambda, or when you expect to change EC2 families | Use when you are confident the instance family and region are stable |
Spot Instances in Practice
Spot Fleet and Mixed Instances Policy
Use a mixed instances policy in your Auto Scaling group to diversify across multiple instance types and AZs. This reduces the probability of all Spot capacity being reclaimed simultaneously.
# Auto Scaling group with mixed instances policy (AWS CLI)
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name my-mixed-asg \
--min-size 2 \
--max-size 20 \
--desired-capacity 6 \
--vpc-zone-identifier "subnet-aaa,subnet-bbb,subnet-ccc" \
--mixed-instances-policy '{
"LaunchTemplate": {
"LaunchTemplateSpecification": {
"LaunchTemplateName": "my-launch-template",
"Version": "$Latest"
},
"Overrides": [
{"InstanceType": "m5.xlarge"},
{"InstanceType": "m5a.xlarge"},
{"InstanceType": "m6i.xlarge"},
{"InstanceType": "m6a.xlarge"}
]
},
"InstancesDistribution": {
"OnDemandBaseCapacity": 2,
"OnDemandPercentageAboveBaseCapacity": 20,
"SpotAllocationStrategy": "capacity-optimized"
}
}'
Interruption Handling User Data Script
#!/bin/bash
# /etc/spot-interruption-handler.sh
# Polls instance metadata for Spot interruption notice every 5 seconds.
# When detected: drains application, flushes queue, deregisters from ALB.
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
while true; do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
-H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/spot/termination-time)
if [ "$HTTP_CODE" -eq 200 ]; then
echo "Spot interruption notice received. Starting graceful shutdown..."
# Stop accepting new requests
systemctl stop nginx
# Deregister from ALB target group
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/instance-id)
aws elbv2 deregister-targets \
--target-group-arn "$TARGET_GROUP_ARN" \
--targets Id="$INSTANCE_ID"
# Flush any in-flight work (application-specific)
/usr/local/bin/app-graceful-shutdown.sh
break
fi
sleep 5
done
S3 Cost Optimization
Storage Classes
| Storage Class | Use Case | Min Storage Duration | Retrieval Latency | Approx. Cost/GB/mo |
|---|---|---|---|---|
| S3 Standard | Frequently accessed data | None | Milliseconds | $0.023 |
| S3 Standard-IA | Infrequently accessed, rapid retrieval | 30 days | Milliseconds | $0.0125 |
| S3 One Zone-IA | Non-critical, infrequent, single AZ | 30 days | Milliseconds | $0.01 |
| S3 Intelligent-Tiering | Unknown or changing access patterns | None | Milliseconds (frequent/infrequent tiers) | $0.023 + $0.0025/1K objects monitoring fee |
| S3 Glacier Instant Retrieval | Archive with millisecond access once/quarter | 90 days | Milliseconds | $0.004 |
| S3 Glacier Flexible Retrieval | Archive, 1–5 min to 5–12 hr retrieval | 90 days | Minutes to hours | $0.0036 |
| S3 Glacier Deep Archive | Long-term retention (compliance, cold backup) | 180 days | 12–48 hours | $0.00099 |
S3 Lifecycle Policy Example
{
"Rules": [
{
"ID": "archive-old-logs",
"Status": "Enabled",
"Filter": {
"Prefix": "logs/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER_IR"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 2555
},
"NoncurrentVersionTransitions": [
{
"NoncurrentDays": 30,
"StorageClass": "STANDARD_IA"
},
{
"NoncurrentDays": 90,
"StorageClass": "GLACIER_IR"
}
],
"NoncurrentVersionExpiration": {
"NoncurrentDays": 365
}
}
]
}
# Apply lifecycle policy to an S3 bucket
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration file://lifecycle-policy.json
EBS Cost Optimization
gp2 to gp3 Migration
gp3 volumes are up to 20% cheaper than gp2 at the same size, while delivering a guaranteed baseline of 3,000 IOPS and 125 MB/s throughput (vs gp2's burst model). Additional IOPS and throughput on gp3 are purchased separately, but most workloads do not need to exceed the free baseline.
# Identify all gp2 volumes in a region
aws ec2 describe-volumes \
--filters Name=volume-type,Values=gp2 \
--query 'Volumes[*].{ID:VolumeId,Size:Size,State:State}' \
--output table
# Modify a single gp2 volume to gp3 (no downtime required)
aws ec2 modify-volume \
--volume-id vol-0abcdef1234567890 \
--volume-type gp3
# Bulk migrate all gp2 volumes using a loop
for vol_id in $(aws ec2 describe-volumes \
--filters Name=volume-type,Values=gp2 \
--query 'Volumes[*].VolumeId' \
--output text); do
echo "Migrating $vol_id to gp3..."
aws ec2 modify-volume --volume-id "$vol_id" --volume-type gp3
done
Snapshot Cleanup
# List snapshots older than 90 days owned by your account
aws ec2 describe-snapshots \
--owner-ids self \
--query 'Snapshots[?StartTime<=`2025-12-28`].{ID:SnapshotId,Date:StartTime,Size:VolumeSize}' \
--output table
# Delete a snapshot
aws ec2 delete-snapshot --snapshot-id snap-0abcdef1234567890
aws ec2 describe-images --filters Name=block-device-mapping.snapshot-id,Values=snap-xxx). Deleting a snapshot referenced by an AMI will break the AMI.
Data Transfer Costs
Data transfer charges are often underestimated and can account for 10–20% of total AWS spend for data-intensive applications. Understanding the pricing tiers is essential for architecture decisions.
| Transfer Path | Approximate Cost | Notes |
|---|---|---|
| Same AZ (EC2 to EC2, private IP) | Free | Must use private IP within the same AZ |
| Same region, different AZ | $0.01/GB (both directions) | Each GB transferred incurs cost in both send and receive ($0.02/GB total) |
| Same region via public/Elastic IP | $0.01/GB (both directions) | Even within same AZ — avoid using public IPs for internal traffic |
| Cross-region (AWS backbone) | ~$0.02/GB | Route-dependent; US to EU is ~$0.02/GB |
| Internet egress (first 10 TB/mo) | $0.09/GB | Waived for CloudFront origin fetch |
| Internet egress via CloudFront | ~$0.0085–0.085/GB | Tiered; much cheaper than direct EC2 egress at scale |
| S3 to EC2 (same region) | Free | No charge for S3 data transfer within the same region to EC2 |
| VPC Endpoint (Gateway type) to S3/DynamoDB | Free | Removes NAT Gateway data processing charges for S3/DynamoDB traffic |
How to Minimize Data Transfer Costs
- Place resources that communicate frequently in the same AZ and use private IPs.
- Use VPC Gateway Endpoints for S3 and DynamoDB to eliminate NAT Gateway charges on that traffic.
- Use CloudFront to cache and serve static content — CloudFront egress pricing is lower than EC2 direct egress.
- Use S3 Transfer Acceleration only when uploading from distant geographies; it adds cost for closer regions.
- For cross-region replication, evaluate whether the data truly needs to be replicated or whether a single-region read replica with regional routing would suffice.
RDS Cost Optimization
Reserved Instances for RDS
RDS supports Reserved Instances with 1-year and 3-year terms. For production databases running 24/7, a 1-year No Upfront RI typically saves ~30% versus On-Demand. A 3-year All Upfront RI can save up to 60%. Purchase RIs for your largest, most stable database instances first.
Multi-AZ Only for Production
Multi-AZ RDS deployments are ~2x the cost of Single-AZ. Enforce a policy: Multi-AZ only for production environments. Development and staging databases should be Single-AZ. Use environment tags in Terraform to enforce this automatically.
Aurora Serverless v2 for Variable Workloads
Aurora Serverless v2 scales in fine-grained increments (0.5 ACU steps) and can scale to zero during idle periods. For workloads with highly variable traffic — such as internal tools used only during business hours, or development databases — Aurora Serverless v2 can cut costs by 50–70% compared to a provisioned Aurora cluster running 24/7.
# Create an Aurora Serverless v2 cluster
aws rds create-db-cluster \
--db-cluster-identifier my-serverless-cluster \
--engine aurora-mysql \
--engine-version 8.0.mysql_aurora.3.04.0 \
--serverless-v2-scaling-configuration MinCapacity=0.5,MaxCapacity=16 \
--db-subnet-group-name my-subnet-group \
--vpc-security-group-ids sg-0abc12345 \
--master-username admin \
--master-user-password "$(aws secretsmanager get-secret-value \
--secret-id db-master-password --query SecretString --output text)"
AWS Cost Explorer
AWS Cost Explorer provides interactive cost and usage data visualization. The following filters and saved reports are most useful for ongoing cost management.
| Report / Filter | Purpose | Key Settings |
|---|---|---|
| Monthly costs by service | Identify which services are driving the most spend | Group by: Service; Time: monthly; 6-month view |
| Daily spend trend | Detect anomalies and unexpected spikes | Group by: Service; Time: daily; 30-day view; compare to prior period |
| Costs by tag (Environment) | Understand spend per environment (prod / staging / dev) | Group by: Tag:Environment; Time: monthly |
| RI Coverage report | Measure what % of eligible On-Demand hours are covered by RIs | Coverage > 80% is a healthy target; below 60% indicates opportunity |
| Savings Plans Coverage | Measure Savings Plans utilization and coverage rate | Target > 90% utilization; coverage gap shows On-Demand overspend |
| Data Transfer breakdown | Identify large data transfer charges by usage type | Filter by Service: EC2-Other; Group by: Usage Type |
AWS Budgets
AWS Budgets allows you to set custom cost and usage budgets and receive alerts via email or SNS when thresholds are breached.
# Create a monthly cost budget with alert at 80% and 100% via AWS CLI
aws budgets create-budget \
--account-id "$(aws sts get-caller-identity --query Account --output text)" \
--budget '{
"BudgetName": "monthly-total-budget",
"BudgetLimit": {
"Amount": "5000",
"Unit": "USD"
},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{
"SubscriptionType": "EMAIL",
"Address": "[email protected]"
},
{
"SubscriptionType": "SNS",
"Address": "arn:aws:sns:ap-southeast-1:123456789012:cost-alerts"
}
]
},
{
"Notification": {
"NotificationType": "FORECASTED",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{
"SubscriptionType": "EMAIL",
"Address": "[email protected]"
}
]
}
]'
Automated Cost Governance: Lambda to Stop Idle Instances
The following Lambda function (Python) runs on a schedule and stops EC2 instances that have had average CPU utilization below 5% for the past 7 days, unless they carry a DoNotStop=true tag.
import boto3
import datetime
ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
SNS_TOPIC_ARN = 'arn:aws:sns:ap-southeast-1:123456789012:cost-alerts'
def get_avg_cpu(instance_id, days=7):
"""Return average CPUUtilization for the past N days."""
end = datetime.datetime.utcnow()
start = end - datetime.timedelta(days=days)
resp = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start,
EndTime=end,
Period=86400 * days, # single data point over the full period
Statistics=['Average']
)
datapoints = resp.get('Datapoints', [])
return datapoints[0]['Average'] if datapoints else None
def lambda_handler(event, context):
paginator = ec2.get_paginator('describe_instances')
stopped = []
for page in paginator.paginate(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
):
for reservation in page['Reservations']:
for instance in reservation['Instances']:
iid = instance['InstanceId']
tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
# Skip instances tagged DoNotStop=true
if tags.get('DoNotStop', '').lower() == 'true':
continue
avg_cpu = get_avg_cpu(iid)
if avg_cpu is not None and avg_cpu < 5.0:
print(f"Stopping idle instance {iid} (avg CPU: {avg_cpu:.2f}%)")
ec2.stop_instances(InstanceIds=[iid])
stopped.append({'id': iid, 'name': tags.get('Name', ''), 'cpu': avg_cpu})
if stopped:
sns = boto3.client('sns')
message = "Stopped idle EC2 instances (avg CPU < 5% over 7 days):\n"
for s in stopped:
message += f" - {s['id']} ({s['name']}): {s['cpu']:.2f}%\n"
sns.publish(TopicArn=SNS_TOPIC_ARN, Subject='AWS Cost: Idle Instances Stopped', Message=message)
return {'stopped': len(stopped), 'instances': stopped}
Cost Allocation Tags
Tags are the foundation of cost accountability on AWS. Activate tags as Cost Allocation Tags in the Billing console so they appear in Cost Explorer and billing reports.
Recommended Tagging Strategy
| Tag Key | Example Values | Purpose |
|---|---|---|
Environment |
prod, staging, dev, sandbox | Separate production from non-production spend |
Team |
platform, backend, data, security | Chargeback / showback to business units |
Project |
customer-portal, data-lake, auth-service | Attribute costs to products or initiatives |
CostCenter |
CC-1001, CC-2003 | Align with finance chargeback codes |
ManagedBy |
terraform, cloudformation, manual | Track IaC coverage; identify manually managed resources |
# Tag multiple EC2 instances at once
aws ec2 create-tags \
--resources i-0abc123 i-0def456 i-0ghi789 \
--tags \
Key=Environment,Value=prod \
Key=Team,Value=platform \
Key=Project,Value=customer-portal \
Key=CostCenter,Value=CC-1001 \
Key=ManagedBy,Value=terraform
# Find untagged EC2 instances (missing 'Environment' tag)
aws resourcegroupstaggingapi get-resources \
--resource-type-filters ec2:instance \
--tag-filters Key=Environment \
--query 'ResourceTagMappingList[?Tags[?Key==`Environment`] == `[]`].ResourceARN' \
--output text
required-tags to detect resources missing mandatory tags. Pair it with an Auto Remediation action that sends a notification to the resource owner's team Slack channel, including a direct link to the resource in the AWS console.