Chaos Engineering
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in its ability to withstand turbulent conditions in production. Rather than waiting for failures to occur, chaos engineering proactively discovers weaknesses before they become outages.
Principles of Chaos Engineering
1. Define a Steady State
Identify a measurable output that represents normal system behavior. Use SLIs — request success rate, p99 latency, throughput. The experiment validates that steady state is maintained under failure conditions.
2. Hypothesize Steady State Continues
Form a testable hypothesis: "If we terminate 30% of pods in the payment service, the overall checkout success rate will remain above 99.5%." Clear hypothesis = clear pass/fail criteria.
3. Vary Real-World Events
Inject failures that mirror actual incidents: pod crashes, network latency, disk full, CPU saturation, dependency timeouts, AZ failures, DNS resolution failures, certificate expiry.
4. Run Experiments in Production
Staging environments don't reflect production load, data size, or interaction patterns. Start with staging, but production is the ultimate validation. Begin with blast radius = 1 pod, 1% traffic.
5. Automate Experiments Continuously
Manual GameDays are valuable but infrequent. Integrate chaos experiments into CI/CD pipelines. Run lightweight experiments continuously in staging; scheduled experiments in production.
6. Minimize Blast Radius
Always have an abort mechanism. Start small — kill 1 pod before killing 30%. Expand scope only after confidence is established. Never run experiments during peak traffic or deploys.
Chaos Maturity Model
Level 1 — Manual GameDays
Planned, manually executed failure scenarios. Team assembles, injects failure, observes response. Valuable for initial discovery but infrequent and labor-intensive.
- Monthly or quarterly scheduled sessions
- Manual kubectl delete pod, kill -9, iptables DROP
- Documented playbooks and hypothesis
Level 2 — Automated Experiments
Chaos experiments defined as code, executed automatically on a schedule or triggered by CI/CD. Results reported to dashboards and Slack.
- Chaos Mesh / LitmusChaos / AWS FIS experiments as YAML
- Scheduled runs in staging (nightly) and production (weekly)
- Automated pass/fail based on SLO thresholds
Level 3 — Continuous Chaos
Low-level failures injected continuously in production (Netflix Chaos Monkey model). System must self-heal. Engineering culture fully embraces failure as learning.
- Random pod termination always enabled in production
- Chaos integrated into every deployment pipeline
- Reliability metrics tracked as engineering KPIs
Tools Comparison
| Tool | Type | K8s Native | Fault Types | Cost |
|---|---|---|---|---|
| Chaos Mesh | Open Source | Yes (CRDs) | Pod, Network, Stress, HTTP, IO, Time | Free |
| LitmusChaos | Open Source | Yes (CRDs) | Pod, Node, Network, AWS/GCP faults | Free / Enterprise |
| AWS FIS | Managed SaaS | Via EKS actions | EC2, EKS, RDS, ECS, API throttling | Pay per action |
| Gremlin | Commercial SaaS | Yes | Resource, Network, State, Application | Paid |
| Chaos Monkey | Open Source | No (EC2/ASG) | Instance termination only | Free |
AWS Fault Injection Simulator (FIS)
Experiment Template — EC2 Instance Termination
# fis-terminate-instances.json
{
"description": "Terminate 30% of EC2 instances in ASG to test auto-recovery",
"targets": {
"MyInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Environment": "staging",
"Service": "web-api"
},
"selectionMode": "PERCENT(30)"
}
},
"actions": {
"TerminateInstances": {
"actionId": "aws:ec2:terminate-instances",
"parameters": {},
"targets": {
"Instances": "MyInstances"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:ap-southeast-1:123456789:alarm/chaos-abort-alarm"
}
],
"roleArn": "arn:aws:iam::123456789:role/FISExperimentRole",
"tags": {
"Purpose": "chaos-engineering"
}
}
Experiment Template — AZ Outage Simulation
# fis-az-outage.json — Stop all instances in ap-southeast-1a
{
"description": "Simulate AZ failure: stop all instances in ap-southeast-1a",
"targets": {
"AZInstances": {
"resourceType": "aws:ec2:instance",
"filters": [
{
"path": "Placement.AvailabilityZone",
"values": ["ap-southeast-1a"]
},
{
"path": "State.Name",
"values": ["running"]
}
],
"selectionMode": "ALL"
}
},
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": "PT10M"
},
"targets": { "Instances": "AZInstances" }
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:ap-southeast-1:123456789:alarm/error-rate-critical"
}
],
"roleArn": "arn:aws:iam::123456789:role/FISExperimentRole"
}
FIS CLI Commands
# Create experiment template
aws fis create-experiment-template \
--cli-input-json file://fis-terminate-instances.json \
--region ap-southeast-1
# List templates
aws fis list-experiment-templates
# Start experiment
aws fis start-experiment \
--experiment-template-id EXT1234567890abcdef \
--region ap-southeast-1
# Monitor experiment
aws fis get-experiment --id EXP1234567890abcdef
# Stop experiment (abort)
aws fis stop-experiment --id EXP1234567890abcdef
# View experiment history
aws fis list-experiments \
--filter "experimentTemplateId=EXT1234567890abcdef"
Chaos Mesh on Kubernetes
Installation
# Add Chaos Mesh Helm repo
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
# Install Chaos Mesh (with dashboard)
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace=chaos-mesh \
--create-namespace \
--set dashboard.create=true \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--version 2.6.3
# Verify installation
kubectl get pods -n chaos-mesh
PodChaos — Pod Kill & Failure
# Kill random pod in payment-service every 60 seconds
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-payment-pod
namespace: chaos-testing
spec:
action: pod-kill # pod-kill | pod-failure | container-kill
mode: one # one | all | fixed | fixed-percent | random-max-percent
selector:
namespaces:
- production
labelSelectors:
app: payment-service
scheduler:
cron: "@every 60s"
# Inject pod failure for 5 minutes (pod stays but is in failed state)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: payment-pod-failure
namespace: chaos-testing
spec:
action: pod-failure
mode: fixed-percent
value: "30" # Fail 30% of matching pods
duration: "5m"
selector:
namespaces: [production]
labelSelectors:
app: payment-service
NetworkChaos — Latency, Loss, Partition
# Add 100ms ± 20ms latency between order-service and inventory-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: order-to-inventory-latency
namespace: chaos-testing
spec:
action: delay
mode: all
selector:
namespaces: [production]
labelSelectors:
app: order-service
delay:
latency: "100ms"
correlation: "25"
jitter: "20ms"
direction: to
target:
mode: all
selector:
namespaces: [production]
labelSelectors:
app: inventory-service
duration: "10m"
# 20% packet loss between services
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-packet-loss
namespace: chaos-testing
spec:
action: loss
mode: all
selector:
namespaces: [production]
labelSelectors:
app: frontend
loss:
loss: "20"
correlation: "25"
direction: both
duration: "5m"
---
# Network partition — isolate payment-service from all outbound traffic
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: payment-isolation
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces: [production]
labelSelectors:
app: payment-service
direction: both
duration: "2m"
StressChaos — CPU & Memory Pressure
# CPU stress — saturate 2 CPU cores at 90% for 10 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress-api
namespace: chaos-testing
spec:
mode: one
selector:
namespaces: [production]
labelSelectors:
app: api-gateway
stressors:
cpu:
workers: 2 # Number of CPU workers
load: 90 # CPU load percentage
duration: "10m"
---
# Memory stress — consume 512MB for 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress-api
namespace: chaos-testing
spec:
mode: one
selector:
namespaces: [production]
labelSelectors:
app: api-gateway
stressors:
memory:
workers: 1
size: "512MB"
duration: "5m"
HTTPChaos — Inject HTTP Faults
# Inject 500ms delay into 50% of HTTP responses from user-service
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
name: user-service-delay
namespace: chaos-testing
spec:
mode: all
selector:
namespaces: [production]
labelSelectors:
app: user-service
target: Response
port: 8080
path: "/api/*"
delay: "500ms"
percent: 50
duration: "5m"
---
# Abort 10% of requests with HTTP 503
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
name: user-service-abort
namespace: chaos-testing
spec:
mode: all
selector:
namespaces: [production]
labelSelectors:
app: user-service
target: Response
port: 8080
abort: true
percent: 10
duration: "3m"
Schedule — Recurring Chaos
# Run pod-kill experiment every day at 2am (staging)
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: daily-pod-chaos
namespace: chaos-testing
spec:
schedule: "0 2 * * *" # Cron: 2am daily
historyLimit: 5
type: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
namespaces: [staging]
labelSelectors:
chaos-enabled: "true"
LitmusChaos
Installation
# Install LitmusChaos via Helm
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm install chaos litmuschaos/litmus \
--namespace=litmus \
--create-namespace \
--set portal.frontend.service.type=LoadBalancer
ChaosEngine Example — Pod Delete
# Install pod-delete experiment
kubectl apply -f \
https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/pod-delete/experiment.yaml
---
# ChaosEngine — define target and experiment parameters
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payment-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=payment-service
appkind: deployment
chaosServiceAccount: litmus-admin
monitoring: true
jobCleanUpPolicy: retain
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # seconds
- name: CHAOS_INTERVAL
value: "10" # kill a pod every 10s
- name: FORCE
value: "false" # graceful termination
- name: PODS_AFFECTED_PERC
value: "33" # 33% of pods
GameDay Planning
GameDay Template
1. Hypothesis
If [failure condition],
then [expected system behavior],
because [the mechanism that should protect us],
and we will verify by [measurement / SLI].
Example: "If the primary PostgreSQL instance is unavailable for 2 minutes, then checkout requests will continue to succeed at >99%, because the application uses connection retry with a 30s timeout and the replica will be promoted within 60s, and we will verify by monitoring the checkout_success_rate SLI."
2. Pre-Game Checklist
- ✅ Monitoring dashboard open (Grafana, CloudWatch)
- ✅ Blast radius defined and agreed (max impact scope)
- ✅ Rollback/abort procedure documented and tested
- ✅ Stakeholders notified (not during peak hours)
- ✅ On-call engineer on standby
- ✅ Steady-state baseline captured (current SLI values)
- ✅ Stop condition configured (alarm that halts experiment)
Real GameDay Scenario: Primary Database Failure
# Scenario: "What happens when the primary RDS instance goes down?"
# Step 1: Capture steady state
curl -s https://api.myapp.com/health | jq .
# Checkout success rate: 99.95%
# P95 latency: 180ms
# Step 2: Start monitoring
# Open Grafana dashboard: checkout_success_rate, db_connection_errors
# Step 3: Inject failure — reboot primary RDS (AWS FIS or manual)
aws rds reboot-db-instance \
--db-instance-identifier prod-postgres-primary \
--force-failover # Triggers automatic failover to replica
# Step 4: Observe (2-minute observation window)
# Expected: brief connection errors during failover (~30-60s)
# Expected: automatic failover to replica
# Expected: success rate recovers to >99% within 90s
# Step 5: Record results
# Actual failover time: 47 seconds
# Success rate dip: 94% for 47s (violated SLO!)
# Root cause: connection pool not configured for retry
# Step 6: Document findings and action items
# Action 1: Configure db connection pool with retry (deadline: 1 week)
# Action 2: Add read replica health check to load balancer
# Action 3: Set RDS Multi-AZ for automatic failover
Post-GameDay Report Template
# GameDay Report — YYYY-MM-DD
# Experiment: Primary database failover
# Team: Platform + Backend
## Hypothesis
If primary RDS fails, checkout success rate stays above 99%
because application retries with 30s timeout.
## Results
| Metric | Baseline | During Chaos | Recovery |
|-------------------|----------|--------------|----------|
| Success rate | 99.95% | 94% | 99.9% |
| P95 latency | 180ms | 8200ms | 190ms |
| Failover duration | N/A | 47 seconds | N/A |
## Hypothesis: FAILED (success rate dropped below 99%)
## Root Causes
1. Connection pool has no retry — first request to new primary fails
2. Application timeout (5s) shorter than failover time (47s)
## Action Items
1. [P0] Add connection retry with backoff — @backend-team — due 2024-04-01
2. [P1] Increase db connection timeout to 60s — @backend-team — due 2024-04-01
3. [P2] Add chaos experiment to CI/CD pipeline — @platform-team — due 2024-04-15
## What Went Well
- Monitoring detected the issue immediately
- Automatic RDS failover worked as expected
- Team communicated effectively during the experiment
Chaos Engineering in CI/CD
# .github/workflows/chaos-staging.yml
name: Chaos Experiment — Staging
on:
schedule:
- cron: "0 3 * * 1-5" # Weekdays at 3am
workflow_dispatch:
inputs:
experiment:
description: "Experiment name"
required: true
default: "pod-kill"
jobs:
chaos-experiment:
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
run: echo "${{ secrets.STAGING_KUBECONFIG }}" | base64 -d > kubeconfig
- name: Capture steady state
env:
KUBECONFIG: ./kubeconfig
run: |
SUCCESS_RATE=$(kubectl exec -n monitoring deploy/prometheus \
-- promtool query instant \
'sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))')
echo "Baseline success rate: $SUCCESS_RATE"
echo "BASELINE=$SUCCESS_RATE" >> $GITHUB_ENV
- name: Apply chaos experiment
env:
KUBECONFIG: ./kubeconfig
run: |
kubectl apply -f chaos/staging/pod-kill-experiment.yaml
echo "Chaos experiment started — waiting 5 minutes..."
sleep 300
- name: Validate SLO during chaos
env:
KUBECONFIG: ./kubeconfig
run: |
SUCCESS_RATE=$(kubectl exec -n monitoring deploy/prometheus \
-- promtool query instant \
'sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))')
echo "Success rate during chaos: $SUCCESS_RATE"
# Fail if below 99%
python3 -c "
rate = float('$SUCCESS_RATE')
if rate < 0.99:
print(f'SLO VIOLATION: {rate:.4f} < 0.99')
exit(1)
print(f'SLO OK: {rate:.4f}')
"
- name: Cleanup experiment
if: always()
env:
KUBECONFIG: ./kubeconfig
run: |
kubectl delete -f chaos/staging/pod-kill-experiment.yaml --ignore-not-found
- name: Post results to Slack
if: always()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Chaos Experiment completed: ${{ job.status }}",
"blocks": [{
"type": "section",
"text": { "type": "mrkdwn",
"text": "*Chaos Experiment*: pod-kill staging\n*Result*: ${{ job.status }}\n*Run*: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" }
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
- Always configure a stop condition (CloudWatch alarm / Prometheus alert) that automatically aborts the experiment if impact exceeds threshold
- Never run chaos experiments during deployments, peak traffic hours, or maintenance windows
- Always have a human abort button — someone watching dashboards with authority to stop immediately
- Never run in production without first proving in staging
- Communicate to all stakeholders before running production experiments
- Kill 1 pod of a non-critical service in staging
- Verify auto-recovery works (Kubernetes restarts it)
- Add retry logic where needed
- Graduate to multi-pod, then network faults, then AZ failures
Building a Chaos Engineering Culture
- Start with leadership buy-in — frame chaos as risk reduction, not risk creation
- Blameless environment — findings expose system weaknesses, not human failures
- Track reliability improvements — measure MTTR before/after each experiment cycle
- Celebrate learnings — a failed hypothesis is a successful experiment (you learned something)
- Publish results — share GameDay reports across engineering; build institutional knowledge
- Include chaos in oncall — oncall engineers run monthly GameDays to stay sharp
Next Steps
- Reliability Patterns — Circuit breakers, retries, and capacity planning
- Incident Management — Respond effectively when chaos becomes real
- SRE Overview — SLOs, error budgets, and the Four Golden Signals