Chaos Engineering

Chaos Engineering — deliberately inject failures into systems to build confidence in their ability to withstand turbulent, real-world conditions.

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in its ability to withstand turbulent conditions in production. Rather than waiting for failures to occur, chaos engineering proactively discovers weaknesses before they become outages.

Principles of Chaos Engineering

1. Define a Steady State

Identify a measurable output that represents normal system behavior. Use SLIs — request success rate, p99 latency, throughput. The experiment validates that steady state is maintained under failure conditions.

2. Hypothesize Steady State Continues

Form a testable hypothesis: "If we terminate 30% of pods in the payment service, the overall checkout success rate will remain above 99.5%." Clear hypothesis = clear pass/fail criteria.

3. Vary Real-World Events

Inject failures that mirror actual incidents: pod crashes, network latency, disk full, CPU saturation, dependency timeouts, AZ failures, DNS resolution failures, certificate expiry.

4. Run Experiments in Production

Staging environments don't reflect production load, data size, or interaction patterns. Start with staging, but production is the ultimate validation. Begin with blast radius = 1 pod, 1% traffic.

5. Automate Experiments Continuously

Manual GameDays are valuable but infrequent. Integrate chaos experiments into CI/CD pipelines. Run lightweight experiments continuously in staging; scheduled experiments in production.

6. Minimize Blast Radius

Always have an abort mechanism. Start small — kill 1 pod before killing 30%. Expand scope only after confidence is established. Never run experiments during peak traffic or deploys.

Chaos Maturity Model

Level 1 — Manual GameDays

Planned, manually executed failure scenarios. Team assembles, injects failure, observes response. Valuable for initial discovery but infrequent and labor-intensive.

Monthly or quarterly scheduled sessions
Manual kubectl delete pod, kill -9, iptables DROP
Documented playbooks and hypothesis

Level 2 — Automated Experiments

Chaos experiments defined as code, executed automatically on a schedule or triggered by CI/CD. Results reported to dashboards and Slack.

Chaos Mesh / LitmusChaos / AWS FIS experiments as YAML
Scheduled runs in staging (nightly) and production (weekly)
Automated pass/fail based on SLO thresholds

Level 3 — Continuous Chaos

Low-level failures injected continuously in production (Netflix Chaos Monkey model). System must self-heal. Engineering culture fully embraces failure as learning.

Random pod termination always enabled in production
Chaos integrated into every deployment pipeline
Reliability metrics tracked as engineering KPIs

Tools Comparison

Tool	Type	K8s Native	Fault Types	Cost
Chaos Mesh	Open Source	Yes (CRDs)	Pod, Network, Stress, HTTP, IO, Time	Free
LitmusChaos	Open Source	Yes (CRDs)	Pod, Node, Network, AWS/GCP faults	Free / Enterprise
AWS FIS	Managed SaaS	Via EKS actions	EC2, EKS, RDS, ECS, API throttling	Pay per action
Gremlin	Commercial SaaS	Yes	Resource, Network, State, Application	Paid
Chaos Monkey	Open Source	No (EC2/ASG)	Instance termination only	Free

AWS Fault Injection Simulator (FIS)

Experiment Template — EC2 Instance Termination

# fis-terminate-instances.json
{
  "description": "Terminate 30% of EC2 instances in ASG to test auto-recovery",
  "targets": {
    "MyInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "Environment": "staging",
        "Service": "web-api"
      },
      "selectionMode": "PERCENT(30)"
    }
  },
  "actions": {
    "TerminateInstances": {
      "actionId": "aws:ec2:terminate-instances",
      "parameters": {},
      "targets": {
        "Instances": "MyInstances"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:ap-southeast-1:123456789:alarm/chaos-abort-alarm"
    }
  ],
  "roleArn": "arn:aws:iam::123456789:role/FISExperimentRole",
  "tags": {
    "Purpose": "chaos-engineering"
  }
}

Experiment Template — AZ Outage Simulation

# fis-az-outage.json — Stop all instances in ap-southeast-1a
{
  "description": "Simulate AZ failure: stop all instances in ap-southeast-1a",
  "targets": {
    "AZInstances": {
      "resourceType": "aws:ec2:instance",
      "filters": [
        {
          "path": "Placement.AvailabilityZone",
          "values": ["ap-southeast-1a"]
        },
        {
          "path": "State.Name",
          "values": ["running"]
        }
      ],
      "selectionMode": "ALL"
    }
  },
  "actions": {
    "StopInstances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {
        "startInstancesAfterDuration": "PT10M"
      },
      "targets": { "Instances": "AZInstances" }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:ap-southeast-1:123456789:alarm/error-rate-critical"
    }
  ],
  "roleArn": "arn:aws:iam::123456789:role/FISExperimentRole"
}

FIS CLI Commands

# Create experiment template
aws fis create-experiment-template \
  --cli-input-json file://fis-terminate-instances.json \
  --region ap-southeast-1

# List templates
aws fis list-experiment-templates

# Start experiment
aws fis start-experiment \
  --experiment-template-id EXT1234567890abcdef \
  --region ap-southeast-1

# Monitor experiment
aws fis get-experiment --id EXP1234567890abcdef

# Stop experiment (abort)
aws fis stop-experiment --id EXP1234567890abcdef

# View experiment history
aws fis list-experiments \
  --filter "experimentTemplateId=EXT1234567890abcdef"

Chaos Mesh on Kubernetes

Installation

# Add Chaos Mesh Helm repo
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Install Chaos Mesh (with dashboard)
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace=chaos-mesh \
  --create-namespace \
  --set dashboard.create=true \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --version 2.6.3

# Verify installation
kubectl get pods -n chaos-mesh

PodChaos — Pod Kill & Failure

# Kill random pod in payment-service every 60 seconds
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-payment-pod
  namespace: chaos-testing
spec:
  action: pod-kill         # pod-kill | pod-failure | container-kill
  mode: one                # one | all | fixed | fixed-percent | random-max-percent
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  scheduler:
    cron: "@every 60s"

# Inject pod failure for 5 minutes (pod stays but is in failed state)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: payment-pod-failure
  namespace: chaos-testing
spec:
  action: pod-failure
  mode: fixed-percent
  value: "30"              # Fail 30% of matching pods
  duration: "5m"
  selector:
    namespaces: [production]
    labelSelectors:
      app: payment-service

NetworkChaos — Latency, Loss, Partition

# Add 100ms ± 20ms latency between order-service and inventory-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: order-to-inventory-latency
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: order-service
  delay:
    latency: "100ms"
    correlation: "25"
    jitter: "20ms"
  direction: to
  target:
    mode: all
    selector:
      namespaces: [production]
      labelSelectors:
        app: inventory-service
  duration: "10m"

# 20% packet loss between services
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-packet-loss
  namespace: chaos-testing
spec:
  action: loss
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: frontend
  loss:
    loss: "20"
    correlation: "25"
  direction: both
  duration: "5m"

---
# Network partition — isolate payment-service from all outbound traffic
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-isolation
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: payment-service
  direction: both
  duration: "2m"

StressChaos — CPU & Memory Pressure

# CPU stress — saturate 2 CPU cores at 90% for 10 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-api
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: api-gateway
  stressors:
    cpu:
      workers: 2          # Number of CPU workers
      load: 90            # CPU load percentage
  duration: "10m"

---
# Memory stress — consume 512MB for 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-api
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: api-gateway
  stressors:
    memory:
      workers: 1
      size: "512MB"
  duration: "5m"

HTTPChaos — Inject HTTP Faults

# Inject 500ms delay into 50% of HTTP responses from user-service
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: user-service-delay
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: user-service
  target: Response
  port: 8080
  path: "/api/*"
  delay: "500ms"
  percent: 50
  duration: "5m"

---
# Abort 10% of requests with HTTP 503
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: user-service-abort
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: user-service
  target: Response
  port: 8080
  abort: true
  percent: 10
  duration: "3m"

Schedule — Recurring Chaos

# Run pod-kill experiment every day at 2am (staging)
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: daily-pod-chaos
  namespace: chaos-testing
spec:
  schedule: "0 2 * * *"   # Cron: 2am daily
  historyLimit: 5
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces: [staging]
      labelSelectors:
        chaos-enabled: "true"

LitmusChaos

Installation

# Install LitmusChaos via Helm
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

helm install chaos litmuschaos/litmus \
  --namespace=litmus \
  --create-namespace \
  --set portal.frontend.service.type=LoadBalancer

ChaosEngine Example — Pod Delete

# Install pod-delete experiment
kubectl apply -f \
  https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/pod-delete/experiment.yaml

---
# ChaosEngine — define target and experiment parameters
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  monitoring: true
  jobCleanUpPolicy: retain
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"         # seconds
            - name: CHAOS_INTERVAL
              value: "10"         # kill a pod every 10s
            - name: FORCE
              value: "false"      # graceful termination
            - name: PODS_AFFECTED_PERC
              value: "33"         # 33% of pods

GameDay Planning

GameDay Template

1. Hypothesis

If [failure condition],
then [expected system behavior],
because [the mechanism that should protect us],
and we will verify by [measurement / SLI].

Example: "If the primary PostgreSQL instance is unavailable for 2 minutes, then checkout requests will continue to succeed at >99%, because the application uses connection retry with a 30s timeout and the replica will be promoted within 60s, and we will verify by monitoring the checkout_success_rate SLI."

2. Pre-Game Checklist

✅ Monitoring dashboard open (Grafana, CloudWatch)
✅ Blast radius defined and agreed (max impact scope)
✅ Rollback/abort procedure documented and tested
✅ Stakeholders notified (not during peak hours)
✅ On-call engineer on standby
✅ Steady-state baseline captured (current SLI values)
✅ Stop condition configured (alarm that halts experiment)

Real GameDay Scenario: Primary Database Failure

# Scenario: "What happens when the primary RDS instance goes down?"

# Step 1: Capture steady state
curl -s https://api.myapp.com/health | jq .
# Checkout success rate: 99.95%
# P95 latency: 180ms

# Step 2: Start monitoring
# Open Grafana dashboard: checkout_success_rate, db_connection_errors

# Step 3: Inject failure — reboot primary RDS (AWS FIS or manual)
aws rds reboot-db-instance \
  --db-instance-identifier prod-postgres-primary \
  --force-failover   # Triggers automatic failover to replica

# Step 4: Observe (2-minute observation window)
# Expected: brief connection errors during failover (~30-60s)
# Expected: automatic failover to replica
# Expected: success rate recovers to >99% within 90s

# Step 5: Record results
# Actual failover time: 47 seconds
# Success rate dip: 94% for 47s (violated SLO!)
# Root cause: connection pool not configured for retry

# Step 6: Document findings and action items
# Action 1: Configure db connection pool with retry (deadline: 1 week)
# Action 2: Add read replica health check to load balancer
# Action 3: Set RDS Multi-AZ for automatic failover

Post-GameDay Report Template

# GameDay Report — YYYY-MM-DD
# Experiment: Primary database failover
# Team: Platform + Backend

## Hypothesis
If primary RDS fails, checkout success rate stays above 99%
because application retries with 30s timeout.

## Results
| Metric            | Baseline | During Chaos | Recovery |
|-------------------|----------|--------------|----------|
| Success rate      | 99.95%   | 94%          | 99.9%    |
| P95 latency       | 180ms    | 8200ms       | 190ms    |
| Failover duration | N/A      | 47 seconds   | N/A      |

## Hypothesis: FAILED (success rate dropped below 99%)

## Root Causes
1. Connection pool has no retry — first request to new primary fails
2. Application timeout (5s) shorter than failover time (47s)

## Action Items
1. [P0] Add connection retry with backoff — @backend-team — due 2024-04-01
2. [P1] Increase db connection timeout to 60s — @backend-team — due 2024-04-01
3. [P2] Add chaos experiment to CI/CD pipeline — @platform-team — due 2024-04-15

## What Went Well
- Monitoring detected the issue immediately
- Automatic RDS failover worked as expected
- Team communicated effectively during the experiment