Chaos Engineering

Chaos Engineering — deliberately inject failures into systems to build confidence in their ability to withstand turbulent, real-world conditions.

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in its ability to withstand turbulent conditions in production. Rather than waiting for failures to occur, chaos engineering proactively discovers weaknesses before they become outages.

Principles of Chaos Engineering

1. Define a Steady State

Identify a measurable output that represents normal system behavior. Use SLIs — request success rate, p99 latency, throughput. The experiment validates that steady state is maintained under failure conditions.

2. Hypothesize Steady State Continues

Form a testable hypothesis: "If we terminate 30% of pods in the payment service, the overall checkout success rate will remain above 99.5%." Clear hypothesis = clear pass/fail criteria.

3. Vary Real-World Events

Inject failures that mirror actual incidents: pod crashes, network latency, disk full, CPU saturation, dependency timeouts, AZ failures, DNS resolution failures, certificate expiry.

4. Run Experiments in Production

Staging environments don't reflect production load, data size, or interaction patterns. Start with staging, but production is the ultimate validation. Begin with blast radius = 1 pod, 1% traffic.

5. Automate Experiments Continuously

Manual GameDays are valuable but infrequent. Integrate chaos experiments into CI/CD pipelines. Run lightweight experiments continuously in staging; scheduled experiments in production.

6. Minimize Blast Radius

Always have an abort mechanism. Start small — kill 1 pod before killing 30%. Expand scope only after confidence is established. Never run experiments during peak traffic or deploys.

Chaos Maturity Model

Level 1 — Manual GameDays

Planned, manually executed failure scenarios. Team assembles, injects failure, observes response. Valuable for initial discovery but infrequent and labor-intensive.

  • Monthly or quarterly scheduled sessions
  • Manual kubectl delete pod, kill -9, iptables DROP
  • Documented playbooks and hypothesis

Level 2 — Automated Experiments

Chaos experiments defined as code, executed automatically on a schedule or triggered by CI/CD. Results reported to dashboards and Slack.

  • Chaos Mesh / LitmusChaos / AWS FIS experiments as YAML
  • Scheduled runs in staging (nightly) and production (weekly)
  • Automated pass/fail based on SLO thresholds

Level 3 — Continuous Chaos

Low-level failures injected continuously in production (Netflix Chaos Monkey model). System must self-heal. Engineering culture fully embraces failure as learning.

  • Random pod termination always enabled in production
  • Chaos integrated into every deployment pipeline
  • Reliability metrics tracked as engineering KPIs

Tools Comparison

Tool Type K8s Native Fault Types Cost
Chaos Mesh Open Source Yes (CRDs) Pod, Network, Stress, HTTP, IO, Time Free
LitmusChaos Open Source Yes (CRDs) Pod, Node, Network, AWS/GCP faults Free / Enterprise
AWS FIS Managed SaaS Via EKS actions EC2, EKS, RDS, ECS, API throttling Pay per action
Gremlin Commercial SaaS Yes Resource, Network, State, Application Paid
Chaos Monkey Open Source No (EC2/ASG) Instance termination only Free

AWS Fault Injection Simulator (FIS)

Experiment Template — EC2 Instance Termination

# fis-terminate-instances.json
{
  "description": "Terminate 30% of EC2 instances in ASG to test auto-recovery",
  "targets": {
    "MyInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "Environment": "staging",
        "Service": "web-api"
      },
      "selectionMode": "PERCENT(30)"
    }
  },
  "actions": {
    "TerminateInstances": {
      "actionId": "aws:ec2:terminate-instances",
      "parameters": {},
      "targets": {
        "Instances": "MyInstances"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:ap-southeast-1:123456789:alarm/chaos-abort-alarm"
    }
  ],
  "roleArn": "arn:aws:iam::123456789:role/FISExperimentRole",
  "tags": {
    "Purpose": "chaos-engineering"
  }
}

Experiment Template — AZ Outage Simulation

# fis-az-outage.json — Stop all instances in ap-southeast-1a
{
  "description": "Simulate AZ failure: stop all instances in ap-southeast-1a",
  "targets": {
    "AZInstances": {
      "resourceType": "aws:ec2:instance",
      "filters": [
        {
          "path": "Placement.AvailabilityZone",
          "values": ["ap-southeast-1a"]
        },
        {
          "path": "State.Name",
          "values": ["running"]
        }
      ],
      "selectionMode": "ALL"
    }
  },
  "actions": {
    "StopInstances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {
        "startInstancesAfterDuration": "PT10M"
      },
      "targets": { "Instances": "AZInstances" }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:ap-southeast-1:123456789:alarm/error-rate-critical"
    }
  ],
  "roleArn": "arn:aws:iam::123456789:role/FISExperimentRole"
}

FIS CLI Commands

# Create experiment template
aws fis create-experiment-template \
  --cli-input-json file://fis-terminate-instances.json \
  --region ap-southeast-1

# List templates
aws fis list-experiment-templates

# Start experiment
aws fis start-experiment \
  --experiment-template-id EXT1234567890abcdef \
  --region ap-southeast-1

# Monitor experiment
aws fis get-experiment --id EXP1234567890abcdef

# Stop experiment (abort)
aws fis stop-experiment --id EXP1234567890abcdef

# View experiment history
aws fis list-experiments \
  --filter "experimentTemplateId=EXT1234567890abcdef"

Chaos Mesh on Kubernetes

Installation

# Add Chaos Mesh Helm repo
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Install Chaos Mesh (with dashboard)
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace=chaos-mesh \
  --create-namespace \
  --set dashboard.create=true \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --version 2.6.3

# Verify installation
kubectl get pods -n chaos-mesh

PodChaos — Pod Kill & Failure

# Kill random pod in payment-service every 60 seconds
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-payment-pod
  namespace: chaos-testing
spec:
  action: pod-kill         # pod-kill | pod-failure | container-kill
  mode: one                # one | all | fixed | fixed-percent | random-max-percent
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  scheduler:
    cron: "@every 60s"
# Inject pod failure for 5 minutes (pod stays but is in failed state)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: payment-pod-failure
  namespace: chaos-testing
spec:
  action: pod-failure
  mode: fixed-percent
  value: "30"              # Fail 30% of matching pods
  duration: "5m"
  selector:
    namespaces: [production]
    labelSelectors:
      app: payment-service

NetworkChaos — Latency, Loss, Partition

# Add 100ms ± 20ms latency between order-service and inventory-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: order-to-inventory-latency
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: order-service
  delay:
    latency: "100ms"
    correlation: "25"
    jitter: "20ms"
  direction: to
  target:
    mode: all
    selector:
      namespaces: [production]
      labelSelectors:
        app: inventory-service
  duration: "10m"
# 20% packet loss between services
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-packet-loss
  namespace: chaos-testing
spec:
  action: loss
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: frontend
  loss:
    loss: "20"
    correlation: "25"
  direction: both
  duration: "5m"

---
# Network partition — isolate payment-service from all outbound traffic
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-isolation
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: payment-service
  direction: both
  duration: "2m"

StressChaos — CPU & Memory Pressure

# CPU stress — saturate 2 CPU cores at 90% for 10 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-api
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: api-gateway
  stressors:
    cpu:
      workers: 2          # Number of CPU workers
      load: 90            # CPU load percentage
  duration: "10m"

---
# Memory stress — consume 512MB for 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-api
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: api-gateway
  stressors:
    memory:
      workers: 1
      size: "512MB"
  duration: "5m"

HTTPChaos — Inject HTTP Faults

# Inject 500ms delay into 50% of HTTP responses from user-service
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: user-service-delay
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: user-service
  target: Response
  port: 8080
  path: "/api/*"
  delay: "500ms"
  percent: 50
  duration: "5m"

---
# Abort 10% of requests with HTTP 503
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: user-service-abort
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: user-service
  target: Response
  port: 8080
  abort: true
  percent: 10
  duration: "3m"

Schedule — Recurring Chaos

# Run pod-kill experiment every day at 2am (staging)
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: daily-pod-chaos
  namespace: chaos-testing
spec:
  schedule: "0 2 * * *"   # Cron: 2am daily
  historyLimit: 5
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces: [staging]
      labelSelectors:
        chaos-enabled: "true"

LitmusChaos

Installation

# Install LitmusChaos via Helm
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

helm install chaos litmuschaos/litmus \
  --namespace=litmus \
  --create-namespace \
  --set portal.frontend.service.type=LoadBalancer

ChaosEngine Example — Pod Delete

# Install pod-delete experiment
kubectl apply -f \
  https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/pod-delete/experiment.yaml

---
# ChaosEngine — define target and experiment parameters
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  monitoring: true
  jobCleanUpPolicy: retain
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"         # seconds
            - name: CHAOS_INTERVAL
              value: "10"         # kill a pod every 10s
            - name: FORCE
              value: "false"      # graceful termination
            - name: PODS_AFFECTED_PERC
              value: "33"         # 33% of pods

GameDay Planning

GameDay Template

1. Hypothesis

If [failure condition],
then [expected system behavior],
because [the mechanism that should protect us],
and we will verify by [measurement / SLI].

Example: "If the primary PostgreSQL instance is unavailable for 2 minutes, then checkout requests will continue to succeed at >99%, because the application uses connection retry with a 30s timeout and the replica will be promoted within 60s, and we will verify by monitoring the checkout_success_rate SLI."

2. Pre-Game Checklist

  • ✅ Monitoring dashboard open (Grafana, CloudWatch)
  • ✅ Blast radius defined and agreed (max impact scope)
  • ✅ Rollback/abort procedure documented and tested
  • ✅ Stakeholders notified (not during peak hours)
  • ✅ On-call engineer on standby
  • ✅ Steady-state baseline captured (current SLI values)
  • ✅ Stop condition configured (alarm that halts experiment)

Real GameDay Scenario: Primary Database Failure

# Scenario: "What happens when the primary RDS instance goes down?"

# Step 1: Capture steady state
curl -s https://api.myapp.com/health | jq .
# Checkout success rate: 99.95%
# P95 latency: 180ms

# Step 2: Start monitoring
# Open Grafana dashboard: checkout_success_rate, db_connection_errors

# Step 3: Inject failure — reboot primary RDS (AWS FIS or manual)
aws rds reboot-db-instance \
  --db-instance-identifier prod-postgres-primary \
  --force-failover   # Triggers automatic failover to replica

# Step 4: Observe (2-minute observation window)
# Expected: brief connection errors during failover (~30-60s)
# Expected: automatic failover to replica
# Expected: success rate recovers to >99% within 90s

# Step 5: Record results
# Actual failover time: 47 seconds
# Success rate dip: 94% for 47s (violated SLO!)
# Root cause: connection pool not configured for retry

# Step 6: Document findings and action items
# Action 1: Configure db connection pool with retry (deadline: 1 week)
# Action 2: Add read replica health check to load balancer
# Action 3: Set RDS Multi-AZ for automatic failover

Post-GameDay Report Template

# GameDay Report — YYYY-MM-DD
# Experiment: Primary database failover
# Team: Platform + Backend

## Hypothesis
If primary RDS fails, checkout success rate stays above 99%
because application retries with 30s timeout.

## Results
| Metric            | Baseline | During Chaos | Recovery |
|-------------------|----------|--------------|----------|
| Success rate      | 99.95%   | 94%          | 99.9%    |
| P95 latency       | 180ms    | 8200ms       | 190ms    |
| Failover duration | N/A      | 47 seconds   | N/A      |

## Hypothesis: FAILED (success rate dropped below 99%)

## Root Causes
1. Connection pool has no retry — first request to new primary fails
2. Application timeout (5s) shorter than failover time (47s)

## Action Items
1. [P0] Add connection retry with backoff — @backend-team — due 2024-04-01
2. [P1] Increase db connection timeout to 60s — @backend-team — due 2024-04-01
3. [P2] Add chaos experiment to CI/CD pipeline — @platform-team — due 2024-04-15

## What Went Well
- Monitoring detected the issue immediately
- Automatic RDS failover worked as expected
- Team communicated effectively during the experiment

Chaos Engineering in CI/CD

# .github/workflows/chaos-staging.yml

name: Chaos Experiment — Staging

on:
  schedule:
    - cron: "0 3 * * 1-5"   # Weekdays at 3am
  workflow_dispatch:
    inputs:
      experiment:
        description: "Experiment name"
        required: true
        default: "pod-kill"

jobs:
  chaos-experiment:
    runs-on: ubuntu-latest
    environment: staging

    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        run: echo "${{ secrets.STAGING_KUBECONFIG }}" | base64 -d > kubeconfig

      - name: Capture steady state
        env:
          KUBECONFIG: ./kubeconfig
        run: |
          SUCCESS_RATE=$(kubectl exec -n monitoring deploy/prometheus \
            -- promtool query instant \
            'sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))')
          echo "Baseline success rate: $SUCCESS_RATE"
          echo "BASELINE=$SUCCESS_RATE" >> $GITHUB_ENV

      - name: Apply chaos experiment
        env:
          KUBECONFIG: ./kubeconfig
        run: |
          kubectl apply -f chaos/staging/pod-kill-experiment.yaml
          echo "Chaos experiment started — waiting 5 minutes..."
          sleep 300

      - name: Validate SLO during chaos
        env:
          KUBECONFIG: ./kubeconfig
        run: |
          SUCCESS_RATE=$(kubectl exec -n monitoring deploy/prometheus \
            -- promtool query instant \
            'sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))')
          echo "Success rate during chaos: $SUCCESS_RATE"
          # Fail if below 99%
          python3 -c "
          rate = float('$SUCCESS_RATE')
          if rate < 0.99:
              print(f'SLO VIOLATION: {rate:.4f} < 0.99')
              exit(1)
          print(f'SLO OK: {rate:.4f}')
          "

      - name: Cleanup experiment
        if: always()
        env:
          KUBECONFIG: ./kubeconfig
        run: |
          kubectl delete -f chaos/staging/pod-kill-experiment.yaml --ignore-not-found

      - name: Post results to Slack
        if: always()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Chaos Experiment completed: ${{ job.status }}",
              "blocks": [{
                "type": "section",
                "text": { "type": "mrkdwn",
                  "text": "*Chaos Experiment*: pod-kill staging\n*Result*: ${{ job.status }}\n*Run*: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" }
              }]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
⚠️ Safety Rules — Never Skip These:
  • Always configure a stop condition (CloudWatch alarm / Prometheus alert) that automatically aborts the experiment if impact exceeds threshold
  • Never run chaos experiments during deployments, peak traffic hours, or maintenance windows
  • Always have a human abort button — someone watching dashboards with authority to stop immediately
  • Never run in production without first proving in staging
  • Communicate to all stakeholders before running production experiments
💡 Getting Started: Don't start with complex network partitions. Start simple:
  1. Kill 1 pod of a non-critical service in staging
  2. Verify auto-recovery works (Kubernetes restarts it)
  3. Add retry logic where needed
  4. Graduate to multi-pod, then network faults, then AZ failures

Building a Chaos Engineering Culture

✅ Organizational Principles:
  • Start with leadership buy-in — frame chaos as risk reduction, not risk creation
  • Blameless environment — findings expose system weaknesses, not human failures
  • Track reliability improvements — measure MTTR before/after each experiment cycle
  • Celebrate learnings — a failed hypothesis is a successful experiment (you learned something)
  • Publish results — share GameDay reports across engineering; build institutional knowledge
  • Include chaos in oncall — oncall engineers run monthly GameDays to stay sharp

Next Steps