Infrastructure Playbooks

Runbook Index — Seven infrastructure runbooks covering the most common production incidents encountered in cloud-native Kubernetes environments.

Each runbook follows the standard structure: Symptom, Root Cause, Immediate Actions, Verification, and Prevention. All commands are written for a production Kubernetes cluster on AWS/GCP with standard tooling (kubectl, awscli, gcloud).

RB-0001 — Kubernetes Pod CrashLoopBackOff

Symptom

Pod status shows CrashLoopBackOff. The pod starts, crashes immediately, and Kubernetes restarts it in an exponentially increasing back-off cycle. Users may see 503 errors if the affected pod is part of a load-balanced service.

Common Root Causes

Out-of-Memory (OOM) kill — container exceeds its memory limit
Missing environment variable or ConfigMap / Secret reference
Bad container image (corrupted layer, wrong tag, non-existent registry path)
Application startup failure (misconfiguration, failed DB connection on boot)
Liveness probe misconfigured — probe kills healthy containers

Immediate Actions

# 1. Identify the affected pod(s)
kubectl get pods -n <namespace> | grep CrashLoopBackOff

# 2. Describe the pod — look at Events section at the bottom
kubectl describe pod <pod-name> -n <namespace>

# 3. Read the current logs (may be truncated if container exited quickly)
kubectl logs <pod-name> -n <namespace>

# 4. Read logs from the previous container instance (most useful for CrashLoop)
kubectl logs <pod-name> -n <namespace> --previous

# 5. Check recent events in the namespace for broader context
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -30

Diagnosis & Resolution by Cause

OOM Kill

In kubectl describe pod output, look for OOMKilled in the last state:

# Confirm OOM kill
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Last State"
# Expected: Reason: OOMKilled

# Check current resource usage on the node
kubectl top pod <pod-name> -n <namespace>

# Resolution: increase memory limit in the Deployment manifest
kubectl edit deployment <deployment-name> -n <namespace>
# Under resources.limits.memory, increase the value (e.g., 512Mi → 1Gi)

# Or patch directly without editor
kubectl patch deployment <deployment-name> -n <namespace> \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"1Gi"}]'

Missing ConfigMap / Secret

# Logs will typically show: "env var X not set" or similar
# Verify referenced ConfigMap exists
kubectl get configmap <configmap-name> -n <namespace>

# Verify referenced Secret exists
kubectl get secret <secret-name> -n <namespace>

# If missing, create the ConfigMap from a file
kubectl create configmap <configmap-name> --from-file=config.yaml -n <namespace>

# Or create the Secret
kubectl create secret generic <secret-name> \
  --from-literal=DB_PASSWORD='s3cur3p@ss' -n <namespace>

Bad Container Image

# Check image pull status in events
kubectl describe pod <pod-name> -n <namespace> | grep -i "image\|pull\|back-off"

# Roll back to the last known good image tag
kubectl set image deployment/<deployment-name> \
  <container-name>=<registry>/<image>:<last-good-tag> \
  -n <namespace>

# Monitor rollout
kubectl rollout status deployment/<deployment-name> -n <namespace>

Verification: Run kubectl get pods -n <namespace> and confirm status is Running with restart count stable. Check application logs with kubectl logs <pod-name> -n <namespace> for no errors.

Prevention: Enforce resource requests and limits via LimitRange. Use Pod Disruption Budgets. Enable OOM alerting in Prometheus. Add liveness/readiness probe validation to CI pipeline.

RB-0002 — Kubernetes Node NotReady

Symptom

One or more nodes show NotReady status in kubectl get nodes. Pods on the affected node may be evicted or stuck in Terminating or Unknown state.

Common Root Causes

Disk pressure — node disk usage exceeds eviction threshold
Memory pressure — node memory fully exhausted
Network partition — kubelet cannot communicate with the API server
kubelet crash or hang
Container runtime (containerd / Docker) failure

Immediate Actions

# 1. Identify NotReady nodes
kubectl get nodes

# 2. Get detailed node conditions
kubectl describe node <node-name>
# Look at: Conditions section (DiskPressure, MemoryPressure, PIDPressure, Ready)
# Look at: Events section at the bottom

# 3. Prevent new pods from being scheduled on the affected node
kubectl cordon <node-name>

# 4. If node needs maintenance — drain it (evicts all pods gracefully)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# 5. SSH into the node to investigate (example for AWS EC2 with SSM)
aws ssm start-session --target <instance-id>

Diagnosis on the Node

# Check kubelet status
sudo systemctl status kubelet

# Restart kubelet if it has crashed
sudo systemctl restart kubelet

# Check kubelet logs
sudo journalctl -u kubelet -n 100 --no-pager

# Check container runtime
sudo systemctl status containerd
sudo systemctl restart containerd

# Check disk usage
df -h
du -sh /var/log/* | sort -hr | head -20

# Check memory
free -m

# Check network connectivity to API server
curl -k https://<api-server-endpoint>/healthz

Recovery

# After fixing the underlying issue, uncordon the node
kubectl uncordon <node-name>

# Verify node is Ready
kubectl get nodes

# Verify pods are rescheduled and running
kubectl get pods -A | grep -v Running | grep -v Completed

Verification: Node status shows Ready. No pods stuck in Unknown or Terminating. Cluster-level metrics (CPU, memory, disk) return to normal baseline.

Prevention: Configure node auto-repair in GKE / EKS managed node groups. Set up disk pressure alerts at 75% and 85%. Use cluster autoscaler to replace unhealthy nodes automatically. Implement log rotation to prevent disk exhaustion.

RB-0003 — High CPU / Memory on Instance

Symptom

CPU usage sustained above 90% or memory above 95% on a VM / EC2 instance / Compute Engine instance. Application may be slow, timing out, or returning 5xx errors. Alert fires from CloudWatch / Stackdriver / Prometheus node exporter.

Immediate Actions

# SSH into the instance
# Identify the top CPU consumers (interactive, refreshes every 2s)
top -c

# Non-interactive snapshot — useful for scripting and logging
ps aux --sort=-%cpu | head -20

# Check memory usage
free -m
ps aux --sort=-%mem | head -20

# Check load average (1m, 5m, 15m)
uptime

# Full resource snapshot with extended process info
htop  # (if installed)

# Check for runaway processes by specific user or service
ps -u www-data -o pid,ppid,%cpu,%mem,comm --sort=-%cpu | head -10

Resolution

# Option A: Kill a runaway process (use SIGTERM first, then SIGKILL)
kill -15 <PID>
# Wait 10 seconds, then if still running:
kill -9 <PID>

# Option B: Restart the problematic service
sudo systemctl restart <service-name>

# Option C: Scale out — add more instances via autoscaling group
# AWS: manually trigger a scale-out
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name <asg-name> \
  --desired-capacity <current+1>

# GCP: resize managed instance group
gcloud compute instance-groups managed resize <mig-name> \
  --size=<new-size> --zone=<zone>

# Option D: Add CPU credits for burstable instances (AWS T-series)
# Change instance type temporarily to a non-burstable instance
aws ec2 modify-instance-attribute --instance-id <id> \
  --instance-type '{"Value": "m5.large"}'

Verification: CPU drops below 70%. Memory pressure resolved. Application response times return to baseline. No new 5xx errors in access logs.

Prevention: Configure autoscaling policies on CPU thresholds (scale out at 70%). Enable detailed CloudWatch / Stackdriver monitoring. Profile application for CPU hotspots during load testing. Set resource quotas in Kubernetes to prevent noisy-neighbour effects.

RB-0004 — Database Connection Pool Exhausted

Symptom

Application returns errors such as FATAL: remaining connection slots are reserved (PostgreSQL) or Too many connections (MySQL). Database metrics show connections at or near max_connections. New application pods fail to start because they cannot acquire a DB connection.

Immediate Actions

# Connect to PostgreSQL and inspect active connections
psql -h <host> -U <admin-user> -d <database>

-- View current connections grouped by state and application
SELECT state, application_name, count(*)
FROM pg_stat_activity
GROUP BY state, application_name
ORDER BY count DESC;

-- View long-running queries (potential connection holders)
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '5 minutes'
ORDER BY duration DESC;

-- Terminate idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND query_start < now() - interval '10 minutes';

-- View max_connections setting
SHOW max_connections;

Connection Pooler — PgBouncer

# Check PgBouncer status
psql -h <pgbouncer-host> -p 6432 -U pgbouncer pgbouncer

-- View pool statistics
SHOW POOLS;

-- View client statistics
SHOW CLIENTS;

-- View server connections
SHOW SERVERS;

-- Reload PgBouncer config without full restart
RELOAD;

# Restart PgBouncer pod in Kubernetes
kubectl rollout restart deployment/pgbouncer -n <namespace>

Application-Side Config Check

# Check connection pool settings in application config
# For Python (SQLAlchemy) — typical settings to review:
#   pool_size=5, max_overflow=10, pool_timeout=30

# For Node.js (pg / knex):
#   min: 2, max: 10

# Emergency: reduce pool size per pod to free up connections
kubectl set env deployment/<app> DB_POOL_MAX=3 -n <namespace>

# Scale down replicas temporarily to reduce connection pressure
kubectl scale deployment/<app> --replicas=2 -n <namespace>

Verification: Active connection count drops below 80% of max_connections. Application errors clear. New requests succeed. Monitor pg_stat_activity for 5 minutes to confirm stability.

Prevention: Deploy PgBouncer in transaction-pooling mode in front of all PostgreSQL instances. Alert when connections exceed 70% of max. Implement connection timeouts and idle connection reaping in all application pools. Size max_connections appropriately for instance memory (rule of thumb: 100 connections per GB of RAM for PostgreSQL).

RB-0005 — TLS Certificate Expired

Symptom

Users see browser SSL errors (ERR_CERT_DATE_INVALID). Load balancer health checks fail on HTTPS. Monitoring alerts fire on certificate expiry check. curl to the endpoint returns SSL certificate problem: certificate has expired.

Immediate Actions — Diagnose

# Check certificate expiry date directly from the endpoint
echo | openssl s_client -servername <domain> -connect <domain>:443 2>/dev/null \
  | openssl x509 -noout -dates

# Check certificate details (subject, issuer, SANs)
echo | openssl s_client -servername <domain> -connect <domain>:443 2>/dev/null \
  | openssl x509 -noout -text | grep -A2 "Subject:\|Issuer:\|Not After\|DNS:"

# Check Kubernetes TLS secret expiry (cert-manager managed)
kubectl get certificate -A
kubectl describe certificate <cert-name> -n <namespace>

Resolution — cert-manager (Kubernetes)

# Trigger manual renewal of a cert-manager Certificate resource
kubectl annotate certificate <cert-name> -n <namespace> \
  cert-manager.io/issuse-temporary-certificate="true"

# Delete the managed Secret to force cert-manager to re-issue
kubectl delete secret <tls-secret-name> -n <namespace>
# cert-manager will automatically recreate it

# Check certificate request status
kubectl get certificaterequest -n <namespace>
kubectl describe certificaterequest <cr-name> -n <namespace>

Resolution — Manual Let's Encrypt (Certbot)

# Renew all certificates managed by certbot
sudo certbot renew --force-renewal

# Renew a specific domain
sudo certbot certonly --force-renewal -d <domain>

# Reload web server / ingress after renewal
sudo systemctl reload nginx
# or
sudo systemctl reload apache2

# Copy renewed cert to Kubernetes secret
kubectl create secret tls <tls-secret-name> \
  --cert=/etc/letsencrypt/live/<domain>/fullchain.pem \
  --key=/etc/letsencrypt/live/<domain>/privkey.pem \
  -n <namespace> --dry-run=client -o yaml | kubectl apply -f -

Verification: Run the openssl s_client command again and confirm Not After is a future date (typically 90 days for Let's Encrypt, 1–2 years for commercial CAs). Browser shows valid padlock.

Prevention: Use cert-manager with auto-renewal (renews 30 days before expiry). Implement certificate expiry monitoring alerting at 30 days and 7 days. Maintain a certificate inventory spreadsheet for manually managed certs. Enable HSTS preloading only after ensuring renewal automation is solid.

RB-0006 — S3 / GCS Bucket Accidentally Made Public

Symptom

Security scanning tool (AWS Macie, GCP SCC, or third-party CSPM) reports a public bucket. CloudTrail / Cloud Audit Logs show a PutBucketAcl or SetIamPolicy event granting public access. Data may have been exposed.

Priority: This is a SEV1 security incident. Re-lock access FIRST within minutes, then investigate. Do not investigate before locking — every second of public exposure increases risk.

Immediate Actions — AWS S3

# Step 1: Block all public access at bucket level IMMEDIATELY
aws s3api put-public-access-block \
  --bucket <bucket-name> \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

# Step 2: Verify public access is blocked
aws s3api get-public-access-block --bucket <bucket-name>

# Step 3: Remove any bucket ACL grants to AllUsers or AuthenticatedUsers
aws s3api put-bucket-acl --bucket <bucket-name> --acl private

# Step 4: Check current bucket policy for public statements
aws s3api get-bucket-policy --bucket <bucket-name> | python3 -m json.tool

# Step 5: Remove the bucket policy if it grants public access
aws s3api delete-bucket-policy --bucket <bucket-name>

# Step 6: Check CloudTrail for who made the change and when
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=<bucket-name> \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --query 'Events[?contains(EventName, `Bucket`) || contains(EventName, `Acl`) || contains(EventName, `Policy`)]'

Immediate Actions — GCP GCS

# Step 1: Remove allUsers and allAuthenticatedUsers from bucket IAM policy
gcloud storage buckets remove-iam-policy-binding gs://<bucket-name> \
  --member=allUsers --role=roles/storage.objectViewer
gcloud storage buckets remove-iam-policy-binding gs://<bucket-name> \
  --member=allAuthenticatedUsers --role=roles/storage.objectViewer

# Step 2: Verify current IAM policy
gcloud storage buckets get-iam-policy gs://<bucket-name>

# Step 3: Enable uniform bucket-level access to prevent ACL-based public access
gcloud storage buckets update gs://<bucket-name> --uniform-bucket-level-access

# Step 4: Check Cloud Audit Logs for the change
gcloud logging read \
  'protoPayload.methodName=("storage.buckets.setIamPolicy" OR "storage.buckets.update") AND resource.labels.bucket_name="<bucket-name>"' \
  --limit=20 --format=json

Verification: Re-run the public access check. Attempt to access a bucket object via anonymous HTTP request — should return 403. Confirm in CSPM tool that bucket is no longer flagged as public.

Prevention: Enable S3 Block Public Access at the AWS Account level. Use GCP Organization Policy storage.publicAccessPrevention. Implement SCPs / Org Policies to deny any action that grants public access. Enable AWS Macie or GCP SCC to continuously scan for public buckets. Add automated remediation via AWS Config Rules.

RB-0007 — Disk Space Critical

Symptom

Alert fires when disk usage exceeds 85% (warning) or 95% (critical). Application may fail to write logs, create temporary files, or write to database WAL. In severe cases the OS becomes unresponsive.

Immediate Actions — Identify Space Consumers

# Check disk usage across all filesystems
df -h

# Find the largest directories on the root filesystem
du -sh /var/log/* 2>/dev/null | sort -hr | head -20
du -sh /tmp/* 2>/dev/null | sort -hr | head -10
du -sh /home/* 2>/dev/null | sort -hr | head -10

# Find large files (>100MB) recursively
find / -xdev -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -hr | head -20

# Check Docker / container storage
docker system df  # If Docker is installed
crictl images     # If containerd is the runtime

Resolution — Free Up Space

# Rotate / compress old logs immediately
sudo logrotate -f /etc/logrotate.conf

# Remove old rotated logs
sudo find /var/log -name "*.gz" -mtime +7 -delete
sudo find /var/log -name "*.log.[0-9]*" -mtime +3 -delete

# Clean up journal logs (keep last 100MB)
sudo journalctl --vacuum-size=100M
sudo journalctl --vacuum-time=3d

# Clean apt/yum package cache
sudo apt-get clean       # Debian/Ubuntu
sudo yum clean all       # RHEL/CentOS

# Remove Docker unused images, containers, volumes
docker system prune -f
docker image prune -a -f

# Remove core dumps if present
sudo find / -name "core" -type f -delete 2>/dev/null

# Truncate (not delete) an actively-written log file
sudo truncate -s 0 /var/log/<application>/<large-log-file>.log

Resolution — Extend Volume

# AWS EBS — extend volume online
# Step 1: Resize the EBS volume in AWS Console or CLI
aws ec2 modify-volume --volume-id vol-xxxxxxxx --size <new-size-gb>

# Step 2: Wait for optimizing state
aws ec2 describe-volumes-modifications --volume-id vol-xxxxxxxx

# Step 3: Grow the partition (for xfs filesystem)
sudo growpart /dev/nvme0n1 1
sudo xfs_growfs -d /

# Step 3 alternative (for ext4 filesystem)
sudo growpart /dev/xvda 1
sudo resize2fs /dev/xvda1

# GCP Persistent Disk — extend volume online
gcloud compute disks resize <disk-name> --size=<new-size>GB --zone=<zone>
# Then resize the filesystem (same growpart / resize2fs commands as above)

Verification: Run df -h and confirm disk usage is below 70%. Application logs are being written successfully. No file-system-full errors in /var/log/syslog or journalctl.

Prevention: Alert at 75% (warning) and 85% (critical). Configure logrotate for all application logs. Enable automatic EBS volume extension via AWS CloudWatch alarms + Lambda. For Kubernetes, set emptyDir.sizeLimit and enable ephemeral storage limits on pods to prevent a single pod from exhausting node disk.

Incident Playbooks

Infrastructure Playbooks

RB-0001 — Kubernetes Pod CrashLoopBackOff

Symptom

Common Root Causes

Immediate Actions

Diagnosis & Resolution by Cause

OOM Kill

Missing ConfigMap / Secret

Bad Container Image

RB-0002 — Kubernetes Node NotReady

Symptom

Common Root Causes

Immediate Actions

Diagnosis on the Node

Recovery

RB-0003 — High CPU / Memory on Instance

Symptom

Immediate Actions

Resolution

RB-0004 — Database Connection Pool Exhausted

Symptom

Immediate Actions

Connection Pooler — PgBouncer

Application-Side Config Check

RB-0005 — TLS Certificate Expired

Symptom

Immediate Actions — Diagnose

Resolution — cert-manager (Kubernetes)

Resolution — Manual Let's Encrypt (Certbot)

RB-0006 — S3 / GCS Bucket Accidentally Made Public

Symptom

Immediate Actions — AWS S3

Immediate Actions — GCP GCS

RB-0007 — Disk Space Critical

Symptom

Immediate Actions — Identify Space Consumers

Resolution — Free Up Space

Resolution — Extend Volume