Infrastructure Playbooks
Each runbook follows the standard structure: Symptom, Root Cause, Immediate Actions, Verification, and Prevention. All commands are written for a production Kubernetes cluster on AWS/GCP with standard tooling (kubectl, awscli, gcloud).
RB-0001 — Kubernetes Pod CrashLoopBackOff
Symptom
Pod status shows CrashLoopBackOff. The pod starts, crashes immediately, and Kubernetes restarts it in an exponentially increasing back-off cycle. Users may see 503 errors if the affected pod is part of a load-balanced service.
Common Root Causes
- Out-of-Memory (OOM) kill — container exceeds its memory limit
- Missing environment variable or ConfigMap / Secret reference
- Bad container image (corrupted layer, wrong tag, non-existent registry path)
- Application startup failure (misconfiguration, failed DB connection on boot)
- Liveness probe misconfigured — probe kills healthy containers
Immediate Actions
# 1. Identify the affected pod(s)
kubectl get pods -n <namespace> | grep CrashLoopBackOff
# 2. Describe the pod — look at Events section at the bottom
kubectl describe pod <pod-name> -n <namespace>
# 3. Read the current logs (may be truncated if container exited quickly)
kubectl logs <pod-name> -n <namespace>
# 4. Read logs from the previous container instance (most useful for CrashLoop)
kubectl logs <pod-name> -n <namespace> --previous
# 5. Check recent events in the namespace for broader context
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -30
Diagnosis & Resolution by Cause
OOM Kill
In kubectl describe pod output, look for OOMKilled in the last state:
# Confirm OOM kill
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Last State"
# Expected: Reason: OOMKilled
# Check current resource usage on the node
kubectl top pod <pod-name> -n <namespace>
# Resolution: increase memory limit in the Deployment manifest
kubectl edit deployment <deployment-name> -n <namespace>
# Under resources.limits.memory, increase the value (e.g., 512Mi → 1Gi)
# Or patch directly without editor
kubectl patch deployment <deployment-name> -n <namespace> \
--type='json' \
-p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"1Gi"}]'
Missing ConfigMap / Secret
# Logs will typically show: "env var X not set" or similar
# Verify referenced ConfigMap exists
kubectl get configmap <configmap-name> -n <namespace>
# Verify referenced Secret exists
kubectl get secret <secret-name> -n <namespace>
# If missing, create the ConfigMap from a file
kubectl create configmap <configmap-name> --from-file=config.yaml -n <namespace>
# Or create the Secret
kubectl create secret generic <secret-name> \
--from-literal=DB_PASSWORD='s3cur3p@ss' -n <namespace>
Bad Container Image
# Check image pull status in events
kubectl describe pod <pod-name> -n <namespace> | grep -i "image\|pull\|back-off"
# Roll back to the last known good image tag
kubectl set image deployment/<deployment-name> \
<container-name>=<registry>/<image>:<last-good-tag> \
-n <namespace>
# Monitor rollout
kubectl rollout status deployment/<deployment-name> -n <namespace>
kubectl get pods -n <namespace> and confirm status is Running with restart count stable. Check application logs with kubectl logs <pod-name> -n <namespace> for no errors.
RB-0002 — Kubernetes Node NotReady
Symptom
One or more nodes show NotReady status in kubectl get nodes. Pods on the affected node may be evicted or stuck in Terminating or Unknown state.
Common Root Causes
- Disk pressure — node disk usage exceeds eviction threshold
- Memory pressure — node memory fully exhausted
- Network partition — kubelet cannot communicate with the API server
- kubelet crash or hang
- Container runtime (containerd / Docker) failure
Immediate Actions
# 1. Identify NotReady nodes
kubectl get nodes
# 2. Get detailed node conditions
kubectl describe node <node-name>
# Look at: Conditions section (DiskPressure, MemoryPressure, PIDPressure, Ready)
# Look at: Events section at the bottom
# 3. Prevent new pods from being scheduled on the affected node
kubectl cordon <node-name>
# 4. If node needs maintenance — drain it (evicts all pods gracefully)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# 5. SSH into the node to investigate (example for AWS EC2 with SSM)
aws ssm start-session --target <instance-id>
Diagnosis on the Node
# Check kubelet status
sudo systemctl status kubelet
# Restart kubelet if it has crashed
sudo systemctl restart kubelet
# Check kubelet logs
sudo journalctl -u kubelet -n 100 --no-pager
# Check container runtime
sudo systemctl status containerd
sudo systemctl restart containerd
# Check disk usage
df -h
du -sh /var/log/* | sort -hr | head -20
# Check memory
free -m
# Check network connectivity to API server
curl -k https://<api-server-endpoint>/healthz
Recovery
# After fixing the underlying issue, uncordon the node
kubectl uncordon <node-name>
# Verify node is Ready
kubectl get nodes
# Verify pods are rescheduled and running
kubectl get pods -A | grep -v Running | grep -v Completed
Ready. No pods stuck in Unknown or Terminating. Cluster-level metrics (CPU, memory, disk) return to normal baseline.
RB-0003 — High CPU / Memory on Instance
Symptom
CPU usage sustained above 90% or memory above 95% on a VM / EC2 instance / Compute Engine instance. Application may be slow, timing out, or returning 5xx errors. Alert fires from CloudWatch / Stackdriver / Prometheus node exporter.
Immediate Actions
# SSH into the instance
# Identify the top CPU consumers (interactive, refreshes every 2s)
top -c
# Non-interactive snapshot — useful for scripting and logging
ps aux --sort=-%cpu | head -20
# Check memory usage
free -m
ps aux --sort=-%mem | head -20
# Check load average (1m, 5m, 15m)
uptime
# Full resource snapshot with extended process info
htop # (if installed)
# Check for runaway processes by specific user or service
ps -u www-data -o pid,ppid,%cpu,%mem,comm --sort=-%cpu | head -10
Resolution
# Option A: Kill a runaway process (use SIGTERM first, then SIGKILL)
kill -15 <PID>
# Wait 10 seconds, then if still running:
kill -9 <PID>
# Option B: Restart the problematic service
sudo systemctl restart <service-name>
# Option C: Scale out — add more instances via autoscaling group
# AWS: manually trigger a scale-out
aws autoscaling set-desired-capacity \
--auto-scaling-group-name <asg-name> \
--desired-capacity <current+1>
# GCP: resize managed instance group
gcloud compute instance-groups managed resize <mig-name> \
--size=<new-size> --zone=<zone>
# Option D: Add CPU credits for burstable instances (AWS T-series)
# Change instance type temporarily to a non-burstable instance
aws ec2 modify-instance-attribute --instance-id <id> \
--instance-type '{"Value": "m5.large"}'
RB-0004 — Database Connection Pool Exhausted
Symptom
Application returns errors such as FATAL: remaining connection slots are reserved (PostgreSQL) or Too many connections (MySQL). Database metrics show connections at or near max_connections. New application pods fail to start because they cannot acquire a DB connection.
Immediate Actions
# Connect to PostgreSQL and inspect active connections
psql -h <host> -U <admin-user> -d <database>
-- View current connections grouped by state and application
SELECT state, application_name, count(*)
FROM pg_stat_activity
GROUP BY state, application_name
ORDER BY count DESC;
-- View long-running queries (potential connection holders)
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '5 minutes'
ORDER BY duration DESC;
-- Terminate idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
-- View max_connections setting
SHOW max_connections;
Connection Pooler — PgBouncer
# Check PgBouncer status
psql -h <pgbouncer-host> -p 6432 -U pgbouncer pgbouncer
-- View pool statistics
SHOW POOLS;
-- View client statistics
SHOW CLIENTS;
-- View server connections
SHOW SERVERS;
-- Reload PgBouncer config without full restart
RELOAD;
# Restart PgBouncer pod in Kubernetes
kubectl rollout restart deployment/pgbouncer -n <namespace>
Application-Side Config Check
# Check connection pool settings in application config
# For Python (SQLAlchemy) — typical settings to review:
# pool_size=5, max_overflow=10, pool_timeout=30
# For Node.js (pg / knex):
# min: 2, max: 10
# Emergency: reduce pool size per pod to free up connections
kubectl set env deployment/<app> DB_POOL_MAX=3 -n <namespace>
# Scale down replicas temporarily to reduce connection pressure
kubectl scale deployment/<app> --replicas=2 -n <namespace>
max_connections. Application errors clear. New requests succeed. Monitor pg_stat_activity for 5 minutes to confirm stability.
max_connections appropriately for instance memory (rule of thumb: 100 connections per GB of RAM for PostgreSQL).
RB-0005 — TLS Certificate Expired
Symptom
Users see browser SSL errors (ERR_CERT_DATE_INVALID). Load balancer health checks fail on HTTPS. Monitoring alerts fire on certificate expiry check. curl to the endpoint returns SSL certificate problem: certificate has expired.
Immediate Actions — Diagnose
# Check certificate expiry date directly from the endpoint
echo | openssl s_client -servername <domain> -connect <domain>:443 2>/dev/null \
| openssl x509 -noout -dates
# Check certificate details (subject, issuer, SANs)
echo | openssl s_client -servername <domain> -connect <domain>:443 2>/dev/null \
| openssl x509 -noout -text | grep -A2 "Subject:\|Issuer:\|Not After\|DNS:"
# Check Kubernetes TLS secret expiry (cert-manager managed)
kubectl get certificate -A
kubectl describe certificate <cert-name> -n <namespace>
Resolution — cert-manager (Kubernetes)
# Trigger manual renewal of a cert-manager Certificate resource
kubectl annotate certificate <cert-name> -n <namespace> \
cert-manager.io/issuse-temporary-certificate="true"
# Delete the managed Secret to force cert-manager to re-issue
kubectl delete secret <tls-secret-name> -n <namespace>
# cert-manager will automatically recreate it
# Check certificate request status
kubectl get certificaterequest -n <namespace>
kubectl describe certificaterequest <cr-name> -n <namespace>
Resolution — Manual Let's Encrypt (Certbot)
# Renew all certificates managed by certbot
sudo certbot renew --force-renewal
# Renew a specific domain
sudo certbot certonly --force-renewal -d <domain>
# Reload web server / ingress after renewal
sudo systemctl reload nginx
# or
sudo systemctl reload apache2
# Copy renewed cert to Kubernetes secret
kubectl create secret tls <tls-secret-name> \
--cert=/etc/letsencrypt/live/<domain>/fullchain.pem \
--key=/etc/letsencrypt/live/<domain>/privkey.pem \
-n <namespace> --dry-run=client -o yaml | kubectl apply -f -
openssl s_client command again and confirm Not After is a future date (typically 90 days for Let's Encrypt, 1–2 years for commercial CAs). Browser shows valid padlock.
RB-0006 — S3 / GCS Bucket Accidentally Made Public
Symptom
Security scanning tool (AWS Macie, GCP SCC, or third-party CSPM) reports a public bucket. CloudTrail / Cloud Audit Logs show a PutBucketAcl or SetIamPolicy event granting public access. Data may have been exposed.
Immediate Actions — AWS S3
# Step 1: Block all public access at bucket level IMMEDIATELY
aws s3api put-public-access-block \
--bucket <bucket-name> \
--public-access-block-configuration \
"BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
# Step 2: Verify public access is blocked
aws s3api get-public-access-block --bucket <bucket-name>
# Step 3: Remove any bucket ACL grants to AllUsers or AuthenticatedUsers
aws s3api put-bucket-acl --bucket <bucket-name> --acl private
# Step 4: Check current bucket policy for public statements
aws s3api get-bucket-policy --bucket <bucket-name> | python3 -m json.tool
# Step 5: Remove the bucket policy if it grants public access
aws s3api delete-bucket-policy --bucket <bucket-name>
# Step 6: Check CloudTrail for who made the change and when
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=<bucket-name> \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--query 'Events[?contains(EventName, `Bucket`) || contains(EventName, `Acl`) || contains(EventName, `Policy`)]'
Immediate Actions — GCP GCS
# Step 1: Remove allUsers and allAuthenticatedUsers from bucket IAM policy
gcloud storage buckets remove-iam-policy-binding gs://<bucket-name> \
--member=allUsers --role=roles/storage.objectViewer
gcloud storage buckets remove-iam-policy-binding gs://<bucket-name> \
--member=allAuthenticatedUsers --role=roles/storage.objectViewer
# Step 2: Verify current IAM policy
gcloud storage buckets get-iam-policy gs://<bucket-name>
# Step 3: Enable uniform bucket-level access to prevent ACL-based public access
gcloud storage buckets update gs://<bucket-name> --uniform-bucket-level-access
# Step 4: Check Cloud Audit Logs for the change
gcloud logging read \
'protoPayload.methodName=("storage.buckets.setIamPolicy" OR "storage.buckets.update") AND resource.labels.bucket_name="<bucket-name>"' \
--limit=20 --format=json
storage.publicAccessPrevention. Implement SCPs / Org Policies to deny any action that grants public access. Enable AWS Macie or GCP SCC to continuously scan for public buckets. Add automated remediation via AWS Config Rules.
RB-0007 — Disk Space Critical
Symptom
Alert fires when disk usage exceeds 85% (warning) or 95% (critical). Application may fail to write logs, create temporary files, or write to database WAL. In severe cases the OS becomes unresponsive.
Immediate Actions — Identify Space Consumers
# Check disk usage across all filesystems
df -h
# Find the largest directories on the root filesystem
du -sh /var/log/* 2>/dev/null | sort -hr | head -20
du -sh /tmp/* 2>/dev/null | sort -hr | head -10
du -sh /home/* 2>/dev/null | sort -hr | head -10
# Find large files (>100MB) recursively
find / -xdev -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -hr | head -20
# Check Docker / container storage
docker system df # If Docker is installed
crictl images # If containerd is the runtime
Resolution — Free Up Space
# Rotate / compress old logs immediately
sudo logrotate -f /etc/logrotate.conf
# Remove old rotated logs
sudo find /var/log -name "*.gz" -mtime +7 -delete
sudo find /var/log -name "*.log.[0-9]*" -mtime +3 -delete
# Clean up journal logs (keep last 100MB)
sudo journalctl --vacuum-size=100M
sudo journalctl --vacuum-time=3d
# Clean apt/yum package cache
sudo apt-get clean # Debian/Ubuntu
sudo yum clean all # RHEL/CentOS
# Remove Docker unused images, containers, volumes
docker system prune -f
docker image prune -a -f
# Remove core dumps if present
sudo find / -name "core" -type f -delete 2>/dev/null
# Truncate (not delete) an actively-written log file
sudo truncate -s 0 /var/log/<application>/<large-log-file>.log
Resolution — Extend Volume
# AWS EBS — extend volume online
# Step 1: Resize the EBS volume in AWS Console or CLI
aws ec2 modify-volume --volume-id vol-xxxxxxxx --size <new-size-gb>
# Step 2: Wait for optimizing state
aws ec2 describe-volumes-modifications --volume-id vol-xxxxxxxx
# Step 3: Grow the partition (for xfs filesystem)
sudo growpart /dev/nvme0n1 1
sudo xfs_growfs -d /
# Step 3 alternative (for ext4 filesystem)
sudo growpart /dev/xvda 1
sudo resize2fs /dev/xvda1
# GCP Persistent Disk — extend volume online
gcloud compute disks resize <disk-name> --size=<new-size>GB --zone=<zone>
# Then resize the filesystem (same growpart / resize2fs commands as above)
df -h and confirm disk usage is below 70%. Application logs are being written successfully. No file-system-full errors in /var/log/syslog or journalctl.
logrotate for all application logs. Enable automatic EBS volume extension via AWS CloudWatch alarms + Lambda. For Kubernetes, set emptyDir.sizeLimit and enable ephemeral storage limits on pods to prevent a single pod from exhausting node disk.