Incident Management
Incident Lifecycle
Every incident follows the same lifecycle. Having a defined process reduces cognitive load during high-stress situations and ensures nothing is skipped.
Severity Levels
SEV1 — Critical
Definition: Complete service outage affecting all or majority of users. Revenue impact per minute is material. Safety or data integrity may be at risk.
Examples: Production database down, payment processing completely unavailable, authentication service down (login impossible), DDoS causing full outage.
Response time: Acknowledge within 5 minutes, incident commander appointed within 10 minutes.
Communication: Status page update within 15 minutes. Executive notification within 30 minutes. Updates every 15 minutes.
Escalation: Page primary on-call → secondary on-call (5 min no-ack) → engineering lead (15 min) → VP Engineering (30 min).
SEV2 — High
Definition: Significant degradation affecting a substantial portion of users or a critical business function. Some users can work around the issue.
Examples: Search returning errors for 20% of users, payment latency 10x normal (timeouts for some users), dashboard charts not loading, API rate limiting at lower-than-expected thresholds.
Response time: Acknowledge within 15 minutes during business hours, 30 minutes off-hours.
Communication: Status page update if user-visible. Team Slack notification. Manager awareness. Updates every 30 minutes.
SEV3 — Medium
Definition: Partial degradation or failure affecting a minority of users or a non-critical feature. Reliable workarounds exist.
Examples: Email notifications delayed by 10 minutes, PDF export feature broken for some file types, analytics data delayed by 2 hours, specific browser compatibility issue.
Response time: Acknowledge within 2 hours. Resolve within 24 hours.
Communication: Slack notification in engineering channel. Ticket created and tracked.
SEV4 — Low
Definition: Minor issue with minimal user impact. Cosmetic, performance is within acceptable range, or affects only internal tools.
Examples: UI misalignment in admin panel, log verbosity too high causing storage growth, documentation link broken, slow report generation (within SLO).
Response time: Handled during next business day. No on-call paging.
Communication: Jira ticket. Resolved in next sprint planning.
Incident Commander Role
The Incident Commander (IC) is the single decision-maker during an active incident. The IC does not do the technical work of fixing the problem — they coordinate the people who do. During high-stress incidents, clear command structure prevents chaos.
IC Responsibilities
- Declare the incident and assign severity immediately upon joining the war room
- Assign roles: Primary responder (owns diagnosis/fix), Communications Lead (status page + stakeholders), Scribe (takes timeline notes)
- Maintain communication cadence: Brief updates every 15 minutes (SEV1), every 30 minutes (SEV2)
- Control the war room: Prevent rabbit holes, refocus when team gets stuck, call timeouts on unproductive approaches
- Make rollback decision: IC has authority to order a rollback at any time, overriding engineering judgment
- Declare resolution: When service is confirmed restored, IC formally closes the incident
- Hand off: If incident spans shifts, IC formally hands off to a new IC with a verbal and written briefing
On-Call Best Practices
PagerDuty Escalation Policy (YAML Config)
# pagerduty-terraform/main.tf
resource "pagerduty_escalation_policy" "payment_api" {
name = "Payment API Escalation Policy"
num_loops = 3 # Try full escalation 3 times before giving up
rule {
escalation_delay_in_minutes = 5 # Page primary. After 5 min no-ack → next rule
target {
type = "schedule_reference"
id = pagerduty_schedule.primary_oncall.id
}
}
rule {
escalation_delay_in_minutes = 10 # Page secondary. After 10 min → next rule
target {
type = "schedule_reference"
id = pagerduty_schedule.secondary_oncall.id
}
}
rule {
escalation_delay_in_minutes = 15 # Page engineering lead
target {
type = "user_reference"
id = pagerduty_user.engineering_lead.id
}
}
}
# On-call schedule: follow-the-sun rotation
resource "pagerduty_schedule" "primary_oncall" {
name = "Payment API Primary On-Call"
time_zone = "Asia/Ho_Chi_Minh"
layer {
name = "APAC Team"
start = "2025-01-01T00:00:00+07:00"
rotation_turn_length_seconds = 604800 # 1 week
users = [
pagerduty_user.engineer_1.id,
pagerduty_user.engineer_2.id,
pagerduty_user.engineer_3.id,
pagerduty_user.engineer_4.id,
]
restriction {
type = "weekly_restriction"
duration_seconds = 57600 # 16 hours (8AM - midnight)
start_time_of_day = "08:00:00"
start_day_of_week = 1 # Monday
}
}
}
# Alert routing: route by service + severity
resource "pagerduty_service" "payment_api" {
name = "Payment API"
escalation_policy = pagerduty_escalation_policy.payment_api.id
alert_creation = "create_alerts_and_incidents"
alert_grouping = "time"
alert_grouping_timeout = 300 # Group alerts within 5-minute window
incident_urgency_rule {
type = "use_support_hours"
during_support_hours {
type = "constant"
urgency = "high"
}
outside_support_hours {
type = "constant"
urgency = "low" # Low urgency = no phone call, only push notification
}
}
support_hours {
type = "fixed_time_per_day"
time_zone = "Asia/Ho_Chi_Minh"
start_time = "08:00:00"
end_time = "20:00:00"
days_of_week = [1, 2, 3, 4, 5] # Monday - Friday
}
}
Alert Fatigue Reduction
- Alert on symptoms, not causes: "Users are seeing errors" > "CPU is at 80%"
- Every alert must be actionable: If the on-call can't do anything about it in the next hour, it should not wake them up
- Set minimum durations: Alert must fire for 5+ minutes before paging — transient spikes should not wake anyone
- Weekly alert review: Any alert that fired more than 3 times in a week without causing real user impact should be silenced or raised
- Alert categorization: Page (immediate action needed, call on-call), Ticket (needs attention in 24h, create Jira), Log (informational, no action)
- Track MTTA (Mean Time to Acknowledge): If MTTA is consistently >5 minutes, alert routing or severity is wrong
Incident Communication
Status Page Protocol (Statuspage.io)
# Statuspage API integration — auto-update via Lambda/Cloud Function
import requests
import os
STATUSPAGE_API_KEY = os.environ['STATUSPAGE_API_KEY']
PAGE_ID = os.environ['STATUSPAGE_PAGE_ID']
COMPONENT_ID = os.environ['PAYMENT_COMPONENT_ID']
def create_incident(title, body, component_status='degraded_performance'):
"""component_status: operational | degraded_performance | partial_outage | major_outage"""
response = requests.post(
f'https://api.statuspage.io/v1/pages/{PAGE_ID}/incidents',
headers={'Authorization': f'OAuth {STATUSPAGE_API_KEY}'},
json={
'incident': {
'name': title,
'status': 'investigating', # investigating | identified | monitoring | resolved
'body': body,
'components': {COMPONENT_ID: component_status},
'component_ids': [COMPONENT_ID],
'deliver_notifications': True,
}
}
)
incident_id = response.json()['id']
print(f"Incident created: https://status.company.com/incidents/{incident_id}")
return incident_id
def update_incident(incident_id, status, body):
requests.patch(
f'https://api.statuspage.io/v1/pages/{PAGE_ID}/incidents/{incident_id}',
headers={'Authorization': f'OAuth {STATUSPAGE_API_KEY}'},
json={
'incident': {
'status': status,
'body': body,
'components': {COMPONENT_ID: 'operational' if status == 'resolved' else 'degraded_performance'},
}
}
)
# Usage during incident:
incident_id = create_incident(
title='Payment Processing Degradation',
body='We are investigating reports of payment failures. Our team is actively working on this.',
component_status='partial_outage'
)
# 15 minutes later:
update_incident(incident_id, 'identified',
'We have identified an issue with our payment processor integration. '
'We are implementing a fix and expect resolution within 30 minutes.')
# On resolution:
update_incident(incident_id, 'resolved',
'This incident has been resolved. Payment processing is fully operational. '
'We apologize for any inconvenience.')
Slack War-Room Protocol
# War-room channel convention: #inc-YYYYMMDD-short-description
# Example: #inc-20250228-payment-outage
## Pinned message template (post immediately on incident declaration):
---
:rotating_light: *INCIDENT DECLARED — SEV1* :rotating_light:
*Time:* 14:32 UTC
*Incident Commander:* @nguyenvan.a
*Primary Responder:* @tran.b
*Scribe:* @le.c
*Impact:* Payment processing failing for ~30% of users
*Status:* Investigating
*Status Page:* https://status.company.com/incidents/xyz
*Next Update:* 14:47 UTC (15 min)
---
## Communication cadence (SEV1):
# T+0: "Incident declared. Investigating. IC is @nguyenvan.a"
# T+15: "Update: Identified high error rate on payment-api pods. Checking DB."
# T+30: "Update: DB connection pool saturation confirmed. Scaling DB connections."
# T+45: "Update: DB scaled. Error rate dropping. Monitoring."
# T+60: "RESOLVED: Full service restored at 15:32 UTC. Post-mortem scheduled."
## War-room rules:
# 1. Only IC and designated roles speak — no free-for-all
# 2. No blame, no "I told you so" — solve first, learn after
# 3. All decisions go through IC
# 4. Use threads for sub-investigations — keep main channel clean
# 5. Record every action taken in chronological order (scribe's job)
Blameless Post-Mortem
A blameless post-mortem is a structured learning exercise, not a tribunal. The goal is to understand what happened and why, and to build systemic improvements that prevent recurrence. Blame is counterproductive: engineers in fear of blame hide problems, avoid risk, and stop reporting near-misses.
Post-Mortem Template
# Post-Mortem: [Brief Description of Incident]
# Date: 2025-02-28
# Authors: [Names]
# Severity: SEV1
# Status: DRAFT / IN REVIEW / PUBLISHED
## Summary
A 47-minute payment processing outage affected approximately 28% of users between
14:32 and 15:19 UTC on February 28, 2025. The root cause was database connection
pool exhaustion triggered by a deployment that introduced a connection leak.
## Impact
- Users affected: ~28% (estimated 14,000 users)
- Failed payment attempts: 2,847
- Estimated revenue impact: $47,200
- Error budget consumed: 0.22% of monthly budget (43% of weekly allowance)
- SLO: Availability SLO breached during incident window
## Timeline (all times UTC)
| Time | Event |
|-------|-------|
| 14:28 | Deployment v2.3.1 pushed to production (payment-api) |
| 14:31 | Error rate begins rising (not yet alert threshold) |
| 14:32 | PagerDuty alert: HighErrorRate > 5% fires. On-call paged. |
| 14:37 | On-call @tran.b acknowledges. Incident declared SEV1. |
| 14:38 | IC @nguyenvan.a appointed. War room opened: #inc-20250228-payment-outage |
| 14:42 | Status page updated: "Investigating payment processing issues" |
| 14:48 | Database connection pool exhaustion identified via Grafana |
| 14:52 | Decision: rollback to v2.3.0 |
| 14:55 | Rollback completed. Error rate begins dropping. |
| 15:10 | Error rate below 0.1%. Monitoring period begins. |
| 15:19 | All metrics nominal. Incident resolved. Status page updated. |
## Root Cause
A code change in v2.3.1 introduced a database connection leak in the payment
transaction retry logic. When a payment timed out, the retry path failed to
close the database connection before creating a new one. Under load, this
exhausted the 50-connection pool within 4 minutes of deployment.
## Contributing Factors
1. **Load testing gap:** The staging environment uses a connection pool of 10,
which obscured the leak — it exhausted before tests completed, masking the error.
2. **Monitoring gap:** We monitor connection pool utilization but had no alert for
"connection acquisition time > threshold" as an early warning.
3. **Code review:** The connection leak was present in code review but was subtle
(a missing defer conn.Close() in one error path). No automated linter caught it.
4. **Deployment timing:** Deployment occurred at 14:28, a historically high-traffic
period. Pre-deployment traffic checks were not performed.
## What Went Well
- PagerDuty alert fired within 1 minute of error rate exceeding threshold
- Incident Commander assumed control within 5 minutes
- Rollback decision was made decisively at the 20-minute mark
- Status page updated promptly; stakeholders were kept informed
## Action Items
| Action | Owner | Priority | Due Date |
|--------|-------|----------|----------|
| Add linter rule to detect missing defer conn.Close() | Engineering | P1 | 2025-03-07 |
| Update staging to match production connection pool size | Platform | P1 | 2025-03-05 |
| Add alert: DB connection acquisition time > 500ms | SRE | P1 | 2025-03-05 |
| Add pre-deployment traffic check step to CI/CD pipeline | DevOps | P2 | 2025-03-14 |
| Add connection pool metrics to deployment rollout dashboard | SRE | P2 | 2025-03-14 |
## Facilitation Anti-Patterns to Avoid
# ❌ "Why did @tran.b not catch this in code review?"
# ❌ "This should have been obvious"
# ❌ "We need to be more careful" (too vague to act on)
# ✓ "What process change would have caught this?"
# ✓ "How do we make this category of error impossible or immediately visible?"
Incident Metrics
Mean Time Metrics
| Metric | Definition | Formula | Target |
|---|---|---|---|
| MTTD | Mean Time To Detect | Time from incident start to first alert | <2 min for SEV1, <10 min for SEV2 |
| MTTA | Mean Time To Acknowledge | Time from alert to on-call acknowledgment | <5 min for SEV1, <15 min for SEV2 |
| MTTR | Mean Time To Recover/Resolve | Time from incident start to service restoration | <30 min for SEV1, <2h for SEV2 |
| MTTF | Mean Time To Failure | Average time between incidents | Increasing quarter-over-quarter |
| MTBF | Mean Time Between Failures | MTTF + MTTR (time between start of one incident to next) | High values = fewer incidents |
Incident Metrics Dashboard (PromQL)
# These queries work with AlertManager/Prometheus when incidents are tracked as metrics
# 1. Incident frequency by severity (last 30 days)
count_over_time(ALERTS{alertstate="firing", severity="critical"}[30d])
# 2. MTTR tracking via incident duration
# If you push incident open/close timestamps to Prometheus as gauges:
# incident_open_time{sev="1"} = Unix timestamp when incident opened
# incident_close_time{sev="1"} = Unix timestamp when resolved
avg(incident_close_time - incident_open_time) by (severity) / 60
# Result in minutes
# 3. Error budget burn from incidents
# (actual downtime minutes / total minutes) / (1 - SLO)
(
sum_over_time(incident_downtime_minutes[30d])
/ (30 * 24 * 60)
) / (1 - 0.999)
# 4. Incident trend (are we improving?)
# Compare this month's SEV1 count to last month
count_over_time(ALERTS{severity="critical", alertstate="firing"}[30d])
/ count_over_time(ALERTS{severity="critical", alertstate="firing"}[60d:30d])
# Ratio < 1 means fewer incidents this month than last — good
# 5. On-call load: alerts per person per week
count by (receiver) (
count_over_time(ALERTS{alertstate="firing"}[7d])
)
Action Item Follow-Through
Post-mortems are only valuable if the action items are actually completed. Common failure modes: action items are created but never prioritized, no owner is assigned, or the due date passes without follow-up.
Action Item Tracking Process
- Every action item must have: Owner (specific person, not team), priority (P1/P2/P3), due date, and a Jira/GitHub ticket number
- P1 items (prevent recurrence of the exact same incident) are added to the current sprint within 24 hours
- P2 items are added to the next sprint during planning
- Monthly reliability review: Review all open post-mortem action items. Escalate any overdue P1 items to engineering leadership
- Closure verification: Action item is not closed until the SRE team confirms the fix is in production and the risk is mitigated
- Effectiveness check: 30 days after a post-mortem, review whether the actions prevented recurrence