Incident Response Playbooks

Operational Readiness — Structured playbooks, runbooks, and templates for responding to infrastructure and security incidents with speed and consistency.

Runbook vs Playbook vs Postmortem

These three artefacts serve different purposes in the incident management lifecycle and are frequently confused. Understanding the distinction helps teams build the right documents for the right situations.

Runbook

A step-by-step, procedural guide for executing a specific, well-understood operational task. Runbooks are prescriptive: they assume the operator knows that they are in the right scenario and simply need to follow the numbered steps. Examples include restarting a service, rotating a certificate, or draining a Kubernetes node. Runbooks minimise cognitive load during high-stress situations.

Playbook

A strategic, decision-tree-based guide for responding to a class of incidents. Playbooks contain multiple runbooks, escalation paths, and conditional branches. They guide the responder through diagnosis and triage before pointing to the appropriate runbook. A security breach playbook, for example, covers detection, containment, eradication, and recovery phases — each phase may invoke specific runbooks.

Postmortem (Retrospective)

A blameless, structured retrospective document written after an incident is resolved. Its purpose is to capture what happened, why it happened, what the impact was, how it was mitigated, and — most importantly — what action items will prevent recurrence. Postmortems feed directly back into improving runbooks and playbooks.

Runbook Structure Template

Every runbook in this portfolio follows the same structure. This consistency allows any on-call engineer to pick up any runbook cold and execute it confidently.

# Runbook: [Short Descriptive Title]

## Metadata
- **Runbook ID**: RB-NNNN
- **Severity**: SEV1 | SEV2 | SEV3 | SEV4
- **Owner**: [Team Name]
- **Last Updated**: YYYY-MM-DD
- **Reviewed By**: [Name / Handle]

## Trigger Condition
Describe exactly what alert, symptom, or condition brings an operator to this runbook.
Example: PagerDuty alert "PodCrashLoopBackOff" fires for pod in namespace `production`.

## Impact Assessment
- **User Impact**: [Describe visible customer/user impact]
- **Business Impact**: [Revenue, SLA, compliance risk]
- **Blast Radius**: [Which services / regions / tenants are affected]

## Immediate Actions (Time-bound: first 5 minutes)
1. Acknowledge the alert in PagerDuty / OpsGenie.
2. Post in #incidents Slack channel: "Investigating [alert name] — [your handle]".
3. Open the relevant dashboard: [link].
4. [First containment step].

## Diagnosis Steps
1. [Command or check to confirm the issue].
2. [Second diagnostic step with expected output].
3. [Branching: if X, go to Resolution A; if Y, go to Resolution B].

## Resolution Steps
### Resolution A: [Cause A]
1. [Step 1]
2. [Step 2]
3. [Verification step]

### Resolution B: [Cause B]
1. [Step 1]
2. [Verification step]

## Escalation Path
- **L1 (On-call SRE)**: Attempt resolution within 15 minutes.
- **L2 (Senior SRE / Tech Lead)**: Escalate if unresolved after 15 minutes or if SEV1.
- **L3 (Engineering Manager / Vendor Support)**: Escalate if L2 cannot resolve within 30 minutes.
- **Stakeholder Notification**: Notify [Product Manager / CTO] for SEV1/SEV2.

## Communication Template
See Section: Incident Communication Templates below.

## Post-Incident Actions
- [ ] Resolve the alert in PagerDuty / OpsGenie.
- [ ] Update the incident status page to "Resolved".
- [ ] Schedule postmortem within 48 hours (SEV1/SEV2).
- [ ] File action items as tickets in Jira.
- [ ] Update this runbook if steps were inaccurate or incomplete.

Incident Severity Matrix

All incidents must be classified immediately upon detection. The severity level determines response time, stakeholder notification, and communication channel.

Severity Criteria Response Time Stakeholders Communication Channel
SEV1 — Critical Full production outage; data loss or breach; all users affected; revenue impact exceeding agreed threshold per hour Acknowledge within 5 min; resolution target 1 hour CTO, VP Engineering, Product Manager, On-call SRE, Security (if breach) #incident-sev1, status page, customer email, executive bridge call
SEV2 — High Major feature degraded; significant portion of users affected; no known workaround; payment flows impaired Acknowledge within 15 min; resolution target 4 hours Engineering Manager, Product Manager, On-call SRE #incident-sev2, status page update, internal Slack
SEV3 — Medium Non-critical feature degraded; minority of users affected; workaround available; performance degradation within SLO Acknowledge within 1 hour; resolution target next business day Team Lead, On-call SRE #incidents, internal Slack thread
SEV4 — Low Cosmetic bug; single-user issue; no service impact; informational alert requiring investigation Acknowledge within 4 hours; resolution target next sprint Assigned engineer Jira ticket, team standup

On-Call Decision Tree

When an alert fires, the on-call engineer follows this decision tree before diving into diagnosis.

Step 1 — Alert Fires

An alert arrives via PagerDuty / OpsGenie. Acknowledge within the SLA window defined by severity. Open the alert and read the full context: affected service, region, metric value, alert rule.

Step 2 — Is it Actionable?

Ask: "Does this alert require a human to do something right now?" If No (e.g., flapping metric that auto-resolved, or a test alert), resolve it and file a ticket to tune the alert rule. If Yes, proceed to Step 3.

Step 3 — Is it a Known Issue?

Check the runbook index, the #incidents Slack channel history, and the ongoing incidents dashboard. If a matching runbook exists, jump directly to it. If this is a known ongoing incident already being handled, join the bridge and support the incident commander. If unknown, proceed to Step 4.

Step 4 — Apply Runbook

Locate the runbook that best matches the alert. Follow each numbered step precisely. Do not skip steps. Document every action taken in the incident Slack thread with timestamps. If the runbook resolves the issue, proceed to post-incident actions.

Step 5 — Escalate

If the runbook does not resolve the issue within the time defined by severity, or if you encounter a situation not covered by any runbook, escalate immediately. Do not wait until the SLA breaches. When escalating, provide: incident severity, timeline of events, steps already taken, current system state, and your hypothesis of root cause.

Incident Communication Templates

Consistent, timely communication reduces stakeholder anxiety during incidents. Use these templates verbatim and fill in the placeholders.

Status Page Update Template

--- INVESTIGATING ---
Title: [Service Name] Degraded Performance
Date: YYYY-MM-DD HH:MM UTC
Status: Investigating

We are currently investigating reports of [brief description of impact, e.g.,
"elevated error rates on the API"]. Our engineering team is actively working
to diagnose the issue.

Affected services: [list services]
Affected regions: [list regions or "All regions"]

We will provide an update within [30 minutes | 1 hour].

--- UPDATE ---
Title: [Service Name] Degraded Performance — Update #1
Date: YYYY-MM-DD HH:MM UTC
Status: Identified

We have identified the root cause: [brief, non-technical description].
Our team is implementing a fix. We expect resolution by [estimated time] UTC.

--- RESOLVED ---
Title: [Service Name] — Resolved
Date: YYYY-MM-DD HH:MM UTC
Status: Resolved

This incident has been resolved. [Service name] is operating normally.

Root cause: [one sentence]
Duration: [HH:MM]
Impact: [number of users or percentage affected]

We apologize for the inconvenience. A full postmortem will be published within 5 business days.

Internal Slack Message Template

:rotating_light: *INCIDENT DECLARED* :rotating_light:
*Severity*: SEV[1|2|3]
*Service*: [Affected service name]
*Summary*: [One sentence describing what is broken]
*Impact*: [Who / what is affected and how many users]
*Started*: YYYY-MM-DD HH:MM UTC
*Incident Commander*: @[handle]
*Bridge*: [Zoom/Meet link or "async in this thread"]
*Runbook*: [link to runbook]
*Status page*: [link]

cc: @[engineering-manager] @[product-manager]

*Updates will be posted here every 30 minutes until resolved.*

Customer-Facing Email Template

Subject: [ACTION REQUIRED / Service Update] Issue affecting [Product Name] — YYYY-MM-DD

Dear [Customer Name / "Valued Customer"],

We are writing to inform you that we are currently experiencing an issue
affecting [Product Name / specific feature].

WHAT HAPPENED
[1–2 sentences describing the issue in plain, non-technical language.]

IMPACT TO YOU
[Describe what the customer may have experienced: errors, slow responses,
unavailability. Be specific about timeframe.]

WHAT WE ARE DOING
Our engineering team identified the root cause at [HH:MM UTC] and has
implemented a fix. [Service] has been fully restored as of [HH:MM UTC].

WHAT YOU SHOULD DO
[If action is needed: "Please clear your browser cache and log in again."]
[If no action needed: "No action is required on your part."]

We sincerely apologise for the disruption this may have caused. We are
conducting a full review to prevent recurrence.

If you have any questions or concerns, please contact support at [email/link].

Sincerely,
[Name]
[Title] — [Company Name]

Postmortem / Retrospective Template

Postmortems must be blameless. The goal is to improve systems and processes, not to assign fault to individuals.

# Postmortem: [Incident Title]

## Incident Summary
- **Incident ID**: INC-NNNN
- **Severity**: SEV[1|2|3]
- **Date**: YYYY-MM-DD
- **Duration**: [HH:MM] (HH:MM UTC — HH:MM UTC)
- **Incident Commander**: [Name]
- **Authors**: [Names]
- **Review Date**: YYYY-MM-DD
- **Status**: Draft | In Review | Approved

## Executive Summary
[2–3 sentences: what broke, how long, what was the impact, and the single
root cause. Write this last, for a non-technical audience.]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM      | Monitoring alert fires: [alert name] |
| HH:MM      | On-call engineer acknowledges alert |
| HH:MM      | Incident declared SEV[N] |
| HH:MM      | [Diagnosis step / finding] |
| HH:MM      | Root cause identified: [brief description] |
| HH:MM      | Mitigation applied: [action taken] |
| HH:MM      | Service restored / incident resolved |
| HH:MM      | Incident closed, postmortem scheduled |

## Root Cause Analysis
### What happened?
[Detailed technical description of the failure.]

### Why did it happen? (5 Whys)
1. Why? — [First level cause]
2. Why? — [Second level cause]
3. Why? — [Third level cause]
4. Why? — [Fourth level cause]
5. Why? — [Root cause]

### Contributing factors
- [Factor 1: e.g., missing alert coverage on downstream service]
- [Factor 2: e.g., runbook was outdated]
- [Factor 3: e.g., deploy happened during peak traffic without feature flag]

## Impact
- **Users affected**: [number / percentage]
- **Duration of impact**: [HH:MM]
- **SLO breach**: [Yes/No — if yes, specify error budget consumed]
- **Revenue impact**: [estimate if applicable]
- **Data integrity**: [Was any data lost or corrupted? Yes/No]

## What Went Well
- [e.g., Monitoring detected the issue before customer reports]
- [e.g., Team mobilised quickly and communication was clear]
- [e.g., Rollback procedure worked as expected]

## What Could Be Improved
- [e.g., Alert threshold was too sensitive / not sensitive enough]
- [e.g., Runbook RB-0042 had incorrect commands]
- [e.g., No runbook existed for this scenario]

## Action Items
| Action | Owner | Due Date | Priority | Ticket |
|--------|-------|----------|----------|--------|
| [Preventive action] | @[handle] | YYYY-MM-DD | P1 | JIRA-NNN |
| [Detection improvement] | @[handle] | YYYY-MM-DD | P2 | JIRA-NNN |
| [Runbook update] | @[handle] | YYYY-MM-DD | P2 | JIRA-NNN |
| [Process improvement] | @[handle] | YYYY-MM-DD | P3 | JIRA-NNN |
Best Practice: Schedule the postmortem meeting within 48 hours of incident resolution for SEV1/SEV2. All engineers involved in the incident should attend. The postmortem document should be published internally within 5 business days and shared externally (in a summarised form) within 10 business days for SEV1 incidents.