The Incident Response Playbook
A practitioner’s framework for responding to cloud outages — the five-phase lifecycle, how to classify severity, who does what, and the comms that keep customers informed. Adapted from the NIST and Google SRE incident models.
The five-phase lifecycle
Every incident moves through the same arc. The goal of each phase is fixed; only the tactics change with the failure.
- 1Detect0–5 min
Confirm the signal is real and customer-impacting.
- Validate alert against independent telemetry
- Declare an incident and assign a severity
- Open the incident channel and bridge
- 2Triage5–15 min
Size the blast radius and stand up the response team.
- Page the Incident Commander and on-call SMEs
- Scope affected regions, services and customers
- Post the first public status update
- 3Mitigate15–60 min
Stop the bleeding — restore service before root-causing.
- Apply the fastest safe mitigation (failover, rollback, drain)
- Throttle or shed load to protect healthy capacity
- Re-confirm impact is trending down
- 4Resolve1–4 hrs
Return to full service and verify recovery.
- Roll forward the durable fix
- Validate SLOs are back within target
- Stand down the response, hand off to owners
- 5Learn24–72 hrs
Turn the incident into durable resilience.
- Publish a blameless post-incident review
- File and prioritise corrective actions
- Share an external RCA where customers were impacted
Severity matrix
Classify the incident the moment it’s declared — severity drives who gets paged and how often you communicate.
| Level | Impact | Response |
|---|---|---|
| SEV-1Critical | Multi-region outage or data-loss risk; revenue-critical path down. | All-hands, exec paged, 15-min status cadence. |
| SEV-2Major | Single-region or single-service outage with broad customer impact. | IC + on-call SMEs, 30-min status cadence. |
| SEV-3Minor | Degraded performance or partial feature impact with a workaround. | On-call owner, hourly updates, business hours. |
| SEV-4Low | Cosmetic or internal-only; no customer-visible degradation. | Tracked as a normal ticket, no incident bridge. |
Incident roles
Clear ownership prevents the two classic failure modes: everyone debugging, or no one deciding.
- Incident Commander
- Owns the response and all decisions. Coordinates — does not debug.
- Operations Lead
- Drives the technical investigation and applies mitigations.
- Communications Lead
- Owns status-page updates and internal/exec stakeholder comms.
- Scribe
- Keeps the timestamped timeline that feeds the post-incident review.
Communication templates
Pre-written, fill-in-the-blank updates so the Comms Lead never starts from a blank page mid-incident.
Initial — within 15 min
We are investigating reports of <impact> affecting <service> in <region>. Next update in 30 minutes.
Update — mitigation underway
We have identified a likely cause and are applying a mitigation. Some customers may still see <symptom>. Next update by <time>.
Resolved
The issue affecting <service> is resolved as of <time>. A full root-cause analysis will follow within 5 business days.
The one rule
Mitigate before you diagnose. Restoring service is always the first priority — root cause can wait until customers are back.