Framework · Operational Resilience

The Incident Response Playbook

A practitioner’s framework for responding to cloud outages — the five-phase lifecycle, how to classify severity, who does what, and the comms that keep customers informed. Adapted from the NIST and Google SRE incident models.

The five-phase lifecycle

Every incident moves through the same arc. The goal of each phase is fixed; only the tactics change with the failure.

1Detect
0–5 min
Confirm the signal is real and customer-impacting.
- Validate alert against independent telemetry
- Declare an incident and assign a severity
- Open the incident channel and bridge
2Triage
5–15 min
Size the blast radius and stand up the response team.
- Page the Incident Commander and on-call SMEs
- Scope affected regions, services and customers
- Post the first public status update
3Mitigate
15–60 min
Stop the bleeding — restore service before root-causing.
- Apply the fastest safe mitigation (failover, rollback, drain)
- Throttle or shed load to protect healthy capacity
- Re-confirm impact is trending down
4Resolve
1–4 hrs
Return to full service and verify recovery.
- Roll forward the durable fix
- Validate SLOs are back within target
- Stand down the response, hand off to owners
5Learn
24–72 hrs
Turn the incident into durable resilience.
- Publish a blameless post-incident review
- File and prioritise corrective actions
- Share an external RCA where customers were impacted

Severity matrix

Classify the incident the moment it’s declared — severity drives who gets paged and how often you communicate.

Level	Impact	Response
SEV-1Critical	Multi-region outage or data-loss risk; revenue-critical path down.	All-hands, exec paged, 15-min status cadence.
SEV-2Major	Single-region or single-service outage with broad customer impact.	IC + on-call SMEs, 30-min status cadence.
SEV-3Minor	Degraded performance or partial feature impact with a workaround.	On-call owner, hourly updates, business hours.
SEV-4Low	Cosmetic or internal-only; no customer-visible degradation.	Tracked as a normal ticket, no incident bridge.

Incident roles

Clear ownership prevents the two classic failure modes: everyone debugging, or no one deciding.

Incident Commander: Owns the response and all decisions. Coordinates — does not debug.
Operations Lead: Drives the technical investigation and applies mitigations.
Communications Lead: Owns status-page updates and internal/exec stakeholder comms.
Scribe: Keeps the timestamped timeline that feeds the post-incident review.

Communication templates

Pre-written, fill-in-the-blank updates so the Comms Lead never starts from a blank page mid-incident.

Initial — within 15 min

We are investigating reports of <impact> affecting <service> in <region>. Next update in 30 minutes.

Update — mitigation underway

We have identified a likely cause and are applying a mitigation. Some customers may still see <symptom>. Next update by <time>.

Resolved

The issue affecting <service> is resolved as of <time>. A full root-cause analysis will follow within 5 business days.

The one rule

Mitigate before you diagnose. Restoring service is always the first priority — root cause can wait until customers are back.