Framework · Operational Resilience

The Incident Response Playbook

A practitioner’s framework for responding to cloud outages — the five-phase lifecycle, how to classify severity, who does what, and the comms that keep customers informed. Adapted from the NIST and Google SRE incident models.

The five-phase lifecycle

Every incident moves through the same arc. The goal of each phase is fixed; only the tactics change with the failure.

  1. 1Detect
    0–5 min

    Confirm the signal is real and customer-impacting.

    • Validate alert against independent telemetry
    • Declare an incident and assign a severity
    • Open the incident channel and bridge
  2. 2Triage
    5–15 min

    Size the blast radius and stand up the response team.

    • Page the Incident Commander and on-call SMEs
    • Scope affected regions, services and customers
    • Post the first public status update
  3. 3Mitigate
    15–60 min

    Stop the bleeding — restore service before root-causing.

    • Apply the fastest safe mitigation (failover, rollback, drain)
    • Throttle or shed load to protect healthy capacity
    • Re-confirm impact is trending down
  4. 4Resolve
    1–4 hrs

    Return to full service and verify recovery.

    • Roll forward the durable fix
    • Validate SLOs are back within target
    • Stand down the response, hand off to owners
  5. 5Learn
    24–72 hrs

    Turn the incident into durable resilience.

    • Publish a blameless post-incident review
    • File and prioritise corrective actions
    • Share an external RCA where customers were impacted

Severity matrix

Classify the incident the moment it’s declared — severity drives who gets paged and how often you communicate.

LevelImpactResponse
SEV-1CriticalMulti-region outage or data-loss risk; revenue-critical path down.All-hands, exec paged, 15-min status cadence.
SEV-2MajorSingle-region or single-service outage with broad customer impact.IC + on-call SMEs, 30-min status cadence.
SEV-3MinorDegraded performance or partial feature impact with a workaround.On-call owner, hourly updates, business hours.
SEV-4LowCosmetic or internal-only; no customer-visible degradation.Tracked as a normal ticket, no incident bridge.

Incident roles

Clear ownership prevents the two classic failure modes: everyone debugging, or no one deciding.

Incident Commander
Owns the response and all decisions. Coordinates — does not debug.
Operations Lead
Drives the technical investigation and applies mitigations.
Communications Lead
Owns status-page updates and internal/exec stakeholder comms.
Scribe
Keeps the timestamped timeline that feeds the post-incident review.

Communication templates

Pre-written, fill-in-the-blank updates so the Comms Lead never starts from a blank page mid-incident.

Initial — within 15 min

We are investigating reports of <impact> affecting <service> in <region>. Next update in 30 minutes.

Update — mitigation underway

We have identified a likely cause and are applying a mitigation. Some customers may still see <symptom>. Next update by <time>.

Resolved

The issue affecting <service> is resolved as of <time>. A full root-cause analysis will follow within 5 business days.

The one rule

Mitigate before you diagnose. Restoring service is always the first priority — root cause can wait until customers are back.