Incident response runbook

Incidents run on a fixed cadence. This page is the reference — the step-by-step for the first 30 minutes plus the structure of what follows. For the pre-incident on-call posture (what qualifies as a page, the first 10 minutes) see On-call basics. This page picks up when you have triaged a confirmed P1 and need to stabilise, communicate, and close out.

Incident lifecycle

Phase	Duration	Exit criterion
1. Identify	0–5 min	Hypothesis for root cause or symptom
2. Triage	5–15 min	Impact scope, decision: mitigate vs dig
3. Communicate	every 15 min	External status post; internal channel flowing
4. Mitigate	15–60 min	Customer impact ended
5. Postmortem	within 5 business days	Written doc in `docs/superpowers/plans/`

Phase 1 — Identify (first 5 minutes)

Same as the first 10 minutes of a page (see On-call basics). By minute 5, you should be able to say one of:

“Cloud API is returning 5xx, deploy at HH:MM is the suspect.”
“Gateway X dropped Y cameras, cause unknown.”
“Customer Z cannot log in, cause unknown.”
“Something is wrong with live view across all orgs, cause unknown.”

“Cause unknown” is fine. “No hypothesis yet” also fine — keep digging.

Phase 2 — Triage (minutes 5–15)

Decide between two paths:

Mitigate first. If there is a known revert (feature flag flip, deploy rollback, gateway restart), apply it. See Feature-flag rollout for the flag revert path. Cloud rollback pattern is in .claude/skills/novavms-cloud-deploy/SKILL.md.
Dig first. Only if mitigation would destroy evidence (a rollback clears the bad container’s logs). Snapshot logs via the commands in Release-pipeline monitoring before you restart anything.

Customer impact assessment:

How many orgs? Use Cross-org search or check aggregate error metrics — if a specific org is named, confirm via their org_id.
How many users per org? Rough estimate from the fleet metrics — no need to impersonate just for a count.
What is broken vs degraded? “Live view is black for Org A” is broken. “Live view takes 8 seconds to start in Org A” is degraded.

Phase 3 — Communicate (every 15 minutes while active)

Three audiences, three channels:

Internal incident channel (#novavms-incident-live): every action, every decision. This is the audit trail if the postmortem becomes customer-facing.
#novavms-oncall: one-line updates every 15 minutes for the rest of the team who is not live on the incident.
Customer-facing status (status.novalien.com if one exists, else direct customer Slack/email): only after you have confirmed impact and have a working timeline. Never post speculation externally.

Silence for more than 15 minutes on any of these three is a bug in the incident process, even if you are heads-down. Post “still investigating, no change.”

Phase 4 — Mitigate (minutes 15–60)

Apply the chosen mitigation. Confirm customer impact ended — do not rely on “looks fine from my side.” Cases:

Feature-flag revert — flip state back to alpha (see Feature-flag rollout). Effect is within 60 seconds for customers.
Cloud rollback — docker compose to the previous image tag on 100.70.175.62. Typical downtime 30–90 seconds. Use the pattern in the novavms-cloud-deploy skill.
Gateway restart — if one host, remote systemctl restart novavms-gateway. If fleet-wide, rolling restart via the pattern in the novavms-gateway-deploy skill.
Customer impersonation to fix a tenant-specific state — mint a token with reason: "P1 <incident-id> — <one-line>". Fix the state. End the impersonation immediately. Do not stay in the customer org to poke around.

Watch for 15 minutes after mitigation. Resolution = zero errors in the watched logs for 15 minutes straight.

Phase 5 — Postmortem (within 5 business days)

Every P1 gets a written postmortem in docs/superpowers/plans/YYYY-MM-DD-<incident-slug>.md. Structure (copy the most recent file as template):

Summary — one paragraph, no jargon.
Impact — which orgs, how long, how severe.
Timeline — every significant event with UTC timestamps. Pull from #novavms-incident-live.
Root cause — technical. Cite file paths and commit SHAs.
What went well — genuinely, not performatively.
What went wrong — process gaps, missed alerts, late communication.
Action items — each with an owner and a date. Track in the team’s issue tracker.

Postmortems are blameless. Name systems, commits, and processes — not people.

Internal runbook links (in-repo)

These paths are engineer-facing, not public. Reference them in the postmortem and the incident thread:

docs/superpowers/plans/2026-04-21-webrtc-allcodecs-STATUS.md — live-view codec pipeline rollout history.
docs/superpowers/plans/2026-04-14-gateway-reconnection-hardening.md — gateway reconnection hardening plan (relevant when gateways drop en masse).
docs/superpowers/plans/2026-04-09-sfu-keyframe-replay-and-contextmenu.md — SFU keyframe replay design.
docs/superpowers/plans/2026-04-12-mse-live-view-handover.md — MSE vs WebRTC transport handover.
docs/superpowers/plans/2026-03-30-connection-lifecycle-hardening.md — connection-lifecycle notes.
docs/reliability-reports/2026-04-17-session-record.md — end-to-end record of the reconnect-storm incident; canonical example of a well-run incident without a page.
.claude/skills/novavms-camera-troubleshoot/SKILL.md — if the incident is a single-camera symptom that escalated.
.claude/skills/novavms-webrtc-live-streaming/SKILL.md — if the incident touches live view, WebRTC, H.265, or MSE.

Escalation matrix

See On-call basics for the full PagerDuty chain. Escalate to engineering lead if you are still in Phase 2 (no hypothesis) after 20 minutes, or if mitigation has failed twice.