Incident response runbook
Incidents run on a fixed cadence. This page is the reference — the step-by-step for the first 30 minutes plus the structure of what follows. For the pre-incident on-call posture (what qualifies as a page, the first 10 minutes) see On-call basics. This page picks up when you have triaged a confirmed P1 and need to stabilise, communicate, and close out.
Incident lifecycle
| Phase | Duration | Exit criterion |
|---|---|---|
| 1. Identify | 0–5 min | Hypothesis for root cause or symptom |
| 2. Triage | 5–15 min | Impact scope, decision: mitigate vs dig |
| 3. Communicate | every 15 min | External status post; internal channel flowing |
| 4. Mitigate | 15–60 min | Customer impact ended |
| 5. Postmortem | within 5 business days | Written doc in docs/superpowers/plans/ |
Phase 1 — Identify (first 5 minutes)
Same as the first 10 minutes of a page (see On-call basics). By minute 5, you should be able to say one of:
- “Cloud API is returning 5xx, deploy at HH:MM is the suspect.”
- “Gateway X dropped Y cameras, cause unknown.”
- “Customer Z cannot log in, cause unknown.”
- “Something is wrong with live view across all orgs, cause unknown.”
“Cause unknown” is fine. “No hypothesis yet” also fine — keep digging.
Phase 2 — Triage (minutes 5–15)
Decide between two paths:
- Mitigate first. If there is a known revert (feature flag flip, deploy rollback, gateway restart), apply it. See Feature-flag rollout for the flag revert path. Cloud rollback pattern is in
.claude/skills/novavms-cloud-deploy/SKILL.md. - Dig first. Only if mitigation would destroy evidence (a rollback clears the bad container’s logs). Snapshot logs via the commands in Release-pipeline monitoring before you restart anything.
Customer impact assessment:
- How many orgs? Use Cross-org search or check aggregate error metrics — if a specific org is named, confirm via their
org_id. - How many users per org? Rough estimate from the fleet metrics — no need to impersonate just for a count.
- What is broken vs degraded? “Live view is black for Org A” is broken. “Live view takes 8 seconds to start in Org A” is degraded.
Phase 3 — Communicate (every 15 minutes while active)
Three audiences, three channels:
- Internal incident channel (
#novavms-incident-live): every action, every decision. This is the audit trail if the postmortem becomes customer-facing. #novavms-oncall: one-line updates every 15 minutes for the rest of the team who is not live on the incident.- Customer-facing status (
status.novalien.comif one exists, else direct customer Slack/email): only after you have confirmed impact and have a working timeline. Never post speculation externally.
Silence for more than 15 minutes on any of these three is a bug in the incident process, even if you are heads-down. Post “still investigating, no change.”
Phase 4 — Mitigate (minutes 15–60)
Apply the chosen mitigation. Confirm customer impact ended — do not rely on “looks fine from my side.” Cases:
- Feature-flag revert — flip state back to alpha (see Feature-flag rollout). Effect is within 60 seconds for customers.
- Cloud rollback —
docker composeto the previous image tag on100.70.175.62. Typical downtime 30–90 seconds. Use the pattern in thenovavms-cloud-deployskill. - Gateway restart — if one host, remote
systemctl restart novavms-gateway. If fleet-wide, rolling restart via the pattern in thenovavms-gateway-deployskill. - Customer impersonation to fix a tenant-specific state — mint a token with
reason: "P1 <incident-id> — <one-line>". Fix the state. End the impersonation immediately. Do not stay in the customer org to poke around.
Watch for 15 minutes after mitigation. Resolution = zero errors in the watched logs for 15 minutes straight.
Phase 5 — Postmortem (within 5 business days)
Every P1 gets a written postmortem in docs/superpowers/plans/YYYY-MM-DD-<incident-slug>.md. Structure (copy the most recent file as template):
- Summary — one paragraph, no jargon.
- Impact — which orgs, how long, how severe.
- Timeline — every significant event with UTC timestamps. Pull from
#novavms-incident-live. - Root cause — technical. Cite file paths and commit SHAs.
- What went well — genuinely, not performatively.
- What went wrong — process gaps, missed alerts, late communication.
- Action items — each with an owner and a date. Track in the team’s issue tracker.
Postmortems are blameless. Name systems, commits, and processes — not people.
Internal runbook links (in-repo)
These paths are engineer-facing, not public. Reference them in the postmortem and the incident thread:
docs/superpowers/plans/2026-04-21-webrtc-allcodecs-STATUS.md— live-view codec pipeline rollout history.docs/superpowers/plans/2026-04-14-gateway-reconnection-hardening.md— gateway reconnection hardening plan (relevant when gateways drop en masse).docs/superpowers/plans/2026-04-09-sfu-keyframe-replay-and-contextmenu.md— SFU keyframe replay design.docs/superpowers/plans/2026-04-12-mse-live-view-handover.md— MSE vs WebRTC transport handover.docs/superpowers/plans/2026-03-30-connection-lifecycle-hardening.md— connection-lifecycle notes.docs/reliability-reports/2026-04-17-session-record.md— end-to-end record of the reconnect-storm incident; canonical example of a well-run incident without a page..claude/skills/novavms-camera-troubleshoot/SKILL.md— if the incident is a single-camera symptom that escalated..claude/skills/novavms-webrtc-live-streaming/SKILL.md— if the incident touches live view, WebRTC, H.265, or MSE.
Escalation matrix
See On-call basics for the full PagerDuty chain. Escalate to engineering lead if you are still in Phase 2 (no hypothesis) after 20 minutes, or if mitigation has failed twice.
See also
- On-call basics — first 10 minutes.
- Release-pipeline monitoring — commands to run during triage.
- Audit expectations — what your incident actions produce in logs.
- Feature-flag rollout — mitigation via flag revert.