Skip to content

Incident response runbook

Incidents run on a fixed cadence. This page is the reference — the step-by-step for the first 30 minutes plus the structure of what follows. For the pre-incident on-call posture (what qualifies as a page, the first 10 minutes) see On-call basics. This page picks up when you have triaged a confirmed P1 and need to stabilise, communicate, and close out.

Incident lifecycle

PhaseDurationExit criterion
1. Identify0–5 minHypothesis for root cause or symptom
2. Triage5–15 minImpact scope, decision: mitigate vs dig
3. Communicateevery 15 minExternal status post; internal channel flowing
4. Mitigate15–60 minCustomer impact ended
5. Postmortemwithin 5 business daysWritten doc in docs/superpowers/plans/

Phase 1 — Identify (first 5 minutes)

Same as the first 10 minutes of a page (see On-call basics). By minute 5, you should be able to say one of:

  • “Cloud API is returning 5xx, deploy at HH:MM is the suspect.”
  • “Gateway X dropped Y cameras, cause unknown.”
  • “Customer Z cannot log in, cause unknown.”
  • “Something is wrong with live view across all orgs, cause unknown.”

“Cause unknown” is fine. “No hypothesis yet” also fine — keep digging.

Phase 2 — Triage (minutes 5–15)

Decide between two paths:

  • Mitigate first. If there is a known revert (feature flag flip, deploy rollback, gateway restart), apply it. See Feature-flag rollout for the flag revert path. Cloud rollback pattern is in .claude/skills/novavms-cloud-deploy/SKILL.md.
  • Dig first. Only if mitigation would destroy evidence (a rollback clears the bad container’s logs). Snapshot logs via the commands in Release-pipeline monitoring before you restart anything.

Customer impact assessment:

  • How many orgs? Use Cross-org search or check aggregate error metrics — if a specific org is named, confirm via their org_id.
  • How many users per org? Rough estimate from the fleet metrics — no need to impersonate just for a count.
  • What is broken vs degraded? “Live view is black for Org A” is broken. “Live view takes 8 seconds to start in Org A” is degraded.

Phase 3 — Communicate (every 15 minutes while active)

Three audiences, three channels:

  1. Internal incident channel (#novavms-incident-live): every action, every decision. This is the audit trail if the postmortem becomes customer-facing.
  2. #novavms-oncall: one-line updates every 15 minutes for the rest of the team who is not live on the incident.
  3. Customer-facing status (status.novalien.com if one exists, else direct customer Slack/email): only after you have confirmed impact and have a working timeline. Never post speculation externally.

Silence for more than 15 minutes on any of these three is a bug in the incident process, even if you are heads-down. Post “still investigating, no change.”

Phase 4 — Mitigate (minutes 15–60)

Apply the chosen mitigation. Confirm customer impact ended — do not rely on “looks fine from my side.” Cases:

  • Feature-flag revert — flip state back to alpha (see Feature-flag rollout). Effect is within 60 seconds for customers.
  • Cloud rollbackdocker compose to the previous image tag on 100.70.175.62. Typical downtime 30–90 seconds. Use the pattern in the novavms-cloud-deploy skill.
  • Gateway restart — if one host, remote systemctl restart novavms-gateway. If fleet-wide, rolling restart via the pattern in the novavms-gateway-deploy skill.
  • Customer impersonation to fix a tenant-specific state — mint a token with reason: "P1 <incident-id> — <one-line>". Fix the state. End the impersonation immediately. Do not stay in the customer org to poke around.

Watch for 15 minutes after mitigation. Resolution = zero errors in the watched logs for 15 minutes straight.

Phase 5 — Postmortem (within 5 business days)

Every P1 gets a written postmortem in docs/superpowers/plans/YYYY-MM-DD-<incident-slug>.md. Structure (copy the most recent file as template):

  1. Summary — one paragraph, no jargon.
  2. Impact — which orgs, how long, how severe.
  3. Timeline — every significant event with UTC timestamps. Pull from #novavms-incident-live.
  4. Root cause — technical. Cite file paths and commit SHAs.
  5. What went well — genuinely, not performatively.
  6. What went wrong — process gaps, missed alerts, late communication.
  7. Action items — each with an owner and a date. Track in the team’s issue tracker.

Postmortems are blameless. Name systems, commits, and processes — not people.

These paths are engineer-facing, not public. Reference them in the postmortem and the incident thread:

  • docs/superpowers/plans/2026-04-21-webrtc-allcodecs-STATUS.md — live-view codec pipeline rollout history.
  • docs/superpowers/plans/2026-04-14-gateway-reconnection-hardening.md — gateway reconnection hardening plan (relevant when gateways drop en masse).
  • docs/superpowers/plans/2026-04-09-sfu-keyframe-replay-and-contextmenu.md — SFU keyframe replay design.
  • docs/superpowers/plans/2026-04-12-mse-live-view-handover.md — MSE vs WebRTC transport handover.
  • docs/superpowers/plans/2026-03-30-connection-lifecycle-hardening.md — connection-lifecycle notes.
  • docs/reliability-reports/2026-04-17-session-record.md — end-to-end record of the reconnect-storm incident; canonical example of a well-run incident without a page.
  • .claude/skills/novavms-camera-troubleshoot/SKILL.md — if the incident is a single-camera symptom that escalated.
  • .claude/skills/novavms-webrtc-live-streaming/SKILL.md — if the incident touches live view, WebRTC, H.265, or MSE.

Escalation matrix

See On-call basics for the full PagerDuty chain. Escalate to engineering lead if you are still in Phase 2 (no hypothesis) after 20 minutes, or if mitigation has failed twice.

See also