On-call basics
Novalien on-call rotates weekly. One primary, one secondary. Handover is Monday 10:00 local on the #novavms-oncall Slack channel. The primary owns PagerDuty acknowledgements and is expected online within 10 minutes of a page, 24/7 during their week. Secondary takes over if primary does not ack within 15 minutes.
Rotation handover
Outgoing primary posts a handover note in #novavms-oncall containing:
- Open incidents and their state (link to postmortem drafts if any).
- Deploys in flight — anything in
alphaorbetathat the incoming primary inherits. See Feature-flag rollout. - Known-noisy alerts to watch (and any temporary silences scheduled to lift during the next week).
- Customer orgs with active
allow_platform_impersonation = false— these cannot be entered to help during the week without Owner action.
Incoming primary acknowledges with a :thumbsup: in-thread. No handover is considered complete without that ack.
What “page-worthy” means (P1 criteria)
PagerDuty pages on P1 only. Everything else goes to #novavms-oncall-noise and is handled business hours.
A P1 is any of:
- Cloud API (
novavms.novalien.com) returns 5xx for more than 2 minutes on/api/v1/healthor on authenticated reads. - More than 10% of online gateways drop simultaneously (mass disconnection — usually our fault, not theirs).
- Ingest pipeline stops writing new events for more than 5 minutes across all orgs.
- A customer reports total loss of live view AND playback for an entire org.
- Any active security event — confirmed credential leak, suspicious
platform.impersonation_startedfrom an unknown actor, unexpected deletion ofplatform_audit_logrows (should be impossible — investigate).
Single-camera offline, one gateway offline, a slow search query — none of these are page-worthy. They are normal support tickets. See docs/reliability-reports/2026-04-17-session-record.md in the repo for what a non-P1 reconnect storm looks like and how it was handled without a page.
First 10 minutes of a page
Do these in order. Do not skip steps because the cause looks obvious.
- Ack the page in PagerDuty. Stops the escalation timer. If you cannot engage in 15 minutes, re-assign to secondary — do not ack and then disappear.
- Declare in
#novavms-incident-live. Post: incident start time, what paged, current hypothesis (“unknown” is a valid hypothesis). This channel is the audit trail if we end up writing a public postmortem. - Check prod status.
curl -sf https://novavms.novalien.com/api/v1/health. Check the cloud server via the pattern in.claude/skills/novavms-cloud-deploy/SKILL.md(Tailscale IP100.70.175.62, Windows native SSH). See Release-pipeline monitoring. - Check recent deploys.
git log --since=24.hourson the cloud and gateway repos. A deploy in the last 2 hours is the prime suspect. - Check the reliability-report directory. Any recent soak fails?
docs/reliability-reports/in the repo — most recent file first. - If customer-reported: ask for the
org_idor the customer-owner email. Use Cross-org search to confirm the state before you impersonate. Do not impersonate speculatively — that’s an audit entry in their log. - If you need to act inside a customer org: mint an impersonation token with
reason: "P1 incident <incident-id>"andticket_ref: <PagerDuty incident ID>. See How scoped impersonation works. - Communicate outward every 15 minutes. Even “still investigating, no customer impact update” is a message. Silence is worse than admitting you don’t know yet.
At the 30-minute mark, if you are not clearly converging on a fix, page secondary and follow Incident response runbook.
PagerDuty escalation policy
Primary → Secondary (after 15 min of no ack) → Engineering lead → CTO. The CTO page has only fired twice in the product’s history; both were valid. If you are secondary and you page up to the lead, write a one-sentence reason in the PagerDuty note — “primary unreachable, customer-facing outage at 08:42 UTC” beats “escalating.”
See also
- Incident response runbook — beyond the first 10 minutes.
- Release-pipeline monitoring — post-deploy checks that prevent pages.
- In-repo:
docs/reliability-reports/for historical incident data.