Skip to content

On-call basics

Novalien on-call rotates weekly. One primary, one secondary. Handover is Monday 10:00 local on the #novavms-oncall Slack channel. The primary owns PagerDuty acknowledgements and is expected online within 10 minutes of a page, 24/7 during their week. Secondary takes over if primary does not ack within 15 minutes.

Rotation handover

Outgoing primary posts a handover note in #novavms-oncall containing:

  1. Open incidents and their state (link to postmortem drafts if any).
  2. Deploys in flight — anything in alpha or beta that the incoming primary inherits. See Feature-flag rollout.
  3. Known-noisy alerts to watch (and any temporary silences scheduled to lift during the next week).
  4. Customer orgs with active allow_platform_impersonation = false — these cannot be entered to help during the week without Owner action.

Incoming primary acknowledges with a :thumbsup: in-thread. No handover is considered complete without that ack.

What “page-worthy” means (P1 criteria)

PagerDuty pages on P1 only. Everything else goes to #novavms-oncall-noise and is handled business hours.

A P1 is any of:

  • Cloud API (novavms.novalien.com) returns 5xx for more than 2 minutes on /api/v1/health or on authenticated reads.
  • More than 10% of online gateways drop simultaneously (mass disconnection — usually our fault, not theirs).
  • Ingest pipeline stops writing new events for more than 5 minutes across all orgs.
  • A customer reports total loss of live view AND playback for an entire org.
  • Any active security event — confirmed credential leak, suspicious platform.impersonation_started from an unknown actor, unexpected deletion of platform_audit_log rows (should be impossible — investigate).

Single-camera offline, one gateway offline, a slow search query — none of these are page-worthy. They are normal support tickets. See docs/reliability-reports/2026-04-17-session-record.md in the repo for what a non-P1 reconnect storm looks like and how it was handled without a page.

First 10 minutes of a page

Do these in order. Do not skip steps because the cause looks obvious.

  1. Ack the page in PagerDuty. Stops the escalation timer. If you cannot engage in 15 minutes, re-assign to secondary — do not ack and then disappear.
  2. Declare in #novavms-incident-live. Post: incident start time, what paged, current hypothesis (“unknown” is a valid hypothesis). This channel is the audit trail if we end up writing a public postmortem.
  3. Check prod status. curl -sf https://novavms.novalien.com/api/v1/health. Check the cloud server via the pattern in .claude/skills/novavms-cloud-deploy/SKILL.md (Tailscale IP 100.70.175.62, Windows native SSH). See Release-pipeline monitoring.
  4. Check recent deploys. git log --since=24.hours on the cloud and gateway repos. A deploy in the last 2 hours is the prime suspect.
  5. Check the reliability-report directory. Any recent soak fails? docs/reliability-reports/ in the repo — most recent file first.
  6. If customer-reported: ask for the org_id or the customer-owner email. Use Cross-org search to confirm the state before you impersonate. Do not impersonate speculatively — that’s an audit entry in their log.
  7. If you need to act inside a customer org: mint an impersonation token with reason: "P1 incident <incident-id>" and ticket_ref: <PagerDuty incident ID>. See How scoped impersonation works.
  8. Communicate outward every 15 minutes. Even “still investigating, no customer impact update” is a message. Silence is worse than admitting you don’t know yet.

At the 30-minute mark, if you are not clearly converging on a fix, page secondary and follow Incident response runbook.

PagerDuty escalation policy

Primary → Secondary (after 15 min of no ack) → Engineering lead → CTO. The CTO page has only fired twice in the product’s history; both were valid. If you are secondary and you page up to the lead, write a one-sentence reason in the PagerDuty note — “primary unreachable, customer-facing outage at 08:42 UTC” beats “escalating.”

See also