Release pipeline monitoring
Every deploy to novavms.novalien.com (HostHatch, Tailscale 100.70.175.62) and every gateway-agent push triggers a monitoring window. This page is the checklist for that window: what to run, where logs live, and when to page on-call. Most regressions surface in the first 30 minutes — if nothing is wrong by then, the deploy is usually stable.
For the deploy itself, use the novavms-cloud-deploy and novavms-gateway-deploy skills (in .claude/skills/). This page covers monitoring only.
Post-deploy health checklist (first 30 minutes)
Run in order. Each command is copy-pasteable; substitute placeholders only where indicated.
1. Cloud HTTP health
curl -sf https://novavms.novalien.com/api/v1/health | jq# expect: {"status":"ok","db":"ok","go2rtc":"ok","version":"<git-sha>"}If anything is not ok, do not proceed — escalate to on-call.
2. Frontend loads cleanly
curl -sI https://novavms.novalien.com/ | head -5# expect: HTTP/2 200Open the site in a private browser window. Confirm /login renders and no red console errors. See the novavms-browser-qa skill for automation.
3. Container logs — error grep on cloud
SSH via Tailscale (the same pattern your novavms-cloud-deploy skill uses):
/c/Windows/System32/OpenSSH/ssh.exe root@100.70.175.62 \ 'docker logs --since 30m novavms-cloud 2>&1 | grep -E "level=error|panic|FATAL"'A clean deploy has zero matches. Any match requires reading context.
4. Container logs — gateway error grep
Gateways log per-host. To sweep the fleet, iterate the gateway list from /platform/orgs (see Cross-org search):
# per gateway (example pi-gateway):/c/Windows/System32/OpenSSH/ssh.exe pi-gateway \ 'sudo journalctl -u novavms-gateway --since "30 min ago" | grep -E "ERROR|panic"'5. AI pipeline check
/c/Windows/System32/OpenSSH/ssh.exe root@100.70.175.62 \ 'docker logs --since 30m novavms-cloud 2>&1 | grep -E "ai.provider|gemini|ollama" | tail -50'Watch for provider_unavailable, quota_exceeded, or a spike in parse_error — these typically indicate an upstream provider change, not our bug.
6. WebRTC / SFU health (for releases touching live view)
/c/Windows/System32/OpenSSH/ssh.exe root@100.70.175.62 \ 'docker logs --since 30m novavms-cloud 2>&1 | grep -E "session_start|extract-NAL|readyState" | tail -50'See docs/superpowers/plans/2026-04-21-webrtc-allcodecs-STATUS.md for what a healthy-vs-regressing pattern looks like on the codec pipeline specifically.
7. Write a reliability report
If the deploy is non-trivial or touches a hot-path, run a soak and drop the report under docs/reliability-reports/YYYY-MM-DD-HH-MM-<duration>.md. The format is documented in existing files under that directory — the two most recent are canonical.
Mandatory for deploys that touch: cloud streaming, gateway WebSocket, auth, or RBAC.
When to page on-call
Page immediately (P1, see On-call basics):
/api/v1/healthreturns non-200 for more than 2 minutes.- More than 10% of online gateways drop after the deploy.
- Any
FATALlog line on cloud or any gateway. - Customer-facing auth starts returning 5xx.
Escalate next business hour (not page-worthy):
- A single
level=errorthat does not recur. - One gateway flapping.
- Slow queries on non-critical endpoints.
Do not page:
- Noisy but already-silenced alerts.
- Expected restart loops during a staged rollout.
Reliability-report structure
Files in docs/reliability-reports/ follow this shape (see 2026-04-15-14-17-smoke.md for a short example, 2026-04-15-6h.md for a long one):
- Title includes timestamp and duration.
- Section 1: what was tested (cloud, gateways, frontend, SFU, each listed).
- Section 2: result summary (PASS/FAIL per subsystem).
- Section 3: per-subsystem detail — the commands run, the output excerpts, anomalies.
- Section 4: outstanding issues with links to incident threads or PR numbers.
When you start a report, copy the most recent short-duration file and replace content top-down. Don’t invent a new format.
See also
- On-call basics — what to do when this monitoring triggers a page.
- Feature-flag rollout — the soak period uses the commands on this page.
- Incident response runbook — if the checklist finds something broken.
- In-repo:
docs/reliability-reports/for historical context..claude/skills/novavms-reliability-test/for the automated runner.