Release pipeline monitoring

Every deploy to novavms.novalien.com (HostHatch, Tailscale 100.70.175.62) and every gateway-agent push triggers a monitoring window. This page is the checklist for that window: what to run, where logs live, and when to page on-call. Most regressions surface in the first 30 minutes — if nothing is wrong by then, the deploy is usually stable.

For the deploy itself, use the novavms-cloud-deploy and novavms-gateway-deploy skills (in .claude/skills/). This page covers monitoring only.

Post-deploy health checklist (first 30 minutes)

Run in order. Each command is copy-pasteable; substitute placeholders only where indicated.

1. Cloud HTTP health

curl -sf https://novavms.novalien.com/api/v1/health | jq
# expect: {"status":"ok","db":"ok","go2rtc":"ok","version":"<git-sha>"}

If anything is not ok, do not proceed — escalate to on-call.

2. Frontend loads cleanly

curl -sI https://novavms.novalien.com/ | head -5
# expect: HTTP/2 200

Open the site in a private browser window. Confirm /login renders and no red console errors. See the novavms-browser-qa skill for automation.

3. Container logs — error grep on cloud

SSH via Tailscale (the same pattern your novavms-cloud-deploy skill uses):

/c/Windows/System32/OpenSSH/ssh.exe root@100.70.175.62 \
  'docker logs --since 30m novavms-cloud 2>&1 | grep -E "level=error|panic|FATAL"'

A clean deploy has zero matches. Any match requires reading context.

4. Container logs — gateway error grep

Gateways log per-host. To sweep the fleet, iterate the gateway list from /platform/orgs (see Cross-org search):

# per gateway (example pi-gateway):
/c/Windows/System32/OpenSSH/ssh.exe pi-gateway \
  'sudo journalctl -u novavms-gateway --since "30 min ago" | grep -E "ERROR|panic"'

5. AI pipeline check

/c/Windows/System32/OpenSSH/ssh.exe root@100.70.175.62 \
  'docker logs --since 30m novavms-cloud 2>&1 | grep -E "ai.provider|gemini|ollama" | tail -50'

Watch for provider_unavailable, quota_exceeded, or a spike in parse_error — these typically indicate an upstream provider change, not our bug.

6. WebRTC / SFU health (for releases touching live view)

/c/Windows/System32/OpenSSH/ssh.exe root@100.70.175.62 \
  'docker logs --since 30m novavms-cloud 2>&1 | grep -E "session_start|extract-NAL|readyState" | tail -50'

See docs/superpowers/plans/2026-04-21-webrtc-allcodecs-STATUS.md for what a healthy-vs-regressing pattern looks like on the codec pipeline specifically.

7. Write a reliability report

If the deploy is non-trivial or touches a hot-path, run a soak and drop the report under docs/reliability-reports/YYYY-MM-DD-HH-MM-<duration>.md. The format is documented in existing files under that directory — the two most recent are canonical.

Mandatory for deploys that touch: cloud streaming, gateway WebSocket, auth, or RBAC.

When to page on-call

Page immediately (P1, see On-call basics):

/api/v1/health returns non-200 for more than 2 minutes.
More than 10% of online gateways drop after the deploy.
Any FATAL log line on cloud or any gateway.
Customer-facing auth starts returning 5xx.

Escalate next business hour (not page-worthy):

A single level=error that does not recur.
One gateway flapping.
Slow queries on non-critical endpoints.

Do not page:

Noisy but already-silenced alerts.
Expected restart loops during a staged rollout.

Reliability-report structure

Files in docs/reliability-reports/ follow this shape (see 2026-04-15-14-17-smoke.md for a short example, 2026-04-15-6h.md for a long one):

Title includes timestamp and duration.
Section 1: what was tested (cloud, gateways, frontend, SFU, each listed).
Section 2: result summary (PASS/FAIL per subsystem).
Section 3: per-subsystem detail — the commands run, the output excerpts, anomalies.
Section 4: outstanding issues with links to incident threads or PR numbers.

When you start a report, copy the most recent short-duration file and replace content top-down. Don’t invent a new format.