Emergency LLM provider failover
Use this procedure when the primary LLM provider (currently Gemini generativelanguage.googleapis.com/v1beta) returns sustained errors (HTTP 5xx, rate-limit spikes, region outage) and AI event enrichment starts backing up in the queue. The failover flips the fleet to the secondary provider via platform config, notifies affected orgs, then restores primary when the provider reports green.
Procedure
- Confirm primary is actually down. Open Grafana →
llm_provider_errors_total{provider="gemini"}for the last 15 min. A provider-side blip does not justify a failover — threshold is > 30% error rate sustained for 10 minutes.
-
Announce intent in
#novavms-incidentsso other on-call engineers do not roll back your change. Include the Grafana link and expected duration. -
Flip the provider flag. Edit
cloud/.env.prodon the Avon-HostHatch host:Terminal window ssh root@100.70.175.62 'cd /root/NovaVMS && \sed -i "s/^AI_PROVIDER=.*/AI_PROVIDER=openai/" cloud/.env.prod && \docker compose restart cloud'The cloud container restarts in < 30 seconds. AI queue backlog starts draining against OpenAI.
-
Verify new events are processing.
docker logs novavms-cloud --tail=50 | grep providershould showprovider=openai model=gpt-4o-minion the next event.
-
Notify affected orgs. Run
cmd/notify-llm-failover:Terminal window go run ./cmd/notify-llm-failover --since=<grafana_spike_start> \--message="Temporary AI provider change in effect — no action required"This emails every Owner and Admin in orgs that had at least one AI event queued during the spike window. Audit row:
platform.fleet_notice_sent. -
Monitor primary. When
llm_provider_errors_total{provider="gemini"}returns below 2% for 30 minutes, flip back: setAI_PROVIDER=gemini, restart, verify, re-announce.
Common variations
- Only one region is affected: the env flag is all-or-nothing for now. Per-org provider override exists (D27, per-camera provider field) but the platform-wide switch is faster in an outage. Accept the cost for the duration.
- Fallback also fails: disable AI enrichment entirely with
AI_ENABLED=false. Motion alerts keep firing (R4 split pipeline — immediate path does not depend on AI). Communicate the degraded mode to orgs.
Related
- Feature-flag rollout — the slower, per-org equivalent for non-emergency changes.
- Incident response runbook — wraps this procedure with the incident-command steps.
- Fleet metrics — watching the queue drain during recovery.