Emergency LLM provider failover

Use this procedure when the primary LLM provider (currently Gemini generativelanguage.googleapis.com/v1beta) returns sustained errors (HTTP 5xx, rate-limit spikes, region outage) and AI event enrichment starts backing up in the queue. The failover flips the fleet to the secondary provider via platform config, notifies affected orgs, then restores primary when the provider reports green.

Procedure

Confirm primary is actually down. Open Grafana → llm_provider_errors_total{provider="gemini"} for the last 15 min. A provider-side blip does not justify a failover — threshold is > 30% error rate sustained for 10 minutes.

Announce intent in #novavms-incidents so other on-call engineers do not roll back your change. Include the Grafana link and expected duration.
Flip the provider flag. Edit cloud/.env.prod on the Avon-HostHatch host:
Terminal window
```
ssh root@100.70.175.62 'cd /root/NovaVMS && \
  sed -i "s/^AI_PROVIDER=.*/AI_PROVIDER=openai/" cloud/.env.prod && \
  docker compose restart cloud'
```
The cloud container restarts in < 30 seconds. AI queue backlog starts draining against OpenAI.
Verify new events are processing. docker logs novavms-cloud --tail=50 | grep provider should show provider=openai model=gpt-4o-mini on the next event.

Notify affected orgs. Run cmd/notify-llm-failover:
Terminal window
```
go run ./cmd/notify-llm-failover --since=<grafana_spike_start> \
  --message="Temporary AI provider change in effect — no action required"
```
This emails every Owner and Admin in orgs that had at least one AI event queued during the spike window. Audit row: platform.fleet_notice_sent.
Monitor primary. When llm_provider_errors_total{provider="gemini"} returns below 2% for 30 minutes, flip back: set AI_PROVIDER=gemini, restart, verify, re-announce.

Common variations

Only one region is affected: the env flag is all-or-nothing for now. Per-org provider override exists (D27, per-camera provider field) but the platform-wide switch is faster in an outage. Accept the cost for the duration.
Fallback also fails: disable AI enrichment entirely with AI_ENABLED=false. Motion alerts keep firing (R4 split pipeline — immediate path does not depend on AI). Communicate the degraded mode to orgs.

Feature-flag rollout — the slower, per-org equivalent for non-emergency changes.
Incident response runbook — wraps this procedure with the incident-command steps.
Fleet metrics — watching the queue drain during recovery.

Emergency LLM provider failover

Procedure

Common variations

Related