Why NovaVMS uses a single AI call (not two) for event analysis
Why NovaVMS uses a single AI call (not two) for event analysis
Classic VMS AI pipelines split perception from judgment across two models — a detector finds objects in a frame, a second model decides what those objects mean. NovaVMS tried that and chose differently. By default, one VLM call handles both (per R1-REV). Two-pass is still available per prompt pack, but it is no longer the default.
The two-pass approach we rejected
The original design called for two phases. Phase 1 was a vision model (a VLM, or in older pipelines a YOLO-style detector) that extracted what the camera saw — people, vehicles, packages, bounding boxes, descriptions. Phase 2 was a text-only LLM that took Phase 1’s output and applied judgment — severity, recommended action, which tags to emit, whether to fire an alert.
The split is appealing on paper. Each model is focused. Phase 2 becomes cheap to re-run when severity rules change. Judgment logic lives in text, so it is easy to edit and version.
The split fell short in practice for three reasons.
- Latency doubled. Two sequential API calls added 1.5 to 3 seconds per event. For operators watching the event feed, a three-second delay between trigger and tag is the difference between useful and irrelevant.
- Cost doubled. Two paid calls per event at Verkada-level volumes add up. A self-hoster running Gemini 2.5 Flash for a 50-camera site pays more than twice as much for Phase 2 context alone, because Phase 1’s output must be re-sent as Phase 2’s input.
- Modern VLMs are already good at judgment. Gemini 2.5 Flash and GPT-4V both emit structured severity, tags, and narrative text from a single multimodal prompt. The text-only second pass corrected their judgment only at the margins.
The one-pass approach we chose
A single VLM call now takes the event snapshot (or the multi-frame keyframe set, or the full clip for Gemini) plus the prompt pack, and returns everything in one response: tags, description, severity, entities, narrative, and the embedding text used for semantic search.
The flow:
Event trigger → clip uploaded → AI queue → VLM call (perception + judgment) → composite tags → embedding → storeThe prompt pack tells the VLM what the org cares about. Output is a structured JSON schema the worker validates before persisting. There is no second round-trip.
Trade-offs
Decision R1-REV records the cost-benefit explicitly.
- Latency. One-pass is roughly 2x faster end-to-end. p50 typically lands under 1.5 seconds from upload to stored tags with Gemini 2.5 Flash.
- Cost. Roughly 40% cheaper per event. One paid call instead of two, and no Phase-2 input-token overhead.
- Model constraints. One-pass requires a VLM capable enough to produce structured judgment. Gemini 2.0 Flash and above, GPT-4V, and Qwen-VL qualify. Older or smaller models may under-classify severity.
- Re-running judgment is harder. Under two-pass you could change severity rules and re-run Phase 2 only. Under one-pass, every re-run is a full VLM call. For teams that tune severity frequently, this is the real cost.
- One prompt does two jobs. The prompt pack author has to balance perception language (“describe what you see”) with judgment language (“classify as critical/warning/info”). A bad balance produces either vague descriptions or miscalibrated severity.
When two-pass would be better
One-pass is the default, not the only option. Two-pass is still the right choice in three scenarios, and the pass_mode field on prompt packs exposes it.
- Batch reprocessing. If you are re-scoring a month of events against a new severity rule, two-pass lets you skip the expensive VLM phase and only re-run the cheap LLM phase. The two-pass code path exists in the codebase but batch reprocessing is currently stubbed; it is on the roadmap.
- Multi-frame routing. If you want Phase 1 to emit a signal (“there is a person”) that decides whether Phase 2 ever runs at all, two-pass saves cost at scale. Also currently stubbed.
- Regulated workloads. If compliance requires that judgment logic be expressible in plain text (not hidden inside a VLM prompt), two-pass keeps Phase 2 as a readable text-only LLM prompt. The text-only phase is auditable in a way that a VLM prompt is not.
For the typical NovaVMS deployment — a few sites, a few dozen cameras, events arriving one at a time — one-pass wins on latency and cost without meaningful accuracy loss.
See also
- Configure the LLM model — how to select provider and model (admin-gated per D82)
- Configure prompt packs — how to tune prompt content (operator-gated per D82)
- Alert rule schema — fields the VLM populates and alert rules consume