Why NovaVMS uses a single AI call (not two) for event analysis

Classic VMS AI pipelines split perception from judgment across two models — a detector finds objects in a frame, a second model decides what those objects mean. NovaVMS tried that and chose differently. By default, one VLM call handles both (per R1-REV). Two-pass is still available per prompt pack, but it is no longer the default.

The two-pass approach we rejected

The original design called for two phases. Phase 1 was a vision model (a VLM, or in older pipelines a YOLO-style detector) that extracted what the camera saw — people, vehicles, packages, bounding boxes, descriptions. Phase 2 was a text-only LLM that took Phase 1’s output and applied judgment — severity, recommended action, which tags to emit, whether to fire an alert.

The split is appealing on paper. Each model is focused. Phase 2 becomes cheap to re-run when severity rules change. Judgment logic lives in text, so it is easy to edit and version.

The split fell short in practice for three reasons.

Latency doubled. Two sequential API calls added 1.5 to 3 seconds per event. For operators watching the event feed, a three-second delay between trigger and tag is the difference between useful and irrelevant.
Cost doubled. Two paid calls per event at Verkada-level volumes add up. A self-hoster running Gemini 2.5 Flash for a 50-camera site pays more than twice as much for Phase 2 context alone, because Phase 1’s output must be re-sent as Phase 2’s input.
Modern VLMs are already good at judgment. Gemini 2.5 Flash and GPT-4V both emit structured severity, tags, and narrative text from a single multimodal prompt. The text-only second pass corrected their judgment only at the margins.

The one-pass approach we chose

A single VLM call now takes the event snapshot (or the multi-frame keyframe set, or the full clip for Gemini) plus the prompt pack, and returns everything in one response: tags, description, severity, entities, narrative, and the embedding text used for semantic search.

The flow:

Event trigger → clip uploaded → AI queue → VLM call (perception + judgment) → composite tags → embedding → store

The prompt pack tells the VLM what the org cares about. Output is a structured JSON schema the worker validates before persisting. There is no second round-trip.

Trade-offs

Decision R1-REV records the cost-benefit explicitly.

Latency. One-pass is roughly 2x faster end-to-end. p50 typically lands under 1.5 seconds from upload to stored tags with Gemini 2.5 Flash.
Cost. Roughly 40% cheaper per event. One paid call instead of two, and no Phase-2 input-token overhead.
Model constraints. One-pass requires a VLM capable enough to produce structured judgment. Gemini 2.0 Flash and above, GPT-4V, and Qwen-VL qualify. Older or smaller models may under-classify severity.
Re-running judgment is harder. Under two-pass you could change severity rules and re-run Phase 2 only. Under one-pass, every re-run is a full VLM call. For teams that tune severity frequently, this is the real cost.
One prompt does two jobs. The prompt pack author has to balance perception language (“describe what you see”) with judgment language (“classify as critical/warning/info”). A bad balance produces either vague descriptions or miscalibrated severity.

When two-pass would be better

One-pass is the default, not the only option. Two-pass is still the right choice in three scenarios, and the pass_mode field on prompt packs exposes it.

Batch reprocessing. If you are re-scoring a month of events against a new severity rule, two-pass lets you skip the expensive VLM phase and only re-run the cheap LLM phase. The two-pass code path exists in the codebase but batch reprocessing is currently stubbed; it is on the roadmap.
Multi-frame routing. If you want Phase 1 to emit a signal (“there is a person”) that decides whether Phase 2 ever runs at all, two-pass saves cost at scale. Also currently stubbed.
Regulated workloads. If compliance requires that judgment logic be expressible in plain text (not hidden inside a VLM prompt), two-pass keeps Phase 2 as a readable text-only LLM prompt. The text-only phase is auditable in a way that a VLM prompt is not.

For the typical NovaVMS deployment — a few sites, a few dozen cameras, events arriving one at a time — one-pass wins on latency and cost without meaningful accuracy loss.

Why NovaVMS uses a single AI call (not two) for event analysis

Why NovaVMS uses a single AI call (not two) for event analysis

The two-pass approach we rejected

The one-pass approach we chose

Trade-offs

When two-pass would be better

See also