Stop paying twice to see the same pixels.
Video VLMs keep re-buying visual state that is already sitting in the stream. VLMaxxing asks a frozen model to reuse what survived, refresh what changed, and keep the expensive bits for when they matter: training-free, no measurable accuracy drift in the tested follow-up rows.
What it actually does
- Most frames are reruns. The factory wall did not move; the cache says so. VLMaxxing makes the runtime remember that.
- After ingest, later questions on the same video reuse repaired visual state instead of rebuilding the expensive prefix.
- On fresh videos, sparse vision helps when timed vision work is actually skipped. C-CEILING keeps the speedup honest.
- For live streams, the target is perception rate: buy fresh evidence around change, motion, text, object boundaries, and sensor disagreement.
The Numbers Worth Clicking
| Lane | Teaser number | What it means | Guardrail |
|---|---|---|---|
| C-PERSIST | 14.90–35.92× | later questions reuse repaired same-video state | n=93 · 0 paired choice/correctness drift |
| C-STREAM target | 54.68 fps | warm 32-frame follow-up perception on a 26B-class stack | 19/21 rows >30 fps · first question pays once |
| C-VISION | 1.316× | fresh-video vision work skipped before the first answer | Gemma 32f short · n=20 · 0 drift/parse failures |
| C-CEILING | stage share | a local win only counts where wall-clock was actually skipped | separate first-query, follow-up, streaming, and routing rows |
Same numbers, separate workloads: first query, later questions on the same video, warm streaming views, and routing overlays each have their own denominator.
Watch the Cache Spend
Orange shows where the runtime buys fresh visual evidence. The pattern is the point: moving regions get new spend, while stable and shifted structure keeps its cached state.
Why it matters
Continuous-video agents — computer-use, screen recording, robotics, games, driving — need to notice the world changing while most of it does not. In the 26B scale-out bundle, warm follow-up turns hit 54.68 fps at the median on 32-frame rows, with 19/21 rows above 30 fps.
Streaming is the big target because the input rate keeps rising. The future interface is not just more frames; it is a state stream that carries what stayed put, what shifted, what surprised the codec, and which tiles deserve fresh vision now.
Where reuse gets interesting
Toward VLM-native codecs
Today’s codecs already know about motion, residual surprise, stable blocks, and prediction failure. VLMaxxing asks what happens when that structure becomes model state instead of being flattened back into RGB.
The next codec for models should speak in active tiles, object state, sensor time, uncertainty, text events, and cheap invalidation signals. More pixels is the boring interface; fresh evidence is the interesting one.