VLMaxxing by FrameMoggingtraining-free anti-recomputation for video VLMs

Stop paying twice to see the same pixels.

Most of what a video VLM is told to ingest is evidence the stack already paid for. We reuse the model’s working state across turns and only refresh what actually changed — training-free, no measurable accuracy drift, on frozen open-weights models.

Routing-budget overlay across three clips. Orange highlights the per-frame fresh budget; static and shifted regions reuse the prior cache.
54 fps
Perception throughput, 32f follow-up turns
median, n=21 paired rows · range 23.85–134.43 fps
18.7×
Speedup vs cold-dense, 32f
median paired · range 7.34×–44.17×
0 / 21
Correctness drift
paired choice diffs · paired correctness diffs · same prompt, frames, and seed

What it actually does

  • Most of what a video VLM is asked to ingest is what it already ingested. The factory wall did not move; the cache says so.
  • Snapshot the model's working state right after it has seen the video, then reuse that snapshot across follow-up questions instead of re-ingesting the video each time.
  • Tested on Gemma 4 26B-A4B and Qwen 2.5-VL-7B-4bit, frozen weights, M5-class hardware. Byte-paired against cold-dense baselines.
  • Training-free. No new weights, no fine-tune, no distillation. The mechanism is in how the cache is borrowed across turns.

Numbers

ModelFrames / turnMedian speedupMedian fpsn
Gemma 4 26B-A4B89.11×27.021
Gemma 4 26B-A4B3218.7×54.721
Qwen 2.5-VL-7B-4bit20 · short/med/long14.9–35.9×93

All rows are paired against cold-dense baselines on the same frames, prompt, and decoding seed. Engineering caveats — including the cache-correctness boundary on mixed-attention models and the upstream mlx-vlm fix that closes it — are written up in the paper.

See it

Per-clip routing-budget overlays. Orange highlights what the runtime actually paid to re-ingest on each frame; everything else is reused state.

TOMATO 0298
Routing-budget overlay on a TOMATO motion clip. Reused vs fresh blocks per frame.
VideoMME 267
Same overlay on a VideoMME slice — most blocks reuse, only the moving region refreshes.
VideoMME 380
Slow camera pan: shifted blocks dominate, the static background is fully cached.

Why it matters

Continuous-video agents — computer-use, screen recording, robotics — need to observe at 30 fps or higher. If the model re-ingests the entire scene on every decision, you cannot keep up. Reusing what already happened lets a 26B-class open-weights VLM perceive at 24–134 fps per follow-up turn, which is the throughput regime where 30 fps observation actually becomes tractable on a laptop.

The bigger picture: most video pipelines hand the model dense pixels every frame and ask it to rediscover what didn’t move. Almost everything in a real scene didn’t move. Anti-recomputation is just the cache-side answer to that waste — and it gets larger every time the input rate goes up.

Links