VLMaxxing by FrameMoggingtraining-free anti-recomputation for video VLMs

Stop paying twice to see the same pixels.

Most of what a video VLM is told to ingest is evidence the stack already paid for. We reuse the model’s working state across turns and only refresh what actually changed — training-free, no measurable accuracy drift, on frozen open-weights models.

Read the paper →github.com/jfbastien/codec-through

Routing-budget overlay across three clips. Orange highlights the per-frame fresh budget; static and shifted regions reuse the prior cache.

54 fps

Perception throughput, 32f follow-up turns

median, n=21 paired rows · range 23.85–134.43 fps

18.7×

Speedup vs cold-dense, 32f

median paired · range 7.34×–44.17×

0 / 21

Correctness drift

paired choice diffs · paired correctness diffs · same prompt, frames, and seed

What it actually does

Most of what a video VLM is asked to ingest is what it already ingested. The factory wall did not move; the cache says so.
Snapshot the model's working state right after it has seen the video, then reuse that snapshot across follow-up questions instead of re-ingesting the video each time.
Tested on Gemma 4 26B-A4B and Qwen 2.5-VL-7B-4bit, frozen weights, M5-class hardware. Byte-paired against cold-dense baselines.
Training-free. No new weights, no fine-tune, no distillation. The mechanism is in how the cache is borrowed across turns.

Numbers

Model	Frames / turn	Median speedup	Median fps	n
Gemma 4 26B-A4B	8	9.11×	27.0	21
Gemma 4 26B-A4B	32	18.7×	54.7	21
Qwen 2.5-VL-7B-4bit	20 · short/med/long	14.9–35.9×	—	93

All rows are paired against cold-dense baselines on the same frames, prompt, and decoding seed. Engineering caveats — including the cache-correctness boundary on mixed-attention models and the upstream mlx-vlm fix that closes it — are written up in the paper.

See it

Per-clip routing-budget overlays. Orange highlights what the runtime actually paid to re-ingest on each frame; everything else is reused state.

TOMATO 0298

Routing-budget overlay on a TOMATO motion clip. Reused vs fresh blocks per frame.

VideoMME 267

Same overlay on a VideoMME slice — most blocks reuse, only the moving region refreshes.

VideoMME 380

Slow camera pan: shifted blocks dominate, the static background is fully cached.

Why it matters

Continuous-video agents — computer-use, screen recording, robotics — need to observe at 30 fps or higher. If the model re-ingests the entire scene on every decision, you cannot keep up. Reusing what already happened lets a 26B-class open-weights VLM perceive at 24–134 fps per follow-up turn, which is the throughput regime where 30 fps observation actually becomes tractable on a laptop.

The bigger picture: most video pipelines hand the model dense pixels every frame and ask it to rediscover what didn’t move. Almost everything in a real scene didn’t move. Anti-recomputation is just the cache-side answer to that waste — and it gets larger every time the input rate goes up.

Stop paying twice to see the same pixels.

What it actually does

Numbers

See it

Why it matters

Links