VLMaxxing through FrameMoggingtraining-free anti-recomputation for video VLMs

Stop paying twice to see the same pixels.

Video VLMs keep re-buying visual state that is already sitting in the stream. VLMaxxing asks a frozen model to reuse what survived, refresh what changed, and keep the expensive bits for when they matter: training-free, no measurable accuracy drift in the tested follow-up rows.

Routing-budget overlay across updated VLMaxxing clips. Orange is the per-frame fresh buy; static and shifted regions reuse prior visual state.
14.90–35.92×
Same-video follow-up speedup
adaptive repaired-cache breadth · n=93 · no measurable accuracy drift
54.68 fps
Warm 26B perception view
32-frame warm follow-up · 19/21 rows above 30 fps
0 / 93
Correctness drift
paired choice + correctness diffs · same-video follow-up breadth

What it actually does

  • Most frames are reruns. The factory wall did not move; the cache says so. VLMaxxing makes the runtime remember that.
  • After ingest, later questions on the same video reuse repaired visual state instead of rebuilding the expensive prefix.
  • On fresh videos, sparse vision helps when timed vision work is actually skipped. C-CEILING keeps the speedup honest.
  • For live streams, the target is perception rate: buy fresh evidence around change, motion, text, object boundaries, and sensor disagreement.

The Numbers Worth Clicking

LaneTeaser numberWhat it meansGuardrail
C-PERSIST14.90–35.92×later questions reuse repaired same-video staten=93 · 0 paired choice/correctness drift
C-STREAM target54.68 fpswarm 32-frame follow-up perception on a 26B-class stack19/21 rows >30 fps · first question pays once
C-VISION1.316×fresh-video vision work skipped before the first answerGemma 32f short · n=20 · 0 drift/parse failures
C-CEILINGstage sharea local win only counts where wall-clock was actually skippedseparate first-query, follow-up, streaming, and routing rows

Same numbers, separate workloads: first query, later questions on the same video, warm streaming views, and routing overlays each have their own denominator.

Watch the Cache Spend

Orange shows where the runtime buys fresh visual evidence. The pattern is the point: moving regions get new spend, while stable and shifted structure keeps its cached state.

TOMATO 0298
High-reuse clip: the background stays paid for; orange marks the fresh motion budget.
VideoMME 267
Boundary clip: more orange appears where the evidence actually moves.
VideoMME 380
Camera-pan clip: shifted blocks dominate, while stable structure remains reusable.

Why it matters

Continuous-video agents — computer-use, screen recording, robotics, games, driving — need to notice the world changing while most of it does not. In the 26B scale-out bundle, warm follow-up turns hit 54.68 fps at the median on 32-frame rows, with 19/21 rows above 30 fps.

Streaming is the big target because the input rate keeps rising. The future interface is not just more frames; it is a state stream that carries what stayed put, what shifted, what surprised the codec, and which tiles deserve fresh vision now.

Where reuse gets interesting

Factories + cameras
Stable walls, counters, and workspaces should stay cached while entrants and contacts refresh.
Screen + UI agents
Exact-copy regions, glyph changes, cursor motion, and scroll events need different freshness rules.
Robotics + VLA
Refresh the gripper, object, contact boundary, and goal zone; reuse the boring workspace.
Driving + drones
Pose, flow, borders, parallax, and occlusion decide where same-position reuse stops being safe.
Games + HUD streams
Stable HUD anchors and repeated interaction states are obvious cache candidates.
Sensor-fusion streams
Depth, IMU, events, object tracks, timestamps, and confidence can say where RGB is worth buying.

Toward VLM-native codecs

Today’s codecs already know about motion, residual surprise, stable blocks, and prediction failure. VLMaxxing asks what happens when that structure becomes model state instead of being flattened back into RGB.

The next codec for models should speak in active tiles, object state, sensor time, uncertainty, text events, and cheap invalidation signals. More pixels is the boring interface; fresh evidence is the interesting one.

Receipts