Decomposing Reasoning Efficiency

A workshop version of the efficiency accounting idea

Core question

When two models solve similar numbers of reasoning tasks, which one spends fewer generated tokens per solved instance?

A short route and a looping route arrive at the same basin while spending visibly different amounts of generated text.

Read the visual

Short route

Useful tokens move directly toward solved work.

Looping route

Extra text can circulate without buying correctness.

Budget funnel

Efficiency asks how much solved work each generated token buys.

Method visual

How to read it

Budget accounting

Efficiency asks how much solved work each generated token buys, not whether the visible trace merely looks short.

A stream of blue token beads passes through a funnel while useful green beads reach a basin and grey beads loop away.

Budget accounting

Efficiency view

Solved work divided by generated text.

The metric is not a moral preference for short answers. It is a way to ask when additional reasoning text buys reliability, and when it only burns inference budget.

Correctness

Whether the final answer is valid under the benchmark extractor.

Generated length

The visible output-token budget spent by the model.

Efficiency

Correct answers produced per shared generated-token budget.

A direct green route and a long grey looping route start from the same basin and end at the same target while separate token streams show budget use. — Equivalent success can hide very different routes through the generated-token budget.

Main insight

Accuracy can tie while the budget story diverges.

Efficiency rankings can diverge from plain accuracy rankings.

Token budget matters most when models deliberate, self-correct, or loop.

Controlled tasks make it possible to ask where extra tokens buy accuracy and where they are wasted.

Diagnostic visual

A short green path and a long grey looping path reach the same blue target basin over a shared contour field.

What it separates

Same score, different route

Two systems can land on similar accuracy while one burns budget through loops and detours.

Takeaway

This workshop paper introduced the first compact version of the reasoning-efficiency framing: evaluate reasoning not only by correctness, but by how much generated output budget is spent to obtain it.

Back to research overview