03

Diagnostic regimes

Research project

Beyond Accuracy

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

An accounting protocol for why reasoning is inefficient

Core question

When a reasoning model is inefficient, is it failing by truncation, by wrong completed answers, or by spending too much text relative to the work?

Poster-style diagnostic map where one black aggregate efficiency landscape decomposes into blue, red, orange, and green regimes.
One aggregate efficiency landscape separates into distinct regimes that call for different repairs.
Read the visual

01

Completion layer

Did the model finish inside the context and output budget?

02

Logic layer

Given completion, was the answer actually correct?

03

Overhead layer

How much extra text was spent relative to the work?
Method visual

How to read it

Layered accounting

Completion, logic, overhead, and workload coupling are checked as separate layers before interpreting efficiency.

A dark aggregate basin peels into four colored contour layers that separate the causes of inefficient reasoning.
Decomposition

Accounting protocol

Split one bad number into its causes.

The page visual is a peeled reasoning landscape: each layer asks a different diagnostic question before the final efficiency score is interpreted.

r_ctx

Completion

Did the model finish within the stated output budget?

r_logic

Conditional correctness

Given clean completion, did the model answer correctly?

VO

Mean overhead

How many generated tokens were spent per unit of task-implied work?
A dark aggregate reasoning basin splitting into blue, red, orange, and grey diagnostic basins across peeled paper layers.
A single bad efficiency number separates into context, logic, overhead, and balanced regimes.

Diagnostic regimes

Same accuracy, different repair strategy.

Efficiency and mean overhead rankings transfer more strongly than accuracy across several reasoning benchmarks.

Many low-efficiency models are logic-limited, while others are context-limited by truncation or verbosity-limited by overhead.

CogniLoad shows that long sequential context is the strongest efficiency stressor in the suite.

Diagnostic visual
Four colored topographic islands with black observation dots are connected by one faint shared accuracy band.

What it separates

Repair regimes

Context-limited, logic-limited, verbose, and balanced systems call for different interventions.

Takeaway

Beyond Accuracy turns token efficiency from a scalar leaderboard into a diagnostic protocol. It works for open and closed models because the core layer needs only correctness, truncation status, and generated-token counts.

Back to research overview