CogniLoad

Separating intrinsic load, extraneous load, and length demand in LLM reasoning

Core question

Can we disentangle how much task length, intrinsic difficulty, and distractor ratio each contribute to success or failure in multi-step reasoning?

A reasoning trajectory crosses length pressure, distractor fields, and local difficulty ridges before reaching the final state.

Read the visual

Intrinsic load

Contour ridges mark local steps where entities, attributes, and conditions interact.

Extraneous load

Distractor fields are valid statements that do not belong to the queried state path.

Length demand

The path gets longer, so the relevant state has to survive more sequential updates.

Method visual

How to read it

Independent load controls

CogniLoad keeps sequence length, local difficulty, and distractor ratio separable before asking for the final state.

Three independent experimental controls feed a green state path through blue ridges, grey distractor particles, and a final basin.

Load dimensions

Factorial grid

Difficulty is a coordinate, not one number.

CogniLoad samples puzzles across three independent cognitive-load dimensions, then asks whether the final queried state can still be recovered.

Intrinsic cognitive load

Intrinsic difficulty

How many entities, attributes, and conditions interact inside each reasoning step.

d in {1, 3, 5, 7, 10}

rho

Extraneous cognitive load

Distractor density

How much irrelevant but valid state information surrounds the queried path. Lower rho means denser distractors; intermediate rho creates the hardest filtering regime.

rho in {5, 10, 25, 50, 75, 90, 95}

Length / germane-demand proxy

Task length

How many sequential statements must be processed before the answer can be recovered.

N in {20, 50, 100, 250}

What it reveals

A benchmark score becomes a failure fingerprint.

Blue sequential reasoning path crossing contour ridges and distractor particle clouds before falling into a final answer basin. — The same aggregate score can be read as a fingerprint of length pressure, local difficulty, and distractor interference.

Task length is the dominant stressor: moving from short to long sequences causes the largest degradation.

Intermediate distractor ratios create a characteristic U-shaped response because filtering and state updating pull in opposite directions.

Under high load, many failures are state-tracking mistakes rather than formatting errors or unsolvable instances.

Diagnostic visual

One reasoning origin branches into three load regimes with blue true paths, green robust paths, and red wrong-state attractors.

What it separates

Failure fingerprints

A wrong answer can be traced to pressure from long context, local constraint ridges, or distractor interference.

Takeaway

CogniLoad is a synthetic natural-language reasoning benchmark grounded in cognitive load theory. It turns benchmark difficulty into three independently controlled load dimensions instead of one blended score.

Back to research overview