The Extraction Bottleneck

The Extraction Bottleneck

Large language models fail at topological grid puzzles — even frontier models solve fewer than a quarter of hard instances. The natural assumption is that topology is hard and LLMs can’t reason about it. TopoBench shows the assumption is wrong at the joint.

Analysis of 750 annotated reasoning traces identifies four failure patterns: premature commitment, constraint forgetting, repeated reasoning, and extraction difficulties. The critical finding: the bottleneck lies in extracting constraints from spatial representations, not in reasoning over them. When constraints are provided explicitly — parsed out of the grid and handed to the model as structured text — reasoning performance improves dramatically. The model can manipulate topological relationships once it has them. It cannot read them off a grid.

This is a perception failure, not a reasoning failure. The grid is a spatial encoding of logical constraints. Converting that encoding into the constraints it represents requires something closer to vision than to inference — pattern extraction from a structured spatial layout. LLMs process text sequentially; grids encode information in two-dimensional spatial relationships. The mismatch isn’t between the model and topology; it’s between the model and grids.

The through-claim: measuring “reasoning ability” with tasks that combine perception and reasoning tells you about whichever component is weaker. TopoBench was designed to test topological reasoning but actually tests spatial constraint extraction, because extraction fails first. The benchmark measures the bottleneck, not the target. This is general: any evaluation that bundles input parsing with the capability it claims to measure will systematically underestimate the capability by conflating it with the parsing step. The fix — tool-based constraint verification, adjusted grid formats — works precisely because it separates what the model can do from what it can’t, rather than grading the combination.


No comments yet.