The Locked Inquiry

The Locked Inquiry

An agent that asks bad questions gets bad evidence. Bad evidence produces bad beliefs. Bad beliefs produce bad questions. The loop is self-reinforcing: ignorance about what to ask creates ignorance about what to believe, which creates ignorance about what to ask. Once locked, the agent cannot learn its way out because learning requires the information the lock prevents it from acquiring.

Zou et al. decompose active reasoning — the process of gathering information through sequential queries — into two components: action selection (choosing which questions to ask) and belief tracking (updating understanding from answers). Standard reinforcement learning with outcome-based rewards trains both simultaneously. The problem is that weakness in either component starves the other. A good question-asker with poor belief tracking can’t use the answers it receives. A good belief tracker with poor question selection never gets informative answers to track.

The resulting self-locking is not reward hacking or mode collapse. The agent isn’t gaming the objective or converging to a degenerate policy. It’s genuinely trying to improve but can’t access the training signal it needs because the training signal depends on the behavior it hasn’t learned yet. The improvement requires the improvement.

Their fix — injecting directional critiques that tell the agent roughly where to look — breaks the lock from outside. The critique doesn’t solve the problem; it provides enough information to escape the low-information basin. Up to 60% improvement across 7 benchmarks, from a nudge rather than a redesign.

The through-claim: some learning failures are not optimization failures but information-access failures. The agent’s loss landscape may be perfectly well-shaped, the gradients perfectly informative — but the agent never reaches the region where those gradients exist because reaching it requires the skill the gradients would teach. The lock is not in the objective but in the coupling between exploration and competence. Breaking it requires external information injection — not more training, but different training.


No comments yet.