Advertisement

The Phantasm of Pondering: Understanding the Strengths and Limitations of Reasoning Fashions by way of the Lens of Downside Complexity


Thank you for reading this post, don't forget to subscribe!

Latest generations of frontier language fashions have launched Giant Reasoning Fashions
(LRMs) that generate detailed pondering processes earlier than offering solutions. Whereas these fashions
exhibit improved efficiency on reasoning benchmarks, their elementary capabilities, scal-
ing properties, and limitations stay insufficiently understood. Present evaluations primarily fo-
cus on established mathematical and coding benchmarks, emphasizing remaining reply accuracy. How-
ever, this analysis paradigm usually suffers from information contamination and doesn’t present insights
into the reasoning traces’ construction and high quality. On this work, we systematically examine these
gaps with the assistance of controllable puzzle environments that permit exact manipulation of composi-
tional complexity whereas sustaining constant logical buildings. This setup permits the evaluation
of not solely remaining solutions but in addition the inner reasoning traces, providing insights into how LRMs
“assume”. By means of intensive experimentation throughout various puzzles, we present that frontier LRMs
face an entire accuracy collapse past sure complexities. Furthermore, they exhibit a counter-
intuitive scaling restrict: their reasoning effort will increase with drawback complexity up to a degree, then
declines regardless of having an ample token funds. By evaluating LRMs with their normal LLM
counterparts underneath equal inference compute, we determine three efficiency regimes: (1) low-
complexity duties the place normal fashions surprisingly outperform LRMs, (2) medium-complexity
duties the place extra pondering in LRMs demonstrates benefit, and (3) high-complexity duties
the place each fashions expertise full collapse. We discovered that LRMs have limitations in actual
computation: they fail to make use of specific algorithms and cause inconsistently throughout puzzles. We
additionally examine the reasoning traces in additional depth, learning the patterns of explored options
and analyzing the fashions’ computational habits, shedding gentle on their strengths, limitations,
and finally elevating essential questions on their true reasoning capabilities.

*Equal contribution.
†Work executed throughout an internship at Apple.