New paper pushes again on Apple’s LLM ‘reasoning collapse’ research

Thank you for reading this post, don't forget to subscribe!

The Apple Intelligence survey is meaningless, but we will pay for it

Apple’s latest AI analysis paper, “The Phantasm of Pondering”, has been making waves for its blunt conclusion: even probably the most superior Giant Reasoning Fashions (LRMs) collapse on advanced duties. However not everybody agrees with that framing.

Right this moment, Alex Lawsen, a researcher at Open Philanthropy, printed an in depth rebuttal arguing that lots of Apple’s most headline-grabbing findings boil right down to experimental design flaws, not basic reasoning limits. The paper additionally credit Anthropic’s Claude Opus mannequin as its co-author.

The rebuttal: Much less “phantasm of pondering,” extra “phantasm of analysis”

Lawsen’s critique, aptly titled “The Phantasm of the Phantasm of Pondering,” doesn’t deny that immediately’s LRMs wrestle with advanced planning puzzles. However he argues that Apple’s paper confuses sensible output constraints and flawed analysis setups with precise reasoning failure.

Listed here are the three principal points Lawsen raises:

Token price range limits have been ignored in Apple’s interpretation:
On the level the place Apple claims fashions “collapse” on Tower of Hanoi puzzles with 8+ disks, fashions like Claude have been already bumping up towards their token output ceilings. Lawsen factors to actual outputs the place the fashions explicitly state: “The sample continues, however I’ll cease right here to save lots of tokens.”
Unattainable puzzles have been counted as failures:
Apple’s River Crossing take a look at reportedly included unsolvable puzzle cases (for instance, 6+ actor/agent pairs with a ship capability that mathematically can’t transport everybody throughout the river underneath the given constraints). Lawsen calls consideration to the truth that fashions have been penalized for recognizing that and refusing to resolve them.
Analysis scripts didn’t distinguish between reasoning failure and output truncation:
Apple used automated pipelines that judged fashions solely by full, enumerated transfer lists, even in circumstances the place the duty would exceed the token restrict. Lawsen argues that this inflexible analysis unfairly categorised partial or strategic outputs as whole failures.

Different testing: Let the mannequin write code as a substitute

To again up his level, Lawsen reran a subset of the Tower of Hanoi checks utilizing a special format: asking fashions to generate a recursive Lua operate that prints the answer as a substitute of exhaustively itemizing all strikes.

The consequence? Fashions like Claude, Gemini, and OpenAI’s o3 had no hassle producing algorithmically right options for 15-disk Hanoi issues, far past the complexity the place Apple reported zero success.

Lawsen’s conclusion: Whenever you take away synthetic output constraints, LRMs appear completely able to reasoning about high-complexity duties. Not less than by way of algorithm era.

Why this debate issues

At first look, this may sound like typical AI analysis nitpicking. However the stakes listed here are greater than that. The Apple paper has been extensively cited as proof that immediately’s LLMs essentially lack scalable reasoning capability, which, as I argued right here, may not have been the fairest approach to body the research within the first place.

Lawsen’s rebuttal suggests the reality could also be extra nuanced: sure, LLMs wrestle with long-form token enumeration underneath present deployment constraints, however their reasoning engines is probably not as brittle as the unique paper implies. Or higher but, as many mentioned it implied.

After all, none of this lets LRMs off the hook. Even Lawsen acknowledges that true algorithmic generalization stays a problem, and his re-tests are nonetheless preliminary. He additionally lays out ideas as to what future works on the topic may need to give attention to:

Design evaluations that distinguish between reasoning functionality and output constraints

Confirm puzzle solvability earlier than evaluating mannequin efficiency

Use complexity metrics that replicate computational problem, not simply resolution size

Think about a number of resolution representations to separate algorithmic understanding from execution

The query isn’t whether or not LRMs can cause, however whether or not our evaluations can distinguish reasoning from typing.

In different phrases, his core level is obvious: earlier than we declare reasoning lifeless on arrival, it could be price double-checking the requirements by which that’s being measured.

H/T: Fabrício Carraro.

FTC: We use revenue incomes auto affiliate hyperlinks. Extra.