Back to Blog

Apple: The Illusion of Thinking

Jeremy WeaverJune 18, 2025
Premium

Apple’s new study shows that Large Reasoning Models excel only up to a point—then abruptly collapse—revealing surprising limits in algorithmic rigor and problem-solving stamina.


Why Apple Rebuilt the Benchmark

Standard math or coding tests often mask how AI “thinks.” Apple’s paper, “*[The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity](https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf)*,” swaps those datasets for controllable puzzle environments (e.g., Tower of Hanoi, River Crossing). By dialing complexity step-by-step, researchers exposed how Large Reasoning Models (LRMs) and baseline Large Language Models (LLMs) truly scale—or fail.

Three Distinct Performance Regimes

1. Low Complexity – Classic LLMs do just fine, sometimes besting LRMs. 2. Medium Complexity – LRMs pull ahead, showcasing richer chain-of-thought and better success rates. 3. High Complexity – Both model types crash; LRMs hit an abrupt accuracy wall despite ample token budgets. This cliff suggests that today’s reasoning architectures can’t simply “think harder” once complexity passes a certain threshold.

The Counter-Intuitive Token Curve

One striking finding: as complexity rises, LRMs initially spend more tokens reasoning. Yet right before failure, token usage drops. Instead of grinding longer, the models appear to give up early, producing shorter—but still wrong—traces. Token allotment wasn’t the bottleneck; algorithmic precision was.

Overthinking, Underperforming

For easy puzzles, LRMs often land on a correct solution quickly—then continue exploring incorrect branches, a behavior the authors dub overthinking. At moderate levels, correct solutions emerge later in the chain-of-thought after false starts. Past the complexity cliff, no correct path appears at all.

Algorithms Don’t Save Them

Providing LRMs with the explicit Tower-of-Hanoi algorithm failed to prevent collapse. The models struggled to execute exact step sequences, underscoring a gap between verbal reasoning and deterministic computation.

Rethinking Benchmarks and Trust

Apple’s work raises uncomfortable questions:
  • Benchmark Contamination – Popular datasets may leak training examples, inflating perceived capability.
  • Trace Insight – Without analyzing intermediate “thoughts,” we can’t see where models go off the rails.
  • Safety Implications – If models silently falter on complex logic, deploying them in high-stakes domains becomes risky.

Takeaways for Builders and Educators

  • Match Task to Complexity – Use LRMs for medium-complex workflows; validate exhaustively on harder problems.
  • Inspect Thought Traces – Logging reasoning paths helps detect overthinking and silent failure modes.
  • Blend AI with Human Oversight – Platforms like [ibl.ai’s AI Mentor](https://ibl.ai/product/mentor-ai-higher-ed) can teach users to spot when automated reasoning veers off course, combining human judgment with machine speed.

Final Thoughts

“*The Illusion of Thinking*” reminds us that polished chain-of-thought does not equal genuine mastery. While LRMs push boundaries at moderate complexity, they still stumble on the very tasks that demand flawless logic. As the AI field races ahead, nuanced evaluation—anchored in problem complexity—will be crucial to separate true breakthroughs from mere illusions.