Apple: The Illusion of Thinking

Jeremy WeaverJune 18, 2025

Premium

Apple’s new study shows that Large Reasoning Models excel only up to a point—then abruptly collapse—revealing surprising limits in algorithmic rigor and problem-solving stamina.

Why Apple Rebuilt the Benchmark

Standard math or coding tests often mask how AI “thinks.” Apple’s paper, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” swaps those datasets for controllable puzzle environments (e.g., Tower of Hanoi, River Crossing). By dialing complexity step-by-step, researchers exposed how Large Reasoning Models (LRMs) and baseline Large Language Models (LLMs) truly scale—or fail.

Three Distinct Performance Regimes

1. Low Complexity – Classic LLMs do just fine, sometimes besting LRMs.

2. Medium Complexity – LRMs pull ahead, showcasing richer chain-of-thought and better success rates.

3. High Complexity – Both model types crash; LRMs hit an abrupt accuracy wall despite ample token budgets.

This cliff suggests that today’s reasoning architectures can’t simply “think harder” once complexity passes a certain threshold.

The Counter-Intuitive Token Curve

One striking finding: as complexity rises, LRMs initially spend more tokens reasoning. Yet right before failure, token usage drops. Instead of grinding longer, the models appear to give up early, producing shorter—but still wrong—traces. Token allotment wasn’t the bottleneck; algorithmic precision was.

Overthinking, Underperforming

For easy puzzles, LRMs often land on a correct solution quickly—then continue exploring incorrect branches, a behavior the authors dub overthinking. At moderate levels, correct solutions emerge later in the chain-of-thought after false starts. Past the complexity cliff, no correct path appears at all.

Algorithms Don’t Save Them

Providing LRMs with the explicit Tower-of-Hanoi algorithm failed to prevent collapse. The models struggled to execute exact step sequences, underscoring a gap between verbal reasoning and deterministic computation.

Rethinking Benchmarks and Trust

Apple’s work raises uncomfortable questions:

Benchmark Contamination – Popular datasets may leak training examples, inflating perceived capability.
Trace Insight – Without analyzing intermediate “thoughts,” we can’t see where models go off the rails.
Safety Implications – If models silently falter on complex logic, deploying them in high-stakes domains becomes risky.

Takeaways for Builders and Educators

Match Task to Complexity – Use LRMs for medium-complex workflows; validate exhaustively on harder problems.
Inspect Thought Traces – Logging reasoning paths helps detect overthinking and silent failure modes.
Blend AI with Human Oversight – Platforms like ibl.ai’s AI Mentor can teach users to spot when automated reasoning veers off course, combining human judgment with machine speed.

Final Thoughts

“The Illusion of Thinking” reminds us that polished chain-of-thought does not equal genuine mastery. While LRMs push boundaries at moderate complexity, they still stumble on the very tasks that demand flawless logic. As the AI field races ahead, nuanced evaluation—anchored in problem complexity—will be crucial to separate true breakthroughs from mere illusions.

← PreviousOpenAI: A Practical Guide to Building Agents Next →Pearson: Asking to Learn

Microsoft Education AI Toolkit

Microsoft’s new AI Toolkit guides institutions through a full-cycle journey—exploration, data readiness, pilot design, scaled adoption, and continuous impact review—showing how to deploy AI responsibly for student success and operational efficiency.

Jeremy WeaverJune 30, 2025

Nature: LLMs Proficient Solving & Creating Emotional Intelligence Tests

A new Nature paper reveals that advanced language models not only surpass human performance on emotional intelligence assessments but can also author psychometrically sound tests of their own.

Jeremy WeaverJune 26, 2025

Multi-Agent Portfolio Collab with OpenAI Agents SDK

OpenAI’s tutorial shows how a hub-and-spoke agent architecture can transform investment research by orchestrating specialist AI “colleagues” with modular tools and full auditability.

Jeremy WeaverJune 25, 2025

BCG: AI-First Companies Win the Future

BCG’s new report argues that firms built around AI—not merely using it—will widen competitive moats, reshape P&Ls, and scale faster with lean, specialized teams.

Jeremy WeaverJune 23, 2025

See the ibl.ai AI Operating System in Action

Discover how leading universities and organizations are transforming education with the ibl.ai AI Operating System. Explore real-world implementations from Harvard, MIT, Stanford, and users from 400+ institutions worldwide.

View Case Studies

Get Started with ibl.ai

Choose the plan that fits your needs and start transforming your educational experience today.

ibl.ai AI Education Blog

Topics We Cover

Featured Research and Reports

For University Leaders