When AI ‘Thinking’ Is Just Storytelling at Scale

Welp.
WWDC 2025 has concluded, and Apple finally pulled back the curtain on its long-teased AI push. Sort of.

In place of a fully upgraded Siri, we got a lot of “Apple Intelligence” demos. Safe, sandboxed features that hardly anyone will use, and the promise that Siri updates are “coming next year.” Conveniently, just before the event, Apple quietly released a technical paper: The Illusion of Thinking.

The paper is deeply revealing, though you’d never hear it from the keynote stage.

Apple’s researchers built controlled puzzle environments to see how their reasoning models actually handle problem-solving as complexity scales. They tested puzzles like Tower of Hanoi, River Crossing, and Blocks World. The setup allowed them to see how models behave as problems get just a little harder.

TL;DR?
At first, the models (called LRMs — Large Reasoning Models) generate more tokens and deeper thinking steps as tasks get more complex. But past a certain point, they start doing the opposite: they think less, even though they technically have room to think more. The models abandon useful reasoning midway. They confuse themselves. And eventually, they just fully collapse.

More complex puzzles? The models think less, not more.

It’s AI slop, just with more words.

This is what Apple politely calls “an illusion of thinking.” Chain-of-thought prompting makes the outputs look thoughtful, verbose, step-by-step, self-correcting, but that’s often surface polish. Underneath, the models remain pattern matchers. They narrate plausible-sounding failures.

And this is the same class of AI people dream about using for high-stakes decision-making: healthcare, law, even military risk assessment.

We want AI to conduct evaluations and decide who’s going to die?
I’m dubious.

That said, we should take Apple’s paper with a grain of salt too. It frames their predicament in colors that suit their narrative. Ask Anthropic or OpenAI, and they’d likely argue that scaling, fine-tuning, and data diversity can push these ceilings higher. The debate isn’t settled.

What this paper does reveal, intentionally or not, is why Apple’s Siri upgrade wasn’t center stage yet. The ceiling isn’t just a compute problem. It’s a reasoning problem. And for now, Apple seems content to admit it only through technical papers that 99.9% of users will never read.