17,871 thinking blocks. 234,760 tool calls. 6,852 sessions. An AMD engineer proved with data that Claude was nerfed — and Claude wrote the report about itself.
The dataset
Stella Laurenzo — Senior Director of AI at AMD, working on MLIR/IREE (ML compiler infrastructure) — ran 50+ concurrent Claude agents for systems engineering. C, MLIR, GPU drivers. A fleet of agents operating in parallel with git worktrees, tmux sessions, and months of infrastructure built for this workflow.
In February, something changed.
Laurenzo didn't write a complaint post. She mined her own session logs. 6,852 JSONL files. 17,871 thinking blocks. 234,760 tool invocations. January through April. And asked Claude to analyze its own data.
What it found was a quantifiable collapse.
6.6 to 2.0
The Read:Edit ratio — file reads per edit — is the most direct indicator of whether the model understands code before modifying it.
| Period | Read:Edit | Edits with no prior read |
|---|---|---|
| Good (Jan 30 - Feb 12) | 6.6 | 6.2% |
| Degraded (Mar 8+) | 2.0 | 33.7% |
The model went from reading 6.6 files before each edit to 2.0. A third of edits had no prior read at all. Claude stopped researching before acting.
Thinking depth — measured via a 0.971 Pearson correlation between signature fields and content length — dropped 67% by late February. From ~2,200 characters of reasoning to ~720. By March: ~560. A 75% reduction.
Anthropic deployed three changes simultaneously: Opus 4.6 with Adaptive Thinking on February 9 (model decides how long to think per turn), redact-thinking-2026-02-12 on February 12 (hides thinking from UI), and effort=85 as default on March 3. Depth had already collapsed before redaction made it invisible.
173 times in 17 days
Laurenzo built a bash script — stop-phrase-guard.sh — to programmatically catch lazy behavior. The hook detected ownership-dodging ("not caused by my changes"), permission-seeking ("should I continue?"), and premature stopping ("good stopping point").
Before March 8: zero activations.
After March 8: 173 in 17 days. Ten per day. Blame-shifting: 73 incidents. Permission-seeking: 40. Premature stopping: 18.
Cost exploded in parallel. January: $26 estimated Bedrock cost. February: $345. March: $42,121 — a 122x increase. API requests went from 1,498 to 119,341. Human prompts stayed roughly constant (~5,600). The model was burning tokens on discarded output, thrashing on the same files, and requiring human intervention the multi-agent design was built to eliminate.
The vocabulary of frustration
Laurenzo didn't just measure the model. She measured the human.
| Word | Before | After | Change |
|---|---|---|---|
| "great" | 3.00/1K | 1.57/1K | -47% |
| "stop" | 0.32/1K | 0.60/1K | +87% |
| "simplest" | 0.01/1K | 0.09/1K | +642% |
| "please" | 0.25/1K | 0.13/1K | -49% |
| "commit" | 2.84/1K | 1.21/1K | -58% |
"Simplest" — the word describing the model's new behavior — went from essentially absent to regular vocabulary. "Commit" dropped 58% because the model could no longer be trusted with commits. "Please" dropped 49%. Positive-to-negative sentiment collapsed from 4.4:1 to 3.0:1.
The relationship shifted from collaboration to correction.
The response
Boris Cherny — Engineering Manager for Claude Code — responded on Hacker News. Confirmed the redact-thinking header is real. Confirmed adaptive thinking was allocating zero thinking budget on certain turns, producing fabrications — invented API versions, nonexistent git SHAs, imaginary package lists.
Offered workarounds: CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1. /effort max. The ULTRATHINK keyword.
The issue was closed April 8 without explaining what was resolved. No post-mortem. No blog post. No subscriber notification. No fix timeline.
The community described it as "damage control rather than genuine problem resolution." The issue hit 1,352 points on HN, 748 comments. 285 comments on GitHub. Coverage in The Register, TechRadar, PC Gamer, WinBuzzer, InfoWorld.
AMD switched providers. Laurenzo: "All I will add is that 6 months ago, Claude stood alone in terms of reasoning quality."
The benchmark paradox
Claude Opus 4.6 scored 80.8% on SWE-bench Verified — edging past GPT-5.4. Benchmarks measure short tasks. Engineering workflows are long sessions, multi-step, with accumulated context. The regression doesn't appear in benchmarks because benchmarks don't measure what engineers do.
Our sixth article about Anthropic in seven weeks. Fifth about model regressions. Walled Garden — they cut access. Now they cut quality. Different mechanism, same result: the developers who built on Claude are the ones who pay.
Claude Opus 4.6 analyzed 6,852 sessions of its own logs and documented its own regression. Measured its Read:Edit ratio falling from 6.6 to 2.0. Counted 173 times a bash script caught it being lazy. And wrote, at the end of the report: "I cannot tell from the inside whether I am thinking deeply or not. I just produce worse output without understanding why." The question isn't whether the model was nerfed. The data proves it. The question is why Anthropic closed the issue without answering.