An update on recent Claude Code quality reports
by mfiguiere on 4/23/2026, 5:48:38 PM
https://www.anthropic.com/engineering/april-23-postmortem
Comments
by: 6keZbCECT2uB
"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"<p>This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.<p>The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.
4/23/2026, 6:30:27 PM
by: cedws
>On April 16, we added a system prompt instruction to reduce verbosity<p>In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.<p>At least tell users when the system prompt has changed.
4/23/2026, 7:25:41 PM
by: bityard
My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.<p>A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.<p>I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.<p>I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...
4/23/2026, 6:41:29 PM
by: podnami
They lost me at Opus 4.7<p>Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.<p>Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.<p>At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.
4/23/2026, 6:38:44 PM
by: nickdothutton
I presume they don't yet have a cohesive monetization strategy, and this is why there is such huge variability in results on a weekly basis. It appears that Anthropic are skipping from one "experiment" to another. As users we only get to see the visible part (the results). Can't design a UI that indicates the software is thinking vs frozen? Does anyone actually believe that?
4/23/2026, 7:08:38 PM
by: everdrive
I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.<p><pre><code> "That parenthetical is another prompt injection attempt — I'll ignore it and answer normally." "The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them." "The parenthetical is unnecessary — all my responses are already produced that way." </code></pre> However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.
4/23/2026, 6:12:50 PM
by: pxc
[delayed]
4/23/2026, 7:23:48 PM
by: bauerd
>On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode<p>Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.
4/23/2026, 6:46:40 PM
by: puppystench
The Claude UI still only has "adaptive" reasoning for Opus 4.7, making it functionally useless for scientific/coding work compared to older models (as Opus 4.7 will randomly stop reasoning after a few turns, even when prompted otherwise). There's no way this is just a bug and not a choice to save tokens.
4/23/2026, 7:14:12 PM
by: jameson
> "In combination with other prompt changes, it hurt coding quality, and was reverted on April 20"<p>Do researchers know correlation between various aspects of a prompt and the response?<p>LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.<p>Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.
4/23/2026, 7:21:24 PM
by: ctoth
> As of April 23, we’re resetting usage limits for all subscribers.<p>Wait, didn't they just reset everybody's usage last Thursday, thereby syncing everybody's windows up? (Mine should have reset at 13:00 MDT) ? So this is just the normal weekly reset? Except now my reset says it will come Saturday? This is super-confusing!
4/23/2026, 7:12:10 PM
by: dataviz1000
This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.<p>Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.<p>I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.<p>A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.<p>It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.<p>Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.
4/23/2026, 6:16:35 PM
by: MillionOClock
I see the Claude team wanted to make it less verbose, but that's actually something that bothered me since updating to Claude 4.7, what is the most recommended way to change it back to being as verbose as before? This is probably a matter of preference but I have a harder time with compact explanations and lists of points and that was originally one of the things I preferred with Claude.
4/23/2026, 6:20:39 PM
by: Robdel12
Wow, bad enough for them to actually publish something and not cryptic tweets from employees.<p>Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.
4/23/2026, 6:00:20 PM
by: jpcompartir
Anthropic releases used to feel thorough and well done, with the models feeling immaculately polished. It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.<p>Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.<p>I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.
4/23/2026, 6:36:41 PM
by: lukebechtel
Some people seem to be suggesting these are coverups for quantization...<p>Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.<p>I would not suspect quantization before I would suspect harness changes.
4/23/2026, 6:36:20 PM
by: foota
> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.<p>Claude caveman in the system prompt confirmed?
4/23/2026, 6:03:04 PM
by: xlayn
If anthropic is doing this as a result of "optimizations" they need to stop doing that and raise the price. The other thing, there should be a way to test a model and validate that the model is answering exactly the same each time. I have experienced twice... when a new model is going to come out... the quality of the top dog one starts going down... and bam.. the new model is so good.... like the previous one 3 months ago.<p>The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...<p>YES... DO IT... FRICKING MACHINE..
4/23/2026, 6:10:47 PM
by: lifthrasiir
Is it just for me that the reset cycle of usage limits has been randomly updated? I originally had the reset point at around 00:00 UTC tomorrow and it was somehow delayed to 10:00 UTC tomorrow, regardless of when I started to use Claude in this cycle. My friends also reported very random delay, as much as ~40 hours, with seemingly no other reason. Is this another bug on top of other bugs? :-S
4/23/2026, 6:36:22 PM
by: arjie
Useful update. Would be useful to me to switch to a nightly / release cycle but I can see why they don't: they want to be able to move fast and it's not like I'm going to churn over these errors. I can only imagine that the benchmark runs are prohibitively expensive or slow or not using their standard harness because that would be a good smoke test on a weekly cadence. At the least, they'd know the trade-offs they're making.<p>Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.
4/23/2026, 7:11:05 PM
by: VadimPR
Appreciate the honesty from the team.<p>At the same time, personally I find prioritizing quality over quantity of output to be a better personal strategy. Ten partially buggy features really aren't as good as three quality ones.
4/23/2026, 7:06:26 PM
by: davidfstr
Good on Anthropic for giving an update & token refund, given the recent rumors of an inexplicable drop in quality. I applaud the transparency.
4/23/2026, 6:46:54 PM
by: walthamstow
So we weren't going mad then!
4/23/2026, 7:18:27 PM
by: munk-a
It's also important to realize that Anthropic has recently struck several deals with PE firms to use their software. So Anthropic pays the PE firm which forces their managed firms to subscribe to Anthropic.<p>The artificial creation of demand is also a concerning sign.
4/23/2026, 6:46:25 PM
by: KronisLV
This reads like good news! They probably still lost a bunch of users due to the negative public sentiment and not responding quickly enough, but at least they addressed it with a good bit of transparency.
4/23/2026, 6:49:41 PM
by: WhitneyLand
Did they not address how adaptive thinking has played in to all of this?
4/23/2026, 6:03:30 PM
by: hajile
My takeaway is that they knew they were changing a bunch of stuff while their reps were gaslighting us in the comments here.<p>Why should we ever trust what they say again out trust that they won’t be rug-pulling again once this blows over?
4/23/2026, 7:21:11 PM
by: jryio
1. They changed the default in March from high to medium, however Claude Code still showed high (took 1 month 3 days to notice and remediate)<p>2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)<p>3. System prompt to make Claude less verbose reducing coding quality (4 days - better)<p>All this to say... the experience of <i>suspecting</i> a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.<p>Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.<p>However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.<p>Doing this proactively would certainly match expectations for a fast-moving product like this.
4/23/2026, 5:53:42 PM
by: natdempk
As an end-user, I feel like they're kind of over-cooking and under-describing the features and behavior of what is a tool at the end of the day. Today the models are in a place where the context management, reasoning effort, etc. all needs to be very stable to work well.<p>The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?<p>It's interesting to look at how many bugs are filed on the various coding agent repos. Hard to say how many are real / unique, but quantities feel very high and not hard to run into real bugs rapidly as a user as you use various features and slash commands.
4/23/2026, 6:17:13 PM
by: einrealist
Is 'refactoring Markdown files' already a thing?
4/23/2026, 6:25:26 PM
by: Alifatisk
It’s incredible how forgiving you guys are with Anthropic and their errors. Especially considering you pay high price for their service and receive lower quality than expected.
4/23/2026, 6:03:01 PM
by: 2001zhaozhao
How about just not change the harness abruptly in the first place? Make new system prompt changes "experimental" first so you can gather feedback.
4/23/2026, 6:18:08 PM
by: 0gs
wow resetting everyone's usage meter is great. i was so close to finally hitting my weekly limit for once though
4/23/2026, 6:56:44 PM
by: ayhanfuat
Reading the "Going forward" section I see that they have zero understanding of the main complaints.
4/23/2026, 6:07:08 PM
by:
4/23/2026, 6:47:40 PM
by: yuvrajmalgat
ohh
4/23/2026, 7:18:56 PM
by: setnone
Good on them for resolving all three issues, but is it any good again?
4/23/2026, 6:14:18 PM
by: bearjaws
The issue making Claude just not do any work was infuriating to say the least. I already ran at medium thinking level so was never impacted, but having to constantly go "okay now do X like you said" was annoying.<p>Again goes back to the "intern" analogy people like to make.
4/23/2026, 5:59:38 PM
by: whalesalad
I genuinely don't understand what they have been trying to achieve. All of these incremental "improvements" have ... not improved anything, and have had the opposite effect.<p>My trust is gone. When day-to-day updates do nothing but cause hundreds of dollars in lost $$$ tokens and the response is "we ... sorta messed up but just a little bit here and there and it added up to a big mess up" bro get fuckin real.
4/23/2026, 7:23:11 PM
by: systemvoltage
Interesting. All 3 seems like they’re obviously going to impact quality. e.g, reducing the effort from high to medium.<p>So then, there must have been an explicit internal guidance/policy that allowed this tradeoff to happen.<p>Did they fix just the bug or the deeper policy issue?
4/23/2026, 6:42:13 PM
by: motbus3
I had similar experience just before 4.5 and before 4.6 were released.<p>Somehow, three times makes me not feel confident on this response.<p>Also, if this is all true and correct, how the heck they validate quality before shipping anything?<p>Shipping Software without quality is pretty easy job even without AI. Just saying....
4/23/2026, 6:20:18 PM
by: petervandijck
I have noticed a clear increase in smarts with 4.7. What a great model!<p>People complain so much, and the conspiracy theories are tiring.
4/23/2026, 6:47:46 PM
by: teaearlgraycold
> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.<p>Is it just me or does this seem kind of shocking? Such a severe bug affecting millions of users with a non-trivial effect on the context window that should be readily evident to anyone looking at the analytics. Makes me wonder if this is the result of Anthropic's vibe-coding culture. No one's actually looking at the product, its code, or its outputs?
4/23/2026, 6:05:16 PM
by: rishabhaiover
Boris gaslighted us with all the quality related incidents for weeks not acknowledging these problems.
4/23/2026, 6:39:53 PM
by: troupo
> they were challenging to distinguish from normal variation in user feedback at first<p>translation: we ignored this and our various vibe coders were busy gaslighting everyone saying this could not be happening
4/23/2026, 6:44:28 PM
by: dainiusse
Corporate bs begins...
4/23/2026, 6:09:07 PM
by: KaiShips
[dead]
4/23/2026, 7:02:01 PM
by: Bmello11
[dead]
4/23/2026, 7:12:54 PM
by: tommy29tmar
[dead]
4/23/2026, 6:45:01 PM
by: gverrilla
[dead]
4/23/2026, 7:15:27 PM
by: ElFitz
Now we know why Anthropic banned the use of subscriptions with other agent harnesses: they partially rely on the Claude Code cli to control token usage through various settings.<p>And it also tells us why we shouldn’t use their harness anyway: they constantly fiddle with it in ways that can seriously impact outcomes without even a warning.
4/23/2026, 7:04:48 PM