The highest quality codebase

by Gricha on 12/8/2025, 9:33:09 PM

https://gricha.dev/blog/the-highest-quality-codebase

Comments

by: xnorswap

Claude is really good at specific analysis, but really terrible at open-ended problems."Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could."Hey claude, anything I could do to improve Y?", and it'll struggle beyond the basics that a linter might suggest.It suggested enthusiastically a library for <work domain> and it was all "Recommended" about it, but when I pointed out that the library had been considered and rejected because <issue>, it understood and wrote up why that library suffered from that issue and why it was therefore unsuitable.There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.That may well change, so I don't want to embed that thought too deeply into my own priors, because the LLM space seems to evolve rapidly. I wouldn't want to find myself blind to the progress because I write it off from a class of problems.But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.

12/11/2025, 4:02:03 PM

by: postalcoder

One of my favorite personal evals for llms is testing its stability as a reviewer.The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.You could also interpret these results to be a proxy for obsequiousness.Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.

12/11/2025, 4:08:08 PM

by: elzbardico

LLMs have this strong bias towards generating code, because writing code is the default behavior from pre-training.Removing code, renaming files, condensing, and other edits is mostly a post-training stuff, supervised learning behavior. You have armies of developers across the world making 17 to 35 dollars an hour solving tasks step by step which are then basically used to generate prompt/responses pairs of desired behavior for a lot of common development situations, adding desired output for things like tool calling, which is needed for things like deleting code.A typical human working on post-training dataset generation task would involve a scenario like: given this Dockerfile for a python application, when we try to run pytest it fails with exception foo not found. The human will notice that package foo is not installed, change the requirements.txt file and write this down, then he will try pip install, and notice that the foo package requires a certain native library to be installed. The final output of this will be a response with the appropriate tool calls in a structured format.Given that the amount of unsupervised learning is way bigger than the amount spent on fine-tuning for most models, it is not surprise that given any ambiguous situation, the model will default to what it knows best.More post-training will usually improve this, but the quality of the human generated dataset probably will be the upper bound of the output quality, not to mention the risk of overfitting if the foundation model labs embrace SFT too enthusiastically.

12/11/2025, 4:46:25 PM

by: f311a

I like to ask LLMs to find problems o improvements in 1-2 files. They are pretty good at finding bugs, but for general code improvements, 50-60% edits are trash. They add completely unnecessary stuff. If you ask them to improve a pretty well-written code, they rarely say it's good enough already.For example, in a functional-style codebase, they will try to rewrite everything to a class. I have to adjust the prompt to list things that I'm not interested in. And some inexperienced people are trying to write better code by learning from such changes of LLMs...

12/11/2025, 3:53:09 PM

by: kderbyma

Yeah. I noticed Claud suffers when it reaches context overload - its too opinionated, so it shortens its own context with decisions I would not ever make, yet I see it telling itself that the shortcuts are a good idea because the project is complex...then it gets into a loop where it second guesses its own decisions and forgets the context and then continues to spiral uncontrollably into deeper and deeper failures - often missing the obvious glitch and instead looking into imaginary land for answers - constantly diverting the solution from patching to completely rewriting...I think it suffers from performance anxiety...----The only solution I have found is to - rewrite the prompt from scratch, change the context myself, and then clear any "history or memories" and then try again.I have even gone so far as to open nested folders in separate windows to "lock in" scope better.As soon as I see the agent say "Wait, that doesnt make sense, let me review the code again" its cooked

12/10/2025, 1:59:31 AM

by: iambateman

The point he’s making - that LLM’s aren’t ready for broadly unsupervised software development - is well made.It still requires an exhausting amount of thought and energy to make the LLM go in the direction I want, which is to say in a direction which considers the code which is outside the current context window.I suspect that we will not solve the context window problem for a long time. But we will see a tremendous growth in “on demand tooling” for things which do fit into a context window and for which we can let the AI “do whatever it wants.”For me, my work product needs to conform to existing design standards and I can’t figure out how to get Claude to not just wire up its own button styles.But it’s remarkable how—despite all of the nonsense—these tools remain an irreplaceable part of my work life.

12/11/2025, 4:32:16 PM

by: mbesto

While there are justifiable comments here about how LLMs behave, I want to point out something else:There is no consensus on what constitutes a high quality codebase.Said differently - even if you asked 200 humans to do this same exercise, you would get 200 different outputs.

12/11/2025, 4:48:34 PM

by: samuelknight

This is an interesting experiment that we can summarize as "I gave a smart model a bad objective", with the key result at the end"...oh and the app still works, there's no new features, and just a few new bugs."Nobody thinks that doing 200 improvement passes on functioning code base is a good idea. The prompt tells the model that it is a principal engineer, then contradicts that role the imperative "We need to improve the quality of this codebase". Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn't tell the model that it can decide the code is good enough. I think we would see a different behavior if the prompt was changed to "Inspect the codebase, determine if we can do anything to improve code quality, then immediately implement it." If the model is smart enough, this will increasingly result in passes where the agent decides there is nothing left to do.In my experience with CC I get great results where I make an open ended question about a large module and instruct it to come back to me with suggestions. Claude generates 5-10 suggestions and ranks them by impact. It's very low-effort from the developer's perspective and it can generate some good ideas.

12/11/2025, 5:19:32 PM

by: m101

This is a great example of there being no intelligence under the hood.

12/11/2025, 3:44:23 PM

by: jedberg

You know how when someone hears how many engineerings are working on a product, and you think to yourself, "but I could do that with like three people!"? Now you know why they have so many people. Because they did this with their codebase, but with humans.Or I should say, they kept hiring the humans who needed something to do, and basically did what this AI did.

12/11/2025, 5:42:57 PM

by: hazmazlaz

Well of course it produced bad results... it was given a bad prompt. Imagine how things would have turned out if you had given the same instructions to a skilled but naive contractor who contractually couldn't say no and couldn't question you. Probably pretty similar.

12/11/2025, 4:25:53 PM

by: dcchuck

I spent some time last night "over iterating" on a plan to do some refactoring in a large codebase.I created the original plan with a very specific ask - create an abstraction to remove some tight coupling. Small problem that had a big surface area. The planning/brainstorming was great and I like the plan we came up with.I then tried to use a prompt like OP's to improve it (as I said, large surface area so I wanted to review it) - "Please review PLAN_DOC.md - is it a comprehensive plan for this project?". I'd run it -> get feedback -> give it back to Claude to improve the plan.I (naively perhaps) expected this process to converge to a "perfect plan". At this point I think of it more like a probability tree where there's a chance of improving the plan, but a non-zero chance of getting off the rails. And once you go off the rails, you only veer further and further from the truth.There are certainly problems where "throwing compute" at it and continuing to iterate with an LLM will work great. I would expect those to have firm success criteria. Providing definitions of quality would significantly improve the output here as well (or decrease the probability of going off the rails I suppose). Otherwise Claude will confuse quality like we see here.Shout out OP for sharing their work and moving us forward.

12/11/2025, 4:08:32 PM

by: failuremode

> We went from around 700 to a whooping 5369 tests> Tons of tests got added, but some tests that mattered the most (maestro e2e tests that validated the app still works) were forgotten.I've seen many LLM proponents often cite the number of tests as a positive signal.This smells, to me, like people who tout lines of code.When you are counting tests in the thousands I think its a negative signal.You should be writing property based tests rather than 'assert x=1', 'assert x=2', 'assert x=-1' and on and on.If LLMs are incapable of acknowledging that then add it to the long list of 'failure modes'.

12/11/2025, 9:31:40 PM

by: thomassmith65

With a good programmer, if they do multiple passes of a refactor, each pass makes the code more elegant, and the next pass easier to understand and further improve.Claude has a bias to add lines of code to a project, rather than make it more concise. Consequently, each refactoring pass becomes more difficult to untangle, and harder to improve.Ideally, in this experiment, only the first few passes would result in changes - mostly shrinking the project size, and from then on, Claude would change nothing - just a like a very good programmer.This is the biggest problem with developing with Claude, by far. Anthropic should laser focus on fixing it.

12/11/2025, 7:56:28 PM

by: torginus

I've heard a very apt criticism of the current batch of LLMs:LLMs are incapable of reducing entropy in a code baseI've always had this nagging feeling, but I think this really captures the essence of it succintly.

12/11/2025, 4:47:41 PM

by: maddmann

lol 5000 tests. Agentic code tools have a significant bias to add versus remove/condense. This leads to a lot of bloat and orphaned code. Definitely something that still needs to be solved for by agentic tools.

12/11/2025, 3:57:28 PM

by: written-beyond

> I like Rust's result-handling system, I don't think it works very well if you try to bring it to the entire ecosystem that already is standardized on error throwing.I disagree, it's very useful even in languages that have exception throwing conventions. It's good enough for the return type for Promise.allSettled api.The problem is when I don't have the result type I end up approximating it anyway through other ways. For a quick project I'd stick with exceptions but depending on my codebase I usually use the Go style ok, err tuple (it's usually clunkier in ts though) or a rust style result type ok err enum.

12/8/2025, 10:25:43 PM

by: ttul

Have you tried writing into the AGENTS.md something like, "Always be on the lookout for dead code, copy-pasta, and other opportunities to optimize and trim the codebase in a sensible way."In my experience, adding this kind of instruction to the context window causes SOTA coding models to actually undertake that kind of optimization while development carries on. You can also periodically chuck your entire codebase into Gemini-3 (with its massive context window) and ask it to write a refactoring plan; then, pass that refactoring plan back into your day-to-day coding environment such as Cursor or Codex and get it to take a few turns working away at the plan.As with human coders, if you let them run wild "improving" things without specifically instructing them to also pay attention to bloat, bloat is precisely what you will get.

12/11/2025, 5:56:44 PM

by: bulletsvshumans

I think the prompt is a major source of the issue. "We need to improve the quality of this codebase" implicitly indicates that there is something wrong with the codebase. I would be curious to see if it would reach a point of convergence with a prompt that allowed for it. Something like "Improve the quality of this codebase, or tell me that it is already in an optimal state."

12/11/2025, 4:37:27 PM

by: minimaxir

About a year ago I wrote a blog post (HN discussion: <a href="https://news.ycombinator.com/item?id=42584400">https://news.ycombinator.com/item?id=42584400</a>) experimenting if asking Claude to "write code better" repeatedly would indeed cause it to write better code, determined by speed as better code implies more efficient algorithms. I found that it did indeed work (at n=5 iterations), but additionally providing a system prompt also explicitly improved it.Given with what I've seen from Claude 4.5 Opus, I suspect the following test would be interesting: attempt to have Claude Code + Haiku/Sonnet/Opus implement and benchmark an algorithm with:- no CLAUDE.md file- a basic CLAUDE.md file- an overly nuanced CLAUDE.md fileAnd then both test the algorithm speed and number of turns it takes to hit that algorithm speed.

12/11/2025, 5:23:33 PM

by: blobbers

I'm curious if anyone has written a "Principal Engineer" agents.md or CLAUDE.md style file that yields better results than the 'junior dev' results people are seeing here.I've worked on writing some as a data scientist, and I have gotten the basic claude output to be much better; it makes some saner decisions, it validates and circles back to fix fits, etc.

12/11/2025, 7:15:35 PM

by: layer8

This makes me wonder what the result would be of having an AI turn a code base into literate-programming style, and have it iterate on that to improve the “literacy”.

12/11/2025, 8:41:39 PM

by: tracker1

On the Result<TR, TE> responses... I've seen this a few times. I think it works well in Rust or other languages that don't have the ability to "throw" baked in. However, when you bolt it on to a language that implicitly can throw, you're now doing twice the work as you have to handle the explicit error result and integrated errors.I worked in a C# codebase with Result responses all over the place, and it just really complicated every use case all around. Combined with Promises (TS) it's worse still.

12/11/2025, 5:17:05 PM

by: bikeshaving

<a href="https://github.com/Gricha/macro-photo/blob/highest-quality/lib/logger.ts" rel="nofollow">https://github.com/Gricha/macro-photo/blob/highest-quality/l...</a>The logger library which Claude created is actually pretty simple, highly approachable code, with utilities for logging the timings of async code and the ability to emit automatic performance warnings.I have been using LogTape (<a href="https://logtape.org" rel="nofollow">https://logtape.org</a>) for JavaScript logging, and the inherited, category-focused logging with different sinks has been pretty great.

12/11/2025, 4:20:51 PM

by: barbazoo

> I can sort of respect that the dependency list is pretty small, but at the cost of very unmaintainable 20k+ lines of utilities. I guess it really wanted to avoid supply-chain attacks.> Some of them are really unnecessary and could be replaced with off the shelf solutionLots of people would regard this as a good thing. Surely the LLM can't guess which kind you are.

12/11/2025, 6:30:26 PM

by: surprisetalk

This reflects my experience with human programmers. So many devs are taught to add layers of complexity in pursuit of "best practices". I think the LLM was trained to behave this way.In my experience, Claude can actually clean up a repo rather nicely if you ask it to (1) shrink source code size (LOC or total bytes), (2) reduce dependencies, and (3) maintain integration tests.

12/11/2025, 4:46:59 PM

by: Hammershaft

Impressive that the app still works! Did not expect that.

12/11/2025, 4:15:28 PM

by: Havoc

My current fav improvement strategy is1) Run multiple code analysis tools over it and have the LLM aggregate it with suggestions2) ask the LLM to list potential improvements open ended question and pick by hand which I wantAnd usually repeat the process with a completely different model (ie diff company trained it)Any more and yeah they end up going in circles

12/11/2025, 4:57:24 PM

by: Bombthecat

Story of AI:For instance - it created a hasMinimalEntropy function meant to "detect obviously fake keys with low character variety". I don't know why.

12/11/2025, 7:04:42 PM

by: maerF0x0

I would love to see someone do a longitudinal study of the incident/error rate of a canary container in prod that is managed by claude. Basically doing a control/experimental group to prove who does better the Humans or the AI?

12/11/2025, 5:30:45 PM

by: WhitneyLand

It can be difficult to explain to management why in certain scenarios AI can seem to work coding miracles, but this still doesn’t mean it’s going always speed up development 10x especially for an established code base.Tangible examples like this seem like a useful way to show some of the limitations.

12/11/2025, 4:39:12 PM

by: websiteapi

you gotta be strategic about it. so for example for tests, tell it to use equivalence testing and to prove it, e.g. create a graph of permutations of arguments and their equivalences from the underlying code, and then use such thing to generate the tests.telling it to do better without any feedback obviously is going to go nowhere fast.

12/11/2025, 3:50:32 PM

by: fauigerzigerk

What would happen if you gave the same task to 200 human contractors?I suspect SLOC growth wouldn't be quite as dramatic but things like converting everything to Rust's error handling approach could easily happen.

12/11/2025, 5:18:40 PM

by: elzbardico

Funniest part:> ..oh and the app still works, there's no new features, and just a few new bugs.

12/11/2025, 4:20:21 PM

by: whalesalad

I would love to see an experiment done like this with an arena of principal engineer agents. Give each of them a unique personality: this one likes shiny new objects and is willing to deal with early adopter pain, this one is a neckbeard who uses emacs as pid 1 and sends email via usb thumbdrive, and the third is a pragmatic middle of the road person who can help be the glue between them. All decisions need to reach a quorum before continuing. Better yet: each agent is running on a completely different model from a different provider. 3 can be a knob you dial up to 5, 10, etc. Each of these agents can spawn sub-agents, to reach out to professionals like a CSS export, or a DBA.I think prompt engineering could help here a bit, adding some context on what a quality codebase is, remove everything that is not necessary, consider future maintainability (20->84k lines is a smell). All of these are smells that like a simple supervisor agent could have caught.

12/11/2025, 9:59:00 PM

by: orliesaurus

Ok SRS question: What's the best "Code Review" Skill/Agent/Prompt that I can use these days? Curious to see even paid options if anyone knows?

12/11/2025, 5:10:28 PM

by: keepamovin

This is actually a great idea. It's like those AI resampled this image 10,000 times. Or JPEG iteratively compressed this picture 1 Million times.

12/11/2025, 4:55:10 PM

by:

12/11/2025, 4:01:20 PM

by: gm678

"Core Functional Utilities: Identity function - returns its input unchanged." is one of my favorites from `lib/functional.ts`.

12/11/2025, 4:22:29 PM

by: phildougherty

Pasting this whole article in to claude code "improve my codebase taking this article in to account"

12/11/2025, 4:48:23 PM

by: VikingCoder

You need to scroll the windows to see all the numbers. (Why??)

12/11/2025, 4:59:17 PM

by: g947o

When I ask coding agents to add tests, they often come up with something like this:<pre><code> const x = new NewClass(); assert.ok(x instanceof NewClass); </code></pre> So I am not at all surprised about Claude adding 5x tests, most of which are useless.It's going to be fun to look back at this and see how much slop these coding agents created.

12/11/2025, 5:15:59 PM

by: thald

Interesting experiment. Looking at this I immediately thought similar experiment run by Google: AlphaEvolve. Throwing LLM compute at problems might work if the problem is well defined and the result can be objectively measured.As for this experiment: What does quality even mean? Most human devs will have different opinions on it. If you would ask 200 different devs (Claude starts from 0 after each iteration) to do the same, I have doubts the code would look much better.I am also wondering what would happen if Claude would have an option to just walk away from the code if its "good enough". For each problem most human devs run cost->benefit equation in their head, only worthy ideas are realized. Claude does not do it, the code writing cost is very low on his site and the prompt does not allow any graceful exit :)

12/11/2025, 5:22:56 PM

by: simonw

The prompt was:<pre><code> Ultrathink. You're a principal engineer. Do not ask me any questions. We need to improve the quality of this codebase. Implement improvements to codebase quality. </code></pre> I'm a little disappointed that Claude didn't eventually decide to start removing all of the cruft it had added to improve the quality that way instead.

12/11/2025, 4:24:56 PM

by: pawelduda

Did it create 200 CODE_QUALITY_IMPROVEMENTS.md files by chance?

12/11/2025, 4:08:18 PM

by: GuB-42

It is something I noticed when talking to LLMs, if they don't get it right the first time, they probably never will, and if you really insist, the quality starts to degrade.It is not unlike people, the difference being that if you ask someone the same thing 200 times, he will probably going to tell you to go fuck yourself, or, if unable to, turn to malicious compliance. These AIs will always be diligent. Or, a human may use the opportunity to educate himself, but again, LLMs don't learn by doing, they have a distinct training phase that involves ingesting pretty much everything humanity has produced, your little conversation will not have a significant effect, if at all.

12/11/2025, 5:07:04 PM

by: 6LLvveMx2koXfwn

for all the bad code havoc was most certainly not 'wrecked', it may have been 'wreaked' though . . .

12/11/2025, 4:46:27 PM

by: mvanbaak

`--dangerously-skip-permissions` why?

12/11/2025, 4:55:07 PM

by: SKILNER

This strikes me as a very solid methodology for improving the results of all AI coding tools. I hope Anthropic, etc take this up.Rather than converging on optimal code (Occam's Razor for both maintainability and performance) they are just spewing code all over the scene. I've noticed that myself, of course, but this technique helps to magnify and highlight the problem areas.It makes you wonder how much training material was/is available for code optimization relative to training material for just coding to meet functional requirements. And therefore, what's the relative weight of optimizing code baked into the LLMs.

12/11/2025, 4:00:53 PM

by: jesse__

> This app is around 4-5 screens. The version "pre improving quality" was already pretty large. We are talking around 20k lines of TSFucking yikes dude. When's the last time it took you 4500 lines per screen, 9000 including the JSON data in the repo????? This is already absolute insanity.I bet I could do this entire app in easily less than half, probably less than a tenth, of that.

12/11/2025, 5:37:26 PM

by: etamponi

Am I the only one that is surprised that the app still works?!

12/11/2025, 4:44:02 PM

by: stavros

Well, given it can't say "no, I think it's good enough now", you'll just get madness, no?

12/11/2025, 4:41:20 PM

by: smallpipe

The viewport of this website is quite infuriating. I have to scroll horizontally to see the `cloc` output, but there's 3x the empty space on either side.

12/11/2025, 5:59:42 PM

by: lubesGordi

So now you know. You can get claude to write you a ton of unit tests and also improve your static typing situation. Now you can restrict your prompt!

12/11/2025, 6:03:11 PM

by: jcalvinowens

This really mirrors my experience trying to get LLMs to clean up kernel driver code, they seem utterly incapable of simplifying things.

12/11/2025, 6:06:59 PM

by: nadis

20K --> 84K lines of ts for a simple app is bananas. Much madness indeed! But also super interesting, thanks for sharing the experiment.

12/11/2025, 5:53:37 PM

by: guluarte

that's my experience with AI, most times it creates an overengineered solution unless told it to keep it simple

12/11/2025, 4:50:51 PM

by: krupan

Just the headline sounds like a YouTube brain rot video title:"I spent 200 days in the woods""I Google translated this 200 times""I hit myself with this golf club 200 times"Is this really what hacker news is for now?

12/11/2025, 4:27:50 PM

Top 20

The highest quality codebase

Comments

by: xnorswap

by: postalcoder

by: elzbardico

by: f311a

by: kderbyma

by: iambateman

by: mbesto

by: samuelknight

by: m101

by: jedberg

by: hazmazlaz

by: dcchuck

by: failuremode

by: thomassmith65

by: torginus

by: maddmann

by: written-beyond

by: ttul

by: bulletsvshumans

by: minimaxir

by: blobbers

by: layer8

by: tracker1

by: bikeshaving

by: barbazoo

by: surprisetalk

by: Hammershaft

by: Havoc

by: Bombthecat

by: maerF0x0

by: WhitneyLand

by: websiteapi

by: fauigerzigerk

by: elzbardico

by: whalesalad

by: orliesaurus

by: keepamovin

by:

by: gm678

by: phildougherty

by: VikingCoder

by: g947o

by: thald

by: simonw

by: pawelduda

by: GuB-42

by: 6LLvveMx2koXfwn

by: mvanbaak

by: SKILNER

by: jesse__

by: etamponi

by: stavros

by: smallpipe

by: lubesGordi

by: jcalvinowens

by: nadis

by: guluarte

by: krupan