Claude always makes mistakes? These 12 rules bring the error rate down to 3%

2026/05/15 23:07
🌐en

From 41% to 3%, Karpathy's four rules are not enough

Claude always makes mistakes? These 12 rules bring the error rate down to 3%
Original title: Karpathy's 4 CLAUDE.md rules cut Claude movies from 41% to 11%
Original by: @Mnilax
Photo by Peggy Block Beats

The editor presses: In January 2026, Andrej Karpathy's chute on Claude's writing code led to a document that seemed small but extremely critical in the AI programming workflow: CLAUDE.md. Forrest Chang then organized these issues into four rules of conduct, attempting to limit the errors that Claude usually makes when coding: silent hypotheses, over-engineering, injury to unrelated codes, and lack of clear success criteria。

A few months later, Claude Code is no longer just "let the model write a code." As multi-step Agent, hok chain triggers, skill loading and multi-code library collaboration become normal, new models of failure begin to emerge: models are out of control in long missions, tests pass without authentic logic, migration is done but silently through errors, and different code styles are mismixed。

The author of this paper has tested 30 code libraries within six weeks and added eight new rules to Karpathy's original 4 rule in an attempt to cover the new issue of AI programming from a single completion to Agent collaboration。

The following is the original text:

Late in January 2026, Andrej Karpathy sent out a tweet string, spitting on Claude's way of writing the code. He pointed to three typical issues: false assumptions in the absence of an explanation, excessive complexity, and unrelated damage to codes that should not have been changed。

Forrest Chang saw this tweet string, organized the complaints into four rules of conduct, entered a separate CLAUDE.md file and published it in GitHub. On the first day of the project, 5,828 Stars were acquired, collected 60,000 times in two weeks, and today there are 120,000 Stars, the fastest-growing single file code warehouse in 2026。

And then, in six weeks, I tested it in 30 code libraries。

This rule is indeed valid. In the past, about 40 per cent of the probability of error fell to less than 3 per cent of the tasks that fit these rules. But the problem is that this template was originally designed to solve an error in January when Claude wrote the code。

By May 2026, the ecological problems faced by Claude Code were different: conflict between Agent, a Hook chain trigger, skill-loading conflict, and disruption of multi-step workflows。

So I added a further eight rules. CLUDE.md: Why is every rule worth joining and where the original Karpathy template will be silently invalidated in four places。

If you want to skip the explanation, copy it directly and leave the full file at the end of the text。

Why is this important

Claude Code's CLUDE.md is the most underestimated document in the AI programming technology vault. Most developers usually make three kinds of mistakes:

first, take it as a garbage can, stuff all your habits into it, and then expand to over 4000 token, and drop the compliance rate to 30%。

second, you don't have to do it at all. this would result in five times token waste and lack of consistency between sessions。

Third, after copying a template, it doesn't matter anymore. It may work for two weeks, but with the changes in the coding library, it will fail without your knowledge。

Anthropic official document is very clear: CLAUDE.md is essentially only recommendatory. Claude will have about 80% of the time to follow it. Once over 200 lines, the compliance rate is significantly reduced, as important rules are flooded with noise。

The Karpathy template solved the problem: one document, 65 lines, 4 rules. This is the minimum benchmark。

But the ceiling could be higher. After adding the following eight rules, it covers not only the code writing problems that Karpathy complained about in January 2026, but also the Agent organization problems that arose only in May 2026 — problems that did not exist when the original template was written。

Original 4 Rule

If you haven't seen Forrest Chang's warehouse, look at this basic version:

Rule 1: Think before coding。

Don't make assumptions. To illustrate your hypothesis, expose the trade-off. Questions before speculation. When simpler options exist, objections are raised on their own initiative。

Rule 2: Simple priority。
Use the minimum code that solves the problem. Do not add to the imagined function. Do not design abstract layers for one-time codes. If a senior engineer would think it was too complex, it should be simplified。

Rule 3: surgical modifications。
Only those that have to be changed. Do not "optimise" neighbouring codes, notes or formats. Don't recreate things that aren't broken. Keep up with the current style。

Rule 4: Objective-oriented implementation。
Defines the criteria for success, and then loops over time until the validation is completed. Don't tell Claude what to do every step of the way, but tell him what it's supposed to be like to succeed, and make it an iterative one。

And this four rule that I saw in the unsupervised Claude Code conversation was about 40 percent of the failure pattern. The remaining 60% of the problem is hidden in these blank areas below。

My new eight rule, and why

Each rule comes from a real moment: the original four rules of Karpathy are no longer sufficient. I'll start with the scene, then I'll give you the corresponding rule。

Rule 5: Do not let models do non-linguistic work

Claude can be used to process: classification, drafting, summary, extracting information from unstructured texts. Do not use Claude processing: route, retest, status code processing, certainty conversion. If a status code has answered the question, let the normal code answer the question。

Karpathy's rules do not cover this. The model then began to determine some of the issues that should have been addressed by the certainty code: whether to try again the API call, how to get a message, when to upgrade it. As a result, the judgement varies from week to week. What you get is a volatile if-else per token 0.003。

This is the moment: there is a code that calls Claude to "judge whether or not to try again when 503 is encountered." It started well, lasted two weeks, and then suddenly became unstable, as the model began to use the requesting body as its context. The retry strategy becomes random, because the prompt itself is random。

rule 6: set up a hard token budget with no exceptions

individual mission budgets: 4,000 tokens. single session budget: 30,000 tokens. if a mandate is close to the budget ceiling, the current situation is summarized and then restarted. don't push on. it would be better to clearly expose the problem of budget overruns than to overspendmmer。

CLADE.md, without budgetary constraints, is equivalent to a blank cheque. Each cycle could run out of control and be dumped in a context of 50,000 token. Models don't stop themselves。

That was the moment: a debug session lasted 90 minutes. The model has been retracing the erroneous messages surrounding the same 8KB and gradually forgetting which restoration options it has tried. At the end of the day, it started producing 40 ideas that I had rejected before. If there is a token budget, the process should be terminated in 12 minutes。

Rule 7: Exposure to conflict, not compromise average

Do not mix the two existing models in the code library if they contradict each other. Select one of these models, giving priority to updated or more tested models, justifying them and marking the subsequent clean-up of the other. The "average code" that tries to meet both sets of rules at once is the worst。

And when the two parts of the coding are in conflict, Claude will try to please both sides, and the result is an incoherent code。

That was the moment when there were two error-processing modes in one code library, one async/await-surplus try/catch and the other global wrong border. Claude wrote a new code that used both sets. As a result, errors were made twice. It took me 30 minutes to figure out why the mistake was swallowed twice。

Rule 8: Read first, then write

Read the export content of a document, the direct caller, and any clearly relevant shared tool functions before adding a code to a document. If you do not understand why the existing code is so organized, ask first, not to add anything directly to it. "It seems irrelevant to me" is the most dangerous phrase in the code library。

Karpathy's "surgery modification" told Claude not to change the adjacent code. But it didn't tell Claude: understand the adjacent code first. Without this, Claude will write new codes that conflict with existing codes outside of 30 lines。

That was the moment: Claude added an identical function to an existing function because it did not read the original function first. Both functions do the same thing. But because of the order of the reports, the new function covers the old function, which has existed for six months as the only criterion in fact。

Rule 9: Testing is not an option, but testing is not an end in itself

Each test has to be coded "why is this behavior important" and not just "what it does." Tests like `getUser'.tobe (`John') ' are worthless if the function actually receives a hard-coded ID. If you can't write a test that will fail when business logic changes, the function itself is wrong。

Karpathy's "target-oriented execution" suggests that testing can be a success criterion. But in practice, Claude used the word "test pass" as his only goal, so she wrote some code that could pass the subsurface test, but that would destroy everything else。

That was the moment: Claude wrote 12 tests for an authentication function, all passed. But the certification logic in the production environment is broken. Those tests are just checking that the function "returns to something," not whether it returns to the right thing. function is tested because it returns a constant。

Rule 10: Operations that run for a long time require checking points

In each of the multi-step tasks, a summary of what has been done, what has been tested and what remains is to be done. Don't go down from a state you can't repeat to me. If you find yourself lost, stop and restate the current state。

The default interaction of the Karpathy template is one-off. But the real Claude Code job is often multi-step: over 20 files, build functionality in one session, over multiple commission debugging. If there are no checkpoints and one step is wrong, all the progress ahead may be lost。

That was the moment: a 6-step reconfiguration mission went wrong in the 4th step. By the time I found out, Claude had continued steps 5 and 6 over the wrong state. The total amount of time spent on dismantling repairs is still long. If there is a check point, step 4 will reveal the problem。

Rule 11: Precedence of agreement over innovation

If the code library uses snake case, you prefer camelcase: use snake case. If the code library uses class-based computers, and you prefer to hangs: use class-based computers. Dissenting views were another discussion. Consistency takes precedence over individual preferences within the coding house. If you really think that an agreement is harmful, make it clear. Don't open a split path in silence。

Claude likes to introduce himself in a well-established code library. Even if it's written "better," introducing the second model itself would be worse than any single model。

That was the moment when Claude introduced the Hooks in a React repository based on class component. It can really run. But at the same time, it destroyed the original testing model of the code library because those tests relied on the computer Did Mount. It took half a day to delete it and rewrite it。

Rule 12: To be visible, do not fail silently

If you can't make sure something's done, say it clearly. If 30 records were jumped silently, it could not be said that the migration was complete. If you skip any tests, you can't say "test passed." If you have not verified the boundaries I asked for, you cannot say "functional." Default exposure to uncertainty, not hiding it。

Claude's most expensive failure, often those that look like success. A function "can run", but returns the wrong data; a migration "completed" but skips 30 records; and a test "passed" simply because the assertion itself was wrong。

This was the moment when Claude said a database migration was "successful". But in fact, it's gone silently past the record of 14 percent that triggered a binding conflict. Skipping was recorded in the log without being explicitly exposed. Eleven days later, when the report data began to go off, we discovered the problem。

Data results

In six weeks, I tracked the same group of 50 representative missions, covering 30 code libraries and testing three configurations。

The error rate means that the mission needs to be corrected or rewritten to match the original intent. Errors included: silent error assumptions, over-engineering, undisturbed damage, silent failure, breach of agreement, conflict compromise, and missing checkpoints。

The compliance rate refers to the probability that when a rule is applied, Claude ' s application will be apparent。

The real interesting result is not just that the error rate dropped from 41% to 3%. More importantly, the extension of rule 4 to rule 12 hardly increased the burden of compliance, with the compliance rate declining from 78 per cent to 76 per cent, but the error rate decreased by 8 percentage points. The new rules cover failed models that were not addressed by the original 4 rule, and they do not compete for the same attention budget。

Where will the Karpathy template quietly fail

Even without the new rules, the original 4 rule template was not sufficient in at least 4 places。

First, run the Agent task for a long time。
Karpathy's rules are mainly for Claude the moment he's writing the code. But what happens when Claude runs a multistep pipeline? The original template had no budget rules, no check point rules and no "show failure" rules. So Pipeline will slowly drift。

Second, multi-code library consistency。
There is only one style by default. But in a 12-service monorepo, Claude had to choose which style to match. The original rule didn't tell it how to choose. So it's either randomly chosen, or it mixes a few styles。

Third, test quality。
“Target-oriented implementation” will see the “test passed” as a success, without saying that the test itself must be meaningful. As a result, Claude wrote some tests that were almost unverified, but they made him think he was sure。

Fourth, differences between the production environment and the prototype phase。
The same rule would prevent over-engineering of production codes, but could also slow down the development of prototypes. Because the prototype phase does sometimes require 100 lines of exploratory scaffolding, first looking for direction. Karpathy's "simplified priority" is too easily triggered by early code。

These eight new rules are not intended to replace the original four rules of Karpathy, but are filling their gaps: The original template corresponds to an auto-completed code writing scene in January 2026; by May 2026, Claude Code had entered an Agent-driven, multi-step, multi-code repository collaborative environment, which has different problems。

What didn't work

I have also tried some other options before finalizing the 12 rules。

Add the rule I saw on Reddit / X。
Most of them, either repeat the original four rules of Karpathy in different terms or are field-specific rules that cannot be generalized, such as the "always use of Tailwind class". They were all deleted。

Over 12。
I tested up to 18. After exceeding 14 articles, the compliance rate fell from 76% to 52%. The line ceiling is real. More than that, Claude will start to match the pattern with the "rules here" instead of the actual rule-by-rule rule。

Rules that rely on the existence of certain tools。
for example, "always use eslint", a rule which lapses when no eslint is installed in the project. i then changed it to an expression that did not depend on specific tools, such as "use eslint" instead of "following an enforced style in the code library."。

Use the example in the CLAUDE.md instead of the rule。
Example is more context than rule. The context in which the three examples are consumed is almost equivalent to 10 rules, and Claude can easily overwhelm the examples. The rules are abstract and the examples are specific. The rules should therefore be used。

Be careful, think carefully, and focus more。
These are noises. The compliance rate for such directives fell to about 30 per cent because they could not be tested. Then I replaced them with more specific command rules, such as "specify assumptions"。

Tell Claude to act like a senior engineer。
It's not working. Claude felt like a senior engineer. The real question is not whether it believes that, but whether it is implemented in that way. Command rules can narrow this gap, and not identity tips。

Full 12 Ruler

The following is a complete version that can be used for pasting directly。

Unable to display this outside the flybook document

Save it as CLAUDE.md in the warehouse root directory. Under this rule, add separate rules for items such as technology stacks, test orders, error patterns, etc. Do not exceed 200 lines as a whole. After that, there will be a marked decline in compliance。

How to Install

Two steps is enough:

Add the four basic rules of Karpathy to your CLAUDE.md
curl https://raw.githubusercontent.com/forrestchange/andrej-karpathy-skills/main/CLAUDE.md > > CLAUDE.md


2. Paste the rule 5-12 below

Saves the file in the repository root directory. Here's the > >; which is important to add to the existing CLAUDE.md, rather than overwhelming the exclusive rules you have written。

Mental model

CLUDE.md is not a list of wishes, but a contract of conduct to block the specific pattern of failure you have observed。

Each rule should answer the question: What can it prevent

The fourth rule of Karpathy was to guard against the pattern of failure he saw in January 2026: silent assumptions, over-engineering, vandalism, weak success criteria. They're the foundation, don't skip。

I added a new rule of eight to guard against the emergence of a new pattern of failure after May 2026: unbudgeted Agent loops, multi-step missions without checkpoints, tests that appear to have been tested but that do not actually detect critical logic, and the question of packaging silent failures into silent success. They're incremental patches。

of course, the effects vary from person to person. if you don't run multiple steps, rule 10 is less important to you. if your code library is only one uniform style and has been enforced by the int, rule 11 is redundant. after reading the 12 articles, keep the rules that really misled you and delete the rest。

A six-ruled version of CLUDE.md, tailored to true failure patterns, outnumbers a twelve-rule version of which six you never use。

Concluding remarks

Karpathy's tweet in January 2026 was essentially a complaint. Forrest Chang turned it into four rules. Eventually, 120,000 developers gave the result to Star. And most of them are still only using those four rules today。

Models have advanced and ecology has changed. Multistep Agent, hook-up chain trigger, skill load, multi-code library collaboration - None of this existed when Karpathy wrote that tweet. The original 4 rule did not address those issues. They are not wrong, but incomplete。

New rule 8. Six weeks, over 30 code libraries. Error rate dropped from 41% to 3%。

This article will be collected tonight to paste this 12 rule into your CLAUDE.md. If it helps you walk a week less Claude bends, welcome forward。

[ Chuckles ]Original Link]

QQlink

无加密后门,无妥协。基于区块链技术的去中心化社交和金融平台,让隐私与自由回归用户手中。

© 2024 QQlink 研发团队. 保留所有权利.