What six months of AI-generated code taught me about plumbing

What six months of AI-generated code taught me about plumbing

I haven’t written a line of code in six months.

That’s not exactly true. I’ve written configuration files. Instruction documents. Shell scripts that check whether an AI agent did what I asked. Rules about TypeScript, rules about security, rules about whether a button should be a Server Component or a Client Component. Reviews of reviews of code I didn’t write.

But the actual production code — none of it. An AI agent called Claude Code wrote every line. Three hundred and fifty-one pull requests between October and March. Every component, every hook, every edge function, every test. I just told it what to build.

The interesting part isn’t the AI. The interesting part is the plumbing.

My grandmother had a kitchen in Queens. Small. The counters were that flecked Formica you don’t see anymore. She cooked in that kitchen for forty years and the food was always good. Not because she was some kind of genius. Because the kitchen worked. Everything was where it needed to be. The knives were sharp. The stove worked right. She’d spent decades arranging things so the cooking part was easy.

That’s what I did with the AI. Except instead of arranging a kitchen, I built a pipeline. And instead of forty years, it took about five months.

Here’s the number that matters. In October, when I started, the automated code reviewer found 0.83 critical issues per pull request. Almost one serious problem in every PR. Type safety violations, security holes, things that would break in production.

By March, that number was 0.07. One critical issue for every fourteen PRs.

Ten times better. In code I didn’t write, reviewed by agents I built, governed by a pipeline I kept making more complicated.

The question is why. And the answer is boring. I think that’s the most important thing about it.

The pipeline started simple. A file called CLAUDE.md that told the agent what stack we were using. ESLint. A test framework. A CI check. That was it. The agent generated code, the code went up as a pull request, an automated reviewer checked it, and nearly everything had problems.

So I added rules. Five documents about what good code looks like. TypeScript should be strict. Use this function for authentication, not that one. Server Components by default.

Then I added workflow automation. A single command that creates a branch, commits, pushes, opens a PR, and goes back to master. Before that, the agent would get the git commands wrong about half the time.

Then parallel agents. Multiple coding agents working at the same time, each one required to pass a build check between waves.

Then plan review. Three agents reading the plan before any code got written. One of them existed only to catch over-engineering.

Then more. Reuse-first planning that searched the codebase before proposing new code. A PR resolution workflow for triaging review comments. A unified verify script. A rewrite of all the rules because I discovered that writing “NEVER” and “MUST” in the instruction file was less effective than just explaining why a rule existed.

By February, the pipeline had twelve steps. Brainstorming through production-verified testing. Each step existed because something had gone wrong without it.

There’s a thing that happens when you renovate a house. You tear out a wall and find that the previous owner ran the plumbing through a load-bearing beam. The water flows. Everything works. But it’s wrong in a way that only shows up later, when something shifts.

The agent has a version of this. It writes code that works but is wrong in ways that only show up under pressure.

The best example is error handling. The agent writes the happy path first. Always. If the plan says “fetch recipes from the API and display them,” the agent writes code that fetches recipes and displays them. Clean code. Passes tests.

But what happens when the network fails? When the user closes the page while the fetch is still running? When the authentication token expires mid-stream?

Nothing good. The agent doesn’t think about failure because failure isn’t in the plan. It writes code for the world where everything works. That world doesn’t exist.

Around fifty to a hundred issues in this category. Fire-and-forget fetch calls. Missing try-catch blocks. Streaming code that works until someone’s WiFi hiccups, and then falls over and stays down.

The testing gap bothers me the most.

I built a TDD enforcement skill. It’s thorough. Write a test, watch it fail, write the minimum code to pass, watch it pass, refactor. The whole cycle, enforced at every step. When the agent goes through this process, the tests are good.

The problem is I don’t always make it go through this process.

Claude Code has a built-in planning mode. Quick and easy. You describe what you want, it plans, it executes. Very convenient for small fixes. That built-in mode has no idea my project requires TDD. The rules mention it, sure. But without the TDD skill actively driving the session, the agent writes production code without tests and nothing stops it.

A hundred and fifty to two hundred and sixty testing issues across six months. The number never improved. Not once. Every other category got better. Testing stayed flat.

It stayed flat because of me. Every time I took the shortcut, used the quick path instead of the full pipeline, told myself the change was too small to bother with, another PR shipped without tests. The agent didn’t fail. I failed to use the tools I built.

That’s the real finding in the whole report. The pipeline works. It works really well. But only when you use it. The temptation to skip it for “just a quick fix” is the kind of thing that looks harmless until you see it repeated across three hundred and fifty-one pull requests.

There’s another problem the pipeline can’t solve. I didn’t see this one coming.

The project is a culinary platform. A kitchen mentor app. The product vision is that the interface should feel like opening a premium food magazine. I mean this literally. Typography, whitespace, image treatment, the way a recipe card looks when you scroll past it. Presentation matters as much as the food.

The agent can check that a component uses the right design token for a color. It can verify that a responsive breakpoint is set correctly. It can confirm that an element renders at the right size.

It cannot tell you whether a page feels like a food magazine.

Type safety improved ninety-eight percent because type-checking is binary. Code compiles or it doesn’t. Security improved ninety-five percent for the same reason. These are yes-or-no questions and the pipeline answers them well.

But visual presentation is different. Someone looks at a recipe card and either feels something or doesn’t. That category got worse relative to everything else. Everything else improved so much that the visual stuff became the dominant remaining problem.

The mechanical part is fixable. A lint rule can reject hardcoded colors. That’s plumbing. Should have built it months ago.

Whether the overall composition creates the right feeling, though. That’s a different question. No lint rule evaluates editorial tone. No automated test tells you if a page makes someone want to cook.

For a product where the visual experience is the product, that’s a real ceiling.

Something happened in December that confused me at first.

December had the highest issue rate of any month. 5.4 issues per pull request. Worse than October, when the pipeline was basically nothing. I’d added multi-agent review, parallel execution, all this infrastructure. And the numbers got worse.

Except they didn’t. December was when I built the Recipe Card Development Studio and the FullRecipe display system. The two most design-heavy features in the project. Fifty-one UI issues in that month alone, more than double any month before. The pipeline caught more because there was more to catch.

You miss this if you only look at numbers. A spike doesn’t always mean regression. Sometimes it means the pipeline is finding things it would have missed before.

There were three models over the six months. Sonnet 4.5, then Opus 4.5, then Opus 4.6. Each one better than the last.

When a new model showed up, the pipeline was already there. All the rules, all the modules, all the enforcement. The new model dropped in and immediately produced better results. No retraining. No migration. No setup period.

I’d been building the pipeline ahead of the models. Complexity that pushed the current model to its limits became the next model’s comfort zone. Things that barely worked with Sonnet ran easily with Opus. The pipeline and the model improved at the same time, and they compounded.

Near-zero time to value. The infrastructure you build today for a model that can barely handle it becomes the thing that makes the next model sing. If you wait until the model is ready, you lose the compounding.

Here’s what I know after six months and three hundred and fifty-one pull requests.

AI-generated code quality depends on two things. The model and the pipeline. Neither one alone explains the improvement.

October: basic rules, older model. Almost one critical per PR. March: twelve-step pipeline, better model. One critical for every fourteen PRs.

The improvement came from four things, in order of how well they work.

Hard constraints. Lint rules, build gates, type checking. Things that make wrong code fail to compile. The most effective because the agent literally cannot produce code that violates them.

Agent context. Modules and standards documents that make the right code easier to generate than the wrong code.

Review layers. Plan review before coding, code review after, testing after that. These catch what the constraints miss.

Workflow automation. Commands that handle the boring stuff so the agent doesn’t forget. Commit, push, open a PR, sync documentation.

Every category that improved has a hard constraint behind it. Every category that didn’t is missing one.

My grandmother’s kitchen. The Formica counters. The knives in the same drawer every time.

She didn’t think about the kitchen. She thought about the food. The kitchen was just good plumbing that let her cook.

That’s the pipeline. Good plumbing. The AI writes the code. I build the kitchen around it. Every month, the food gets a little better.

Except for plating. Plating is still hard.

Three hundred and fifty-one pull requests. Every component, every hook, every edge function. Written by an AI agent. Reviewed by twelve automated agents. Governed by a twelve-step pipeline.

The human’s job was the plumbing. And sometimes, apparently, remembering to use it.