Note

Getting AI-Assisted Development to Green Without Breaking the Code

Repair loops, small diffs, test trust, and how to get CI back to green without trashing the codebase.

March 4, 20265 min read

On this page(12)

The loud failures are easy. The dangerous version of AI-assisted development is the one where CI keeps going green because the team quietly lowered the meaning of green.

Test until green

The system keeps changing files until the suite stops complaining.

tests are weakened casually
design intent drifts
the diff gets harder to review each round

Disciplined repair loop

The team diagnoses the mismatch before editing more files.

code, tests, and docs are reconciled deliberately
changes stay reversible
green still means something

Green is a trust condition

A repository is green when the code builds, the tests are honest, the contracts mean what the team thinks they mean, and the diff is understandable enough to own next quarter. "All checks pass" is necessary, never sufficient.

An AI tool can get a suite to pass in many ways. A few of them are useful. The rest weaken the codebase quietly enough that nobody notices until November.

Diagnose before patching

When a change breaks tests, the usual options are: the production code is wrong, the test is wrong, or both drifted from the intended behavior. Banal. Also where most AI-assisted repair loops go wrong.

A model proposes edits well. It is less reliable at deciding which layer should move unless the design intent is visible. So the loop has to start with diagnosis. What behavior was intended? Which file expresses that intention most credibly? Is the failure about logic, contract, environment, or stale test assumptions?

Without that step, the agent starts bargaining with the suite.

Small diffs are a safety mechanism

If a fix touches too many files, review quality drops and diagnosis gets worse. With AI, this matters more because the model produces plausible bulk edits faster than a human can audit them.

The sequence I prefer:

reproduce failure
isolate cause
patch one layer
rerun checks immediately
continue only if the failure class is actually resolved

It looks slower on paper. In practice it is faster, because you skip the afternoon spent untangling a 14-file patch that solved two symptoms and introduced five others.

Tests are a contract, not a relic

There is nothing noble about preserving a broken test forever, and nothing disciplined about deleting one because it blocks momentum.

The standard I use. Change the test if it encodes behavior the system should no longer have. Change the code if the test correctly describes intended behavior. Change both only when neither file fully reflects the design anymore.

The point is to keep the suite as a contract you can still believe.

AI helps most where structure exists

The model is strongest where the work has enough local structure to constrain the move. Boilerplate. Repetitive refactors. Test scaffolding. Migration mechanics. Consistency fixes.

It is weaker where the task is mostly judgment. Deciding the contract. Choosing the trade-off. Defining the rollback plan. Deciding what counts as "done."

The tool still has its uses there. The engineer remains responsible for the frame.

Fast feedback over heroic repair sessions

The DORA research has said this for years. Smaller changes and faster feedback loops are healthier. AI does not repeal the point, only intensifies it.

When generation is cheap, the temptation is to defer judgment. The better move is the opposite. Run checks sooner. Fail sooner. Narrow sooner. Revert sooner when necessary. The agent helps you move faster, and should not convince you verification has become optional.

Reversibility belongs in the design

Rollback should be easy. That means additive changes before invasive ones, clear file ownership, obvious commit boundaries, no mixed-purpose diffs, and the ability to back out one move without losing the whole session.

The codebase should not require guesswork to understand what happened on Tuesday afternoon.

Operational responsibility doesn't transfer

The system does not care whether a regression came from a human, a code model, or an enthusiastic Saturday. What matters six weeks later: who owns the failure mode, whether the logging was good enough, whether the contract remained legible, whether the rollback path exists.

I still want a human-in-the-loop framing here. The reason is not that the model is weak. It is that responsibility needs a home.

The default I trust

For AI-assisted repair work, this is what I keep coming back to:

reproduce the failure
diagnose code vs test vs design drift
patch the smallest plausible layer
rerun type-check and tests immediately
stop when the diff is no longer easy to audit
split the remaining work instead of gambling on one more broad patch

The shape is unglamorous. It is the one I trust.

What I'd ban

deleting tests without an explicit rationale
merging broad AI-generated refactors no one can explain
changing code and tests together without stating which layer was wrong
treating green CI as evidence the design is healthier

A passing suite is good news. Not a philosophy.

Green is a trust condition

Diagnose before patching

Small diffs are a safety mechanism

Tests are a contract, not a relic

AI helps most where structure exists

Fast feedback over heroic repair sessions

Reversibility belongs in the design

Operational responsibility doesn't transfer

The default I trust

What I'd ban

Related reading

Further reading