The loud failures are easy. The dangerous version of AI-assisted development is the one where CI keeps going green because the team quietly lowered the meaning of green.
Test until green
The system keeps changing files until the suite stops complaining.
- tests are weakened casually
- design intent drifts
- the diff gets harder to review each round
Disciplined repair loop
The team diagnoses the mismatch before editing more files.
- code, tests, and docs are reconciled deliberately
- changes stay reversible
- green still means something
Green is a trust condition
A repository is green when the code builds, the tests are honest, the contracts mean what the team thinks they mean, and the diff is understandable enough to own next quarter. "All checks pass" is necessary, never sufficient.
An AI tool can get a suite to pass in many ways. A few of them are useful. The rest weaken the codebase quietly enough that nobody notices until November.
Diagnose before patching
When a change breaks tests, the usual options are: the production code is wrong, the test is wrong, or both drifted from the intended behavior. Banal. Also where most AI-assisted repair loops go wrong.
A model proposes edits well. It is less reliable at deciding which layer should move unless the design intent is visible. So the loop has to start with diagnosis. What behavior was intended? Which file expresses that intention most credibly? Is the failure about logic, contract, environment, or stale test assumptions?
Without that step, the agent starts bargaining with the suite.
Small diffs are a safety mechanism
If a fix touches too many files, review quality drops and diagnosis gets worse. With AI, this matters more because the model produces plausible bulk edits faster than a human can audit them.
The sequence I prefer:
- reproduce failure
- isolate cause
- patch one layer
- rerun checks immediately
- continue only if the failure class is actually resolved
It looks slower on paper. In practice it is faster, because you skip the afternoon spent untangling a 14-file patch that solved two symptoms and introduced five others.
Tests are a contract, not a relic
There is nothing noble about preserving a broken test forever, and nothing disciplined about deleting one because it blocks momentum.
The standard I use. Change the test if it encodes behavior the system should no longer have. Change the code if the test correctly describes intended behavior. Change both only when neither file fully reflects the design anymore.
The point is to keep the suite as a contract you can still believe.
AI helps most where structure exists
The model is strongest where the work has enough local structure to constrain the move. Boilerplate. Repetitive refactors. Test scaffolding. Migration mechanics. Consistency fixes.
It is weaker where the task is mostly judgment. Deciding the contract. Choosing the trade-off. Defining the rollback plan. Deciding what counts as "done."
The tool still has its uses there. The engineer remains responsible for the frame.
Fast feedback over heroic repair sessions
The DORA research has said this for years. Smaller changes and faster feedback loops are healthier. AI does not repeal the point, only intensifies it.
When generation is cheap, the temptation is to defer judgment. The better move is the opposite. Run checks sooner. Fail sooner. Narrow sooner. Revert sooner when necessary. The agent helps you move faster, and should not convince you verification has become optional.
Reversibility belongs in the design
Rollback should be easy. That means additive changes before invasive ones, clear file ownership, obvious commit boundaries, no mixed-purpose diffs, and the ability to back out one move without losing the whole session.
The codebase should not require guesswork to understand what happened on Tuesday afternoon.
Operational responsibility doesn't transfer
The system does not care whether a regression came from a human, a code model, or an enthusiastic Saturday. What matters six weeks later: who owns the failure mode, whether the logging was good enough, whether the contract remained legible, whether the rollback path exists.
I still want a human-in-the-loop framing here. The reason is not that the model is weak. It is that responsibility needs a home.
The default I trust
For AI-assisted repair work, this is what I keep coming back to:
- reproduce the failure
- diagnose code vs test vs design drift
- patch the smallest plausible layer
- rerun type-check and tests immediately
- stop when the diff is no longer easy to audit
- split the remaining work instead of gambling on one more broad patch
The shape is unglamorous. It is the one I trust.
What I'd ban
- deleting tests without an explicit rationale
- merging broad AI-generated refactors no one can explain
- changing code and tests together without stating which layer was wrong
- treating green CI as evidence the design is healthier
A passing suite is good news. Not a philosophy.