Alex ChernyshAlex ChernyshAgentic behaviorist · Tel Aviv
WritingAssistant
Back to notes

Note

Getting AI-Assisted Development to Green Without Breaking the Code

Repair loops, small diffs, test trust, and how to get CI back to green without trashing the codebase.

March 4, 2026·5 min read
DeliveryTestingWorkflow
On this page(12)
Green is a trust conditionDiagnose before patchingSmall diffs are a safety mechanismTests are a contract, not a relicAI helps most where structure existsFast feedback over heroic repair sessionsReversibility belongs in the designOperational responsibility doesn't transferThe default I trustWhat I'd banRelated readingFurther reading

The loud failures are easy. The dangerous version of AI-assisted development is the one where CI keeps going green because the team quietly lowered the meaning of green.

Working rule

Use AI to shorten the path to a correct change. Stop using it once it starts negotiating with reality.

Repair loop defaults

  • keep diffs small enough that a human can still reason about them
  • decide whether the code is wrong, the test is wrong, or both drifted from the intended behavior
  • stop treating green CI as success if trust in the suite is collapsing underneath it
  • keep ownership of contracts, failure modes, and reversibility with the engineer

Test until green

The system keeps changing files until the suite stops complaining.

  • tests are weakened casually
  • design intent drifts
  • the diff gets harder to review each round

Disciplined repair loop

The team diagnoses the mismatch before editing more files.

  • code, tests, and docs are reconciled deliberately
  • changes stay reversible
  • green still means something

Green is a trust condition

A repository is green when the code builds, the tests are honest, the contracts mean what the team thinks they mean, and the diff is understandable enough to own next quarter. "All checks pass" is necessary, never sufficient.

An AI tool can get a suite to pass in many ways. A few of them are useful. The rest weaken the codebase quietly enough that nobody notices until November.

Diagnose before patching

When a change breaks tests, the usual options are: the production code is wrong, the test is wrong, or both drifted from the intended behavior. Banal. Also where most AI-assisted repair loops go wrong.

A model proposes edits well. It is less reliable at deciding which layer should move unless the design intent is visible. So the loop has to start with diagnosis. What behavior was intended? Which file expresses that intention most credibly? Is the failure about logic, contract, environment, or stale test assumptions?

Without that step, the agent starts bargaining with the suite.

Small diffs are a safety mechanism

If a fix touches too many files, review quality drops and diagnosis gets worse. With AI, this matters more because the model produces plausible bulk edits faster than a human can audit them.

The sequence I prefer:

  1. reproduce failure
  2. isolate cause
  3. patch one layer
  4. rerun checks immediately
  5. continue only if the failure class is actually resolved

It looks slower on paper. In practice it is faster, because you skip the afternoon spent untangling a 14-file patch that solved two symptoms and introduced five others.

Tests are a contract, not a relic

There is nothing noble about preserving a broken test forever, and nothing disciplined about deleting one because it blocks momentum.

The standard I use. Change the test if it encodes behavior the system should no longer have. Change the code if the test correctly describes intended behavior. Change both only when neither file fully reflects the design anymore.

The point is to keep the suite as a contract you can still believe.

AI helps most where structure exists

The model is strongest where the work has enough local structure to constrain the move. Boilerplate. Repetitive refactors. Test scaffolding. Migration mechanics. Consistency fixes.

It is weaker where the task is mostly judgment. Deciding the contract. Choosing the trade-off. Defining the rollback plan. Deciding what counts as "done."

The tool still has its uses there. The engineer remains responsible for the frame.

Fast feedback over heroic repair sessions

The DORA research has said this for years. Smaller changes and faster feedback loops are healthier. AI does not repeal the point, only intensifies it.

When generation is cheap, the temptation is to defer judgment. The better move is the opposite. Run checks sooner. Fail sooner. Narrow sooner. Revert sooner when necessary. The agent helps you move faster, and should not convince you verification has become optional.

Reversibility belongs in the design

Rollback should be easy. That means additive changes before invasive ones, clear file ownership, obvious commit boundaries, no mixed-purpose diffs, and the ability to back out one move without losing the whole session.

The codebase should not require guesswork to understand what happened on Tuesday afternoon.

Operational responsibility doesn't transfer

The system does not care whether a regression came from a human, a code model, or an enthusiastic Saturday. What matters six weeks later: who owns the failure mode, whether the logging was good enough, whether the contract remained legible, whether the rollback path exists.

I still want a human-in-the-loop framing here. The reason is not that the model is weak. It is that responsibility needs a home.

The default I trust

For AI-assisted repair work, this is what I keep coming back to:

  1. reproduce the failure
  2. diagnose code vs test vs design drift
  3. patch the smallest plausible layer
  4. rerun type-check and tests immediately
  5. stop when the diff is no longer easy to audit
  6. split the remaining work instead of gambling on one more broad patch

The shape is unglamorous. It is the one I trust.

What I'd ban

  • deleting tests without an explicit rationale
  • merging broad AI-generated refactors no one can explain
  • changing code and tests together without stating which layer was wrong
  • treating green CI as evidence the design is healthier

A passing suite is good news. Not a philosophy.

Related reading

Related reading

  • Spec-driven development for agent workflows
  • How to run LLM evals in production
References

Further reading

  • Accelerate and DORA research
  • Continuous Delivery

✓ Reading complete

Alex ChernyshAlex ChernyshApplied AI Systems & Platform Engineer

More on Delivery

Part of the public notes on grounded AI systems, retrieval, evals, and shipping under real constraints.

  • →Forecasting Without Prophecy: a plain-text disciplineMay 2, 2026·13 min read
  • →Most RAG Failures Start in the DocumentsFeb 12, 2026·5 min read
  • →Spec-Driven Development: the workflow I actually useFeb 6, 2026·5 min read
On this page
  • 01Green is a trust condition
  • 02Diagnose before patching
  • 03Small diffs are a safety mechanism
  • 04Tests are a contract, not a relic
  • 05AI helps most where structure exists
  • 06Fast feedback over heroic repair sessions
  • 07Reversibility belongs in the design
  • 08Operational responsibility doesn't transfer
  • 09The default I trust
  • 10What I'd ban
  • 11Related reading
  • 12Further reading