Alex Chernysh
AI systems / retrieval / evals / architecture
Discuss a system
SystemsWritingAssistant
Back to notes

Note

Getting AI-Assisted Development to Green Without Breaking the Code

Repair loops, small diffs, test trust, and how to get CI back to green without trashing the codebase.

March 4, 20266 min readBy Alex Chernysh
DeliveryTestingWorkflow
Jump to section
1. Green state is a trust condition2. The first diagnosis matters more than the fourth patch3. Small diffs are not aesthetic. They are a safety mechanism.4. Tests are not sacred, but they are not disposable5. AI helps most where structure exists6. Fast feedback beats one heroic repair session7. Reversibility is part of the design8. AI-generated code still inherits operational responsibility9. A disciplined green pathWhat I would forbidRelated readingFurther reading

Prefer a shorter pass first?

The dangerous version of AI-assisted development is not the one that fails loudly. It is the one that gets to green by quietly lowering the meaning of green.

Working rule

Use AI to shorten the path to a correct change, not to negotiate with reality until the CI lights become decorative.

Repair loop defaults

  • keep diffs small enough that a human can still reason about them
  • decide whether the code is wrong, the test is wrong, or both drifted from the intended behavior
  • stop treating green CI as success if trust in the suite is collapsing underneath it
  • keep ownership of contracts, failure modes, and reversibility with the engineer

Test until green

The system keeps changing files until the suite stops complaining.

  • tests are weakened casually
  • design intent drifts
  • the diff gets harder to review each round

Disciplined repair loop

The team diagnoses the mismatch before editing more files.

  • code, tests, and docs are reconciled deliberately
  • changes stay reversible
  • green still means something

1. Green state is a trust condition

A repository is green when:

  • the code builds
  • the tests are honest
  • the contracts still mean what the team thinks they mean
  • the diff is understandable enough to own later

That is why "all checks pass" is necessary and still not sufficient.

An AI tool can get a suite to pass in many ways. Some of them are useful. Some of them are a form of polite vandalism.

2. The first diagnosis matters more than the fourth patch

When a change breaks tests, there are usually three possibilities:

  1. the production code is wrong
  2. the tests are wrong
  3. both drifted away from the intended behavior

This sounds banal. It is also where most AI-assisted repair loops go wrong.

A model is excellent at proposing edits. It is less reliable at deciding which layer deserves to move unless the design intent is visible.

That is why the repair loop should begin with explicit diagnosis:

  • what behavior was intended?
  • which file expresses that intention most credibly?
  • is the failure about logic, contract, environment, or stale test assumptions?

Without that step, the agent starts bargaining with the suite.

3. Small diffs are not aesthetic. They are a safety mechanism.

If a fix touches too many files at once, review quality drops and diagnosis gets worse.

In AI-assisted development, small diffs matter even more because the model can produce plausible bulk edits faster than a human can audit them.

I prefer a sequence like this:

  1. reproduce failure
  2. isolate cause
  3. patch one layer
  4. rerun checks immediately
  5. continue only if the failure class is actually resolved

That sounds slower. It is often faster because you do not spend the afternoon untangling a 14-file patch that solved two symptoms and introduced five others.

4. Tests are not sacred, but they are not disposable

There is nothing noble about preserving a broken test forever.

There is also nothing disciplined about deleting or weakening a test because it is blocking momentum.

A healthier standard is:

  • change the test if the test encodes behavior the system should no longer have
  • change the code if the test correctly describes intended behavior
  • change both only when the design evolved and neither file fully reflects it anymore

The point is not to defend tests emotionally. The point is to keep the suite as a credible contract.

5. AI helps most where structure exists

AI tools are strongest when the work has enough local structure to constrain the move:

  • boilerplate
  • repetitive refactors
  • test scaffolding
  • migration mechanics
  • obvious consistency fixes

They are weaker when the task is mostly judgment:

  • deciding the contract
  • choosing the trade-off
  • defining the rollback plan
  • deciding what counts as “done”

That does not mean the tool is useless there. It means the engineer remains responsible for the frame.

6. Fast feedback beats one heroic repair session

The external research on software delivery has been saying roughly the same thing for years: smaller changes and faster feedback loops are healthier.

AI does not repeal this. It intensifies it.

When generation is cheap, the temptation is to defer judgment. The better move is the opposite:

  • run checks sooner
  • fail sooner
  • narrow sooner
  • revert sooner when necessary

The agent can help move faster. It should not convince you that verification has become optional.

7. Reversibility is part of the design

A good AI-assisted workflow makes rollback easy.

That means:

  • additive changes before invasive ones
  • clear file ownership
  • obvious commit boundaries
  • avoiding mixed-purpose diffs
  • preserving the ability to back out one move without losing the whole session

The codebase should not need a séance to understand what happened.

8. AI-generated code still inherits operational responsibility

The system does not care whether a regression came from a human, a code model, or an enthusiastic afternoon.

What matters later is:

  • who owns the failure mode
  • whether the logging was good enough
  • whether the contract remained legible
  • whether the rollback path exists

That is why I still like a human-in-the-loop framing for engineering work. Not because the model is weak. Because responsibility still needs a home.

9. A disciplined green path

If I wanted a reliable default for AI-assisted repair work, I would keep it close to this:

  1. reproduce the failure
  2. diagnose code vs test vs design drift
  3. patch the smallest plausible layer
  4. rerun type-check and tests immediately
  5. stop when the diff is no longer easy to audit
  6. split the remaining work instead of gambling on one more broad patch

That is not the most cinematic workflow. It is the one I trust.

What I would forbid

A few anti-patterns deserve a plain ban:

  • deleting tests without explicit rationale
  • merging broad AI-generated refactors that no one can explain
  • changing code and tests together without stating which layer was wrong
  • using green CI as proof that the design is now healthier

A passing suite is good news. It is not a philosophy.

Related reading

Related reading

  • Spec-driven development for agent workflows
  • How to run LLM evals in production
References

Further reading

  • Accelerate and DORA research
  • Continuous Delivery

Related reading

Part of the public notes on grounded AI systems, retrieval, evals, and delivery under real constraints.

Most RAG Failures Start in the DocumentsSpec-Driven Development: the workflow I actually useHow to Run LLM Evals in Production
On this page
1. Green state is a trust condition2. The first diagnosis matters more than the fourth patch3. Small diffs are not aesthetic. They are a safety mechanism.4. Tests are not sacred, but they are not disposable5. AI helps most where structure exists6. Fast feedback beats one heroic repair session7. Reversibility is part of the design8. AI-generated code still inherits operational responsibility9. A disciplined green pathWhat I would forbidRelated readingFurther reading