Alex Chernysh
AI systems / retrieval / evals / architecture
Discuss a system
SystemsWritingAssistant
Back to notes

Note

Prompt Engineering: From Phrasing to Policy

Prompt design now means response formats, examples, tools, and eval loops — not incantations.

January 29, 20265 min readBy Alex Chernysh
AgentsPrompting
Jump to section
1. The prompt is not the product anymore2. Clear instructions still win3. Few-shot examples are usually better than one more paragraph of prose4. Structure beats vibes5. Tools and response formats matter as much as wording6. Query transformation belongs in retrieval design, not in prompt mysticism7. Prompt changes should be evaluated like code changes8. A good prompt does four jobsRelated readingFurther reading

Prefer a shorter pass first?

Prompt engineering used to be treated like copywriting with caffeine. It is closer to policy design: define the task, define the boundaries, show the pattern, and verify the result against reality. The magic phrase, regrettably, never arrived.

Why this shifted

OpenAI, Anthropic, and Google all converged on the same practical lesson: prompts matter, but only inside a wider system that includes tools, context, examples, and evaluation.

Prompting is one layer
Good output quality is usually the result of several layers working together.

What actually matters

  • the instruction layer defines role and boundaries
  • examples compress desired behavior faster than prose
  • retrieved context decides what can be said truthfully
  • response formats decide what can be checked
  • evals decide whether the prompt change was worth it

1. The prompt is not the product anymore

In modern systems, the prompt is one layer among several:

  • system instruction
  • user task framing
  • retrieved context
  • examples
  • tool configuration
  • output schema
  • evals and downstream checks

That is why prompt engineering now feels less like wordsmithing and more like operating a multi-layer interface contract.

Prompt as phrasing

The team tweaks wording until the answer sounds better.

  • style dominates the discussion
  • examples are treated as optional
  • output shape is vague
  • regressions are discovered late

Prompt as policy

The prompt becomes one governed layer in a larger system.

  • instructions are narrow and explicit
  • examples show the target behavior
  • schemas keep outputs machine-checkable
  • evals decide whether the change helped

2. Clear instructions still win

The most durable advice in the field is still the least glamorous: make the instructions clear and specific.

The model should know:

  • what role it plays
  • what the task is
  • what constraints matter
  • what the output should look like
  • what to do when the task is underspecified

Clarity still beats cleverness.

3. Few-shot examples are usually better than one more paragraph of prose

Strong examples compress the desired behavior into something visible.

They teach the model:

  • the target format
  • the expected level of brevity
  • refusal behavior
  • citation posture
  • style boundaries

If the examples are good enough, some of the instruction prose can shrink.

4. Structure beats vibes

Anthropic's prompt-engineering docs still point toward the same practical set of techniques: clarity, examples, explicit structure, role prompting, thinking, and prompt chaining where it actually helps.

I like prompts that make each layer legible:

<role>You are a grounded assistant for production AI operations.</role>
<constraints>
- Use only retrieved context for factual claims.
- If support is missing, say so directly.
</constraints>
<context>[retrieved evidence]</context>
<task>[user question]</task>
<output_format>[exact shape]</output_format>

This does not make the model perfect. It does make failure easier to diagnose.

5. Tools and response formats matter as much as wording

Prompt quality is often decided outside the sentence layer.

In production systems, the real difference often comes from:

  • which tools are available
  • how narrowly their contracts are defined
  • whether the output schema is strict enough to validate
  • whether the model is allowed to abstain

This is one reason prompt conversations now overlap with product requirements and interface design. A prompt that looks elegant but sits on top of loose tools and vague outputs will still behave loosely.

6. Query transformation belongs in retrieval design, not in prompt mysticism

There is a temptation to treat every retrieval or routing improvement as a prompt breakthrough.

That is usually a category mistake.

Rewrite, decomposition, step-back prompting, and similar techniques can help retrieval materially, but they should be evaluated as search-control moves with latency costs, not as proof that someone discovered a more magical wording style.

If a query transformation helps, keep it. Just keep the reason honest.

7. Prompt changes should be evaluated like code changes

OpenAI's current eval guidance makes the operational consequence obvious: if the output matters, prompt changes belong in the same measurement culture as code changes.

Every material prompt change should answer:

  • what behavior is intended to improve?
  • which eval set should move?
  • what new failure mode might appear?
  • what does worse now look like?

If the change cannot be measured, the team is usually arguing about taste.

8. A good prompt does four jobs

The prompt is good when it does all four of these at once:

  1. narrows the task
  2. narrows the output shape
  3. narrows the model's freedom under uncertainty
  4. leaves enough flexibility for the useful part of the work

Miss one of those and the system drifts.

Related reading

Related reading

  • Which query transformation techniques actually help RAG?
  • How to run LLM evals in production
References

Further reading

  • Anthropic prompt engineering overview
  • OpenAI: Evaluation best practices
  • OpenAI: Agents guide
  • Google Gemini prompting strategies

Related reading

Part of the public notes on grounded AI systems, retrieval, evals, and delivery under real constraints.

LLM Product Safety Without TheaterBuilding Agentic AI Systems That Hold UpPreventing Hallucinations in LLM Systems
On this page
1. The prompt is not the product anymore2. Clear instructions still win3. Few-shot examples are usually better than one more paragraph of prose4. Structure beats vibes5. Tools and response formats matter as much as wording6. Query transformation belongs in retrieval design, not in prompt mysticism7. Prompt changes should be evaluated like code changes8. A good prompt does four jobsRelated readingFurther reading