AI systems that hold up in production.

Available unless I'm pretending not to be. Forward-deployed work. Retrieval, evals, agent infra.

How the work actually looks

What does a forward-deployed AI engineer actually do?

Forward-deployed means I work inside your repo and your stack, not over a Slack channel from the outside. I write and ship the code that closes the gap between a working demo and a system you can leave running. Most of the time that is retrieval grounding, eval coverage, tool boundaries, and the rollback path. Deliverables are commits and runbooks, not slide decks.

When does multi-agent orchestration matter and when is it overkill?

Worth it when a task crosses several tool surfaces, takes longer than one model context, or needs a signed audit trail per step. Overkill when a single well-instrumented agent with strict tool boundaries already does the job. Bernstein is built for the first case. If your workflow is one prompt and one tool, you do not need an orchestrator, you need better evals.

Why does retrieval grounding fail in production even when demos look fine?

Demo queries are friendly. Production traffic is not. The retriever silently returns plausible-but-wrong context, the model writes confident prose around it, and nothing in your stack catches it. Fixes I usually ship: hybrid retrieval, evidence-first answer shape with page-level citations, and an eval set with adversarial queries that pin the failure modes you actually see, not the ones the framework imagines.

When does on-prem agent orchestration matter?

When the workload touches regulated data, an air-gapped network, or a customer-side LLM gateway you cannot route around. Bernstein runs file-based state, deterministic scheduling, and per-agent credential scoping inside your perimeter. No outbound calls you did not authorise. The same orchestrator runs on a laptop, in CI, and on a hardened VM, which is usually what compliance review wants to see.

What does eval-driven delivery look like in practice?

Every change ships with a gold set of inputs, a deterministic judge, and a fail-closed gate in CI. New failure modes get captured as eval cases before the fix lands, so the regression cannot come back silently. The orchestrator records every agent turn, so when a metric moves you can replay the exact run that moved it. No vibes-based releases.

Writing

all notes

i'm spinning like a hamster in a wheel but if it's something crazy then alex at this website dot com

ghxrss