Alex ChernyshAlex ChernyshAgentic behaviorist · Tel Aviv
WritingAssistant
Back to notes

Note

Forecasting Without Prophecy: a plain-text discipline

Why I leave the future to astrology and reach for reference classes, premortems, and calibration logs instead. Disciplined uncertainty in plain text.

May 2, 2026·13 min read
Delivery
On this page(9)
The illusion of point estimatesReference classes before storiesPremortems and falsifiersAbstention as a featureCalibration is a habit, not an eventScenario discipline in plain textTry it on something smallRelated readingFurther reading

I am a fire Aries ruled by Mars, and even I will not pretend the future is a thing you can read off a chart. Calibrated uncertainty does the work prediction promises. The difference shows up six months later, when you can still grade what you wrote. Same question whether the deadline is a deploy, a hiring call, a relocation, or a difficult conversation with a peer.

The frame for this post

Precise prediction belongs to astrology. The job here is narrower. Text-only procedures that take a messy world, anchor it in base rates, generate competing scenarios with explicit probabilities, and store the result in a form you can grade later. Same discipline whether the surface is a backend migration, a Q3 review, a visa application, or a friend who has gone quiet.

What disciplined forecasting looks like

  • a forecast you cannot grade six months later is a feeling in costume
  • start from a reference class, not from the story of your case
  • premortems are the cheapest decision-quality intervention I have ever run
  • abstaining counts. conformal prediction sets are the formal name for it
  • a forecast without a falsifier and a leading indicator is unfinished
  • the log is the only honest grading sheet
The minimum viable forecast loop
Five steps that turn a question into something you can actually grade six months later.

The illusion of point estimates

The most common forecasting mistake I see is not bias. It is false specificity in the answer.

A senior engineer says "I'm 70% confident this ships by end of Q2." The number sounds disciplined. There is no scoring history attached. The same engineer said "70%" last quarter, and the quarter before, and three out of four ended up landing in different buckets. The "70%" is a feeling reformatted as a probability.

The same trap shows up well outside a deploy window. A friend is "pretty sure" the new sleep regimen holds through the work-week. A cousin is "fairly confident" the visa will clear in time. A founder is "70% sure" the round closes in six weeks. None have a forecast log behind them.

Point probabilities without a forecast log are theatre. They wear the costume of rigour (the decimal, the percentage sign) and the calibration that would make them rigorous is missing.

The same trap shows up further down the AI stack. A retrieval system reports score: 0.83 and the team treats it as ground truth. A model reports confidence: 0.91 and the team builds an approval flow on top of it. Neither number is calibrated against actual outcomes. They are surface forms of a habit that does not exist yet.

The fix is not "stop using numbers." The fix is ranges, not points, until you have a calibration log that earns the precision. Twenty-to-thirty-five percent is defensible. Twenty-seven percent without a log is a costume.

Reference classes before stories

The second most common mistake is starting from the inside view. The story of the project, the story of the relationship, the story of the deploy.

Reference-class forecasting is the corrective. The original framing comes from Kahneman and Lovallo, and was operationalised most aggressively by Bent Flyvbjerg on infrastructure megaprojects, where insiders consistently overstated success and external base rates told a quieter, more accurate story.

The procedure is short:

  1. Name two to four classes the case belongs to. Not metaphorical classes. Observable ones, with countable outcomes. "Solo-founder consumer-SaaS launches with no paid acquisition." "First-time hires from an outside referral at a company under 30 people." "Friends who have gone quiet for ten days after a tense exchange." "Indie novel projects taken from outline to a finished draft within twelve months."
  2. Estimate the prior odds of the target outcome from those base rates. Use ranges.
  3. Adjust modestly (at most thirty or forty percentage points) only if your case-specific evidence is strong and distinctive.
  4. If no relevant reference class exists, your confidence drops automatically.

The discipline lives in the order. Build the prior before you tell yourself the story. Once the story is in your head, every reference class will start to look "different in our case" and the outside view gets rationalised away. I have done this to myself on a relocation, a hiring call, on whether a parent would actually visit in spring. Writing the prior down before the narrative is the only thing that has ever stopped it.

This is the same discipline that makes eval suites useful in LLM systems: pick the reference set first, then look at the system, not the other way round.

Premortems and falsifiers

A premortem is the cheapest decision-quality intervention I have ever run. The technique is associated with Gary Klein's 2007 HBR piece. The underlying discipline is older. A deliberate inversion of the usual kickoff posture. Works on a relocation, a difficult conversation with a colleague, or the question of whether to stretch the emergency fund onto a new lease.

The procedure, in plain text:

1. Set the scene: it is six months from now, the project failed.
2. Each participant writes down, alone, the strongest specific reason it failed.
3. Read the answers out. Cluster them.
4. Each cluster becomes a falsifier or a mitigation in the live plan.

Two effects compound. First, asking "why did it fail" generates more honest hypotheses than "what could go wrong" because the failure is now an established fact in the imagined timeline. Nobody is debating whether it might happen, only how. Second, the failure modes that survive clustering become falsifiers: observations that, if they happen, mean the plan is broken. Falsifiers convert vague risk into a leading indicator you can actually watch for.

This pairs well with how I run feature flags and staged rollouts in agentic systems. The flag's "off" criteria are usually written casually. They should be written as falsifiers. "If the regression rate exceeds 4% over two weekly cohorts, this rollout has failed and we revert." That sentence is forecastable. "We'll keep an eye on regressions" is not. The same shape works outside a codebase. "If the antibiotic course produces nausea on day three, I switch back to the GP" is a falsifier. "I'll see how I feel" is not.

Abstention as a feature

The third common mistake is answering when the honest answer is "I don't know yet, and here is what would change that."

Abstention is treated as failure in most organisations and in most personal conversations. In disciplined forecasting it is a feature. Two reasons.

Calibration. A forecaster who abstains on cases that are genuinely underdetermined posts better Brier scores than one who answers everything with a 50% confidence shrug.

Decision quality. The ask "what evidence would resolve this?" reframes the situation from "what do I think?" to "what do I need to look at next?" That is the question that actually moves projects forward, and the question that quietly de-escalates most family arguments about hypothetical futures.

The technical analogue worth knowing is conformal prediction, surveyed accessibly in Angelopoulos and Bates' 2021 tutorial. The output is a set of labels guaranteed to contain the truth at least (1 − α) of the time, rather than a single label with a confidence. When the set has one element, you have a confident prediction. When the set has six, the model is honestly saying "I cannot distinguish among these without more evidence". The set size is the abstention signal.

You don't need conformal infrastructure to apply the principle. The principle: make the size of your answer track the size of your uncertainty. A short single-line forecast for a confident case. A two- or three-branch forecast for a moderately known case. An explicit "I abstain because X, Y, Z would resolve it" for the underdetermined case. This sits next to my preference for product safety without theatre. Refusing a question is sometimes the strongest answer the system has.

Calibration is a habit, not an event

A forecast is incomplete until it is graded.

The grading metric I keep coming back to is the Brier score, summarised on the Wikipedia page. Lower is better. Zero is perfect. The convenient property is that the score decomposes into calibration and resolution. You can be wrong because your probabilities do not match observed frequencies, or because your forecasts do not separate likely from unlikely cases. Two different fixes.

In practice, you do not need fancy infrastructure to track calibration. A four-column markdown table is enough:

| Date       | Question                                | Forecast    | Outcome | Notes                |
|------------|-----------------------------------------|-------------|---------|----------------------|
| 2026-03-01 | Will candidate X accept by 03-15?       | 35-50%      | yes     | accepted on 03-09    |
| 2026-03-04 | Will deploy be clean on 03-08?          | 60-75%      | no      | DB pool exhausted    |
| 2026-03-09 | Will my friend reply within 48 hours?   | 40-55%      | no      | replied on day 5     |
| 2026-03-12 | Will the landlord renew on same terms?  | 55-70%      | yes     | small CPI bump only  |

Two months of entries and you start to see the systematic biases. Overconfidence on a topic you "know". Underconfidence on a topic you are afraid of. Point estimates that hide a wide range. Ranges that hide a missing reference class. The biases that show up against deploys also show up against landlords, friends, and gigs that need to break even on the door.

I keep the log as a live file. New entries take less than a minute to write. The discipline lives in the reading, on a slow Sunday once a month, with last month's predictions next to last month's outcomes.

If that sounds tedious, consider the alternative is a version of you who never learns whether your forecasts are right. Public superforecasters, profiled in the Good Judgment Project, do score above average on fluid intelligence and active open-mindedness. The strongest single predictor of breaking into the top 2% was perpetual updating, roughly three times more predictive than IQ. They keep score.

Scenario discipline in plain text

The single-story narrative is the most expensive default in informal forecasting. "I think they're going to ghost us." "I think the round will close in six weeks." "I think the strike will be over by Friday." A single hidden-motive story replaces the work of generating competing hypotheses.

The fix is a scenario table, generated once, with three to five branches that meaningfully compete:

ScenarioProbability rangeStrongest evidence forStrongest evidence againstLeading indicator
Status quo continues30–45%track record of inactionrecent change in incentivesno decision in next 14 days
Cautious improvement25–35%small visible gestures last weekhistory of regressionsone substantive ask answered
Escalation or rupture10–20%pattern of ultimatumscalmer recent toneunilateral action by the other side
Strategic distance10–20%resources are clearly limiteddependency on this threadreduced engagement, not reduced contact
External shock5–10%three competitors movingsector quiet otherwisea third party makes the question moot

That same shape covers a regulator's calendar, a job search, a health regimen, or whether a quiet family thread reopens on its own. You change the rows; the columns stay.

Two rules make the table earn its keep.

Probability ranges, not single numbers, unless you have a forecast log to back the precision. Range midpoints should land somewhere near 100%. They will never sum to exactly 100 (they are ranges) but if the column sums to 50% the scenario set is incomplete, not the arithmetic.

One falsifier per row. A scenario without a falsifier is a wish or a fear, not a scenario. The leading-indicator column does the work: it tells you what observation, if you saw it next week, would shift the probability of that branch up or down.

A nice consequence of plain text is that you can paste it into a thread, hand it to a colleague, or feed it to a model for a second opinion without an export step. The same plain-text discipline that makes spec-driven development survive context switches makes scenario tables survive them too.

Try it on something small

This entire discipline collapses if you only ever apply it to high-stakes, low-frequency questions. You don't get calibration data. You don't develop the muscle. You don't learn which biases are yours.

Start with something small enough to grade within two weeks. A Q3 review, an offer letter, a connecting flight, a Saturday gig that needs to break even on the door. Anything where the outcome lands before you forget you forecasted it.

One-week forecasting starter

  • pick three questions you genuinely don't know the answer to
  • write the forecast as a probability range, not a point
  • name the reference class for each one
  • write a falsifier and a leading indicator for each
  • log them in a four-column table you can reread
  • grade them when the outcomes land

If two of the three are wildly off after two weeks, the lesson is in the gap, not in the embarrassment. Reread the original entries. Which step did you skip? Did you start with a story instead of a reference class? Did you give a point estimate instead of a range? Did you forget the falsifier?

The future stays unpredictable. The job is to build a calmer, slightly more honest interface to a messy world, and to leave behind enough trail-of-evidence that next year's version of you can grade this year's forecasts and learn something.

Related reading

Related reading

  • Building agentic AI systems that hold up
  • LLM evals in production
  • Hallucination prevention in LLM products
  • Product safety without theatre
  • Spec-driven development
  • AI-assisted development from a green state
  • Building legal answering systems
References

Further reading

  • Gary Klein, "Performing a Project Premortem" (HBR, 2007)
  • Reference-class forecasting (Kahneman, Lovallo, Flyvbjerg)
  • Angelopoulos & Bates, "A Gentle Introduction to Conformal Prediction" (arXiv 2021)
  • Brier score — calibration metric primer
  • Brier (1950) — verification of forecasts expressed in terms of probability
  • The Good Judgment Project — superforecaster research

✓ Reading complete

Alex ChernyshAlex ChernyshApplied AI Systems & Platform Engineer

More on Delivery

Part of the public notes on grounded AI systems, retrieval, evals, and shipping under real constraints.

  • →Getting AI-Assisted Development to Green Without Breaking the CodeMar 4, 2026·5 min read
  • →Most RAG Failures Start in the DocumentsFeb 12, 2026·5 min read
  • →How to Run LLM Evals in ProductionFeb 3, 2026·6 min read
On this page
  • 01The illusion of point estimates1 min
  • 02Reference classes before stories1 min
  • 03Premortems and falsifiers1 min
  • 04Abstention as a feature1 min
  • 05Calibration is a habit, not an event1 min
  • 06Scenario discipline in plain text2 min
  • 07Try it on something small1 min
  • 08Related reading
  • 09Further reading