Retries must be stateless, or the loop learns to cheat

The first live end-to-end probe against the Bank of England MMC article produced a 9,340-character draft. Sonnet hadn’t inserted rate figures the minutes didn’t contain. Factually, it was clean.

The calibration gate blocked it.

That was the right outcome. A false-block (blocking a correct draft) costs one retry. A false-pass ships a factual error to a UK regulatory calculator, where it sits until someone notices. The gate being conservative was a signal the machinery was working. The problem was what happened next: the retry loop needed a doctrine, or it would calibrate itself into noise.

That was the start of two days working out what that doctrine should be.

The pipeline

ukcalculators composes articles from structured source material: regulatory publications, committee minutes, statistical releases from bodies like the BoE and FCA. X.8.2 is the JSON envelope composer: it sits between Sonnet (which drafts prose from a compose-from-blocks prompt) and the calibration gate (which verifies factual claims at the paragraph level). Its role is to package Sonnet’s output into a structured envelope with enough metadata that the gate can emit a verdict with a reason, not a binary yes/no.

Wiring it into draft.py and redraft.py was a two-commit change. The first live probes surfaced a pack threading bug in how the envelope was assembled, not in the gate’s logic. Both motivating articles, BoE and FCA, drafted correctly and blocked correctly once the threading issue was resolved. The machinery was stable.

The instability was in the doctrine around the retry loop.

Stateless re-asks

A blocked draft triggers a re-ask: here is the paragraph, here is the source, revise. The naive implementation carries the previous gate verdict forward. Sonnet sees what the gate said last time and optimises its response against that signal.

This is where the loop calibrates itself into noise.

Sonnet is capable of satisfying a stated criterion. Given a verdict, it produces output that satisfies the criterion, even if the underlying factual problem persists but is now expressed in a way the gate misses. The loop runs; the gate eventually passes something; the pass is not evidence of accuracy. It is evidence that the re-ask context gave Sonnet enough signal to satisfy the gate’s expressed preference. Those are different things.

The fix: each re-ask is stateless. Sonnet gets the paragraph stub and the source data, not the previous verdict. The gate runs on the fresh output against a pre-registered pass criterion. The loop has no feedback path to exploit.

That rule came out of PR #39 being made doctrine rather than just a new test row. The directive on the wiring PR was explicit: retries must be stateless, pre-register S-2’s pass criterion before running it. Make the lesson structural, not a one-off guard.

Pre-registered pass criteria

The second rule concerns where the pass criterion comes from.

S-2’s pass criterion (the criterion the gate uses on the second pass) must be registered before the run, not derived from what the first pass produced. If it is derived from the first pass, it anchors to the first pass’s output. Miscalibration in the first pass propagates forward. Correct output that the gate mis-categorised on pass one produces a second-pass criterion that will accept the same mis-categorisation. The loop converges on the wrong target.

Pre-registration means the criterion exists before Sonnet drafts. The re-ask is tested against something set in advance. The gate cannot learn to accept the wrong output by being shown it once.

The scoped snapshot check

PR #40 closed a separate calibration debt. The BoE iter2 runs had produced two snapshot-mismatch verdicts: the gate blocking paragraphs it should have passed. Triaged against the 15-minute discipline: root cause, close the deviation, move on.

Root cause: the snapshot check was running against article-level context, not paragraph-level context. A change in one paragraph was producing mismatch signals in adjacent ones. Scoping the check to the paragraph under test resolved both verdicts. That merged as c18e11a before the main wiring PR landed.

The test that checked nothing

Post-wiring housekeeping surfaced a different class of problem.

test_verify_py_not_modified_by_wiring_pr existed to confirm the wiring PR hadn’t touched verify.py. A reasonable constraint while the PR was in review. The moment PR #43 squash-merged, the diff it was checking was empty by definition. Every subsequent CI run would pass the test. It would never fail. It was also checking nothing.

A failing test stops you. A permanently green test that checks an empty condition tells you the suite is healthy while making no actual claim. That is the more dangerous state: not a lie that triggers, but a silence that accumulates. The test was removed. Replay hashes replaced it: artifact-level verification that the wiring is intact, rather than diff-level verification that a specific PR didn’t touch a specific file.

The gate stays clean

The gate’s logic was untouched through all of this. That separation matters.

The calibration changes went into the loop around the gate: the re-ask structure, the snapshot scope, the criterion registration. The gate itself verifies factual claims; that verification logic is the known quantity. The loop is where calibration debt accumulates, because the loop is where you interact with an LLM, and LLMs are capable of satisfying criteria. The gate stays clean by staying out of the loop’s feedback path.

X.8.4 added paragraph-level calibration: paragraph refs, sentence-level verification, a citation-tail metadata exemption for the trailing cite syntax the gate was previously treating as factual claims. Those went into the gate’s verification logic, not into the retry structure. The boundary held.

What carries forward

Three rules, in order of leverage:

  • Each re-ask to Sonnet is stateless: no previous gate verdict in context.
  • Pass criteria are registered before the run they govern, not derived from what that run produces.
  • Snapshot checks are scoped to the paragraph under test, not the full article.

The fourth (shelf-life tests are removed at merge time, not left to silently atrophy) is housekeeping. But it is the kind of housekeeping that compounds. A test suite with permanently green vacuous assertions is a test suite you cannot trust at the edges.

The BoE MMC article that started this is still working through the retry loop. The gate blocking a clean draft was a good signal. A stateless loop running against a pre-registered criterion has no feedback path to game. It will either pass on genuine accuracy or keep blocking until the source alignment is fixed. Getting there from a false-block is two retries. Getting there from a loop that has learned to cheat the gate is a rewrite.

All writing