Both voice iterations failed the Cash ISA test. I shipped the original.

Three Cash ISA drafts sat in the review queue. Same calculator, same data, same structure. The version labels were stripped before I read them. I didn’t know which was the original and which were the iterations until after the selection.

The selection came back: v1.0.0. The original.

That was SC-010. Second failure in two attempts.

What I was iterating toward

The uk-calculator-voice skill is the voice layer for an editorial agent pipeline on the the UK calculator site project. Phase X involves 14 curated UK personal-finance sources, a 3-hour cron heartbeat, and a monitor that polls each source by fetch type (RSS, sitemap, or index HTML), deduplicates by URL and SHA-256 hash, and surfaces candidates for triage. The voice skill is what the drafter calls when a triage decision produces an article.

v1.0.0 had been working for single-page work. The plan was to scale the skill before the pipeline went live. The hypothesis: a voice tuned for one-off calculator descriptions might not hold up across 14 sources and high volume.

v2.0.0 was the first iteration. More analytical. More precise about rates and thresholds. The assumption was that UK personal finance readers want authority over warmth, and that the original voice was too neutral to carry technical content at scale.

The Cash ISA draft under v2.0.0 came out as a rate sheet. Accurate, complete, cold. Not wrong, but the register had shifted from “useful guide” to “technical specification” in a way that wouldn’t serve someone arriving from a search for “best cash ISA rates.” I noted it, built v2.1.0, and did not treat one draft as definitive.

v2.1.0 was the correction. The skill got new framing cues: acknowledge the reader’s decision context, signal that the numbers connect to real choices. The next Cash ISA draft came out overcompensating. The instruction to “acknowledge decision context” surfaces in the output as named empathy: sentences that begin “Whether you’re opening your first Cash ISA or reviewing an existing one…” The reader who wanted the rate has already scrolled past.

Two iterations. Both failed the same blind test.

What the blind test actually surfaces

The A/B strips version labels from the drafts before evaluation. The selection criterion: which draft would I publish on a live calculator page?

That’s a harder question than it looks when you’re mid-iteration. During v2.0.0 development, the relevant question is “is this better than v1.0.0?” A blind A/B changes the question to “is this good?” The two questions produce different answers when you’re making relative progress on a regressed baseline.

v2.0.0 had accumulated an assumption: precision reads as authority on a UK calculator page. It doesn’t. It reads as a rate sheet. v2.1.0 had accumulated a different assumption: named empathy corrects for clinical register. It doesn’t. The output lands as a fintech chatbot. Both assumptions felt reasonable during development. Neither survived contact with the blind test.

Iterative review and a blind A/B ask different questions. Iterative review asks whether the new version is better than the last. A blind A/B asks whether the current output is good at all. The first lets relative progress mask absolute regression. The second doesn’t.

SC-010 failing twice on two consecutive iterations is the signal the reset path is designed to catch.

The reset decision

The ship decision was v1.1.0. Not a revert to v1.0.0. A reset, labelled as one.

The difference is in what the version number records. A revert says: we made a mistake and undid it. A reset says: iteration produced regression on this path; we’re returning to the last passing state and formalising the recovery process. Both restore v1.0.0 behaviour. Only the reset leaves an explicit record of why.

v1.1.0 adds one thing to the v1.0.0 baseline. A documented reset path, written into the skill’s versioning convention. If SC-010 fails twice on the same voice change, the next step is a reset to the last passing version. Not a third iteration attempt. That rule now lives in the skill rather than in a devlog entry that might be hard to locate six months from now.

Naming the rule is what makes it survive context loss. The next iteration attempt will fail at some point. When it does, the version history tells you exactly where the reset point is and why it was set.

Out of scope at v1.1.0: structural changes to the voice itself, new tone directions, or changes to how the drafter calls the skill. The reset path is the entire v1.1.0 diff.

The auth drift that landed alongside it

On the same day the voice reset shipped, a fix went into .editorial/scripts/_llm.py. The script had been calling the Anthropic API directly via urllib.request (a POST to the messages endpoint with an ANTHROPIC_API_KEY environment variable). CJ flagged it immediately: “We don’t use api key we use Claude sdk or cli.” The fix was a swap from urllib.request to subprocess.run(["claude", …]).

Both the voice regression and the auth drift are the same shape of problem: an implementation choice drifted from an established convention. The voice skill had an established register; the iterations drifted from it. The codebase had an established auth model; the new script drifted from it. Both were caught before production. Neither was subtle once examined with the right question.

CJ caught the auth drift on review. A blind A/B caught the voice regression. Different gates. Same function: surface drift before it propagates.

Where the pipeline stands

Phase X’s editorial pipeline is running a 3-hour cron heartbeat against 14 UK personal-finance sources. The monitor at .editorial/scripts/monitor.py polls each source. It deduplicates by URL and SHA-256 hash. It surfaces candidates for triage. .editorial/sources/registry.json is the committed source allowlist, each entry carrying a fetch_type and an advice_language_risk flag.

The cron is disabled pending B-012. When B-012 clears, the voice skill the drafter calls will be v1.1.0. Two iterations tested, original wins both rounds. That is the version that ships into production scale.

The lesson isn’t that v1.0.0 is optimal or that voice iteration is pointless. It’s that iteration can regress, that iterative review alone won’t catch it, and that the reset path needs to be a named, versioned thing rather than an ad-hoc rollback. Iteration without a reset path is how you end up shipping a rate sheet when you wanted a guide.

All writing