Stop retrying the prompt: what the brief ledger taught the image harness

The second session ran --n 2 hero candidates for five posts: gmail-tractable, sender-reputation, six-sprints, classifier-is-the-product among them. The candidates came back. Some were wrong in exactly the same way posts from the first session had been wrong. Different brief, different post, same structural mistake. The brief had been adjusted the first time. The adjustment hadn’t survived.

That was the moment the wrong mental model became visible.

The harness

The image harness is a CLI that generates hero illustrations for captainrandom.co.uk. It reads a full post, constructs a visual-poet brief grounded in the post’s actual thesis, and calls a Gemini image model. One command; one brief; one set of candidates. The brief is the artefact: not a free-text prompt field but a structured specification.

The problem wasn’t the structure. It was what happened when a render came back wrong.

The retry reflex

The instinct is to adjust the prompt and run again. Free-text field, instant re-run, no visible cost. Tweak, retry, evaluate. If the image is better, move on.

That reflex produces a better image for this post, this time. It doesn’t produce a harness that gets better. Every adjustment is local. The insight lives in the session. The next batch run starts from the same prior. You re-learn the same lesson. Or you re-discover a model tendency you’d already worked around once and forgot.

The batch run confirmed it. Same mistake, different post. The first correction hadn’t been written anywhere it could be read.

The iteration loop

The third session formalised what was already forming. The method: iterate → review → learn. One post per cycle. Run the harness, read the output, extract the lesson, harden the instruction into the brief so it doesn’t need re-learning.

Two rules came out of this period. The story rule: the image should depict a moment, not a concept. It came from a batch of renders that were technically competent and looked like diagrams: accurate illustrations of an idea rather than images that created a scene. The lighting rule came from noticing that several candidates across unrelated posts shared the same flat overlit quality. Different briefs, same flaw. That’s a model tendency, not a prompt miss. You fix it once in the brief or you fix it every run.

The brief ledger is where both rules live. Not a doc, not a comment. Executable. The harness reads the ledger on every call and injects the constraints into the brief before the model sees it.

That’s the distinction that matters. A lesson encoded in the ledger is applied on every run, including after a week gap, including when the person running the harness didn’t learn the lesson themselves. A lesson that stays in a session note requires a human to remember to apply it. The ledger removes that requirement.

The third session also required a correction to an earlier rule. That’s expected. A brief instruction that holds for one post doesn’t always hold for ten. The iteration loop is how you find out.

Imagen and NB2

The second session had surfaced a constraint with the Imagen rewriter, an intermediate component in the pipeline. By the fourth session, there was a reason to evaluate whether the layer was still necessary.

Nano Banana 2 is gemini-3.1-flash-image, preview slug gemini-3.1-flash-image-preview. Run against the current brief config with the ledger constraints in place: 10 out of 10 clean renders. No rewriter, no intermediate transformation layer, no workarounds.

The rewriter was eliminated. Not deprecated. Removed. The brief ledger was already producing correct output for NB2’s understanding of the brief format. Adding a transformation layer on top would be indirection without upside. The Imagen constraint was real; under NB2 it was irrelevant.

Out of scope: evaluating the rest of the current Gemini image line-up. NB2 was clean; the work stopped there.

The v1 backfill

Earlier posts on the site had been generated under v1-era config, before the ledger had accumulated its current constraints. The classifier cycle closed that gap. The last post on v1 imagery ran through the full current-config pipeline: fresh Pro brief, current ledger exclusions, NB2 render.

The run came back clean. A brief generated under the current config for a post already processed under the old config produced no new failure modes. No additional rules were needed. The ledger was stable for the current corpus.

Shipping it

The harness work had accumulated on a stale writing branch, around 30 commits behind origin/main. It had started as a writing-context branch and picked up harness infrastructure alongside post drafts, which was the wrong separation. Getting to a clean PR meant extracting harness code and assets from the writing history, rebasing onto fresh origin/main, and landing as a six-commit PR.

The result: hero illustrations live on all 22 posts. The the article pipeline post-draft hook wired, so new articles coming through the the article pipeline pipeline get a harness run automatically after draft. The brief ledger’s constraints apply to every future post without any human remembering to use them.

That’s the point. The lessons from five sessions of iteration are in every future run. The harness is better not because anyone applies lessons manually. It’s better because the lessons are in the code path.

What this breaks

Two things worth naming.

First, the ledger accumulates. Each new constraint is tested against current output, but the interaction between constraints isn’t systematically tested. A new rule could suppress a correct output an existing rule was producing. The only check is running a batch and reviewing. There’s no automated regression test for image quality.

Second, the story rule and the lighting rule came from one person’s read of bad outputs at a specific time. They’re correct for captainrandom.co.uk’s visual language as it existed across these five sessions. That’s enough. It’s not a claim about universal failure modes in AI image generation.

Both are acceptable. The brief ledger’s value isn’t completeness. It’s that corrections compound: a mistake fixed once doesn’t repeat. That property survives both limitations.

The reflex worth building

The retry reflex is fast and locally satisfying. The brief-ledger reflex is slower (write down the lesson, encode it in the brief, verify it survives the next run) but it compounds. By the fifth session, the harness was generating clean candidates for posts that would have required several retries in the first session. Not because the model changed. Because the brief was better, and the better brief was built from corrections made durable rather than local.

That’s the property worth targeting in any system calling a generative model repeatedly. Not better prompts. Durable lessons.