Two correct cancellations, two SLA breaches, four manual ships

At 08:00:20Z on 13 June, the the article pipeline publisher returned None. The 09:00 BST slot was already gone.

G-L3 (the duplicate-check gate) had correctly identified queue id=35 as a duplicate of an already-shipped article. The match was PR #105. The block was accurate. The pipeline had done exactly what it was supposed to do. The slot was still empty.

That was the first breach of the day.

What was supposed to happen

The editorial SLA for captainrandom.co.uk is one article at 09:00 BST every day. The mechanism: a scheduled row in dvlaw.db enters its publication window, G-L3 checks for duplicate topics and fingerprint collisions, the publisher commits the MDX to the repo, GitHub Actions handles the rest. The 06:00 BST digest had fired clean that morning (10 candidates posted to Telegram), so the queue going into 09:00 was not empty.

B-016 had shipped at 02:55 BST the same morning. That was the alias-map structural fix outstanding since 2026-06-12, when an integration test exposed that _derive_topic_key in duplicate_check.py was over-aggregating: pulling in platform-generic tokens as topic discriminators and incorrectly treating different articles as duplicates of the same topic. Approach B won: a hard _GENERIC_PLATFORM_TOKENS frozenset in _derive_topic_key. Shipped as the editorial pipeline PR #203 (735b42e), merged at 02:55 BST via subagent.

B-016 was the fix for a class of incorrect blocks. id=35 was not that class.

Direct replay of B-016’s corrected logic against id=35 returned blocked=False. The alias-map fix had no bearing on this row. The duplicate match against PR #105 was a separate, accurate finding: a genuine same-article collision, G-L3 doing its job. B-016 was working correctly. The 09:00 slot was still empty because nothing was in line to take id=35’s place.

What correct cancellation looks like at the slot level

When G-L3 blocks a row, the publisher exits. Nothing promotes into the vacant slot. The next scheduled row fires at its own window.

The pipeline’s correctness model is row-level: was this specific row processed accurately? The SLA’s correctness model is slot-level: did something publish at 09:00? These are different questions. The pipeline had no mechanism to translate between them.

Manual recovery filled the 09:00 gap. PR #115 (newsletter hotfixes, a draft already in the system) shipped by hand.

The second breach

id=37 cancelled at 11:00:08Z. The 12:00 BST slot.

Direct replay of B-016 against id=37 returned blocked=False. G-L3 was not the cause. The cancellation had a different origin: one hour before the window, not inside it. The 12:00 slot was empty before the publisher ran.

Manual recovery again. PR #118 (the idea park installer) shipped by hand.

Two breaches. Two different cancellation causes. Both times the duplicate-detection logic was verified correct. Both times the SLA required a human to close the gap.

What the day cost

Four articles shipped on 13 June. All manual.

  • 09:00 BST: PR #115 (newsletter hotfixes)
  • 12:00 BST: PR #118 (the idea park installer)
  • 15:00 BST: PR #120 (retries-must-be-stateless)

Each manual ship is an interrupt. Diagnosing the cancellation, identifying a substitute draft, verifying it clears the gates, committing, confirming the deploy. Four of those in a single day is the better part of a sprint’s worth of time, none of it architectural.

The day also produced two sprints of architectural work (B-018 SLA recovery, B-019 citation policy) alongside the manual ships. That pairing is the pattern when a pipeline operates under a real SLA before its recovery mechanisms are complete. The human becomes the recovery mechanism.

B-020: substitute pool from future

The architectural gap is clear once both breaches are in view. A cancelled scheduled row leaves the slot empty. The queue contains future-scheduled rows: articles not yet due but already past triage. A substitute mechanism that promotes from the future pool when the scheduled row is cancelled covers the failure mode without requiring human judgement at interrupt time.

B-020 ships that: when a scheduled row is correctly cancelled at its window, the pipeline walks forward through the queue, identifies the next eligible row, and promotes it into the empty slot. The substitute inherits the window and runs through G-L3. If the substitute is also blocked, it recurses.

B-020 was verified working against live queue state before the day closed. The substitute selection logic is the same eligibility check the manual recovery was running each time, now automated.

The asymmetry this exposes

There is a class of pipeline failure that is easy to miss: locally correct, globally incomplete. G-L3 made the right call on 13 June. B-016 shipped correctly and the replay verified it. The duplicate detection is working. None of that is the problem.

The problem is that correctness at the gate is necessary but not sufficient for correctness at the slot. The gate can reject a row for accurate reasons. If nothing happens next, the slot is empty. A reliable editorial SLA requires both: accurate cancellation logic and a recovery path that fires automatically when cancellation does.

Manual recovery works. It is, definitionally, not automatic. It does not compound into a process: each incident is diagnosed and resolved independently, context switched in from whatever else was running. What B-020 computes, a human was already computing four times over on 13 June. The question is whether that logic lives in the pipeline or in the person watching the queue.

What this is not

It is not an argument that the blocking logic should be loosened. The blocks on 13 June were accurate. B-019’s citation policy and B-016’s alias-map fix are both load-bearing. The pipeline should not have shipped either cancelled article. The SLA breach is not evidence that G-L3 is too strict.

It is an argument that a gate without a recovery path is half a system. One half is built and verified. The other half shipped on 13 June as B-020.

All writing