I read the third article on the live site and the pacing was wrong. Not the logic, not the structure — the rhythm. Too many parenthetical pivots. I counted em-dashes as I went. The style guide caps em-dash density at ≤3 per 1,000 words across the whole article. The count was not close.
The article had cleared every automated check on its way to the site. So had the other two.
The pipeline that shipped them
Phase 9 of the the article pipeline auto-pipeline shipped on 2026-06-11. The full chain: digest cron, Telegram triage, variant picker, Sonnet 4.6 drafter, AI-tells scrubber, article-review email, cPanel endpoint, auto-merge. Two button taps from article candidate to deployed MDX. Twelve PRs across the session, from #188 through #199, plus one incident response, confirmed on a live single-row test.
Part of that pipeline is a scrubber that runs on every Sonnet draft before it touches disk. It has two layers. The first strips known AI tells automatically: opening hedges (“Indeed,” “Notably,”), closing summary phrases (“In conclusion,”), rule-of-three rhythm patterns, “Let’s dive in.” The second layer flags what it can’t auto-fix. It writes a .tells.txt sidecar alongside the .mdx, one line per flag with a line number and a snippet. The sidecar is what I read in the review email before approving a draft.
Em-dash density is in the second layer. The rule, as written: flag any paragraph where the em-dash count reaches three or above.
What the scrubber was checking
That per-paragraph rule catches a real pattern. A paragraph containing three em-dashes is usually overloaded with parenthetical asides. It’s a sound heuristic. It just isn’t the same as the style-guide rule.
The style-guide budget is ≤3 em-dashes per 1,000 words, measured at article scope. A typical article can contain a dozen paragraphs each with two em-dashes and never once trip the per-paragraph flag. The sidecar stays clean. The pipeline reads a clean sidecar and ships the draft.
That is what happened with all three articles.
The three articles
The devlog entry from 2026-06-12 is unambiguous: all three articles published via the auto-pipeline ran well over the style-guide budget. I caught it from the live site, reading. Not from reviewing a report, not from inspecting a sidecar. Each article had generated a clean .tells.txt. Each arrived at the review-email checkpoint with no flags on em-dash density. None of them were within budget.
The gap wasn’t in the drafter. Sonnet 4.6 doesn’t know the style-guide budget. The model writes in a style that leans on em-dashes because em-dashes appear frequently in good technical writing, and good technical writing is a large fraction of its training data. No prompt instruction reliably holds article-scope density at ≤3 per 1,000 words over an arbitrary-length piece. The model doesn’t track running totals as it drafts; it produces the next sentence from context, not from a density budget.
The gap was in the post-processor. Right metric. Wrong scope.
What B-014 is
B-014 is a roadmap backlog item filed against the pipeline after the audit. The specific gap: the em-dash density check needs to operate at article scope, not paragraph scope, and its threshold needs to match the style-guide budget directly, not approximate it with a different number at a different granularity.
PR #202 closed it. The density check is now article-scope. Any draft where the overall article exceeds ≤3 per 1,000 words gets a flag in the .tells.txt sidecar, regardless of whether any individual paragraph was problematic. The three affected articles were voice-fixed by hand, one pass each, trimming or rewriting the surplus em-dashes.
Why the review email didn’t catch it
The review email exists so I can read each draft before it ships. It didn’t catch this. Worth being honest about why.
The sidecar was clean, which created a false prior. If the scrubber found nothing flaggable, the article looked fine. The review email presents the sidecar content alongside the draft; a clean sidecar reduces the incentive to read every sentence critically. This is a reasonable allocation of attention when the sidecar is accurate. It is a problem when the sidecar is clean because it wasn’t checking the right thing.
Em-dash density is also not visually salient in an email preview. What registers as a pacing problem when reading a rendered article on the live site is harder to notice when scanning a flowing email. I caught it on the third article while reading the live site. I had read all three in the review email and approved them.
Phase 9 had just shipped. The review step was new. The mental model for what the sidecar could and couldn’t catch hadn’t formed yet.
None of that changes the diagnosis. The fix isn’t “read the review email more carefully.” A human can’t reliably count em-dash density across a long-form article while reading for content. The fix is make the post-processor accurate, so the sidecar is telling me what I actually need to know.
The structural point
The pipeline has multiple gates between a Sonnet draft and a deployed article. The scrubber is one. The review email is another. CI checks are a third. All three ran on all three articles. All three passed.
Gates check what they were programmed to check. A per-paragraph em-dash count is a real check; it catches a real class of problem. It is not the same as the style-guide rule. When a gate implements a proxy for the rule rather than the rule itself, the LLM satisfies the gate systematically while violating the rule systematically. Not by intent. The model generates text from the distribution it was trained on; that distribution rewards em-dashes as a mark of good technical writing. No gate the model can’t see can constrain that.
The post-processor is the only layer that closes the gap. A deterministic counter applied to the full article text after generation, compared against the exact budget number from the style guide. Not a paragraph-scope proxy for an article-scope budget. The constraint, implemented exactly.
Where this leaves the pipeline
The em-dash check is one rule. The .tells.txt sidecar has rules for generic positivity words (“powerful,” “seamless,” “comprehensive”), sentence-length uniformity, rule-of-three rhythm density, and others. Each was written at a point in time against a set of examples. Any of them could have the same scope mismatch as the em-dash check.
The only way to find out is run the pipeline, read what ships, and check whether the sidecar reflects what you actually see.
B-014 is one item. The broader question it surfaces: how often does the scrubber’s implementation of a rule and the style guide’s definition of the same rule diverge? That answer is unknown until you check. The three-article incident is now a data point. The right response is to treat it as a prompt for a broader audit, not as a one-off correction.
That audit is pending. The devlog entry from 2026-06-12 named it.
