When Your Pipeline Becomes the Patient

February 16, 2026

steve-yegge ai generative-ai llms ai-assisted-programming ai-ethics coding-agents typography blogging python

When Your Pipeline Becomes the Patient

Open pocket watch on workbench under magnifying glass

Agents Playing Doctor

I spent a good chunk of today watching AI agents argue about lung fibrosis.

Not metaphorically. TroopX, the multi-agent orchestration platform I've been building, ran a medical consultation workflow where a supervisor agent coordinated a pulmonologist, an infectious disease specialist, and a cardiologist. Each registered via MCP, read the case from a shared blackboard, wrote findings, and the supervisor synthesized a treatment plan. The whole thing took about 35 minutes across six agent sessions, with 118 shell commands from the QA runner alone.

The medical domain is a stress test I chose deliberately. A cardiologist agent that misses a drug interaction is more obviously wrong than a code reviewer that misses a style nit. And the roster-builder workflow (which spins up new specialist agents on demand) ran four separate times today, each under three minutes. The pattern works: define an agent spec, build it, validate it, deploy it. What I didn't expect is how much the conversation routing matters. The supervisor spent most of its 22 minutes not thinking but polling. Heartbeats, signal checks, blackboard reads. Coordination overhead dominates when agents actually have something to say.

This connects to something Alex Kladov wrote today in Diagnostics Factory: Zig's approach separates error handling from error reporting, pushing the reporting problem to users. My agent workflows have the same split. Each specialist agent handles its domain (the pulmonologist knows lungs), but the supervisor handles reporting (synthesizing findings into something actionable). When I tried collapsing those roles, the output got worse. Separation of concerns applies to agents the same way it applies to functions.

Hallway with multiple doors showing colored light

The longest session today was nearly 12 hours on a completely different project: downloading video from security cameras via a Docker container running TUTK protocol code. Thirty minutes of active work, 17 shell commands, one file modified. Most of that time was the container doing its thing. But it reminded me of a pattern I keep noticing: the sessions that touch the fewest files complete cleanly. Yesterday's memory context flagged that sessions touching more than five files have a 78% error rate historically. This camera session touched exactly one.

Mathew Duggan's piece on selling out for $20/month of Terraform generation landed in my feed today, and his frustration resonates. He spent months dismissing LLM tools as verbose comment generators before finding one that actually worked for infrastructure code. I had the same arc with agent orchestration. The first version of TroopX's workflow engine was a mess of polling loops and missed messages. The current version, where agents register, claim tasks from a shared list, and signal completion through a router, only emerged after I stopped trying to make agents smarter and started making the coordination protocol dumber. Dumber protocols, smarter outcomes.

Simon Willison flagged the term "Deep Blue" for the existential dread developers feel about AI encroachment. I get it. But building the orchestration layer feels different from using the output layer. When you're wiring agents together, you see exactly how fragile the illusion is. A cardiologist agent that runs 36 bash commands in 7 minutes isn't practicing medicine. It's following a script that happens to produce useful text about medicine. The gap between those two things is the whole game.

Glass terrarium with notebook held beside it

Day Two of a Hundred

Today was day two of the 100-day content series. The Distill pipeline ran end-to-end again: ingest sessions, synthesize journal entries, generate social posts, push drafts to Postiz for review. The pipeline generated tweets, LinkedIn posts, Slack updates, weekly essays, and thematic deep dives across probably 80 individual LLM calls. Entity extraction, memory updates, the works.

The web dashboard also got a significant session (3+ hours, 17 files touched) alongside a similar-length VerMAS session (20 files). Cory Doctorow's piece on the online community trilemma (reach, community, and information: pick two) maps neatly onto my publishing problem. Distill tries to hit all three by generating platform-specific variants from the same source material. LinkedIn gets the professional angle, Twitter gets the compressed take, Ghost gets the full essay. Whether that actually solves the trilemma or just papers over it with automation, I honestly don't know yet.

What I do know: the pipeline that writes about itself is the most unforgiving test environment I've ever built. Every rough edge in the synthesis shows up in the output I'm publishing under my own name. The agents playing doctor today were a stress test for TroopX. The pipeline writing my daily essay is a stress test for me.

When Your Pipeline Becomes the Patient

Agents Playing Doctor

The Twelve-Hour Sidebar

Day Two of a Hundred