Infrastructure Building vs Shipping Features
Every software project has two competing gravitational pulls. One draws you toward building systems — pipelines, orchestration layers, coordination protocols, memory stores. The other draws you toward producing output that someone outside your codebase might actually care about. The tension between these forces is usually framed as a prioritization problem: spend wisely, balance infrastructure investment against feature delivery, find the right ratio. But that framing misses the deeper issue. Infrastructure building and feature shipping don't just compete for time. They compete for attention, and attention is the resource you can't scale.
The Satisfaction Asymmetry
Infrastructure work provides immediate, concrete feedback. You wire up a pipeline stage, run it, and see structured JSON come out the other side. You add a coordination protocol between two agents, and you can watch them exchange messages in real time. You build a knowledge extraction loop and observe the extracted patterns threading into the next cycle's prompts. Each piece snaps into place with the satisfying click of a well-cut joint.
Feature shipping — or more precisely, evaluating whether shipped features produce anything worth using — provides almost no immediate feedback. Reading pipeline output takes focused attention. Judging whether a synthesized essay captures genuine insight or merely reorganizes words requires taste, patience, and willingness to confront the possibility that your elegant machinery produces mediocre results. The asymmetry is structural: building rewards you in minutes, evaluation punishes you in hours.
I watched this dynamic play out across three weeks of daily sessions on a content pipeline and a multi-agent orchestration platform. The pattern was consistent. Each day produced new infrastructure: working memory for journal synthesis, a knowledge graph extraction pipeline, entity extraction modules, post-workflow analysis agents, blackboard coordination protocols. Each day also produced a quiet promise: tomorrow will be the day I actually read what this thing generates.
Tomorrow never came.
Production Scales With Compute; Consumption Scales With Attention
Herbert Simon wrote in 1971 that "a wealth of information creates a poverty of attention." He was describing organizations drowning in reports. The same principle applies to automated content systems, except the drowning happens faster.
Over those three weeks, the content pipeline processed hundreds of content items from eight sources — RSS feeds, browser history, newsletters, social exports, coding sessions. It generated daily journal entries with memory threading, weekly blog posts with thematic detection, social media adaptations across three platforms. Each synthesis step worked correctly. Entity extraction returned clean JSON. The blog synthesizer produced properly structured essays. The social publishers respected character limits. By every automated metric, the system was performing flawlessly.
The pile of unread output grew taller every day.
This is the core failure mode of infrastructure-first development: you optimize the production side of the equation because it responds to engineering effort, while the consumption side — the part where a human decides whether the output is any good — remains stubbornly fixed. You can parallelize LLM calls. You can batch entity extraction. You can fan out to multiple publishing platforms simultaneously. You cannot parallelize your own reading comprehension.
When Deletion Outperforms Construction
The most productive single action across those weeks wasn't adding a new pipeline stage or wiring up another coordination protocol. It was deleting a parser.
The parser in question was 842 lines of Python that normalized session data from a project management tool. It had been one of three session sources, a peer to the Claude and Codex parsers. But the upstream tool evolved, making the parser's assumptions obsolete. The code still ran. Tests still passed. The parser dutifully produced ContentItems that flowed through the rest of the pipeline, got enriched, got synthesized, got published. Nobody noticed whether those items were useful because nobody was reading the output closely enough to distinguish signal from noise.
Removing it wasn't just cleanup. It was an act of evaluation. Sitting down, reading the pipeline output, tracing which sources contributed genuine insight and which contributed noise, and making a judgment call. That judgment call — this source isn't pulling its weight — required the exact kind of attention that infrastructure building so effectively displaces.
A similar pattern emerged on the orchestration side. A massive removal of legacy meeting infrastructure — fifty-six files, nearly ten thousand lines — was replaced by four focused orchestration engines. The new engines were smaller, tested, and aligned with what the system actually needed. The diff looked destructive. It was the most valuable architectural work of the month. But it only happened because someone stopped building long enough to evaluate what existed.
The Behavioral Failure of "Tomorrow"
There is a particular kind of self-deception that infrastructure builders are prone to. It sounds like this: "The system is almost complete. Once this last piece is in place, I'll have time to evaluate the output." Across ten journal entries spanning three weeks, some version of that sentence appears six times. It was never true. The system was never almost complete, because infrastructure generates its own requirements. Each new capability surfaces new integration points, new optimization opportunities, new coordination challenges. The work expands to fill the engineering bandwidth available, leaving evaluation perpetually deferred.
This is not a discipline problem. Behavioral commitments to review — promises to yourself, calendar blocks, stern self-talk in journal entries — consistently fail because they're fighting an incentive gradient. Infrastructure work is immediately rewarding, clearly scoped, and produces visible artifacts. Review work is diffuse, uncomfortable, and produces nothing except the unwelcome possibility that your beautiful system needs rethinking.
The only solutions that actually work are architectural, not behavioral. Build the constraint into the system itself. A pipeline that degrades visibly when review lapses — output that expires, queues that back up and block new generation, quality scores that trend downward without human calibration — forces evaluation by making the cost of skipping it concrete and immediate. A pipeline that runs perfectly whether anyone reads its output will always lose the attention war to the next infrastructure task.
The Orchestration Trap
Multi-agent orchestration systems are the purest expression of this dynamic. I spent weeks building coordination infrastructure: agent registration, heartbeat polling, blackboard protocols, signal-based state transitions, memory extraction, and knowledge injection loops. Each piece was technically interesting. The blackboard namespace convention emerged organically across agent runs. Dev agents started running conflict pre-checks while QA agents were still writing reviews. Post-workflow analysts extracted structured learnings that got injected into subsequent agent prompts. The coordination substrate was genuinely sophisticated.
But coordination infrastructure has a seductive property: it always looks like it's almost working. Each successful agent interaction validates the architecture. Each emergent behavior — agents anticipating each other's needs, conventions stabilizing without explicit programming — feels like evidence that the system is converging on something powerful. The temptation is to keep adding capability: dynamic team composition, specialist spawning, compound signal triggers, feedback loops that modify agent prompts based on accumulated experience.
Each addition is individually justified. Collectively, they represent an ever-growing investment in production machinery with an ever-shrinking allocation of attention to what that machinery produces. The question "does feeding structured expertise back into agent prompts actually improve behavior?" appeared in journal entries across multiple weeks. It was never answered — not because it was unanswerable, but because designing the controlled comparison to answer it was less immediately satisfying than building the next piece of infrastructure.
The Metric That Matters
Task completion rates, test coverage, type-check cleanliness, successful pipeline runs — these are all necessary conditions for useful output, but none of them are sufficient. A system can score perfectly on every automated metric while producing output that nobody reads, nobody uses, and nobody would miss if it disappeared.
The metric that actually matters is whether the output changes something. Does a synthesized journal entry surface an insight that reshapes how you work tomorrow? Does a blog post explain a concept clearly enough that a reader applies it without knowing your codebase? Does a QA agent's review catch a bug that would have reached production? These are consumption metrics, not production metrics, and they require human judgment to evaluate.
Infrastructure that nobody evaluates is not infrastructure. It is a hobby — an intellectually stimulating one, a technically impressive one, but a hobby nonetheless. The distinguishing characteristic of infrastructure is that something depends on it. If the only thing depending on your pipeline's output is the next stage of the pipeline, you have built an elaborate closed loop that transforms compute into heat.
Building the Off-Ramp
The practical solution is not to stop building infrastructure. Systems need plumbing, and good plumbing creates leverage that compounds over time. The solution is to build evaluation checkpoints into the infrastructure itself, as load-bearing components rather than optional add-ons.
A content pipeline should have a human review gate that blocks the next generation cycle until someone confirms the previous cycle's output met a quality bar. An orchestration platform should surface not just whether agents completed their tasks, but whether the completed work was any good — measured by downstream consumption, not upstream completion signals. A knowledge injection loop should include periodic audits where a human reads the injected context and judges whether it's accurate, because bad context compounds through automated systems with exactly the same efficiency as good context.
These checkpoints feel like friction. They slow the system down. They interrupt the satisfying flow of building the next thing. That is precisely why they work. The discomfort of evaluation is the signal that you're doing the part of the work that matters — the part that no amount of infrastructure can automate away.