Deep Dive: workflow analysis
Every workflow system eventually faces the same question: how do you know the work actually happened?
This is not a philosophical puzzle. It is an engineering problem with concrete failure modes, measurable costs, and — if you get the architecture right — a surprisingly elegant solution. The challenge is that workflow coordination and workflow outcomes exist in two completely separate observability domains. The coordination layer sees messages, signals, and state transitions. The outcome layer sees files changed, tests passing, and commits landing. Most systems monitor one and assume the other.
That assumption is where things break.
Two Kinds of Blindness
Fred Brooks observed in The Mythical Man-Month that adding communication channels to a project increases overhead quadratically. His model assumed humans. Multi-agent workflows exhibit a different pattern: the overhead is fixed per agent, not quadratic, but the blindness is total where the coordination layer meets the artifact layer.
Consider what a workflow orchestrator actually sees. It sees that Agent A registered, received a task, sent a "done" signal, and that Agent B received that signal, ran its verification pass, and sent "approved." The orchestrator processed every message, tracked every state transition, managed every heartbeat. From its perspective, the workflow completed successfully.
Now consider what the orchestrator does not see. It does not see whether the files Agent A modified actually compile. It does not see whether the tests pass. It does not see whether the commit is clean or whether the working tree has uncommitted conflicts. The orchestrator has perfect visibility into its own domain and zero visibility into the domain that matters.
This is not a bug in any particular implementation. It is a structural property of layered systems. The coordination layer and the artifact layer speak different languages, operate on different timescales, and fail in different ways. Bridging them requires deliberate architectural work.
The Verification Gate
The fix is straightforward in concept and surprisingly compact in implementation. Before any agent can emit a completion signal, an independent verification step inspects actual artifact state: run the test suite, check the type checker, confirm the git working tree is clean. The signal is gated on ground truth, not on self-report.
In practice, this looks like a pre-signal check that runs a handful of shell commands and inspects their exit codes. Not their output — their exit codes. A type checker returns 0 or it doesn't. A test suite passes or it doesn't. Git reports a clean working tree or it doesn't. Binary signals from binary checks, with no natural-language parsing required.
The economics here are asymmetric. The verification step adds seconds to a workflow that runs for minutes. Skipping it saves those seconds but risks a false completion that takes much longer to diagnose and repair. Every workflow that signals success without verification is carrying unpriced risk.
What the Traces Reveal
Once you have workflows running through coordinated dev-QA pairs with verification gates, you accumulate traces. These traces decompose into three layers: protocol traces (who sent what signal when), behavioral traces (what commands each agent ran, what files they touched), and outcome traces (what changed in the repository, what tests pass now that didn't before).
Most workflow monitoring systems capture only the first layer. That is like evaluating a restaurant by watching the waiters move between tables without ever tasting the food.
The behavioral layer is where diagnostic gold lives. A QA agent that runs 42 shell commands during a single verification pass is telling you something about its thoroughness. A dev agent that modifies six files and runs 22 validation commands before committing is telling you something about its confidence level. These numbers are not directly actionable in isolation, but patterns emerge across workflows.
The pattern I keep finding: the ratio between verification commands and code-changing commands predicts downstream revision rates. When QA agents front-load verification — running types, lint, targeted tests, full suite, commit history checks — first-pass approval rates climb. When they spot-check with a single test run, revision cycles follow. The correlation is strong enough to use as a quality metric for the verification agents themselves.
Proportional Ceremony
Not all tasks deserve the same coordination investment. A function that capitalizes words does not need the same workflow machinery as a cross-module refactor touching a dozen files.
The ceremony cost of multi-agent coordination is roughly fixed per agent per session. Registration, heartbeat establishment, task reading, blackboard setup, signal negotiation. Whether the task takes two minutes or twenty, the coordination overhead stays roughly constant. This means ceremony cost as a percentage of total effort is inversely proportional to task complexity.
The implication is a routing decision: tasks below a certain complexity threshold should bypass full orchestration entirely. A single agent with a clear spec finishes faster than a dev-QA pair on a trivial task, because the pair spends more time coordinating than working. Above the threshold, the pair consistently catches issues that a solo agent misses — parameter ordering bugs, missing default values, edge cases in string handling. The QA agent earns its overhead by finding real problems, not by performing review theater.
The threshold is not abstract. I track it empirically. Tasks that touch fewer than three files in a single module almost never benefit from a QA pass that a good test suite wouldn't catch anyway. Tasks that cross module boundaries or modify public interfaces almost always surface issues in the review cycle that would have shipped as bugs.
The Feedback Loop
After each workflow completes, a separate analyst agent reads the traces — the blackboard entries, the signals, the behavioral data — and extracts patterns. These patterns feed back into the agent context for subsequent runs, creating a closed loop: execute, observe, extract, inject.
This loop has a failure mode that took weeks to recognize. The analyst agent extracts patterns about coordination efficiency, about common implementation pitfalls, about review quality. Some of these patterns are useful: "type checker failures in this codebase usually involve missing Optional annotations" helps a dev agent avoid a specific class of error. But the analyst also extracts patterns about the extraction process itself. Patterns about how patterns are extracted. Observations about the observation layer. Left unchecked, the system spends increasing compute reflecting on its own reflections.
The fix is a classification gate with three criteria. An extracted pattern must be actionable by a non-analyst agent. It must be grounded in a specific technical context, not a meta-process observation. And it must be validated by successful application in a subsequent workflow. Patterns that fail any criterion get filtered. This drops a substantial fraction of what the analyst produces, but what survives actually improves subsequent workflows. Dev agents that receive targeted knowledge about a codebase's common failure modes write cleaner first-pass implementations. QA agents that receive calibrated review criteria produce fewer false-positive rejections.
The Step Function
The transition from "agents can coordinate" to "agents reliably complete real work" is not gradual. It is a step function. On one side, you have a coordination protocol that successfully routes messages, manages state transitions, and produces clean signal traces. On the other side, you have a system that ships code. The gap between them is entirely about connecting protocol success to artifact success.
Verification gates bridge this gap. They are small in implementation — a handful of shell commands checking exit codes — but large in effect. They convert a coordination system that knows what agents said into a production system that knows what agents did.
The broader lesson extends beyond multi-agent orchestration. Any system where one layer reports on behalf of another layer has this problem. CI pipelines that report "deployment successful" based on container startup without health checks. Monitoring systems that confirm services are running without confirming they are serving traffic correctly. The pattern is the same: a reporting layer with perfect visibility into its own domain and no visibility into the domain it claims to represent.
The fix is also the same. Gate the report on ground truth. Do not trust what components say about themselves. Inspect the actual artifacts, the actual state, the actual outcomes. The cost of inspection is low. The cost of false confidence is not.
Workflow analysis, ultimately, is not about watching workflows. It is about building workflows that watch themselves honestly — and knowing exactly where to stop trusting the self-report.