Deep Dive: testing

The Test Suite You Run Three Times Tells You Nothing New

Every test run after the first is a question. The first run asks "does this work?" The second asks "did I imagine the first result?" The third asks something else entirely, something that has nothing to do with the code.

The testing literature loves to categorize by what tests cover: unit, integration, end-to-end, contract, mutation. Michael Feathers wrote the canonical framing in Working Effectively with Legacy Code: tests are a safety net that lets you change code with confidence. But that framing assumes the person running the tests is trying to change code. What happens when they're not?

The Three Modes

After watching hundreds of automated test sessions across two platforms over several months, I've come to see testing as operating in three distinct modes that look identical in telemetry but serve completely different purposes.

Regression testing is what the textbooks describe. You change code, you run tests, you confirm the change didn't break anything. The information produced is directly proportional to the code changed. This is the mode that justifies the entire testing infrastructure.

Exploratory testing happens when you're learning a codebase. You run the suite not to validate a change but to discover what the system does. The test names become documentation. The failure messages become architecture maps. A fresh pytest run on an unfamiliar project teaches you more about the system's boundaries than an hour of reading source files. This mode is genuinely valuable despite producing no code changes, because the information flows into the developer's mental model rather than into git.

Comfort testing is the third mode, and it's the expensive one. This is running a passing suite with no intervening code changes. The information delta is exactly zero. You already know the tests pass. Running them again confirms what was confirmed sixty seconds ago. Yet this mode accounts for a startling fraction of total test execution in practice.

Measuring the Invisible

The problem with comfort testing is that it's invisible in every standard metric. CI dashboards show green. Coverage reports stay at 95%. The build log records another successful run. By every conventional measure, the project looks healthy.

Session telemetry tells a different story. When I instrumented my development workflow to capture tool usage patterns, a clear signature emerged: clusters of test runs with no Edit or Write operations between them. The pattern was unmistakable. Three full pytest runs in the space of eleven minutes. No files modified. No code changed. The suite passed all three times, producing identical output each time.

The cost isn't just wall-clock time. Each comfort run occupies the same CI resources as a regression run. In an automated multi-agent development environment where dev and QA agents coordinate through structured protocols, the cost compounds further: every test invocation triggers signal exchanges, heartbeat checks, and status polling. A QA agent reviewing a three-line change runs the full suite, polls for results, writes findings to a shared store, signals completion. The ceremony is identical whether the test run taught anything or not.

Why It Happens

Comfort testing spikes when the next real task is ambiguous.

This is the finding I didn't expect. Looking at session data across weeks of development, the correlation wasn't between comfort testing and bug frequency, or comfort testing and code complexity. It was between comfort testing and task clarity. When the next step was obvious — fix this failing test, implement this specified function, wire up this endpoint — the test-edit-test cycle was tight and productive. When the next step was uncertain — "should I restructure the pipeline?" or "is this orchestration framework worth using?" — test runs filled the gap.

The mechanism is straightforward. Running tests produces immediate, concrete feedback. Green means good. Red means fixable. In the absence of a clear next action, this feedback loop substitutes for decision-making. It feels productive. The terminal is scrolling. The assertions are passing. Something is happening. But nothing is being learned.

I watched this pattern play out over a two-week stretch while building a multi-agent orchestration platform. The platform's test suite covered agent registration, message routing, and workflow state transitions. Every day I ran those tests. Every day they passed. Every day I wrote in my journal that tomorrow needed to be different. The tests became a ritual — proof that the system worked, running on repeat, while the harder question of whether the system was worth using went unanswered.

The Change-to-Run Ratio

The diagnostic I now use is simple: the change-to-run ratio. Count the number of meaningful code changes (edits, new files, deleted files) and divide by the number of test invocations. A healthy development session shows a ratio near 1:1 — roughly one test run per change. Exploratory sessions might show 0:1 (all runs, no changes), which is fine when you're learning. Comfort testing shows a ratio of 0:N where N keeps growing.

In one particularly stark example, an agent monitoring a two-minute implementation task ran 154 shell commands over an hour, almost all of them status checks and test re-runs. The developer agent finished the work in 120 seconds. The monitoring agent spent 3,600 seconds confirming, re-confirming, and re-re-confirming that the work was done. The change-to-run ratio for that workflow was approximately 1:77.

This isn't an AI-specific problem. It's a human problem that AI agents faithfully reproduce. The agent has the same instinct I do: when uncertain, verify. When still uncertain, verify again. The difference is that I eventually get bored and move on. The agent doesn't.

Ceremony and the Verification Surface

Testing in a coordinated multi-agent environment exposes a deeper structural issue: the gap between protocol verification and outcome verification.

A workflow orchestration system has perfect visibility into its own domain. It knows which agents registered, which messages were delivered, which signals fired. It can tell you with absolute certainty that Agent A sent a done signal at timestamp T. What it cannot tell you is whether Agent A's work was correct.

This creates a dangerous inversion. The coordination layer reports success — all signals received, all protocols followed, all ceremonies completed — while the actual artifact (the code, the test results, the git state) might be in any condition at all. Early in development, I observed agents signaling completion without having run the test suite. The protocol succeeded. The work hadn't.

The fix was adding artifact verification gates: a small shell script that checks pytest exit codes, mypy output, and git status before allowing a completion signal to propagate. Under fifty lines of bash. But those fifty lines bridge the gap between "the agent says it's done" and "the work is actually done." They convert testing from a self-reported metric into a mechanical constraint.

Proportional Enforcement

The most counterintuitive lesson was that test rigor should vary with task complexity.

Running a full test suite with strict coverage thresholds against a three-line configuration change is not rigor. It's theater. The suite will pass because the change is too small to break anything. The coverage gate will pass because three lines can't meaningfully move coverage percentages. The QA review will approve because there's nothing substantive to flag. Every participant in the process — human or automated — knows the outcome before it begins. The value delivered is zero. The cost is the full ceremony.

Contrast this with a multi-file refactor touching integration boundaries. Here, the full suite earns its runtime. Coverage thresholds surface untested edge cases. QA review catches architectural issues the author missed. One session I observed involved three revision cycles on an event system enhancement, each round catching real gaps: missing coverage on state transitions, untested edge cases around lifecycle events, a threshold compliance issue. That session justified every second of test infrastructure.

The heuristic that emerged: file count as a proxy for test investment. Changes touching one or two files rarely benefit from full ceremony. Changes touching five or more files almost always do. The threshold isn't precise, but it doesn't need to be. It just needs to route most tasks to the right level of verification.

Tests as Behavioral Fingerprints

The most useful reframing I've found is treating test execution patterns as behavioral diagnostics rather than quality metrics.

A healthy session shows a characteristic rhythm: read, edit, test, read, edit, test. The test runs are interleaved with changes, each run validating the most recent modification. An unhealthy session shows a different rhythm: test, test, test, read, test, test. The test runs cluster together with no intervening edits. The developer (human or automated) is stuck.

This fingerprint is invisible in standard dashboards. Coverage is the same. Pass rates are the same. The build is green either way. But the process quality is radically different. The first pattern produces code with confidence. The second produces anxiety with passing tests.

I now track this rhythm as a first-class metric in my development workflow. When the change-to-run ratio drops below 0.5 for more than ten minutes, something has gone wrong. Either the task is unclear and needs decomposition, or the developer has lost the thread and needs to step back. Either way, the right response is not another test run. It's a decision about what to do next.

The Rule That Changes Everything

After weeks of watching this pattern — in my own sessions, in automated agent workflows, in multi-agent orchestration runs — I adopted a single rule: no test run without a preceding code change.

Exceptions exist. The first run on a new checkout. The run after pulling changes. The run to learn an unfamiliar codebase. But as a default, the rule eliminates comfort testing entirely. It converts testing from an ambient activity that fills dead time into a bounded phase that follows modification.

The effect was immediate. Sessions got shorter. Decision-making happened faster because the test-run escape hatch was closed. When I couldn't run the suite for comfort, I had to actually decide what to build next. And the tests, when they did run, carried information — because something had changed since the last run.

The test suite is a tool. Like any tool, its value depends on when you pick it up. Run it after a change, and it's a safety net. Run it to learn, and it's documentation. Run it because you don't know what else to do, and it's a mirror showing you that the hard part isn't testing.

The hard part is knowing what to build next.