Sixty-Seven Shell Commands and the Feeling of Letting Go

Sixty-Seven Shell Commands and the Feeling of Letting Go

Hands releasing a paper boat onto a still river at dusk

When the QA Agent Ran More Commands Than I Would Have

The QA agent ran 67 shell commands today verifying a single endpoint. Types, linting, targeted tests, commit history. I watched the log scroll by and realized: it was more thorough than I would have been.

That's the strange inflection point with multi-agent orchestration. You build the system, you wire the coordination protocol, you define the workflow states. Then at some point the agents start doing things you wouldn't have thought to do, and you find yourself wondering whether that's the whole point or a warning sign. Today it felt like the point.

I've been running TroopX workflows all day across VerMAS: dev-QA pairs, PMO-guided workflows, a product strategy session, even a DSL-mode version endpoint task. Each one follows the same pattern. Agents register via the router MCP, claim tasks from a shared blackboard, coordinate through signals, and complete their work in isolated git worktrees. The dev agent writes code, the QA agent verifies it, and afterward a post-workflow analyst extracts learnings that get fed back to agent memory for next time. The whole cycle has become mechanical. I used the word "mechanical" in my notes and then crossed it out, because mechanical implies mindless. This is more like practiced.

The numbers from today: roughly a dozen workflow runs, touching four different orchestration modes (goal-directed, guided, autonomous, DSL). The longest session clocked in at nearly 37 hours of accumulated agent time on VerMAS alone. The QA agents collectively ran hundreds of shell commands. The dev agents modified dozens of files. And the whole thing worked well enough that the interesting question is no longer "does it work" but "where does it break."

Open journal beside a stack of older notebooks in window light

The Skills Paper and the Accidental Validation

Sean Goedecke published a piece today about LLM-generated skills: short explanatory prompts bundled with helper scripts. The finding that caught me: LLM-authored skills work, but only if you generate them after the task, not before. Skills written in advance by the LLM don't help. Skills extracted from successful completions do.

I read this and laughed, because that's exactly what the post-workflow analyst does in TroopX. After every workflow completes, it extracts "rich, specific, actionable learnings" for each participating agent. Those learnings get stored in agent memory. The next time a similar workflow runs, the agent recalls them. I didn't design this because I'd read the research. I designed it because it seemed obvious that reflection should come after action. But it's nice when the literature catches up to your instincts.

Martin Fowler's piece on agentic email hit a different nerve. He describes people setting up LLM agents to read their email, draft replies, and sometimes respond autonomously. The framing is cautious: Fowler's good at that. But the pattern is identical to what I'm building. An agent registers, reads the context, acts on it, and signals completion. The domain (email vs. code) is almost incidental. The coordination protocol is the product.

Craftsman's workbench with tools arranged around a test mortise joint

What Jeff Geerling Gets Right and Wrong

Jeff Geerling wrote about AI destroying open source. The specific complaint: a flood of AI-generated slop PRs overwhelming maintainers. Joan Westenberg made the same point through a different lens, arguing for gatekeeping and medieval guild structures. Both are right about the symptom and wrong about the prescription.

The problem with AI-generated contributions isn't that AI is involved. It's that there's no verification loop. Someone points an agent at a repo, the agent generates a PR, nobody checks whether it's good. My QA agent ran 67 commands to verify a single endpoint. That's the difference between an agent and a script: the agent has a quality bar, defined by the workflow, enforced by the orchestration layer. The guild model Westenberg describes is essentially what I'm building into the workflow DSL. Define the states, define the transitions, define what "done" means. Agents that don't meet the bar get revision signals.

The open question, the one I keep coming back to, is scale. Today's tasks were deliberately simple: ping endpoints, version endpoints, config endpoints, metrics endpoints. Validating the orchestration before scaling the complexity. Can this handle multi-file features? Database migrations? API integrations with external services? Can the worktree isolation hold up under three or four concurrent workflows?

I don't know yet. But 67 shell commands for one endpoint suggests the thoroughness is there. The question is whether thoroughness survives complexity, or whether it's a luxury of simple tasks.

Ibrahim Diallo wrote today about how we used to just sit with discomfort long enough for something to emerge. He's talking about thinking before ChatGPT. I'm building systems that sit with discomfort for me, running 67 commands because they don't know when to stop checking. Maybe that's the same thing. Maybe it's the opposite. I'll know when the tasks get harder.