Forty-Two Commands and the Case for Productive Friction

February 19, 2026

code llms development software event newsletter blogging troy-hunt benchmarks django

Forty-Two Commands and the Case for Productive Friction

The QA Agent That Wouldn't Rubber-Stamp

I spent most of today running TroopX dev-QA workflow pairs. Not one or two. Twelve sessions across six different tasks: fixing the doctor health-check, adding CLI secrets management, a ready endpoint for the roster service, palindrome checkers, snake-to-camel converters, and a signal carry-forward bug that took two attempts to nail down. The dev agent writes code, the QA agent reviews it. Simple loop.

Here's what caught my attention: the QA verifier on the doctor-check fix ran 42 independent commands before signing off. Not pytest and done. It checked import paths, verified error messages matched the spec, tested edge cases the dev agent hadn't considered, confirmed the CLI output format, and probed three failure modes I wouldn't have thought to test manually. Forty-two commands providing genuine editorial friction.

The complex tasks took about 40 minutes per dev-QA cycle. The trivial ones (utility functions, string transformers) closed in under 5 minutes. That ratio feels right. The system is learning to match effort to complexity, or at least I'm learning to scope tasks so it can.

Meanwhile the distill pipeline churned through its own work: entity extraction, journal synthesis, blog generation, social media adaptation, image prompt generation, memory extraction. I counted roughly 50 LLM calls in the content pipeline alone. The pipeline has reached the point where it generates more content per day than I could review in a sitting. Which raises a question I keep circling back to: does the machine that makes the content also need to be the machine that filters it?

Type Hints, Expert Generalists, and the Skill Inversion

Simon Willison wrote something today that stopped me mid-scroll. After 25 years of programming, he's coming around to type hints. Not because he suddenly likes ceremony, but because agents do the typing now. The cost of explicit types used to be iteration speed. When a coding agent handles the mechanical work, that cost drops to zero and the benefit (catching errors before they propagate) stays the same.

This connects directly to something Martin Fowler said at the Thoughtworks retreat, which Willison also flagged: "LLMs are eating specialty skills. There will be less use of specialist front-end and back-end developers as the LLM-driving skills become more important than the details of platform usage." Fowler asks whether this elevates Expert Generalists or whether agents just code around the silos.

I think the answer is both, simultaneously, and that's the uncomfortable part. Today I had agents writing Python, TypeScript, YAML, and shell scripts across two projects. I didn't need to be an expert in any single one. But I needed to know enough about all of them to tell when the QA agent's 42 commands were actually testing the right things. The skill isn't writing code anymore. The skill is reading it fast enough to maintain quality at the speed agents produce it.

Jim Nielsen wrote about care in the context of AI and taste. Everyone claims taste is the supreme skill now. I think that's half right. Taste without friction is just aesthetics. The 42-command QA pass isn't tasteful. It's thorough. Thoroughness is the boring cousin of taste, and it matters more.

The Seven-Hour Session and the Scope Warning

The biggest session today was VerMAS: 7 hours, 54 files modified, 529 shell commands. That's the kind of session that, historically, has a 78% error rate. We established that metric back on February 14 when analyzing session data: anything touching more than 5 files is inherently riskier. Today's VerMAS session touched 54.

I've been sitting on the decision to build scope warnings into the pipeline as a proactive signal. Today made it concrete. When a session crosses the 5-file threshold, the journal entry should flag it. When it crosses 20, the blog synthesis should note the risk profile. Not to stop the work. To make the friction visible.

Paul Ford published an NYT op-ed about AI disruption arriving and being fun. I read it, nodded, and then went back to debugging why signals from a previous workflow were bleeding into the next one (the premature-signal-carry-forward bug, which took two dev-QA cycles to fix). Disruption is fun in essays. In production, it's a state management problem.

The Ladybird browser team abandoned Swift adoption. A language that was supposed to be the future of systems programming, walked back by a browser project that had publicly committed to it. Meanwhile I'm watching agents write perfectly adequate code in whatever language the task demands. The language wars feel increasingly beside the point when the bottleneck is judgment, not syntax.

Forty-two commands. That's what productive friction looks like. Not a gatekeeper saying no, but a verifier saying "let me check one more thing." The pipeline generates more than I can review. The agents write more than I can read. The answer isn't to slow them down. It's to make the friction proportional to the risk, and to trust the boring thoroughness over the flashy taste. That's the system I'm building, one QA cycle at a time.