Ninety Sessions and One Open Question

February 21, 2026

definitions ai andrej-karpathy generative-ai llms ai-agents openclaw blogging museums til

Ninety Sessions and One Open Question

Layered nautical logbooks on a chart table, single brass lamp

The Day I Stopped Building and Started Operating

Today I ran over 90 agent sessions across TroopX. CEO agents, CTO agents, dev-QA pairs, content writers, outreach agents, post-workflow analysts. I did not write most of the code. I did not draft most of the content. I kicked off workflows and watched blackboard entries fill up, heartbeats pulse, signals fire.

This is the operational mode shift I've been circling for a week. Content and outreach workflows are now fully delegated to agent squads. Engineering bug fixes flow through dev-QA pairs that register, check the blackboard, do the work, signal done. A 25-minute session fixed ack_message returning 404s. A 17-minute session patched the dashboard workflow. A content squad spent 73 minutes producing a tutorial, a quick-start guide, and outreach posts, all without me touching a file. The coordination overhead keeps dropping because three primitives handle almost everything: register, heartbeat, check blackboard, signal done.

The heartbeat mechanism turned out to do double duty. It's both a keepalive tracker and a pending-message notification system. Agents poll their heartbeat, get back a count of waiting messages, decide whether to check their inbox or keep working. The blackboard pattern is even better: shared state means agents arrive at conversations already knowing what happened. Dramatically shorter inter-agent chats. No more "can you summarize what you've done so far?" back-and-forth.

Stone arch bridge over gorge with eroded foundation beside it

Whale Falls and Platform Risk

Andrew Nesbitt wrote today about whale falls: what happens when a large open source project dies. The ecosystem that depended on it adapts or collapses. I read it between workflow launches and couldn't stop thinking about it in agent terms. My whole system depends on Claude CLI subprocess calls. If Anthropic changes the API, changes pricing, changes model behavior: whale fall.

Ed Zitron published his Hater's Guide to Anthropic today too, which is a fun thing to read while you're literally building your entire infrastructure on their models. I disagree with most of his framing, but the dependency risk is real. The saving grace: the coordination layer (Temporal, the blackboard, the message router) is mine. The LLM is a swappable subprocess call. That's the theory. In practice, every prompt is tuned to Claude's quirks. Swapping would be a rewrite, not a refactor.

Karpathy's piece on "Claws" landed today as well. His framing of AI agents as extensions of human capability maps exactly to what I'm seeing. My agents don't make decisions. They execute workflows I designed, write to blackboards I read, produce content I review. The autonomy is in execution, not judgment. Ninety sessions and I still approved every merge.

Detailed propagation notes on a bench, empty garden plot beyond

The Learning Loop Problem

The most interesting thing I shipped today isn't code. It's a habit.

After every major workflow completes, a post-workflow analyst agent reads the blackboard entries, reads the signals, and extracts per-participant learnings. Then a reflection agent picks up those learnings and updates working memory. I ran dozens of these analysis-reflection pairs today. The system generates its own retrospectives.

Here's what bothers me. The agents remember. They write to memory files, store learnings via memory_remember, update skill documentation. But do they improve? The next time a dev agent spins up, does it behave differently because of what the reflection agent wrote last time?

I don't think so. Not yet.

Alex Kladov's post on wrapping code comments made me think about this distinction in miniature. Comments describe intent. Code describes behavior. My agents have plenty of documented intent sitting in markdown files and JSON stores. The behavioral change is harder to verify. An agent reads its memory at startup, sure. But "reading a file" and "changing behavior" are different things. I can write "always run tests before signaling done" in a memory file. Whether the agent actually does it depends on how that memory gets woven into the active prompt, how it competes with other instructions, whether the context window even has room for it.

This is my open question for the week: how do you close the loop between agents that remember and agents that actually improve? The post-workflow analysis extracts the right insights. The reflection sessions write them down in the right places. But the prompt that runs tomorrow needs to be shaped by what was learned today, not just informed by it. Right now that shaping is manual. I read the analysis, I update the skill files, I tweak the workflow definitions.

The agents are the hands. I'm still the nervous system. Ninety sessions today, and one question I can't delegate to any of them.