The Machine That Reads Itself

Brass ouroboros on a workshop bench with tools

I spent most of today watching my content pipeline eat its own tail. Not in a bad way. In the way where you build a thing to process information, and then the thing starts processing itself, and you realize the architecture is either elegant or deeply confused.

The distill project has an intake pipeline that pulls content from a dozen sources: RSS feeds, browser history, Substack, LinkedIn, Twitter, Reddit, YouTube, Gmail newsletters. Each source has its own parser that normalizes everything into a canonical ContentItem. Today I ran the full pipeline and watched it fire off something like 25 LLM calls in rapid succession, all doing classification and entity extraction. Each call lasted a few seconds. "Classify each content item below. Return ONLY valid JSON." Over and over, batched across the corpus.

Here's what's funny: most of those content items are articles about building exactly this kind of system. I'm reading about LLM pipelines, feeding those reads into an LLM pipeline, which then synthesizes a journal entry about LLM pipelines. At some point the snake swallows enough of its own tail that you have to ask whether the output is actually useful or just recursive navel-gazing.

I think it's useful. The classification step (intelligence.py) tags each item as tutorial, opinion, news, or reference, then extracts named entities: projects, people, technologies. That metadata is what makes the blog synthesis step possible later. Without it, the weekly post would just be "here's everything I read" with no structure. With it, the synthesizer can find threads across days and sources. A tutorial about embeddings from Monday connects to a reference doc about pgvector from Wednesday connects to the session where I actually wired up PgvectorStore in store.py.

Which brings me to one of those threads that emerged today, the kind that only surfaces when you let the machine chew through enough context.

Three hourglasses with different colored sand on a desk

Thematic Posts and the Twitter Problem

The pipeline generated a thematic deep-dive on Twitter today. Not Twitter the company, Twitter as a data source. The TwitterParser in src/intake/parsers/twitter.py is one of the more convoluted parsers because X's data export format is genuinely hostile. They give you JavaScript files, not JSON. The parser has to strip the window.YTD.tweet.part0 = prefix and then parse what's left. On top of that, there's a nitter RSS fallback for when you want live-ish data without the export dance.

The thematic post took about 54 seconds to generate, which is the longest single LLM call of the day. Everything else was under 6 seconds. That ratio feels right: the batch classification calls are cheap and parallelizable, the synthesis call is expensive but only happens once. If I were optimizing, the classification batches are the obvious target. Twenty-five calls at 3 seconds each is 75 seconds of serial LLM time. Batch them into fewer, larger calls and you'd cut that significantly. But the current batching in intelligence.py already groups items, so the wins might be marginal.

The performance numbers matter here because they're the difference between a tool you use and a tool you think about using. Three minutes end-to-end means I can run the pipeline, look at the output, and still have momentum to act on what it surfaces. Ten minutes and I'm making coffee. Thirty minutes and I've moved on to something else entirely.

Mechanical loom weaving multiple threads into fabric

When the Pipeline Is the Product

Something I keep circling back to: distill started as a way to generate journal entries from coding sessions. Then it grew an intake pipeline for reading. Then a blog synthesizer. Then publishers for Ghost and Postiz. Now it has a web dashboard with Hono and React. The scope creep is real, but it's also the point. The product is the pipeline itself.

Most developer tools treat "what did I do today" and "what did I read today" as separate concerns. Separate apps, separate workflows, separate outputs. Distill's bet is that they're the same concern. The session where I debugged ContentType(None) crashing in PgvectorStore._row_to_item is content. The article about embeddings that prompted me to check my null handling is content. They belong in the same stream because they inform each other.

The context.py module in the intake package does exactly this partitioning: it takes the full list of ContentItem objects and splits them into session_items, seed_items, and content_items. Then the unified prompt in prompts.py blends all three into a single synthesis input. The LLM sees everything together. That's the whole trick.

Whether this actually produces better writing than just sitting down and typing is a question I'm not ready to answer. Today the pipeline processed 92 RSS feeds, classified a few dozen items, extracted entities, generated a thematic post, and updated the unified memory. All of that happened in under three minutes. Writing this paragraph took longer. But the paragraph is better because I had the structured output to look at while writing it. The pipeline isn't replacing the thinking. It's scaffolding for the thinking.

Tomorrow I want to look at the seed-to-blog handoff. There's a known issue where intake marks seeds as used before the blog step runs, which means the blog synthesizer never sees them. The workaround is to reset seeds and run blog separately, but that's the kind of thing that quietly becomes permanent if you don't fix it. And fixing it means going back into the same codebase the pipeline is already reading about, which means the snake gets another inch of tail.