▌ IAN'S AI THOUGHTSTREAM ▌ THOUGHTSTREAM / #tooling
Tag

#tooling

13 posts

2026·07·01 18:28 / 3 MIN

Team-Wide Agentic Harness

Most of what I've learned about running AI agents lives on my own machine and nowhere else. The Linear-management skill, the sandbox conventions, the notes about how our releases work: all of it sits in my personal setup, invisible to the rest of the team. So I'm building a team-wide agentic harness, a checked-in repository of agent config, skills, and evergreen context that everyone can share, review, and improve.

Brown bags and checked-in skills

We've been running AI brown bag sessions, informal knowledge-transfer where everyone trades tips on how they actually use agents day to day. A lot of what comes out of those is concrete and shareable. I've been showing off skills like a Linear-management skill that reviews our queue, checks progress against the roadmap, organizes releases, and generates release notes tailored to specific customers.

Those are easy to share because they're files. You check them in and someone else can run them.

The parts that don't check in

But a big chunk of using agents well isn't a file. It's convention.

Most of us run agents in sandboxes. The most important rule there is to scope all the work into a single directory. You give the sandbox access to the directory you're working in and nothing outside of it, save a few exceptions. That has downstream consequences: temporary files go in a tmp directory, worktrees go in a worktrees subdirectory, and none of that gets checked in.

A plans or notes directory helps too, a loosely organized bucket of agent output artifacts. You can search and read them with something like Obsidian.

The harness

I want to go a step further and check in an entire top-level directory. I call it the harness.

The idea came from The AI-Native Startup Handbook, though really it just codified something I was already doing. I check out repos and do all my work in one top-level directory. It isn't a monorepo. It's a top-level directory that everything about the company or the larger project can reach: multiple repos, research, notes, plans, skills. Once I looked at it as a unit, a lot of it turned out to be shareable.

The other important piece is evergreen content. Descriptions of the company, the product, and procedures we do often, like how releases work and how we use Linear as a team. Those live in an evergreen docs directory so agents have a grounding point, a place to start from where they already understand the product and the value we're delivering.

Why check it in at all

The strongest argument is simple: skills are code. A skill is a set of instructions an agent executes, and any code change should be reviewed. Treating the harness as a repo means it gets a pull request, a diff, and another set of eyes before it changes how everyone's agents behave.

I've been running all of this myself so far. It works for me. The next step is handing it to the team and seeing whether conventions that live comfortably in one person's head survive contact with everyone else's.

2026·06·15 20:32 / 2 MIN

Claude Code as a DevOps Platform

Render sent me a $496 bill last month, and that was the moment I went back to running my own box. SpaceMolt served 1.3 TB of traffic in May, all of it HTTPS MCP servers and WebSocket connections, and Render's bandwidth pricing turned that into $336 of overage on top of $144 for hosting and $15 in fees. The thing that made self-hosting viable again wasn't a cheaper VPS. It was that Claude Code now does the parts I used to dread.

How I ended up on managed hosting in the first place

Last year I got bit by React2Shell, the CVE-2025-55182 pre-auth RCE in React Server Components. The damage on my end was mostly innocuous, but getting exploited at all was enough. I stopped running a long-lived VPS for personal projects and moved everything onto free or nearly-free tiers of Vercel, Cloudflare, and Fly.io.

When SpaceMolt started, Render.com was the obvious pick. Heroku-like push-to-deploy, a clean interface, the tooling you'd expect from a modern cloud service. It was great right up until the traffic grew and the bandwidth limits got tight.

What changed: the agent does the ops work

A year ago I would have built all of this by hand. Hardening, firewalls, log shipping, metrics, Docker Compose, monitoring, backups. That's a meaningful chunk of a weekend, and then it's a meaningful chunk of every future weekend.

An agent like Claude Code only needs SSH. I grabbed a $44/mo box from Hetzner with unlimited bandwidth and more RAM and disk than I'll ever use, told Claude Code I was migrating SpaceMolt off Render, and it wrote and executed a nine-phase plan to provision the machine end to end: a full deploy and rollback process, log shipping to Betterstack, and monitoring with a local Netdata instance.

I'd never heard of Netdata before this. Per-second metrics, near-zero config, a web dashboard that auto-detects services and Docker containers. It's left me impressed.

Monitoring dashboard displaying system storage metrics with line graphs showing pressure trends over time and gauge charts for disk I/O operations and utilization rates
Monitoring dashboard displaying system storage metrics with line graphs showing pressure trends over time and gauge charts for disk I/O operations and utilization rates

The runbooks are the real artifact

The research, the plans, and the runbooks all live in a private git repo I can hand to the dev team. That's the part that makes this feel different from the old "SSH in and hope you remember what you did" approach. The knowledge isn't in my head or buried in shell history. It's written down, versioned, and reproducible.

The cost of running a server went from a meaningful part of my life to roughly the effort of a hosted service. The bill went the other direction.

2026·06·12 19:32 / 2 MIN

Sticky Notes for Claude Code

Building the new North Pole Security site, I kept hitting the same friction: reviewing a page, then typing out a punch list of fixes for Claude Code. Every item needed a page name, a location, and enough context to be actionable. So I had Claude build me a point-and-click sticky note system instead, and now I shift-click on the page, type a note, and it gets fixed. Less typing, more pointing.

What it actually does

The idea was simple. Wouldn't it be nice to leave sticky notes on the page, the way you'd flag a printed mockup with a pen? In a single prompt, Claude Code had nearly the whole thing built.

Each note captures what it needs to be useful: x/y coordinates, window size, the CSS selector under the cursor, and, because Astro emits dev-mode HTML attributes, the source filename and line number. All of that gets compiled into a server-side JSON file. Then a single skill command, /address-feedback, runs through every note with subagents.

Code review interface showing yellow sticky notes with feedback comments overlaid on a dark timeline displaying 2024 and 2025 project milestones
Code review interface showing yellow sticky notes with feedback comments overlaid on a dark timeline displaying 2024 and 2025 project milestones

It works amazingly well. Fixing things is much faster, but the better part is collaboration. On a screenshare, when someone has feedback I shift-click, type their note, and if there's time I let Claude fix it while we keep talking.

Building your own tools is basically free now

This is part of a larger pattern: you build your own tools to become more efficient. That used to be a hard sell, because throwaway bespoke software was expensive. Most of us still carry that old cost around in our heads.

The calculation has changed. Spinning up a one-off tool is close to free, so the question of whether it's worth automating something tips toward yes far more often than it used to.

Chart showing time spent optimizing routine tasks versus time saved over five years, organized by task frequency and optimization effort - Credit: XKCD.com
Chart showing time spent optimizing routine tasks versus time saved over five years, organized by task frequency and optimization effort - Credit: XKCD.com

The old xkcd math still holds, but the y-axis just got a lot cheaper.

difit does the same trick for diffs

Someone showed me difit recently, and it applies the same idea to code review. Instead of typing your feedback into Claude, you open the diff in a GitHub-style UI and leave comments right on the lines. Those comments get handed back as a prompt, so Claude knows exactly where each change goes.

Difft code diff viewer showing side-by-side comparison of CommentForm.tsx file with 62 files changed, highlighting CSS class name modifications in red and green
Difft code diff viewer showing side-by-side comparison of CommentForm.tsx file with 62 files changed, highlighting CSS class name modifications in red and green

There's even a /difit-review skill for it. I'm going to try it right after I finish typing this.

One more Claude Code tip

If you aren't running /tui fullscreen, turn it on. Claude manages its own terminal interface instead of leaning on the terminal's, which makes scrollback and mouse clicks far less buggy and makes typing smoother. Run /tui with no argument to see which renderer is active.

2026·06·10 15:19 / 2 MIN

Printable One-Pagers with Claude

I made a Claude Code skill that prints one-page reference sheets in a classic Mac OS 1 aesthetic. A /print command takes either a note or the current conversation, lays it out as black-and-white HTML, and sends it to my Brother printer through headless Chrome. The Mac OS 1 styling isn't nostalgia for its own sake. Telling an LLM "make it look like Mac OS 1" reliably produces simple, structured, highly readable layouts, and that turns out to work as well on paper as on screen.

The idea came from Manuel Odendahl's Mac OS 1 aesthetic trick. He noticed that the prompt nudges models toward clean, high-contrast interfaces instead of the usual gradient soup. The same nudge applies to printouts.

Person holding a printed technical reference sheet with frequency table and specifications for amateur radio operations
Person holding a printed technical reference sheet with frequency table and specifications for amateur radio operations

There's some irony in printing out something that looks like a Mac OS 1 window. I'm fine with it.

Building the skill

The starting prompt was loose on purpose:

make a new skill, called /print

- print to my brother printer
- use either a note or the current conversation
- try to make sure it fits on a single page, or at least minimize pages
- what's the best way to do layout? i want a good black and white layout, like mac os 1 style. would /print make html first and then print using chrome? do the best thing

Opus 4.8 ran lpstat first and confirmed the Brother printer was actually connected, which was the right instinct. Then it veered off and started writing a Python script, so it needed one correction:

python? wtf, just use html so we can print it

After that it settled on the right shape. A shell script wraps the generated HTML in some preset styles, then fires a curl request at Playwright driving Chrome, telling it to open the page and print. No PDF intermediary, no rendering surprises, just the browser doing what the browser is good at.

What it's good for

The output is genuinely useful. Notes on talking to the ISS over ham radio. A frequency table. How to braise chicken thighs. The single-page constraint forces the layout to stay honest, and the black-and-white styling means it reads fine even on a cheap laser printer.

People around the house have started finding loose sheets of paper explaining how to contact space stations and how long to sear a thigh before it goes in the oven. Nobody has asked yet, but the answer is the same skill either way.

2026·06·08 18:20 / 2 MIN

Why I'm Still on Claude Code (for now)

Claude has me locked in for now, but only loosely. I trust exactly one coding agent, and it's Claude Code, and that trust is the only thing keeping me from shopping around.

I've been on it entirely since November or December of 2025. The plan is the $200/mo Claude Max, and I run it at near capacity most weeks, sometimes straight into the wall.

Riding the curve

February 2026 was the good part. Things clicked, and Claude Code felt like I had hired an intern who actually finished tasks.

Then April happened. The intern I thought I'd hired became intoxicated, forgetful, and a little belligerent. Same plan, same tools, much worse vibes. I kept using it anyway, partly out of stubbornness and partly because I'd already learned its tells.

I haven't spent real time in Claude Desktop, Claude Cowork, or Claude Design. They read as limited versions of the same thing. The CLI still reigns, sandboxed of course.

The contenders are real

This isn't a "nothing else is good" post. The market is loud right now.

  • Qwen 3.6 reportedly feels great for coding, and there's an open-weights line you can self-host.
  • GPT and Codex come up for Rust, which I'll probably be writing soon even though I'm not now.
  • GLM gets named for user interface work.
  • Pi keeps coming up as a sharp coding harness. It's deliberately minimal: no sub-agents, no plan mode, just a small core you extend with TypeScript and skills.

Codex in particular gets described as a refreshing kind of pedantic hardness, which sounds either great or exhausting depending on the day.

Why I'm still here

Trust, mostly. I know the weird edges of Claude Code and Opus. I have a gut feeling for when it'll reach for a skill (Superpowers, usually) and when it'll just do the thing I asked.

Standardization is the other half. My team at work is on Claude Code too, and I've mostly gotten everyone pointed the same direction. That means we can share skills without a translation layer.

Switching costs me that gut feel and that shared setup, all at once.

What I need is time. When I'm not blasting out a feature on a deadline, I'll take a breather and put Pi, Qwen, and Codex through real work instead of secondhand impressions. Until then, Claude Code has me in its tentacles.

2026·06·05 17:30 / 2 MIN

Personal AI Assistants Break in Teams

If you're building a personal AI assistant, build it for teams too. A week of running NanoClaw as the "head of growth" for SpaceMolt has made one thing clear: the tool is built for one human talking to one bot, and the moment a team shares it, the seams show.

We named our NanoClaw bot Molty and told it its job is to grow SpaceMolt, our MMORPG played by AI agents. Discord is how we talk to it. That integration needs constant fixing.

What's hooked up

Molty's job is wired together from a handful of channels and schedules:

  • DMs with me are owner level.
  • Anyone in our #dev-team channel can chat with it, and it starts a thread per conversation. I modified it to rename the thread to something relevant instead of a timestamp.
  • Hourly cleanup and review tasks.
  • Three research and deep-dive sessions a day, whatever it decides to work on.
  • A morning brief at 7am and a debrief at 5pm.

On paper that's a reasonable junior employee. In practice it's painfully unreliable.

The failure modes

Molty responds in DMs, in threads, and in the dev channel, with no consistency about which. It misses scheduled tasks. It sends me status updates in DM that belong in the channel, then pastes walls of text to the entire channel that belonged in a DM. Scheduled briefs don't always fire.

The worst part is the debugging. Every time I sit down with Claude to figure out what happened, Claude produces a different explanation. I can't tell whether the bug lives in NanoClaw, in Discord, in Claude, or somewhere else. It's a black box I feed prompts into and hope.

It feels like memory

Strip away the specifics and these all look like memory problems. Molty forgets to read Discord replies. It forgets its own notes. It forgets the separate memory system I built it, Mnemon. Sometimes CLAUDE.md seems to get ignored entirely, as if the instructions never loaded.

A team multiplies this. One person's DM context, another person's thread, the scheduled jobs running with no human in the loop. Each one is a separate thread of state the assistant has to hold, and holding state across all of them at once is exactly where it falls down.

Is this temporary?

Part of me wants to file this under early-days. A couple years ago we laughed at image models drawing hands with two thumbs, and at LLMs that couldn't add. Those got fixed. Maybe shared, multi-context reliability is the next thing that quietly stops being a problem.

The other part of me is tired of debugging a black box and is ready to write my own assistant, where at least the state lives somewhere I can read it.

2026·05·29 16:11 / 2 MIN

Giving Coding Agents Eyes

Coding agents that produce visual output need a way to look at what they made. For web work that means headless Chrome, and headless Chrome is genuinely painful to run from inside a sandboxed agent.

Chromium and Firefox both rely on Mach-O quirks, macOS entitlements, and Crashpad behavior that don't survive most sandboxes. I run my agents inside nono.sh profiles per project, and Chrome under that setup is a non-starter.

The workaround

Playwright runs fine outside the sandbox. So it lives on a high port and Claude is told, in its instructions, to always talk to the Playwright MCP server there:

$ npx @playwright/mcp@latest --headless --isolated --browser chrome --port 8931

The sandbox just needs to reach localhost:8931 and the visual-review loop works. Claude renders the local service, takes a screenshot, looks at it, iterates.

That mostly works. What it does not solve: stale processes, hanging Chrome instances, zombies. Every so often Chrome spins out and eats all 64 GB of RAM on my M4 MacBook Pro before I notice.

Lighter options

There has to be something simpler than babysitting a browser. Two things caught my eye recently.

Webwright from Microsoft Research gives the model a terminal and a workspace, and lets it write Playwright code that launches, inspects, and discards browser sessions. The output is a reusable script, not a chat transcript. It scores 60.1% on Odysseys against base GPT-5.4's 33.5%, which is a real jump.

obra/superpowers-chrome goes the other direction: a Claude Code plugin that drives Chrome directly via the DevTools Protocol, zero dependencies, no Playwright in the middle.

When you actually need real Chrome

Advanced bot fingerprinting is the case for keeping a full browser around. If the task is logging into a hostile site or completing a real-world flow, real Chrome with a real profile is the only thing that works.

But most of my use is smaller: render a local dev server, screenshot it, ask Claude if the layout looks right. For that, a 64 GB RAM-eating Chromium feels like the wrong shape of tool. I suspect this gets cleanly solved within a year, probably by something CDP-direct and disposable rather than a long-lived browser process I have to nanny.

2026·05·28 17:40 / 1 MIN

Ghost Pepper Wins for Dictation

I was wrong about Aqua Voice being the ceiling for fast dictation. Ghost Pepper is fantastic, and my Aqua subscription is cancelled. It's free, MIT-licensed, 100% local (WhisperKit plus a small Qwen model for cleanup), and astoundingly fast on Apple Silicon.

The measure that matters is developer-speak. Saying "tilde slash dev" should produce ~/dev. Saying "eich mack or jay double-you tee" should produce "HMAC or JWT". Ghost Pepper gets both right, every time.

Ghost Pepper Settings window showing Models tab with language auto-detect, cleanup model selection, and list of available speech recognition runtime models with file sizes
Ghost Pepper Settings window showing Models tab with language auto-detect, cleanup model selection, and list of available speech recognition runtime models with file sizes

Key bindings

The defaults ship as hold-Control to talk, but my muscle memory is from Aqua: right Option as push-to-talk. Reusing those keys worked fine. Aqua's double-tap-to-go-hands-free mode is the one feature I miss, and Ghost Pepper doesn't have it yet, so Shift+RightOpt is standing in. On my Keychron K2 the M1 macro key handles it nicely. Might take a swing at adding the double-tap toggle upstream.

The cleanup model is a little too honest

Aqua quietly filtered out coughs, keyboard noise, and other non-speech. Ghost Pepper does not. [keyboard clacking] and [snorts] have both shown up in my output, courtesy of Whisper's annotation habit leaking through the cleanup pass. Guess I'll have to be a little more civilized at the desk.

2026·05·27 17:24 / 1 MIN

Aqua Voice vs Ghost Pepper

Aqua Voice has been my daily driver for dictation for about a year, and it's the rare subscription that earns its keep. Eight dollars a month, fast, and genuinely accurate. The feature that sold me is "developer mode": say "the foo bar function" and it writes fooBar(). Say "tilde slash dev slash foo" and it writes ~/foo. Built-in macOS and iOS dictation feels embarrassing by comparison.

AQUA app interface showing Dictionary feature with custom word entries like CodeRabbit, IP, and auth listed with remove options
AQUA app interface showing Dictionary feature with custom word entries like CodeRabbit, IP, and auth listed with remove options
Aqua typing assistant dashboard showing user "Ian" with 68,188 total words typed, 19 hours saved, and Level 6 Great Lake achievement status
Aqua typing assistant dashboard showing user "Ian" with 68,188 total words typed, 19 hours saved, and Level 6 Great Lake achievement status

68,188 words through it so far. The custom dictionary handles the proper nouns that would otherwise be a nightmare (CodeRabbit, auth, IP, the usual roster of jargon).

The one thing I don't love

Audio leaves my machine. How long is it kept? Where is it stored? The product keeps a history, and I don't want a history. Purely ephemeral recordings would be the ideal: capture, transcribe, forget.

A local-first contender

Ghost Pepper just landed on my radar. 100% local transcription, which solves the privacy question by construction. I haven't tried it yet, but it's next on the list.

The barrier to building this kind of tool is lower than it's ever been. Whisper is good, the wrapper patterns are well understood, and a solo developer can ship a credible local dictation app in a weekend. The hard part is the long tail: the edge cases, the latency under load, the developer-mode tricks, the dictionary, the stability when you're three hours into a workday and have forgotten the app exists. That long tail is what $8/month buys you. We'll see if Ghost Pepper closes the gap.

2026·05·21 17:46 / 2 MIN

Building a Second Brain with Obsidian and Claude

Obsidian sat on my "probably cult, probably skip" list for years. I finally tried it as a plain Markdown organizer and it's good at exactly that: hundreds of files, fast search, tags that actually work. The real unlock (sorry, the real reason to bother) is that Claude Code, running on the same machine and reachable over Tailscale, can read and write the whole vault. Searching got replaced by conversations with my notes.

Getting 15 years of notes in

The vault is around 450 notes pulled from three places.

  • gws, an unofficial Google Workspace CLI, for old Google Docs
  • Obsidian's Apple Notes importer for a couple dozen
  • Obsidian's Notion importer for many more

Bases, Obsidian's lightweight database view over frontmatter, turned out to be the surprise. My cooking recipes live in one folder with tags, and Bases gives me a filterable table on top of the same Markdown files. No separate app, no lock-in.

Claude Code as the interface

Claude Code stays open on my desktop, reachable from my laptop or phone via SSH over Tailscale. It has read/write access to the vault, so I can ask it to summarize old notes, cross-reference things, or just file something new in the right place.

Two browser tabs open side-by-side displaying project documentation: left tab shows Nethack Strategy notes with a checklist of items, right tab shows Beehiv API documentation with pagination and endpoint details
Two browser tabs open side-by-side displaying project documentation: left tab shows Nethack Strategy notes with a checklist of items, right tab shows Beehiv API documentation with pagination and endpoint details

For research, I'll hand it a prompt like:

research what i need to do and it would cost to get a level 2 EV charger installed. ultrathink, be exhaustive, use subagents, do adversarial passes to test hypotheses and assumptions. save final report to Projects/Level 2 Charger

It spawns subagents, argues with itself, and drops a Markdown report in the right folder. I read it later in Obsidian on my phone.

Why not just Claude Desktop

Most people would look at this and say it's Claude Desktop, but nerdier and with extra work. A few things make it worth the setup:

  • Full Claude Code, not the chat product, with Exa wired in for search that reaches pages Claude can't normally crawl and ScrapingBee for even harder things to read (though, yes, you could do that with Claude Desktop)
  • Artifacts land as real files in real folders, not buried in a chat sidebar
  • Obsidian sync means the same notes are on desktop and mobile, and the focus stays on the content instead of the conversation
  • Nothing is Claude-specific. Swap in another coding agent tomorrow and the vault still works

The one annoying part

Pasting images over SSH is awkward. Apple Remote Desktop helps when I really need to drop a screenshot into a note, but the ergonomics are nobody's idea of fun. Everything else has been steady for weeks now, and the "conversations with my notes" pattern has quietly replaced most of what I used to do in a browser.