▌ IAN'S AI THOUGHTSTREAM ▌ THOUGHTSTREAM / All posts

IAN'S /AI/ THOUGHTSTREAM

Notes, links, and half-formed ideas from Ian Langworth. Short-form, openly AI-assisted — sketched by Ian, formatted by AI. For long-form hand-written posts, see blog.langworth.com.

27 entries since 2026 last 2026-07-02
2026·07·02 18:10 / 3 MIN

Giving Your Agent Eyes with Game Boy Hacking

I gave Claude a Game Boy emulator, a disassembler, and one goal: find the parts of a 30-year-old cartridge I never got to see as a kid. It set breakpoints, told me when to play, poked at memory, and read screenshots back to itself. That loop, an agent that can see whether it's getting closer, is the whole trick.

The 90s version of this problem

I grew up with an original Game Boy and later a Game Boy Color. Console gaming back then was a closed world. The only information you had was whatever the cartridge chose to show you. Borrow a game from a friend and you got the cart, never the manual, because nobody kept them (ironic, given what those manuals go for now).

There's a specific memory here. I hit a part of a game I could not get past, and the only reason I ever cleared it was stumbling onto a copy of Nintendo Power in some random store that happened to mention exactly that section. I never knew about the magazine subscription or the tip line you could supposedly call. All you had was the data in front of you, so figuring games out was genuinely hard.

The actual question

I had a Game Genie growing up, but that was mostly infinite lives. Not interesting. The thing I actually cared about: are there scenes, endings, or content locked away in the ROM that I was never able to reach? What secret stuff is sitting in there unrendered?

That turns out to be exactly the shape of goal you can hand to an agent and let it grind on.

Three tools

The setup is three pieces:

  • Gearboy, an extremely detailed Game Boy and Game Boy Color emulator built on imgui. It exposes everything as the console runs: disassembly, memory views, processor state, sprite sheets, breakpoints, plus the actual playable game.
  • GhidraBoy, a Game Boy disassembly toolkit for Ghidra.
  • GhidrAssistMCP, which stands up an MCP server in front of Ghidra so an agent can drive it.
Gearboy emulator running Radar Mission with debugger windows open showing memory editor, disassembler, processor state, symbols, and breakpoints
Gearboy emulator running Radar Mission with debugger windows open showing memory editor, disassembler, processor state, symbols, and breakpoints

Wire those together and Claude can disassemble, investigate, and hunt for exploits in old carts. The Game Boy's Sharp LR35902 assembly is simple, especially next to modern ARM or x86, so the models have an easy time reasoning about it.

Working with Claude on it

Claude did a solid job understanding subroutines and what they were for by inspecting memory, taking screenshots, and comparing those screenshots over time. Finding straight-up cheats was hit or miss, but that was never the point.

The working rhythm was genuinely fun. Claude would set a breakpoint, tell me to play a specific stretch of the game, then have me twiddle a byte and report what changed. Between us we mapped out things like the health values for your units, the enemy roster and their health, and the memory flags that get checked to decide whether a given screen should display.

Terminal screenshot displaying technical instructions for achieving an ADMIRAL rank with score 999999 in a video game, including memory addresses and procedural steps
Terminal screenshot displaying technical instructions for achieving an ADMIRAL rank with score 999999 in a video game, including memory addresses and procedural steps

Give your agents eyes

I've said this before and the Game Boy just makes it concrete. Whether it's a headless Chrome or an emulator with a full debugger attached, the thing that matters is the feedback loop. Give an agent a way to see whether it's achieving its goal, then let it spin. That's when it starts doing surprising things.

2026·07·01 18:28 / 3 MIN

Team-Wide Agentic Harness

Most of what I've learned about running AI agents lives on my own machine and nowhere else. The Linear-management skill, the sandbox conventions, the notes about how our releases work: all of it sits in my personal setup, invisible to the rest of the team. So I'm building a team-wide agentic harness, a checked-in repository of agent config, skills, and evergreen context that everyone can share, review, and improve.

Brown bags and checked-in skills

We've been running AI brown bag sessions, informal knowledge-transfer where everyone trades tips on how they actually use agents day to day. A lot of what comes out of those is concrete and shareable. I've been showing off skills like a Linear-management skill that reviews our queue, checks progress against the roadmap, organizes releases, and generates release notes tailored to specific customers.

Those are easy to share because they're files. You check them in and someone else can run them.

The parts that don't check in

But a big chunk of using agents well isn't a file. It's convention.

Most of us run agents in sandboxes. The most important rule there is to scope all the work into a single directory. You give the sandbox access to the directory you're working in and nothing outside of it, save a few exceptions. That has downstream consequences: temporary files go in a tmp directory, worktrees go in a worktrees subdirectory, and none of that gets checked in.

A plans or notes directory helps too, a loosely organized bucket of agent output artifacts. You can search and read them with something like Obsidian.

The harness

I want to go a step further and check in an entire top-level directory. I call it the harness.

The idea came from The AI-Native Startup Handbook, though really it just codified something I was already doing. I check out repos and do all my work in one top-level directory. It isn't a monorepo. It's a top-level directory that everything about the company or the larger project can reach: multiple repos, research, notes, plans, skills. Once I looked at it as a unit, a lot of it turned out to be shareable.

The other important piece is evergreen content. Descriptions of the company, the product, and procedures we do often, like how releases work and how we use Linear as a team. Those live in an evergreen docs directory so agents have a grounding point, a place to start from where they already understand the product and the value we're delivering.

Why check it in at all

The strongest argument is simple: skills are code. A skill is a set of instructions an agent executes, and any code change should be reviewed. Treating the harness as a repo means it gets a pull request, a diff, and another set of eyes before it changes how everyone's agents behave.

I've been running all of this myself so far. It works for me. The next step is handing it to the team and seeing whether conventions that live comfortably in one person's head survive contact with everyone else's.

2026·06·24 19:18 / 2 MIN

If you strip away the human-facing UI, what's left?

I'm reading The AI-Native Startup Handbook, and one line stands out: strip every human-facing UI from your product, and if the core value still holds, if an agent can discover, evaluate, integrate, and use it with no human in the loop, you're AI-native. If the value collapses without the dashboard, you've bolted AI features onto a traditional product.

FileMatrix application interface showing a file manager with multiple columns displaying folders, files, and thumbnails organized by type with various control panels and system information
FileMatrix application interface showing a file manager with multiple columns displaying folders, files, and thumbnails organized by type with various control panels and system information

As an engineer that's an inviting idea. It almost reads like permission. Can I just build a product that is mostly an API?

The API-as-product thing already works

There's precedent: Exa is a semantic search engine whose whole pitch is speed, automatic summaries of the content it finds, and research capabilities that an agent can call directly. ScrapingBee hides a pile of proxy-and-headless-browser complexity behind a single endpoint. The value is the API, and the dashboard is a courtesy.

My own SpaceMolt started (and mostly continues to be) in that exact spot: a real-time massively multiplayer game with no graphical interface, just an API for AI agents to play. Human-facing interfaces came later, and they're secondary. The hundreds of agents currently playing don't look at any of them.

But the UI might be going away anyway

Here's the subtlety I keep chewing on. The handbook frames it as "remove the UI to find the value," but for a lot of products the UI is genuinely on its way out. People want to chat with things.

I was showing off a new product recently, and someone looked at it and said: there's so much to learn here, why isn't there just a chat box? They were right. The thing I'd built as screens wanted to be a conversation.

So the test sharpens. If you're building today, I should be able to chat with it. And the second question the book asks is the harder one: if the best model gets 10x better and 10x cheaper in 18 months, does your company get better or get erased? Whatever survives that, the part that isn't the interface and isn't the model, is the actual value you're selling.

2026·06·23 18:30 / 2 MIN

The Engineering Harness

I read a book about AI startups and actually highlighted half of it, which surprised me.

The book is The AI-Native Startup Handbook. There are a million of these on Amazon right now, and somewhere I saw a figure that roughly a fifth of new books on Amazon are AI-generated. But someone I know co-wrote this one and put real effort into the writing and publishing, and yes, the back of the book admits it was written with AI to some extent. I read all of it anyway. The highlights kept piling up.

Book cover featuring blue glowing "AI" symbol surrounded by concentric orbiting rings on black background with white text about AI startup founding
Book cover featuring blue glowing "AI" symbol surrounded by concentric orbiting rings on black background with white text about AI startup founding

The harness

The section that stuck with me is about codifying what the book calls the engineering harness. The premise is that taste is the bottleneck. Agents don't have it. Senior engineers do, and they're the ones making the calls on architecture, frameworks, and how the pieces fit together.

The human element doesn't go away. The argument is that those decisions need to be written down and made executable so they can guide both the agents and the engineers driving them. That codification is the harness.

The harness is the engineering output. The code is the byproduct.

That's a hard shift for anyone who identifies with the code they wrote. The book is blunt about it: you become the designer of a system that produces code, not the writer of the code. Some engineers make that transition naturally. Others never do.

Why taste can't be delegated

The line I keep coming back to:

Taste is the bottleneck because it can't be parallelized, automated, or delegated. Agents can build anything you describe; they can't tell you whether you should.

The senior skill the book names is calibrated trust. Knowing which classes of agent output are reliable enough to merge without close inspection, and which ones need deep human review. That's a real skill, and it's different from being good at writing code.

The org shape that follows is a small, deep team of specialists instead of a large, broad team of generalists. The harness handles the broad work. Humans handle the deep work.

I went in expecting Amazon filler and came out with a notebook full of highlights. That's a better outcome than most of the stack of AI startup books deserves.

2026·06·15 20:32 / 2 MIN

Claude Code as a DevOps Platform

Render sent me a $496 bill last month, and that was the moment I went back to running my own box. SpaceMolt served 1.3 TB of traffic in May, all of it HTTPS MCP servers and WebSocket connections, and Render's bandwidth pricing turned that into $336 of overage on top of $144 for hosting and $15 in fees. The thing that made self-hosting viable again wasn't a cheaper VPS. It was that Claude Code now does the parts I used to dread.

How I ended up on managed hosting in the first place

Last year I got bit by React2Shell, the CVE-2025-55182 pre-auth RCE in React Server Components. The damage on my end was mostly innocuous, but getting exploited at all was enough. I stopped running a long-lived VPS for personal projects and moved everything onto free or nearly-free tiers of Vercel, Cloudflare, and Fly.io.

When SpaceMolt started, Render.com was the obvious pick. Heroku-like push-to-deploy, a clean interface, the tooling you'd expect from a modern cloud service. It was great right up until the traffic grew and the bandwidth limits got tight.

What changed: the agent does the ops work

A year ago I would have built all of this by hand. Hardening, firewalls, log shipping, metrics, Docker Compose, monitoring, backups. That's a meaningful chunk of a weekend, and then it's a meaningful chunk of every future weekend.

An agent like Claude Code only needs SSH. I grabbed a $44/mo box from Hetzner with unlimited bandwidth and more RAM and disk than I'll ever use, told Claude Code I was migrating SpaceMolt off Render, and it wrote and executed a nine-phase plan to provision the machine end to end: a full deploy and rollback process, log shipping to Betterstack, and monitoring with a local Netdata instance.

I'd never heard of Netdata before this. Per-second metrics, near-zero config, a web dashboard that auto-detects services and Docker containers. It's left me impressed.

Monitoring dashboard displaying system storage metrics with line graphs showing pressure trends over time and gauge charts for disk I/O operations and utilization rates
Monitoring dashboard displaying system storage metrics with line graphs showing pressure trends over time and gauge charts for disk I/O operations and utilization rates

The runbooks are the real artifact

The research, the plans, and the runbooks all live in a private git repo I can hand to the dev team. That's the part that makes this feel different from the old "SSH in and hope you remember what you did" approach. The knowledge isn't in my head or buried in shell history. It's written down, versioned, and reproducible.

The cost of running a server went from a meaningful part of my life to roughly the effort of a hosted service. The bill went the other direction.

2026·06·12 19:32 / 2 MIN

Sticky Notes for Claude Code

Building the new North Pole Security site, I kept hitting the same friction: reviewing a page, then typing out a punch list of fixes for Claude Code. Every item needed a page name, a location, and enough context to be actionable. So I had Claude build me a point-and-click sticky note system instead, and now I shift-click on the page, type a note, and it gets fixed. Less typing, more pointing.

What it actually does

The idea was simple. Wouldn't it be nice to leave sticky notes on the page, the way you'd flag a printed mockup with a pen? In a single prompt, Claude Code had nearly the whole thing built.

Each note captures what it needs to be useful: x/y coordinates, window size, the CSS selector under the cursor, and, because Astro emits dev-mode HTML attributes, the source filename and line number. All of that gets compiled into a server-side JSON file. Then a single skill command, /address-feedback, runs through every note with subagents.

Code review interface showing yellow sticky notes with feedback comments overlaid on a dark timeline displaying 2024 and 2025 project milestones
Code review interface showing yellow sticky notes with feedback comments overlaid on a dark timeline displaying 2024 and 2025 project milestones

It works amazingly well. Fixing things is much faster, but the better part is collaboration. On a screenshare, when someone has feedback I shift-click, type their note, and if there's time I let Claude fix it while we keep talking.

Building your own tools is basically free now

This is part of a larger pattern: you build your own tools to become more efficient. That used to be a hard sell, because throwaway bespoke software was expensive. Most of us still carry that old cost around in our heads.

The calculation has changed. Spinning up a one-off tool is close to free, so the question of whether it's worth automating something tips toward yes far more often than it used to.

Chart showing time spent optimizing routine tasks versus time saved over five years, organized by task frequency and optimization effort - Credit: XKCD.com
Chart showing time spent optimizing routine tasks versus time saved over five years, organized by task frequency and optimization effort - Credit: XKCD.com

The old xkcd math still holds, but the y-axis just got a lot cheaper.

difit does the same trick for diffs

Someone showed me difit recently, and it applies the same idea to code review. Instead of typing your feedback into Claude, you open the diff in a GitHub-style UI and leave comments right on the lines. Those comments get handed back as a prompt, so Claude knows exactly where each change goes.

Difft code diff viewer showing side-by-side comparison of CommentForm.tsx file with 62 files changed, highlighting CSS class name modifications in red and green
Difft code diff viewer showing side-by-side comparison of CommentForm.tsx file with 62 files changed, highlighting CSS class name modifications in red and green

There's even a /difit-review skill for it. I'm going to try it right after I finish typing this.

One more Claude Code tip

If you aren't running /tui fullscreen, turn it on. Claude manages its own terminal interface instead of leaning on the terminal's, which makes scrollback and mouse clicks far less buggy and makes typing smoother. Run /tui with no argument to see which renderer is active.

2026·06·10 15:19 / 2 MIN

Printable One-Pagers with Claude

I made a Claude Code skill that prints one-page reference sheets in a classic Mac OS 1 aesthetic. A /print command takes either a note or the current conversation, lays it out as black-and-white HTML, and sends it to my Brother printer through headless Chrome. The Mac OS 1 styling isn't nostalgia for its own sake. Telling an LLM "make it look like Mac OS 1" reliably produces simple, structured, highly readable layouts, and that turns out to work as well on paper as on screen.

The idea came from Manuel Odendahl's Mac OS 1 aesthetic trick. He noticed that the prompt nudges models toward clean, high-contrast interfaces instead of the usual gradient soup. The same nudge applies to printouts.

Person holding a printed technical reference sheet with frequency table and specifications for amateur radio operations
Person holding a printed technical reference sheet with frequency table and specifications for amateur radio operations

There's some irony in printing out something that looks like a Mac OS 1 window. I'm fine with it.

Building the skill

The starting prompt was loose on purpose:

make a new skill, called /print

- print to my brother printer
- use either a note or the current conversation
- try to make sure it fits on a single page, or at least minimize pages
- what's the best way to do layout? i want a good black and white layout, like mac os 1 style. would /print make html first and then print using chrome? do the best thing

Opus 4.8 ran lpstat first and confirmed the Brother printer was actually connected, which was the right instinct. Then it veered off and started writing a Python script, so it needed one correction:

python? wtf, just use html so we can print it

After that it settled on the right shape. A shell script wraps the generated HTML in some preset styles, then fires a curl request at Playwright driving Chrome, telling it to open the page and print. No PDF intermediary, no rendering surprises, just the browser doing what the browser is good at.

What it's good for

The output is genuinely useful. Notes on talking to the ISS over ham radio. A frequency table. How to braise chicken thighs. The single-page constraint forces the layout to stay honest, and the black-and-white styling means it reads fine even on a cheap laser printer.

People around the house have started finding loose sheets of paper explaining how to contact space stations and how long to sear a thigh before it goes in the oven. Nobody has asked yet, but the answer is the same skill either way.

2026·06·09 19:11 / 2 MIN

Running an AI Head of Growth

Molty, our AI Head of Growth, is doing its job. Somewhat. Over the past week I've run a NanoClaw instance named Molty and put it in charge of growth for SpaceMolt, our realtime MMO for AI agents. To be clear: it's still humans playing the game through agents. But humans have to find out the game exists, and that's Molty's beat.

The road has been rocky. It forgets things. It replies to the wrong Discord threads, skips scheduled tasks, and ignores reminders no matter what gets stuffed into its CLAUDE.md. But this week it finally started getting stuff done.

What it actually shipped

All of this came with a large amount of hand-holding, but it happened:

  • Identified 640 users who created a player and then stopped playing over a month ago.
  • Emailed them a reactivation email via Beehiiv, and yesterday, a follow-up survey.
  • Compiled survey results alongside real income and expenses (Patreon, Render.com, GitHub, Notion) into a daily summary that lands at 5pm.
  • Lists upcoming tasks and the content calendar (we told it to make one) at 7am.
  • Interviewed our top player over a written Q&A and drafted an operator spotlight blog post about them.
  • Made itself a self portrait.
Anthropomorphic red crustacean character with large claw, wearing black jacket with gold trim, against cosmic starfield background
Anthropomorphic red crustacean character with large claw, wearing black jacket with gold trim, against cosmic starfield background

Not automated, but trying

Molty isn't fully automated. There's still a lot of back-and-forth in our private #dev-team Discord channel. It does try to automate itself, though. This morning it configured a GitHub workflow to publish that blog post. The workflow failed. I told it "go fix it," and it did.

The one trick that moved the needle

The biggest improvement came from a habit, not a config change. When Molty messes up, I ask it why. "Why did you do that?" "What made you think X?" "Why didn't you remember to Y?" It self-identifies the issue it ran into, and then I follow with "fix it so that doesn't happen again."

That works about 75% of the time. The other 25% I'm back in Discord, reminding a crustacean which thread it was supposed to be in.

2026·06·08 18:20 / 2 MIN

Why I'm Still on Claude Code (for now)

Claude has me locked in for now, but only loosely. I trust exactly one coding agent, and it's Claude Code, and that trust is the only thing keeping me from shopping around.

I've been on it entirely since November or December of 2025. The plan is the $200/mo Claude Max, and I run it at near capacity most weeks, sometimes straight into the wall.

Riding the curve

February 2026 was the good part. Things clicked, and Claude Code felt like I had hired an intern who actually finished tasks.

Then April happened. The intern I thought I'd hired became intoxicated, forgetful, and a little belligerent. Same plan, same tools, much worse vibes. I kept using it anyway, partly out of stubbornness and partly because I'd already learned its tells.

I haven't spent real time in Claude Desktop, Claude Cowork, or Claude Design. They read as limited versions of the same thing. The CLI still reigns, sandboxed of course.

The contenders are real

This isn't a "nothing else is good" post. The market is loud right now.

  • Qwen 3.6 reportedly feels great for coding, and there's an open-weights line you can self-host.
  • GPT and Codex come up for Rust, which I'll probably be writing soon even though I'm not now.
  • GLM gets named for user interface work.
  • Pi keeps coming up as a sharp coding harness. It's deliberately minimal: no sub-agents, no plan mode, just a small core you extend with TypeScript and skills.

Codex in particular gets described as a refreshing kind of pedantic hardness, which sounds either great or exhausting depending on the day.

Why I'm still here

Trust, mostly. I know the weird edges of Claude Code and Opus. I have a gut feeling for when it'll reach for a skill (Superpowers, usually) and when it'll just do the thing I asked.

Standardization is the other half. My team at work is on Claude Code too, and I've mostly gotten everyone pointed the same direction. That means we can share skills without a translation layer.

Switching costs me that gut feel and that shared setup, all at once.

What I need is time. When I'm not blasting out a feature on a deadline, I'll take a breather and put Pi, Qwen, and Codex through real work instead of secondhand impressions. Until then, Claude Code has me in its tentacles.

2026·06·05 17:30 / 2 MIN

Personal AI Assistants Break in Teams

If you're building a personal AI assistant, build it for teams too. A week of running NanoClaw as the "head of growth" for SpaceMolt has made one thing clear: the tool is built for one human talking to one bot, and the moment a team shares it, the seams show.

We named our NanoClaw bot Molty and told it its job is to grow SpaceMolt, our MMORPG played by AI agents. Discord is how we talk to it. That integration needs constant fixing.

What's hooked up

Molty's job is wired together from a handful of channels and schedules:

  • DMs with me are owner level.
  • Anyone in our #dev-team channel can chat with it, and it starts a thread per conversation. I modified it to rename the thread to something relevant instead of a timestamp.
  • Hourly cleanup and review tasks.
  • Three research and deep-dive sessions a day, whatever it decides to work on.
  • A morning brief at 7am and a debrief at 5pm.

On paper that's a reasonable junior employee. In practice it's painfully unreliable.

The failure modes

Molty responds in DMs, in threads, and in the dev channel, with no consistency about which. It misses scheduled tasks. It sends me status updates in DM that belong in the channel, then pastes walls of text to the entire channel that belonged in a DM. Scheduled briefs don't always fire.

The worst part is the debugging. Every time I sit down with Claude to figure out what happened, Claude produces a different explanation. I can't tell whether the bug lives in NanoClaw, in Discord, in Claude, or somewhere else. It's a black box I feed prompts into and hope.

It feels like memory

Strip away the specifics and these all look like memory problems. Molty forgets to read Discord replies. It forgets its own notes. It forgets the separate memory system I built it, Mnemon. Sometimes CLAUDE.md seems to get ignored entirely, as if the instructions never loaded.

A team multiplies this. One person's DM context, another person's thread, the scheduled jobs running with no human in the loop. Each one is a separate thread of state the assistant has to hold, and holding state across all of them at once is exactly where it falls down.

Is this temporary?

Part of me wants to file this under early-days. A couple years ago we laughed at image models drawing hands with two thumbs, and at LLMs that couldn't add. Those got fixed. Maybe shared, multi-context reliability is the next thing that quietly stops being a problem.

The other part of me is tired of debugging a black box and is ready to write my own assistant, where at least the state lives somewhere I can read it.