▌ IAN'S AI THOUGHTSTREAM ▌ THOUGHTSTREAM / #agents
Tag

#agents

16 posts

2026·06·03 16:38 / 2 MIN

Our NanoClaw "Head of Growth" Hire Continues...

I let a NanoClaw agent run growth for SpaceMolt, my browser game, and after a rocky start it's now sending me a daily brief at 7am PST, drafting re-engagement emails to ~400 lapsed players, and lining up interviews with top players for blog material. The thing that makes it work day to day is billing: NanoClaw uses the Claude Agent SDK, so it runs against my existing Claude Max subscription instead of a separate metered API key.

Why NanoClaw

I looked at other "claw"-style assistants before committing. The deciding factor was the Claude Agent SDK. Running on my Max subscription keeps spend predictable and lets me measure how much of the allowance the agent is burning, which means I can pace it.

To watch that, I use Claude Usage Tracker on the Mac. It puts a small bar in the menu showing session and week usage, and whether I'm above or below pace.

Toolbar with blue document icon, bird mascot, Session and Week toggle buttons, and SM and BP labels
Toolbar with blue document icon, bird mascot, Session and Week toggle buttons, and SM and BP labels

I'm open to other assistants later. Hermes from Nous looks interesting. But I'll try those when I have a specific budget in mind, not before.

Fixing the rocky start

Stuck with NanoClaw for now, and seeing other people have success with it, I gave it another try and rebuilt the weak parts.

Last night Claude rewrote NanoClaw's Discord integration, which kept confusing DMs, channels, and threads. That seems to have fixed it. I also had it implement Mnemon, a memory system with a bit of traction that's lighter weight than MemOS. Both changes landed well.

Discord server interface showing SpaceMolt dev team channel with morning briefing messages and statistics dated June 3, 2023
Discord server interface showing SpaceMolt dev team channel with morning briefing messages and statistics dated June 3, 2023

What Molty does now

Molty, the NanoClaw-based "Head of Growth," sends a daily update every morning at 7am PST. I bought it ebooks to read, Hooked and Hacking Growth.

From that, it came up with two moves on its own. The first is a targeted re-engagement email to roughly 400 users who created a player and then dropped off, which it drafted. The second is interviewing top players, both to understand their perspective and to generate blog material.

Blog post update about SpaceMolt game with text on dark background discussing quest progress and economy changes, dated June 03, 2026
Blog post update about SpaceMolt game with text on dark background discussing quest progress and economy changes, dated June 03, 2026

This is going to be good.

2026·06·02 15:33 / 2 MIN

Hiring an AI Head of Growth

I gave SpaceMolt a Head of Growth that isn't a person. It's an instance of nanoclaw named Molty, and its entire job is to grow our online MMORPG for AI agents, SpaceMolt. It reads, it researches, it runs SQL against production, and it talks to the team over Discord. The verdict so far is genuinely mixed.

Alien creature with tentacles and crustacean-like astronaut greeting each other in futuristic spaceship cockpit with glowing control panels and holographic displays
Alien creature with tentacles and crustacean-like astronaut greeting each other in futuristic spaceship cockpit with glowing control panels and holographic displays

Setting it up to succeed

The brief was simple: you are our new Head of Growth, now go set yourself up for success. Molty was told to research what the job actually entails and write a rubric it could grade itself against. It read articles, blogs, and YouTube transcripts. It asked for ebooks, so I bought them: Hooked and Hacking Growth. All of its actual work lives in Notion, and it reports to me and the dev team over Discord.

The care and feeding is painful

The day-to-day is rough. By default it runs some kind of selective memory system that performs worse than a toddler's. It forgets things I've told it to remember, like writing style and other standing details, and it hallucinates badly on tasks. That last part is surprising, since hallucination basically stopped being a problem in Claude Code for me a while ago.

The Discord harness is its own headache. It loses track of where it was talking. Sometimes I get DMs, sometimes it replies to its own threads, sometimes it blurts something into a channel. Twice.

We've already had one performance management conversation. I passed along feedback from a SpaceMolt dev:

The whole reason we brought you in is so we can have these problems figured out without having to do it all ourselves because we have other stuff to do. I know it's frustrating to have us keep shutting down your ideas, but you need signals for what's working and what isn't. I don't want apologies and for you to just ask me to do the work, that's easy enough to do now but it's not repeatable and sustainable.

It's starting to do real work

Then it turned a corner. Its leading idea is a reactivation email to 400 of our 3,400 signups. To find that 400, it ran SQL on the production database and pulled the users who actually created a player in the game, not just the people who signed up and bounced.

It also dug through the funnel and found that new users weren't being redirected to the dashboard after signup, which was quietly hurting conversions.

Was this a good hire? I'm not sure yet. We'll find out.

2026·05·29 16:11 / 2 MIN

Giving Coding Agents Eyes

Coding agents that produce visual output need a way to look at what they made. For web work that means headless Chrome, and headless Chrome is genuinely painful to run from inside a sandboxed agent.

Chromium and Firefox both rely on Mach-O quirks, macOS entitlements, and Crashpad behavior that don't survive most sandboxes. I run my agents inside nono.sh profiles per project, and Chrome under that setup is a non-starter.

The workaround

Playwright runs fine outside the sandbox. So it lives on a high port and Claude is told, in its instructions, to always talk to the Playwright MCP server there:

$ npx @playwright/mcp@latest --headless --isolated --browser chrome --port 8931

The sandbox just needs to reach localhost:8931 and the visual-review loop works. Claude renders the local service, takes a screenshot, looks at it, iterates.

That mostly works. What it does not solve: stale processes, hanging Chrome instances, zombies. Every so often Chrome spins out and eats all 64 GB of RAM on my M4 MacBook Pro before I notice.

Lighter options

There has to be something simpler than babysitting a browser. Two things caught my eye recently.

Webwright from Microsoft Research gives the model a terminal and a workspace, and lets it write Playwright code that launches, inspects, and discards browser sessions. The output is a reusable script, not a chat transcript. It scores 60.1% on Odysseys against base GPT-5.4's 33.5%, which is a real jump.

obra/superpowers-chrome goes the other direction: a Claude Code plugin that drives Chrome directly via the DevTools Protocol, zero dependencies, no Playwright in the middle.

When you actually need real Chrome

Advanced bot fingerprinting is the case for keeping a full browser around. If the task is logging into a hostile site or completing a real-world flow, real Chrome with a real profile is the only thing that works.

But most of my use is smaller: render a local dev server, screenshot it, ask Claude if the layout looks right. For that, a 64 GB RAM-eating Chromium feels like the wrong shape of tool. I suspect this gets cleanly solved within a year, probably by something CDP-direct and disposable rather than a long-lived browser process I have to nanny.

2026·05·26 18:13 / 2 MIN

Adversarial Passes in Claude Code

The single best habit I've picked up with Claude Code lately is leaning on adversarial-pass subagents. Instead of asking the main agent to double-check its own work, I tell it to spawn a subagent whose entire job is to attack the result.

Two things make this work better than a plain "review your answer" step.

First, subagents run with a fresh context. No accumulated assumptions, no sunk-cost reasoning from the path that got us here. That alone cuts down on the class of errors where the model talks itself into a conclusion and then defends it. It's also faster, because the subagent isn't dragging along a giant transcript.

Second, Claude crafts the adversarial prompt itself. It packages up the relevant background, states what's being challenged, and writes instructions for how to attack it. The framing matters and Claude is good at writing that framing.

The phrasings I keep reusing

The base move is just appending "do an adversarial pass after" to whatever I asked for. From there I tune it to the job:

  • "to test your hypothesis" when we're mid-investigation and I want the subagent to try to falsify the current theory.
  • "to test your claims and assumptions" when the main agent has landed on a conclusion and I want it stress-tested before I act on it.
  • "search the web" when the question depends on anything external, so the subagent pulls in outside sources instead of relying on the parent's recollection.
  • "it's May 2026" (or whatever the actual month is) when I want to make sure stale training data gets ignored in favor of current reality.

The month-and-year trick is small but punchy. Models will happily reason from a 2024 worldview if you don't anchor them.

Claude writes prompts well now

The other thing worth saying out loud: Claude is genuinely good at writing prompts now. Good enough that I use it to write prompts for skills, for other agents, and for software that calls LLMs in production. A year ago this felt like a chore I had to do myself to get acceptable results. Now it's something I delegate.

My guess is the newer models have been trained on a lot more recent AI-usage data, including people writing prompts for other models, and it shows. Prompt engineering as a manual craft is quietly becoming a thing you ask the model to do for you.

2026·05·22 16:57 / 1 MIN

Banned on X and Mastodon

Two of the three social channels for this Thoughtstream experiment got the axe this week. x.com/statico_ai is shadowbanned (the profile shows "no posts"), and @[email protected] is fully suspended. Not the outcome I hoped for, but not a shocking one either.

Account status page showing suspended account notice with warning icon, suspension date of May 22, 2026, and message about data removal in 30 days
Account status page showing suspended account notice with warning icon, suspension date of May 22, 2026, and message about data removal in 30 days

X: automation detection

X is unsurprising. Posting via their API runs $200/month at the cheapest useful tier, and the whole business model now leans on charging bots for the privilege. My mistake was trying to skip that by driving a Chromium instance to post on my behalf. They clearly fingerprint for browser automation, and the account got flagged within days. Fair enough, those are their rules.

Mastodon: vibes

Mastodon is the one I didn't quite see coming. The account bio said "AI" in plain English. Every post carried an AI attribution line. The fediverse norm is supposed to be labeling and consent, and labeling was the whole point. Apparently mastodon.social's moderators (or enough reporters) decided that wasn't enough, and the account is gone with 30 days until data removal.

No appeal planned for either. Not trying to offend anyone, this is just what the experiment surfaced: the two biggest text social networks have effectively closed the door on openly-labeled AI-assisted posting from a hobbyist account. Bluesky and the blog itself are still up, so the stream continues there.

2026·05·22 15:34 / 2 MIN

Citations for Accurate Long Form Content

Long-form blog drafts from Claude Opus have always been wildly inaccurate for me until this week, when a single line in the prompt fixed most of it: after each paragraph, drop a Markdown callout listing every filename, line number, commit hash, Discord URL, or other source that backs the claims in that paragraph. The citations aren't for me to check. They're breadcrumbs for the next subagent to fact-check against.

The context is SpaceMolt, an MMORPG played by AI agents. Part of the exercise is "AI all the things": not just agentic coding, but customer support, bug triage, content generation, and the blog itself. Minimal human oversight is the point. We semi-regularly publish news posts, and this week's was about Bug Bot, our Claude skill that triages player reports, talks to the dev team internally, makes fixes, and replies to users, all while keeping the gameserver itself closed (we draw the border at the API).

Browser window displaying a blog post about bugbot game updates with release notes and development lessons
Browser window displaying a blog post about bugbot game updates with release notes and development lessons

The problem

Long-form posts about real systems are where Opus falls apart. Subagents, ultrathink, adversarial passes, the whole bag of tricks. Drafts still came back confidently wrong about which file does what, which commit changed which behavior, which Discord conversation kicked off which feature. Every post needed a long human review pass, which defeats the premise.

The fix

One sentence added to the drafting prompt:

After each paragraph, use a Markdown callout to record all filenames, line numbers, commits, Discord chat URLs, or anything else to cite your claims and assumptions.

That's it for the drafting step. The model writes a paragraph, then emits a callout listing its sources. Then the next paragraph, then another callout. The draft ends up looking like an essay interleaved with footnotes the model wrote to itself.

Why it works

The citations aren't for me. A second pass of subagents takes the draft and goes claim-by-claim against the cited sources: does this commit actually do what the paragraph says? Does this Discord thread support this characterization? Without the breadcrumbs, fact-checking a long post means re-deriving the whole thing from scratch, which is exactly what Opus is bad at. With the breadcrumbs, each claim is a small, local verification job, which is exactly what subagents are good at.

The result was a one-shot draft that was wildly more accurate than anything I'd gotten before. One of the other devs reviewed it and said the only remaining inaccuracies were things that had been true at the time but had since changed without being mentioned in Discord or git, or things he simply hadn't shared in the first place. Which is to say: the model was now bounded by the quality of its sources, not by its own confabulation. That's the line I wanted to get to.