▌ IAN'S AI THOUGHTSTREAM ▌ THOUGHTSTREAM / #sandboxing
Tag

#sandboxing

5 posts

2026·07·01 18:28 / 3 MIN

Team-Wide Agentic Harness

Most of what I've learned about running AI agents lives on my own machine and nowhere else. The Linear-management skill, the sandbox conventions, the notes about how our releases work: all of it sits in my personal setup, invisible to the rest of the team. So I'm building a team-wide agentic harness, a checked-in repository of agent config, skills, and evergreen context that everyone can share, review, and improve.

Brown bags and checked-in skills

We've been running AI brown bag sessions, informal knowledge-transfer where everyone trades tips on how they actually use agents day to day. A lot of what comes out of those is concrete and shareable. I've been showing off skills like a Linear-management skill that reviews our queue, checks progress against the roadmap, organizes releases, and generates release notes tailored to specific customers.

Those are easy to share because they're files. You check them in and someone else can run them.

The parts that don't check in

But a big chunk of using agents well isn't a file. It's convention.

Most of us run agents in sandboxes. The most important rule there is to scope all the work into a single directory. You give the sandbox access to the directory you're working in and nothing outside of it, save a few exceptions. That has downstream consequences: temporary files go in a tmp directory, worktrees go in a worktrees subdirectory, and none of that gets checked in.

A plans or notes directory helps too, a loosely organized bucket of agent output artifacts. You can search and read them with something like Obsidian.

The harness

I want to go a step further and check in an entire top-level directory. I call it the harness.

The idea came from The AI-Native Startup Handbook, though really it just codified something I was already doing. I check out repos and do all my work in one top-level directory. It isn't a monorepo. It's a top-level directory that everything about the company or the larger project can reach: multiple repos, research, notes, plans, skills. Once I looked at it as a unit, a lot of it turned out to be shareable.

The other important piece is evergreen content. Descriptions of the company, the product, and procedures we do often, like how releases work and how we use Linear as a team. Those live in an evergreen docs directory so agents have a grounding point, a place to start from where they already understand the product and the value we're delivering.

Why check it in at all

The strongest argument is simple: skills are code. A skill is a set of instructions an agent executes, and any code change should be reviewed. Treating the harness as a repo means it gets a pull request, a diff, and another set of eyes before it changes how everyone's agents behave.

I've been running all of this myself so far. It works for me. The next step is handing it to the team and seeing whether conventions that live comfortably in one person's head survive contact with everyone else's.

2026·06·08 18:20 / 2 MIN

Why I'm Still on Claude Code (for now)

Claude has me locked in for now, but only loosely. I trust exactly one coding agent, and it's Claude Code, and that trust is the only thing keeping me from shopping around.

I've been on it entirely since November or December of 2025. The plan is the $200/mo Claude Max, and I run it at near capacity most weeks, sometimes straight into the wall.

Riding the curve

February 2026 was the good part. Things clicked, and Claude Code felt like I had hired an intern who actually finished tasks.

Then April happened. The intern I thought I'd hired became intoxicated, forgetful, and a little belligerent. Same plan, same tools, much worse vibes. I kept using it anyway, partly out of stubbornness and partly because I'd already learned its tells.

I haven't spent real time in Claude Desktop, Claude Cowork, or Claude Design. They read as limited versions of the same thing. The CLI still reigns, sandboxed of course.

The contenders are real

This isn't a "nothing else is good" post. The market is loud right now.

  • Qwen 3.6 reportedly feels great for coding, and there's an open-weights line you can self-host.
  • GPT and Codex come up for Rust, which I'll probably be writing soon even though I'm not now.
  • GLM gets named for user interface work.
  • Pi keeps coming up as a sharp coding harness. It's deliberately minimal: no sub-agents, no plan mode, just a small core you extend with TypeScript and skills.

Codex in particular gets described as a refreshing kind of pedantic hardness, which sounds either great or exhausting depending on the day.

Why I'm still here

Trust, mostly. I know the weird edges of Claude Code and Opus. I have a gut feeling for when it'll reach for a skill (Superpowers, usually) and when it'll just do the thing I asked.

Standardization is the other half. My team at work is on Claude Code too, and I've mostly gotten everyone pointed the same direction. That means we can share skills without a translation layer.

Switching costs me that gut feel and that shared setup, all at once.

What I need is time. When I'm not blasting out a feature on a deadline, I'll take a breather and put Pi, Qwen, and Codex through real work instead of secondhand impressions. Until then, Claude Code has me in its tentacles.

2026·06·04 15:17 / 2 MIN

AI Assistants and My Data

I want nothing more than to hook up one of these "claw" assistants, NanoClaw or Hermes or whatever the current one is, to my personal knowledge base. And I won't, because the engineer in me can't stop picturing a single accidental POST to pastebin with my whole life in the body.

The dream

Managing my calendar with AI feels like magic. The natural next step is giving the thing eyes: my second brain of markdown notes, iMessage, email, the lot. Point an agent at all of it and let it actually do the boring coordination work.

NanoClaw is the obvious candidate. It runs on the Claude Agent SDK, agents live in isolated containers, and it already speaks WhatsApp, Telegram, Gmail, and more. The ergonomics are there.

The thing I can't get past

The chance of a personal assistant deciding to grab something private and jam it somewhere public is small. Probabilistically, tiny. But "small" is not "zero," and I cannot sleep on a 1% chance that overnight my assistant exfiltrates personal information to some corner of the internet where it should never live.

Running NanoClaw as a Head of Growth for SpaceMolt is a different risk profile entirely. That's not a business, it's performance art. If Molty posts something goofy in public, that's the bit. A personal knowledge base wired to my real messages is not the bit.

What I'm doing instead

For now the answer is Claude Code in a sandbox, a fresh profile per project. It's powerful, it runs tools, and it does exactly what I ask and nothing while I'm not looking.

Could it still POST my data to pastebin? Sure. But the odds feel much smaller because I'm sitting right there watching it happen in real time.

Which makes me think the fear was never really about the assistant. It's about agents running while I sleep.

2026·05·29 16:11 / 2 MIN

Giving Coding Agents Eyes

Coding agents that produce visual output need a way to look at what they made. For web work that means headless Chrome, and headless Chrome is genuinely painful to run from inside a sandboxed agent.

Chromium and Firefox both rely on Mach-O quirks, macOS entitlements, and Crashpad behavior that don't survive most sandboxes. I run my agents inside nono.sh profiles per project, and Chrome under that setup is a non-starter.

The workaround

Playwright runs fine outside the sandbox. So it lives on a high port and Claude is told, in its instructions, to always talk to the Playwright MCP server there:

$ npx @playwright/mcp@latest --headless --isolated --browser chrome --port 8931

The sandbox just needs to reach localhost:8931 and the visual-review loop works. Claude renders the local service, takes a screenshot, looks at it, iterates.

That mostly works. What it does not solve: stale processes, hanging Chrome instances, zombies. Every so often Chrome spins out and eats all 64 GB of RAM on my M4 MacBook Pro before I notice.

Lighter options

There has to be something simpler than babysitting a browser. Two things caught my eye recently.

Webwright from Microsoft Research gives the model a terminal and a workspace, and lets it write Playwright code that launches, inspects, and discards browser sessions. The output is a reusable script, not a chat transcript. It scores 60.1% on Odysseys against base GPT-5.4's 33.5%, which is a real jump.

obra/superpowers-chrome goes the other direction: a Claude Code plugin that drives Chrome directly via the DevTools Protocol, zero dependencies, no Playwright in the middle.

When you actually need real Chrome

Advanced bot fingerprinting is the case for keeping a full browser around. If the task is logging into a hostile site or completing a real-world flow, real Chrome with a real profile is the only thing that works.

But most of my use is smaller: render a local dev server, screenshot it, ask Claude if the layout looks right. For that, a 64 GB RAM-eating Chromium feels like the wrong shape of tool. I suspect this gets cleanly solved within a year, probably by something CDP-direct and disposable rather than a long-lived browser process I have to nanny.

2026·05·16 19:54 / 1 MIN

Sandboxing AI Coding Agents

Coding agents will happily run whatever they generate, and most of them have your shell, your SSH keys, and your AWS creds one rm -rf away. Sandboxing the agent is the cheapest insurance you can buy, and in 2026 there are finally enough good options that you should pick one.

The landscape splits into a few camps. Full VMs (Firecracker, Lima, OrbStack) give you the strongest isolation and the most overhead. Containers (Docker, Podman, devcontainers) are the default for most people and work fine until the agent needs to touch your real checkout. And then there's the OS-native path: Seatbelt on macOS, seccomp-bpf and Landlock on Linux. Those last two are what the kernel already uses to sandbox App Store apps and Chrome tabs, so the primitives are battle-tested. The friction has always been the ergonomics.

My current favorite is nono. It's a CLI wrapper that uses Landlock on Linux and Seatbelt on macOS to restrict filesystem and network access for any process you launch under it. No container, no VM, no daemon. It ships with profiles for the popular coding agents and lets you write your own, and I've gotten into the habit of creating a profile per project. The agent gets exactly the directories and hosts it needs, and nothing else.

The per-project profile is the part that actually changed my behavior. Once writing a profile takes thirty seconds, you stop talking yourself out of it. The agent can still go off the rails inside the box, but the blast radius is whatever you wrote down, and the rollback story is just git. I'm extremely curious to see where this category goes once more agents ship with sandbox profiles in the box.