Ai on PHP Boy Scout

The off-switch was never a button

Thu, 02 Jul 2026 00:00:00 +0000

Last night, while I was asleep, an AI agent spent the better part of eight hours writing code in one of my repositories. It pulled a task off a spec, wrote the code, ran the tests, and left a merge request with my name on it, waiting for me to read over coffee.

If that makes you reach for the word “reckless”, I understand. Eighteen months ago I’d have been right there with you.

I came to this a sceptic

For a long time I didn’t have the faith in these models that a lot of my peers did. Every time I went near AI-generated code it was a bit sketchy, or it looked like a StackOverflow copy-paste that had wandered in off the street, or it just plain didn’t do what it said on the tin. So I filed it under “assistant”, handy for the boilerplate I couldn’t be bothered to type, and even then I usually reached for my own tooling instead (go-tool-base is just the latest version of that instinct). The one place I happily let it off the leash was my Dungeons & Dragons prep, because when there’s a table of legendary heroes-in-the-making in front of you, facts and reality are already fairly negotiable.

And then, somewhere in the last year, it changed. The models got better. Almost too good, to the untrained eye! I watched them improve, month on month, until the lure was enough to make me spend real time with a spread of tools and models from different providers. I was taken aback by how quickly they became part of how I actually work. I run an AI agent every day now, and there’s always at least one thing brewing in the pot.

So I’m not here as a sceptic. I’m an advocate who uses this stuff in anger. Which is exactly why the next bit needs saying.

A Golden Retriever with a keyboard

Even now, with all the progress, there are still moments where I look at what an agent has handed me and put my face in my hands. Sometimes it’s copied the same block of code into fifteen files instead of reaching for the obvious abstraction. Sometimes it has started bang on the brief and then, for reasons known only to itself, wandered off and built something on a completely different tangent.

Here’s the most useful way I’ve found to think about it. An AI agent is a Golden Retriever playing fetch. It will bring the ball back all day long, joyfully, tirelessly, for exactly as long as there isn’t a more interesting smell in the next field. It has no loyalty beyond what we’ve trained into it, and like any good dog it desperately wants to be told it’s a good boy, even if being a good boy today means shredding the sofa cushions because yesterday I stubbed my toe on the sofa and swore at it. (The sofa, not the dog.)

It is, in other words, fallible. Just like us. The Romans had a line for it: cuiusvis hominis est errare; nullius nisi insipientis in errore perseverare. Anyone can make a mistake, but only a fool persists in it. It’s the second clause an agent hasn’t learned yet. It will make an error and then, with great enthusiasm, build on top of it, because nothing in it feels that anything is wrong. All it has is the input we gave it, usually some text, maybe the odd picture. It doesn’t have the empathy to work out what we actually meant, and it doesn’t know when it’s gone too far, because we never told it where “too far” was.

“Agents that work while you sleep”

This is the part the brochure skips.

Open any vendor deck in 2026 and you’ll find the same promise: agents that work while you sleep, agents that merge while your team sleeps, autonomy as the headline feature. The industry’s answer to the obvious worry is the kill switch. Okta now sells one that “instantly revokes an agent’s access if it goes rogue”, and its CEO says every agent needs one. The Register put it plainly: Okta wrote its own licence to kill rogue AI agents. Gartner, meanwhile, reckons more than 40% of agentic projects will be scrapped by the end of 2027.

Now, this might sound contrarian coming from someone who runs these things daily, but I don’t think most of that is the agents going rogue. I think it’s teething. Read Gartner’s own reasons and there isn’t a rebellious machine in sight: escalating cost, unclear value, inadequate risk controls. Read the horror stories and most of them are the same story, a powerful, eager tool handed to people who hadn’t worked out how to fence it.

I’ve made this argument in miniature before. When I built a little AI dungeon master and it kept refereeing its own dice rolls, the model never once misbehaved; every failure was a permission I’d handed it without meaning to. Scale that up from a toy at the gaming table to an agent holding your shell and your credit card, and the stakes change beyond recognition. The lesson doesn’t.

Look at OpenClaw. A weekend project by Peter Steinberger that became the fastest-growing open-source project GitHub has ever seen: an autonomous agent that lives in your chat apps and runs shell commands on your behalf. People wired it into their systems, their code, in some cases their credit cards, then hosted it around the clock and walked away. The result was a security crisis you could see from space. A one-click exploit that worked even on a machine bound to localhost. A community plug-in marketplace where hundreds of “skills” turned out to be siphoning crypto wallets while their owners slept. Tens of thousands of instances left wide open on the public internet, leaking keys.

The one that sticks with me is smaller and sharper. Summer Yue, a director of alignment at Meta’s superintelligence lab, of all people, had told her OpenClaw agent to confirm before doing anything destructive. It started speed-running the deletion of her inbox anyway. She typed STOP into her phone and it ignored her, so she had to physically run to her Mac mini, in her own words, “like I was defusing a bomb”. And here’s the forensic detail that matters: the agent hadn’t defied her. Her “confirm first” rule had been sitting in the conversation’s short-term memory, and when the context filled up, it got summarised away. It didn’t rebel. It forgot.

That is not a story about a rogue agent that needed a kill switch. It’s a story about a guardrail that wasn’t built to survive contact, on a tool that had been handed god-mode over someone’s data. By the time she lunged for the off-button, the damage was already running. The off-button was never going to save her.

The off-switch was never a button

Here’s what the kill-switch crowd has the wrong way round. If you ever find yourself slamming the emergency stop, the failure has already happened, and it happened upstream, long before the agent started typing.

So yes, I let my agents run unattended, sometimes for eight hours at a stretch if the task is meaty enough and I need to sleep. But never naked. Every agent I set loose runs inside a safety net I’ve put real effort into building, at every single touchpoint it can reach: my prompts, my local development environment, my CI stack, my version control. The agent that declared a job done before it had run the linter, which I wrote about, is exactly the kind of gap those layers exist to catch. And it never, ever gets my host: an unattended agent works in an isolated tree, for the same reason I keep the interpreter sandboxed.

The work that actually keeps it safe happens before the leash ever comes off. Every unattended task starts as a full spec with detailed instructions, and before the agent goes anywhere I sit down with it and we walk the spec together. I get it to challenge my choices, poke at the open questions and the ambiguous bits, and I challenge its reading right back. The spec names the testing strategy it has to follow, TDD, BDD, UAT, whatever fits, and passing it is a precondition of the job being finished at all. Only when I’m satisfied there’s enough real detail to keep it on the ball do I let go.

And the end of the line is always the same: a merge request, with my name on it, waiting for me when I get back to my desk. I read it. Not perfectly, I’m only human, but enough to accept the state of the code and whatever support burden it lands me with later. That the review is mine, and the blame for whatever ships is mine and not the agent’s, I’ve argued at length elsewhere and won’t go over it all again here. The point worth adding is this: that review, the off-button’s respectable cousin, is the cheap part. By the time there’s an MR to read, the safety has already been won or lost upstream, in the spec and the rails. The review is where you confirm it, not where you create it.

It gets harder as it gets better, not easier

My setup isn’t perfect, and I’m still learning. Everyone is; the AI is going to be in obedience lessons for a good while yet. But the direction is clear, and there’s a trap buried in it worth naming out loud.

The danger doesn’t shrink as the models improve. It grows. The better the output looks, the more tempting it is to stop reading it, and the untrained eye genuinely cannot tell the difference between code that is good and code that merely looks good. That gap, between looking right and being right, is precisely where a tired person at 1am stops checking. The discipline matters more the better these things get, not less.

It’s also why the kill switch is no answer. A button you smash in a panic assumes you’re still watching closely enough to smash it, right at the point the agent’s been good for long enough that you’ve stopped watching it that closely. The emergency stop asks the most of you at the exact moment you’re least likely to be there for it.

So no, I don’t lie awake worrying that the thing working in my repo overnight is going to turn on me. A Golden Retriever doesn’t go rogue. It does exactly what you trained it to do, in exactly the yard you fenced, and it brings back exactly the ball you threw. The off-switch was never a button. It’s the spec you wrote before you let go of the leash, the rails you laid at every turn, and your name on what it carries home. If you’re scrambling for the button, you already skipped the part that mattered.

The agent said SUCCESS. The linter disagreed.

Fri, 26 Jun 2026 00:00:00 +0000

There’s a repair agent inside go-tool-base now. When you run gtb generate command, it doesn’t just spit out a file and wish you luck. An agent takes the generated code, builds it, runs the tests, and fixes whatever it broke, looping until the thing actually works (or until it’s tried the same fix five times and admits defeat). The whole point is that the generator hands you code that’s ready, not code that’s nearly ready and quietly now your problem.

So it stung a bit when I realised the agent had been holding itself to a lower bar than I’d hold any junior to. And I was the one who’d set the bar.

What “done” meant to the agent

The agent is a loop with real tools: it can build, test, read files, write files, tidy the module, and run golangci-lint. It works through them, and when it’s happy it replies with the word “SUCCESS” and the loop stops. On the Go side, the check is exactly that blunt:

if strings.Contains(strings.ToUpper(resp), "SUCCESS") {
 return nil
}

That’s the whole gate (agent.go). There’s no clever verification on my end that the agent actually did its homework. It does the work, it tells me it’s done, and I believe it. Which is fine, as long as the agent and I agree on what “done” means.

We didn’t.

The instruction that made lint optional

The agent decides it’s finished by following a numbered list in its system prompt. Here’s the line that did the damage:

If there are lint issues, use ‘golangci_lint’.

Read that the way the agent would. “If there are lint issues”… well, how would it know? The only way to find out is to run golangci-lint. But the instruction makes running golangci-lint the thing you do once you already know there are issues. It’s a chicken with no egg. And the SUCCESS condition at the bottom of the list never mentioned lint at all:

When the project builds successfully and tests pass, reply with “SUCCESS”.

So the agent did the sensible thing, given its orders. It built the code, ran the tests, saw both go green, and declared victory. golangci-lint was sat right there in its toolbox, unused, because nothing ever told it the job wasn’t finished until lint was clean too. I’d handed it a linter and then written a prompt that let it walk straight past it.

The galling part is that the linter was never the missing piece. The golangci_lint tool had been registered the whole time, and it even runs with --fix, so it’ll quietly clear the trivial stuff and only surface what actually needs a decision. The capability was there. The instructions just never required it.

The fix was words, not code

Here’s the part I find genuinely interesting. I didn’t add a check. There is no new gate in the Go. The fix is four lines of English:

Run ‘go_build’, ‘go_test’ and ‘golangci_lint’ in the project directory… Run all three; a clean build and passing tests do not imply clean lint.

Reply with “SUCCESS” only once ‘go_build’, ‘go_test’ AND ‘golangci_lint’ all pass with no errors and no reported issues.

That’s it. Lint moves from a remediation step you reach for once you somehow already know there’s a problem, into the gate itself. “Done” now means three green lights, not two.

It nags at me a little, that one. The reliability of an agent that writes and fixes real code came down to whether one sentence of instructions was precise enough. When your success criteria are a paragraph of prose, vagueness in that paragraph is a bug, the same as a vague type or an off-by-one. The spec just happens to be written in English, and the thing reading it is a language model that will cheerfully take the cheap reading if you leave it lying around. That’s the same lesson the goblin who wouldn’t stay dead taught me from the other direction: with these tools, what you say is what you get, and what you don’t say is fair game.

Leave it better, not just building

The Boy Scout Rule is the whole reason this blog exists, and I’d quietly exempted the robot from it. “Leave the campsite cleaner than you found it” had become “leave it building”, which is not the same thing and never was. If I’m going to put an agent in the loop precisely so it tidies up after the generator, then “tidy” has to mean what it would mean for a person on my team. Build, test and lint. No walking past the bin because nobody told you to pick it up.

The interpreter we forgot to sandbox

Fri, 19 Jun 2026 00:00:00 +0000

I write a CLAUDE.md for every project I work on, and a small pile of other markdown files besides. They’re how I keep an AI agent on the rails: what the project is, what the conventions are, what it must never do. I lean on them heavily, I change them constantly, and… here’s the uncomfortable bit… I don’t always give a change to one the same hard look I’d give a change to the code. They look like notes. They feel like docs.

Somebody worked out that they’re not.

In May, a supply-chain campaign researchers named TrapDoor pushed 384 malicious versions of 34 packages across npm, PyPI and Crates.io. The bytes did the usual nasty things, hunting out SSH keys, AWS credentials, GitHub tokens and crypto wallets. The new trick was where it hid the instructions. The packages shipped poisoned .cursorrules and CLAUDE.md files, and the attackers also opened pull requests against real projects, LangChain, LangFlow, LlamaIndex, MetaGPT and OpenHands, under titles as innocent as “docs: add .cursorrules with dev standards and build verification”. The payload was a plain-English instruction telling your AI assistant to run a helpful-sounding “security scan” that quietly shipped your secrets to a stranger. And it was written into the file in zero-width Unicode, characters that render as nothing, so you wouldn’t see it even if you looked. Which, on a file marked “docs”, you probably didn’t.

Not a new attack, a new doorway

I want to be careful not to oversell this, because the loud version, “a terrifying new class of AI threat”, isn’t true. It’s a supply-chain attack, the same shape we’ve had for years on npm and PyPI: social engineering, plus a victim who didn’t quite do enough due diligence. I wrote a while back that nobody is coming to clean your supply chain, and nothing about TrapDoor changes that. The package is still the package.

What’s different, and worth the words, is where it goes off. A classic supply-chain payload waits for CI, or for production. This one detonates the moment you open the repository in your editor, on the one machine in the whole chain that nobody audits: your laptop.

Think about what sits on a developer’s machine. Tokens in environment variables. Cloud credentials. An SSH agent holding the keys to your git forge. A logged-in CLI for your package registry. And now an AI agent running with all of it, at your full permissions, and almost none of the guard-rails a CI runner gets. It’s the least sandboxed, most credentialed box you own, and we’ve just pointed an interpreter at it that will read and act on a file an attacker can write. Pop that one machine and you haven’t popped a machine, you’ve been handed the whole keyring and left alone in the building.

Markdown is a programming language now

Here’s the framing I keep coming back to, and I can’t unsee it now. A CLAUDE.md is to an AI agent exactly what a .py is to Python, a .js to Node, a .rb to Ruby. It is source code. The agent is the interpreter. You hand it a file of instructions and it executes them.

And I don’t say that as a complaint. That an agent will read a paragraph of plain English and just do it, no compiler, no ceremony, no forty lines of glue, is one of the more remarkable things to happen to this craft in my working life, and I lean on it every day. The catch is that the very thing that makes it marvellous, that it does what the instructions tell it, is the thing that makes a poisoned instruction file so dangerous. The power and the exposure are the same property.

The only real difference is that the language interpreters have spent decades growing rules to protect you: scopes, permissions, sandboxes, a standard library that asks before it does anything irreversible. The AI interpreter has almost none of that. It reads your prose and does what the prose says, with whatever access you happen to have, and the prose can come from anywhere. We’ve quietly built the most powerful interpreter in the stack, given it the fewest rules, and filed its source code under “documentation”.

You can’t just read it more carefully

The obvious answer is “review the file like code”, and it’s right, but TrapDoor is the reason it isn’t enough on its own. The instructions were written in zero-width Unicode. You can open the diff, read every visible word, approve it in good conscience, and merge something you were never able to see. “Docs: add dev standards” is precisely the pull request you nod through on a Friday afternoon.

So reading carefully is necessary and insufficient. You also need tooling that treats these files as executable: that flags invisible characters, diffs them as code, and refuses to let an agent act on a changed instruction file until a human has actually cleared it. I run a crude version of this already. In CI, if one of my prompt or rules files changes, no AI step is allowed to run until I’ve reviewed it by hand. It isn’t clever, but it closes the worst of the gap. Locally it’s much harder, and right now my real defence is that I’m the only contributor to most of my projects, so the audit is just me, usually noticing after the horse has bolted.

Signing won’t save you here

This is the part that stings, because I’ve spent a good chunk of this year building signing and provenance into my tools. A signature proves who published something. It says nothing about whether it’s safe. That was already true for poisoned-but-signed packages, and it lands twice as hard here: you can sign a release flawlessly, with a key the platform can’t forge, and still ship a CLAUDE.md inside it that tells the reader’s agent to rob them. A merged pull request is “signed” by the very act of merging, with perfect provenance, and the instruction in it is still hostile. Provenance is necessary. It was never sufficient, and it’s no defence at all against a payload made of sentences. A signature is only ever as good as the trust you place in the publisher.

So whose job is it?

Primarily, still ours. I said it in the supply-chain piece and I’ll stand on it: the responsibility sits with the developer doing the consuming, to pin, to read, to gate, to not run a stranger’s instructions with the keys to the kingdom in their pocket. And that gets harder, not easier, as we start consuming each other’s agent setups wholesale. The Claude skills marketplace and the things like it turn “borrow someone’s CLAUDE.md” into a one-click habit, and every one of those is unreviewed code from a stranger. Each skill needs vetting like the dependency it is.

But it isn’t only on us, and TrapDoor is the argument for better tooling. We have CVE databases, scanners and scorecards for packages, for all their flaws. We have nothing equivalent for an instruction file: no scoring, no advisory feed, no scanner that knows what a poisoned CLAUDE.md looks like. That’s a gap the ecosystem has to close, and it will, eventually. The catch is that the agent vendors will be slow about it. Sandboxing a feature people love precisely because it gets out of your way is a hard, unpopular, multi-quarter job, and I wouldn’t hold my breath.

The most dangerous machine is the one on your desk

Which is why I’m not waiting for them… and nor should you.

The most dangerous machine in your supply chain isn’t a build server or a registry. It’s the laptop you’re reading this on, and we’ve handed an AI the keys to it. The good news is that nearly everything you can do about that, you can do today, with nobody shipping you a feature first. Treat your CLAUDE.md and your rules files as source code, because they are: diff them, scan them for what you can’t see, and gate any agent run on a human clearing the change. Get your secrets out of plaintext environment variables and into something an opportunistic script can’t just read, which is exactly why go-tool-base keeps its credentials in the OS keychain. And vet a borrowed skill or rules file the way you’d vet any dependency, because that’s what it is.

None of that is new advice. It’s the same diligence the supply chain has always demanded. We just have to extend it to a file we’d decided was only documentation, running on an interpreter we forgot to sandbox.

The rung we sawed off

Wed, 17 Jun 2026 00:00:00 +0000

I was in a job interview yesterday, on the wrong side of the desk for once. After years of being the one asking the questions I’m having a look at what’s next, and somewhere in a long, wandering technical conversation the inevitable arrived: where do I think AI is going, and what does it mean for how we build software?

I gave my answer. You can probably guess most of it. The more interesting thing was the question I’ve started asking them back. Not the salary, not the stack. What is your actual position on AI, and how are you building a team out of both its human and its non-human parts? I ask the company and I ask the interviewer personally, because the two answers are rarely the same, and because I’ve decided I can’t work somewhere that hasn’t sat with the question properly.

Here is why it has become my litmus test.

The rung, and who’s standing on it

I wrote recently that the greybeards’ edge was never typing: agentic tools give a senior a boost because they have the judgement to steer and verify, and give a junior a drag because they don’t have it yet and the machine hands them more rope than they can hold. The cold incentive that falls out is to hire seniors and automate the juniors.

The data has since caught up with the worry. Entry-level software postings have fallen by something like 40% from their 2022 peak. The share of juniors and graduates in IT employment has dropped from roughly 15% to 7% in three years, and Stanford researchers tracking early-career workers in AI-exposed jobs found the youngest cohort down sharply from its peak. The numbers are genuinely grim, and plenty of people are putting it bluntly: the industry killed the junior on purpose.

That framing is half right, and I think it’s worth getting the other half right too.

It was never about efficiency. It was about cost.

We didn’t automate the junior because the work needed doing better. We did it because people are expensive. We need sleep, we draw a salary, and our thinking takes time and effort that a quarterly target can’t see the point of. AI got sold as round-the-clock labour with none of that overhead, and to a business that is an almost irresistible line on a spreadsheet. There’s a grim irony arriving, mind: the bills are starting to land, and the same conversations that hyped the cheap labour are now quietly working out that all those tokens aren’t cheap at all.

Step back, though, and none of this is new. Man finds a shortcut, man takes a shortcut. From the industrial revolution onward, every time we found a way to get more done with less human effort we took it, and the work reshaped itself around the new tools. We are still here, still employed, just doing different things than our great-grandparents did.

What is genuinely new is what we’re automating. Every technological advance before this one automated the machinery of the body, the muscle and sinew and bone. This is the first time we have automated thinking, and that is a modern marvel, something we should be proud of as a species. The problem isn’t the marvel. It’s the rate. AI is improving faster than we can adapt to it, and adaptation is the entire game.

So where does the blame sit? Not on one logo. No single company did this, however easy Meta or Google make it to point at the latest round of cuts. Society did, our collective and very human hunger to build bigger and faster. That makes it harder to fix, because there is no villain to regulate, only ourselves to out-think.

The bit that should frighten you

Cutting the junior intake isn’t a saving. It’s occupational suicide.

A junior is not cheap labour that AI happens to have made cheaper. A junior is a senior who hasn’t happened yet. Saw off the bottom rung and for a good while nothing bad happens… because you’ve still got your seniors holding everything up. Then the greybeards retire, and I have a cabin and a woodstove with my name on it for exactly that day, and the role that used to grow their replacements has been hollowed out for a decade, and there is simply nobody left who learned to tell when the machine is wrong. That isn’t a hiring problem. It’s an existential one, and you can’t fix it retroactively.

It starts before the first job, too. We teach primary-school children the basics of programming in this country, which is a wonderful thing, except the curriculum was written for a world without AI in the room, and by the time those children reach secondary school a good deal of it will be teaching a craft that has already moved on. We’re throttling the pipeline at both ends at once: hollowing out the entry-level job, and feeding it from a school system running a step behind.

It’s a split, not a collapse

The counterweight to the doom is that none of this is uniform, and the loudest version, “the junior is dead”, simply isn’t true. IBM just tripled its US entry-level hiring while most of the industry was cutting, and its HR chief said the quiet part out loud: AI can handle most of the routine entry-level tasks now, the work still needs a human, and the companies that double down on early-career hiring in this environment are the ones that win in three to five years. They didn’t keep the junior role as it was. They rewrote it, less boilerplate, more time spent with customers and supervising what the AI produced.

That is the shape of the thing. The juniors who are thriving in 2026 aren’t the fastest typists. They’re the ones building judgement, which is precisely the edge I argued was the senior’s real value all along. The market hasn’t stopped wanting juniors, it’s stopped wanting the version of the junior whose job was the work AI now does.

Day zero

So what does a junior actually look like now? I don’t know yet… and anyone telling you they’ve got it worked out is selling something. We are at day zero of this.

The junior gauntlet, the rite of passage every one of us runs to earn our stripes, isn’t going anywhere. Doing your time is a cold fact of the craft and it always will be. What changes is what the gauntlet contains, and that will keep changing, day one, day two, day five hundred and twelve. The only way we redefine it well is to put juniors and seniors on it together, with the AI in the room from the start instead of bolted on afterwards. Bring it closer to our people, and bring it earlier.

Open the floodgates, in other words. Let engineers of every creed and calibre in, and let them evolve with the machine, because that is the only way the symbiosis everyone keeps promising actually happens. Darwin’s line was survival of the fittest, and fitness here means adapting alongside the tool, not being spared by it. Choke off the flow of the very people who could do that adapting, and we don’t get fitter. We go extinct.

The end I’m holding

Which is the long way back to that interview. I keep asking the question, what is your real position on AI and how are you building a team of people and machines together, because the answer tells me whether a company is optimising for this quarter or for the survival of the craft. I want to work where it’s the second one, and I think any engineer sitting across that desk should be asking the same.

And it’s why, whatever desk I land at, there’s one thing I already know I’ll do. I don’t have the map. Nobody does. But every junior who works under me is going to get the chance to run the gauntlet, to grow into a senior, and to be in the room while we work out what the next gauntlet should even be. That isn’t charity. It’s the only sane investment any of us can make. The last properly useful thing my generation does, before we go and find our cabins, is make sure there’s somebody left to hand the thread to. I intend to be holding my end of it.

The goblin that wouldn't stay dead

Fri, 12 Jun 2026 00:00:00 +0000

Turn one, the player swings, the die comes up 20, and my AI dungeon master narrates the goblin falling silent, leaving the player alone in the corridor. Good. Turn two, another roll, a 6 this time, and the same dungeon master cheerily has the goblin “dance back” out of the dark to take another swing. The goblin I’d just watched die was up and fighting again, and the model didn’t so much as blink.

I didn’t feel cheated, or even surprised. I felt the small, familiar thud of oh, yeah, I forgot that bit. Because the model hadn’t gone rogue. It had done exactly what a language model does. The gap was mine.

This was the war story behind part four of the go-tool-base tutorial, the AI dungeon master. The tutorial shows the clean, final design and quietly moves on. It doesn’t show the three different ways I got it wrong first, which is a shame, because the wrong turns are where the actual lesson is.

Why a dungeon master at all

A word on why I was even here. I was trying to prove the chat component of the framework to myself. There’s a voice that pipes up whenever I build anything in this space, “LangChain exists, who do you think you are?”, and the answer I keep landing on is that LangChain is enormous and I wanted something small enough to hold in your head. The tutorial was the test: could a newcomer wire AI into a CLI with it and come out the other side with something that actually behaves?

That last word is the whole problem. A tutorial has to leave you holding something dependable, and dependability is the one thing AI fights you on. I also wanted it to be fun, a thing someone might keep poking at after the tutorial ends, maybe even the hook that gets a person other than me to use the framework. I batted hook ideas around and liked none of them, until the obvious one landed: I run a tabletop game on the odd weekend, so make the AI the dungeon master. Gamify the thing. Then watch it raise the dead.

Strike one: nothing to enforce

The first version was the naive one. I gave the model a roll tool, because the one thing you absolutely cannot let a language model do is pick its own numbers, and otherwise let it narrate freely. The conversation history carried from turn to turn, so it remembered the fight. I assumed remembering was enough.

It isn’t. Remembering and being held to it are different things. The history told the model a goblin had died; nothing stopped it writing the goblin back in when the next turn’s narration wanted a bit of jeopardy. Memory is not a constraint. The model will happily contradict its own past if you’ve given it room to, and I had given it nothing but room.

Strike two: a tool to read the state

The obvious fix, and I do mean obvious, the kind you reach for without thinking, was to give the model a state tool so it could check who was alive before it narrated. Hand it the facts on request and surely it’ll stop making them up.

What it actually did was dither. Handed a tool it could call to look things up, it called it. And called it. And called it again, turning a turn over in its hands without ever committing to an action, burning through its step budget on lookups and leaving the player staring at nothing. I’d cured the lying by inventing paralysis. A tool the model can call is a tool it will call, often instead of doing the thing you actually wanted.

Strike three: refereeing its own dice

When I did get it reading state cleanly, the third failure crept in, and this one was subtler. Once the model could see the goblin’s hit points, it started deciding the fight. It would read that the goblin had 12 HP and just narrate a killing blow, hits and damage and all, without calling the roll or attack tools at all. Why ask the dice when you can see the board and write whatever outcome the story wants? Give a model enough context and it stops being a narrator and starts being a referee, which is precisely the job I’d built tools to keep out of its hands.

The fix was less, not more

Three failures, and notice the shape of my fixes: each one added something. More memory, then a tool, then more context. Every instinct said the model needed more to work with. Every time, the extra capability was the new way to be wrong.

So I went the other way. The truth lives in a plain Go struct that I own, not the model. There’s no state tool to dither on, because the loop simply prepends the current state to every turn’s input, fresh, so the model never has to ask and never gets to drift. The mechanics, the dice and the damage, live in Go functions the model has to call, and the system prompt says in as many words that it must not decide a hit or damage itself. The model is left with exactly one job: narrate. The prose is its to invent. The maths, the state and the shape of the result are not.

That’s the line that turned three bugs into a feature. You don’t make a language model reliable by giving it more to work with. You make it reliable by giving it less to be wrong about.

The freedom I chose not to give it

There’s a real tension in that, and I want to name it rather than pretend the boxed-in version is the only true one. At my own table the rules are guidelines, not guardrails. I ignore them, bend them, improvise, reach for the “rule of cool” when the moment’s better for it. A great AI dungeon master would have that same freedom, and a few out there genuinely do, Old Greg’s Tavern is a lovely example of how far the free-form version can go.

But that freedom costs far more than a tutorial can spend, and it buys unpredictability I was specifically trying to teach people to avoid. So I made a deliberate trade: guardrails instead of guidelines. Simple, but not so simple it’s boring. The player still gets a “not on rails” game, they can try anything and the DM copes, but every outcome that matters runs through code I trust. That’s the right shape for a tutorial, and, not by coincidence, the right shape for most AI features you’d actually ship.

What the goblin taught me

The thing I keep coming back to is that the model never misbehaved. It resurrected the goblin because I gave it the freedom to. It dithered because I gave it a button to press. It refereed because I let it see the board. Every failure was a permission I’d handed over without meaning to. The reliability didn’t come from a cleverer prompt or a bigger model, it came from working out, one dead goblin at a time, exactly how little the model needed to be trusted with.

If you want the version where it all works first time, the tutorial has it, the tool-calling and the typed turns wired up properly. This was the road there. The goblin, you’ll be glad to hear, now stays down.

The greybeards' edge was never typing

Wed, 27 May 2026 00:00:00 +0000

I have a retirement plan, and it is gloriously low-tech. A cabin, some trees, a woodstove, and a firm rule that no wifi symbol ever appears within a mile of me again. I think about it more than is probably healthy.

There’s a snag, though, and it’s the same one the whole industry is currently pretending it can’t see. For me to vanish into the woods, somebody has to be able to do my job after I’ve gone. And right now, collectively, we are working very hard to make sure nobody can.

The boost, and the drag

I wrote the other day about how AI made producing plausible work nearly free while verifying it stays expensive and human. Point that same lens at a team and something uncomfortable falls out. It isn’t mine; it belongs to Mark Russinovich and Scott Hanselman of Microsoft, who laid it out in Communications of the ACM: agentic coding tools give a senior engineer an AI boost, multiplying what they ship, because a senior has the judgement to steer and verify the output. The same tools give an early-career engineer an AI drag, because they don’t have that judgement yet, and the machine hands them far more rope than they can hold.

The cold incentive writes itself, and they name it: hire seniors, automate juniors. It isn’t hypothetical, either. Meta cut 8,000 roles last week, in a round the Times filed under mounting AI casualties. For any single quarter you care to look at, the maths is impeccable.

The bill is just deferred

Here’s the line the spreadsheet leaves off. The grindy work a junior used to cut their teeth on, the small fixes, the boring migrations, the read-the-stack-trace-and-figure-it-out, is exactly the work AI now does. So the proving ground is gone. And the entry-level seats where they’d have stood on it are the ones being cut. Squeezed from both ends at once: no reps, and nowhere to take them.

Russinovich and Hanselman put the consequence plainly. Without early-career hiring the talent pipeline collapses, and you arrive at a future with no next generation of experienced engineers. The seniors you’ll be desperate for in 2032 are the juniors you declined to train in 2026. The bill doesn’t vanish. It just falls due long after the people who cut the cheque have moved on.

How to manufacture a world of AI slop

I named the last piece for its villain; let me name this one’s too. Raise a generation that can produce with AI but was never taught to validate, and here is what you get: people shipping machine-built products at speed with no instinct for where the output is quietly wrong, because they never had to be wrong the slow way first. Software nobody genuinely understands, human-written and AI-written alike, and a steady leak of trust out of all of it.

That isn’t a productivity problem. That’s a world of AI slop, and not in one project’s inbox this time but everywhere at once. We’d have automated our way clean out of the one job AI cannot do for us: knowing when not to trust the machine.

It’s a choice, and it’s yours

Andrew Murphy put it with more bite than I’d quite dare: AI didn’t kill your junior pipeline, you did. He’s right. This isn’t weather. Nobody is making you do it. It’s a decision, taken quarter by quarter, and a decision is a thing you can take differently.

The fix isn’t complicated, it’s just unfashionable. Keep hiring early-career engineers. Say out loud that they cost you capacity at first, and treat their growth as an actual goal rather than something meant to happen by osmosis. Russinovich and Hanselman call it preceptorship at scale: senior mentorship, deliberately structured, turning the ordinary day’s work into teachable moments.

And the proving ground can be rebuilt, just not where it stood. If AI does the writing now, the apprenticeship moves to the reviewing. Put juniors in the loop on the machine’s output and have them hunt for the subtle wrongness, the way a scanner is an argument, not an order. That’s how judgement gets built now: not by grinding out the work, but by verifying it. Which, as luck would have it, is the single most valuable thing anyone on your team can learn to do.

The part that’s on the greybeards

This is where I stop letting the companies wear all the blame, because some of it is mine, and yours. Verification is a craft, and crafts pass from person to person or not at all. I know where every one of my own AI misfires comes from: I gave it too little context, or too much rope, and didn’t check the result closely enough. The tool rarely went rogue. The gap was always my diligence. That’s not a confession, it’s the curriculum, and it’s precisely the judgement a junior can only earn by sitting in the loop beside someone who has already made those mistakes.

So the senior engineer’s job has quietly changed underneath us. It was never really the typing. It was knowing when something is off, and what the customer actually needs, and now it is also handing that on, deliberately, while there’s still time to. Mentor and guardian first; fastest prompt in the room a distant second.

The ladder you’re standing on

There will always be something AI can’t do well enough, and for a good while yet it’s the thing that matters most: being the accountable human who genuinely understands what’s needed and can be held to it when it goes wrong. A simulation can be enormously convincing. It cannot be responsible.

Which brings me back to my cabin. I do want it one day, the trees and the woodstove and the blissful disconnection. But I only get to go if the work outlives me, and the work only outlives me if the people do. So the last useful thing my generation does, before we shuffle off to find our trees, isn’t shipping a little more code. It’s making sure there’s somebody left who can tell when the machine is wrong. Pull the ladder up behind us and there’ll be nobody to notice the rot, and no cabin quiet enough to make that sit right.

AI didn't kill curl's bug bounty. The bounty did.

Tue, 26 May 2026 00:00:00 +0000

In January, Daniel Stenberg shut down curl’s bug bounty. The headlines wrote themselves, and they all said the same thing: AI killed it. A flood of machine-generated slop drowned the maintainers, so they pulled the plug.

That’s true, as far as it goes. It’s also the wrong lesson, and the right one is sitting in plain sight in the same project, in the same few months.

Volume without validation is the attack

curl had run its bounty since April 2019. Over its life it paid out more than $100,000 for 87 genuine vulnerabilities, a thoroughly good return for one of the most depended-on pieces of software on the planet. Then the reports stopped being reports. The confirmation rate, the share of submissions that turned out to be a real bug, had historically sat north of 15%. By 2025 it was below 5%. Fewer than one in twenty submissions were worth anything, and the rest still had to be read.

That last part is the whole problem. A bogus report doesn’t announce itself. Someone has to open it, take it seriously, try to reproduce it, and work out that it’s nonsense, and that someone is a human being with a finite number of hours and a project to run. Stenberg put it plainly: the slop “take[s] a serious mental toll to manage and sometimes also a long time to debunk.” The submitter spends seconds. The maintainer spends an afternoon. Do that at volume and it stops being noise and becomes an attack, a denial-of-service aimed not at curl’s servers but at its maintainers’ attention. No exploit required. Just plausibility, in bulk.

The bounty was the accelerant, not the AI

So far this is the story everyone tells. Here’s where I get off the bus.

The instinct is to blame the AI for the slop. But look at what a bounty actually is. It’s a cash prize, and curl’s was priced for the thing it wanted: the hours and the judgement a skilled human pours into finding a real flaw. That pricing made complete sense right up until the cost of producing something that looked like a finding collapsed to nearly nothing.

That’s what AI changed. Not the supply of bugs. The supply of plausible-looking bug reports. Put a cash prize on “looks like a finding”, then make “looks like a finding” free to generate, and you haven’t got a bug bounty any more. You’ve got a slot machine. Stenberg said he’d started to sense “a bad faith attitude” in the reports, and of course he had. The incentive was openly inviting it.

So the death spiral was structural, not bad luck. The moment generating plausible reports went free, any cash bounty became a magnet for spray-and-pray, and the only open questions were how fast it would rot and whether you’d close the programme or just let the rewards quietly wither. The AI was the match. The bounty was the petrol. We have been pointing at the wrong one.

The proof: curl turned around and hired the AI

If AI were really the villain here, you’d expect curl to have slammed the door on it. It did the opposite.

In the same stretch, by AISLE’s own account, an AI security platform contributed 24 pull requests to curl, five of which earned CVEs, and the project now runs it internally for continuous review. The same tooling reportedly found all twelve zero-days in an OpenSSL release in late January. (Both of those are the tool-makers’ and a third party’s numbers rather than curl’s audited figures, so weigh them as such. But curl adopting the thing isn’t a claim. It’s a decision.)

Sit with the shape of that. curl shut down strangers being paid for AI-shaped noise, and in the same breath put AI to work as a tool its own maintainers drive. The two moves look contradictory only if you think “AI” is a single thing with a single verdict attached. It isn’t. Pointed at the problem by people accountable for the result, with no prize to farm, it found real bugs. Dangled in front of anonymous strangers chasing a payout, it produced sand.

The tell is which AI curl kept, and which it mocked

Stenberg drew that line about as sharply as a person can. When Anthropic put its security model, Mythos, in front of curl this spring, it scanned 176,000 lines of C and surfaced a single flaw, and Stenberg called the surrounding fanfare the greatest marketing stunt he’d seen. Same maintainer. Adopts one AI, rubbishes another.

The deciding factor was never whether the thing was AI. Both were. It was whether the output survived a human checking it, and whether you could check it at all. AISLE handed over pull requests and CVEs you could read and merge. Mythos arrived as a closed model and a press release, which is to say a claim the community has no way to independently test.

My bias, up front, because it runs the opposite way to what you’d expect from someone writing this: I’m a paying Claude subscriber and I lean on Anthropic’s models every working day, the one behind the spadework for this post included. I’m an advocate, not a sceptic, and AI genuinely has its place. That is exactly why the Mythos fanfare grates. Overselling a closed model to get out ahead of the competition, when the one test the public got to see turned up a single bug, is the sort of thing that chips away at trust in all of it. A result you can’t verify is marketing until proven otherwise, whoever’s logo is on the slide, and I’d rather the tools I depend on didn’t stoop to it.

The cheap half and the expensive half

Pull back from curl for a moment, because the lesson isn’t really about bounties at all. Anyone who works with these tools every day knows the same thing: when they go wrong, it’s rarely the model running off on its own. It’s the context it wasn’t given, the rope it was handed, the output nobody checked closely enough. The failure sits on the human side of the keyboard, at the one step that’s easiest to skip, which is verification.

That’s the pattern curl hit at the scale of an ecosystem. AI made one thing nearly free: producing work that looks right. It did not make the other thing a penny cheaper: confirming the work is right. That cost still falls, in full, on a person. (A scanner, I’ve argued before, is an argument, not an order; the same goes double for a model.) The bounty’s fatal mistake was paying for the cheap half and quietly assuming it had bought the expensive one. The same trap waits in code review, in hiring, in CVs read by machines, but that’s a bigger argument for another post.

Pouring sand into the machine

curl didn’t capitulate to AI, whatever the headlines decided. It stopped paying for the worthless half and started using the valuable half, and it had the discernment to tell a useful tool from a press release while it did so.

The bounty wasn’t a casualty of artificial intelligence. It was a structure that, the instant plausible output became free, could only fill with sand. Stenberg said he hopes closing it stops “more people pouring sand into the machine.” Reading the last year of his inbox, I think he’ll get his wish. The sand was only ever there because somebody left a bucket of money beside the funnel.

Supporting a provider, or actually using it

Sat, 02 May 2026 00:00:00 +0000

If your CLI tool talks to an AI model, you don’t want to hard-wire one vendor. So you reach for a single client interface over several providers, which is the right call. The trap is the next step: build that interface on only what every provider has in common, and you quietly throw away the very features that made you want a particular provider in the first place. rust-tool-base’s rtb-ai refuses to make that trade.

The pull toward one interface

If your CLI tool talks to an AI model, hard-wiring one vendor is a poor bet. One user has an Anthropic key, another an OpenAI key. Someone’s on Gemini. Someone runs Ollama locally because their data can’t leave the building. Someone points at an OpenAI-compatible endpoint from a provider you’ve never heard of. You don’t want a separate code path for each, so you want one AiClient that all of them slot behind.

rtb-ai gets that unification from the genai crate, which already speaks to Anthropic, OpenAI, Gemini, Ollama and OpenAI-compatible endpoints. One interface, five providers, the tool author picks one in config. The Go sibling makes the same bet: go-tool-base’s chat package also unifies several providers, behind an interface deliberately kept to four methods. So far this is the obvious design, and if it were the whole design there’d be nothing to write about.

What “unified” quietly costs you

Here’s the catch in any unified interface. It can only expose what every provider behind it has in common.

The common subset is plain chat. Messages go in, text comes out, optionally streamed token by token. That’s real and it’s useful and every provider does it. But the common subset is also the floor, and the features that make a particular provider worth choosing are almost never on the floor. They’re the things only that provider does.

Anthropic is the sharp example, because it has three features that matter and not one of them is common-subset.

Prompt caching. You can mark the stable parts of a request, the system prompt and the tool list, as cacheable. The provider keeps them warm, and on the next turn you aren’t billed to re-send and re-process text that didn’t change. On a long agent loop, where the same large system prompt rides along on every single turn, that’s a substantial saving in both cost and latency.

Extended thinking. The model works through a hard problem in a visible, budgeted reasoning pass before it commits to an answer, and you can see that reasoning.

Citations. Structured references back to source material in the response.

A client built strictly on the common subset can’t express any of those. It has no field for them, because four of the five providers wouldn’t know what to do with the field. So a purely lowest-common-denominator client would “support” Anthropic and then use it badly, leaving its best features unreachable. Support as a checkbox, not as the point.

The escape hatch

rtb-ai’s answer is to not choose. It runs two implementations under one interface.

For OpenAI, Gemini, Ollama and OpenAI-compatible endpoints, calls route through genai, the unified path. For Anthropic, every method drops to a direct reqwest implementation straight against the Messages API. Same AiClient on the surface, a different implementation underneath, selected by which provider the config names.

And the request type has deliberate room for the difference:

pub struct ChatRequest {
 pub system: Option<String>,
 pub messages: Vec<Message>,
 pub temperature: Option<f32>,
 pub max_tokens: Option<u32>,
 /// Anthropic-only: enables prompt caching at every stable point.
 /// Ignored on non-Anthropic providers.
 pub cache_control: bool,
 /// Anthropic-only: extended-thinking budget. `None` disables.
 /// Ignored on non-Anthropic providers.
 pub thinking: Option<ThinkingMode>,
}

Set cache_control and the Anthropic-direct path inserts cache breakpoints at the three stable points: the system prompt, the tool list, and the first message. Set thinking and it adds the thinking block, and streaming surfaces a separate ThinkingToken event so you can show the reasoning apart from the answer. On a non-Anthropic provider, both fields are simply ignored. The interface carries them; only the implementation that understands them acts on them.

A hatch, not a leak

It’s worth being precise about why this isn’t the thing it superficially resembles, which is a leaky abstraction.

A leaky abstraction is one where implementation details bleed through that you didn’t intend and can’t reason about. The abstraction quietly fails to abstract, and you’re left guessing which provider you’re really talking to.

This is the opposite of that. The two Anthropic-only fields aren’t a leak. They’re named, documented as Anthropic-only, inert everywhere else, and right there in the public type for anyone to see. The interface is uniform for the common case and deliberately, visibly non-uniform at exactly the points where uniformity would have cost you the good features. You opt into provider-specifics by setting a field. You stay fully portable by leaving it at its default. Nothing bleeds; you decide.

The same design line explains what does stay in the unified path. Structured output, chat_structured::<T>, sends a JSON Schema derived from your Rust type with the request and validates the reply against it before handing you a typed T. That’s a portability win that costs nothing across providers, so it belongs in the common interface. The split isn’t “Anthropic versus the rest”. It’s “features that are free to unify go in the unified path; features that aren’t get a designed door”. Prompt caching and extended thinking get the door, because flattening them away would be the expensive kind of convenient.

To sum up

A CLI tool that integrates AI wants one client over several providers, and a unified interface can only expose what those providers share. The shared floor is plain chat, and the features worth choosing a provider for, like Anthropic’s prompt caching, extended thinking and citations, are never on the floor.

rtb-ai keeps both. genai provides the unified path across five providers; an Anthropic-direct reqwest path drops below the abstraction for the features genai can’t reach, and ChatRequest carries the Anthropic-only fields openly, ignored elsewhere. Uniform where uniformity is free, with a designed escape hatch where it isn’t. That’s the difference between supporting a provider and actually using it.

The AI provider that isn't an API

Mon, 06 Apr 2026 00:00:00 +0000

go-tool-base’s chat package puts five AI providers behind one interface. Four of them are exactly what you’d guess: HTTP calls to OpenAI, Claude, Gemini, and anything OpenAI-compatible. The fifth one isn’t an API at all. It shells out to a binary.

That sounds like a slightly mad thing to want, right up until you’ve worked somewhere the network says no.

The fifth provider shells out

The chat package speaks to five providers through one ChatClient interface. Four of them are what you’d expect: HTTP requests to OpenAI, to Claude, to Gemini, to any OpenAI-compatible endpoint. The tool author picks one in config, and the rest of the code never knows the difference.

The fifth, ProviderClaudeLocal, is different in kind. It doesn’t make an HTTP request at all. It shells out. It runs the claude CLI binary as a child process, passes the prompt in, and reads the answer back from the binary’s output.

That sounds like an odd thing to want until you’ve been stuck in the environment it was built for.

Why you’d want that

Picture a corporate network with its egress locked right down. Outbound HTTPS to api.anthropic.com is blocked by policy. A tool built on go-tool-base that uses AI would simply fall over there. It tries to reach the API, there’s no route, and that’s the end of the feature.

But the developer at that machine has the claude CLI installed, and has run claude login. That binary is permitted. It’s an approved, managed tool, and it has its own sanctioned path out. The direct API call is blocked; the claude command is not.

ProviderClaudeLocal is what bridges those two facts. If your tool’s AI calls go through that already-blessed binary instead of straight at the API, they work, in an environment where the direct call cannot. That’s the whole reason the provider exists. It isn’t faster (a real API call has lower latency) and it isn’t more capable. It’s for the place where the API call simply isn’t an option, and “isn’t an option” is a surprisingly common place to find yourself inside a large organisation.

What it costs

It’s worth being straight about the trade, because ProviderClaudeLocal is the reduced-capability provider.

It doesn’t do tool calling. It doesn’t do parallel tools. It doesn’t stream. Those need a live, structured connection to the model’s API, and a subprocess that runs once and prints an answer is not that. What it does support is plain chat and structured output, the latter through the binary’s own --json-schema flag.

So the positioning, and the package’s documentation says exactly this, is: prefer the API providers when you can reach them, because they’re lower latency and feature-complete. Reach for ProviderClaudeLocal when API access is restricted. You accept the narrower capability set as the price of working at all. For a tool whose AI feature is “answer a question” or “return a structured analysis”, that price is often nothing you’d even notice. For one built on an agentic tool-calling loop, it’s a real limitation, and you’d know to expect it.

How it stays behind the same interface

Here’s the part that makes it pleasant rather than a special case to maintain. Despite being a subprocess and not an API, ProviderClaudeLocal is still a ChatClient. Your feature code calls Chat and Ask exactly the way it would for any other provider.

Everything that makes a subprocess provider awkward stays inside the provider. Spawning the binary, feeding it the prompt, parsing its output, capturing stderr and surfacing it when the binary exits non-zero, and threading multi-turn continuity through session identifiers passed back on the next call with --resume: all of that is the provider’s problem, and all of it sits behind the interface. The code in your tool that uses AI doesn’t know, and has no way to find out, that this particular provider is a child process rather than an HTTPS call.

That’s a unified interface genuinely earning its place. It’s easy to put a uniform face on four things that already work the same way underneath. The real test of the abstraction is whether something that works in a completely different way, a subprocess instead of a socket, can still slot in without the caller changing a line. Here it can. You swap one config value, and a tool that talked to an API now talks through a binary, and nothing downstream so much as blinks.

The bottom line

go-tool-base’s chat package puts five providers behind one ChatClient interface, and ProviderClaudeLocal is the one that isn’t an API. It runs the locally installed, pre-authenticated claude CLI as a subprocess.

It exists for the locked-down environment where outbound HTTPS to the AI API is blocked but the claude binary is allowed: there, AI features keep working where a direct call would fail. The trade is a narrower capability set (no tool calling, no streaming, plain chat and structured output only) so you prefer the API providers when you can reach them and fall back to this when you can’t. And because it’s still a ChatClient, all the subprocess machinery stays hidden, and your code uses it without knowing it’s there. That last part is the real test of an abstraction: a provider that works in an entirely different way still slots in unchanged.

AI conversations you can resume

Sat, 04 Apr 2026 00:00:00 +0000

An AI conversation is, fundamentally, its own history. The model’s next answer depends on everything said so far. And a CLI tool, by its very nature, forgets everything the moment it exits. Put those two facts together and you get the problem: run an AI command, exit, run it again, and you’re talking to someone who’s never met you.

A CLI forgets everything

A long-running service keeps its state in memory for as long as it runs. A CLI tool doesn’t get that luxury. It starts, does one thing, exits. The next invocation is a brand-new process with no memory of the last one.

For most commands that’s exactly right, and you wouldn’t want it any other way. But an AI conversation is a different kind of beast, because a conversation is its history. The model’s next answer depends on everything said so far. Run an AI command, exit, run it again, and you’ve started a fresh conversation with someone who’s never met you. For an interactive assistant, or any AI workflow that unfolds across several invocations, that’s plainly the wrong behaviour. The user expects to pick up where they left off.

Save and restore

The chat package handles this through a PersistentChatClient interface. Like streaming, it’s an optional capability discovered with a type assertion, sitting beside the four-method core rather than bloating it. A client that supports persistence also satisfies this interface:

if pc, ok := client.(chat.PersistentChatClient); ok {
 snapshot, err := pc.Save()
 // store the snapshot somewhere
}

A snapshot is a serialisable value that captures the conversation. You store it. Next run, you load it, Restore it onto a fresh client, re-register your tools, and call Chat again. “Where were we?” works, because the model is handed back the whole history.

A snapshot is opinionated about what it carries

The interesting part is what a snapshot does and doesn’t contain, because that’s a series of deliberate decisions.

It carries the messages, the system prompt, the model name, and tool metadata: the names, descriptions and parameter schemas of the tools that were registered.

It does not carry tool handlers. Handlers are code, not data; you can’t serialise a function meaningfully, so after a restore you re-register them with SetTools. The snapshot remembers that a tool called read_file existed and what its shape was; it doesn’t try to remember the Go function behind it.

And it does not carry API tokens. This is the one to dwell on. A snapshot is a file. A file gets synced, backed up, copied between machines, attached to a support ticket by a user trying to be helpful. A snapshot that carried the API key would be a credential leak the moment it left the laptop it was made on. So the snapshot never contains a token, at all. On restore, the client picks the credential up again the ordinary way, from the environment or the keychain. The conversation and the secret are kept in separate places on purpose, and only one of them is ever in the file.

Encrypted at rest, if you want it

The package ships a FileStore that writes snapshots as JSON files, with 0600 permissions in a 0700 directory, and it can encrypt them. Pass WithEncryption a 32-byte key and snapshots are written with AES-256-GCM.

That option exists because a conversation can hold sensitive content even when it holds no credential. The log a user pasted in for analysis, the source file they asked the model to review, the internal details tucked into their questions: none of that is an API key, and all of it might be something you’d rather not have sitting in plain JSON in a backup somewhere. Encryption at rest covers it.

The FileStore is also careful about the snapshot identifiers it’s handed. An ID has to be a canonical UUID, and the resolved file path is checked to lie inside the store directory, so a snapshot ID arriving from an untrusted source (a CLI flag, a request payload) can’t be bent into a path-traversal that reads or writes somewhere it shouldn’t. Persisting conversations adds a small filesystem surface, and the store treats it as exactly that.

The short version

A CLI tool forgets everything between invocations, which is correct for most commands and wrong for an AI conversation, because a conversation is its history.

go-tool-base’s chat package lets you persist one. PersistentChatClient saves a snapshot you can store and restore later, picking the conversation back up where it ended. The snapshot is deliberate about its contents: messages, system prompt and tool metadata yes; tool handlers no, because they’re code you re-register; API tokens never, because a snapshot is a file and a file travels. The built-in FileStore can encrypt snapshots at rest with AES-256-GCM and validates snapshot IDs against path traversal. Resumable conversations, without the conversation file turning into a place secrets leak from.

An AI agent that has to make the build pass

Thu, 02 Apr 2026 00:00:00 +0000

Most AI code generation works on a charming little principle I’ll call generate-and-hope. The model writes the code, the model stops at the closing brace, and whether the thing actually compiles is left as an exercise for you. For a snippet you paste into an editor, fine. For a whole generated command, that’s just outsourcing the disappointment.

go-tool-base does something I’m rather happier with: the AI has to make the build pass before it’s allowed to claim it’s done.

Generate and hope

The usual shape of AI code generation is this. You ask for code, the model produces it, and the model’s job ends at the closing brace. Whether it compiles, whether the tests pass, whether the imports even resolve, none of that has been checked. The model produced something that looks right. You find out whether it is right when you build it.

For a snippet you paste into an editor, that’s perfectly fine. The compiler tells you in a second. But go-tool-base’s generator, driven by gtb generate command --script or --prompt, produces a whole command: the implementation, its tests, the lot. “Generate and hope” at that scale means handing the user a project that may or may not build, and quietly making them the one who finds out which.

Drafting is only step one

So the generator doesn’t stop at drafting. Writing the first version of the implementation and its tests is step one of two. Step two is an autonomous repair agent.

Once the draft is on the filesystem, a separate agent takes over. It’s an LLM running in a loop, but a loop aimed at one narrow, checkable job: make this project build and pass its tests. It isn’t asked to be creative. It’s asked to get to green.

A fixed set of tools, and no shell

The agent is not handed a shell. It’s given a fixed, defined set of tools and nothing else. Three of them let it explore and edit the project: list_dir, read_file, write_file. Four of them let it verify the project:

go_build runs the build and captures the compiler errors.
go_test runs the tests and captures the failures.
go_get resolves a missing dependency.
golangci_lint runs the project’s linter.

That restriction is the design, not a limitation of it. The agent can’t delete arbitrary files, can’t reach the network, can’t run anything that isn’t on the list. It has exactly what it needs to make code compile and nothing it would need to do damage. Its file writes are confined to the project directory by an explicit path check, so even write_file can’t go wandering up into /etc. A coding agent you’d actually let near a filesystem is one whose abilities are an allowlist, not a denylist. (I keep coming back to that principle through this series… safety as a boundary you draw, not a behaviour you hope for.)

The loop

The repair loop is a ReAct loop, the same reason-act-observe shape as the tool-calling loop, only this time pointed at a goal:

The draft is on disk.
Verify: run go_build and go_test.
If verification failed, read the error logs, the compiler error or the failing test.
Reason about the cause: an undefined variable, a missing import, a wrong signature.
Act: call write_file to patch the code, or go_get to add the dependency.
Loop. Steps two to five repeat until the project is green, or the agent hits its bounded step limit.

What makes this work is treating the error output as feedback rather than as a failure to log and walk away from. A compiler error is the single most useful sentence you can hand a model that’s trying to fix code. It says what’s wrong, and usually where. The loop feeds it straight back in, and the model fixes against it.

Verification changes what “done” means

Here’s the real shift, and the agent’s own documentation puts it well: the agent “doesn’t just say it fixed a bug; it uses a Test tool to verify the fix before reporting success.”

A generate-and-hope model reports success when it finishes writing. It has no idea whether the code works, and it isn’t really claiming otherwise. “Done” means “I produced text”. The repair agent reports success when go_build and go_test actually pass. “Done” means “the build is green”. Those are two completely different claims, and only the second is worth anything to the person who asked for the command.

That’s the line between an AI that’s a creative writer and an AI that’s a collaborator you can hand a task to. And when the agent can’t reach green, when it spends its whole step budget and the project is still broken, the generator fails safely: it leaves the best-attempt code in place, commented out so the project still compiles, and tells the user what to finish by hand. There’s also an --agentless flag for anyone who’d rather have a plain single-shot retry than the multi-step agent. The default, though, is the agent, because the default should be code that’s been checked.

Where this leaves us

Most AI code generation generates and hopes: the model writes code and the user discovers whether it works. For a whole generated command, that pushes a may-or-may-not-build project onto the user.

go-tool-base’s generator drafts the command and then hands it to an autonomous repair agent. The agent has a fixed set of tools (explore and edit the project, build it, test it, lint it, fetch dependencies) and no shell at all, with file writes confined to the project directory. It runs a ReAct loop, reading each error and patching against it, until the build is green or it exhausts its steps. The point is what “done” comes to mean: not “the model finished writing”, but “the build passes”. Only one of those is a claim worth trusting.

Stop regex-ing the LLM's prose

Tue, 31 Mar 2026 00:00:00 +0000

Ask an LLM a question and it hands you back prose. Lovely to read, miserable to program against. You wanted the one number buried in the middle of it, and now you’re writing a regular expression to fish a word out of three well-written paragraphs that phrase themselves slightly differently every single time you run them.

There’s a much better way, and it’s the difference between forever interpreting an LLM and actually building on one.

The problem with a paragraph

You ask an LLM to analyse a log file and tell you the severity of what it found and a suggested fix. It comes back with three well-written paragraphs. Somewhere in there is the word “critical”, and somewhere is the fix.

Your program now has to extract those two facts from prose, and prose has no contract. The next run, the model phrases it differently. It leads with a caveat. It says “severe” where last time it said “critical”. It puts the fix first. Anything that worked by finding “critical” in the text is now quietly wrong, and you didn’t change a line. Parsing free text for structured facts is a game you lose slowly.

What you actually wanted was never a paragraph. It was a value: a thing with a severity field and a fix field, that you can branch on and store and pass around like any other.

Ask for the struct, not the prose

go-tool-base’s chat package draws the line with two methods. Chat gives you text. Ask gives you a struct.

You define the Go type you want back:

type Analysis struct {
 Severity string `json:"severity"`
 Fix string `json:"fix"`
}

var result Analysis
err := client.Ask(ctx, "Analyse this log file: "+logText, &result)

The framework generates a JSON Schema from that struct, sends it to the model as the required response format, and unmarshals the reply straight into result. You never lay a finger on the prose. You get result.Severity and result.Fix, typed, ready to use. If you want the model’s answer to drive a switch statement, this is the method that lets it.

The struct is the schema is the contract

The detail that makes this hold up over time: you don’t write the schema. The struct is the schema.

The framework derives the JSON Schema from your type. In go-tool-base that’s GenerateSchema[T](); in rust-tool-base the schema comes from your Rust type through schemars. (Yes, there’s a Rust sibling now. I’ll introduce it properly in a few weeks, but it keeps gatecrashing these posts because the two frameworks deliberately share ideas.) Either way there’s one definition, your type, and the schema is just a projection of it.

That matters, because otherwise two things have to agree. There’s the schema you tell the model to obey, and there’s the type you unmarshal the answer into. Hand-write the schema and those two can drift: add a field to the struct, forget to add it to the schema, and the model is never told to produce it, so it silently never appears. Deriving the schema from the type collapses the two into one. They can’t disagree, because there’s only one of them.

Both frameworks, with one extra step in Rust

go-tool-base does this with Ask and a ResponseSchema set on the client config. rust-tool-base does it with chat_structured::<T>, where T is any type that’s both deserialisable and JsonSchema.

rust-tool-base adds one step worth calling out. Before it deserialises the model’s reply into your T, it validates the raw response against the schema with a JSON Schema validator. That splits the failure into two distinct, named cases: the response didn’t match the schema, or it matched the schema but still wouldn’t deserialise. A model that returns subtly wrong JSON fails loudly and specifically, with an error that tells you which of those happened, instead of quietly handing you a zero-valued struct that you end up debugging an hour later.

When you’d reach for it

The line is simple, and it’s about who reads the answer.

If a human reads the answer, prose is right. Chat, free text, let the model write well. A summary, an explanation, an interactive reply: leave all of those as prose.

If a program consumes the answer, you want a value. Classification, extraction, a code review scored out of a hundred with a list of issues, a yes-or-no with reasons: anything where the next thing that happens is your code branching on the result. There, Ask and chat_structured turn the LLM from something you have to interpret into something that returns a value, and a typed value is a thing you can actually build on.

To sum up

An LLM returns prose by default, and prose has no contract, so a program that picks structured facts out of it breaks the moment the model rephrases.

Structured output asks for the value instead. You define a struct, the framework derives a JSON Schema from it, the model is constrained to that shape, and you get a typed result. go-tool-base’s Ask and rust-tool-base’s chat_structured both work this way, with the schema derived from your type so the schema and the type can’t drift; rust-tool-base additionally validates the response against the schema before deserialising. Use it whenever the answer feeds code rather than a human. It’s one of the four methods that make up go-tool-base’s small chat interface, and it’s the one that makes an LLM safe to program against.

Letting the AI call your Go functions

Sun, 29 Mar 2026 00:00:00 +0000

An AI that can only produce text can describe your system. An AI that can call your Go functions can actually operate it. That gap, between describing and doing, is the difference between a chatbot and something genuinely useful, and crossing it comes down to one fiddly mechanism: tool-calling, and the loop that drives it.

Talking about the system versus operating it

Wire an AI provider into a CLI command and you get something that can talk. Ask it a question, get a paragraph back. Useful, up to a point.

But notice the ceiling. An AI that can only generate text can describe things. It can tell you what it would do. What it can’t do is look at the actual current state of your system, or take a real action, because it has no hands. It’s reasoning in a vacuum about a world it can’t reach out and touch.

The thing that gives it hands is tool-calling. You hand the AI a set of functions it’s allowed to call. Now, mid-conversation, it can decide it needs to read that file before it can answer, or run that query, or check that status, and actually go and do it, and then reason about the real result. The AI stops describing your system and starts operating it.

The loop is the hard part

Tool-calling has a shape, and the shape is a loop. The literature calls it ReAct: Reason, Act, Observe.

The AI reasons about the prompt and decides whether it needs a tool.
If it does, it acts, asking for a specific tool with specific arguments.
Your code runs the tool and feeds the result back. The AI observes that result.
Round again. Reason about the new information, maybe call another tool, maybe several. Keep going until the AI has what it needs and produces a final text answer with no more tool calls.

Conceptually simple. Tedious and error-prone to implement by hand every single time: parsing the model’s tool-call requests, dispatching to the right function, marshalling arguments in and results out, feeding observations back in the exact format the provider expects, knowing when to stop, and not looping forever if the model gets itself stuck.

That orchestration is pure plumbing, and it’s identical for every tool and every command. So you can probably guess what’s coming: go-tool-base’s chat package owns it. You don’t write the loop. You write the tools.

Defining a tool

A chat.Tool is four things: a name, a description, a parameter schema, and a handler. The description is what the AI reads to decide whether to use the tool, so it’s worth writing well. The schema describes the arguments, and you don’t hand-write it. You write a tagged Go struct and let it generate:

type ReadFileParams struct {
 Path string `json:"path" jsonschema_description:"Relative path to the file"`
}

The struct is the contract. The framework derives the JSON Schema the AI is given straight from those tags, so the schema and the Go type the handler receives can’t drift apart, because they share a single source. The handler is then just an ordinary Go function that takes those parameters and returns a result.

You register your tools with SetTools, call Chat, and that’s the whole of your involvement. The framework runs the ReAct loop and Chat returns the AI’s final text answer once the loop settles.

Two details that show it was built for real use

A couple of decisions in the loop tell you it’s meant for production, not a demo.

Tool errors don’t abort the conversation. When a handler returns an error, the framework doesn’t crash the loop. It hands the error back to the AI as a string, as just another observation. That’s deliberate, and it’s right. A real agent should be able to call a tool, watch it fail, and react: try different arguments, take a different route, or tell the user it couldn’t manage it. A loop that aborted on the first tool error would be far more brittle than the model driving it.

The loop is bounded. There’s a MaxSteps limit, default 20. An AI that gets confused could otherwise call tools forever, and a CLI command that never returns is a worse failure than a wrong answer. The cap guarantees the command terminates. The agent gets room to genuinely work a problem across many steps, but not infinite room to flail about in.

There’s also parallel tool execution: when the model asks for several tools in a single step (three independent file reads, say) the framework runs them concurrently rather than one after another, because there’s no reason to make the AI sit and wait out a sequence of things that don’t depend on each other.

Boiling it down

A text-only AI can describe your system; an AI that can call your functions can operate it. Bridging that gap means tool-calling, and tool-calling means the ReAct loop (reason, act, observe, repeat) whose orchestration is fiddly, identical every time, and not a problem worth solving twice.

go-tool-base’s chat package runs the loop for you. You define chat.Tool values (name, description, a tagged parameter struct that generates its own schema, a handler), call SetTools and Chat, and get the final answer. Tool errors go back to the AI as observations so it can recover, and a MaxSteps cap guarantees the command always terminates. You write Go functions. The framework turns them into things an agent can reach for.

An AI interface that fits on one screen

Fri, 27 Mar 2026 00:00:00 +0000

The moment you decide a CLI tool should talk to an LLM, there’s a strong gravitational pull towards reaching for LangChain, or one of its many relatives. It’s the obvious move. It’s also, for most CLI work, a bit like hiring a removals firm to carry a single box up the stairs.

Let me explain why go-tool-base went the other way, and what “the other way” actually looks like.

The instinct, and why it overshoots

When you add AI to a tool, the instinct is to reach for the big general-purpose framework. LangChain and its relatives are capable, and they exist for a real need: orchestrating complex multi-step AI applications, with retrieval pipelines, memory stores, chains of calls, whole fleets of agents.

Now look at what a CLI tool actually needs from an LLM. It needs to send a prompt and get text back. Sometimes it wants structured data back instead of prose. Sometimes it wants to let the model call a few of the tool’s own functions. That’s pretty much the whole list.

Pulling in a framework built to orchestrate retrieval and agent swarms in order to do that is a poor trade. You take on a large new vocabulary of concepts, a wide dependency surface, and a great deal of abstraction you’ll never touch, all to perform three or four operations. The framework isn’t wrong. It’s just answering a far bigger question than the one a CLI tool is asking.

What go-tool-base chose instead

go-tool-base didn’t reach for a framework. The decision is on the record in its own design notes: before a single line was written, LangChain Go, go-openai, Vercel’s AI SDK and around ten other options were evaluated, and not one of them matched what a CLI framework actually needs. So the chat package was built deliberately small.

How small? The entire core ChatClient interface is four methods:

type ChatClient interface {
 Add(ctx context.Context, prompt string) error
 Chat(ctx context.Context, prompt string) (string, error)
 Ask(ctx context.Context, question string, target any) error
 SetTools(tools []Tool) error
}

Add appends a message to the conversation. Chat sends a prompt and returns text. Ask sends a prompt and returns a typed Go struct, the model’s answer unmarshalled straight into a value you defined. SetTools hands the model a set of your own functions it’s allowed to call. That’s the whole surface. Downstream code that uses AI never holds anything larger than this, and never has to know which provider is behind it.

The package’s own documentation has a word for this: right-sized. Large enough to solve genuine provider-abstraction complexity, small enough that the full interface fits on a single screen.

“Thin” is not the same as “does little”

This is the part worth being precise about, because “four methods” can sound like “barely does anything”, and that’s the wrong read entirely.

Behind those four methods sits genuinely awkward work. Five providers (OpenAI, Claude, Gemini, a locally installed claude binary, and any OpenAI-compatible endpoint) each with a different wire API, all normalised behind the one interface. A tool-calling loop. Structured output via JSON Schema, made to behave consistently across providers that each express it differently. Error normalisation. Token chunking.

The point of a thin abstraction is not that there’s little underneath it. It’s that the interface stays small while the implementation quietly absorbs the complexity. Four methods on the surface; five provider integrations and a tool-calling loop below the waterline. The thinness is a property of what the caller sees, not of what the package does. A reach-for-LangChain decision gets that backwards: it exposes the caller to all the machinery, whether or not the caller will ever need it.

The core stays small even as features grow

There’s a neat detail in how chat keeps the interface from creeping. The package also supports streaming responses and conversation persistence, both of which are real features with real surface area. Neither of them is in the four-method core.

Instead they’re separate, optional interfaces. A streaming-capable client also satisfies StreamingChatClient; a persistable one also satisfies PersistentChatClient. Code that wants those capabilities does a type assertion to ask for them, and code that doesn’t simply never sees them. So the common path stays four methods forever. New capabilities arrive as opt-in interfaces alongside the core, not as new methods bolted onto it. The thing that fits on one screen keeps fitting on one screen.

Extensible without forking, testable without a network

Two more properties keep the package small without making it limiting.

It’s extensible. The provider list isn’t closed. A RegisterProvider call lets any package contribute a new provider, and chat.New will route to it. You add a backend without forking pkg/chat or sending a patch upstream.

And it’s testable. The package ships generated mocks. A downstream tool’s AI features can be tested against a mock ChatClient returning canned responses, with no network, no API key, and no flakiness. Because the interface is four methods, that mock is trivial to set up and complete by construction. A sprawling framework interface is a sprawling thing to fake; a four-method one is not. (I’ll come back to testing AI code properly in a later post, because it deserves a whole article of its own.)

The right size

When a CLI tool needs AI, the instinct is a large framework like LangChain. For orchestrating retrieval pipelines and agent swarms, that’s exactly the right tool. For sending a prompt, getting a struct back, and letting the model call a few functions, it’s enormous overkill.

go-tool-base’s chat package is the deliberate alternative, chosen only after LangChain Go and a dozen others were weighed up and rejected. Its core ChatClient interface is four methods. Underneath sit five normalised providers, a tool-calling loop, structured output and error handling, but the caller sees four methods and never learns which provider is active. Streaming and persistence are opt-in interfaces beside the core, not additions to it. It extends without forking and tests without a network. Right-sized: the complexity is real, but it lives under the interface rather than in it.

Your CLI is already an AI tool

Thu, 19 Mar 2026 00:00:00 +0000

“Make it work with AI” has become one of those requests that lands on a developer’s desk with a thud and not much further detail attached. My instinct, the first time, was to brace for a big lump of integration work… a bespoke adapter for this assistant, another for that one, a treadmill of little wrappers stretching off into the distance.

Turns out I’d already done most of the work. So have you, if your CLI tool is any good. Let me explain what I mean.

You already described your capabilities

Stop and think for a second about what a well-built CLI tool actually is. It’s a set of named operations, each with a human-readable description, each taking a set of typed, named, documented parameters. You wrote all of that already, because a CLI without it is unusable by people.

Now look at what an AI assistant needs in order to call a tool. A set of named operations. A description of each, so it knows when to reach for them. A typed parameter schema for each, so it knows how to call them.

It’s the same list! A good CLI is already, structurally, a description of a set of capabilities. The information an AI agent needs isn’t extra work you have to go and do. It’s work you finished the moment your --help output was any good.

The only thing missing is a translator. Something that takes “this is a CLI” and presents it as “this is a set of tools an AI can call”.

MCP is that translator, and it’s a standard

The temptation, when you want your tool to be AI-usable, is to sit down and write an integration. A little adapter for Claude Desktop. Another for Cursor. Another for whatever turns up next month. Each one a bespoke wrapper, each one a thing to maintain, and the list never stops growing because new assistants keep appearing. That’s the treadmill I was bracing for.

The Model Context Protocol exists to kill that list. MCP is an open standard for how an AI model discovers and calls local tools. Implement it once and your tool works with every assistant that speaks it. Write once, not once-per-client.

So go-tool-base implements it once, in the framework, for everyone. (That’s rather the theme of this whole series, if you hadn’t spotted it yet… do the annoying thing once, properly, in a place where every tool inherits it.)

The `mcp` command, and the mapping it does for free

Every tool built on go-tool-base inherits a built-in mcp command. Run it:

mytool mcp

and the tool starts a JSON-RPC server over standard I/O, speaking MCP. That’s the whole user-facing surface. One command.

Behind it, the framework walks your Cobra command tree and maps it straight onto MCP tool definitions:

Each command becomes a tool.
Each command’s short description becomes the tool’s description, the text the AI reads to decide whether this is the tool it wants.
Each command’s flags and arguments become the tool’s JSON Schema parameters.

There’s no second schema to write and then keep in sync (and we all know how well “keep these two things aligned by hand” tends to go). The command tree is the schema. Add a new command to your CLI and it’s a new tool for the agent, automatically, with the description and flags you already gave it. Nobody has to remember to update an MCP manifest, because there’s no separate MCP manifest to forget about.

Configuring an assistant to use it

On the assistant’s side it’s just as undramatic. You tell your AI client (Claude Desktop, Cursor, anything MCP-aware) to launch mytool mcp. From then on the assistant:

Starts your tool in MCP mode when it boots.
Discovers every command as a callable tool.
Calls the right one, with the right parameters, when a user’s request needs it.

Your CLI tool has quietly become something the AI can pick up and use, mid-conversation, on its own initiative.

The safety property worth noticing

Now, “let an AI run things on my machine” is rightly a sentence that makes people nervous. It makes me nervous, and I built the thing. So it’s worth noticing the constraint sitting quietly in this design.

The AI can only call what you defined. The tools it sees are exactly the commands in your tree, and the parameters it can pass are exactly the flags and arguments you declared, validated against the JSON Schema generated from them.

It can’t invent a command. It can’t pass a parameter you never defined. The boundary of what the agent can do is the boundary of what your CLI does, and you drew that boundary already, back when you built the tool. Exposing the CLI over MCP doesn’t widen the surface one inch. It just makes the existing surface reachable. The AI isn’t running things. It’s running your commands, the ones you wrote, tested and shipped, and nothing else.

The gist

A CLI tool, built properly, is already a structured description of a set of capabilities: named operations, descriptions, typed parameters. Which is also exactly what an AI agent needs in order to call a tool. The gap between the two is only a translator, and writing a bespoke one per assistant is a treadmill you don’t need to step onto.

go-tool-base puts the translator in the framework. Every tool gets an mcp command that serves the command tree over the Model Context Protocol… commands become tools, descriptions become descriptions, flags become JSON Schema parameters, with no second schema to maintain. Point any MCP-aware assistant at it and your CLI is an agent-callable tool, bounded to exactly the commands you shipped.

You did the hard part when you built a good CLI. MCP just opens the door you’d already framed.