Turn one, the player swings, the die comes up 20, and my AI dungeon master narrates the goblin falling silent, leaving the player alone in the corridor. Good. Turn two, another roll, a 6 this time, and the same dungeon master cheerily has the goblin “dance back” out of the dark to take another swing. The goblin I’d just watched die was up and fighting again, and the model didn’t so much as blink.
I didn’t feel cheated, or even surprised. I felt the small, familiar thud of oh, yeah, I forgot that bit. Because the model hadn’t gone rogue. It had done exactly what a language model does. The gap was mine.
This was the war story behind part four of the go-tool-base tutorial, the AI dungeon master. The tutorial shows the clean, final design and quietly moves on. It doesn’t show the three different ways I got it wrong first, which is a shame, because the wrong turns are where the actual lesson is.
Why a dungeon master at all
A word on why I was even here. I was trying to prove the chat component of the framework to myself. There’s a voice that pipes up whenever I build anything in this space, “LangChain exists, who do you think you are?”, and the answer I keep landing on is that LangChain is enormous and I wanted something small enough to hold in your head. The tutorial was the test: could a newcomer wire AI into a CLI with it and come out the other side with something that actually behaves?
That last word is the whole problem. A tutorial has to leave you holding something dependable, and dependability is the one thing AI fights you on. I also wanted it to be fun, a thing someone might keep poking at after the tutorial ends, maybe even the hook that gets a person other than me to use the framework. I batted hook ideas around and liked none of them, until the obvious one landed: I run a tabletop game on the odd weekend, so make the AI the dungeon master. Gamify the thing. Then watch it raise the dead.
Strike one: nothing to enforce
The first version was the naive one. I gave the model a roll tool, because the
one thing you absolutely cannot let a language model do is pick its own numbers,
and otherwise let it narrate freely. The conversation history carried from turn to
turn, so it remembered the fight. I assumed remembering was enough.
It isn’t. Remembering and being held to it are different things. The history told the model a goblin had died; nothing stopped it writing the goblin back in when the next turn’s narration wanted a bit of jeopardy. Memory is not a constraint. The model will happily contradict its own past if you’ve given it room to, and I had given it nothing but room.
Strike two: a tool to read the state
The obvious fix, and I do mean obvious, the kind you reach for without thinking,
was to give the model a state tool so it could check who was alive before it
narrated. Hand it the facts on request and surely it’ll stop making them up.
What it actually did was dither. Handed a tool it could call to look things up, it called it. And called it. And called it again, turning a turn over in its hands without ever committing to an action, burning through its step budget on lookups and leaving the player staring at nothing. I’d cured the lying by inventing paralysis. A tool the model can call is a tool it will call, often instead of doing the thing you actually wanted.
Strike three: refereeing its own dice
When I did get it reading state cleanly, the third failure crept in, and this one
was subtler. Once the model could see the goblin’s hit points, it started
deciding the fight. It would read that the goblin had 12 HP and just narrate a
killing blow, hits and damage and all, without calling the roll or attack
tools at all. Why ask the dice when you can see the board and write whatever
outcome the story wants? Give a model enough context and it stops being a narrator
and starts being a referee, which is precisely the job I’d built tools to keep out
of its hands.
The fix was less, not more
Three failures, and notice the shape of my fixes: each one added something. More memory, then a tool, then more context. Every instinct said the model needed more to work with. Every time, the extra capability was the new way to be wrong.
So I went the other way. The truth lives in a plain Go struct that I own, not the
model. There’s no state tool to dither on, because the loop simply prepends the
current state to every turn’s input, fresh, so the model never has to ask and
never gets to drift. The mechanics, the dice and the damage, live in Go functions
the model has to call, and the system prompt says in as many words that it must
not decide a hit or damage itself. The model is left with exactly one job:
narrate. The prose is its to invent. The maths, the state and the shape of the
result are not.
That’s the line that turned three bugs into a feature. You don’t make a language model reliable by giving it more to work with. You make it reliable by giving it less to be wrong about.
The freedom I chose not to give it
There’s a real tension in that, and I want to name it rather than pretend the boxed-in version is the only true one. At my own table the rules are guidelines, not guardrails. I ignore them, bend them, improvise, reach for the “rule of cool” when the moment’s better for it. A great AI dungeon master would have that same freedom, and a few out there genuinely do, Old Greg’s Tavern is a lovely example of how far the free-form version can go.
But that freedom costs far more than a tutorial can spend, and it buys unpredictability I was specifically trying to teach people to avoid. So I made a deliberate trade: guardrails instead of guidelines. Simple, but not so simple it’s boring. The player still gets a “not on rails” game, they can try anything and the DM copes, but every outcome that matters runs through code I trust. That’s the right shape for a tutorial, and, not by coincidence, the right shape for most AI features you’d actually ship.
What the goblin taught me
The thing I keep coming back to is that the model never misbehaved. It resurrected the goblin because I gave it the freedom to. It dithered because I gave it a button to press. It refereed because I let it see the board. Every failure was a permission I’d handed over without meaning to. The reliability didn’t come from a cleverer prompt or a bigger model, it came from working out, one dead goblin at a time, exactly how little the model needed to be trusted with.
If you want the version where it all works first time, the tutorial has it, the tool-calling and the typed turns wired up properly. This was the road there. The goblin, you’ll be glad to hear, now stays down.
