Testing code that calls an LLM

TL;DR: An LLM is non-deterministic, so the instinct that you “can’t test” code that calls one is half right. You can’t assert on what the model says. But you can test both halves of your own code around it: the prompt you send, and what you do with the response. go-tool-base and rust-tool-base converge on the same split. Snapshot the request so a refactor can’t change it silently, and mock the response so tests never touch the network.

“You can’t test AI code”

It’s a fair worry. Your command calls an LLM. The LLM returns something slightly different every run. A test that asserts response == "..." is broken before you’ve finished typing it. So the conclusion arrives quickly: the AI path can’t be tested, leave it uncovered.

Which is a shame, because the AI call is usually the riskiest line in the whole command.

The conclusion is also wrong. It mistakes “I can’t test the model” for “I can’t test my code.” The model is not your code. Your code is the two pieces on either side of it.

Your code is a prompt and a handler

Strip the command down to what it actually does:

It builds a prompt. It assembles a system prompt, the user’s input, perhaps some context, and sends it.
The model does something. This is not your code.
It takes the response and does something with it. It parses it, branches on it, prints it, stores it.

Steps one and three are entirely yours, and entirely deterministic. The same inputs build the same prompt and handle the same response the same way, every single time. That is testable. Step two is the only part that isn’t, and step two was never yours to test.

So the job is to pin step two to a known value, and then test one and three properly.

Test the prompt: snapshot it

Step one produces a prompt, and a prompt is just a string, which means you can pin it.

Both frameworks lean on snapshot testing here. go-tool-base uses a golden-file approach: the prompt your code generates is recorded to a file, and the test re-generates it and compares against that file. rust-tool-base does the same with insta, snapshotting the request body the client would send.

The reason this matters is that the prompt is load-bearing and quietly easy to break. You refactor how context gets assembled. Without noticing, you’ve changed the wording, or the ordering, or dropped a line the model was relying on. Nothing fails to compile. The behaviour just drifts.

A snapshot test catches exactly that. It fails, it shows you the diff between the old prompt and the new one, and it makes you make a decision. Was this change intended? If yes, you accept the new snapshot and move on. If no, you have just caught a bug before it shipped. Either way the prompt never changes by accident, which for AI code is most of the battle.

Test the handler: mock the response

Step three needs a response to handle, and in a unit test you do not get that response from the real model. You supply it.

go-tool-base ships generated mocks for the ChatClient interface. A test builds a mock client, tells it “when Ask is called, return this canned value,” and runs the command against it:

mockClient := mock_chat.NewMockChatClient(t)
mockClient.EXPECT().
    Ask(mock.Anything, mock.Anything, mock.AnythingOfType("*main.Analysis")).
    RunAndReturn(func(_ context.Context, _ string, target any) error {
        *(target.(*Analysis)) = Analysis{Severity: "critical"}
        return nil
    })

Because the interface is only four methods, that mock is trivial to set up and complete by construction. rust-tool-base takes the same idea one layer down: HTTP-bound tests use wiremock, which stands up a fake server returning a canned response body. The client makes a real HTTP request; it just goes to a fake endpoint the test controls.

Either way, step two is now fixed to a value you chose, which makes step three deterministic. And that unlocks the tests that actually matter: given a malformed response, does the command fail gracefully? Given a rate-limit error, an empty answer, a field missing? Those are the cases a live model almost never hands you on demand, and a mock hands you every time, on the first run.

This is, incidentally, the same discipline as the test-mocking work elsewhere in the framework: the dependency is injected, so the test gets to decide what it does.

What you deliberately don’t test

One honest boundary. None of this tests whether the model gives good answers. That question is real, but it is a different activity, evaluations, run as their own suite, not mixed into the unit tests.

The unit suite’s job is your code: that it builds a sound prompt, and that it handles every shape of response correctly, including the ugly ones. Keep that separate from “is the model clever today.” A unit test that depends on the model being clever is a unit test that fails when the weather changes, and a flaky test teaches people to ignore the suite.

What it comes down to

Code that calls an LLM is testable; the model is not, and those are different statements. Your code is a prompt builder and a response handler, both deterministic, with the model in between.

go-tool-base and rust-tool-base converge on the same approach. Snapshot the prompt, with golden files or insta, so a refactor can’t change what you send without a test noticing. Mock the response, with generated ChatClient mocks or a wiremock server, so tests run with no network and you can feed in the malformed and error cases a real model won’t reliably produce. Leave “are the answers good” to a separate evaluation suite. Test the two halves you own, and the non-determinism in the middle stops being a reason to leave the riskiest line uncovered.