An AI agent that has to make the build pass

TL;DR: AI that generates code usually works by generating and hoping. The model emits the code, and you discover whether it compiles when you compile it. go-tool-base’s command generator does something better. After the AI drafts a command, an autonomous repair agent takes over with a fixed set of sandboxed tools, build, test, lint, fetch dependencies, and loops: run the build, read the error, fix the file, run again, until the project is actually green. The agent has no shell. It verifies its own work instead of leaving that to you.

Generate and hope

The usual shape of AI code generation is this. You ask for code, the model produces it, and the model’s job ends at the closing brace. Whether the code compiles, whether the tests pass, whether the imports resolve, none of that has been checked. The model produced something that looks right. You find out whether it is right when you build it.

For a snippet you paste into an editor, that’s fine. The compiler tells you in a second. But go-tool-base’s generator, driven by gtb generate command --script or --prompt, produces a whole command: the implementation, its tests, the lot. “Generate and hope” at that scale means handing the user a project that may or may not build, and making them the person who finds out which.

Drafting is only step one

So the generator doesn’t stop at drafting. Writing the first version of the implementation and its tests is step one of two. Step two is an autonomous repair agent.

Once the draft is on the filesystem, a separate agent takes over. It is an LLM running in a loop, but a loop aimed at one narrow, checkable job: make this project build and pass its tests. It is not asked to be creative. It is asked to get to green.

A fixed set of tools, and no shell

The agent is not given a shell. It is given a fixed, defined set of tools and nothing else. Three of them let it explore and edit the project, list_dir, read_file, write_file. Four of them let it verify the project:

go_build runs the build and captures the compiler errors.
go_test runs the tests and captures the failures.
go_get resolves a missing dependency.
golangci_lint runs the project’s linter.

That restriction is the design, not a limitation of it. The agent cannot delete arbitrary files, cannot reach the network, cannot run anything that isn’t on the list. It has exactly what it needs to make code compile and nothing it would need to do damage. Its file writes are confined to the project directory by an explicit path check, so even write_file cannot wander up into /etc. A coding agent you would actually let near a filesystem is one whose abilities are an allowlist, not a denylist.

The loop

The repair loop is a ReAct loop, the same reason-act-observe shape as the tool-calling loop, pointed at a goal:

The draft is on disk.
Verify: run go_build and go_test.
If verification failed, read the error logs, the compiler error or the failing test.
Reason about the cause: an undefined variable, a missing import, a wrong signature.
Act: call write_file to patch the code, or go_get to add the dependency.
Loop. Steps two to five repeat until the project is green or the agent hits its step limit, which defaults to 15.

What makes this work is treating the error output as feedback rather than as a failure to log and abandon. A compiler error is the single most useful sentence you can give a model that is trying to fix code. It says what is wrong, and usually where. The loop feeds it straight back in, and the model fixes against it.

Verification changes what “done” means

Here is the real shift, and the agent’s own documentation puts it well: the agent “doesn’t just say it fixed a bug; it uses a Test tool to verify the fix before reporting success.”

A generate-and-hope model reports success when it finishes writing. It has no idea whether the code works, and it isn’t really claiming otherwise. “Done” means “I produced text.” The repair agent reports success when go_build and go_test actually pass. “Done” means “the build is green.” Those are two completely different claims, and only the second is worth anything to the person who asked for the command.

That is the line between an AI that is a creative writer and an AI that is a collaborator you can hand a task to. And when the agent can’t reach green, when it spends its whole step budget and the project is still broken, the generator fails safely: it leaves the best-attempt code in place, commented out so the project still compiles, and tells the user what to finish by hand. There is also an --agentless flag for anyone who would rather have a plain single-shot retry than the multi-step agent. The default, though, is the agent, because the default should be code that has been checked.

Where this leaves us

Most AI code generation generates and hopes: the model writes code and the user discovers whether it works. For a whole generated command, that pushes a may-or-may-not-build project onto the user.

go-tool-base’s generator drafts the command and then hands it to an autonomous repair agent. The agent has a fixed set of tools, explore and edit the project, build it, test it, lint it, fetch dependencies, and no shell at all, with file writes confined to the project directory. It runs a ReAct loop, reading each error and patching against it, until the build is green or it exhausts its steps. The point is what “done” comes to mean: not “the model finished writing,” but “the build passes.” One of those is a claim worth trusting.