Lifecycle management for when your CLI grows up into a service

TL;DR: Plenty of CLI tools quietly grow a serve or run command and become long-running services, at which point they need the unglamorous production plumbing: coordinated startup, graceful shutdown, health probes, and a sane response to a service that’s fallen over. go-tool-base’s pkg/controls provides that as a Controller managing one or more Controllable services — shared channels for errors and signals, ordered graceful shutdown, Kubernetes-style liveness and readiness probes, and an opt-in supervisor that restarts a failed service with exponential backoff.

The command that stops being a command

go-tool-base is CLI-first. It is not CLI-only, and the reason is a pattern I’ve watched play out more times than I can count.

A tool starts its life as an honest command-line utility. It runs, it does a thing, it exits. Then someone needs it to also expose a small HTTP endpoint. Or poll a queue. Or run a scheduler. So it grows a serve command, or a run command, and the moment it does, the thing that was a CLI tool is now a long-running service that happens to have a CLI in front of it.

And a long-running service needs a whole category of plumbing a one-shot command never did. It has to start things up in a sensible order. It has to shut them down gracefully when someone sends a SIGTERM, finishing in-flight work rather than dropping it. It has to tell an orchestrator whether it’s alive and whether it’s ready. It has to do something sensible when one of its internal services dies at 3am.

Hand-rolled, that’s a few hundred lines of goroutine choreography, channel wrangling and signal handling that every such tool reinvents, slightly differently, slightly wrong. It’s the first-afternoon problem again, just later in the project’s life. So go-tool-base ships it: pkg/controls.

A controller and the things it controls

The model is small. A Controller manages any number of services, each of which satisfies a Controllable interface — at heart, a StartFunc and a StopFunc. An HTTP server, a background worker, a scheduler, anything with a “begin” and an “end.”

You register your services with the controller and it owns their collective lifecycle. They share a common set of channels — errors, OS signals, health, control messages — so the whole set can react together. A SIGTERM doesn’t get caught by one service in isolation; it reaches the controller, and the controller takes everything down in order, each StopFunc given a context with a deadline so a sulking service can’t wedge the shutdown forever.

That ordering and timeout handling is the bit nobody enjoys writing and everybody needs. Centralising it means a tool that adds a second service later inherits correct coordinated shutdown for free, rather than discovering on its first production SIGTERM that it half-shuts-down.

Probes, because something is usually watching

If the service runs in Kubernetes — and a lot of them end up there — the orchestrator wants to ask two different questions, and they are genuinely different questions.

Liveness: are you alive, or are you wedged and in need of a kill? Readiness: are you alive and able to take traffic right now? A service can easily be live but not ready — still warming a cache, still waiting on a dependency. Conflating the two gets you killed during a slow startup, or sent traffic before you can serve it.

controls keeps them separate. You attach a WithLiveness probe and a WithReadiness probe to a service, each just a function returning a health report, and the controller exposes them. The tool answers Kubernetes honestly, in Kubernetes’ own terms, without you wiring up two more HTTP handlers by hand.

Self-healing, but only if you ask

The last piece is what happens when a service fails. A worker’s StartFunc returns an error. Health checks start failing. In a hand-rolled setup this is where you either crash the whole process or write a bespoke restart loop.

controls has a supervisor that can restart a failed service for you, and the important word in that sentence is can. It is off by default. A service is only supervised if you hand it a RestartPolicy at registration:

controls.WithRestartPolicy(controls.RestartPolicy{
    MaxRestarts:            5,
    InitialBackoff:         time.Second,
    MaxBackoff:             30 * time.Second,
    HealthFailureThreshold: 3,
})

With a policy in place, the controller restarts the service if its StartFunc errors out, or if it racks up more consecutive health-check failures than the threshold allows. Restarts back off exponentially, from InitialBackoff up to a MaxBackoff ceiling, so a service that’s failing because its database is down doesn’t hammer that database flat with a tight restart loop. MaxRestarts caps the attempts, because a service that has failed five times in a row is not going to be fixed by a sixth try, and at that point honest failure beats a thrashing pretence of health.

Opt-in matters here. Automatic restarts are the right behaviour for a resilient daemon and the wrong behaviour for a tool where a failure should stop the line and get a human’s attention. The framework doesn’t decide that for you. It gives you the supervisor and lets you point it at the services that genuinely want it.

Worth remembering

A surprising number of CLI tools become long-running services the day they grow a serve command, and the day they do, they need coordinated startup, graceful ordered shutdown, real liveness and readiness probes, and a considered answer to a service falling over. That’s a few hundred lines of fiddly, easy-to-get-wrong plumbing.

pkg/controls provides it: a Controller over Controllable services with shared channels and deadline-bounded graceful shutdown, separate Kubernetes-style liveness and readiness probes, and an opt-in supervisor that restarts failed services with exponential backoff and a restart ceiling. Your tool can start as a command and grow into a daemon without that growth being a rewrite.

CLI-first, but not stuck there.