Featured image of post Telemetry that asks, and telemetry that doesn't

Telemetry that asks, and telemetry that doesn't

go-tool-base has had a thing called telemetry for a long while now. It’s the opt-in kind: the product analytics that asks a user’s permission before it phones a single byte home, sits there as a no-op until they say yes, and can be wiped on request. The whole package is built around consent.

Then the web-service series went and needed telemetry too. Not that telemetry. The other one, the one the rest of the industry means when it says the word: traces, metrics and logs of a running service. And the awkward thing about those two is that they share a name, they want to share a package, and they pull in exactly opposite directions on the one question that matters most.

This is the story of how 0.7.x grew a second telemetry without breaking the first, and where the line between them ended up getting drawn.

Why bother putting it in the framework at all

The starting point is that I could have left observability out. A reader could wire up OpenTelemetry in their own service and go about their day. But the six parts of the web-service series spent a lot of effort making the transports first-class: a gRPC server, an HTTP server, a gateway, TLS across all of them, each one a Register call against the controller. Turning a CLI into a real long-running service and then shrugging “observability is your problem” would have left a hole exactly where it hurts.

Because a service you can’t see into is a liability the moment it leaves your laptop. The series ended with a macguffin service that was typed, fast and served over TLS, and was also a black box: when it got slow, you had nowhere to look. Metrics and traces are how you get the lights on, and they deserved the same first-class treatment as the things they observe.

The other half of the reason is that the framework already had a foot in this world. The analytics package’s preferred backend speaks OTLP, the OpenTelemetry wire protocol. So OpenTelemetry was already in the building. Doing observability any other way would have meant two standards where one would do.

The catch: two telemetries, opposite instincts

Here’s where it gets interesting, and it’s the part worth slowing down on.

The analytics telemetry is about a user. It collects usage data, hashed machine id, which command ran, exit code, and the entire design assumes you have to ask first. It is off by default. The collector you get when it’s disabled is a no-op, so nothing is recorded until the user opts in, and there’s a deletion path for when they change their mind. That’s not an add-on, that’s by design.

The observability telemetry is about a service. It emits operational data, how long a request took, which span was slow, how many errored, to a collector the operator runs. And there is no user in the loop to ask. The operator deploys the service, points it at their collector, and that act is itself the consent. Asking would be nonsensical: whose permission, for data about their own service, on their own infrastructure?

So you have two things called telemetry, wanting to live in one package, with the opposite default on consent. One is off until someone says yes; the other is on the moment it’s configured. Get that wiring wrong and you fail in one of two ugly ways. Gate the operational telemetry behind the user’s analytics opt-in, and a service’s tracing silently does nothing because nobody ticked a box meant for something else. Or loosen the analytics gate to make observability flow, and you start leaking usage data the user never agreed to share. Neither is acceptable, and “just use two packages” throws away everything the two genuinely have in common.

What they actually share

Quite a lot, as it turns out, and all of it below the consent line.

Both ship their data over OTLP to a collector. Both need to describe who is emitting, the service name and version, the resource in OpenTelemetry’s terms. Both parse an endpoint, attach headers, decide whether the connection is plaintext. None of that has the faintest thing to do with consent. It’s just the plumbing of getting bytes to a collector, and the analytics backend already had all of it, written inline.

So the shape of the solution fell out of the problem. Lift the shared plumbing into one place, let both telemetries stand on it, and keep the consent decision firmly out of that shared layer. The structure under pkg/telemetry ended up like this:

pkg/telemetry/
    telemetry.go      the analytics Collector (consent-gated)
    backend_otel.go    its OTLP backend
    posthog/ datadog/  vendor analytics backends
    otelcore/         shared: OTLP endpoint, resource, telemetry.* config
    tracing/          observability signal
    metrics/          observability signal
    logs/             observability signal
    observability.go   Setup: builds the enabled signals (implied consent)

The new otelcore is the keystone. It holds the three things both sides need and nothing they don’t: ParseEndpoint for the OTLP URL, Resource for the service identity, and Resolve for reading the shared telemetry.* config (a base endpoint, plus per-signal overrides, in the same cascade as the TLS config). It imports no signal exporter and knows nothing about traces, metrics, logs or analytics. It is deliberately dumb plumbing.

The refactor: making the old telemetry stand on the new core

This next part is where the old telemetry and the new one become a single thing. The analytics OTLP backend was the first user of OTLP in the framework, and it had grown its own copy of all that plumbing: a function that parsed the endpoint URL, split out the host and path, worked out the insecure flag, and built the resource from a service name. Exactly the code the three new signals were about to need.

So rather than write it a second time and let the two drift, the analytics backend was refactored onto otelcore. Its exporter builder, buildOTelExporterOpts, now calls otelcore.ParseEndpoint, the same function tracing, metrics and logs call, and the resource comes from otelcore.Resource, the same one they use. One implementation of “talk OTLP to a collector”, four callers: the analytics backend and the three observability signals. Change how the framework forms an OTLP endpoint, and every signal moves together.

The reassuring part was that the analytics tests didn’t budge. The refactor moved code without changing behaviour, and the consent machinery, the opt-in, the no-op-when-disabled, the deletion path, never came near otelcore. Which is exactly the point.

Where the line is

Because the shared core is the easy half. The half that earns its keep is the bit that isn’t shared, and it’s a single, deliberate line.

The analytics collector keeps its gate. The constructor, NewCollector, still returns a no-op the moment telemetry is disabled, so a user who hasn’t opted in gets a collector that silently discards everything. Informed consent, untouched.

Observability gets a different door entirely. Setup builds whichever signals the operator has switched on, and it is gated only by telemetry.tracing.enabled and its siblings, which the operator sets. It never consults the analytics opt-in. Turning on tracing doesn’t turn on analytics; disabling analytics doesn’t silence tracing. The two enable flags live under the same telemetry.* config root, sit next to each other, and never read each other.

So that’s the whole architecture in a sentence: one package, one OTLP export core, two consent models that share everything except the answer to “do we need to ask”. The principle underneath, the one that decided every one of these calls, is that the kind of data sets the consent model. Usage data about a person needs informed consent. Operational data about a service runs on implied consent. The CLI and the web service are just where each kind tends to live.

Where this leaves the framework

0.7.x came out the other side with both telemetries: the one that asks first, exactly as it was, and a new one that doesn’t, because it has nobody to ask. They share an export core, a config root and a name, and they part company on the only thing they were ever going to disagree about.

I’ve been careful here to describe how the two consent models are kept apart, not to argue why they have to be. That argument, that “the kind of data decides the consent model” is a line worth holding rather than a convenient bit of engineering, is a piece of its own, and it’s the one I’m writing next.

Built with Hugo
Theme Stack designed by Jimmy