Observability Is a Counter in RAM, an ID in a Header, and a Log with Context
For a long time, “observability” sounded like one of those words vendors invent to sell you a logging system twice.
You import an SDK. You sprinkle some calls in your code. When the code runs, it sends something somewhere, that something gets stored, and a dashboard reads from the store. That’s it, right? That’s a logging pipeline with a nicer landing page.
I was almost right. And “almost right” is the worst place to be, because the mental model works just well enough that you never question it.
The model that almost works
Here’s the pitch, one layer up from the buzzword. Observability is the ability to understand what’s happening inside a system just by looking at its outputs — without shipping new code to ask new questions.
Monitoring answers known questions you set up dashboards for in advance: “is CPU high?” Observability answers unknown questions after the fact: “why is this one customer’s checkout slow on Tuesdays?”
Fine. But mechanically, that still sounds like logs plus smart queries. Collect everything, store everything, grep harder. Where’s the difference?
The difference isn’t in the analysis. It’s in the emission. And you only see it when you look at what each pillar actually does at runtime, one layer down at a time.
Pillar one: a metric is a number in RAM
The first pillar, metrics, doesn’t emit events at all.
The SDK keeps a number in your application’s memory. Your code just nudges it:
requests_total.increment() // counter goes 1, 2, 3...
active_connections.set(57) // gauge = current value
request_duration.record(0.23) // adds to a histogram
Nothing is sent at that moment. Every fifteen seconds or so, either the backend scrapes an endpoint your app exposes (/metrics), or the app pushes the current values out.
That’s why metrics are cheap. A million requests isn’t a million records — it’s one counter sitting at 1,000,000. You could never reconstruct a clean p99 latency graph by parsing log text; a histogram is a fundamentally different data structure, sampled on a clock.
There’s an obvious objection: if the number lives in RAM, what happens on Cloud Run, where containers are stateless and die constantly?
The answer is that the counter never needed to survive. It’s only a temporary accumulator between exports. Each container increments its own local counter, pushes every few seconds tagged with an instance ID, and the backend sums across instances. A container dying loses at most a few seconds of not-yet-pushed data. The durable truth lives in the backend, never in your app.
Pillar two: a log with an ID stapled to it
The second pillar is the one I thought was the whole thing. And honestly — it almost is what I thought.
A log is still a timestamped event, written out, shipped, stored, searched. The observability upgrade is one field:
log.info("payment failed", { trace_id: "abc123", user: 42 })
That trace_id is injected automatically by the SDK. It’s the correlation glue: it lets this one line be tied back to the exact request that produced it, across every service that request touched.
Mechanically, plain logging. Structurally, the difference between grepping scattered lines and pulling up everything that happened during this one request.
Pillar three: the trick is an HTTP header
The third pillar is the one that broke my “fancy logging” model for good.
A trace follows one request across multiple services. Each unit of work — an endpoint handler, a DB query — is a span, with a start time, a duration, and a parent. The question that matters: how does service B know it’s part of the same request as service A?
It’s a header. That’s the whole trick.
Service A Service B
───────── ─────────
start span (trace_id=abc, span=1)
│
│ HTTP call to B ──────────────────► reads header
│ header: traceparent: abc-1 traceparent: abc-1
│ │
│ start span (trace_id=abc,
│ span=2, parent=1)
│ ◄────────────────────────────────── returns
end span end span
Service A stuffs the trace ID into an outgoing header — traceparent, a W3C standard. Service B reads it and says: I’m part of trace abc, my parent is span 1. Each span is exported independently, and the backend reassembles them into a tree using the parent/child IDs.
This generalizes cleanly. gRPC? Same ID, carried in gRPC metadata — which over HTTP/2 is key/value pairs in the frame. Kafka? Message headers. The SDK has a component called a propagator whose only job is injecting the ID into whatever carrier the outgoing call uses, and extracting it on the way in. Same trace, different envelope.
And this is the part no after-the-fact analysis can fake. If the ID wasn’t propagated at emission time, in-band, inside the request path, no amount of clever log parsing reconstructs the link. The relationship has to be recorded when the data is created — or it never exists.
What’s automatic, what isn’t
One more assumption worth killing: the SDK does not know your business.
Auto-instrumentation hooks into the generic machinery it can see — your HTTP framework, your DB client, your runtime. So you get request counts, latencies, status codes, query durations, memory, GC, CPU, for free. That’s the plumbing.
Anything with domain meaning — signups, abandoned carts, active users — is a manual one-liner, because only you know what those are. The split is clean:
- Auto: how is the plumbing behaving?
- Manual: how is my business behaving?
The export side is boring by design, too. Instrumentation runs in-band, inside your request path — but export runs out-of-band. Signals get buffered in memory and shipped in async batches, so observability adds near-zero latency and a dead backend doesn’t take your app down with it.
The click
So here’s where the “fancy logging” model finally dies.
A logging system emits one kind of thing: text events. Observability emits three structurally different kinds of data, and two of them aren’t logs at all. A metric is a live number in RAM, sampled on a timer — not an event. A trace is a set of spans linked by an ID that physically travels between services in request headers — a relationship that exists only because it was written down at the moment it happened.
The analysis layer isn’t the magic. The magic is that the right structure — the histogram buckets, the trace ID, the parent span — exists in the data before anyone asks a question.
That’s also why the debugging workflow works at all: a metric spike tells you something is wrong, the trace tells you which span is slow, and the trace_id on the logs tells you exactly why. Metric → trace → log. Three pillars, one ID stitching them together.
Not fancy logging. Three emission mechanisms, correlated at birth.
Core definition
Observability is the ability to understand a system’s internal state from the signals it emits — metrics, logs, and traces — well enough to answer questions you didn’t predict in advance. It differs from logging not in how the data is analyzed, but in how it’s created: structured signals and the relationships between them (trace IDs, span hierarchies, histogram buckets) are recorded at emission time, in-band, so they can be correlated later. Logging records events; observability records events, measurements, and request paths — already stitched together.