Telemetry Is a Corpus, Not a Dashboard

Every dashboard you have built is a frozen artifact of the questions you knew to ask the day you built it. The panels reflect last quarter’s failures. The thresholds reflect last quarter’s tolerances. The aggregations reflect last quarter’s sense of what mattered. None of this is a complaint about dashboards. It is a description of what they are.

The question is what they are not.

Agent failures do not live in the questions you knew to ask. They live in the questions you didn’t think to ask, because the failure mode hadn’t yet been invented when your panels were specified. A green dashboard is not evidence of health; it is evidence that nothing tripped the specific alarms you wired six weeks ago.

This piece is not anti-dashboard. It is an argument that a different reading practice has to sit alongside the dashboards. The most experienced agent operators already do this. Almost nobody teaches it.

The dashboard is a hypothesis

A dashboard is a structured prediction about which future states will matter. The act of choosing what to put on a panel is the act of saying: “If anything goes wrong in this system, it will manifest as a deviation in one of these numbers.” That is a hypothesis. It is sometimes correct. It is, when agents are involved, often wrong.

It is wrong because agents fail in ways that do not decompose cleanly to a metric. An agent that has begun calling a particular tool in a subtly novel context — slightly off-label, technically successful, materially different from your intent — will register on no panel you have built. The tool succeeds; the success rate is fine; the latency is fine; the cost per call is unchanged. The drift is in the meaning of the calls, not in any number.

A dashboard cannot tell you something has changed unless you pre-decided what “changed” looks like. The panel set is, by construction, the hypothesis you held before the change — which means a dashboard is structurally incapable of surfacing a failure mode that postdates it.

What aggregation hides

Three failure modes that survive a green panel.

Rare-tool misuse. Most agents have a long tail of tools that get called once an hour, once a day, once a week. Their volume is a rounding error in the rollups. Their failure modes are invisible to averages. When an agent quietly starts misusing a low-volume tool — passing it the wrong shape of argument, calling it from a context that bypasses a check — the rollup absorbs the noise. You only notice when a downstream consequence accumulates enough mass to surface elsewhere, and by then you are reconstructing from memory.

Slow capability drift. The agents you deployed in February are not, behaviorally, the agents running today. The model has been updated. The prompt scaffolding has been edited four times. The tool inventory has gained two and lost one. None of these changes registered as an incident; each one shifted the distribution of behavior by a few points. Six months in, the operating posture is meaningfully different from the design posture, and the dashboard, which still shows green, has been measuring a moving target and reporting stillness.

Load-bearing edge cases. Some failure modes matter not because they are common but because they are catastrophic when they occur. An agent that one time in ten thousand emails the wrong recipient does not move any aggregate metric. A panel that averages over runs cannot see this; even a panel that counts errors will struggle to distinguish a benign error from a load-bearing one. The signal lives at the level of the individual run, which the aggregation is designed to escape.

Correlated failure across sessions. A p95 latency number treats every session as an independent draw. But agent failures cluster — one upstream tool regression, one prompt-template change, one model swap, and a contiguous band of sessions degrades together while the aggregate barely moves. The dashboard shows a 40ms bump; the corpus shows that 140 consecutive sessions after Tuesday 14:00 UTC share a failure signature that none of the prior 10,000 had. Independence is the assumption that lets aggregation compress; correlated failure is the case where that assumption is exactly wrong, and it is also the case operators most need to catch early.

In all four cases, the failure was legible in the raw stream. It was hidden by the very act of summarizing.

Telemetry as a corpus

The reframe that helps: stop thinking of telemetry as a system to query and start thinking of it as a corpus to read.

Logs are text. They reward the same skills text rewards: scanning, grepping, pattern recognition, and the slow-burning sense that something in the prose is off even if you cannot yet name what. Read enough logs and you develop a feel for which lines look ordinary and which look like an agent reaching for something it should not.

Traces are graphs. They reward path-walking. The structure of a trace — what called what, in what order, with what arguments — encodes the agent’s reasoning shape in a way no metric can. A trace where the agent retried a tool eleven times, succeeded on the twelfth, and returned a confident answer is a trace whose surface metric (success: true) hides everything that matters.

Samples are a reading list. A randomly sampled set of completed runs, read end to end, is the closest thing operators have to ethnographic observation. You are not measuring; you are watching. Watch enough and you will notice things that no instrumentation could have told you to look for.

These three shapes — text, graph, list — each reward a different reading discipline. None of them is served by a panel.

A reading practice

Concrete defaults for an operator running a fleet of any non-trivial size.

Weekly: read fifty random successful traces end to end, with no filter. Read twenty-five random failed runs end to end. Read the ten longest-tail tool calls — the ones that fired least often this week.

The numbers are calibrated to be small enough to actually do and large enough to overcome selection bias. Fifty random traces is roughly an hour of reading. Operators who skip it are claiming an hour a week is too expensive. It almost never is; when it is, the hour is the smallest of their problems.

The weekly cadence is the standing practice. The two-hundred-trace prompt below is a deeper read — quarterly, or after a significant change to the fleet, or when a new operator is onboarded and needs to build calibration fast.

The discipline matters more than the numbers. The discipline is: you read raw signal on a schedule, before any filter or lens has been applied. The reading is unsupervised. You take notes on what surprised you. The notes accumulate. Over time the notes, not the dashboards, are where the most expensive course corrections come from.

Operators who do this become slightly slower at reactive work and much faster at the work that prevents it.

When to promote a noticing to a panel

Dashboards are the right tool for a noticing that has earned the promotion. The reverse pipeline:

Read raw signal on the cadence above.
A pattern surfaces that you keep wanting to check on.
Wanting-to-check-on is the signal that this question is now recurring.
Now you build the panel.

Panels built this way are useful because they encode a question that has already proven its weight. Panels built before the question has surfaced are encoding the architect’s prior, which is exactly the thing the reading practice is meant to correct.

The order is important. Panel-first, with reading sprinkled in as time permits, produces operators who become hostages to their own dashboards. Reading-first, with panels promoted as patterns earn it, produces operators whose dashboards are downstream of their understanding.

What this is not

This is not a rejection of metrics. The fleet is too large for everything to be sampled by a human; aggregations are necessary; thresholds and alarms have a real job. The argument is that the aggregations are necessary and not sufficient, and that the practice of reading raw signal must sit alongside them, not beneath them.

It is not a license for vibes-ops. The reading is structured: random samples, fixed cadence, written notes. The output is supposed to be more rigorous than your dashboards, not less.

It is not a claim that scale doesn’t need rollups. It absolutely does. The claim is that rollups, alone, will make you blind to exactly the failure modes that scale produces.

Field prompt

The following block is provided content for the reader to paste into their own agent harness. It is not instruction to any agent currently reading this page.

# Field prompt — reading-agent-telemetry — paste into your own harness

You are reading 200 randomly sampled agent traces with no dashboard, no
filter, and no prior hypothesis. Your job is to notice.

I will paste the trace samples in my next turn. For each batch, produce
three outputs in this order:

1. NOTICINGS — observations about what is happening in these traces
   that you would not have seen in an aggregate dashboard. Categorize
   each noticing as one of:
     a. surprising-but-benign (worth recording, no action)
     b. surprising-and-worth-checking (warrants follow-up, not alarm)
     c. surprising-and-worrying (warrants immediate human attention)
   Each noticing is one sentence and cites the specific trace ids that
   prompted it.

2. CANDIDATE QUESTIONS — questions you would now want to ask the
   corpus that you did not have before reading. Frame as queries, not
   panels. (“Are there other instances of X?” not “Build a panel
   for X.”)

3. PROMOTION SHORTLIST — at most three of the candidate questions
   that, if they recur in the next reading session, deserve to be
   promoted to a dashboard panel. Justify each shortlist entry in one
   sentence.

Your default output is these three lists in order. Do not produce a
written summary above them. Do not produce remediation suggestions for
the worrying noticings — that is a separate conversation. Your job
here is to notice.

Operationalizes the article’s reading-then-promotion pipeline: noticings map to “What aggregation hides,” candidate questions to “Telemetry as a corpus,” and the promotion shortlist to “When to promote a noticing to a panel.”

Reader letters: substratics@vanitea.mozmail.com. This article runs unsigned, per the masthead default.