What is an agent run?

A run is one delegated piece of work an agent attempts end to end — receiving a task, planning, calling tools, and producing a result. It is distinct from a session, which is the human-facing interaction wrapping it. A single session can contain many runs, and the session can look perfectly healthy while every run inside it fails.

Why are session metrics misleading for AI agents?

Because traditional product analytics was built for software where the human does the work step by step, so clicks and time-on-task proxy for value. With agents the human delegates and leaves. Activity metrics then measure that someone asked, not that anything useful came back. You can have rising engagement and falling delivered value at the same time, and a session dashboard will show only the first.

What metrics should you track for AI agents in production?

Six operational signals cover most needs — tokens, cost, latency, errors, tool calls, and output quality. On top of those, track task completion across repeated attempts rather than single runs, tool-call success rate, correction rate, and human acceptance. Evaluate the full execution trajectory, not just the final output, because intermediate reasoning and tool steps fail independently of the end result.

How reliable are AI agents in production?

Less than single-run benchmarks suggest. Enterprise data shows agents achieving around 60% success on a single run dropping to roughly 25% across eight runs — reliability degrades sharply under repetition, which is exactly the condition production imposes. This is why consistency across runs matters more than peak performance on one.

Agent product analytics: why the run matters more than the session

An agent dashboard can be entirely green while the work it was supposed to do is failing.

That is not a monitoring bug. It is a category error: measuring the conversation instead of the delegation. And it is the single most common reason teams are surprised when an agent pilot that looked healthy turns out to have delivered nothing.

The number that reframes the problem

Enterprise evaluation data shows agents achieving roughly 60% success on a single run, falling to about 25% across eight runs.

Sit with that gap. It means a system that looks acceptable when you try it once is unreliable when it runs repeatedly — which is the only mode that matters in production. It also means that a demo, which is a single run performed under favourable conditions, carries almost no information about production behaviour.

This is why 78% of enterprises have agent pilots running while fewer than 15% have reached production scale. The pilots are not lying, exactly. They are measuring the wrong number of attempts.

Session, run, trust

Three layers, and they fail independently.

The session is the visible interaction surface — someone opened the thing and asked for something. This is what conventional analytics captures, and on its own it tells you almost nothing.

The run is the delegated work: the agent receiving a task, planning an approach, calling tools, and producing a result. This is where value is either created or lost.

Trust is whether a human accepted the output and would delegate more next time. This is the only layer that predicts whether the system survives contact with the organisation.

Read in combination they diagnose cleanly:

Busy sessions, weak runs → the system is decorative
Completed runs, no trust → the system works and nobody uses it
Trusted, repeatable runs → this is where leverage starts

Most instrumentation captures the first layer, occasionally the second, and almost never the third — which is unfortunate, because the third is the one that determines renewal.

Why traditional analytics does not transfer

Product analytics assumes a human doing work step by step. Page views, funnels, and time-on-task are reasonable proxies for value under that assumption, because the human's attention is the work.

Agentic software breaks the assumption. The user hands the work over and leaves. Their attention is no longer a proxy for anything — if anything, a user spending a long time in an agent session is a bad sign, because it usually means they are correcting it.

The events that carry signal are different in kind:

The handoff — what was actually delegated, in what form
Tool calls — which were attempted, which succeeded, which silently returned nothing useful
Boundaries hit — where the agent stopped, asked, or should have and didn't
Corrections — how often a human intervened, and how substantially
Acceptance — whether the output was used as-is, edited, or discarded

That last one is the highest-value and least-instrumented metric in most deployments. "Discarded silently" and "accepted gratefully" produce identical session telemetry.

What to instrument

Six operational signals cover the baseline: tokens, cost, latency, errors, tool calls, and output quality. These are table stakes, and most teams get here.

The layer that separates a working system from a hopeful one:

Completion across repeated attempts. Not "did it work" but "how often, over how many tries, on the same class of task". Given the 60%-to-25% degradation, single-run success is close to meaningless as a production signal.

Trajectory, not just outcome. Intermediate tool calls, reasoning steps, and execution order all fail independently of the final result. An agent that reaches a correct answer through a broken path will produce a wrong one as soon as inputs shift. Evaluating only the output means you cannot see that coming.

Planning quality. Whether the agent forms a sensible plan before acting has become a central evaluation concern in 2026, because bad plans executed competently are the hardest failures to detect downstream.

Correction rate over time. The direction matters more than the level. Rising corrections against a stable task mix usually means drift — in the data, the process, or the underlying model.

The commercial test

Instrumentation is worth having only if it answers a business question. Five that are worth asking of any agentic workflow:

Does it reduce repetitive manual work, measurably?
Does it shorten the time to complete a real task?
Does it reduce error, rework, or escalation?
Does a human trust the result enough to reuse it without checking everything?
Can it be connected to customer value, operational leverage, or revenue?

A workflow that fails all five may still be technically impressive. It is not yet worth operating.

Workflows or agents

A distinction worth being disciplined about, because it determines what you should measure.

Anthropic draws the line clearly: "Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents ... dynamically direct their own processes and tool usage." OpenAI's framing is complementary — "Agents are systems that independently accomplish tasks on your behalf."

The operational implication:

Predictable path, control matters → build a workflow. Measure step success and throughput.
Ambiguous, variable, context-dependent work → consider an agent. Measure trajectory, corrections, and trust.

Most teams choose the tool before characterising the work, which is how you end up with a dynamically-reasoning agent solving a problem that wanted a for loop — expensive, slower, and harder to debug than the thing it replaced.

Where this bites in practice

The reason this is not an academic concern: agent reliability degrades under exactly the conditions production creates, and session dashboards are structurally incapable of showing it. A team can watch engagement climb for a quarter while delivered value falls, and have no instrument that contradicts the story.

Instrumenting runs and trust is what makes the failure visible early enough to act — which is the same reason pilots stall when nobody scoped the day-two work. We build this measurement into implementation rather than adding it after launch, because retrofitted observability tends to capture whatever was easy to capture rather than what mattered.

Measure the run. Prove the trust. Connect both to value. In agentic systems, steering is most of the job — and you cannot steer against a metric that cannot see the work.