TL;DR
- Agent tracing extends distributed tracing (OpenTelemetry) to LLM workflows — each LLM call, tool invocation, and retrieval becomes a span with inputs, outputs, timings, and token counts.
- OpenLLMetry and the OpenTelemetry GenAI semantic conventions define a vendor-neutral standard for LLM spans, supported by every major observability vendor.
- Hosted options include LangSmith (LangChain), Braintrust, Phoenix (Arize), Langfuse (open-source self-hosted), Helicone, and traditional APM vendors with LLM extensions (Datadog, Honeycomb, New Relic).
- Without tracing, agent debugging is guess-and-print. With tracing, you can replay any production run, slice latency by span, and feed real traces into evaluation datasets.
Why Agent Tracing Differs#
Classical application tracing follows a request through services. Agent tracing follows a request through a non-deterministic chain of model calls and tools where the topology itself is decided at runtime. A single user message can produce three tool calls, a model call, two more tool calls, and a final response — and the next identical message can produce a different shape.
This makes naive logging useless. You need structured spans, parent-child relationships, and rich span attributes (model name, prompt tokens, completion tokens, tool name, tool arguments) to make sense of what happened.
OpenTelemetry GenAI#
The OpenTelemetry GenAI semantic conventions, stabilised in 2024-2025, define vendor-neutral attribute names for LLM operations: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.operation.name`, and so on. Spans with these attributes can be ingested by any OTLP-compatible backend.
OpenLLMetry, maintained by Traceloop, ships auto-instrumentation libraries for Python and TypeScript that emit GenAI-compliant spans for OpenAI, Anthropic, Cohere, LangChain, LlamaIndex, and others. Drop it in, set an OTLP endpoint, and you have agent tracing without code changes.
# Auto-instrument every LLM call with one line.
from traceloop.sdk import Traceloop
Traceloop.init(app_name="my-agent")
# Now every call into openai, anthropic, langchain, llamaindex
# emits OTel GenAI-compliant spans to your configured backend.What Belongs in a Span#
- Operation name — model.completion, tool.execute, retriever.search.
- Model identifiers — provider, model name, model version.
- Token usage — input, output, cached, reasoning (when applicable).
- Latency components — time to first token, total duration, queueing time.
- Inputs and outputs — prompt, completion, tool arguments, tool result. Subject to PII redaction.
- Error data — error type, message, retry count.
- Cost — derived from token counts and the model's per-token rate.
- Parent-child links — so a trace tree can be reconstructed across services.
Hosted Observability Options#
| Platform | Strengths | Notes |
|---|---|---|
| LangSmith | Deep LangChain/LangGraph integration | Hosted + self-hosted enterprise |
| Braintrust | Experiment comparison + evals | Strong dataset workflows |
| Phoenix (Arize) | Open source, strong instrumentation | OSS or hosted SaaS |
| Langfuse | Open source, self-hostable | GDPR-friendly default |
| Helicone | Proxy-based, very low friction | One header to enable |
| Datadog / Honeycomb / NR | Unified APM + LLM | Best when you already use them |
PII and Redaction#
Prompt and completion contents often contain user PII or confidential data. Plan redaction from day one. Options include: client-side regex/Presidio redaction before emit, server-side redaction at the collector, hashing for correlation without storage, and selective sampling (full prompts on errors, hashed prompts on success).
Treat your trace store as a data system subject to your regulatory regime. Under UK GDPR, an unredacted trace containing customer data is a customer-data store, with retention, access, and right-to-erasure obligations attached.
From Traces to Evaluations#
The highest-value use of tracing is closing the loop into evaluation. Sample production traces by some criterion (errors, low confidence, user thumbs-down, random), promote them to a dataset, annotate the desired output, and run them through your eval framework. This turns production into a feedback machine — every regression has a chance of producing the test case that catches the next one.
LangSmith, Braintrust, Phoenix, and Langfuse all support promoting traces directly into eval datasets. If your stack does not, build the path manually — it pays back disproportionately.
References
- OpenTelemetry GenAI Semantic Conventions · OpenTelemetry
- OpenLLMetry on GitHub · GitHub (Traceloop)
- Langfuse on GitHub · GitHub