TL;DR
- Open-source (ELv2) observability and evaluation platform from Arize AI, first released in 2023. Runs locally as a Python package, as a self-hosted Docker container, or as managed cloud.
- Built on OpenTelemetry and the OpenInference semantic conventions Arize co-authored. Any OTel-instrumented LLM application can send traces to Phoenix without a Phoenix-specific SDK.
- Four pillars: LLM tracing (RAG, agents, multi-step workflows), evaluation (LLM-as-judge, code-based, dataset runs), dataset and experiment management, and retrieval debugging via embedding visualisations.
- Arize's wider commercial product covers traditional ML observability (drift, performance, fairness); Phoenix is the open-source counterpart focused on LLM workflows.
What Phoenix Offers#
Phoenix's design centre is the iteration loop: trace a real production interaction, capture it to a dataset, evaluate variants of the application against that dataset, ship the winning version, repeat. Each step has first-class UI: trace browser with span tree, dataset and experiment manager, evaluation runner with built-in templates (Q&A correctness, hallucination, retrieval relevance), and a comparison view across experiments.
It is also the strongest of the open-source LLM observability tools for retrieval workloads. The embedding-projection view plots query and document embeddings in 2D/3D (UMAP, t-SNE), making it possible to see clusters of failing queries and the documents they pull. For RAG debugging, this is qualitatively different from reading individual traces.
OpenInference Conventions#
OpenInference is a set of OpenTelemetry semantic conventions for LLM workloads — standard attribute names for prompts, completions, model identifiers, token counts, embedding inputs and outputs, retrieval contexts, and tool calls. Arize co-authored the spec and Phoenix consumes it natively. The same instrumentation works with Langfuse, Helicone, and any other OpenInference-compatible backend.
Auto-instrumentors exist for OpenAI, Anthropic, LangChain, LlamaIndex, DSPy, Haystack, AWS Bedrock, Vertex AI, MistralAI, Groq, and others. Instrument once, route the OTLP traces to Phoenix.
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
# Register Phoenix as the OTel endpoint (local or self-hosted)
tracer_provider = register(endpoint="http://phoenix:6006/v1/traces")
# Auto-instrument the OpenAI SDK
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
# Every openai.chat.completions.create() call now emits an OTel span
# tagged with prompt, completion, model, latency, tokens — visible in Phoenix.Evaluation Workflow#
Phoenix ships evaluation templates for common LLM judging tasks — Q&A correctness, hallucination detection, retrieval relevance, summarisation quality, toxicity. Each template is a prompt that an evaluator LLM grades, with calibrated few-shot examples baked in. Custom evaluators are a Python function that returns a label and an explanation.
The typical loop: collect a representative dataset from production traces, define one or more evaluators, run an experiment against the dataset (current app version vs. proposed version), inspect failures in the trace browser, iterate.
Build the eval dataset from real production failures, not synthetic examples. Phoenix makes it one click to convert a low-rated trace into a dataset row — use it.
Retrieval Debugging#
Phoenix's embedding view loads query and document embeddings, projects them with UMAP, and lets you brush over clusters to inspect their members. For a RAG system this answers questions like: which queries are landing in regions of embedding space with no relevant documents? Are retrieved chunks clustered around obvious topics but the failing queries scatter across many small clusters? Where do hallucinations correlate with retrieval shape?
This view is the strongest argument for Phoenix on retrieval-heavy systems. The other LLM observability tools surface individual trace details well; Phoenix surfaces dataset-level structure.
Deployment Modes#
- Local Python — `pip install arize-phoenix; px.launch_app()` spins up a local UI ideal for notebook development.
- Self-hosted Docker — single container backed by PostgreSQL, suitable for team-shared on-prem deployment.
- Phoenix Cloud — managed multi-tenant hosting.
- Arize AX — the commercial product, adds production ML observability, SSO, RBAC, and SLA support on top.
Licensing#
Phoenix is released under the Elastic License v2, the same licence Elasticsearch adopted in 2021. It permits self-hosting, modification, and integration into commercial products, but prohibits offering Phoenix as a managed SaaS that competes with Arize. For internal use this is functionally equivalent to an open-source licence.
References
- Phoenix Documentation · Arize AI
- Phoenix on GitHub · GitHub
- OpenInference Conventions · GitHub (OpenInference)