Phoenix (Arize AI) is an open-source ML observability and LLM evaluation platform that combines embedding visualization, RAG retrieval analysis, and LLM tracing — enabling data scientists and ML engineers to diagnose why their AI systems are failing by visualizing high-dimensional data, analyzing retrieval quality, and tracing complex multi-step LLM pipelines in a unified interface.
What Is Phoenix?
- Definition: An open-source observability tool from Arize AI that runs locally or in cloud environments, providing interactive visualization of embeddings, traces of LLM pipeline executions, and evaluation frameworks for assessing RAG quality, hallucination, and response correctness.
- Embedding Visualization: Projects high-dimensional embedding vectors (sentence embeddings, document embeddings, image embeddings) into 3D UMAP space — enabling visual inspection of clustering, drift, and retrieval quality that are invisible in tabular metrics.
- RAG Debugging: Shows why a RAG retriever missed a relevant document — by visualizing query and document embeddings together, you can see when a user's query embedding is far from the relevant document's embedding, diagnosing semantic mismatch before trying prompt fixes.
- LLM Tracing: Full OpenTelemetry-compatible tracing for LangChain, LlamaIndex, OpenAI, and Anthropic — captures every step of a multi-agent or RAG pipeline with inputs, outputs, latency, and token counts.
- Evals Framework: Pre-built evaluation templates for hallucination detection, relevance scoring, toxicity, and Q&A correctness — run as batch evaluations over production traces or experiment datasets.
Why Phoenix Matters
- Visual Debugging: Metrics like "retrieval accuracy 78%" don't tell you why 22% of queries fail. Phoenix's embedding visualization shows you — query embeddings that cluster away from your document corpus reveal gaps in your knowledge base or chunking strategy.
- Drift Detection: Compare embedding distributions between a baseline (when the system worked well) and current production — visual drift in the UMAP projection indicates distribution shift before it shows up as metric degradation.
- RAG Quality Assessment: Phoenix provides the RAG Triad metrics (context relevance, groundedness, answer relevance) out of the box — quantify retrieval and generation quality separately to identify which component needs improvement.
- Open Source + Arize Ecosystem: Phoenix runs fully open-source locally, and traces can optionally be exported to Arize's commercial platform for enterprise-scale observability — giving teams a migration path from experimentation to production.
- Model-Agnostic: Works with any embedding model (OpenAI, Cohere, sentence-transformers, custom models) and any LLM provider — not tied to a specific vendor's ecosystem.
Core Phoenix Capabilities
Embedding Analysis:
- UMAP projection of query and document embeddings in 3D interactive space.
- Color by metadata (topic, user segment, timestamp) to identify patterns.
- Click any point to inspect the underlying text and its nearest neighbors.
- Compare two embedding snapshots to visualize distribution shift.
LLM Tracing:
``python
import phoenix as px
from phoenix.otel import register
tracer_provider = register(project_name="my-rag-app")
# Now LangChain, LlamaIndex calls are automatically traced
`
Evaluation Framework:
`python
from phoenix.evals import OpenAIModel, HallucinationEvaluator
model = OpenAIModel(model="gpt-4o")
evaluator = HallucinationEvaluator(model)
results = evaluator.evaluate(
output=response_text,
reference=retrieved_context
)
# Returns: {"label": "hallucinated"/"grounded", "score": 0.92, "explanation": "..."}
`
RAG Retrieval Debugging Workflow
1. Ingest embeddings: Send query and document embeddings to Phoenix during evaluation runs.
2. Identify failing queries: Filter by low quality scores or user complaints.
3. Visualize in UMAP: Select the failing queries — if they cluster far from the relevant documents, the retriever is failing semantically.
4. Diagnose root cause: Too-large chunks? Wrong embedding model? Missing content in the knowledge base?
5. Validate fix: Re-run after the fix — embedding clusters should converge.
Phoenix vs Alternatives
| Feature | Phoenix | Langfuse | Weights & Biases | Arize (Commercial) |
|---------|---------|---------|-----------------|-------------------|
| Embedding visualization | Excellent | No | Good | Excellent |
| RAG debugging | Excellent | Good | Limited | Excellent |
| LLM tracing | Good | Excellent | Good | Excellent |
| Open source | Yes | Yes | No | No |
| Local run | Yes | Yes | No | No |
| Eval framework | Strong | Strong | Limited | Strong |
Getting Started
`bash`
pip install arize-phoenix
phoenix serve # Launches UI at http://localhost:6006
`python
import phoenix as px
px.launch_app() # Or connect to running server
# Import your traces and embeddings for analysis
ds = px.Dataset.from_dataframe(df, schema=px.Schema(
prediction_id_column_name="id",
prompt_column_names=px.EmbeddingColumnNames(
vector_column_name="query_embedding",
raw_data_column_name="query_text"
)
))
``
Phoenix is the ML observability tool that makes invisible embedding-level problems visible — by projecting high-dimensional retrieval and semantic data into inspectable visualizations, Phoenix enables AI teams to diagnose RAG failures, embedding drift, and retrieval quality issues that would otherwise require days of manual analysis to understand.