Langfuse is an open-source LLM engineering platform for tracing, evaluating, and monitoring AI applications — providing end-to-end visibility into complex LangChain, LlamaIndex, and custom LLM pipelines through structured traces that capture every component's input, output, latency, and cost, enabling teams to debug production issues, run evaluations, and iteratively improve their AI systems.
What Is Langfuse?
- Definition: An open-source observability and analytics platform (Apache 2.0 license, company founded 2023 in Berlin) specifically designed for the multi-step, non-deterministic nature of LLM applications — capturing hierarchical traces that show exactly what happened inside a LangChain agent, RAG pipeline, or custom AI workflow.
- Trace Model: Langfuse organizes observability data as nested traces — a top-level Trace contains Spans (non-LLM operations like retrieval, tool calls) and Generations (LLM calls with tokens and cost), creating a full execution tree for any complex pipeline.
- Framework Integration: Native instrumentation for LangChain, LlamaIndex, OpenAI SDK, Anthropic SDK, and any Python/TypeScript code — one-line SDK integration or auto-instrumentation via callbacks.
- Evaluation System: Built-in evaluation workflow — define evaluation criteria, run LLM-as-judge scoring on production traces, compare experiment results, and catch regressions before deployment.
- Prompt Management: Version-controlled prompt registry — manage prompt templates in Langfuse, fetch them in code via SDK, roll back to previous versions, and A/B test variants with tracked metrics.
Why Langfuse Matters
- Multi-Step Visibility: Unlike simple request logging, Langfuse traces show the full execution of a RAG pipeline — which documents were retrieved, how long retrieval took, what the generator received, and what it returned — making debugging fast and precise.
- LLM Quality Monitoring: Set up automated evaluation jobs that score production traces using GPT-4 or Claude as a judge — get continuous quality metrics without human labeling.
- Cost Attribution: Track token usage and cost per trace component — identify which pipeline step consumes the most tokens and optimize accordingly.
- Experiment Tracking: Compare different prompt versions, model choices, or retrieval strategies as named experiments — quantitative evidence for engineering decisions.
- Self-Hostable: Deploy Langfuse on your own infrastructure with Docker Compose — complete data sovereignty, required for enterprises with data residency requirements.
Integration Examples
OpenAI SDK (Python):
``python
from langfuse.openai import openai
client = openai.OpenAI() # Langfuse-wrapped client
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain RAG."}],
name="explain-rag", # Trace name in Langfuse
metadata={"user_id": "123"} # Custom metadata
)
`
LangChain Callback:
`python
from langfuse.callback import CallbackHandler
handler = CallbackHandler(public_key="pk-...", secret_key="sk-...")
chain.invoke({"input": "user query"}, config={"callbacks": [handler]})
`
Custom Tracing (Decorator):
`python
from langfuse.decorators import observe, langfuse_context
@observe()
def retrieve_documents(query: str) -> list:
docs = vector_store.similarity_search(query, k=5)
langfuse_context.update_current_observation(metadata={"doc_count": len(docs)})
return docs
@observe(name="rag-pipeline")
def answer_question(question: str) -> str:
docs = retrieve_documents(question)
return generate_answer(question, docs)
`
Evaluation Workflow
Human Annotation:
- Review traces in the Langfuse UI and assign quality scores (correctness, helpfulness, groundedness) — build labeled datasets for fine-tuning and evaluation.
LLM-as-Judge:
- Define evaluators in Python that score traces using another LLM — automatically runs on new production traces for continuous quality monitoring.
Dataset Experiments:
- Curate test datasets from production traces, run your pipeline against the dataset, compare scores across prompt/model versions in experiment view.
Prompt Management
`python
from langfuse import Langfuse
lf = Langfuse()
prompt = lf.get_prompt("customer-support-v3") # Fetches from registry
messages = prompt.compile(customer_name="Alice", issue="billing")
`
Langfuse vs Alternatives
| Feature | Langfuse | Helicone | Phoenix (Arize) | LangSmith |
|---------|---------|---------|----------------|----------|
| Open source | Yes (Apache 2.0) | Yes | Yes | No |
| Trace model | Hierarchical | Flat request logs | Hierarchical | Hierarchical |
| Evaluation system | Strong | Basic | Strong | Strong |
| Prompt management | Yes | No | No | Yes |
| Self-hostable | Yes (simple) | Yes | Yes | No |
| LangChain integration | Excellent | Good | Good | Native |
Self-Hosting
`bash``
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d
# Access at http://localhost:3000
Langfuse is the open-source LLM observability platform that gives engineering teams the visibility and evaluation infrastructure needed to confidently ship and continuously improve AI applications — by combining structured tracing, automated evaluation, and prompt management in a single self-hostable platform, Langfuse provides the observability foundation that production LLM applications require without vendor lock-in.