Langfuse

Langfuse is an open-source LLM engineering platform for tracing, evaluating, and monitoring AI applications — providing end-to-end visibility into complex LangChain, LlamaIndex, and custom LLM pipelines through structured traces that capture every component's input, output, latency, and cost, enabling teams to debug production issues, run evaluations, and iteratively improve their AI systems.

What Is Langfuse?

- Definition: An open-source observability and analytics platform (Apache 2.0 license, company founded 2023 in Berlin) specifically designed for the multi-step, non-deterministic nature of LLM applications — capturing hierarchical traces that show exactly what happened inside a LangChain agent, RAG pipeline, or custom AI workflow.
- Trace Model: Langfuse organizes observability data as nested traces — a top-level Trace contains Spans (non-LLM operations like retrieval, tool calls) and Generations (LLM calls with tokens and cost), creating a full execution tree for any complex pipeline.
- Framework Integration: Native instrumentation for LangChain, LlamaIndex, OpenAI SDK, Anthropic SDK, and any Python/TypeScript code — one-line SDK integration or auto-instrumentation via callbacks.
- Evaluation System: Built-in evaluation workflow — define evaluation criteria, run LLM-as-judge scoring on production traces, compare experiment results, and catch regressions before deployment.
- Prompt Management: Version-controlled prompt registry — manage prompt templates in Langfuse, fetch them in code via SDK, roll back to previous versions, and A/B test variants with tracked metrics.

Why Langfuse Matters

- Multi-Step Visibility: Unlike simple request logging, Langfuse traces show the full execution of a RAG pipeline — which documents were retrieved, how long retrieval took, what the generator received, and what it returned — making debugging fast and precise.
- LLM Quality Monitoring: Set up automated evaluation jobs that score production traces using GPT-4 or Claude as a judge — get continuous quality metrics without human labeling.
- Cost Attribution: Track token usage and cost per trace component — identify which pipeline step consumes the most tokens and optimize accordingly.
- Experiment Tracking: Compare different prompt versions, model choices, or retrieval strategies as named experiments — quantitative evidence for engineering decisions.
- Self-Hostable: Deploy Langfuse on your own infrastructure with Docker Compose — complete data sovereignty, required for enterprises with data residency requirements.

Integration Examples

OpenAI SDK (Python):
``python from langfuse.openai import openai

client = openai.OpenAI() # Langfuse-wrapped client response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Explain RAG."}], name="explain-rag", # Trace name in Langfuse metadata={"user_id": "123"} # Custom metadata )`

LangChain Callback:`python from langfuse.callback import CallbackHandler

handler = CallbackHandler(public_key="pk-...", secret_key="sk-...") chain.invoke({"input": "user query"}, config={"callbacks": [handler]})`

Custom Tracing (Decorator):`python from langfuse.decorators import observe, langfuse_context

@observe() def retrieve_documents(query: str) -> list: docs = vector_store.similarity_search(query, k=5) langfuse_context.update_current_observation(metadata={"doc_count": len(docs)}) return docs

@observe(name="rag-pipeline") def answer_question(question: str) -> str: docs = retrieve_documents(question) return generate_answer(question, docs)`

Evaluation Workflow

Human Annotation: - Review traces in the Langfuse UI and assign quality scores (correctness, helpfulness, groundedness) — build labeled datasets for fine-tuning and evaluation.

LLM-as-Judge: - Define evaluators in Python that score traces using another LLM — automatically runs on new production traces for continuous quality monitoring.

Dataset Experiments: - Curate test datasets from production traces, run your pipeline against the dataset, compare scores across prompt/model versions in experiment view.

Prompt Management

`python from langfuse import Langfuse

lf = Langfuse() prompt = lf.get_prompt("customer-support-v3") # Fetches from registry messages = prompt.compile(customer_name="Alice", issue="billing")`

Langfuse vs Alternatives

| Feature | Langfuse | Helicone | Phoenix (Arize) | LangSmith | |---------|---------|---------|----------------|----------| | Open source | Yes (Apache 2.0) | Yes | Yes | No | | Trace model | Hierarchical | Flat request logs | Hierarchical | Hierarchical | | Evaluation system | Strong | Basic | Strong | Strong | | Prompt management | Yes | No | No | Yes | | Self-hostable | Yes (simple) | Yes | Yes | No | | LangChain integration | Excellent | Good | Good | Native |

Self-Hosting

`bash git clone https://github.com/langfuse/langfuse.git cd langfuse docker compose up -d # Access at http://localhost:3000``

Langfuse is the open-source LLM observability platform that gives engineering teams the visibility and evaluation infrastructure needed to confidently ship and continuously improve AI applications — by combining structured tracing, automated evaluation, and prompt management in a single self-hostable platform, Langfuse provides the observability foundation that production LLM applications require without vendor lock-in.

Want to learn more?