TruLens

Keywords: trulens,feedback,eval

TruLens is an open-source library for evaluating and tracking LLM applications using the RAG Triad framework — providing feedback functions that score context relevance, groundedness, and answer relevance as continuous metrics across every application interaction, enabling data-driven quality improvement for RAG systems, agents, and any LLM-powered workflow.

What Is TruLens?

- Definition: An open-source evaluation and observability library (TruEra, 2022) that wraps LLM application chains with instrumentation — capturing inputs, intermediate outputs, and final responses, then scoring them with user-defined or pre-built feedback functions that measure quality dimensions relevant to RAG and agent systems.
- The RAG Triad: TruLens popularized the "RAG Triad" evaluation framework — three metrics that together assess whether a RAG response is trustworthy: Context Relevance (retriever quality), Groundedness (hallucination absence), and Answer Relevance (response usefulness).
- Feedback Functions: Scoring logic is encapsulated in feedback functions — Python callables that take inputs and outputs and return a score between 0 and 1, powered by LLM providers or custom logic.
- TruChain / TruLlama: Drop-in wrappers for LangChain (TruChain) and LlamaIndex (TruLlama) that auto-instrument all calls — no manual trace instrumentation required.
- Leaderboard: The TruLens dashboard shows a "leaderboard" of experiment runs — compare different RAG configurations side-by-side on all three RAG Triad metrics.

Why TruLens Matters

- RAG Quality Decomposition: When a RAG system gives a wrong answer, TruLens tells you whether the retriever found the wrong documents (low context relevance), the LLM hallucinated beyond those documents (low groundedness), or the answer was off-topic (low answer relevance) — pinpointing which component to fix.
- Continuous Monitoring: Wrap your production RAG application with TruLens and every interaction is automatically scored — dashboards show quality trends without manual evaluation effort.
- Experiment Comparison: Run your RAG pipeline with chunk_size=512 and chunk_size=1024, log both to TruLens, and compare RAG Triad scores — data-driven hyperparameter optimization.
- Feedback Function Flexibility: Beyond the RAG Triad, define custom feedback functions for any quality dimension — sentiment, technical accuracy, compliance with style guidelines, citation formatting.
- Open Source and Extensible: MIT license, all evaluation logic is inspectable and modifiable — no black-box scoring that you have to trust without understanding.

The RAG Triad in Detail

Context Relevance (Retriever Quality):
- "Is the retrieved context actually relevant to the query?"
- Scores each retrieved chunk for relevance to the input question.
- Low score → retriever is pulling off-topic documents. Remediation: better embedding model, metadata filtering, query reformulation.

Groundedness (Generation Quality — Hallucination):
- "Is the answer supported by the retrieved context?"
- Extracts claims from the answer and verifies each against the context using an LLM judge.
- Low score → generator is inventing facts beyond what the context supports. Remediation: tighter system prompt, lower temperature, smaller model.

Answer Relevance (Response Usefulness):
- "Does the answer address the user's question?"
- Evaluates whether the final response is on-topic and helpful for the query.
- Low score → response is tangential or incomplete. Remediation: prompt engineering, question preprocessing.

Core TruLens Usage

LangChain Integration:
``python
from trulens.apps.langchain import TruChain
from trulens.core import TruSession
from trulens.providers.openai import OpenAI as TruOpenAI

session = TruSession()
session.reset_database()

provider = TruOpenAI(model_engine="gpt-4o")

from trulens.core.feedback import Feedback
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on_input_output()
f_context_relevance = Feedback(provider.context_relevance).on_input_output()
f_answer_relevance = Feedback(provider.relevance).on_input_output()

tru_rag = TruChain(
rag_chain,
app_name="CustomerFAQ-RAG",
feedbacks=[f_groundedness, f_context_relevance, f_answer_relevance]
)

with tru_rag as recording:
response = rag_chain.invoke({"query": "What is the return policy?"})

session.get_leaderboard() # Show experiment comparison
`

TruLens Dashboard:
`python
from trulens.dashboard import run_dashboard
run_dashboard(session) # Opens at http://localhost:8501
`

Custom Feedback Function:
`python
def technical_accuracy(question: str, response: str) -> float:
"""Returns 1.0 if response uses correct technical terminology, 0.0 otherwise."""
required_terms = get_required_terms(question)
return sum(1 for term in required_terms if term in response) / len(required_terms)

f_technical = Feedback(technical_accuracy).on_input_output()
``

TruLens vs Alternatives

| Feature | TruLens | RAGAS | DeepEval | Langfuse |
|---------|--------|------|---------|---------|
| RAG Triad | Native | Equivalent | Similar | No |
| LangChain integration | TruChain | Good | Good | Native |
| LlamaIndex integration | TruLlama | Good | Good | Good |
| Dashboard | Built-in | No | Confident AI | Built-in |
| Custom feedback fns | Excellent | Limited | Limited | Custom scorers |
| Open source | Yes | Yes | Yes | Yes |

TruLens is the evaluation library that makes RAG quality measurement concrete and actionable through the RAG Triad framework — by decomposing RAG quality into three independently measurable dimensions, TruLens enables teams to diagnose exactly where their retrieval-augmented generation system is failing and validate that fixes actually improve the right metric without degrading the others.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT