Home Knowledge Base DeepEval

DeepEval is an open-source LLM evaluation framework that runs as pytest-compatible unit tests in CI/CD pipelines — providing pre-built metrics for hallucination detection, contextual relevance, bias, answer correctness, and G-Eval scoring that treat LLM quality as a testable, measurable property rather than a subjective judgment.

What Is DeepEval?

Why DeepEval Matters

Core DeepEval Metrics

Faithfulness (Hallucination Detection):

Contextual Precision (Retrieval Quality):

Contextual Recall:

Answer Relevancy:

G-Eval (Flexible LLM Scoring):

Bias and Toxicity:

Usage Example

import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_rag_faithfulness():
    test_case = LLMTestCase(
        input="What is the return policy?",
        actual_output="Returns are accepted within 30 days with receipt.",
        retrieval_context=["Our policy: customers may return items within 30 days of purchase with proof of purchase."]
    )
    faithfulness = FaithfulnessMetric(threshold=0.8, model="gpt-4o")
    answer_relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")
    assert_test(test_case, [faithfulness, answer_relevancy])

Run with: deepeval test run test_rag.py

Bulk Evaluation:

from deepeval import evaluate

test_cases = [LLMTestCase(...) for _ in dataset]
results = evaluate(test_cases, metrics=[FaithfulnessMetric(threshold=0.8)])

DeepEval vs Alternatives

FeatureDeepEvalRAGASTruLensPromptfoo
Pytest integrationNativeNoNoCLI only
RAG metricsComprehensiveExcellentGoodLimited
Bias/toxicityYesNoNoLimited
CI/CD integrationExcellentGoodLimitedExcellent
Open sourceYesYesYesYes
LLM-as-judgeYesYesYesYes

DeepEval is the evaluation framework that brings unit testing discipline to LLM application quality assurance — by making hallucination, relevance, and bias metrics runnable as pytest assertions in CI/CD pipelines, DeepEval enables engineering teams to catch quality regressions automatically and ship LLM applications with measurable, verifiable quality guarantees.

deepevalunit testevaluationmetrics

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.