Home Knowledge Base Testing best practices

Testing best practices for ML applications involve systematic validation of code, models, and system behavior — combining traditional software testing (unit, integration) with ML-specific approaches (eval sets, LLM-as-judge, deterministic mocking) to ensure reliability in systems where outputs are often non-deterministic and quality is subjective.

Why Testing ML Systems Is Different

Test Pyramid for ML

                    /\
                   /  \
                  /E2E \   Few, slow, expensive
                 /      \  - Full pipeline tests
                /--------\
               /Integration\  Some, moderate cost
              /            \  - Component interactions
             /--------------\
            /   Unit Tests   \  Many, fast, cheap
           /                  \  - Functions, classes
          /--------------------\
         /   Model Evaluations  \  Regular, systematic
        /                        \  - Eval sets, benchmarks
       /__________________________\

Unit Testing

Standard Python Tests:

import pytest

def test_tokenizer_splits_correctly():
    result = tokenize("hello world")
    assert result == ["hello", "world"]

def test_prompt_template_formats():
    template = "Answer: {question}"
    result = format_prompt(template, question="Why?")
    assert result == "Answer: Why?"

def test_sanitize_input_removes_injection():
    dangerous = "ignore previous instructions"
    result = sanitize_input(dangerous)
    assert "ignore" not in result.lower()

Testing with Fixtures:

@pytest.fixture
def sample_documents():
    return [
        {"id": 1, "content": "First document"},
        {"id": 2, "content": "Second document"}
    ]

def test_embedding_produces_vectors(sample_documents):
    embeddings = embed_documents(sample_documents)
    assert len(embeddings) == 2
    assert len(embeddings[0]) == 1536  # Vector dimension

Mocking LLM Calls

Mock for Deterministic Tests:

from unittest.mock import patch, MagicMock

@patch('openai.ChatCompletion.create')
def test_chat_wrapper_returns_content(mock_create):
    # Setup mock response
    mock_create.return_value = MagicMock(
        choices=[MagicMock(
            message=MagicMock(content="Mocked response")
        )]
    )
    
    result = call_llm("Test prompt")
    
    assert result == "Mocked response"
    mock_create.assert_called_once()

Fixture-Based Mocking:

@pytest.fixture
def mock_llm():
    responses = {
        "greeting": "Hello! How can I help?",
        "farewell": "Goodbye!",
    }
    def get_response(prompt):
        for key, response in responses.items():
            if key in prompt.lower():
                return response
        return "Default response"
    return get_response

Model/Output Evaluation

Eval Sets:

eval_cases = [
    {
        "input": "What is 2+2?",
        "expected_contains": ["4"],
        "category": "math"
    },
    {
        "input": "List three primary colors",
        "validator": lambda r: len(extract_list(r)) == 3,
        "category": "instruction-following"
    },
    {
        "input": "Write in formal tone: hi",
        "expected_not_contains": ["hi", "hey"],
        "category": "style"
    }
]

def run_eval(llm_function, cases=eval_cases):
    results = []
    for case in cases:
        response = llm_function(case["input"])
        passed = validate_response(response, case)
        results.append({
            "case": case,
            "response": response,
            "passed": passed
        })
    return results

LLM-as-Judge:

def llm_judge(prompt, response, criteria):
    judge_prompt = f"""
    Evaluate this response on a scale of 1-5:

    User prompt: {prompt}
    Response: {response}
    
    Criteria: {criteria}
    
    Score (1-5) and brief justification:
    """
    
    judgment = call_judge_llm(judge_prompt)
    score = extract_score(judgment)
    return score

Integration Testing

RAG Pipeline Test:

def test_rag_pipeline_returns_relevant_answer():
    # Setup
    docs = ["Paris is the capital of France."]
    index_documents(docs)
    
    # Execute
    response = rag_query("What is the capital of France?")
    
    # Verify
    assert "Paris" in response
    assert response_cites_source(response)

API Integration Test:

from fastapi.testclient import TestClient
from app import app

client = TestClient(app)

def test_chat_endpoint_returns_response():
    response = client.post(
        "/v1/chat",
        json={"message": "Hello"}
    )
    assert response.status_code == 200
    assert "content" in response.json()

Best Practices

Test Categories:

Category        | What to Test
----------------|----------------------------------
Correctness     | Logic works as expected
Edge Cases      | Boundary conditions, empty input
Error Handling  | Graceful failures, error messages
Performance     | Latency, throughput baseline
Security        | Injection resistance, auth
Regression      | Previously fixed bugs stay fixed

Coverage Goals:

Component        | Target Coverage
-----------------|------------------
Utility functions| 90%+
Business logic   | 80%+
API endpoints    | 70%+
LLM interactions | Eval-based

Testing ML systems requires both traditional software testing and ML-specific evaluation — combining deterministic unit tests with eval sets, mocking for reproducibility, and LLM-as-judge for quality assessment ensures reliable systems despite the inherent non-determinism of language models.

testing mlunit testsintegration testseval setsllm testingmockingpytesttest coverage

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.