Testing best practices for ML applications involve systematic validation of code, models, and system behavior — combining traditional software testing (unit, integration) with ML-specific approaches (eval sets, LLM-as-judge, deterministic mocking) to ensure reliability in systems where outputs are often non-deterministic and quality is subjective.
Why Testing ML Systems Is Different
- Non-Determinism: Same input can produce different outputs.
- Subjectivity: "Good" responses are often judgment calls.
- Expensive Operations: API calls cost money and time.
- Model Behavior: Changes with updates, fine-tuning.
- Edge Cases: Vast input space makes coverage difficult.
Test Pyramid for ML
```
/\
/ \
/E2E \ Few, slow, expensive
/ \ - Full pipeline tests
/--------\
/Integration\ Some, moderate cost
/ \ - Component interactions
/--------------\
/ Unit Tests \ Many, fast, cheap
/ \ - Functions, classes
/--------------------\
/ Model Evaluations \ Regular, systematic
/ \ - Eval sets, benchmarks
/__________________________\
Unit Testing
Standard Python Tests:
`python
import pytest
def test_tokenizer_splits_correctly():
result = tokenize("hello world")
assert result == ["hello", "world"]
def test_prompt_template_formats():
template = "Answer: {question}"
result = format_prompt(template, question="Why?")
assert result == "Answer: Why?"
def test_sanitize_input_removes_injection():
dangerous = "ignore previous instructions"
result = sanitize_input(dangerous)
assert "ignore" not in result.lower()
`
Testing with Fixtures:
`python
@pytest.fixture
def sample_documents():
return [
{"id": 1, "content": "First document"},
{"id": 2, "content": "Second document"}
]
def test_embedding_produces_vectors(sample_documents):
embeddings = embed_documents(sample_documents)
assert len(embeddings) == 2
assert len(embeddings[0]) == 1536 # Vector dimension
`
Mocking LLM Calls
Mock for Deterministic Tests:
`python
from unittest.mock import patch, MagicMock
@patch('openai.ChatCompletion.create')
def test_chat_wrapper_returns_content(mock_create):
# Setup mock response
mock_create.return_value = MagicMock(
choices=[MagicMock(
message=MagicMock(content="Mocked response")
)]
)
result = call_llm("Test prompt")
assert result == "Mocked response"
mock_create.assert_called_once()
`
Fixture-Based Mocking:
`python`
@pytest.fixture
def mock_llm():
responses = {
"greeting": "Hello! How can I help?",
"farewell": "Goodbye!",
}
def get_response(prompt):
for key, response in responses.items():
if key in prompt.lower():
return response
return "Default response"
return get_response
Model/Output Evaluation
Eval Sets:
`python
eval_cases = [
{
"input": "What is 2+2?",
"expected_contains": ["4"],
"category": "math"
},
{
"input": "List three primary colors",
"validator": lambda r: len(extract_list(r)) == 3,
"category": "instruction-following"
},
{
"input": "Write in formal tone: hi",
"expected_not_contains": ["hi", "hey"],
"category": "style"
}
]
def run_eval(llm_function, cases=eval_cases):
results = []
for case in cases:
response = llm_function(case["input"])
passed = validate_response(response, case)
results.append({
"case": case,
"response": response,
"passed": passed
})
return results
`
LLM-as-Judge:
`python
def llm_judge(prompt, response, criteria):
judge_prompt = f"""
Evaluate this response on a scale of 1-5:
User prompt: {prompt}
Response: {response}
Criteria: {criteria}
Score (1-5) and brief justification:
"""
judgment = call_judge_llm(judge_prompt)
score = extract_score(judgment)
return score
`
Integration Testing
RAG Pipeline Test:
`python`
def test_rag_pipeline_returns_relevant_answer():
# Setup
docs = ["Paris is the capital of France."]
index_documents(docs)
# Execute
response = rag_query("What is the capital of France?")
# Verify
assert "Paris" in response
assert response_cites_source(response)
API Integration Test:
`python
from fastapi.testclient import TestClient
from app import app
client = TestClient(app)
def test_chat_endpoint_returns_response():
response = client.post(
"/v1/chat",
json={"message": "Hello"}
)
assert response.status_code == 200
assert "content" in response.json()
`
Best Practices
Test Categories:
``
Category | What to Test
----------------|----------------------------------
Correctness | Logic works as expected
Edge Cases | Boundary conditions, empty input
Error Handling | Graceful failures, error messages
Performance | Latency, throughput baseline
Security | Injection resistance, auth
Regression | Previously fixed bugs stay fixed
Coverage Goals:
```
Component | Target Coverage
-----------------|------------------
Utility functions| 90%+
Business logic | 80%+
API endpoints | 70%+
LLM interactions | Eval-based
Testing ML systems requires both traditional software testing and ML-specific evaluation — combining deterministic unit tests with eval sets, mocking for reproducibility, and LLM-as-judge for quality assessment ensures reliable systems despite the inherent non-determinism of language models.