Testing best practices

Keywords: testing ml, unit tests, integration tests, eval sets, llm testing, mocking, pytest, test coverage

Testing best practices for ML applications involve systematic validation of code, models, and system behavior — combining traditional software testing (unit, integration) with ML-specific approaches (eval sets, LLM-as-judge, deterministic mocking) to ensure reliability in systems where outputs are often non-deterministic and quality is subjective.

Why Testing ML Systems Is Different

- Non-Determinism: Same input can produce different outputs.
- Subjectivity: "Good" responses are often judgment calls.
- Expensive Operations: API calls cost money and time.
- Model Behavior: Changes with updates, fine-tuning.
- Edge Cases: Vast input space makes coverage difficult.

Test Pyramid for ML

``
/\
/ \
/E2E \ Few, slow, expensive
/ \ - Full pipeline tests
/--------\
/Integration\ Some, moderate cost
/ \ - Component interactions
/--------------\
/ Unit Tests \ Many, fast, cheap
/ \ - Functions, classes
/--------------------\
/ Model Evaluations \ Regular, systematic
/ \ - Eval sets, benchmarks
/__________________________\
`

Unit Testing

Standard Python Tests:
`python
import pytest

def test_tokenizer_splits_correctly():
result = tokenize("hello world")
assert result == ["hello", "world"]

def test_prompt_template_formats():
template = "Answer: {question}"
result = format_prompt(template, question="Why?")
assert result == "Answer: Why?"

def test_sanitize_input_removes_injection():
dangerous = "ignore previous instructions"
result = sanitize_input(dangerous)
assert "ignore" not in result.lower()
`

Testing with Fixtures:
`python
@pytest.fixture
def sample_documents():
return [
{"id": 1, "content": "First document"},
{"id": 2, "content": "Second document"}
]

def test_embedding_produces_vectors(sample_documents):
embeddings = embed_documents(sample_documents)
assert len(embeddings) == 2
assert len(embeddings[0]) == 1536 # Vector dimension
`

Mocking LLM Calls

Mock for Deterministic Tests:
`python
from unittest.mock import patch, MagicMock

@patch('openai.ChatCompletion.create')
def test_chat_wrapper_returns_content(mock_create):
# Setup mock response
mock_create.return_value = MagicMock(
choices=[MagicMock(
message=MagicMock(content="Mocked response")
)]
)

result = call_llm("Test prompt")

assert result == "Mocked response"
mock_create.assert_called_once()
`

Fixture-Based Mocking:
`python
@pytest.fixture
def mock_llm():
responses = {
"greeting": "Hello! How can I help?",
"farewell": "Goodbye!",
}
def get_response(prompt):
for key, response in responses.items():
if key in prompt.lower():
return response
return "Default response"
return get_response
`

Model/Output Evaluation

Eval Sets:
`python
eval_cases = [
{
"input": "What is 2+2?",
"expected_contains": ["4"],
"category": "math"
},
{
"input": "List three primary colors",
"validator": lambda r: len(extract_list(r)) == 3,
"category": "instruction-following"
},
{
"input": "Write in formal tone: hi",
"expected_not_contains": ["hi", "hey"],
"category": "style"
}
]

def run_eval(llm_function, cases=eval_cases):
results = []
for case in cases:
response = llm_function(case["input"])
passed = validate_response(response, case)
results.append({
"case": case,
"response": response,
"passed": passed
})
return results
`

LLM-as-Judge:
`python
def llm_judge(prompt, response, criteria):
judge_prompt = f"""
Evaluate this response on a scale of 1-5:

User prompt: {prompt}
Response: {response}

Criteria: {criteria}

Score (1-5) and brief justification:
"""

judgment = call_judge_llm(judge_prompt)
score = extract_score(judgment)
return score
`

Integration Testing

RAG Pipeline Test:
`python
def test_rag_pipeline_returns_relevant_answer():
# Setup
docs = ["Paris is the capital of France."]
index_documents(docs)

# Execute
response = rag_query("What is the capital of France?")

# Verify
assert "Paris" in response
assert response_cites_source(response)
`

API Integration Test:
`python
from fastapi.testclient import TestClient
from app import app

client = TestClient(app)

def test_chat_endpoint_returns_response():
response = client.post(
"/v1/chat",
json={"message": "Hello"}
)
assert response.status_code == 200
assert "content" in response.json()
`

Best Practices

Test Categories:
`
Category | What to Test
----------------|----------------------------------
Correctness | Logic works as expected
Edge Cases | Boundary conditions, empty input
Error Handling | Graceful failures, error messages
Performance | Latency, throughput baseline
Security | Injection resistance, auth
Regression | Previously fixed bugs stay fixed
`

Coverage Goals:
`
Component | Target Coverage
-----------------|------------------
Utility functions| 90%+
Business logic | 80%+
API endpoints | 70%+
LLM interactions | Eval-based
``

Testing ML systems requires both traditional software testing and ML-specific evaluation — combining deterministic unit tests with eval sets, mocking for reproducibility, and LLM-as-judge for quality assessment ensures reliable systems despite the inherent non-determinism of language models.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT