Testing best practices

Testing best practices for ML applications involve systematic validation of code, models, and system behavior — combining traditional software testing (unit, integration) with ML-specific approaches (eval sets, LLM-as-judge, deterministic mocking) to ensure reliability in systems where outputs are often non-deterministic and quality is subjective.

Why Testing ML Systems Is Different

- Non-Determinism: Same input can produce different outputs.
- Subjectivity: "Good" responses are often judgment calls.
- Expensive Operations: API calls cost money and time.
- Model Behavior: Changes with updates, fine-tuning.
- Edge Cases: Vast input space makes coverage difficult.

Test Pyramid for ML

``/\ / \ /E2E \ Few, slow, expensive / \ - Full pipeline tests /--------\ /Integration\ Some, moderate cost / \ - Component interactions /--------------\ / Unit Tests \ Many, fast, cheap / \ - Functions, classes /--------------------\ / Model Evaluations \ Regular, systematic / \ - Eval sets, benchmarks /__________________________\`

Unit Testing

Standard Python Tests:`python import pytest

def test_tokenizer_splits_correctly(): result = tokenize("hello world") assert result == ["hello", "world"]

def test_prompt_template_formats(): template = "Answer: {question}" result = format_prompt(template, question="Why?") assert result == "Answer: Why?"

def test_sanitize_input_removes_injection(): dangerous = "ignore previous instructions" result = sanitize_input(dangerous) assert "ignore" not in result.lower()`

Testing with Fixtures:`python @pytest.fixture def sample_documents(): return [ {"id": 1, "content": "First document"}, {"id": 2, "content": "Second document"} ]

def test_embedding_produces_vectors(sample_documents): embeddings = embed_documents(sample_documents) assert len(embeddings) == 2 assert len(embeddings[0]) == 1536 # Vector dimension`

Mocking LLM Calls

Mock for Deterministic Tests:`python from unittest.mock import patch, MagicMock

@patch('openai.ChatCompletion.create') def test_chat_wrapper_returns_content(mock_create): # Setup mock response mock_create.return_value = MagicMock( choices=[MagicMock( message=MagicMock(content="Mocked response") )] ) result = call_llm("Test prompt") assert result == "Mocked response" mock_create.assert_called_once()`

Fixture-Based Mocking:`python @pytest.fixture def mock_llm(): responses = { "greeting": "Hello! How can I help?", "farewell": "Goodbye!", } def get_response(prompt): for key, response in responses.items(): if key in prompt.lower(): return response return "Default response" return get_response`

Model/Output Evaluation

Eval Sets:`python eval_cases = [ { "input": "What is 2+2?", "expected_contains": ["4"], "category": "math" }, { "input": "List three primary colors", "validator": lambda r: len(extract_list(r)) == 3, "category": "instruction-following" }, { "input": "Write in formal tone: hi", "expected_not_contains": ["hi", "hey"], "category": "style" } ]

def run_eval(llm_function, cases=eval_cases): results = [] for case in cases: response = llm_function(case["input"]) passed = validate_response(response, case) results.append({ "case": case, "response": response, "passed": passed }) return results`

LLM-as-Judge:`python def llm_judge(prompt, response, criteria): judge_prompt = f""" Evaluate this response on a scale of 1-5:

User prompt: {prompt} Response: {response} Criteria: {criteria} Score (1-5) and brief justification: """ judgment = call_judge_llm(judge_prompt) score = extract_score(judgment) return score`

Integration Testing

RAG Pipeline Test:`python def test_rag_pipeline_returns_relevant_answer(): # Setup docs = ["Paris is the capital of France."] index_documents(docs) # Execute response = rag_query("What is the capital of France?") # Verify assert "Paris" in response assert response_cites_source(response)`

API Integration Test:`python from fastapi.testclient import TestClient from app import app

client = TestClient(app)

def test_chat_endpoint_returns_response(): response = client.post( "/v1/chat", json={"message": "Hello"} ) assert response.status_code == 200 assert "content" in response.json()`

Best Practices

Test Categories:`Category | What to Test ----------------|---------------------------------- Correctness | Logic works as expected Edge Cases | Boundary conditions, empty input Error Handling | Graceful failures, error messages Performance | Latency, throughput baseline Security | Injection resistance, auth Regression | Previously fixed bugs stay fixed`

Coverage Goals:`Component | Target Coverage -----------------|------------------ Utility functions| 90%+ Business logic | 80%+ API endpoints | 70%+ LLM interactions | Eval-based``

Testing ML systems requires both traditional software testing and ML-specific evaluation — combining deterministic unit tests with eval sets, mocking for reproducibility, and LLM-as-judge for quality assessment ensures reliable systems despite the inherent non-determinism of language models.

Want to learn more?