Experiment tracking with tools like Weights & Biases (W&B) and MLflow enables systematic logging of ML experiments — recording hyperparameters, metrics, model artifacts, and visualizations to enable reproducibility, comparison, and collaboration across training runs and team members.
Why Experiment Tracking Matters
- Reproducibility: Know exactly how a model was trained.
- Comparison: Find best configuration among experiments.
- Collaboration: Share results with team members.
- Debugging: Understand why experiments fail.
- Compliance: Audit trail for model development.
Key Concepts
What to Track:
```
Category | Examples
-------------------|----------------------------------
Hyperparameters | Learning rate, batch size, epochs
Metrics | Loss, accuracy, F1, custom metrics
Artifacts | Model checkpoints, plots
Code | Git commit, dependencies
Data | Dataset version, splits
Environment | GPU type, library versions
Weights & Biases (W&B)
Basic Setup:
`python
import wandb
# Initialize run
wandb.init(
project="my-llm-project",
config={
"learning_rate": 1e-4,
"batch_size": 32,
"epochs": 10,
"model": "gpt2",
}
)
# Training loop
for epoch in range(config.epochs):
loss = train_epoch()
accuracy = evaluate()
# Log metrics
wandb.log({
"epoch": epoch,
"loss": loss,
"accuracy": accuracy,
})
# Finish run
wandb.finish()
`
Advanced W&B Features:
`python
# Log artifacts
artifact = wandb.Artifact("model", type="model")
artifact.add_file("model.pt")
wandb.log_artifact(artifact)
# Log tables
table = wandb.Table(columns=["input", "output", "label"])
for item in eval_data:
table.add_data(item.input, item.output, item.label)
wandb.log({"predictions": table})
# Log custom plots
wandb.log({"confusion_matrix": wandb.plot.confusion_matrix(
probs=probs, y_true=labels
)})
# Hyperparameter sweeps
sweep_config = {
"method": "bayes",
"metric": {"name": "accuracy", "goal": "maximize"},
"parameters": {
"learning_rate": {"min": 1e-5, "max": 1e-3},
"batch_size": {"values": [16, 32, 64]},
}
}
sweep_id = wandb.sweep(sweep_config)
wandb.agent(sweep_id, train_function)
`
MLflow
Basic Setup:
`python
import mlflow
# Set tracking URI
mlflow.set_tracking_uri("http://localhost:5000")
# Start run
with mlflow.start_run():
# Log parameters
mlflow.log_param("learning_rate", 1e-4)
mlflow.log_param("batch_size", 32)
# Training
for epoch in range(epochs):
loss = train_epoch()
mlflow.log_metric("loss", loss, step=epoch)
# Log model
mlflow.pytorch.log_model(model, "model")
# Log artifacts
mlflow.log_artifact("config.yaml")
`
MLflow Model Registry:
`python
# Register model
mlflow.register_model(
f"runs:/{run_id}/model",
"production-model"
)
# Transition model stage
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="production-model",
version=1,
stage="Production"
)
# Load production model
model = mlflow.pyfunc.load_model(
model_uri="models:/production-model/Production"
)
`
Comparison
``
Feature | W&B | MLflow
--------------------|---------------|----------------
Hosting | Cloud/Self | Self-hosted
Visualizations | Excellent | Good
Collaboration | Built-in | Manual setup
Artifact tracking | Yes | Yes
Model registry | Yes | Yes
Sweeps/Search | Built-in | Basic
LLM evaluations | Yes | Limited
Pricing | Freemium | Open source
Best Practices
Naming Conventions:
`python`
# Clear run names
wandb.init(
project="llm-finetune",
name=f"llama-lora-r16-lr{lr}",
tags=["lora", "llama", "production"]
)
Config Management:
`python`
# Use structured configs
config = {
"model": {
"name": "llama-3.1-8b",
"quantization": "4bit",
},
"training": {
"learning_rate": 1e-4,
"batch_size": 16,
},
"data": {
"dataset": "my-instructions",
"version": "v2",
}
}
wandb.init(config=config)
Artifact Versioning:
`python``
# Always version data and models
artifact = wandb.Artifact(
f"training-data-{date}",
type="dataset",
metadata={"rows": len(data), "source": "internal"}
)
Experiment tracking is essential infrastructure for serious ML work — without systematic logging, teams lose hours recreating experiments, can't compare approaches fairly, and struggle to reproduce their best results.