Experiment tracking

Experiment tracking with tools like Weights & Biases (W&B) and MLflow enables systematic logging of ML experiments — recording hyperparameters, metrics, model artifacts, and visualizations to enable reproducibility, comparison, and collaboration across training runs and team members.

Why Experiment Tracking Matters

- Reproducibility: Know exactly how a model was trained.
- Comparison: Find best configuration among experiments.
- Collaboration: Share results with team members.
- Debugging: Understand why experiments fail.
- Compliance: Audit trail for model development.

Key Concepts

What to Track:
``Category | Examples -------------------|---------------------------------- Hyperparameters | Learning rate, batch size, epochs Metrics | Loss, accuracy, F1, custom metrics Artifacts | Model checkpoints, plots Code | Git commit, dependencies Data | Dataset version, splits Environment | GPU type, library versions`

Weights & Biases (W&B)

Basic Setup:`python import wandb

# Initialize run wandb.init( project="my-llm-project", config={ "learning_rate": 1e-4, "batch_size": 32, "epochs": 10, "model": "gpt2", } )

# Training loop for epoch in range(config.epochs): loss = train_epoch() accuracy = evaluate() # Log metrics wandb.log({ "epoch": epoch, "loss": loss, "accuracy": accuracy, })

# Finish run wandb.finish()`

Advanced W&B Features:`python # Log artifacts artifact = wandb.Artifact("model", type="model") artifact.add_file("model.pt") wandb.log_artifact(artifact)

# Log tables table = wandb.Table(columns=["input", "output", "label"]) for item in eval_data: table.add_data(item.input, item.output, item.label) wandb.log({"predictions": table})

# Log custom plots wandb.log({"confusion_matrix": wandb.plot.confusion_matrix( probs=probs, y_true=labels )})

# Hyperparameter sweeps sweep_config = { "method": "bayes", "metric": {"name": "accuracy", "goal": "maximize"}, "parameters": { "learning_rate": {"min": 1e-5, "max": 1e-3}, "batch_size": {"values": [16, 32, 64]}, } } sweep_id = wandb.sweep(sweep_config) wandb.agent(sweep_id, train_function)`

MLflow

Basic Setup:`python import mlflow

# Set tracking URI mlflow.set_tracking_uri("http://localhost:5000")

# Start run with mlflow.start_run(): # Log parameters mlflow.log_param("learning_rate", 1e-4) mlflow.log_param("batch_size", 32) # Training for epoch in range(epochs): loss = train_epoch() mlflow.log_metric("loss", loss, step=epoch) # Log model mlflow.pytorch.log_model(model, "model") # Log artifacts mlflow.log_artifact("config.yaml")`

MLflow Model Registry:`python # Register model mlflow.register_model( f"runs:/{run_id}/model", "production-model" )

# Transition model stage client = mlflow.tracking.MlflowClient() client.transition_model_version_stage( name="production-model", version=1, stage="Production" )

# Load production model model = mlflow.pyfunc.load_model( model_uri="models:/production-model/Production" )`

Comparison

`Feature | W&B | MLflow --------------------|---------------|---------------- Hosting | Cloud/Self | Self-hosted Visualizations | Excellent | Good Collaboration | Built-in | Manual setup Artifact tracking | Yes | Yes Model registry | Yes | Yes Sweeps/Search | Built-in | Basic LLM evaluations | Yes | Limited Pricing | Freemium | Open source`

Best Practices

Naming Conventions:`python # Clear run names wandb.init( project="llm-finetune", name=f"llama-lora-r16-lr{lr}", tags=["lora", "llama", "production"] )`

Config Management:`python # Use structured configs config = { "model": { "name": "llama-3.1-8b", "quantization": "4bit", }, "training": { "learning_rate": 1e-4, "batch_size": 16, }, "data": { "dataset": "my-instructions", "version": "v2", } } wandb.init(config=config)`

Artifact Versioning:`python # Always version data and models artifact = wandb.Artifact( f"training-data-{date}", type="dataset", metadata={"rows": len(data), "source": "internal"} )``

Experiment tracking is essential infrastructure for serious ML work — without systematic logging, teams lose hours recreating experiments, can't compare approaches fairly, and struggle to reproduce their best results.

Want to learn more?