Experiment tracking

Keywords: experiment tracking, wandb, mlflow, logging, hyperparameters, metrics, reproducibility

Experiment tracking with tools like Weights & Biases (W&B) and MLflow enables systematic logging of ML experiments — recording hyperparameters, metrics, model artifacts, and visualizations to enable reproducibility, comparison, and collaboration across training runs and team members.

Why Experiment Tracking Matters

- Reproducibility: Know exactly how a model was trained.
- Comparison: Find best configuration among experiments.
- Collaboration: Share results with team members.
- Debugging: Understand why experiments fail.
- Compliance: Audit trail for model development.

Key Concepts

What to Track:
``
Category | Examples
-------------------|----------------------------------
Hyperparameters | Learning rate, batch size, epochs
Metrics | Loss, accuracy, F1, custom metrics
Artifacts | Model checkpoints, plots
Code | Git commit, dependencies
Data | Dataset version, splits
Environment | GPU type, library versions
`

Weights & Biases (W&B)

Basic Setup:
`python
import wandb

# Initialize run
wandb.init(
project="my-llm-project",
config={
"learning_rate": 1e-4,
"batch_size": 32,
"epochs": 10,
"model": "gpt2",
}
)

# Training loop
for epoch in range(config.epochs):
loss = train_epoch()
accuracy = evaluate()

# Log metrics
wandb.log({
"epoch": epoch,
"loss": loss,
"accuracy": accuracy,
})

# Finish run
wandb.finish()
`

Advanced W&B Features:
`python
# Log artifacts
artifact = wandb.Artifact("model", type="model")
artifact.add_file("model.pt")
wandb.log_artifact(artifact)

# Log tables
table = wandb.Table(columns=["input", "output", "label"])
for item in eval_data:
table.add_data(item.input, item.output, item.label)
wandb.log({"predictions": table})

# Log custom plots
wandb.log({"confusion_matrix": wandb.plot.confusion_matrix(
probs=probs, y_true=labels
)})

# Hyperparameter sweeps
sweep_config = {
"method": "bayes",
"metric": {"name": "accuracy", "goal": "maximize"},
"parameters": {
"learning_rate": {"min": 1e-5, "max": 1e-3},
"batch_size": {"values": [16, 32, 64]},
}
}
sweep_id = wandb.sweep(sweep_config)
wandb.agent(sweep_id, train_function)
`

MLflow

Basic Setup:
`python
import mlflow

# Set tracking URI
mlflow.set_tracking_uri("http://localhost:5000")

# Start run
with mlflow.start_run():
# Log parameters
mlflow.log_param("learning_rate", 1e-4)
mlflow.log_param("batch_size", 32)

# Training
for epoch in range(epochs):
loss = train_epoch()
mlflow.log_metric("loss", loss, step=epoch)

# Log model
mlflow.pytorch.log_model(model, "model")

# Log artifacts
mlflow.log_artifact("config.yaml")
`

MLflow Model Registry:
`python
# Register model
mlflow.register_model(
f"runs:/{run_id}/model",
"production-model"
)

# Transition model stage
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="production-model",
version=1,
stage="Production"
)

# Load production model
model = mlflow.pyfunc.load_model(
model_uri="models:/production-model/Production"
)
`

Comparison

`
Feature | W&B | MLflow
--------------------|---------------|----------------
Hosting | Cloud/Self | Self-hosted
Visualizations | Excellent | Good
Collaboration | Built-in | Manual setup
Artifact tracking | Yes | Yes
Model registry | Yes | Yes
Sweeps/Search | Built-in | Basic
LLM evaluations | Yes | Limited
Pricing | Freemium | Open source
`

Best Practices

Naming Conventions:
`python
# Clear run names
wandb.init(
project="llm-finetune",
name=f"llama-lora-r16-lr{lr}",
tags=["lora", "llama", "production"]
)
`

Config Management:
`python
# Use structured configs
config = {
"model": {
"name": "llama-3.1-8b",
"quantization": "4bit",
},
"training": {
"learning_rate": 1e-4,
"batch_size": 16,
},
"data": {
"dataset": "my-instructions",
"version": "v2",
}
}
wandb.init(config=config)
`

Artifact Versioning:
`python
# Always version data and models
artifact = wandb.Artifact(
f"training-data-{date}",
type="dataset",
metadata={"rows": len(data), "source": "internal"}
)
``

Experiment tracking is essential infrastructure for serious ML work — without systematic logging, teams lose hours recreating experiments, can't compare approaches fairly, and struggle to reproduce their best results.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT