MLflow Tracking

MLflow Tracking is the open-source experiment logging system that records parameters, metrics, code versions, and model artifacts for every ML training run — solving the reproducibility crisis in machine learning by creating a permanent, searchable record of what hyperparameters, data, and code produced each model, enabling teams to compare runs, reproduce results, and understand what actually makes models perform better.

What Is MLflow Tracking?

- Definition: The experiment tracking component of MLflow (open-source ML lifecycle platform created by Databricks in 2018) — a logging API and UI that records everything relevant to a model training run: hyperparameters (config), evaluation metrics (loss, accuracy), model artifacts (saved weights), and source code version (Git commit hash).
- Runs and Experiments: An Experiment is a named collection of related Runs. A Run is a single execution of your training code — MLflow tracks when it started, how long it took, what parameters were set, what metrics were logged, and what artifacts were saved.
- Automatic Logging (autolog): One line of code — mlflow.autolog() — automatically captures framework-specific information from PyTorch, TensorFlow, scikit-learn, XGBoost, LightGBM, and others without any manual log statements.
- Backend Stores: MLflow stores run metadata in a backend (SQLite for local, PostgreSQL/MySQL for team use) and artifacts in a storage location (local filesystem, S3, GCS, Azure Blob) — the same API works whether running locally or on a shared team server.
- Model Registry: An extension of tracking — promote the best run's model to the Model Registry with versioning, staging (Staging → Production), and deployment annotations.

Why MLflow Tracking Matters for AI

- Reproducibility: Without tracking, reproducing a model that got 95% accuracy six months ago requires hoping someone documented the exact learning rate, batch size, data version, and random seed. MLflow makes this automatic.
- Experiment Comparison: The MLflow UI enables sorting runs by any metric — find the hyperparameter combination that minimized validation loss across 100 training runs in seconds rather than digging through log files.
- Team Collaboration: Shared MLflow server (PostgreSQL backend + S3 artifacts) gives the entire ML team visibility into experiments — a new team member can browse all prior experiments to understand what approaches have been tried.
- Model Lineage: Every registered model links back to the training run, which links to Git commit, data version, and environment — complete lineage from raw data to production model artifact.
- Framework Agnostic: Same API for PyTorch, TensorFlow, scikit-learn, HuggingFace Transformers, XGBoost — one tracking system for all ML frameworks, not separate logging per framework.

MLflow Tracking Core API

Manual Logging:
import mlflow
import mlflow.pytorch

mlflow.set_experiment("llm-fine-tuning")

with mlflow.start_run(run_name="llama-3-8b-lora-v2"):
# Log hyperparameters
mlflow.log_params({
"model": "meta-llama/Llama-3-8B",
"learning_rate": 2e-4,
"lora_rank": 16,
"batch_size": 8,
"epochs": 3
})

# Training loop
for epoch in range(3):
train_loss = train_epoch(model, train_loader)
val_loss = evaluate(model, val_loader)

# Log metrics per step
mlflow.log_metrics({
"train_loss": train_loss,
"val_loss": val_loss
}, step=epoch)

# Log final model artifact
mlflow.pytorch.log_model(model, "fine-tuned-llama")
mlflow.log_artifact("training_config.yaml")

Automatic Logging:
import mlflow
mlflow.pytorch.autolog() # Captures loss, LR schedule, model architecture

trainer = Trainer(model=model, args=training_args, ...)
trainer.train()
# Everything logged automatically — no manual mlflow calls needed

Model Registration:
# Register best run's model
run_id = "abc123def456"
mlflow.register_model(f"runs:/{run_id}/fine-tuned-llama", "production-llm")

# Transition to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage("production-llm", version=3, stage="Production")

Querying Experiments Programmatically:
runs = mlflow.search_runs(
experiment_names=["llm-fine-tuning"],
filter_string="metrics.val_loss < 0.5 AND params.lora_rank = '16'",
order_by=["metrics.val_loss ASC"]
)
best_run = runs.iloc[0]

MLflow UI Features:
- Compare multiple runs side-by-side with metric charts
- Filter runs by parameter values and metric thresholds
- View artifact files directly in the browser
- Diff hyperparameters between runs to identify what changed

MLflow Tracking vs Alternatives

| Tool | Open Source | Hosted Option | Best UI | Auto-Logging | Best For |
|------|------------|--------------|---------|-------------|---------|
| MLflow | Yes (self-host) | Databricks | Good | Excellent | Teams wanting self-hosted |
| W&B | No (SaaS) | W&B Cloud | Excellent | Excellent | Research teams, collaboration |
| Neptune.ai | No (SaaS) | Neptune Cloud | Good | Good | Enterprise metadata |
| Comet ML | Partial | Comet Cloud | Good | Good | HPO visualization |

MLflow Tracking is the open-source experiment logging standard that brings reproducibility and accountability to machine learning — by automatically capturing the complete context of every training run (parameters, metrics, code, environment, and artifacts) in a searchable, comparable format, MLflow transforms chaotic model development into a systematic engineering practice where insights accumulate and results can always be reproduced.

Want to learn more?