[Data Science] From Zero to “Ah-ha!”: A Practical MLflow Tutorial (with Bite-Sized Examples)
This post is a hands-on, copy-pasteable guide for turning your experiments into clean, queryable, and reproducible assets with MLflow.
We’ll start small, then layer on features—params, metrics, artifacts, datasets, model signatures, the Model Registry, and serving.
0) One-time setup
Install the basics (per project virtualenv recommended):
pip install -U mlflow scikit-learn pandas matplotlib pyarrow
Point your code at your tracking server (replace with your URL):
export MLFLOW_TRACKING_URI="http://<your-vps>:5000"
# If you enabled basic auth:
# export MLFLOW_TRACKING_USERNAME="user"
# export MLFLOW_TRACKING_PASSWORD="pass"
Quick sanity check (should return the server URI, not file:///...):1) The tiniest experiment: params, metrics, artifact
Goal: create a run that records a parameter, a metric series, and a small file.
# 01_minimal.py
import time
from pathlib import Path
import mlflow
mlflow.set_experiment("blog-mlflow-basics")
with mlflow.start_run(run_name=f"hello-{int(time.time())}"):
mlflow.log_param("model_family", "baseline")
for step, val in enumerate([0.71, 0.74, 0.76, 0.78]):
mlflow.log_metric("accuracy", val, step=step)
time.sleep(0.05)
Path("artifacts").mkdir(exist_ok=True)
Path("artifacts/notes.txt").write_text("First run ✅\n")
mlflow.log_artifact("artifacts/notes.txt", artifact_path="notes")
print("Run:", mlflow.active_run().info.run_id)
Check in the UI: Experiments → blog-mlflow-basics → (your run)
You should see Parameters, Metrics (with a line chart), and Artifacts (notes/notes.txt).
2) A real model: sklearn + confusion matrix figure
Goal: train, log params/metrics, and store a figure as artifact.
# 02_sklearn_and_figure.py
import mlflow, pandas as pd, matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier
from pathlib import Path
mlflow.set_experiment("blog-mlflow-basics")
iris = datasets.load_iris(as_frame=True)
X = iris.frame[iris.feature_names]
y = iris.frame["target"].astype(int)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
params = {"n_estimators": 200, "max_depth": 3, "random_state": 42}
with mlflow.start_run(run_name="rf-iris"):
mlflow.log_params(params)
mlflow.set_tags({"dataset": "iris", "stage": "dev"})
model = RandomForestClassifier(**params).fit(Xtr, ytr)
preds = model.predict(Xte)
acc = accuracy_score(yte, preds)
mlflow.log_metric("accuracy", acc)
# Log a figure
fig_path = Path("artifacts/confusion.png")
fig_path.parent.mkdir(exist_ok=True)
disp = ConfusionMatrixDisplay.from_predictions(yte, preds)
plt.tight_layout(); plt.savefig(fig_path, dpi=160); plt.close()
mlflow.log_artifact(str(fig_path), artifact_path="plots")
UI tip: On the experiment page, tick multiple runs → Compare to see overlayed metric curves and param tables.
3) Log datasets + model signature + input example
Goal: make your run portable: someone else can load the model and know expected columns and types.
# 03_signature_and_datasets.py
import mlflow, pandas as pd
from mlflow.models.signature import infer_signature
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
mlflow.set_experiment("blog-mlflow-basics")
iris = datasets.load_iris(as_frame=True)
X = iris.frame[iris.feature_names]
y = iris.frame["target"].astype(int)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
with mlflow.start_run(run_name="logreg-iris"):
model = LogisticRegression(max_iter=1000).fit(Xtr, ytr)
sig = infer_signature(Xtr, model.predict(Xtr))
mlflow.sklearn.log_model(
model, "model",
signature=sig,
input_example=Xtr.head(3)
)
# Log datasets (best-effort: works on MLflow 3.x)
try:
from mlflow.data.pandas_dataset import from_pandas
mlflow.log_input(from_pandas(Xtr.join(ytr.rename("target")), name="iris_train"), context="training")
mlflow.log_input(from_pandas(Xte.join(yte.rename("target")), name="iris_test"), context="testing")
except Exception as e:
mlflow.log_text(str(e), "logs/datasets_warning.txt")
Why it matters: The signature is validated at load/serve time, preventing silent schema drift.
4) Autologging (one line, lots of value)
Goal: get params, metrics, and the model logged automatically.
# 04_autolog.py
import mlflow
mlflow.set_experiment("blog-mlflow-autolog")
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
mlflow.autolog() # <-- magic line (works with many frameworks)
X, y = datasets.load_breast_cancer(return_X_y=True, as_frame=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=17, stratify=y)
with mlflow.start_run(run_name="gb-autolog"):
m = GradientBoostingClassifier().fit(Xtr, ytr)
mlflow.log_metric("holdout_acc", accuracy_score(yte, m.predict(Xte)))
Autologging is great, but explicitly logging key artifacts/plots is still a good habit.
5) Register a model and manage stages
Goal: turn a one-off model artifact into a versioned, named asset with stages (Staging, Production, Archived).
# 05_register_and_promote.py
import mlflow
from mlflow import MlflowClient
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
mlflow.set_experiment("blog-mlflow-registry")
X, y = datasets.load_breast_cancer(return_X_y=True)
with mlflow.start_run(run_name="rf-register") as r:
m = RandomForestClassifier(n_estimators=300, random_state=7).fit(X, y)
mlflow.sklearn.log_model(m, "model")
client = MlflowClient()
mv = client.create_model_version(
name="breast_cancer_rf",
source=f"{r.info.artifact_uri}/model",
run_id=r.info.run_id
)
print("Created version:", mv.version)
client.transition_model_version_stage(
name="breast_cancer_rf", version=mv.version, stage="Staging"
)
print("Promoted to Staging.")
Load by name+stage (anywhere):
import mlflow.pyfunc
mlflow.set_tracking_uri("http://<your-vps>:5000")
model = mlflow.pyfunc.load_model("models:/breast_cancer_rf/Staging")
print(model.predict([[14.0]*30])[:1])
6) Serve the production model as a REST API
Goal: ship the current Production model without writing a new server.
mlflow models serve -m "models:/breast_cancer_rf/Production" -p 8000 --host 0.0.0.0
Curl test:
curl -X POST http://127.0.0.1:8000/invocations \
-H "Content-Type: application/json" \
-d '{"inputs": [[14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14]]}'
7) Compare runs and pick winners (programmatically)
Goal: query your history with filters and order by metrics.
# 07_query_runs.py
import mlflow
from mlflow.entities import ViewType
mlflow.set_experiment("blog-mlflow-basics")
exp = mlflow.get_experiment_by_name("blog-mlflow-basics")
df = mlflow.search_runs(
experiment_ids=[exp.experiment_id],
filter_string='metrics.accuracy > 0.75 and tags.dataset = "iris"',
order_by=["metrics.accuracy DESC"],
output_format="pandas",
max_results=20,
run_view_type=ViewType.ACTIVE_ONLY,
)
print(df[["run_id", "metrics.accuracy", "params.n_estimators", "tags.dataset"]])
In the UI, use the search bar with the same filter grammar.
8) Nested runs for multi-step pipelines
Goal: keep each phase (prep → train → eval) separate but linked.
# 08_nested_runs.py
import mlflow, time
mlflow.set_experiment("blog-mlflow-nested")
with mlflow.start_run(run_name="pipeline"):
mlflow.set_tag("pipeline", "prep-train-eval")
with mlflow.start_run(run_name="prep", nested=True):
time.sleep(0.1)
mlflow.log_metric("rows_kept", 980)
with mlflow.start_run(run_name="train", nested=True):
mlflow.log_param("lr", 1e-3)
mlflow.log_metric("train_loss", 0.12)
with mlflow.start_run(run_name="eval", nested=True):
mlflow.log_metric("val_auc", 0.91)
UI: The parent run shows Children; click through to see step-level artifacts and metrics.
9) Reproducibility: pin code & environment
Goal: capture code version and the Python env that produced the model.
# 09_reproducibility.py
import os, subprocess, mlflow, json, sys
mlflow.set_experiment("blog-mlflow-repro")
def git_commit():
try:
return subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
except Exception:
return "unknown"
with mlflow.start_run(run_name="env-and-code"):
mlflow.set_tag("git_commit", git_commit())
# Log a simple requirements snapshot
reqs = subprocess.check_output([sys.executable, "-m", "pip", "freeze"]).decode()
mlflow.log_text(reqs, "env/requirements.txt")
# Optional: log a minimal conda env file
conda_env = {
"name": "mlflow-env",
"channels": ["conda-forge"],
"dependencies": ["python={}".format(".".join(map(str, sys.version_info[:3]))), "pip", {"pip": ["mlflow"]}],
}
mlflow.log_text(json.dumps(conda_env, indent=2), "env/conda_env.json")
Tip: mlflow.<flavor>.log_model(..., pip_requirements=[...], conda_env=...) lets you embed env specs inside the model artifact.10) Model evaluation (one-liner)
Goal: log a battery of metrics/plots for classification problems.
# 10_evaluate.py
import mlflow
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=0, stratify=y)
with mlflow.start_run(run_name="evaluate-example"):
m = LogisticRegression(max_iter=2000).fit(Xtr, ytr)
res = mlflow.evaluate(
model=m,
data=Xte.assign(label=yte),
targets="label",
model_type="classifier",
evaluators=["default"], # logs metrics + confusion matrix + ROC, etc.
)
print(res.metrics.keys())
11) Serving-friendly signatures: strict schemas
Goal: fail fast if inputs are wrong at serving time.
# 11_strict_signature.py
import mlflow
from mlflow.models.signature import infer_signature
from sklearn.linear_model import LinearRegression
import pandas as pd
X = pd.DataFrame({"x1":[1,2,3], "x2":[0.1,0.2,0.3]})
y = pd.Series([3.2, 5.1, 7.0])
with mlflow.start_run(run_name="strict-sig"):
m = LinearRegression().fit(X, y)
sig = infer_signature(X, m.predict(X))
mlflow.sklearn.log_model(m, "model", signature=sig, input_example=X.head(2))
When you later call the REST API with wrong columns/types, MLflow will reject the request.
12) Tagging conventions that pay off later
Use consistent tags to supercharge search & dashboards:
project,dataset,owner,stage(dev,abtest,prod-candidate)git_commit,git_branch,feature_flagstraining_job_id,ml_platform(e.g., “k8s”, “ray”, “sagemaker”)
mlflow.set_tags({
"project": "sensor-forecast",
"dataset": "v2025-10-01",
"owner": "you",
"stage": "dev",
})
13) A quick “gotchas” checklist
- Artifacts are
file:///...and fail to write?
Your server likely isn’t serving artifacts. Start it with:- MLflow 3.x:
--serve-artifacts --artifacts-destination file:///path(or S3/MinIO) - MLflow 2.x: set
--default-artifact-root s3://bucket/...(no HTTP artifact serving)
- MLflow 3.x:
- Shadowed import: An
AttributeError: partially initialized module 'mlflow'...usually means you have a localmlflow.pyfile. Rename it and restart the kernel. - Conda vs pip mix: Prefer one channel per env; if mixing, install heavy deps (numpy/scipy) via conda first, then
pip install mlflow.
Tracking URI confusion: Print it before you run:
print(mlflow.get_tracking_uri())
Make sure it’s http(s)://..., not file:///....
14) A tiny template you can reuse
# template_train.py
import os, time, mlflow
from typing import Dict
MLFLOW_URI = os.getenv("MLFLOW_TRACKING_URI", "http://<your-vps>:5000")
EXPERIMENT = os.getenv("MLFLOW_EXPERIMENT", "my-project")
def train_and_log(params: Dict[str, float]) -> str:
mlflow.set_tracking_uri(MLFLOW_URI)
mlflow.set_experiment(EXPERIMENT)
with mlflow.start_run(run_name=f"train-{int(time.time())}") as run:
mlflow.log_params(params)
# ... train ...
mlflow.log_metric("metric@final", 0.123)
# mlflow.log_artifact("path/to/plot.png", artifact_path="plots")
return run.info.run_id
if __name__ == "__main__":
run_id = train_and_log({"learning_rate": 3e-4, "batch_size": 64})
print("Logged run:", run_id)
Wrap-up
You now have everything to:
- Track parameters, metrics, figures, tables, and datasets
- Save models with signatures and examples
- Register versions, promote to stages, and serve over REST
- Query and compare runs at scale
- Keep experiments reproducible and discoverable