MLflow is the leading open-source platform for managing the ML lifecycle — experiment tracking, model registry, project orchestration. DING’s
ding runwraps an MLproject entry point, evaluates rules during the training run, and fires alerts on metric thresholds and exit code — both during the run and on exit, with the alert linking back to the MLflow UI.
>= v0.10.0 — see installmlflow >= 2.0 (pip install mlflow)mlflow server) for production deep-links to workThe shortest config that wires DING into an MLflow project so alerts fire during training and on exit.
MLproject (project root):
name: my-training
entry_points:
main:
parameters:
epochs: { type: int, default: 10 }
command: "ding run --config ding.yaml -- python train.py --epochs {epochs}"
ding.yaml:
notifiers:
slack:
type: slack
url: ${SLACK_WEBHOOK_URL}
rules:
# During-run: fire if validation loss spikes mid-training.
- name: loss_spike
match: { metric: val_loss }
condition: value > 10
cooldown: 1m
message: "val_loss spike: on epoch (run )"
alert:
- notifier: slack
# On exit: fire if the training process exits non-zero.
# The synthetic run.exit event is dispatched at end-of-run; a default
# (during-run) rule matching it fires once when the wrapped command exits.
- name: training_failed
match: { metric: run.exit }
condition: value > 0
message: |
MLflow run failed (exit after )
</#/experiments//runs/|View run in MLflow UI>
alert:
- notifier: slack
train.py (excerpt — emit JSON events for DING alongside MLflow’s native logging):
import json, mlflow
with mlflow.start_run():
for epoch in range(epochs):
loss = train_epoch()
mlflow.log_metric("val_loss", loss, step=epoch) # → MLflow tracking server
print(json.dumps({ # → DING
"metric": "val_loss",
"value": loss,
"epoch": str(epoch), # cast to string so the template variable resolves
}))
Invoke with:
mlflow run . --env-manager=local -P epochs=20
A Slack message during training when val_loss exceeds threshold:
🔔
loss_spikeval_loss spike: 12.4 on epoch 7 (run abc123def456)
…and on training-process exit:
🔔
training_failedMLflow run failed (exit 1 after 42s) View run in MLflow UI
The deep-link in the second message takes you straight to the MLflow run page. All alerts are auto-tagged with run_id, runner=mlflow, experiment_id, tracking_uri.
runctx auto-detects MLflow when MLFLOW_RUN_ID is set in the entry point’s environment (always set by mlflow run):
| Label | Source env var | Notes |
|---|---|---|
run_id |
MLFLOW_RUN_ID |
the MLflow run UUID |
runner |
"mlflow" (set by runctx) |
|
experiment_id |
MLFLOW_EXPERIMENT_ID |
enables Slack-channel routing per experiment |
tracking_uri |
MLFLOW_TRACKING_URI |
only set when value starts with http:// or https://; local file paths skipped |
Use these in match.labels or message template variables. See Configuration for the full notifier reference.
pip install mlflow
mkdir mlflow-smoke && cd mlflow-smoke
# Author MLproject, train.py (with intentional non-zero exit), ding.yaml per the example above
mlflow server --host 127.0.0.1 --port 5000 &
export MLFLOW_TRACKING_URI=http://127.0.0.1:5000
export SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
mlflow run . --env-manager=local
# Verify in Slack:
# 1. training_failed message fires within ~5s of the script exit
# 2. tracking URI deep-link is clickable and lands on the MLflow run page
# 3. labels include run_id, experiment_id, tracking_uri
If the alert doesn’t fire, check the mlflow run log for ding output. Common issues: SLACK_WEBHOOK_URL not exported in the shell that ran mlflow run, or drain_timeout shorter than the notifier retry window — see Configuration → drain_timeout.
python train.py directly produces runner=local; emit mlflow_run_id as a JSON event label inside the script if you need it on alerts.MLFLOW_TRACKING_URI=./mlruns (the MLflow default), tracking_uri is omitted by runctx and Slack templates referencing it render an empty link. Use a real tracking server for deep-links.mlflow run env-manager defaults to conda. This recipe uses --env-manager=local to use the host environment where DING is on PATH. For isolation, install DING into the conda env via conda.yaml or use an absolute path in the MLproject command.mlflow.log_metric for MLflow’s UI, print(json.dumps(...)) for DING rules. DING fires real-time alerts; MLflow records history. Different purposes; no overlap.This recipe is Tier 1 by the program’s standard rubric:
pip install mlflow) — under threshold of 5mlflow server is self-hostable)The 4 gotchas are conceptual (“DING ≠ MLflow tracking; bare scripts not auto-detected; conda env defaults; deep-links require remote tracking server”), not boilerplate-driven — a Tier-2 abstraction wouldn’t reduce them.
A future Tier-2 candidate worth tracking: type: mlflow_run_tag notifier — writes the alert as a tag on the active MLflow run, surfacing failure context in the MLflow UI alongside metrics. Not built here.