ding

Using DING with MLflow

MLflow is the leading open-source platform for managing the ML lifecycle — experiment tracking, model registry, project orchestration. DING’s ding run wraps an MLproject entry point, evaluates rules during the training run, and fires alerts on metric thresholds and exit code — both during the run and on exit, with the alert linking back to the MLflow UI.

Prerequisites

Minimal example

The shortest config that wires DING into an MLflow project so alerts fire during training and on exit.

MLproject (project root):

name: my-training
entry_points:
  main:
    parameters:
      epochs: { type: int, default: 10 }
    command: "ding run --config ding.yaml -- python train.py --epochs {epochs}"

ding.yaml:

notifiers:
  slack:
    type: slack
    url: ${SLACK_WEBHOOK_URL}

rules:
  # During-run: fire if validation loss spikes mid-training.
  - name: loss_spike
    match: { metric: val_loss }
    condition: value > 10
    cooldown: 1m
    message: "val_loss spike:  on epoch  (run )"
    alert:
      - notifier: slack

  # On exit: fire if the training process exits non-zero.
  # The synthetic run.exit event is dispatched at end-of-run; a default
  # (during-run) rule matching it fires once when the wrapped command exits.
  - name: training_failed
    match: { metric: run.exit }
    condition: value > 0
    message: |
      MLflow run failed (exit  after )
      </#/experiments//runs/|View run in MLflow UI>
    alert:
      - notifier: slack

train.py (excerpt — emit JSON events for DING alongside MLflow’s native logging):

import json, mlflow

with mlflow.start_run():
    for epoch in range(epochs):
        loss = train_epoch()
        mlflow.log_metric("val_loss", loss, step=epoch)              # → MLflow tracking server
        print(json.dumps({                                           # → DING
            "metric": "val_loss",
            "value": loss,
            "epoch": str(epoch),  # cast to string so the template variable resolves
        }))

Invoke with:

mlflow run . --env-manager=local -P epochs=20

What you get

A Slack message during training when val_loss exceeds threshold:

🔔 loss_spike val_loss spike: 12.4 on epoch 7 (run abc123def456)

…and on training-process exit:

🔔 training_failed MLflow run failed (exit 1 after 42s) View run in MLflow UI

The deep-link in the second message takes you straight to the MLflow run page. All alerts are auto-tagged with run_id, runner=mlflow, experiment_id, tracking_uri.

Configuration

runctx auto-detects MLflow when MLFLOW_RUN_ID is set in the entry point’s environment (always set by mlflow run):

Label Source env var Notes
run_id MLFLOW_RUN_ID the MLflow run UUID
runner "mlflow" (set by runctx)  
experiment_id MLFLOW_EXPERIMENT_ID enables Slack-channel routing per experiment
tracking_uri MLFLOW_TRACKING_URI only set when value starts with http:// or https://; local file paths skipped

Use these in match.labels or message template variables. See Configuration for the full notifier reference.

Verification

pip install mlflow
mkdir mlflow-smoke && cd mlflow-smoke
# Author MLproject, train.py (with intentional non-zero exit), ding.yaml per the example above
mlflow server --host 127.0.0.1 --port 5000 &
export MLFLOW_TRACKING_URI=http://127.0.0.1:5000
export SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
mlflow run . --env-manager=local
# Verify in Slack:
#   1. training_failed message fires within ~5s of the script exit
#   2. tracking URI deep-link is clickable and lands on the MLflow run page
#   3. labels include run_id, experiment_id, tracking_uri

If the alert doesn’t fire, check the mlflow run log for ding output. Common issues: SLACK_WEBHOOK_URL not exported in the shell that ran mlflow run, or drain_timeout shorter than the notifier retry window — see Configuration → drain_timeout.

Tradeoffs / known limitations

Escalation criteria

This recipe is Tier 1 by the program’s standard rubric:

The 4 gotchas are conceptual (“DING ≠ MLflow tracking; bare scripts not auto-detected; conda env defaults; deep-links require remote tracking server”), not boilerplate-driven — a Tier-2 abstraction wouldn’t reduce them.

A future Tier-2 candidate worth tracking: type: mlflow_run_tag notifier — writes the alert as a tag on the active MLflow run, surfacing failure context in the MLflow UI alongside metrics. Not built here.