ding

Using DING with Argo Workflows

Argo Workflows is K8s-native multi-step orchestration — a controller watches Workflow CRs and schedules each step as a Pod. DING ships with each step’s container — ding run wraps the step’s command, evaluates rules in-Pod, and alerts when the Pod exits, automatically tagging each alert with workflow, node, pod, and namespace.

Prerequisites

Minimal example

The shortest configuration that produces a working alert when an Argo Workflow step exits non-zero. The pattern: ding run wraps your command in the step’s main container; an initContainer copies /ding from the published image into a shared emptyDir; the step’s container mounts the ding.yaml ConfigMap directly and pulls Secret keys into its env via envFrom. DING expands ${SLACK_WEBHOOK_URL} from the env at startup — no template-rendering step needed.

Save the YAML below as ding-workflow.yaml and apply with kubectl apply -f ding-workflow.yaml:

---
# Notifier credential — replace with your real Slack webhook URL.
# The Secret key MUST be a valid env var name (uppercase, no dashes) so
# `envFrom: secretRef:` below can surface it directly into the step's
# container environment.
apiVersion: v1
kind: Secret
metadata:
  name: ding-secrets
type: Opaque
stringData:
  SLACK_WEBHOOK_URL: https://hooks.slack.com/services/T.../B.../...
---
# DING config. ${SLACK_WEBHOOK_URL} is expanded by DING itself at startup
# (see docs/configuration.md#environment-variable-substitution).
apiVersion: v1
kind: ConfigMap
metadata:
  name: ding-config
data:
  ding.yaml: |
    server:
      drain_timeout: 30s
    notifiers:
      slack:
        type: slack
        url: ${SLACK_WEBHOOK_URL}
    rules:
      - name: step_failed
        match:
          metric: run.exit
        condition: value > 0
        message: "Argo step  (workflow ) failed with exit  after "
        alert:
          - notifier: slack
---
# Workflow — replace `image:` and the workload `command:` with your real workload.
# The example below intentionally exits 1 so the failure path can be observed.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: ding-demo-
spec:
  entrypoint: main
  templates:
    - name: main
      volumes:
        - name: ding-bin
          emptyDir: {}
        - name: ding-config
          configMap:
            name: ding-config
      initContainers:
        - name: install-ding
          image: ghcr.io/ding-labs/ding:v0.10.0
          # `ding install` self-copies the binary — works against the FROM-scratch
          # release image (no /bin/sh available). Added in DING v0.5.1.
          command: ["/ding", "install", "/shared/ding"]
          mirrorVolumeMounts: true
      container:
        image: alpine:3
        command:
          - /shared/ding
          - run
          - --config
          - /config/ding.yaml
          - --
          - /bin/sh
          - -c
          - "echo running; sleep 1; exit 1"
        envFrom:
          # Surfaces every key from ding-secrets as an env var inside the
          # step's container. DING expands ${SLACK_WEBHOOK_URL} from here.
          - secretRef:
              name: ding-secrets
        env:
          # Argo's controller auto-injects ARGO_TEMPLATE, ARGO_NODE_ID, and
          # ARGO_CONTAINER_NAME on the main container — but injects the rest
          # of the ARGO_* set (ARGO_WORKFLOW_UID, ARGO_WORKFLOW_NAME,
          # ARGO_POD_NAME) ONLY on the auxiliary `wait` sidecar. We restore
          # the missing pieces on `main` via the downward API. ARGO_WORKFLOW_UID
          # isn't recoverable on main; runctx falls back to ARGO_WORKFLOW_NAME
          # for run_id (see Configuration below).
          - name: POD_NAMESPACE
            valueFrom: { fieldRef: { fieldPath: metadata.namespace } }
          - name: ARGO_POD_NAME
            valueFrom: { fieldRef: { fieldPath: metadata.name } }
          - name: ARGO_WORKFLOW_NAME
            valueFrom: { fieldRef: { fieldPath: "metadata.labels['workflows.argoproj.io/workflow']" } }
        volumeMounts:
          - { name: ding-bin, mountPath: /shared }
          - { name: ding-config, mountPath: /config, readOnly: true }

This is the wedge’s headline use case in Argo: silent step failure inside a multi-step DAG is exactly the observability gap the Argo controller doesn’t solve on its own — it knows the workflow phase, but not why a specific step’s program exited non-zero. DING surfaces that as an alert the moment the Pod exits.

What you get

Configuration

runctx auto-detects Argo Workflows via the presence of ARGO_TEMPLATE (controller-injected on every step’s main container) and captures these labels:

Label Source
run_id ARGO_WORKFLOW_NAME via downward API on the workflows.argoproj.io/workflow pod label. Argo injects ARGO_WORKFLOW_UID only on the wait sidecar, so runctx’s UID-then-NAME fallback chain lands on the workflow name.
runner "argo-workflows" (set by runctx)
workflow ARGO_WORKFLOW_NAME via downward API (same as run_id source)
node ARGO_NODE_ID — auto-injected by Argo on the main container, no downward API needed
pod ARGO_POD_NAME via downward API on metadata.name
namespace POD_NAMESPACE via downward API on metadata.namespace

Why so much downward API? Argo’s controller injects only ARGO_TEMPLATE, ARGO_NODE_ID, and ARGO_CONTAINER_NAME on the step’s main container. The richer set (ARGO_WORKFLOW_UID, ARGO_WORKFLOW_NAME, ARGO_POD_NAME) is hardcoded by the controller onto the auxiliary wait sidecar only. The recipe’s env: block uses the K8s downward API to surface workflow name, pod name, and namespace to the main container so runctx can populate the labels above. ARGO_WORKFLOW_UID cannot be recovered on the main container — the controller writes it as a literal env value on the wait sidecar with no corresponding pod label or annotation; runctx’s UID-then-NAME fallback handles this gracefully.

A self-hosted CI runner (GitHub Actions, GitLab CI, etc.) deployed on Argo-managed Kubernetes will set both its CI env vars and ARGO_TEMPLATE. In that case runctx reports the CI platform — its labels are richer for alerting purposes — and the Argo labels are skipped. See Configuration for the full notifier reference.

drain_timeout and terminationGracePeriodSeconds

Same constraint as the Kubernetes Jobs recipe: SIGTERM-initiated graceful shutdown completes the notifier retry cycle only if drain_timeout is longer than initial_backoff * 2^max_attempts and the pod’s terminationGracePeriodSeconds is longer than drain_timeout. The recipe sets drain_timeout: 30s; for production workflows, declare terminationGracePeriodSeconds: 60 on the step’s pod spec (spec.templates[].podSpecPatch or directly in the template’s pod metadata). Argo additionally honors activeDeadlineSeconds on the Workflow CR — that’s an upper bound on total runtime, not a substitute for grace period.

Multi-step DAG

Real Argo workflows are multi-step DAGs. Define the wrapper pattern once as a reusable dingstep template parameterized by command, then reference it from dag.tasks:

spec:
  entrypoint: pipeline
  templates:
    - name: pipeline
      dag:
        tasks:
          - name: prepare
            template: dingstep
            arguments: { parameters: [{ name: cmd, value: "echo prepare; sleep 1" }] }
          - name: train
            template: dingstep
            depends: prepare
            arguments: { parameters: [{ name: cmd, value: "echo train; sleep 1; exit 1" }] }
          - name: eval
            template: dingstep
            depends: train
            arguments: { parameters: [{ name: cmd, value: "echo eval" }] }

    - name: dingstep
      inputs:
        parameters:
          - { name: cmd }
      volumes:
        - { name: ding-bin, emptyDir: {} }
        - { name: ding-config, configMap: { name: ding-config } }
      initContainers:
        - name: install-ding
          image: ghcr.io/ding-labs/ding:v0.10.0
          command: ["/ding", "install", "/shared/ding"]
          mirrorVolumeMounts: true
      container:
        image: alpine:3
        command:
          - /shared/ding
          - run
          - --config
          - /config/ding.yaml
          - --
          - /bin/sh
          - -c
          - ""
        envFrom: [{ secretRef: { name: ding-secrets } }]
        env:
          # See Minimal example for why ARGO_POD_NAME and ARGO_WORKFLOW_NAME
          # need explicit downward API entries (Argo only injects them on
          # the `wait` sidecar, not on the main container).
          - name: POD_NAMESPACE
            valueFrom: { fieldRef: { fieldPath: metadata.namespace } }
          - name: ARGO_POD_NAME
            valueFrom: { fieldRef: { fieldPath: metadata.name } }
          - name: ARGO_WORKFLOW_NAME
            valueFrom: { fieldRef: { fieldPath: "metadata.labels['workflows.argoproj.io/workflow']" } }
        volumeMounts:
          - { name: ding-bin, mountPath: /shared }
          - { name: ding-config, mountPath: /config, readOnly: true }

When train exits 1, DING in the train step’s pod fires the alert; eval is skipped (Argo’s default for failed dependencies). The Slack message contains:

Per-step matching is constrained. DING’s match.labels does exact-match comparison, and Argo’s node/pod values are dynamic per Workflow run. You can’t write a match.labels: { node: my-train-step } rule that matches “the train step.” Pragmatic patterns:

Surfacing the template name (manual)

runctx does not parse ARGO_TEMPLATE JSON. If you want template-name labels on alerts, surface the controller-injected pod label via downward API:

env:
  - name: TEMPLATE_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.labels['workflows.argoproj.io/template']

Then your script can emit TEMPLATE_NAME as an event label. This is opt-in — the recipe doesn’t include it in the minimal example.

Verification

# One-time cluster setup (~3-5 min cold)
kind create cluster --name argo-smoke
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.13/install.yaml
kubectl wait --for=condition=available --timeout=120s deployment/workflow-controller -n argo

# Smoke test 1 — single-step failure
export SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
kubectl apply -f ding-workflow.yaml         # the Minimal example manifest
argo wait @latest -n default                # blocks until completion
argo get @latest -n default                 # verify phase: Failed
argo logs @latest -n default                # verify DING drained on exit
# Verify Slack: workflow / node / pod / namespace / exit_code labels present

# Smoke test 2 — multi-step DAG (train fails, eval skipped)
kubectl apply -f ding-workflow-dag.yaml     # the Multi-step DAG manifest
argo wait @latest -n default
argo get @latest -n default                 # verify train Failed, eval Omitted

# Smoke test 3 — happy path (workload exits 0)
# Edit the minimal manifest's last command line to "exit 0"; reapply
kubectl apply -f ding-workflow-success.yaml
argo wait @latest -n default
# Verify NO Slack message arrives within 10s of completion

If the alert doesn’t fire, common issues: the Secret wasn’t readable (RBAC on the default ServiceAccount in the workflow’s namespace), SLACK_WEBHOOK_URL was empty/missing in the Secret, or terminationGracePeriodSeconds was too tight for the Pod’s actual deletion path.

Tradeoffs / known limitations

Escalation criteria

This recipe is a Tier-2 candidate by the program’s standard rubric:

Tier-2 candidate. The boilerplate count is the structural problem — both the minimal manifest and the DAG subsection are mostly mechanical plumbing (volumes, initContainers, downward API env block) that every Argo user will copy verbatim. An argo-workflow-template repo (separate, mirroring ding-k8s-job) that publishes a parameterized WorkflowTemplate to GHCR — invoked via argo submit --from workflowtemplate/ding-step --parameter image=my-app --parameter command='python train.py' --parameter slack-url=$SLACK_WEBHOOK_URL — would collapse the recipe to a one-line invocation. Defer the chart until 2+ users ask.