Kubernetes Jobs and CronJobs are the canonical primitives for ephemeral, run-to-completion workloads. Grafana watches the cluster; DING ships with the work —
ding runwraps your container’s command, evaluates rules in-Pod, and alerts when the Pod exits, automatically tagging each alert with namespace, pod, node, and Job name.
!!! tip “One-line install via Helm”
For the common case (Slack alert on Job failure, no per-field K8s tuning), use the ding-k8s-job Helm chart instead of copying the manifest below:
bash
helm install nightly-batch oci://ghcr.io/ding-labs/ding-k8s-job \
--set image.repository=my-app --set image.tag=v1.2.3 \
--set command='{python,train.py}' \
--set slack.webhookUrl=$SLACK_WEBHOOK_URL
The recipe below is the equivalent unrolled manifest, kept for users who need fine-grained control or who prefer not to depend on Helm.
>= v0.10.0 — see install. The recipe pulls the official container image ghcr.io/ding-labs/ding:v0.10.0 (multi-arch, scratch base) into your Pod via an initContainer; no need to bake DING into your workload image.>= 1.21 for the primary wrapper pattern below. The sidecar alternative documented in Configuration requires >= 1.29 for native sidecar lifecycle.kubectl access to a namespace where you can create Jobs, ConfigMaps, and Secrets.The shortest configuration that produces a working alert when a Job exits non-zero. The pattern: ding run wraps your command in the main container; an initContainer copies /ding from the published image into a shared emptyDir; the workload container mounts the ding.yaml ConfigMap directly and pulls Secret keys into its env via envFrom. DING expands ${SLACK_WEBHOOK_URL} from the env at startup — no template-rendering step needed.
Save the YAML below as ding-job.yaml and apply with kubectl apply -f ding-job.yaml:
---
# Notifier credential — replace with your real Slack webhook URL.
# The Secret key MUST be a valid env var name (uppercase, no dashes) so
# `envFrom: secretRef:` below can surface it directly into the workload
# container's environment.
apiVersion: v1
kind: Secret
metadata:
name: ding-secrets
type: Opaque
stringData:
SLACK_WEBHOOK_URL: https://hooks.slack.com/services/T.../B.../...
---
# DING config. ${SLACK_WEBHOOK_URL} is expanded by DING itself at startup
# (see docs/configuration.md#environment-variable-substitution).
apiVersion: v1
kind: ConfigMap
metadata:
name: ding-config
data:
ding.yaml: |
server:
drain_timeout: 30s
notifiers:
slack:
type: slack
url: ${SLACK_WEBHOOK_URL}
rules:
- name: job_failed
match:
metric: run.exit
condition: value > 0
message: " (Job ) failed with exit after "
alert:
- notifier: slack
---
# Job — replace `image:` and the workload `command:` with your real workload.
# The example below intentionally exits 1 so the failure path can be observed.
apiVersion: batch/v1
kind: Job
metadata:
name: my-job
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
terminationGracePeriodSeconds: 60
volumes:
- name: ding-bin
emptyDir: {}
- name: ding-config
configMap:
name: ding-config
initContainers:
- name: install-ding
image: ghcr.io/ding-labs/ding:v0.10.0
# `ding install` self-copies the binary — works against the FROM-scratch
# release image (no /bin/sh available). Added in DING v0.5.1.
command: ["/ding", "install", "/shared/ding"]
volumeMounts:
- { name: ding-bin, mountPath: /shared }
containers:
- name: workload
image: alpine:3
command:
- /shared/ding
- run
- --config
- /config/ding.yaml
- --
- /bin/sh
- -c
- echo running; sleep 1; exit 1
envFrom:
# Surfaces every key from ding-secrets as an env var inside the
# workload container. DING expands ${SLACK_WEBHOOK_URL} from here.
- secretRef:
name: ding-secrets
env:
# Downward API surfaces Pod metadata as env vars; runctx auto-labels alerts with these.
- name: POD_UID
valueFrom: { fieldRef: { fieldPath: metadata.uid } }
- name: POD_NAME
valueFrom: { fieldRef: { fieldPath: metadata.name } }
- name: POD_NAMESPACE
valueFrom: { fieldRef: { fieldPath: metadata.namespace } }
- name: NODE_NAME
valueFrom: { fieldRef: { fieldPath: spec.nodeName } }
- name: JOB_NAME
valueFrom: { fieldRef: { fieldPath: "metadata.labels['job-name']" } }
volumeMounts:
- { name: ding-bin, mountPath: /shared }
- { name: ding-config, mountPath: /config, readOnly: true }
For a CronJob, wrap the same template: block in a jobTemplate::
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-batch
spec:
schedule: "0 2 * * *" # 02:00 UTC daily
concurrencyPolicy: Forbid
jobTemplate:
spec:
backoffLimit: 0
template:
# Identical to the Job's spec.template above:
# restartPolicy, terminationGracePeriodSeconds, volumes,
# initContainers, containers, downward-API env block.
...
This is the wedge’s headline use case in K8s: silent CronJob failure is the most universal observability pain in the platform, and DING surfaces it as an alert the moment the Pod exits non-zero — no scrape interval, no Pushgateway, no separate backend.
namespace, pod, node, job_name, and exit_code.runner=kubernetes label so rules can dispatch K8s alerts differently from CI alerts.ding run forwards the signal to the workload, drains in-flight notifier deliveries, then exits — within terminationGracePeriodSeconds.Complete vs Failed) reflects reality.runctx auto-detects Kubernetes via the presence of KUBERNETES_SERVICE_HOST (kubelet-injected on every Pod with a service account, which is the default) and captures these labels:
| Label | Source |
|---|---|
run_id |
POD_UID (downward API metadata.uid); falls back to POD_NAME |
runner |
"kubernetes" (set by runctx) |
namespace |
POD_NAMESPACE (downward API metadata.namespace) |
pod |
POD_NAME (downward API metadata.name) |
node |
NODE_NAME (downward API spec.nodeName) |
job_name |
JOB_NAME (downward API metadata.labels['job-name'] — auto-injected by the Job controller) |
For CronJobs, job_name is the spawned Job’s randomized name (e.g. nightly-batch-1737848400); the parent CronJob’s name isn’t directly available via downward API. If you want the CronJob name in alerts, surface it explicitly via a label on jobTemplate.spec.template.metadata.labels plus an additional fieldRef on the env block.
A self-hosted CI runner (GitHub Actions, GitLab CI, etc.) deployed on Kubernetes will set both its CI env vars and KUBERNETES_SERVICE_HOST. In that case runctx reports the CI platform — its labels are richer for alerting purposes — and the K8s labels are skipped. See Configuration for the full notifier reference.
drain_timeout and terminationGracePeriodSecondsKubernetes sends SIGTERM on Pod deletion, then waits up to terminationGracePeriodSeconds (default 30) before force-killing with SIGKILL. DING’s ding run traps SIGTERM, forwards it to the child, then drains queued notifier deliveries before exiting — but only up to server.drain_timeout (default 5s).
The defaults are unsafe in practice. With Slack/PagerDuty’s default initial_backoff: 1s and max_attempts: 3, a full retry cycle takes ~7s — which the default 5s drain truncates silently. The recipe sets drain_timeout: 30s and terminationGracePeriodSeconds: 60 so a SIGTERM-initiated graceful shutdown completes the retry cycle without truncation. Tune higher if you have notifiers with longer retry policies.
The wrapper pattern above requires modifying the workload container’s command:. If you can’t (e.g. third-party image with a fixed entrypoint), run DING as a native sidecar instead:
spec:
template:
spec:
initContainers:
- name: ding
image: ghcr.io/ding-labs/ding:v0.10.0
restartPolicy: Always # native sidecar — K8s 1.29+
command: ["/ding", "serve", "--config", "/etc/ding/ding.yaml"]
# ...volumeMounts for config + downward-API env block
containers:
- name: workload
image: third-party/image:tag
# ...workload posts events to http://localhost:8080/events
The workload sends events to DING over the Pod’s loopback. DING runs as a long-lived serve process; native sidecar lifecycle (initContainer with restartPolicy: Always) ensures the sidecar is auto-killed when the workload exits, so the Job can complete. This pattern requires Kubernetes 1.29 or later — earlier versions hit the long-standing Job-completion deadlock where sidecars never exit on their own.
The sidecar pattern is heavier (separate container, IPC over HTTP, no ding run lifecycle semantics) so use it only when the wrapper pattern can’t apply.
SLACK_WEBHOOK_URL exported in your shell): ding validate --config ding.yaml — confirms the rule parses and ${SLACK_WEBHOOK_URL} resolves.kubectl apply -f ding-job.yaml.kubectl wait --for=condition=failed job/my-job --timeout=60s. With the example’s exit 1, the Job should reach Failed quickly.pod, namespace, node, job_name, and exit_code. Check kubectl logs job/my-job -c workload for DING’s drain output.command: last line from exit 1 to exit 0, reapply, wait for Complete. Confirm no alert fires.kind: Job with the CronJob example, wait for the first spawned Job to complete, verify the same alert behavior on a forced failure.If the alert doesn’t fire, common issues: the Secret wasn’t readable (RBAC on the default ServiceAccount), SLACK_WEBHOOK_URL was empty/missing in the Secret, or terminationGracePeriodSeconds was too tight for the Pod’s actual deletion path.
If you want DING alerts to land as native Kubernetes Events visible to kubectl describe pod and kubectl get events — instead of (or alongside) Slack/PagerDuty/etc. — DING ships a built-in type: kubernetes_event notifier. No external webhook needed; alerts publish via the in-cluster ServiceAccount token. See type: kubernetes_event for the full reference.
Minimal ding.yaml snippet:
notifiers:
k8s:
type: kubernetes_event
# event_reason: DingAlertFired # default
# event_type: Warning # default; "Normal" also valid
rules:
- name: job_failed
match: { metric: run.exit }
condition: value > 0
message: "Job failed (exit after )"
alert:
- notifier: k8s
Minimal RBAC (Role + RoleBinding granting the workload’s ServiceAccount permission to create Events in its own namespace):
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ding-event-publisher
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ding-event-publisher
subjects:
- kind: ServiceAccount
name: default # or the SA your Job uses
roleRef:
kind: Role
name: ding-event-publisher
apiGroup: rbac.authorization.k8s.io
After applying these, kubectl describe pod <ding-pod> shows the alert in the Events: section, and kubectl get events --field-selector reason=DingAlertFired enumerates them across the namespace. Identical alerts within K8s’s aggregation window collapse into a single Event with count incremented (no extra config — DING’s per-rule cooldown still applies on top).
command. If the workload’s entrypoint is fixed (third-party image), use the sidecar alternative — but it requires K8s 1.29+ for native sidecar lifecycle.job-name, but the parent CronJob’s name has to be surfaced manually via a label on jobTemplate.spec.template.metadata.labels plus an additional fieldRef.pkill in a lifecycle hook) is documented in upstream K8s docs; it’s outside this recipe’s scope.This recipe is a Tier-2 candidate by the program’s standard rubric:
kubectl apply) — under threshold of 5Tier-2 candidate. The boilerplate count is the structural problem — the manifest is mostly mechanical plumbing (volumes, initContainers, downward API env block) that every K8s user will copy verbatim. A ding-k8s-job Helm chart (separate repo, mirroring the ding-action pattern) that templates the wrapper-pattern manifest behind helm install ding-k8s-job ... --set image=my-app --set command='python train.py' would collapse the recipe to a one-line install. ${VAR} substitution in the YAML parser shipped (the recipe was just simplified above). The remaining boilerplate is structural — pure manifest plumbing every K8s user copies verbatim. Defer the chart until 2+ users ask.