Airflow's SLA misses, callbacks, and email alerts are all evaluated by the scheduler process. If the scheduler itself is down, none of that machinery ever runs — an external ping is the only thing that still notices.
Airflow's built-in reliability features — SLA misses, on_failure_callback, retry/alerting email — are all code that the scheduler process evaluates as it processes DAG runs. That's exactly the layer that can go missing without warning: the scheduler container OOM-kills and a deploy doesn't restart it, the metadata database connection pool exhausts and the scheduler heartbeat stalls, or a DAG-parsing error in an unrelated file (a broken import at the top level of any DAG in the folder) can, on some Airflow versions and configurations, slow or stall the whole DAG-file processing loop. In every one of these cases, the Airflow UI itself can look deceptively calm — the last successful DAG run is still sitting there in green, because nothing has come along to mark it stale. A DAG can also simply never get scheduled if it was accidentally left paused after a deploy (a very common mistake, since new DAGs default to paused in many Airflow configs) — again, nothing fails; nothing runs.
The idiomatic way to hook success/failure without adding a dedicated task is default_args callbacks, applied to every task in the DAG:
Because callbacks fire per-task, put them only on the DAG's final task (or use a DAG-level on_success_callback/on_failure_callback passed to DAG(...) directly instead of default_args) so one ping represents the whole run, not each task in it. Set the check's Cron schedule to match the DAG's schedule argument, and its timezone to the DAG's start_date timezone.
Callbacks don't have a "task started" hook by default — add a dummy first task that pings /start so hung-job duration tracking works the same way it does for any other scheduler:
Everything above depends on the scheduler successfully running the DAG in the first place — which is precisely what's missing if the scheduler process is down. Add a separate, trivial DAG scheduled every few minutes whose only task pings a heartbeat check:
Give that check a Simple schedule of 300 seconds and a tight grace (5–10 minutes). If the scheduler dies, this is the only signal that still fires — nothing inside Airflow can report the absence of its own heartbeat.
Every check has a public SVG badge that shows its live status (updates within ~1 minute). Paste this into any README — it doubles as a heartbeat anyone on the team can see:
Copy the exact markdown from your check's detail page. Add ?label=your-text to customize the left label.
Ready to wire this up? Create a free check — 20 checks, all alert channels, no card.