Skip to content

profiles/cluster/client: Auto-restart slurmd

André Breda requested to merge ist189409/nixrnl:slurm-autorestart into master

Description of changes

slurmd seems to just die sometimes and RNL does not have a 24/7 ops team to keep nodes alive. Given that it is a service that does not hold data, it is fairly safe to auto-restart, circumventing any transient faults in the rest of the infrastructure automatically. This MR does so, using an exponential back-off for delaying restarts more and more up to 1h.

Worst case scenario after applying this change: a faulty node somehow crashes without being marked faulty by slurmctld and keeps getting re-added to the cluster, failing user jobs. This seems unlikely and can be independently circumvented by users (by explicitly excluding it from their jobs).

Things done

  • Tested
  • Updated documentation (Wiki/NetBox)
  • Breaking change

Merge request reports

Loading