profiles/cluster/client: Auto-restart slurmd (!112) · Merge requests · RNL / Nix RNL

André Breda requested to merge ist189409/nixrnl:slurm-autorestart into master Aug 15, 2024

Description of changes

slurmd seems to just die sometimes and RNL does not have a 24/7 ops team to keep nodes alive. Given that it is a service that does not hold data, it is fairly safe to auto-restart, circumventing any transient faults in the rest of the infrastructure automatically. This MR does so, using an exponential back-off for delaying restarts more and more up to 1h.

Worst case scenario after applying this change: a faulty node somehow crashes without being marked faulty by slurmctld and keeps getting re-added to the cluster, failing user jobs. This seems unlikely and can be independently circumvented by users (by explicitly excluding it from their jobs).

Things done

Tested
Updated documentation (Wiki/NetBox)
Breaking change

profiles/cluster/client: Auto-restart slurmd

Description of changes

Things done

Merge request reports