profiles/cluster/client: Auto-restart slurmd
Description of changes
slurmd
seems to just die sometimes and RNL does not have a 24/7 ops team to keep nodes alive.
Given that it is a service that does not hold data, it is fairly safe to auto-restart, circumventing any transient faults in the rest of the infrastructure automatically.
This MR does so, using an exponential back-off for delaying restarts more and more up to 1h.
Worst case scenario after applying this change: a faulty node somehow crashes without being marked faulty by slurmctld
and keeps getting re-added to the cluster, failing user jobs.
This seems unlikely and can be independently circumvented by users (by explicitly excluding it from their jobs).
Things done
-
Tested -
Updated documentation (Wiki/NetBox) -
Breaking change