profiles/cluster: Fix slurmd kill task failed
Description of changes
slurmd
is sometimes taking longer than what slurmctld
expects to kill job tasks. This causes the slow slurmd
to be drained and stop accepting jobs until manually resumed.
According to people on the internet this is due to a mismatch between the timeout used by the two components (the CGroup killer uses 120s while slurmctld expects killed steps to be done in 60s at most).
This MR increases the UnkillableStepTimeout
to 128s, a conservative value which should give ample time for slurmd
to get its stuff together.
This should result in little to no nodes drained with Reason=Kill task failed
at a small (~1min) extra latency cost for draining truly unhealthy nodes with truly unkillable tasks.
Things done
-
Tested -
Updated documentation (Wiki/NetBox) -
Breaking change