profiles/cluster: Fix slurmd kill task failed (!113) · Merge requests · RNL / Nix RNL

André Breda requested to merge ist189409/nixrnl:slurm-kill-task-failed into master Aug 19, 2024

Description of changes

slurmd is sometimes taking longer than what slurmctld expects to kill job tasks. This causes the slow slurmd to be drained and stop accepting jobs until manually resumed.

According to people on the internet this is due to a mismatch between the timeout used by the two components (the CGroup killer uses 120s while slurmctld expects killed steps to be done in 60s at most).

This MR increases the UnkillableStepTimeout to 128s, a conservative value which should give ample time for slurmd to get its stuff together. This should result in little to no nodes drained with Reason=Kill task failed at a small (~1min) extra latency cost for draining truly unhealthy nodes with truly unkillable tasks.

Things done

Tested
Updated documentation (Wiki/NetBox)
Breaking change

profiles/cluster: Fix slurmd kill task failed

Description of changes

Things done

Merge request reports