profiles/cluster: Facilitate prolog/epilog error debugging (!121) · Merge requests · RNL / Nix RNL

André Breda requested to merge ist189409/nixrnl:debug-slurm into master Sep 09, 2024

Description of changes

Slurm has been draining nodes with prolog errors, which are not always timeouts. It seems to be correlated with systemd/systemd-logind flooding DBus with requests repeatedly (and using full CPU threads in the process).

To aid identifying the underlying cause, this MR:

Redirects the stdout and stderr of Slurm Prolog, TaskProlog and Epilog scripts to the node's journal
Increases journald's size limits from the default of 4GB to 50GB -- the DBus flood generates ~32MB/min of journal entries filling 4GB in ~2h.
- It also sets a conservative value for SystemKeepFree, ensuring journald maintains 80GB of free space.

Things done

Tested
Updated documentation (Wiki/NetBox)
Breaking change

Edited Oct 03, 2024 by Carlos Jorge Simão Nogueira Vaz

profiles/cluster: Facilitate prolog/epilog error debugging

Description of changes

Things done

Merge request reports