profiles/cluster: Facilitate prolog/epilog error debugging
Description of changes
Slurm has been draining nodes with prolog errors, which are not always timeouts. It seems to be correlated with systemd
/systemd-logind
flooding DBus with requests repeatedly (and using full CPU threads in the process).
To aid identifying the underlying cause, this MR:
- Redirects the stdout and stderr of Slurm Prolog, TaskProlog and Epilog scripts to the node's journal
- Increases
journald
's size limits from the default of 4GB to 50GB -- the DBus flood generates ~32MB/min of journal entries filling 4GB in ~2h.- It also sets a conservative value for
SystemKeepFree
, ensuring journald maintains 80GB of free space.
- It also sets a conservative value for
Things done
-
Tested -
Updated documentation (Wiki/NetBox) -
Breaking change
Edited by Carlos Jorge Simão Nogueira Vaz