Skip to content

profiles/cluster: Facilitate prolog/epilog error debugging

André Breda requested to merge ist189409/nixrnl:debug-slurm into master

Description of changes

Slurm has been draining nodes with prolog errors, which are not always timeouts. It seems to be correlated with systemd/systemd-logind flooding DBus with requests repeatedly (and using full CPU threads in the process).

To aid identifying the underlying cause, this MR:

  • Redirects the stdout and stderr of Slurm Prolog, TaskProlog and Epilog scripts to the node's journal
  • Increases journald's size limits from the default of 4GB to 50GB -- the DBus flood generates ~32MB/min of journal entries filling 4GB in ~2h.
    • It also sets a conservative value for SystemKeepFree, ensuring journald maintains 80GB of free space.

Things done

  • Tested
  • Updated documentation (Wiki/NetBox)
  • Breaking change
Edited by Carlos Jorge Simão Nogueira Vaz

Merge request reports

Loading