Skip to content

profiles/cluster_server: Re-add nodes to cluster after reboot

Description of changes

Slurm kicks out cluster nodes when they reboot "unexpectedly" (without using scontrol reboot or similar), which includes any regular reboot invocation, or rebooting from a desktop environment. I doubt anyone has the patience to keep manually reacting to these events and bring nodes up by hand.

This MR changes Slurm behavior to always bring any node with valid configuration up. However, it may bring up nodes that failed for bad reasons up automatically as well, so in the future it's probably wise to investigate how to undo this change and make all regular reboots go through slurm so that nodes go back up automatically in those instances.

Things done

  • Tested
  • Updated documentation (Wiki/NetBox)
  • Breaking change

Merge request reports

Loading