Skip to content

profiles/os/nixos: Fix network-online.target when using dhcpcd

André Breda requested to merge ist189409/nixrnl:fix-network-online into master

Description of changes

The cluster has been recently plagued with nodes that are not placed automatically in a draining state but fail all jobs they're included in with a message similar to:

srun: error: Nodes <full node list for job> are still not ready
srun: error: Something is wrong with the boot of the nodes.

Note: a previous version of this description mentioned nodes draining. This was false: we only observed it in nodes that were NOT draining but prevented jobs from executing nonetheless.

Recent investigation on lab0p1 (2025-03-27 around 15:13) suggested this may be related with service ordering issues. Concretely, lab0p1 had been recently booted, slurmd was started and accepted a job, but errors ensued in the PAM kerberos module and related services mentioning a lack of connectivity to the respective server. A few seconds later, we saw a message from the DHCP client stating it acquired an (IPv4) address. This leads us to hypothesize that the root of the problem is a lack of full network connectivity while jobs are started.

This issue is possible in the current state of the configuration. While slurmd.service is scheduled to execute after network-online.target, this target does not have any service enforcing that the network is in fact available. Currently, it only waits for the DHCP client to start, not acquire an address. Fixing this involves creating a new service that runs before network-online.target (and is wanted by that target) which confirms the node has IPv4 and IPv6 connectivity, similarly to NetworkManager-wait-online.service or systemd-networkd-wait-online.service.

This MR creates such a service, which exits after successfully pinging Cloudflare DNS server's IPv4 and IPv6 addresses. The service is only created when Network Manager is disabled, networking.useNetworkd is false and no default gateways are configured. It also changes the configuration of a handful of hosts to set the gateway through the networking.defaultGateway* options instead of routes in a particular interface.

Alternatively, set networking.useNetworkd to true to use the systemd-networkd backend for NixOS networking configuration, but it's currently marked as experimental.

Things done

  • Tested
    • Script manually tested outside of the service.
    • The changes to tardis (which has two network interfaces) and labs (which use DHCP), warrant testing.
    • Remaining non-DHCP-using hosts should not be affected.
  • Updated documentation (Wiki/NetBox)
  • Breaking change
Edited by André Breda

Merge request reports

Loading