Skip to content

profiles/cluster: Support seamless usage of containers

André Breda requested to merge ist189409/nixrnl:slurm-containers into master

Description of changes

Running containers in Slurm jobs is currently troublesome:

  • subuid/gid assignments are not automatically created.
  • /run/user/<uid> is not automatically created.
  • DOCKER_HOST variable is not present (so no docker compose).

Slurm jobs don't go through PAM so they are not tracked as logind sessions, nor do they get our PAM modules (with scripts and such) run.

This MR sets up a job prolog in Slurm:

  1. Create subid assignments using existing infrastructure (subidappend).
  2. Enable logind's "linger state" for the job's user, which results in the user's runtime dir being created.

It also sets up a job epilog to undo point 2: disables the "linger state" for the job's user. As I'm writing this, I have realized this is subject to races: if the same user has multiple jobs using containers in the same machine, stuff may break. However this should be rare, and can probably be left as future work (have a counter somewhere and use that to control linger). Note that SSH and in-situ sessions are NOT affected, as they are properly tracked by logind and will inhibit resource cleanup (unlike SLURM jobs/tasks).

Finally, this MR enables uses a SLURM task prolog to set the DOCKER_HOST environment variable.

Things done

  • [-] Tested
    • The basic principle of running subidappend with PAM_USER set and enabling the lingering state works, but I did not use the SLURM_JOB_USER env variable at that time (it should exist in prolog scripts according to SLURM docs).
  • Updated documentation (Wiki/NetBox)
  • Breaking change
Edited by André Breda

Merge request reports

Loading