Skip to content

profiles/ist-shell,cluster: Tame the OOM killer

André Breda requested to merge ist189409/nixrnl:oom into master

Description of changes

borg recently experienced various OOM killer events, that unfortunately targeted the glusterfs process. Arguably, this means borg should have a bit more RAM, as GlusterFS seems to struggle with high memory usage. In any case, I expect that this was not a ridiculously high memory usage, and the OOM killer simply preferred killing glusterfs to a user process, which is not ideal, as it compromises cluster usage for all users (huge blast radius).

This MR:

  • Configures different OOM scores, that are used by the OOM killer to choose which process to kill, to prefer killing user processes to any system process (with root processes being less likely to be killed than regular users), and to prefer killing generic system processes to glusterfs.
  • Configures zram swap on all "ist-shell" machines (labs, borg, nexus) and systemd-oomd to preemptively terminate and swap out processes when memory is under strain, to avoid the OOM killer being triggered in the first place.

It does not increase the amount of available RAM in borg.

Things done

  • Tested
  • Updated documentation (Wiki/NetBox)
  • Breaking change
Edited by André Breda

Merge request reports

Loading