Skip to content

Prevent OOM situation from unflushed page cache

André Breda requested to merge ist189409/nixrnl:fix-memlimit-pagecache into master

Description of changes

When downloading a large file on borg to $CLUSTER_HOME I experienced frequent crashes after downloading the 1GB. After investigation, I realized that this was caused by high memory usage of the page cache and not my processes.

To prevent this issue from happening, this MR updates all instances (profiles: cluster/server, ist/shell, nexus) where a memory hard limit (MemoryMax/memory.max) is set, and adds to them a soft limit that triggers agressive page cache reclamation (MemoryHigh/memory.high).

Initially, this soft limit was set but its purpose was misunderstood and was eventually removed after observing long delays between attempts to use too much memory and having the offending process killed. However, by setting the soft limit close to the hard limit, I expect it to not substantially slow down the actions of the OOM killer. Additionally, the separate adjustments for CPU and IO priorities that are already present should limit bad actors' habilities to abuse memory while avoiding getting killed (due to the agressive reclamation).

Things done

  • Tested
    • mini-test with systemd-run: page cache is flushed when it gets full, preventing OOM condition
  • Updated documentation (Wiki/NetBox)
  • Breaking change

Merge request reports

Loading