Prevent OOM situation from unflushed page cache (!101) · Merge requests · RNL / Nix RNL

André Breda requested to merge ist189409/nixrnl:fix-memlimit-pagecache into master Jun 14, 2024

Description of changes

When downloading a large file on borg to $CLUSTER_HOME I experienced frequent crashes after downloading the 1GB. After investigation, I realized that this was caused by high memory usage of the page cache and not my processes.

To prevent this issue from happening, this MR updates all instances (profiles: cluster/server, ist/shell, nexus) where a memory hard limit (MemoryMax/memory.max) is set, and adds to them a soft limit that triggers agressive page cache reclamation (MemoryHigh/memory.high).

Initially, this soft limit was set but its purpose was misunderstood and was eventually removed after observing long delays between attempts to use too much memory and having the offending process killed. However, by setting the soft limit close to the hard limit, I expect it to not substantially slow down the actions of the OOM killer. Additionally, the separate adjustments for CPU and IO priorities that are already present should limit bad actors' habilities to abuse memory while avoiding getting killed (due to the agressive reclamation).

Things done

Tested
- mini-test with systemd-run: page cache is flushed when it gets full, preventing OOM condition
Updated documentation (Wiki/NetBox)
Breaking change

Prevent OOM situation from unflushed page cache

Description of changes

Things done

Merge request reports