Prevent OOM situation from unflushed page cache
Description of changes
When downloading a large file on borg
to $CLUSTER_HOME
I experienced frequent crashes after downloading the 1GB.
After investigation, I realized that this was caused by high memory usage of the page cache and not my processes.
To prevent this issue from happening, this MR updates all instances (profiles: cluster/server, ist/shell, nexus) where a memory hard limit (MemoryMax
/memory.max
) is set, and adds to them a soft limit that triggers agressive page cache reclamation (MemoryHigh
/memory.high
).
Initially, this soft limit was set but its purpose was misunderstood and was eventually removed after observing long delays between attempts to use too much memory and having the offending process killed. However, by setting the soft limit close to the hard limit, I expect it to not substantially slow down the actions of the OOM killer. Additionally, the separate adjustments for CPU and IO priorities that are already present should limit bad actors' habilities to abuse memory while avoiding getting killed (due to the agressive reclamation).
Things done
-
Tested - mini-test with
systemd-run
: page cache is flushed when it gets full, preventing OOM condition
- mini-test with
-
Updated documentation (Wiki/NetBox) -
Breaking change