github.com/opencontainers/runc@v1.2.0-rc.1.0.20240520010911-492dc558cdd6/contrib/cmd/memfd-bind/README.md (about) 1 ## memfd-bind ## 2 3 `runc` normally has to make a binary copy of itself (or of a smaller helper 4 binary called `runc-dmz`) when constructing a container process in order to 5 defend against certain container runtime attacks such as CVE-2019-5736. 6 7 This cloned binary only exists until the container process starts (this means 8 for `runc run` and `runc exec`, it only exists for a few hundred milliseconds 9 -- for `runc create` it exists until `runc start` is called). However, because 10 the clone is done using a memfd (or by creating files in directories that are 11 likely to be a `tmpfs`), this can lead to temporary increases in *host* memory 12 usage. Unless you are running on a cgroupv1 system with the cgroupv1 memory 13 controller enabled and the (deprecated) `memory.move_charge_at_immigrate` 14 enabled, there is no effect on the container's memory. 15 16 However, for certain configurations this can still be undesirable. This daemon 17 allows you to create a sealed memfd copy of the `runc` binary, which will cause 18 `runc` to skip all binary copying, resulting in no additional memory usage for 19 each container process (instead there is a single in-memory copy of the 20 binary). It should be noted that (strictly speaking) this is slightly less 21 secure if you are concerned about Dirty Cow-like 0-day kernel vulnerabilities, 22 but for most users the security benefit is identical. 23 24 The provided `memfd-bind@.service` file can be used to get systemd to manage 25 this daemon. You can supply the path like so: 26 27 ``` 28 % systemctl start memfd-bind@/usr/bin/runc 29 ``` 30 31 Thus, there are three ways of protecting against CVE-2019-5736, in order of how 32 much memory usage they can use: 33 34 * `memfd-bind` only creates a single in-memory copy of the `runc` binary (about 35 10MB), regardless of how many containers are running. 36 37 * `runc-dmz` is (depending on which libc it was compiled with) between 10kB and 38 1MB in size, and a copy is created once per process spawned inside a 39 container by runc (both the pid1 and every `runc exec`). The `RUNC_DMZ=true` 40 environment variable needs to be set to opt-in. There are circumstances where 41 using `runc-dmz` will fail in ways that runc cannot predict ahead of time (such 42 as restrictive LSMs applied to containers). `runc-dmz` also requires an 43 additional `execve` over the other options, though since the binary is so small 44 the cost is probably not even noticeable. 45 46 * The classic method of making a copy of the entire `runc` binary during 47 container process setup takes up about 10MB per process spawned inside the 48 container by runc (both pid1 and `runc exec`). 49 50 ### Caveats ### 51 52 There are several downsides with using `memfd-bind` on the `runc` binary: 53 54 * The `memfd-bind` process needs to continue to run indefinitely in order for 55 the memfd reference to stay alive. If the process is forcefully killed, the 56 bind-mount on top of the `runc` binary will become stale and nobody will be 57 able to execute it (you can use `memfd-bind --cleanup` to clean up the stale 58 mount). 59 60 * Only root can execute the cloned binary due to permission restrictions on 61 accessing other process's files. More specifically, only users with ptrace 62 privileges over the memfd-bind daemon can access the file (but in practice 63 this is usually only root). 64 65 * When updating `runc`, the daemon needs to be stopped before the update (so 66 the package manager can access the underlying file) and then restarted after 67 the update.