github.com/opencontainers/runc@v1.2.0-rc.1.0.20240520010911-492dc558cdd6/contrib/cmd/memfd-bind/README.md (about)

     1  ## memfd-bind ##
     2  
     3  `runc` normally has to make a binary copy of itself (or of a smaller helper
     4  binary called `runc-dmz`) when constructing a container process in order to
     5  defend against certain container runtime attacks such as CVE-2019-5736.
     6  
     7  This cloned binary only exists until the container process starts (this means
     8  for `runc run` and `runc exec`, it only exists for a few hundred milliseconds
     9  -- for `runc create` it exists until `runc start` is called). However, because
    10  the clone is done using a memfd (or by creating files in directories that are
    11  likely to be a `tmpfs`), this can lead to temporary increases in *host* memory
    12  usage. Unless you are running on a cgroupv1 system with the cgroupv1 memory
    13  controller enabled and the (deprecated) `memory.move_charge_at_immigrate`
    14  enabled, there is no effect on the container's memory.
    15  
    16  However, for certain configurations this can still be undesirable. This daemon
    17  allows you to create a sealed memfd copy of the `runc` binary, which will cause
    18  `runc` to skip all binary copying, resulting in no additional memory usage for
    19  each container process (instead there is a single in-memory copy of the
    20  binary). It should be noted that (strictly speaking) this is slightly less
    21  secure if you are concerned about Dirty Cow-like 0-day kernel vulnerabilities,
    22  but for most users the security benefit is identical.
    23  
    24  The provided `memfd-bind@.service` file can be used to get systemd to manage
    25  this daemon. You can supply the path like so:
    26  
    27  ```
    28  % systemctl start memfd-bind@/usr/bin/runc
    29  ```
    30  
    31  Thus, there are three ways of protecting against CVE-2019-5736, in order of how
    32  much memory usage they can use:
    33  
    34  * `memfd-bind` only creates a single in-memory copy of the `runc` binary (about
    35    10MB), regardless of how many containers are running.
    36  
    37  * `runc-dmz` is (depending on which libc it was compiled with) between 10kB and
    38    1MB in size, and a copy is created once per process spawned inside a
    39    container by runc (both the pid1 and every `runc exec`). The `RUNC_DMZ=true`
    40    environment variable needs to be set to opt-in. There are circumstances where
    41    using `runc-dmz` will fail in ways that runc cannot predict ahead of time (such
    42    as restrictive LSMs applied to containers).  `runc-dmz` also requires an
    43    additional `execve` over the other options, though since the binary is so small
    44    the cost is probably not even noticeable.
    45  
    46  * The classic method of making a copy of the entire `runc` binary during
    47    container process setup takes up about 10MB per process spawned inside the
    48    container by runc (both pid1 and `runc exec`).
    49  
    50  ### Caveats ###
    51  
    52  There are several downsides with using `memfd-bind` on the `runc` binary:
    53  
    54  * The `memfd-bind` process needs to continue to run indefinitely in order for
    55    the memfd reference to stay alive. If the process is forcefully killed, the
    56    bind-mount on top of the `runc` binary will become stale and nobody will be
    57    able to execute it (you can use `memfd-bind --cleanup` to clean up the stale
    58    mount).
    59  
    60  * Only root can execute the cloned binary due to permission restrictions on
    61    accessing other process's files. More specifically, only users with ptrace
    62    privileges over the memfd-bind daemon can access the file (but in practice
    63    this is usually only root).
    64  
    65  * When updating `runc`, the daemon needs to be stopped before the update (so
    66    the package manager can access the underlying file) and then restarted after
    67    the update.