gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-05-08-rootfs-overlay.md (about)

     1  # Rootfs Overlay
     2  
     3  Root filesystem overlay is now the default in runsc. This improves performance
     4  for filesystem-heavy workloads by overlaying the container root filesystem with
     5  a tmpfs filesystem. Learn more about this feature in the following blog that was
     6  [originally posted](https://opensource.googleblog.com/2023/04/gvisor-improves-performance-with-root-filesystem-overlay.html)
     7  on [Google Open Source Blog](https://opensource.googleblog.com/).
     8  
     9  --------------------------------------------------------------------------------
    10  
    11  ## Costly Filesystem Access
    12  
    13  gVisor uses a trusted filesystem proxy process (“gofer”) to access the
    14  filesystem on behalf of the sandbox. The sandbox process is considered untrusted
    15  in gVisor’s
    16  [security model](https://gvisor.dev/docs/architecture_guide/security/). As a
    17  result, it is not given direct access to the container filesystem and
    18  [its seccomp filters](https://github.com/google/gvisor/tree/master/runsc/boot/filter)
    19  do not allow filesystem syscalls.
    20  
    21  In gVisor, the container rootfs and
    22  [bind mounts](https://docs.docker.com/storage/bind-mounts/#) are configured to
    23  be served by a gofer.
    24  
    25  ![Figure 1](/assets/images/2023-05-08-rootfs-overlay-gofer-diagram.svg "Gofer process diagram."){:width="100%"}
    26  
    27  When the container needs to perform a filesystem operation, it makes an RPC to
    28  the gofer which makes host system calls and services the RPC. This is quite
    29  expensive due to:
    30  
    31  1.  RPC cost: This is the cost of communicating with the gofer process,
    32      including process scheduling, message serialization and
    33      [IPC](https://en.wikipedia.org/wiki/Inter-process_communication) system
    34      calls.
    35      *   To ameliorate this, gVisor recently developed a purpose-built protocol
    36          called [LISAFS](https://github.com/google/gvisor/tree/master/pkg/lisafs)
    37          which is much more efficient than its predecessor.
    38      *   gVisor is also
    39          [experimenting](https://groups.google.com/g/gvisor-users/c/v-ODHzCrIjE)
    40          with giving the sandbox direct access to the container filesystem in a
    41          secure manner. This would essentially nullify RPC costs as it avoids the
    42          gofer being in the critical path of filesystem operations.
    43  2.  Syscall cost: This is the cost of making the host syscall which actually
    44      accesses/modifies the container filesystem. Syscalls are expensive, because
    45      they perform context switches into the kernel and back into userspace.
    46      *   To help with this, gVisor heavily caches the filesystem tree in memory.
    47          So operations like
    48          [stat(2)](https://man7.org/linux/man-pages/man2/lstat.2.html) on cached
    49          files are serviced quickly. But other operations like
    50          [mkdir(2)](https://man7.org/linux/man-pages/man2/mkdir.2.html) or
    51          [rename(2)](https://man7.org/linux/man-pages/man2/rename.2.html) still
    52          need to make host syscalls.
    53  
    54  ## Container Root Filesystem
    55  
    56  In Docker and Kubernetes, the container’s root filesystem (rootfs) is based on
    57  the filesystem packaged with the image. The image’s filesystem is immutable. Any
    58  change a container makes to the rootfs is stored separately and is destroyed
    59  with the container. This way, the image’s filesystem can be shared efficiently
    60  with all containers running the same image. This is different from bind mounts,
    61  which allow containers to access the bound host filesystem tree. Changes to bind
    62  mounts are always propagated to the host and persist after the container exits.
    63  
    64  Docker and Kubernetes both use the
    65  [overlay filesystem](https://docs.kernel.org/filesystems/overlayfs.html) by
    66  default to configure container rootfs. Overlayfs mounts are composed of one
    67  upper layer and multiple lower layers. The overlay filesystem presents a merged
    68  view of all these filesystem layers at its mount location and ensures that lower
    69  layers are read-only while all changes are held in the upper layer. The lower
    70  layer(s) constitute the “image layer” and the upper layer is the “container
    71  layer”. When the container is destroyed, the upper layer mount is destroyed as
    72  well, discarding the root filesystem changes the container may have made.
    73  Docker’s
    74  [overlayfs driver documentation](https://docs.docker.com/storage/storagedriver/overlayfs-driver/#how-the-overlay2-driver-works)
    75  has a good explanation.
    76  
    77  ## Rootfs Configuration Before
    78  
    79  Let’s consider an example where the image has files `foo` and `baz`. The
    80  container overwrites `foo` and creates a new file `bar`. The diagram below shows
    81  how the root filesystem used to be configured in gVisor earlier. We used to go
    82  through the gofer and access/mutate the overlaid directory on the host. It also
    83  shows the state of the host overlay filesystem.
    84  
    85  ![Figure 2](/assets/images/2023-05-08-rootfs-overlay-before.svg "Rootfs state before."){:width="100%"}
    86  
    87  ## Opportunity! Sandbox Internal Overlay
    88  
    89  Given that the upper layer is destroyed with the container and that it is
    90  expensive to access/mutate a host filesystem from the sandbox, why keep the
    91  upper layer on the host at all? Instead we can move the upper layer **into the
    92  sandbox**.
    93  
    94  The idea is to overlay the rootfs using a sandbox-internal overlay mount. We can
    95  use a tmpfs upper (container) layer and a read-only lower layer served by the
    96  gofer client. Any changes to rootfs would be held in tmpfs (in-memory).
    97  Accessing/mutating the upper layer would not require any gofer RPCs or syscalls
    98  to the host. This really speeds up filesystem operations on the upper layer,
    99  which contains newly created or copied-up files and directories.
   100  
   101  Using the same example as above, the following diagram shows what the rootfs
   102  configuration would look like using a sandbox-internal overlay.
   103  
   104  ![Figure 3](/assets/images/2023-05-08-rootfs-overlay-memory.svg "Memory-backed rootfs overlay."){:width="100%"}
   105  
   106  ## Host-Backed Overlay
   107  
   108  The tmpfs mount by default will use the sandbox process’s memory to back all the
   109  file data in the mount. This can cause sandbox memory usage to blow up and
   110  exhaust the container’s memory limits, so it’s important to store all file data
   111  from tmpfs upper layer on disk. We need to have a tmpfs-backing “filestore” on
   112  the host filesystem. Using the example from above, this filestore on the host
   113  will store file data for `foo` and `bar`.
   114  
   115  This would essentially flatten all regular files in tmpfs into one host file.
   116  The sandbox can [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) the
   117  filestore into its address space. This allows it to access and mutate the
   118  filestore very efficiently, without incurring gofer RPCs or syscalls overheads.
   119  
   120  ## Self-Backed Overlay
   121  
   122  In Kubernetes, you can set
   123  [local ephemeral storage limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage).
   124  The upper layer of the rootfs overlay (writeable container layer) on the host
   125  [contributes towards this limit](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-emphemeralstorage-consumption).
   126  The kubelet enforces this limit by
   127  [traversing](https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L57-L58)
   128  the entire
   129  [upper layer](https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/snapshots/overlay/overlay.go#L189-L190),
   130  `stat(2)`-ing all files and
   131  [summing up](https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L69-L74)
   132  their `stat.st_blocks*block_size`. If we move the upper layer into the sandbox,
   133  then the host upper layer is empty and the kubelet will not be able to enforce
   134  these limits.
   135  
   136  To address this issue, we
   137  [introduced “self-backed” overlays](https://github.com/google/gvisor/commit/a53b22ad5283b00b766178eff847c3193c1293b7),
   138  which create the filestore in the host upper layer. This way, when the kubelet
   139  scans the host upper layer, the filestore will be detected and its
   140  `stat.st_blocks` should be representative of the total file usage in the
   141  sandbox-internal upper layer. It is also important to hide this filestore from
   142  the containerized application to avoid confusing it. We do so by
   143  [creating a whiteout](https://github.com/google/gvisor/commit/09459b203a532c24fbb76cc88484d533356b8b91)
   144  in the sandbox-internal upper layer, which blocks this file from appearing in
   145  the merged directory.
   146  
   147  The following diagram shows what rootfs configuration would finally look like
   148  today in gVisor.
   149  
   150  ![Figure 4](/assets/images/2023-05-08-rootfs-overlay-self.svg "Self-backed rootfs overlay."){:width="100%"}
   151  
   152  ## Performance Gains
   153  
   154  Let’s look at some filesystem-intensive workloads to see how rootfs overlay
   155  impacts performance. These benchmarks were run on a gLinux desktop with
   156  [KVM platform](https://gvisor.dev/docs/architecture_guide/platforms/#kvm).
   157  
   158  ### Micro Benchmark
   159  
   160  [Linux Test Project](https://linux-test-project.github.io/) provides a
   161  [fsstress binary](https://github.com/linux-test-project/ltp/tree/master/testcases/kernel/fs/fsstress).
   162  This program performs a large number of filesystem operations concurrently,
   163  creating and modifying a large filesystem tree of all sorts of files. We ran
   164  this program on the container's root filesystem. The exact usage was:
   165  
   166      `sh -c "mkdir /test && time fsstress -d /test -n 500 -p
   167  20 -s 1680153482 -X -l 10"`
   168  
   169  You can use the -v flag (verbose mode) to see what filesystem operations are
   170  being performed.
   171  
   172  The results were astounding! Rootfs overlay reduced the time to run this
   173  fsstress program **from 262.79 seconds to 3.18 seconds**! However, note that
   174  such microbenchmarks are not representative of real-world applications and we
   175  should not extrapolate these results to real-world performance.
   176  
   177  ### Real-world Benchmark
   178  
   179  Build jobs are very filesystem intensive workloads. They read a lot of source
   180  files, compile and write out binaries and object files. Let’s consider building
   181  the [abseil-cpp project](https://github.com/abseil/abseil-cpp) with
   182  [bazel](https://bazel.build/). Bazel performs a lot of filesystem operations in
   183  rootfs; in bazel’s cache located at `~/.cache/bazel/`.
   184  
   185  This is representative of the real-world because many other applications also
   186  use the container root filesystem as scratch space due to the handy property
   187  that it disappears on container exit. To make this more realistic, the
   188  abseil-cpp repo was attached to the container using a bind mount, which does not
   189  have an overlay.
   190  
   191  When measuring performance, we care about reducing the sandboxing overhead and
   192  bringing gVisor performance as close as possible to unsandboxed performance.
   193  Sandboxing overhead can be calculated using the formula *overhead = (s-n)/n*
   194  where `s` is the amount of time taken to run a workload inside gVisor sandbox
   195  and `n` is the time taken to run the same workload natively (unsandboxed). The
   196  following graph shows that rootfs overlay **halved the sandboxing overhead** for
   197  abseil build!
   198  
   199  ![Figure 5](/assets/images/2023-05-08-rootfs-overlay-benchmark-result.svg "Sandbox Overhead: rootfs overlay vs no overlay."){:width="100%"}
   200  
   201  ## Conclusion
   202  
   203  Rootfs overlay in gVisor substantially improves performance for many
   204  filesystem-intensive workloads, so that developers no longer have to make large
   205  tradeoffs between performance and security. We recently made this optimization
   206  [the default](https://github.com/google/gvisor/commit/38750cdedcce19a3039da10e515f5852565d2c7e)
   207  in runsc. This is part of our ongoing efforts to improve gVisor performance. You
   208  can use gVisor in GKE with GKE Sandbox. Happy sandboxing!