gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-06-27-directfs.md (about)

     1  # Faster filesystem access with Directfs
     2  
     3  Directfs is now the default in runsc. This feature gives gVisor’s application
     4  kernel (the Sentry) secure direct access to the container filesystem, avoiding
     5  expensive round trips to the filesystem gofer. Learn more about this feature in
     6  the following blog that was
     7  [originally posted](https://opensource.googleblog.com/2023/06/optimizing-gvisor-filesystems-with-directfs.html)
     8  on [Google Open Source Blog](https://opensource.googleblog.com/).
     9  
    10  --------------------------------------------------------------------------------
    11  
    12  ## Origins of the Gofer
    13  
    14  gVisor is used internally at Google to run a variety of services and workloads.
    15  One of the challenges we faced while building gVisor was providing remote
    16  filesystem access securely to the sandbox. gVisor’s strict
    17  [security model](https://gvisor.dev/docs/architecture_guide/security/) and
    18  defense in depth approach assumes that the sandbox may get compromised because
    19  it shares the same execution context as the untrusted application. Hence the
    20  sandbox cannot be given sensitive keys and credentials to access Google-internal
    21  remote filesystems.
    22  
    23  To address this challenge, we added a trusted filesystem proxy called a "gofer".
    24  The gofer runs outside the sandbox, and provides a secure interface for
    25  untrusted containers to access such remote filesystems. For architectural
    26  simplicity, gofers were also used to serve local filesystems as well as remote.
    27  
    28  ![Figure 1](/assets/images/2023-06-27-gofer-proxy.svg "Filesystem gofer proxy"){:width="100%"}
    29  
    30  ## Isolating the Container Filesystem in runsc
    31  
    32  When gVisor was [open sourced](https://github.com/google/gvisor) as
    33  [runsc](https://gvisor.dev/docs/), the same gofer model was copied over to
    34  maintain the same security guarantees. runsc was configured to start one gofer
    35  process per container which serves the container filesystem to the sandbox over
    36  a predetermined protocol (now
    37  [LISAFS](https://github.com/google/gvisor/blob/master/pkg/lisafs)). However, a gofer
    38  adds a layer of indirection with significant overhead.
    39  
    40  This gofer model (built for remote filesystems) brings very few advantages for
    41  the runsc use-case, where all the filesystems served by the gofer (like rootfs
    42  and [bind mounts](https://docs.docker.com/storage/bind-mounts/)) are mounted
    43  locally on the host. The gofer directly accesses them using filesystem syscalls.
    44  
    45  Linux provides some security primitives to effectively isolate local
    46  filesystems. These include,
    47  [mount namespaces](https://man7.org/linux/man-pages/man7/mount_namespaces.7.html),
    48  [`pivot_root`](https://man7.org/linux/man-pages/man2/pivot_root.2.html) and
    49  detached bind mounts[^1]. **Directfs** is a new filesystem access mode that uses
    50  these primitives to expose the container filesystem to the sandbox in a secure
    51  manner. The sandbox’s view of the filesystem tree is limited to just the
    52  container filesystem. The sandbox process is not given access to anything
    53  mounted on the broader host filesystem. Even if the sandbox gets compromised,
    54  these mechanisms provide additional barriers to prevent broader system
    55  compromise.
    56  
    57  ## Directfs
    58  
    59  In directfs mode, the gofer still exists as a cooperative process outside the
    60  sandbox. As usual, the gofer enters a new mount namespace, sets up appropriate
    61  bind mounts to create the container filesystem in a new directory and then
    62  [`pivot_root(2)`](https://man7.org/linux/man-pages/man2/pivot_root.2.html)s into
    63  that directory. Similarly, the sandbox process enters new user and mount
    64  namespaces and then
    65  [`pivot_root(2)`](https://man7.org/linux/man-pages/man2/pivot_root.2.html)s into
    66  an empty directory to ensure it cannot access anything via path traversal. But
    67  instead of making RPCs to the gofer to access the container filesystem, the
    68  sandbox requests the gofer to provide file descriptors to all the mount points
    69  via [`SCM_RIGHTS` messages](https://man7.org/linux/man-pages/man7/unix.7.html).
    70  The sandbox then directly makes file-descriptor-relative syscalls (e.g.
    71  [`fstatat(2)`](https://linux.die.net/man/2/fstatat),
    72  [`openat(2)`](https://linux.die.net/man/2/openat),
    73  [`mkdirat(2)`](https://linux.die.net/man/2/mkdirat), etc) to perform filesystem
    74  operations.
    75  
    76  ![Figure 2](/assets/images/2023-06-27-directfs.svg "Directfs configuration"){:width="100%"}
    77  
    78  Earlier when the gofer performed all filesystem operations, we could deny all
    79  these syscalls in the sandbox process using seccomp. But with directfs enabled,
    80  the sandbox process's seccomp filters need to allow the usage of these syscalls.
    81  Most notably, the sandbox can now make
    82  [`openat(2)`](https://linux.die.net/man/2/openat) syscalls (which allow path
    83  traversal), but with certain restrictions:
    84  [`O_NOFOLLOW` is required](https://github.com/google/gvisor/commit/114a033bd038519fa6e867c230dc4ad4e057e675),
    85  [no access to procfs](https://github.com/google/gvisor/commit/fcbc289a7ac14b8d84d0c0b23c4b2a14fc626e79)
    86  and
    87  [no directory FDs from the host](https://github.com/google/gvisor/commit/aa8abdfa9256cf057202ec8f4a81ba9f5d6a203f).
    88  We also had to give the sandbox the same privileges as the gofer (for example
    89  `CAP_DAC_OVERRIDE` and `CAP_DAC_READ_SEARCH`), so it can perform the same
    90  filesystem operations.
    91  
    92  It is noteworthy that only the trusted gofer provides FDs (of the container
    93  filesystem) to the sandbox. The sandbox cannot walk backwards (using '..') or
    94  follow a malicious symlink to escape out of the container filesystem. In effect,
    95  we've decreased our dependence on the syscall filters to catch bad behavior, but
    96  correspondingly increased our dependence on Linux's filesystem isolation
    97  protections.
    98  
    99  ## Performance
   100  
   101  Making RPCs to the gofer for every filesystem operation adds a lot of overhead
   102  to runsc. Hence, avoiding gofer round trips significantly improves performance.
   103  Let's find out what this means for some of our benchmarks. We will run the
   104  benchmarks using our newly released
   105  [systrap platform](https://gvisor.dev/blog/2023/04/28/systrap-release/) on bind
   106  mounts (as opposed to rootfs). This would simulate more realistic use cases
   107  because bind mounts are extensively used while configuring filesystems in
   108  containers. Bind mounts also do not have an overlay
   109  ([like the rootfs mount](https://opensource.googleblog.com/2023/04/gvisor-improves-performance-with-root-filesystem-overlay.html)),
   110  so all operations go through goferfs / directfs mount.
   111  
   112  Let's first look at our
   113  [stat micro-benchmark](https://github.com/google/gvisor/blob/master/test/perf/linux/stat_benchmark.cc),
   114  which repeatedly calls
   115  [`stat(2)`](https://man7.org/linux/man-pages/man2/lstat.2.html) on a file.
   116  
   117  ![Figure 3](/assets/images/2023-06-27-stat-benchmark.svg "Stat micro benchmark"){:width="100%"}
   118  
   119  The `stat(2)` syscall is more than 2x faster! However, since this is not
   120  representative of real-world applications, we should not extrapolate these
   121  results. So let's look at some
   122  [real-world benchmarks](https://github.com/google/gvisor/blob/master/test/benchmarks/fs).
   123  
   124  ![Figure 4](/assets/images/2023-06-27-real-world-benchmarks.svg "Real world benchmarks"){:width="100%"}
   125  
   126  We see a 12% reduction in the absolute time to run these workloads and 17%
   127  reduction in Ruby load time!
   128  
   129  ## Conclusion
   130  
   131  The gofer model in runsc was overly restrictive for accessing host files. We
   132  were able to leverage existing filesystem isolation mechanisms in Linux to
   133  bypass the gofer without compromising security. Directfs significantly improves
   134  performance for certain workloads. This is part of our ongoing efforts to
   135  improve gVisor performance. You can learn more about gVisor at
   136  [gvisor.dev](http://www.gvisor.dev/). You can also use gVisor in
   137  [GKE](https://cloud.google.com/kubernetes-engine) with
   138  [GKE Sandbox](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods).
   139  Happy sandboxing!
   140  
   141  --------------------------------------------------------------------------------
   142  
   143  [^1]: Detached bind mounts can be created by first creating a bind mount using
   144      mount(MS_BIND) and then detaching it from the filesystem tree using
   145      umount(MNT_DETACH).