gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-06-27-directfs.md (about) 1 # Faster filesystem access with Directfs 2 3 Directfs is now the default in runsc. This feature gives gVisor’s application 4 kernel (the Sentry) secure direct access to the container filesystem, avoiding 5 expensive round trips to the filesystem gofer. Learn more about this feature in 6 the following blog that was 7 [originally posted](https://opensource.googleblog.com/2023/06/optimizing-gvisor-filesystems-with-directfs.html) 8 on [Google Open Source Blog](https://opensource.googleblog.com/). 9 10 -------------------------------------------------------------------------------- 11 12 ## Origins of the Gofer 13 14 gVisor is used internally at Google to run a variety of services and workloads. 15 One of the challenges we faced while building gVisor was providing remote 16 filesystem access securely to the sandbox. gVisor’s strict 17 [security model](https://gvisor.dev/docs/architecture_guide/security/) and 18 defense in depth approach assumes that the sandbox may get compromised because 19 it shares the same execution context as the untrusted application. Hence the 20 sandbox cannot be given sensitive keys and credentials to access Google-internal 21 remote filesystems. 22 23 To address this challenge, we added a trusted filesystem proxy called a "gofer". 24 The gofer runs outside the sandbox, and provides a secure interface for 25 untrusted containers to access such remote filesystems. For architectural 26 simplicity, gofers were also used to serve local filesystems as well as remote. 27 28 ![Figure 1](/assets/images/2023-06-27-gofer-proxy.svg "Filesystem gofer proxy"){:width="100%"} 29 30 ## Isolating the Container Filesystem in runsc 31 32 When gVisor was [open sourced](https://github.com/google/gvisor) as 33 [runsc](https://gvisor.dev/docs/), the same gofer model was copied over to 34 maintain the same security guarantees. runsc was configured to start one gofer 35 process per container which serves the container filesystem to the sandbox over 36 a predetermined protocol (now 37 [LISAFS](https://github.com/google/gvisor/blob/master/pkg/lisafs)). However, a gofer 38 adds a layer of indirection with significant overhead. 39 40 This gofer model (built for remote filesystems) brings very few advantages for 41 the runsc use-case, where all the filesystems served by the gofer (like rootfs 42 and [bind mounts](https://docs.docker.com/storage/bind-mounts/)) are mounted 43 locally on the host. The gofer directly accesses them using filesystem syscalls. 44 45 Linux provides some security primitives to effectively isolate local 46 filesystems. These include, 47 [mount namespaces](https://man7.org/linux/man-pages/man7/mount_namespaces.7.html), 48 [`pivot_root`](https://man7.org/linux/man-pages/man2/pivot_root.2.html) and 49 detached bind mounts[^1]. **Directfs** is a new filesystem access mode that uses 50 these primitives to expose the container filesystem to the sandbox in a secure 51 manner. The sandbox’s view of the filesystem tree is limited to just the 52 container filesystem. The sandbox process is not given access to anything 53 mounted on the broader host filesystem. Even if the sandbox gets compromised, 54 these mechanisms provide additional barriers to prevent broader system 55 compromise. 56 57 ## Directfs 58 59 In directfs mode, the gofer still exists as a cooperative process outside the 60 sandbox. As usual, the gofer enters a new mount namespace, sets up appropriate 61 bind mounts to create the container filesystem in a new directory and then 62 [`pivot_root(2)`](https://man7.org/linux/man-pages/man2/pivot_root.2.html)s into 63 that directory. Similarly, the sandbox process enters new user and mount 64 namespaces and then 65 [`pivot_root(2)`](https://man7.org/linux/man-pages/man2/pivot_root.2.html)s into 66 an empty directory to ensure it cannot access anything via path traversal. But 67 instead of making RPCs to the gofer to access the container filesystem, the 68 sandbox requests the gofer to provide file descriptors to all the mount points 69 via [`SCM_RIGHTS` messages](https://man7.org/linux/man-pages/man7/unix.7.html). 70 The sandbox then directly makes file-descriptor-relative syscalls (e.g. 71 [`fstatat(2)`](https://linux.die.net/man/2/fstatat), 72 [`openat(2)`](https://linux.die.net/man/2/openat), 73 [`mkdirat(2)`](https://linux.die.net/man/2/mkdirat), etc) to perform filesystem 74 operations. 75 76 ![Figure 2](/assets/images/2023-06-27-directfs.svg "Directfs configuration"){:width="100%"} 77 78 Earlier when the gofer performed all filesystem operations, we could deny all 79 these syscalls in the sandbox process using seccomp. But with directfs enabled, 80 the sandbox process's seccomp filters need to allow the usage of these syscalls. 81 Most notably, the sandbox can now make 82 [`openat(2)`](https://linux.die.net/man/2/openat) syscalls (which allow path 83 traversal), but with certain restrictions: 84 [`O_NOFOLLOW` is required](https://github.com/google/gvisor/commit/114a033bd038519fa6e867c230dc4ad4e057e675), 85 [no access to procfs](https://github.com/google/gvisor/commit/fcbc289a7ac14b8d84d0c0b23c4b2a14fc626e79) 86 and 87 [no directory FDs from the host](https://github.com/google/gvisor/commit/aa8abdfa9256cf057202ec8f4a81ba9f5d6a203f). 88 We also had to give the sandbox the same privileges as the gofer (for example 89 `CAP_DAC_OVERRIDE` and `CAP_DAC_READ_SEARCH`), so it can perform the same 90 filesystem operations. 91 92 It is noteworthy that only the trusted gofer provides FDs (of the container 93 filesystem) to the sandbox. The sandbox cannot walk backwards (using '..') or 94 follow a malicious symlink to escape out of the container filesystem. In effect, 95 we've decreased our dependence on the syscall filters to catch bad behavior, but 96 correspondingly increased our dependence on Linux's filesystem isolation 97 protections. 98 99 ## Performance 100 101 Making RPCs to the gofer for every filesystem operation adds a lot of overhead 102 to runsc. Hence, avoiding gofer round trips significantly improves performance. 103 Let's find out what this means for some of our benchmarks. We will run the 104 benchmarks using our newly released 105 [systrap platform](https://gvisor.dev/blog/2023/04/28/systrap-release/) on bind 106 mounts (as opposed to rootfs). This would simulate more realistic use cases 107 because bind mounts are extensively used while configuring filesystems in 108 containers. Bind mounts also do not have an overlay 109 ([like the rootfs mount](https://opensource.googleblog.com/2023/04/gvisor-improves-performance-with-root-filesystem-overlay.html)), 110 so all operations go through goferfs / directfs mount. 111 112 Let's first look at our 113 [stat micro-benchmark](https://github.com/google/gvisor/blob/master/test/perf/linux/stat_benchmark.cc), 114 which repeatedly calls 115 [`stat(2)`](https://man7.org/linux/man-pages/man2/lstat.2.html) on a file. 116 117 ![Figure 3](/assets/images/2023-06-27-stat-benchmark.svg "Stat micro benchmark"){:width="100%"} 118 119 The `stat(2)` syscall is more than 2x faster! However, since this is not 120 representative of real-world applications, we should not extrapolate these 121 results. So let's look at some 122 [real-world benchmarks](https://github.com/google/gvisor/blob/master/test/benchmarks/fs). 123 124 ![Figure 4](/assets/images/2023-06-27-real-world-benchmarks.svg "Real world benchmarks"){:width="100%"} 125 126 We see a 12% reduction in the absolute time to run these workloads and 17% 127 reduction in Ruby load time! 128 129 ## Conclusion 130 131 The gofer model in runsc was overly restrictive for accessing host files. We 132 were able to leverage existing filesystem isolation mechanisms in Linux to 133 bypass the gofer without compromising security. Directfs significantly improves 134 performance for certain workloads. This is part of our ongoing efforts to 135 improve gVisor performance. You can learn more about gVisor at 136 [gvisor.dev](http://www.gvisor.dev/). You can also use gVisor in 137 [GKE](https://cloud.google.com/kubernetes-engine) with 138 [GKE Sandbox](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods). 139 Happy sandboxing! 140 141 -------------------------------------------------------------------------------- 142 143 [^1]: Detached bind mounts can be created by first creating a bind mount using 144 mount(MS_BIND) and then detaching it from the filesystem tree using 145 umount(MNT_DETACH).