gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-05-08-rootfs-overlay.md (about) 1 # Rootfs Overlay 2 3 Root filesystem overlay is now the default in runsc. This improves performance 4 for filesystem-heavy workloads by overlaying the container root filesystem with 5 a tmpfs filesystem. Learn more about this feature in the following blog that was 6 [originally posted](https://opensource.googleblog.com/2023/04/gvisor-improves-performance-with-root-filesystem-overlay.html) 7 on [Google Open Source Blog](https://opensource.googleblog.com/). 8 9 -------------------------------------------------------------------------------- 10 11 ## Costly Filesystem Access 12 13 gVisor uses a trusted filesystem proxy process (“gofer”) to access the 14 filesystem on behalf of the sandbox. The sandbox process is considered untrusted 15 in gVisor’s 16 [security model](https://gvisor.dev/docs/architecture_guide/security/). As a 17 result, it is not given direct access to the container filesystem and 18 [its seccomp filters](https://github.com/google/gvisor/tree/master/runsc/boot/filter) 19 do not allow filesystem syscalls. 20 21 In gVisor, the container rootfs and 22 [bind mounts](https://docs.docker.com/storage/bind-mounts/#) are configured to 23 be served by a gofer. 24 25 ![Figure 1](/assets/images/2023-05-08-rootfs-overlay-gofer-diagram.svg "Gofer process diagram."){:width="100%"} 26 27 When the container needs to perform a filesystem operation, it makes an RPC to 28 the gofer which makes host system calls and services the RPC. This is quite 29 expensive due to: 30 31 1. RPC cost: This is the cost of communicating with the gofer process, 32 including process scheduling, message serialization and 33 [IPC](https://en.wikipedia.org/wiki/Inter-process_communication) system 34 calls. 35 * To ameliorate this, gVisor recently developed a purpose-built protocol 36 called [LISAFS](https://github.com/google/gvisor/tree/master/pkg/lisafs) 37 which is much more efficient than its predecessor. 38 * gVisor is also 39 [experimenting](https://groups.google.com/g/gvisor-users/c/v-ODHzCrIjE) 40 with giving the sandbox direct access to the container filesystem in a 41 secure manner. This would essentially nullify RPC costs as it avoids the 42 gofer being in the critical path of filesystem operations. 43 2. Syscall cost: This is the cost of making the host syscall which actually 44 accesses/modifies the container filesystem. Syscalls are expensive, because 45 they perform context switches into the kernel and back into userspace. 46 * To help with this, gVisor heavily caches the filesystem tree in memory. 47 So operations like 48 [stat(2)](https://man7.org/linux/man-pages/man2/lstat.2.html) on cached 49 files are serviced quickly. But other operations like 50 [mkdir(2)](https://man7.org/linux/man-pages/man2/mkdir.2.html) or 51 [rename(2)](https://man7.org/linux/man-pages/man2/rename.2.html) still 52 need to make host syscalls. 53 54 ## Container Root Filesystem 55 56 In Docker and Kubernetes, the container’s root filesystem (rootfs) is based on 57 the filesystem packaged with the image. The image’s filesystem is immutable. Any 58 change a container makes to the rootfs is stored separately and is destroyed 59 with the container. This way, the image’s filesystem can be shared efficiently 60 with all containers running the same image. This is different from bind mounts, 61 which allow containers to access the bound host filesystem tree. Changes to bind 62 mounts are always propagated to the host and persist after the container exits. 63 64 Docker and Kubernetes both use the 65 [overlay filesystem](https://docs.kernel.org/filesystems/overlayfs.html) by 66 default to configure container rootfs. Overlayfs mounts are composed of one 67 upper layer and multiple lower layers. The overlay filesystem presents a merged 68 view of all these filesystem layers at its mount location and ensures that lower 69 layers are read-only while all changes are held in the upper layer. The lower 70 layer(s) constitute the “image layer” and the upper layer is the “container 71 layer”. When the container is destroyed, the upper layer mount is destroyed as 72 well, discarding the root filesystem changes the container may have made. 73 Docker’s 74 [overlayfs driver documentation](https://docs.docker.com/storage/storagedriver/overlayfs-driver/#how-the-overlay2-driver-works) 75 has a good explanation. 76 77 ## Rootfs Configuration Before 78 79 Let’s consider an example where the image has files `foo` and `baz`. The 80 container overwrites `foo` and creates a new file `bar`. The diagram below shows 81 how the root filesystem used to be configured in gVisor earlier. We used to go 82 through the gofer and access/mutate the overlaid directory on the host. It also 83 shows the state of the host overlay filesystem. 84 85 ![Figure 2](/assets/images/2023-05-08-rootfs-overlay-before.svg "Rootfs state before."){:width="100%"} 86 87 ## Opportunity! Sandbox Internal Overlay 88 89 Given that the upper layer is destroyed with the container and that it is 90 expensive to access/mutate a host filesystem from the sandbox, why keep the 91 upper layer on the host at all? Instead we can move the upper layer **into the 92 sandbox**. 93 94 The idea is to overlay the rootfs using a sandbox-internal overlay mount. We can 95 use a tmpfs upper (container) layer and a read-only lower layer served by the 96 gofer client. Any changes to rootfs would be held in tmpfs (in-memory). 97 Accessing/mutating the upper layer would not require any gofer RPCs or syscalls 98 to the host. This really speeds up filesystem operations on the upper layer, 99 which contains newly created or copied-up files and directories. 100 101 Using the same example as above, the following diagram shows what the rootfs 102 configuration would look like using a sandbox-internal overlay. 103 104 ![Figure 3](/assets/images/2023-05-08-rootfs-overlay-memory.svg "Memory-backed rootfs overlay."){:width="100%"} 105 106 ## Host-Backed Overlay 107 108 The tmpfs mount by default will use the sandbox process’s memory to back all the 109 file data in the mount. This can cause sandbox memory usage to blow up and 110 exhaust the container’s memory limits, so it’s important to store all file data 111 from tmpfs upper layer on disk. We need to have a tmpfs-backing “filestore” on 112 the host filesystem. Using the example from above, this filestore on the host 113 will store file data for `foo` and `bar`. 114 115 This would essentially flatten all regular files in tmpfs into one host file. 116 The sandbox can [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) the 117 filestore into its address space. This allows it to access and mutate the 118 filestore very efficiently, without incurring gofer RPCs or syscalls overheads. 119 120 ## Self-Backed Overlay 121 122 In Kubernetes, you can set 123 [local ephemeral storage limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage). 124 The upper layer of the rootfs overlay (writeable container layer) on the host 125 [contributes towards this limit](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-emphemeralstorage-consumption). 126 The kubelet enforces this limit by 127 [traversing](https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L57-L58) 128 the entire 129 [upper layer](https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/snapshots/overlay/overlay.go#L189-L190), 130 `stat(2)`-ing all files and 131 [summing up](https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L69-L74) 132 their `stat.st_blocks*block_size`. If we move the upper layer into the sandbox, 133 then the host upper layer is empty and the kubelet will not be able to enforce 134 these limits. 135 136 To address this issue, we 137 [introduced “self-backed” overlays](https://github.com/google/gvisor/commit/a53b22ad5283b00b766178eff847c3193c1293b7), 138 which create the filestore in the host upper layer. This way, when the kubelet 139 scans the host upper layer, the filestore will be detected and its 140 `stat.st_blocks` should be representative of the total file usage in the 141 sandbox-internal upper layer. It is also important to hide this filestore from 142 the containerized application to avoid confusing it. We do so by 143 [creating a whiteout](https://github.com/google/gvisor/commit/09459b203a532c24fbb76cc88484d533356b8b91) 144 in the sandbox-internal upper layer, which blocks this file from appearing in 145 the merged directory. 146 147 The following diagram shows what rootfs configuration would finally look like 148 today in gVisor. 149 150 ![Figure 4](/assets/images/2023-05-08-rootfs-overlay-self.svg "Self-backed rootfs overlay."){:width="100%"} 151 152 ## Performance Gains 153 154 Let’s look at some filesystem-intensive workloads to see how rootfs overlay 155 impacts performance. These benchmarks were run on a gLinux desktop with 156 [KVM platform](https://gvisor.dev/docs/architecture_guide/platforms/#kvm). 157 158 ### Micro Benchmark 159 160 [Linux Test Project](https://linux-test-project.github.io/) provides a 161 [fsstress binary](https://github.com/linux-test-project/ltp/tree/master/testcases/kernel/fs/fsstress). 162 This program performs a large number of filesystem operations concurrently, 163 creating and modifying a large filesystem tree of all sorts of files. We ran 164 this program on the container's root filesystem. The exact usage was: 165 166 `sh -c "mkdir /test && time fsstress -d /test -n 500 -p 167 20 -s 1680153482 -X -l 10"` 168 169 You can use the -v flag (verbose mode) to see what filesystem operations are 170 being performed. 171 172 The results were astounding! Rootfs overlay reduced the time to run this 173 fsstress program **from 262.79 seconds to 3.18 seconds**! However, note that 174 such microbenchmarks are not representative of real-world applications and we 175 should not extrapolate these results to real-world performance. 176 177 ### Real-world Benchmark 178 179 Build jobs are very filesystem intensive workloads. They read a lot of source 180 files, compile and write out binaries and object files. Let’s consider building 181 the [abseil-cpp project](https://github.com/abseil/abseil-cpp) with 182 [bazel](https://bazel.build/). Bazel performs a lot of filesystem operations in 183 rootfs; in bazel’s cache located at `~/.cache/bazel/`. 184 185 This is representative of the real-world because many other applications also 186 use the container root filesystem as scratch space due to the handy property 187 that it disappears on container exit. To make this more realistic, the 188 abseil-cpp repo was attached to the container using a bind mount, which does not 189 have an overlay. 190 191 When measuring performance, we care about reducing the sandboxing overhead and 192 bringing gVisor performance as close as possible to unsandboxed performance. 193 Sandboxing overhead can be calculated using the formula *overhead = (s-n)/n* 194 where `s` is the amount of time taken to run a workload inside gVisor sandbox 195 and `n` is the time taken to run the same workload natively (unsandboxed). The 196 following graph shows that rootfs overlay **halved the sandboxing overhead** for 197 abseil build! 198 199 ![Figure 5](/assets/images/2023-05-08-rootfs-overlay-benchmark-result.svg "Sandbox Overhead: rootfs overlay vs no overlay."){:width="100%"} 200 201 ## Conclusion 202 203 Rootfs overlay in gVisor substantially improves performance for many 204 filesystem-intensive workloads, so that developers no longer have to make large 205 tradeoffs between performance and security. We recently made this optimization 206 [the default](https://github.com/google/gvisor/commit/38750cdedcce19a3039da10e515f5852565d2c7e) 207 in runsc. This is part of our ongoing efforts to improve gVisor performance. You 208 can use gVisor in GKE with GKE Sandbox. Happy sandboxing!