gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-05-08-rootfs-overlay.md

gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-05-08-rootfs-overlay.md (about)

1 # Rootfs Overlay
2
3 Root filesystem overlay is now the default in runsc. This improves performance
4 for filesystem-heavy workloads by overlaying the container root filesystem with
5 a tmpfs filesystem. Learn more about this feature in the following blog that was
6 [originally posted](https://opensource.googleblog.com/2023/04/gvisor-improves-performance-with-root-filesystem-overlay.html)
7 on [Google Open Source Blog](https://opensource.googleblog.com/).
8
9 --------------------------------------------------------------------------------
10
11 ## Costly Filesystem Access
12
13 gVisor uses a trusted filesystem proxy process (“gofer”) to access the
14 filesystem on behalf of the sandbox. The sandbox process is considered untrusted
15 in gVisor’s
16 [security model](https://gvisor.dev/docs/architecture_guide/security/). As a
17 result, it is not given direct access to the container filesystem and
18 [its seccomp filters](https://github.com/google/gvisor/tree/master/runsc/boot/filter)
19 do not allow filesystem syscalls.
20
21 In gVisor, the container rootfs and
22 [bind mounts](https://docs.docker.com/storage/bind-mounts/#) are configured to
23 be served by a gofer.
24
25 ![Figure 1](/assets/images/2023-05-08-rootfs-overlay-gofer-diagram.svg "Gofer process diagram."){:width="100%"}
26
27 When the container needs to perform a filesystem operation, it makes an RPC to
28 the gofer which makes host system calls and services the RPC. This is quite
29 expensive due to:
30
31 1. RPC cost: This is the cost of communicating with the gofer process,
32 including process scheduling, message serialization and
33 [IPC](https://en.wikipedia.org/wiki/Inter-process_communication) system
34 calls.
35 * To ameliorate this, gVisor recently developed a purpose-built protocol
36 called [LISAFS](https://github.com/google/gvisor/tree/master/pkg/lisafs)
37 which is much more efficient than its predecessor.
38 * gVisor is also
39 [experimenting](https://groups.google.com/g/gvisor-users/c/v-ODHzCrIjE)
40 with giving the sandbox direct access to the container filesystem in a
41 secure manner. This would essentially nullify RPC costs as it avoids the
42 gofer being in the critical path of filesystem operations.
43 2. Syscall cost: This is the cost of making the host syscall which actually
44 accesses/modifies the container filesystem. Syscalls are expensive, because
45 they perform context switches into the kernel and back into userspace.
46 * To help with this, gVisor heavily caches the filesystem tree in memory.
47 So operations like
48 [stat(2)](https://man7.org/linux/man-pages/man2/lstat.2.html) on cached
49 files are serviced quickly. But other operations like
50 [mkdir(2)](https://man7.org/linux/man-pages/man2/mkdir.2.html) or
51 [rename(2)](https://man7.org/linux/man-pages/man2/rename.2.html) still
52 need to make host syscalls.
53
54 ## Container Root Filesystem
55
56 In Docker and Kubernetes, the container’s root filesystem (rootfs) is based on
57 the filesystem packaged with the image. The image’s filesystem is immutable. Any
58 change a container makes to the rootfs is stored separately and is destroyed
59 with the container. This way, the image’s filesystem can be shared efficiently
60 with all containers running the same image. This is different from bind mounts,
61 which allow containers to access the bound host filesystem tree. Changes to bind
62 mounts are always propagated to the host and persist after the container exits.
63
64 Docker and Kubernetes both use the
65 [overlay filesystem](https://docs.kernel.org/filesystems/overlayfs.html) by
66 default to configure container rootfs. Overlayfs mounts are composed of one
67 upper layer and multiple lower layers. The overlay filesystem presents a merged
68 view of all these filesystem layers at its mount location and ensures that lower
69 layers are read-only while all changes are held in the upper layer. The lower
70 layer(s) constitute the “image layer” and the upper layer is the “container
71 layer”. When the container is destroyed, the upper layer mount is destroyed as
72 well, discarding the root filesystem changes the container may have made.
73 Docker’s
74 [overlayfs driver documentation](https://docs.docker.com/storage/storagedriver/overlayfs-driver/#how-the-overlay2-driver-works)
75 has a good explanation.
76
77 ## Rootfs Configuration Before
78
79 Let’s consider an example where the image has files `foo` and `baz`. The
80 container overwrites `foo` and creates a new file `bar`. The diagram below shows
81 how the root filesystem used to be configured in gVisor earlier. We used to go
82 through the gofer and access/mutate the overlaid directory on the host. It also
83 shows the state of the host overlay filesystem.
84
85 ![Figure 2](/assets/images/2023-05-08-rootfs-overlay-before.svg "Rootfs state before."){:width="100%"}
86
87 ## Opportunity! Sandbox Internal Overlay
88
89 Given that the upper layer is destroyed with the container and that it is
90 expensive to access/mutate a host filesystem from the sandbox, why keep the
91 upper layer on the host at all? Instead we can move the upper layer **into the
92 sandbox**.
93
94 The idea is to overlay the rootfs using a sandbox-internal overlay mount. We can
95 use a tmpfs upper (container) layer and a read-only lower layer served by the
96 gofer client. Any changes to rootfs would be held in tmpfs (in-memory).
97 Accessing/mutating the upper layer would not require any gofer RPCs or syscalls
98 to the host. This really speeds up filesystem operations on the upper layer,
99 which contains newly created or copied-up files and directories.
100
101 Using the same example as above, the following diagram shows what the rootfs
102 configuration would look like using a sandbox-internal overlay.
103
104 ![Figure 3](/assets/images/2023-05-08-rootfs-overlay-memory.svg "Memory-backed rootfs overlay."){:width="100%"}
105
106 ## Host-Backed Overlay
107
108 The tmpfs mount by default will use the sandbox process’s memory to back all the
109 file data in the mount. This can cause sandbox memory usage to blow up and
110 exhaust the container’s memory limits, so it’s important to store all file data
111 from tmpfs upper layer on disk. We need to have a tmpfs-backing “filestore” on
112 the host filesystem. Using the example from above, this filestore on the host
113 will store file data for `foo` and `bar`.
114
115 This would essentially flatten all regular files in tmpfs into one host file.
116 The sandbox can [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) the
117 filestore into its address space. This allows it to access and mutate the
118 filestore very efficiently, without incurring gofer RPCs or syscalls overheads.
119
120 ## Self-Backed Overlay
121
122 In Kubernetes, you can set
123 [local ephemeral storage limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage).
124 The upper layer of the rootfs overlay (writeable container layer) on the host
125 [contributes towards this limit](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-emphemeralstorage-consumption).
126 The kubelet enforces this limit by
127 [traversing](https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L57-L58)
128 the entire
129 [upper layer](https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/snapshots/overlay/overlay.go#L189-L190),
130 `stat(2)`-ing all files and
131 [summing up](https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L69-L74)
132 their `stat.st_blocks*block_size`. If we move the upper layer into the sandbox,
133 then the host upper layer is empty and the kubelet will not be able to enforce
134 these limits.
135
136 To address this issue, we
137 [introduced “self-backed” overlays](https://github.com/google/gvisor/commit/a53b22ad5283b00b766178eff847c3193c1293b7),
138 which create the filestore in the host upper layer. This way, when the kubelet
139 scans the host upper layer, the filestore will be detected and its
140 `stat.st_blocks` should be representative of the total file usage in the
141 sandbox-internal upper layer. It is also important to hide this filestore from
142 the containerized application to avoid confusing it. We do so by
143 [creating a whiteout](https://github.com/google/gvisor/commit/09459b203a532c24fbb76cc88484d533356b8b91)
144 in the sandbox-internal upper layer, which blocks this file from appearing in
145 the merged directory.
146
147 The following diagram shows what rootfs configuration would finally look like
148 today in gVisor.
149
150 ![Figure 4](/assets/images/2023-05-08-rootfs-overlay-self.svg "Self-backed rootfs overlay."){:width="100%"}
151
152 ## Performance Gains
153
154 Let’s look at some filesystem-intensive workloads to see how rootfs overlay
155 impacts performance. These benchmarks were run on a gLinux desktop with
156 [KVM platform](https://gvisor.dev/docs/architecture_guide/platforms/#kvm).
157
158 ### Micro Benchmark
159
160 [Linux Test Project](https://linux-test-project.github.io/) provides a
161 [fsstress binary](https://github.com/linux-test-project/ltp/tree/master/testcases/kernel/fs/fsstress).
162 This program performs a large number of filesystem operations concurrently,
163 creating and modifying a large filesystem tree of all sorts of files. We ran
164 this program on the container's root filesystem. The exact usage was:
165
166 &nbsp;&nbsp;&nbsp;&nbsp;`sh -c "mkdir /test && time fsstress -d /test -n 500 -p
167 20 -s 1680153482 -X -l 10"`
168
169 You can use the -v flag (verbose mode) to see what filesystem operations are
170 being performed.
171
172 The results were astounding! Rootfs overlay reduced the time to run this
173 fsstress program **from 262.79 seconds to 3.18 seconds**! However, note that
174 such microbenchmarks are not representative of real-world applications and we
175 should not extrapolate these results to real-world performance.
176
177 ### Real-world Benchmark
178
179 Build jobs are very filesystem intensive workloads. They read a lot of source
180 files, compile and write out binaries and object files. Let’s consider building
181 the [abseil-cpp project](https://github.com/abseil/abseil-cpp) with
182 [bazel](https://bazel.build/). Bazel performs a lot of filesystem operations in
183 rootfs; in bazel’s cache located at `~/.cache/bazel/`.
184
185 This is representative of the real-world because many other applications also
186 use the container root filesystem as scratch space due to the handy property
187 that it disappears on container exit. To make this more realistic, the
188 abseil-cpp repo was attached to the container using a bind mount, which does not
189 have an overlay.
190
191 When measuring performance, we care about reducing the sandboxing overhead and
192 bringing gVisor performance as close as possible to unsandboxed performance.
193 Sandboxing overhead can be calculated using the formula *overhead = (s-n)/n*
194 where `s` is the amount of time taken to run a workload inside gVisor sandbox
195 and `n` is the time taken to run the same workload natively (unsandboxed). The
196 following graph shows that rootfs overlay **halved the sandboxing overhead** for
197 abseil build!
198
199 ![Figure 5](/assets/images/2023-05-08-rootfs-overlay-benchmark-result.svg "Sandbox Overhead: rootfs overlay vs no overlay."){:width="100%"}
200
201 ## Conclusion
202
203 Rootfs overlay in gVisor substantially improves performance for many
204 filesystem-intensive workloads, so that developers no longer have to make large
205 tradeoffs between performance and security. We recently made this optimization
206 [the default](https://github.com/google/gvisor/commit/38750cdedcce19a3039da10e515f5852565d2c7e)
207 in runsc. This is part of our ongoing efforts to improve gVisor performance. You
208 can use gVisor in GKE with GKE Sandbox. Happy sandboxing!