gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/g3doc/architecture_guide/performance.md

gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/g3doc/architecture_guide/performance.md (about)

1 # Performance Guide
2
3 [TOC]
4
5 gVisor is designed to provide a secure, virtualized environment while preserving
6 key benefits of containerization, such as small fixed overheads and a dynamic
7 resource footprint. For containerized infrastructure, this can provide a
8 turn-key solution for sandboxing untrusted workloads: there are no changes to
9 the fundamental resource model.
10
11 gVisor imposes runtime costs over native containers. These costs come in two
12 forms: additional cycles and memory usage, which may manifest as increased
13 latency, reduced throughput or density, or not at all. In general, these costs
14 come from two different sources.
15
16 First, the existence of the [Sentry](../README.md#sentry) means that additional
17 memory will be required, and application system calls must traverse additional
18 layers of software. The design emphasizes
19 [security](/docs/architecture_guide/security/) and therefore we chose to use a
20 language for the Sentry that provides benefits in this domain but may not yet
21 offer the raw performance of other choices. Costs imposed by these design
22 choices are **structural costs**.
23
24 Second, as gVisor is an independent implementation of the system call surface,
25 many of the subsystems or specific calls are not as optimized as more mature
26 implementations. A good example here is the network stack, which is continuing
27 to evolve but does not support all the advanced recovery mechanisms offered by
28 other stacks and is less CPU efficient. This is an **implementation cost** and
29 is distinct from **structural costs**. Improvements here are ongoing and driven
30 by the workloads that matter to gVisor users and contributors.
31
32 This page provides a guide for understanding baseline performance, and calls out
33 distinct **structural costs** and **implementation costs**, highlighting where
34 improvements are possible and not possible.
35
36 While we include a variety of workloads here, it’s worth emphasizing that gVisor
37 may not be an appropriate solution for every workload, for reasons other than
38 performance. For example, a sandbox may provide minimal benefit for a trusted
39 database, since *user data would already be inside the sandbox* and there is no
40 need for an attacker to break out in the first place.
41
42 ## Methodology
43
44 All data below was generated using the [benchmark tools][benchmark-tools]
45 repository, and the machines under test are uniform [Google Compute Engine][gce]
46 Virtual Machines (VMs) with the following specifications:
47
48 ```
49 Machine type: n1-standard-4 (broadwell)
50 Image: Debian GNU/Linux 9 (stretch) 4.19.0-0
51 BootDisk: 2048GB SSD persistent disk
52 ```
53
54 Through this document, `runsc` is used to indicate the runtime provided by
55 gVisor. When relevant, we use the name `runsc-platform` to describe a specific
56 [platform choice](/docs/architecture_guide/platforms/).
57
58 **Except where specified, all tests below are conducted with the `ptrace`
59 platform. The `ptrace` platform works everywhere and does not require hardware
60 virtualization or kernel modifications but suffers from the highest structural
61 costs by far. This platform is used to provide a clear understanding of the
62 performance model, but in no way represents an ideal scenario; users should use
63 Systrap for best performance in most cases. In the future, this guide will be
64 extended to bare metal environments and include additional platforms.**
65
66 ## Memory access
67
68 gVisor does not introduce any additional costs with respect to raw memory
69 accesses. Page faults and other Operating System (OS) mechanisms are translated
70 through the Sentry, but once mappings are installed and available to the
71 application, there is no additional overhead.
72
73 {% include graph.html id="sysbench-memory"
74 url="/performance/sysbench-memory.csv" title="perf.py sysbench.memory
75 --runtime=runc --runtime=runsc" %}
76
77 The above figure demonstrates the memory transfer rate as measured by
78 `sysbench`.
79
80 ## Memory usage
81
82 The Sentry provides an additional layer of indirection, and it requires memory
83 in order to store state associated with the application. This memory generally
84 consists of a fixed component, plus an amount that varies with the usage of
85 operating system resources (e.g. how many sockets or files are opened).
86
87 For many use cases, fixed memory overheads are a primary concern. This may be
88 because sandboxed containers handle a low volume of requests, and it is
89 therefore important to achieve high densities for efficiency.
90
91 {% include graph.html id="density" url="/performance/density.csv" title="perf.py
92 density --runtime=runc --runtime=runsc" log="true" y_min="100000" %}
93
94 The above figure demonstrates these costs based on three sample applications.
95 This test is the result of running many instances of a container (50, or 5 in
96 the case of redis) and calculating available memory on the host before and
97 afterwards, and dividing the difference by the number of containers. This
98 technique is used for measuring memory usage over the `usage_in_bytes` value of
99 the container cgroup because we found that some container runtimes, other than
100 `runc` and `runsc`, do not use an individual container cgroup.
101
102 The first application is an instance of `sleep`: a trivial application that does
103 nothing. The second application is a synthetic `node` application which imports
104 a number of modules and listens for requests. The third application is a similar
105 synthetic `ruby` application which does the same. Finally, we include an
106 instance of `redis` storing approximately 1GB of data. In all cases, the sandbox
107 itself is responsible for a small, mostly fixed amount of memory overhead.
108
109 ## CPU performance
110
111 gVisor does not perform emulation or otherwise interfere with the raw execution
112 of CPU instructions by the application. Therefore, there is no runtime cost
113 imposed for CPU operations.
114
115 {% include graph.html id="sysbench-cpu" url="/performance/sysbench-cpu.csv"
116 title="perf.py sysbench.cpu --runtime=runc --runtime=runsc" %}
117
118 The above figure demonstrates the `sysbench` measurement of CPU events per
119 second. Events per second is based on a CPU-bound loop that calculates all prime
120 numbers in a specified range. We note that `runsc` does not impose a performance
121 penalty, as the code is executing natively in both cases.
122
123 This has important consequences for classes of workloads that are often
124 CPU-bound, such as data processing or machine learning. In these cases, `runsc`
125 will similarly impose minimal runtime overhead.
126
127 {% include graph.html id="tensorflow" url="/performance/tensorflow.csv"
128 title="perf.py tensorflow --runtime=runc --runtime=runsc" %}
129
130 For example, the above figure shows a sample TensorFlow workload, the
131 [convolutional neural network example][cnn]. The time indicated includes the
132 full start-up and run time for the workload, which trains a model.
133
134 ## System calls
135
136 Some **structural costs** of gVisor are heavily influenced by the
137 [platform choice](/docs/architecture_guide/platforms/), which implements system
138 call interception. Today, gVisor supports a variety of platforms. These
139 platforms present distinct performance, compatibility and security trade-offs.
140 For example, the KVM platform has low overhead system call interception but runs
141 poorly with nested virtualization.
142
143 {% include graph.html id="syscall" url="/performance/syscall.csv" title="perf.py
144 syscall --runtime=runc --runtime=runsc-ptrace --runtime=runsc-kvm" y_min="100"
145 log="true" %}
146
147 The above figure demonstrates the time required for a raw system call on various
148 platforms. The test is implemented by a custom binary which performs a large
149 number of system calls and calculates the average time required.
150
151 This cost will principally impact applications that are system call bound, which
152 tend to be high-performance data stores and static network services. In general,
153 the impact of system call interception will be lower the more work an
154 application does.
155
156 {% include graph.html id="redis" url="/performance/redis.csv" title="perf.py
157 redis --runtime=runc --runtime=runsc" %}
158
159 For example, `redis` is an application that performs relatively little work in
160 userspace: in general it reads from a connected socket, reads or modifies some
161 data, and writes a result back to the socket. The above figure shows the results
162 of running [comprehensive set of benchmarks][redis-benchmark]. We can see that
163 small operations impose a large overhead, while larger operations, such as
164 `LRANGE`, where more work is done in the application, have a smaller relative
165 overhead.
166
167 Some of these costs above are **structural costs**, and `redis` is likely to
168 remain a challenging performance scenario. However, optimizing the
169 [platform](/docs/architecture_guide/platforms/) will also have a dramatic
170 impact.
171
172 ## Start-up time
173
174 For many use cases, the ability to spin-up containers quickly and efficiently is
175 important. A sandbox may be short-lived and perform minimal user work (e.g. a
176 function invocation).
177
178 {% include graph.html id="startup" url="/performance/startup.csv" title="perf.py
179 startup --runtime=runc --runtime=runsc" %}
180
181 The above figure indicates how total time required to start a container through
182 [Docker][docker]. This benchmark uses three different applications. First, an
183 alpine Linux-container that executes `true`. Second, a `node` application that
184 loads a number of modules and binds an HTTP server. The time is measured by a
185 successful request to the bound port. Finally, a `ruby` application that
186 similarly loads a number of modules and binds an HTTP server.
187
188 > Note: most of the time overhead above is associated Docker itself. This is
189 > evident with the empty `runc` benchmark. To avoid these costs with `runsc`,
190 > you may also consider using `runsc do` mode or invoking the
191 > [OCI runtime](../user_guide/quick_start/oci.md) directly.
192
193 ## Network
194
195 Networking is mostly bound by **implementation costs**, and gVisor's network
196 stack is improving quickly.
197
198 While typically not an important metric in practice for common sandbox use
199 cases, nevertheless `iperf` is a common microbenchmark used to measure raw
200 throughput.
201
202 {% include graph.html id="iperf" url="/performance/iperf.csv" title="perf.py
203 iperf --runtime=runc --runtime=runsc" %}
204
205 The above figure shows the result of an `iperf` test between two instances. For
206 the upload case, the specified runtime is used for the `iperf` client, and in
207 the download case, the specified runtime is the server. A native runtime is
208 always used for the other endpoint in the test.
209
210 {% include graph.html id="applications" metric="requests_per_second"
211 url="/performance/applications.csv" title="perf.py http.(node|ruby)
212 --connections=25 --runtime=runc --runtime=runsc" %}
213
214 The above figure shows the result of simple `node` and `ruby` web services that
215 render a template upon receiving a request. Because these synthetic benchmarks
216 do minimal work per request, much like the `redis` case, they suffer from high
217 overheads. In practice, the more work an application does the smaller the impact
218 of **structural costs** become.
219
220 ## File system
221
222 Some aspects of file system performance are also reflective of **implementation
223 costs**, and an area where gVisor's implementation is improving quickly.
224
225 In terms of raw disk I/O, gVisor does not introduce significant fundamental
226 overhead. For general file operations, gVisor introduces a small fixed overhead
227 for data that transitions across the sandbox boundary. This manifests as
228 **structural costs** in some cases, since these operations must be routed
229 through the [Gofer](../README.md#gofer) as a result of our
230 [Security Model](/docs/architecture_guide/security/), but in most cases are
231 dominated by **implementation costs**, due to an internal
232 [Virtual File System][vfs] (VFS) implementation that needs improvement.
233
234 {% include graph.html id="fio-bw" url="/performance/fio.csv" title="perf.py fio
235 --engine=sync --runtime=runc --runtime=runsc" log="true" %}
236
237 The above figures demonstrate the results of `fio` for reads and writes to and
238 from the disk. In this case, the disk quickly becomes the bottleneck and
239 dominates other costs.
240
241 {% include graph.html id="fio-tmpfs-bw" url="/performance/fio-tmpfs.csv"
242 title="perf.py fio --engine=sync --runtime=runc --tmpfs=True --runtime=runsc"
243 log="true" %}
244
245 The above figure shows the raw I/O performance of using a `tmpfs` mount which is
246 sandbox-internal in the case of `runsc`. Generally these operations are
247 similarly bound to the cost of copying around data in-memory, and we don't see
248 the cost of VFS operations.
249
250 {% include graph.html id="httpd100k" metric="transfer_rate"
251 url="/performance/httpd100k.csv" title="perf.py http.httpd --connections=1
252 --connections=5 --connections=10 --connections=25 --runtime=runc
253 --runtime=runsc" %}
254
255 The high costs of VFS operations can manifest in benchmarks that execute many
256 such operations in the hot path for serving requests, for example. The above
257 figure shows the result of using gVisor to serve small pieces of static content
258 with predictably poor results. This workload represents `apache` serving a
259 single file sized 100k from the container image to a client running
260 [ApacheBench][ab] with varying levels of concurrency. The high overhead comes
261 principally from the VFS implementation that needs improvement, with several
262 internal serialization points (since all requests are reading the same file).
263 Note that some of some of network stack performance issues also impact this
264 benchmark.
265
266 {% include graph.html id="ffmpeg" url="/performance/ffmpeg.csv" title="perf.py
267 media.ffmpeg --runtime=runc --runtime=runsc" %}
268
269 For benchmarks that are bound by raw disk I/O and a mix of compute, file system
270 operations are less of an issue. The above figure shows the total time required
271 for an `ffmpeg` container to start, load and transcode a 27MB input video.
272
273 [ab]: https://en.wikipedia.org/wiki/ApacheBench
274 [benchmark-tools]: https://github.com/google/gvisor/tree/master/test/benchmarks
275 [gce]: https://cloud.google.com/compute/
276 [cnn]: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py
277 [docker]: https://docker.io
278 [redis-benchmark]: https://redis.io/topics/benchmarks
279 [vfs]: https://en.wikipedia.org/wiki/Virtual_file_system