github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/g3doc/architecture_guide/performance.md (about) 1 # Performance Guide 2 3 [TOC] 4 5 gVisor is designed to provide a secure, virtualized environment while preserving 6 key benefits of containerization, such as small fixed overheads and a dynamic 7 resource footprint. For containerized infrastructure, this can provide a 8 turn-key solution for sandboxing untrusted workloads: there are no changes to 9 the fundamental resource model. 10 11 gVisor imposes runtime costs over native containers. These costs come in two 12 forms: additional cycles and memory usage, which may manifest as increased 13 latency, reduced throughput or density, or not at all. In general, these costs 14 come from two different sources. 15 16 First, the existence of the [Sentry](../README.md#sentry) means that additional 17 memory will be required, and application system calls must traverse additional 18 layers of software. The design emphasizes 19 [security](/docs/architecture_guide/security/) and therefore we chose to use a 20 language for the Sentry that provides benefits in this domain but may not yet 21 offer the raw performance of other choices. Costs imposed by these design 22 choices are **structural costs**. 23 24 Second, as gVisor is an independent implementation of the system call surface, 25 many of the subsystems or specific calls are not as optimized as more mature 26 implementations. A good example here is the network stack, which is continuing 27 to evolve but does not support all the advanced recovery mechanisms offered by 28 other stacks and is less CPU efficient. This is an **implementation cost** and 29 is distinct from **structural costs**. Improvements here are ongoing and driven 30 by the workloads that matter to gVisor users and contributors. 31 32 This page provides a guide for understanding baseline performance, and calls out 33 distinct **structural costs** and **implementation costs**, highlighting where 34 improvements are possible and not possible. 35 36 While we include a variety of workloads here, it’s worth emphasizing that gVisor 37 may not be an appropriate solution for every workload, for reasons other than 38 performance. For example, a sandbox may provide minimal benefit for a trusted 39 database, since _user data would already be inside the sandbox_ and there is no 40 need for an attacker to break out in the first place. 41 42 ## Methodology 43 44 All data below was generated using the [benchmark tools][benchmark-tools] 45 repository, and the machines under test are uniform [Google Compute Engine][gce] 46 Virtual Machines (VMs) with the following specifications: 47 48 Machine type: n1-standard-4 (broadwell) 49 Image: Debian GNU/Linux 9 (stretch) 4.19.0-0 50 BootDisk: 2048GB SSD persistent disk 51 52 Through this document, `runsc` is used to indicate the runtime provided by 53 gVisor. When relevant, we use the name `runsc-platform` to describe a specific 54 [platform choice](/docs/architecture_guide/platforms/). 55 56 **Except where specified, all tests below are conducted with the `ptrace` 57 platform. The `ptrace` platform works everywhere and does not require hardware 58 virtualization or kernel modifications but suffers from the highest structural 59 costs by far. This platform is used to provide a clear understanding of the 60 performance model, but in no way represents an ideal scenario. In the future, 61 this guide will be extended to bare metal environments and include additional 62 platforms.** 63 64 ## Memory access 65 66 gVisor does not introduce any additional costs with respect to raw memory 67 accesses. Page faults and other Operating System (OS) mechanisms are translated 68 through the Sentry, but once mappings are installed and available to the 69 application, there is no additional overhead. 70 71 {% include graph.html id="sysbench-memory" 72 url="/performance/sysbench-memory.csv" title="perf.py sysbench.memory 73 --runtime=runc --runtime=runsc" %} 74 75 The above figure demonstrates the memory transfer rate as measured by 76 `sysbench`. 77 78 ## Memory usage 79 80 The Sentry provides an additional layer of indirection, and it requires memory 81 in order to store state associated with the application. This memory generally 82 consists of a fixed component, plus an amount that varies with the usage of 83 operating system resources (e.g. how many sockets or files are opened). 84 85 For many use cases, fixed memory overheads are a primary concern. This may be 86 because sandboxed containers handle a low volume of requests, and it is 87 therefore important to achieve high densities for efficiency. 88 89 {% include graph.html id="density" url="/performance/density.csv" title="perf.py 90 density --runtime=runc --runtime=runsc" log="true" y_min="100000" %} 91 92 The above figure demonstrates these costs based on three sample applications. 93 This test is the result of running many instances of a container (50, or 5 in 94 the case of redis) and calculating available memory on the host before and 95 afterwards, and dividing the difference by the number of containers. This 96 technique is used for measuring memory usage over the `usage_in_bytes` value of 97 the container cgroup because we found that some container runtimes, other than 98 `runc` and `runsc`, do not use an individual container cgroup. 99 100 The first application is an instance of `sleep`: a trivial application that does 101 nothing. The second application is a synthetic `node` application which imports 102 a number of modules and listens for requests. The third application is a similar 103 synthetic `ruby` application which does the same. Finally, we include an 104 instance of `redis` storing approximately 1GB of data. In all cases, the sandbox 105 itself is responsible for a small, mostly fixed amount of memory overhead. 106 107 ## CPU performance 108 109 gVisor does not perform emulation or otherwise interfere with the raw execution 110 of CPU instructions by the application. Therefore, there is no runtime cost 111 imposed for CPU operations. 112 113 {% include graph.html id="sysbench-cpu" url="/performance/sysbench-cpu.csv" 114 title="perf.py sysbench.cpu --runtime=runc --runtime=runsc" %} 115 116 The above figure demonstrates the `sysbench` measurement of CPU events per 117 second. Events per second is based on a CPU-bound loop that calculates all prime 118 numbers in a specified range. We note that `runsc` does not impose a performance 119 penalty, as the code is executing natively in both cases. 120 121 This has important consequences for classes of workloads that are often 122 CPU-bound, such as data processing or machine learning. In these cases, `runsc` 123 will similarly impose minimal runtime overhead. 124 125 {% include graph.html id="tensorflow" url="/performance/tensorflow.csv" 126 title="perf.py tensorflow --runtime=runc --runtime=runsc" %} 127 128 For example, the above figure shows a sample TensorFlow workload, the 129 [convolutional neural network example][cnn]. The time indicated includes the 130 full start-up and run time for the workload, which trains a model. 131 132 ## System calls 133 134 Some **structural costs** of gVisor are heavily influenced by the 135 [platform choice](/docs/architecture_guide/platforms/), which implements system 136 call interception. Today, gVisor supports a variety of platforms. These 137 platforms present distinct performance, compatibility and security trade-offs. 138 For example, the KVM platform has low overhead system call interception but runs 139 poorly with nested virtualization. 140 141 {% include graph.html id="syscall" url="/performance/syscall.csv" title="perf.py 142 syscall --runtime=runc --runtime=runsc-ptrace --runtime=runsc-kvm" y_min="100" 143 log="true" %} 144 145 The above figure demonstrates the time required for a raw system call on various 146 platforms. The test is implemented by a custom binary which performs a large 147 number of system calls and calculates the average time required. 148 149 This cost will principally impact applications that are system call bound, which 150 tend to be high-performance data stores and static network services. In general, 151 the impact of system call interception will be lower the more work an 152 application does. 153 154 {% include graph.html id="redis" url="/performance/redis.csv" title="perf.py 155 redis --runtime=runc --runtime=runsc" %} 156 157 For example, `redis` is an application that performs relatively little work in 158 userspace: in general it reads from a connected socket, reads or modifies some 159 data, and writes a result back to the socket. The above figure shows the results 160 of running [comprehensive set of benchmarks][redis-benchmark]. We can see that 161 small operations impose a large overhead, while larger operations, such as 162 `LRANGE`, where more work is done in the application, have a smaller relative 163 overhead. 164 165 Some of these costs above are **structural costs**, and `redis` is likely to 166 remain a challenging performance scenario. However, optimizing the 167 [platform](/docs/architecture_guide/platforms/) will also have a dramatic 168 impact. 169 170 ## Start-up time 171 172 For many use cases, the ability to spin-up containers quickly and efficiently is 173 important. A sandbox may be short-lived and perform minimal user work (e.g. a 174 function invocation). 175 176 {% include graph.html id="startup" url="/performance/startup.csv" title="perf.py 177 startup --runtime=runc --runtime=runsc" %} 178 179 The above figure indicates how total time required to start a container through 180 [Docker][docker]. This benchmark uses three different applications. First, an 181 alpine Linux-container that executes `true`. Second, a `node` application that 182 loads a number of modules and binds an HTTP server. The time is measured by a 183 successful request to the bound port. Finally, a `ruby` application that 184 similarly loads a number of modules and binds an HTTP server. 185 186 > Note: most of the time overhead above is associated Docker itself. This is 187 > evident with the empty `runc` benchmark. To avoid these costs with `runsc`, 188 > you may also consider using `runsc do` mode or invoking the 189 > [OCI runtime](../user_guide/quick_start/oci.md) directly. 190 191 ## Network 192 193 Networking is mostly bound by **implementation costs**, and gVisor's network 194 stack is improving quickly. 195 196 While typically not an important metric in practice for common sandbox use 197 cases, nevertheless `iperf` is a common microbenchmark used to measure raw 198 throughput. 199 200 {% include graph.html id="iperf" url="/performance/iperf.csv" title="perf.py 201 iperf --runtime=runc --runtime=runsc" %} 202 203 The above figure shows the result of an `iperf` test between two instances. For 204 the upload case, the specified runtime is used for the `iperf` client, and in 205 the download case, the specified runtime is the server. A native runtime is 206 always used for the other endpoint in the test. 207 208 {% include graph.html id="applications" metric="requests_per_second" 209 url="/performance/applications.csv" title="perf.py http.(node|ruby) 210 --connections=25 --runtime=runc --runtime=runsc" %} 211 212 The above figure shows the result of simple `node` and `ruby` web services that 213 render a template upon receiving a request. Because these synthetic benchmarks 214 do minimal work per request, much like the `redis` case, they suffer from high 215 overheads. In practice, the more work an application does the smaller the impact 216 of **structural costs** become. 217 218 ## File system 219 220 Some aspects of file system performance are also reflective of **implementation 221 costs**, and an area where gVisor's implementation is improving quickly. 222 223 In terms of raw disk I/O, gVisor does not introduce significant fundamental 224 overhead. For general file operations, gVisor introduces a small fixed overhead 225 for data that transitions across the sandbox boundary. This manifests as 226 **structural costs** in some cases, since these operations must be routed 227 through the [Gofer](../README.md#gofer) as a result of our 228 [Security Model](/docs/architecture_guide/security/), but in most cases are 229 dominated by **implementation costs**, due to an internal 230 [Virtual File System][vfs] (VFS) implementation that needs improvement. 231 232 {% include graph.html id="fio-bw" url="/performance/fio.csv" title="perf.py fio 233 --engine=sync --runtime=runc --runtime=runsc" log="true" %} 234 235 The above figures demonstrate the results of `fio` for reads and writes to and 236 from the disk. In this case, the disk quickly becomes the bottleneck and 237 dominates other costs. 238 239 {% include graph.html id="fio-tmpfs-bw" url="/performance/fio-tmpfs.csv" 240 title="perf.py fio --engine=sync --runtime=runc --tmpfs=True --runtime=runsc" 241 log="true" %} 242 243 The above figure shows the raw I/O performance of using a `tmpfs` mount which is 244 sandbox-internal in the case of `runsc`. Generally these operations are 245 similarly bound to the cost of copying around data in-memory, and we don't see 246 the cost of VFS operations. 247 248 {% include graph.html id="httpd100k" metric="transfer_rate" 249 url="/performance/httpd100k.csv" title="perf.py http.httpd --connections=1 250 --connections=5 --connections=10 --connections=25 --runtime=runc 251 --runtime=runsc" %} 252 253 The high costs of VFS operations can manifest in benchmarks that execute many 254 such operations in the hot path for serving requests, for example. The above 255 figure shows the result of using gVisor to serve small pieces of static content 256 with predictably poor results. This workload represents `apache` serving a 257 single file sized 100k from the container image to a client running 258 [ApacheBench][ab] with varying levels of concurrency. The high overhead comes 259 principally from the VFS implementation that needs improvement, with several 260 internal serialization points (since all requests are reading the same file). 261 Note that some of some of network stack performance issues also impact this 262 benchmark. 263 264 {% include graph.html id="ffmpeg" url="/performance/ffmpeg.csv" title="perf.py 265 media.ffmpeg --runtime=runc --runtime=runsc" %} 266 267 For benchmarks that are bound by raw disk I/O and a mix of compute, file system 268 operations are less of an issue. The above figure shows the total time required 269 for an `ffmpeg` container to start, load and transcode a 27MB input video. 270 271 [ab]: https://en.wikipedia.org/wiki/ApacheBench 272 [benchmark-tools]: https://github.com/google/gvisor/tree/master/test/benchmarks 273 [gce]: https://cloud.google.com/compute/ 274 [cnn]: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py 275 [docker]: https://docker.io 276 [redis-benchmark]: https://redis.io/topics/benchmarks 277 [vfs]: https://en.wikipedia.org/wiki/Virtual_file_system