gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/g3doc/architecture_guide/performance.md (about)

     1  # Performance Guide
     2  
     3  [TOC]
     4  
     5  gVisor is designed to provide a secure, virtualized environment while preserving
     6  key benefits of containerization, such as small fixed overheads and a dynamic
     7  resource footprint. For containerized infrastructure, this can provide a
     8  turn-key solution for sandboxing untrusted workloads: there are no changes to
     9  the fundamental resource model.
    10  
    11  gVisor imposes runtime costs over native containers. These costs come in two
    12  forms: additional cycles and memory usage, which may manifest as increased
    13  latency, reduced throughput or density, or not at all. In general, these costs
    14  come from two different sources.
    15  
    16  First, the existence of the [Sentry](../README.md#sentry) means that additional
    17  memory will be required, and application system calls must traverse additional
    18  layers of software. The design emphasizes
    19  [security](/docs/architecture_guide/security/) and therefore we chose to use a
    20  language for the Sentry that provides benefits in this domain but may not yet
    21  offer the raw performance of other choices. Costs imposed by these design
    22  choices are **structural costs**.
    23  
    24  Second, as gVisor is an independent implementation of the system call surface,
    25  many of the subsystems or specific calls are not as optimized as more mature
    26  implementations. A good example here is the network stack, which is continuing
    27  to evolve but does not support all the advanced recovery mechanisms offered by
    28  other stacks and is less CPU efficient. This is an **implementation cost** and
    29  is distinct from **structural costs**. Improvements here are ongoing and driven
    30  by the workloads that matter to gVisor users and contributors.
    31  
    32  This page provides a guide for understanding baseline performance, and calls out
    33  distinct **structural costs** and **implementation costs**, highlighting where
    34  improvements are possible and not possible.
    35  
    36  While we include a variety of workloads here, it’s worth emphasizing that gVisor
    37  may not be an appropriate solution for every workload, for reasons other than
    38  performance. For example, a sandbox may provide minimal benefit for a trusted
    39  database, since *user data would already be inside the sandbox* and there is no
    40  need for an attacker to break out in the first place.
    41  
    42  ## Methodology
    43  
    44  All data below was generated using the [benchmark tools][benchmark-tools]
    45  repository, and the machines under test are uniform [Google Compute Engine][gce]
    46  Virtual Machines (VMs) with the following specifications:
    47  
    48  ```
    49  Machine type: n1-standard-4 (broadwell)
    50  Image: Debian GNU/Linux 9 (stretch) 4.19.0-0
    51  BootDisk: 2048GB SSD persistent disk
    52  ```
    53  
    54  Through this document, `runsc` is used to indicate the runtime provided by
    55  gVisor. When relevant, we use the name `runsc-platform` to describe a specific
    56  [platform choice](/docs/architecture_guide/platforms/).
    57  
    58  **Except where specified, all tests below are conducted with the `ptrace`
    59  platform. The `ptrace` platform works everywhere and does not require hardware
    60  virtualization or kernel modifications but suffers from the highest structural
    61  costs by far. This platform is used to provide a clear understanding of the
    62  performance model, but in no way represents an ideal scenario; users should use
    63  Systrap for best performance in most cases. In the future, this guide will be
    64  extended to bare metal environments and include additional platforms.**
    65  
    66  ## Memory access
    67  
    68  gVisor does not introduce any additional costs with respect to raw memory
    69  accesses. Page faults and other Operating System (OS) mechanisms are translated
    70  through the Sentry, but once mappings are installed and available to the
    71  application, there is no additional overhead.
    72  
    73  {% include graph.html id="sysbench-memory"
    74  url="/performance/sysbench-memory.csv" title="perf.py sysbench.memory
    75  --runtime=runc --runtime=runsc" %}
    76  
    77  The above figure demonstrates the memory transfer rate as measured by
    78  `sysbench`.
    79  
    80  ## Memory usage
    81  
    82  The Sentry provides an additional layer of indirection, and it requires memory
    83  in order to store state associated with the application. This memory generally
    84  consists of a fixed component, plus an amount that varies with the usage of
    85  operating system resources (e.g. how many sockets or files are opened).
    86  
    87  For many use cases, fixed memory overheads are a primary concern. This may be
    88  because sandboxed containers handle a low volume of requests, and it is
    89  therefore important to achieve high densities for efficiency.
    90  
    91  {% include graph.html id="density" url="/performance/density.csv" title="perf.py
    92  density --runtime=runc --runtime=runsc" log="true" y_min="100000" %}
    93  
    94  The above figure demonstrates these costs based on three sample applications.
    95  This test is the result of running many instances of a container (50, or 5 in
    96  the case of redis) and calculating available memory on the host before and
    97  afterwards, and dividing the difference by the number of containers. This
    98  technique is used for measuring memory usage over the `usage_in_bytes` value of
    99  the container cgroup because we found that some container runtimes, other than
   100  `runc` and `runsc`, do not use an individual container cgroup.
   101  
   102  The first application is an instance of `sleep`: a trivial application that does
   103  nothing. The second application is a synthetic `node` application which imports
   104  a number of modules and listens for requests. The third application is a similar
   105  synthetic `ruby` application which does the same. Finally, we include an
   106  instance of `redis` storing approximately 1GB of data. In all cases, the sandbox
   107  itself is responsible for a small, mostly fixed amount of memory overhead.
   108  
   109  ## CPU performance
   110  
   111  gVisor does not perform emulation or otherwise interfere with the raw execution
   112  of CPU instructions by the application. Therefore, there is no runtime cost
   113  imposed for CPU operations.
   114  
   115  {% include graph.html id="sysbench-cpu" url="/performance/sysbench-cpu.csv"
   116  title="perf.py sysbench.cpu --runtime=runc --runtime=runsc" %}
   117  
   118  The above figure demonstrates the `sysbench` measurement of CPU events per
   119  second. Events per second is based on a CPU-bound loop that calculates all prime
   120  numbers in a specified range. We note that `runsc` does not impose a performance
   121  penalty, as the code is executing natively in both cases.
   122  
   123  This has important consequences for classes of workloads that are often
   124  CPU-bound, such as data processing or machine learning. In these cases, `runsc`
   125  will similarly impose minimal runtime overhead.
   126  
   127  {% include graph.html id="tensorflow" url="/performance/tensorflow.csv"
   128  title="perf.py tensorflow --runtime=runc --runtime=runsc" %}
   129  
   130  For example, the above figure shows a sample TensorFlow workload, the
   131  [convolutional neural network example][cnn]. The time indicated includes the
   132  full start-up and run time for the workload, which trains a model.
   133  
   134  ## System calls
   135  
   136  Some **structural costs** of gVisor are heavily influenced by the
   137  [platform choice](/docs/architecture_guide/platforms/), which implements system
   138  call interception. Today, gVisor supports a variety of platforms. These
   139  platforms present distinct performance, compatibility and security trade-offs.
   140  For example, the KVM platform has low overhead system call interception but runs
   141  poorly with nested virtualization.
   142  
   143  {% include graph.html id="syscall" url="/performance/syscall.csv" title="perf.py
   144  syscall --runtime=runc --runtime=runsc-ptrace --runtime=runsc-kvm" y_min="100"
   145  log="true" %}
   146  
   147  The above figure demonstrates the time required for a raw system call on various
   148  platforms. The test is implemented by a custom binary which performs a large
   149  number of system calls and calculates the average time required.
   150  
   151  This cost will principally impact applications that are system call bound, which
   152  tend to be high-performance data stores and static network services. In general,
   153  the impact of system call interception will be lower the more work an
   154  application does.
   155  
   156  {% include graph.html id="redis" url="/performance/redis.csv" title="perf.py
   157  redis --runtime=runc --runtime=runsc" %}
   158  
   159  For example, `redis` is an application that performs relatively little work in
   160  userspace: in general it reads from a connected socket, reads or modifies some
   161  data, and writes a result back to the socket. The above figure shows the results
   162  of running [comprehensive set of benchmarks][redis-benchmark]. We can see that
   163  small operations impose a large overhead, while larger operations, such as
   164  `LRANGE`, where more work is done in the application, have a smaller relative
   165  overhead.
   166  
   167  Some of these costs above are **structural costs**, and `redis` is likely to
   168  remain a challenging performance scenario. However, optimizing the
   169  [platform](/docs/architecture_guide/platforms/) will also have a dramatic
   170  impact.
   171  
   172  ## Start-up time
   173  
   174  For many use cases, the ability to spin-up containers quickly and efficiently is
   175  important. A sandbox may be short-lived and perform minimal user work (e.g. a
   176  function invocation).
   177  
   178  {% include graph.html id="startup" url="/performance/startup.csv" title="perf.py
   179  startup --runtime=runc --runtime=runsc" %}
   180  
   181  The above figure indicates how total time required to start a container through
   182  [Docker][docker]. This benchmark uses three different applications. First, an
   183  alpine Linux-container that executes `true`. Second, a `node` application that
   184  loads a number of modules and binds an HTTP server. The time is measured by a
   185  successful request to the bound port. Finally, a `ruby` application that
   186  similarly loads a number of modules and binds an HTTP server.
   187  
   188  > Note: most of the time overhead above is associated Docker itself. This is
   189  > evident with the empty `runc` benchmark. To avoid these costs with `runsc`,
   190  > you may also consider using `runsc do` mode or invoking the
   191  > [OCI runtime](../user_guide/quick_start/oci.md) directly.
   192  
   193  ## Network
   194  
   195  Networking is mostly bound by **implementation costs**, and gVisor's network
   196  stack is improving quickly.
   197  
   198  While typically not an important metric in practice for common sandbox use
   199  cases, nevertheless `iperf` is a common microbenchmark used to measure raw
   200  throughput.
   201  
   202  {% include graph.html id="iperf" url="/performance/iperf.csv" title="perf.py
   203  iperf --runtime=runc --runtime=runsc" %}
   204  
   205  The above figure shows the result of an `iperf` test between two instances. For
   206  the upload case, the specified runtime is used for the `iperf` client, and in
   207  the download case, the specified runtime is the server. A native runtime is
   208  always used for the other endpoint in the test.
   209  
   210  {% include graph.html id="applications" metric="requests_per_second"
   211  url="/performance/applications.csv" title="perf.py http.(node|ruby)
   212  --connections=25 --runtime=runc --runtime=runsc" %}
   213  
   214  The above figure shows the result of simple `node` and `ruby` web services that
   215  render a template upon receiving a request. Because these synthetic benchmarks
   216  do minimal work per request, much like the `redis` case, they suffer from high
   217  overheads. In practice, the more work an application does the smaller the impact
   218  of **structural costs** become.
   219  
   220  ## File system
   221  
   222  Some aspects of file system performance are also reflective of **implementation
   223  costs**, and an area where gVisor's implementation is improving quickly.
   224  
   225  In terms of raw disk I/O, gVisor does not introduce significant fundamental
   226  overhead. For general file operations, gVisor introduces a small fixed overhead
   227  for data that transitions across the sandbox boundary. This manifests as
   228  **structural costs** in some cases, since these operations must be routed
   229  through the [Gofer](../README.md#gofer) as a result of our
   230  [Security Model](/docs/architecture_guide/security/), but in most cases are
   231  dominated by **implementation costs**, due to an internal
   232  [Virtual File System][vfs] (VFS) implementation that needs improvement.
   233  
   234  {% include graph.html id="fio-bw" url="/performance/fio.csv" title="perf.py fio
   235  --engine=sync --runtime=runc --runtime=runsc" log="true" %}
   236  
   237  The above figures demonstrate the results of `fio` for reads and writes to and
   238  from the disk. In this case, the disk quickly becomes the bottleneck and
   239  dominates other costs.
   240  
   241  {% include graph.html id="fio-tmpfs-bw" url="/performance/fio-tmpfs.csv"
   242  title="perf.py fio --engine=sync --runtime=runc --tmpfs=True --runtime=runsc"
   243  log="true" %}
   244  
   245  The above figure shows the raw I/O performance of using a `tmpfs` mount which is
   246  sandbox-internal in the case of `runsc`. Generally these operations are
   247  similarly bound to the cost of copying around data in-memory, and we don't see
   248  the cost of VFS operations.
   249  
   250  {% include graph.html id="httpd100k" metric="transfer_rate"
   251  url="/performance/httpd100k.csv" title="perf.py http.httpd --connections=1
   252  --connections=5 --connections=10 --connections=25 --runtime=runc
   253  --runtime=runsc" %}
   254  
   255  The high costs of VFS operations can manifest in benchmarks that execute many
   256  such operations in the hot path for serving requests, for example. The above
   257  figure shows the result of using gVisor to serve small pieces of static content
   258  with predictably poor results. This workload represents `apache` serving a
   259  single file sized 100k from the container image to a client running
   260  [ApacheBench][ab] with varying levels of concurrency. The high overhead comes
   261  principally from the VFS implementation that needs improvement, with several
   262  internal serialization points (since all requests are reading the same file).
   263  Note that some of some of network stack performance issues also impact this
   264  benchmark.
   265  
   266  {% include graph.html id="ffmpeg" url="/performance/ffmpeg.csv" title="perf.py
   267  media.ffmpeg --runtime=runc --runtime=runsc" %}
   268  
   269  For benchmarks that are bound by raw disk I/O and a mix of compute, file system
   270  operations are less of an issue. The above figure shows the total time required
   271  for an `ffmpeg` container to start, load and transcode a 27MB input video.
   272  
   273  [ab]: https://en.wikipedia.org/wiki/ApacheBench
   274  [benchmark-tools]: https://github.com/google/gvisor/tree/master/test/benchmarks
   275  [gce]: https://cloud.google.com/compute/
   276  [cnn]: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py
   277  [docker]: https://docker.io
   278  [redis-benchmark]: https://redis.io/topics/benchmarks
   279  [vfs]: https://en.wikipedia.org/wiki/Virtual_file_system