github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/g3doc/architecture_guide/performance.md (about)

     1  # Performance Guide
     2  
     3  [TOC]
     4  
     5  gVisor is designed to provide a secure, virtualized environment while preserving
     6  key benefits of containerization, such as small fixed overheads and a dynamic
     7  resource footprint. For containerized infrastructure, this can provide a
     8  turn-key solution for sandboxing untrusted workloads: there are no changes to
     9  the fundamental resource model.
    10  
    11  gVisor imposes runtime costs over native containers. These costs come in two
    12  forms: additional cycles and memory usage, which may manifest as increased
    13  latency, reduced throughput or density, or not at all. In general, these costs
    14  come from two different sources.
    15  
    16  First, the existence of the [Sentry](../README.md#sentry) means that additional
    17  memory will be required, and application system calls must traverse additional
    18  layers of software. The design emphasizes
    19  [security](/docs/architecture_guide/security/) and therefore we chose to use a
    20  language for the Sentry that provides benefits in this domain but may not yet
    21  offer the raw performance of other choices. Costs imposed by these design
    22  choices are **structural costs**.
    23  
    24  Second, as gVisor is an independent implementation of the system call surface,
    25  many of the subsystems or specific calls are not as optimized as more mature
    26  implementations. A good example here is the network stack, which is continuing
    27  to evolve but does not support all the advanced recovery mechanisms offered by
    28  other stacks and is less CPU efficient. This is an **implementation cost** and
    29  is distinct from **structural costs**. Improvements here are ongoing and driven
    30  by the workloads that matter to gVisor users and contributors.
    31  
    32  This page provides a guide for understanding baseline performance, and calls out
    33  distinct **structural costs** and **implementation costs**, highlighting where
    34  improvements are possible and not possible.
    35  
    36  While we include a variety of workloads here, it’s worth emphasizing that gVisor
    37  may not be an appropriate solution for every workload, for reasons other than
    38  performance. For example, a sandbox may provide minimal benefit for a trusted
    39  database, since _user data would already be inside the sandbox_ and there is no
    40  need for an attacker to break out in the first place.
    41  
    42  ## Methodology
    43  
    44  All data below was generated using the [benchmark tools][benchmark-tools]
    45  repository, and the machines under test are uniform [Google Compute Engine][gce]
    46  Virtual Machines (VMs) with the following specifications:
    47  
    48      Machine type: n1-standard-4 (broadwell)
    49      Image: Debian GNU/Linux 9 (stretch) 4.19.0-0
    50      BootDisk: 2048GB SSD persistent disk
    51  
    52  Through this document, `runsc` is used to indicate the runtime provided by
    53  gVisor. When relevant, we use the name `runsc-platform` to describe a specific
    54  [platform choice](/docs/architecture_guide/platforms/).
    55  
    56  **Except where specified, all tests below are conducted with the `ptrace`
    57  platform. The `ptrace` platform works everywhere and does not require hardware
    58  virtualization or kernel modifications but suffers from the highest structural
    59  costs by far. This platform is used to provide a clear understanding of the
    60  performance model, but in no way represents an ideal scenario. In the future,
    61  this guide will be extended to bare metal environments and include additional
    62  platforms.**
    63  
    64  ## Memory access
    65  
    66  gVisor does not introduce any additional costs with respect to raw memory
    67  accesses. Page faults and other Operating System (OS) mechanisms are translated
    68  through the Sentry, but once mappings are installed and available to the
    69  application, there is no additional overhead.
    70  
    71  {% include graph.html id="sysbench-memory"
    72  url="/performance/sysbench-memory.csv" title="perf.py sysbench.memory
    73  --runtime=runc --runtime=runsc" %}
    74  
    75  The above figure demonstrates the memory transfer rate as measured by
    76  `sysbench`.
    77  
    78  ## Memory usage
    79  
    80  The Sentry provides an additional layer of indirection, and it requires memory
    81  in order to store state associated with the application. This memory generally
    82  consists of a fixed component, plus an amount that varies with the usage of
    83  operating system resources (e.g. how many sockets or files are opened).
    84  
    85  For many use cases, fixed memory overheads are a primary concern. This may be
    86  because sandboxed containers handle a low volume of requests, and it is
    87  therefore important to achieve high densities for efficiency.
    88  
    89  {% include graph.html id="density" url="/performance/density.csv" title="perf.py
    90  density --runtime=runc --runtime=runsc" log="true" y_min="100000" %}
    91  
    92  The above figure demonstrates these costs based on three sample applications.
    93  This test is the result of running many instances of a container (50, or 5 in
    94  the case of redis) and calculating available memory on the host before and
    95  afterwards, and dividing the difference by the number of containers. This
    96  technique is used for measuring memory usage over the `usage_in_bytes` value of
    97  the container cgroup because we found that some container runtimes, other than
    98  `runc` and `runsc`, do not use an individual container cgroup.
    99  
   100  The first application is an instance of `sleep`: a trivial application that does
   101  nothing. The second application is a synthetic `node` application which imports
   102  a number of modules and listens for requests. The third application is a similar
   103  synthetic `ruby` application which does the same. Finally, we include an
   104  instance of `redis` storing approximately 1GB of data. In all cases, the sandbox
   105  itself is responsible for a small, mostly fixed amount of memory overhead.
   106  
   107  ## CPU performance
   108  
   109  gVisor does not perform emulation or otherwise interfere with the raw execution
   110  of CPU instructions by the application. Therefore, there is no runtime cost
   111  imposed for CPU operations.
   112  
   113  {% include graph.html id="sysbench-cpu" url="/performance/sysbench-cpu.csv"
   114  title="perf.py sysbench.cpu --runtime=runc --runtime=runsc" %}
   115  
   116  The above figure demonstrates the `sysbench` measurement of CPU events per
   117  second. Events per second is based on a CPU-bound loop that calculates all prime
   118  numbers in a specified range. We note that `runsc` does not impose a performance
   119  penalty, as the code is executing natively in both cases.
   120  
   121  This has important consequences for classes of workloads that are often
   122  CPU-bound, such as data processing or machine learning. In these cases, `runsc`
   123  will similarly impose minimal runtime overhead.
   124  
   125  {% include graph.html id="tensorflow" url="/performance/tensorflow.csv"
   126  title="perf.py tensorflow --runtime=runc --runtime=runsc" %}
   127  
   128  For example, the above figure shows a sample TensorFlow workload, the
   129  [convolutional neural network example][cnn]. The time indicated includes the
   130  full start-up and run time for the workload, which trains a model.
   131  
   132  ## System calls
   133  
   134  Some **structural costs** of gVisor are heavily influenced by the
   135  [platform choice](/docs/architecture_guide/platforms/), which implements system
   136  call interception. Today, gVisor supports a variety of platforms. These
   137  platforms present distinct performance, compatibility and security trade-offs.
   138  For example, the KVM platform has low overhead system call interception but runs
   139  poorly with nested virtualization.
   140  
   141  {% include graph.html id="syscall" url="/performance/syscall.csv" title="perf.py
   142  syscall --runtime=runc --runtime=runsc-ptrace --runtime=runsc-kvm" y_min="100"
   143  log="true" %}
   144  
   145  The above figure demonstrates the time required for a raw system call on various
   146  platforms. The test is implemented by a custom binary which performs a large
   147  number of system calls and calculates the average time required.
   148  
   149  This cost will principally impact applications that are system call bound, which
   150  tend to be high-performance data stores and static network services. In general,
   151  the impact of system call interception will be lower the more work an
   152  application does.
   153  
   154  {% include graph.html id="redis" url="/performance/redis.csv" title="perf.py
   155  redis --runtime=runc --runtime=runsc" %}
   156  
   157  For example, `redis` is an application that performs relatively little work in
   158  userspace: in general it reads from a connected socket, reads or modifies some
   159  data, and writes a result back to the socket. The above figure shows the results
   160  of running [comprehensive set of benchmarks][redis-benchmark]. We can see that
   161  small operations impose a large overhead, while larger operations, such as
   162  `LRANGE`, where more work is done in the application, have a smaller relative
   163  overhead.
   164  
   165  Some of these costs above are **structural costs**, and `redis` is likely to
   166  remain a challenging performance scenario. However, optimizing the
   167  [platform](/docs/architecture_guide/platforms/) will also have a dramatic
   168  impact.
   169  
   170  ## Start-up time
   171  
   172  For many use cases, the ability to spin-up containers quickly and efficiently is
   173  important. A sandbox may be short-lived and perform minimal user work (e.g. a
   174  function invocation).
   175  
   176  {% include graph.html id="startup" url="/performance/startup.csv" title="perf.py
   177  startup --runtime=runc --runtime=runsc" %}
   178  
   179  The above figure indicates how total time required to start a container through
   180  [Docker][docker]. This benchmark uses three different applications. First, an
   181  alpine Linux-container that executes `true`. Second, a `node` application that
   182  loads a number of modules and binds an HTTP server. The time is measured by a
   183  successful request to the bound port. Finally, a `ruby` application that
   184  similarly loads a number of modules and binds an HTTP server.
   185  
   186  > Note: most of the time overhead above is associated Docker itself. This is
   187  > evident with the empty `runc` benchmark. To avoid these costs with `runsc`,
   188  > you may also consider using `runsc do` mode or invoking the
   189  > [OCI runtime](../user_guide/quick_start/oci.md) directly.
   190  
   191  ## Network
   192  
   193  Networking is mostly bound by **implementation costs**, and gVisor's network
   194  stack is improving quickly.
   195  
   196  While typically not an important metric in practice for common sandbox use
   197  cases, nevertheless `iperf` is a common microbenchmark used to measure raw
   198  throughput.
   199  
   200  {% include graph.html id="iperf" url="/performance/iperf.csv" title="perf.py
   201  iperf --runtime=runc --runtime=runsc" %}
   202  
   203  The above figure shows the result of an `iperf` test between two instances. For
   204  the upload case, the specified runtime is used for the `iperf` client, and in
   205  the download case, the specified runtime is the server. A native runtime is
   206  always used for the other endpoint in the test.
   207  
   208  {% include graph.html id="applications" metric="requests_per_second"
   209  url="/performance/applications.csv" title="perf.py http.(node|ruby)
   210  --connections=25 --runtime=runc --runtime=runsc" %}
   211  
   212  The above figure shows the result of simple `node` and `ruby` web services that
   213  render a template upon receiving a request. Because these synthetic benchmarks
   214  do minimal work per request, much like the `redis` case, they suffer from high
   215  overheads. In practice, the more work an application does the smaller the impact
   216  of **structural costs** become.
   217  
   218  ## File system
   219  
   220  Some aspects of file system performance are also reflective of **implementation
   221  costs**, and an area where gVisor's implementation is improving quickly.
   222  
   223  In terms of raw disk I/O, gVisor does not introduce significant fundamental
   224  overhead. For general file operations, gVisor introduces a small fixed overhead
   225  for data that transitions across the sandbox boundary. This manifests as
   226  **structural costs** in some cases, since these operations must be routed
   227  through the [Gofer](../README.md#gofer) as a result of our
   228  [Security Model](/docs/architecture_guide/security/), but in most cases are
   229  dominated by **implementation costs**, due to an internal
   230  [Virtual File System][vfs] (VFS) implementation that needs improvement.
   231  
   232  {% include graph.html id="fio-bw" url="/performance/fio.csv" title="perf.py fio
   233  --engine=sync --runtime=runc --runtime=runsc" log="true" %}
   234  
   235  The above figures demonstrate the results of `fio` for reads and writes to and
   236  from the disk. In this case, the disk quickly becomes the bottleneck and
   237  dominates other costs.
   238  
   239  {% include graph.html id="fio-tmpfs-bw" url="/performance/fio-tmpfs.csv"
   240  title="perf.py fio --engine=sync --runtime=runc --tmpfs=True --runtime=runsc"
   241  log="true" %}
   242  
   243  The above figure shows the raw I/O performance of using a `tmpfs` mount which is
   244  sandbox-internal in the case of `runsc`. Generally these operations are
   245  similarly bound to the cost of copying around data in-memory, and we don't see
   246  the cost of VFS operations.
   247  
   248  {% include graph.html id="httpd100k" metric="transfer_rate"
   249  url="/performance/httpd100k.csv" title="perf.py http.httpd --connections=1
   250  --connections=5 --connections=10 --connections=25 --runtime=runc
   251  --runtime=runsc" %}
   252  
   253  The high costs of VFS operations can manifest in benchmarks that execute many
   254  such operations in the hot path for serving requests, for example. The above
   255  figure shows the result of using gVisor to serve small pieces of static content
   256  with predictably poor results. This workload represents `apache` serving a
   257  single file sized 100k from the container image to a client running
   258  [ApacheBench][ab] with varying levels of concurrency. The high overhead comes
   259  principally from the VFS implementation that needs improvement, with several
   260  internal serialization points (since all requests are reading the same file).
   261  Note that some of some of network stack performance issues also impact this
   262  benchmark.
   263  
   264  {% include graph.html id="ffmpeg" url="/performance/ffmpeg.csv" title="perf.py
   265  media.ffmpeg --runtime=runc --runtime=runsc" %}
   266  
   267  For benchmarks that are bound by raw disk I/O and a mix of compute, file system
   268  operations are less of an issue. The above figure shows the total time required
   269  for an `ffmpeg` container to start, load and transcode a 27MB input video.
   270  
   271  [ab]: https://en.wikipedia.org/wiki/ApacheBench
   272  [benchmark-tools]: https://github.com/google/gvisor/tree/master/test/benchmarks
   273  [gce]: https://cloud.google.com/compute/
   274  [cnn]: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py
   275  [docker]: https://docker.io
   276  [redis-benchmark]: https://redis.io/topics/benchmarks
   277  [vfs]: https://en.wikipedia.org/wiki/Virtual_file_system