gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2022-10-24-buffer-pooling.md (about)

     1  # How we Eliminated 99% of gVisor Networking Memory Allocations with Enhanced Buffer Pooling
     2  
     3  In an
     4  [earlier blog post](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/)
     5  about networking security, we described how and why gVisor implements its own
     6  userspace network stack in the Sentry (gVisor kernel). In summary, we’ve
     7  implemented our networking stack – aka Netstack – in Go to minimize exposure to
     8  unsafe code and avoid using an unsafe Foreign Function Interface. With Netstack,
     9  gVisor can do all packet processing internally and only has to enable a few host
    10  I/O syscalls for near-complete networking capabilities. This keeps gVisor’s
    11  exposure to host vulnerabilities as narrow as possible.
    12  
    13  Although writing Netstack in Go was important for runtime safety, up until now
    14  it had an undeniable performance cost. iperf benchmarks showed Netstack was
    15  spending between 20-30% of its processing time allocating memory and pausing for
    16  garbage collection, a slowdown that limited gVisor’s ability to efficiently
    17  sandbox networking workloads. In this blog we will show how we crafted a cure
    18  for Netstack’s allocation addiction, reducing them by 99%, while also increasing
    19  gVisor networking throughput by 30+%.
    20  
    21  ![Figure 1](/assets/images/2022-10-24-buffer-pooling-figure1.png "Buffer pooling results."){:width="100%"}
    22  
    23  ## A Waste Management Problem
    24  
    25  Go guarantees a basic level of memory safety through the use of a garbage
    26  collector (GC), which is described in great detail by the Go team
    27  [here](https://tip.golang.org/doc/gc-guide). The Go runtime automatically tracks
    28  and frees objects allocated from the heap, relieving the programmer of the often
    29  painful and error-prone process of manual memory management. Unfortunately,
    30  tracking and freeing memory during runtime comes at a performance cost. Running
    31  the GC adds scheduling overhead, consumes valuable CPU time, and occasionally
    32  pauses the entire program’s progress to track down garbage.
    33  
    34  Go’s GC is highly optimized, tunable, and sufficient for a majority of
    35  workloads. Most of the other parts of gVisor happily use Go's GC with no
    36  complaints. However, under high network stress, Netstack needed to aggressively
    37  allocate buffers used for processing TCP/IP data and metadata. These buffers
    38  often had short lifespans, and once the processing was done they were left to be
    39  cleaned up by the GC. This meant Netstack was producing tons of garbage that
    40  needed to be tracked and freed by GC workers.
    41  
    42  ## Recycling to the Rescue
    43  
    44  Luckily, we weren't the only ones with this problem. This pattern of small,
    45  frequently allocated and discarded objects was common enough that the Go team
    46  introduced [`sync.Pool`](https://pkg.go.dev/sync#Pool) in Go1.3. `sync.Pool` is
    47  designed to relieve pressure off the Go GC by maintaining a thread-safe cache of
    48  previously allocated objects. `sync.Pool` can retrieve an object from the cache
    49  if it exists or allocate a new one according to a user specified allocation
    50  function. Once the user is finished with an object they can safely return it to
    51  the cache to be reused again.
    52  
    53  While `sync.Pool` was exactly what we needed to reduce allocations,
    54  incorporating it into Netstack wasn’t going to be as easy as just replacing all
    55  our `make()`s with `pool.Get()`s.
    56  
    57  ## Netstack Challenges
    58  
    59  Netstack uses a few different types of buffers under the hood. Some of these are
    60  specific to protocols, like
    61  [`segment`](https://github.com/google/gvisor/blob/master/pkg/tcpip/transport/tcp/segment.go)
    62  for TCP, and others are more widely shared, like
    63  [`PacketBuffer`](https://github.com/google/gvisor/blob/master/pkg/tcpip/stack/packet_buffer.go),
    64  which is used for IP, ICMP, UDP, etc. Although each of these buffer types are
    65  slightly different, they generally share a few common traits that made it
    66  difficult to use `sync.Pool` out of the box:
    67  
    68  *   The buffers were originally built with the assumption that a garbage
    69      collector would clean them up automatically – there was little (if any)
    70      effort put into tracking object lifetimes. This meant that we had no way to
    71      know when it was safe to return buffers to a pool.
    72  *   Buffers have dynamic sizes that are determined during creation, usually
    73      depending on the size of the packet holding them. A `sync.Pool` out of the
    74      box can only accommodate buffers of a single size. One common solution to
    75      this is to fill a pool with
    76      [`bytes.Buffer`](https://pkg.go.dev/bytes#Buffer), but even a pooled
    77      `bytes.Buffer` could incur allocations if it were too small and had to be
    78      grown to the requested size.
    79  *   Netstack splits, merges, and clones buffers at various points during
    80      processing (for example, breaking a large segment into smaller MTU-sized
    81      packets). Modifying a buffer’s size during runtime could mean lots of
    82      reallocating from the pool in a one-size-fits-all setup. This would limit
    83      the theoretical effectiveness of a pooled solution.
    84  
    85  We needed an efficient, low-level buffer abstraction that had answers for the
    86  Netstack specific challenges and could be shared by the various intermediate
    87  buffer types. By sharing a common buffer abstraction, we could maximize the
    88  benefits of pooling and avoid introducing additional allocations while minimally
    89  changing any intermediate buffer processing logic.
    90  
    91  ## Introducing bufferv2
    92  
    93  Our solution was
    94  [bufferv2](https://github.com/google/gvisor/tree/1ceb81454444981448ad57612139adfc0def1b85/pkg/bufferv2).
    95  Bufferv2 is a non-contiguous, reference counted, pooled, copy-on-write,
    96  buffer-like data structure.
    97  
    98  Internally, a bufferv2 `Buffer` is a linked list of `View`s. Each `View` has
    99  start/end indices and holds a pointer to a `Chunk`. A `Chunk` is a
   100  reference-counted structure that’s allocated from a pool and holds data in a
   101  byte slice. There are several `Chunk` pools, each of which allocates chunks with
   102  different sized byte slices. These sizes start at 64 and double until 64k.
   103  
   104  ![Figure 2](/assets/images/2022-10-24-buffer-pooling-figure2.png "bufferv2 implementation diagram."){:width="100%"}
   105  
   106  The design of bufferv2 has a few key advantages over simpler object pooling:
   107  
   108  *   **Zero-cost copies and copy-on-write**: Cloning a Buffer only increments the
   109      reference count of the underlying chunks instead of reallocating from the
   110      pool. Since buffers are much more frequently read than modified, this saves
   111      allocations. In the cases where a buffer is modified, only the chunk that’s
   112      changed has to be cloned, not the whole buffer.
   113  *   **Fast buffer transformations**: Truncating and merging buffers or appending
   114      and prepending Views to Buffers are fast operations. Thanks to the
   115      non-contiguous memory structure these operations are usually as quick as
   116      adding a node to a linked list or changing the indices in a View.
   117  *   **Tiered pools**: When growing a Buffer or appending data, the new chunks
   118      come from different pools of previously allocated chunks. Using multiple
   119      pools means we are flexible enough to efficiently accommodate packets of all
   120      sizes with minimal overhead. Unlike a one-size-fits-all solution, we don't
   121      have to waste lots of space with a chunk size that is too big or loop
   122      forever allocating small chunks.
   123  
   124  ## Trade-offs
   125  
   126  Shifting Netstack to bufferv2 came with some costs. To start, rewriting all
   127  buffers to use bufferv2 was a sizable effort that took many months to fully roll
   128  out. Any place in Netstack that allocated or used a byte slice needed to be
   129  rewritten. Reference counting had to be introduced so all the aforementioned
   130  intermediate buffer types (`PacketBuffer`, `segment`, etc) could accurately
   131  track buffer lifetimes, and tests had to be modified to ensure reference
   132  counting correctness.
   133  
   134  In addition to the upfront cost, the shift to bufferv2 also increased the
   135  engineering complexity of future Netstack changes. Netstack contributors must
   136  adhere to new rules to maintain memory safety and maximize the benefits of
   137  pooling. These rules are strict – there needs to be strong justification to
   138  break them. They are as follows:
   139  
   140  *   Never allocate a byte slice; always use `NewView()` instead.
   141  *   Use a `View` for simple data operations (e.g writing some data of a fixed
   142      size) and a `Buffer` for more complex I/O operations (e.g appending data of
   143      variable size, merging data, writing from an `io.Reader`).
   144  *   If you need access to the contents of a `View` as a byte slice, use
   145      `View.AsSlice()`. If you need access to the contents of a `Buffer` as a byte
   146      slice, consider refactoring, as this will cause an allocation.
   147  *   Never write or modify the slices returned by `View.AsSlice()`; they are
   148      still owned by the view.
   149  *   Release bufferv2 objects as close to where they're created as possible. This
   150      is usually most easily done with defer.
   151  *   Document function ownership of bufferv2 object parameters. If there is no
   152      documentation, it is assumed that the function does not take ownership of
   153      its parameters.
   154  *   If a function takes ownership of its bufferv2 parameters, the bufferv2
   155      objects must be cloned before passing them as arguments.
   156  *   All new Netstack tests must enable the leak checker and run a final leak
   157      check after the test is complete.
   158  
   159  ## Give it a Try
   160  
   161  Bufferv2 is enabled by default as of
   162  [gVisor 20221017](https://github.com/google/gvisor/releases/tag/release-20221017.0),
   163  and will be rolling out to
   164  [GKE Sandbox](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods)
   165  soon, so no action is required to see a performance boost. Network-bound
   166  workloads, such as web servers or databases like Redis, are the most likely to
   167  see benefits. All the code implementing bufferv2 is public
   168  [here](https://github.com/google/gvisor/tree/master/pkg/bufferv2), and
   169  contributions are welcome! If you’d like to run the iperf benchmark for
   170  yourself, you can run:
   171  
   172  ```
   173  make run-benchmark BENCHMARKS_TARGETS=//test/benchmarks/network:iperf_test \
   174    RUNTIME=your-runtime-here BENCHMARKS_OPTIONS=-test.benchtime=60s
   175  ```
   176  
   177  in the base gVisor directory. If you experience any issues, please feel free to
   178  let us know at [gvisor.dev/issues](https://github.com/google/gvisor/issues).