gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2022-10-24-buffer-pooling.md (about) 1 # How we Eliminated 99% of gVisor Networking Memory Allocations with Enhanced Buffer Pooling 2 3 In an 4 [earlier blog post](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/) 5 about networking security, we described how and why gVisor implements its own 6 userspace network stack in the Sentry (gVisor kernel). In summary, we’ve 7 implemented our networking stack – aka Netstack – in Go to minimize exposure to 8 unsafe code and avoid using an unsafe Foreign Function Interface. With Netstack, 9 gVisor can do all packet processing internally and only has to enable a few host 10 I/O syscalls for near-complete networking capabilities. This keeps gVisor’s 11 exposure to host vulnerabilities as narrow as possible. 12 13 Although writing Netstack in Go was important for runtime safety, up until now 14 it had an undeniable performance cost. iperf benchmarks showed Netstack was 15 spending between 20-30% of its processing time allocating memory and pausing for 16 garbage collection, a slowdown that limited gVisor’s ability to efficiently 17 sandbox networking workloads. In this blog we will show how we crafted a cure 18 for Netstack’s allocation addiction, reducing them by 99%, while also increasing 19 gVisor networking throughput by 30+%. 20 21 ![Figure 1](/assets/images/2022-10-24-buffer-pooling-figure1.png "Buffer pooling results."){:width="100%"} 22 23 ## A Waste Management Problem 24 25 Go guarantees a basic level of memory safety through the use of a garbage 26 collector (GC), which is described in great detail by the Go team 27 [here](https://tip.golang.org/doc/gc-guide). The Go runtime automatically tracks 28 and frees objects allocated from the heap, relieving the programmer of the often 29 painful and error-prone process of manual memory management. Unfortunately, 30 tracking and freeing memory during runtime comes at a performance cost. Running 31 the GC adds scheduling overhead, consumes valuable CPU time, and occasionally 32 pauses the entire program’s progress to track down garbage. 33 34 Go’s GC is highly optimized, tunable, and sufficient for a majority of 35 workloads. Most of the other parts of gVisor happily use Go's GC with no 36 complaints. However, under high network stress, Netstack needed to aggressively 37 allocate buffers used for processing TCP/IP data and metadata. These buffers 38 often had short lifespans, and once the processing was done they were left to be 39 cleaned up by the GC. This meant Netstack was producing tons of garbage that 40 needed to be tracked and freed by GC workers. 41 42 ## Recycling to the Rescue 43 44 Luckily, we weren't the only ones with this problem. This pattern of small, 45 frequently allocated and discarded objects was common enough that the Go team 46 introduced [`sync.Pool`](https://pkg.go.dev/sync#Pool) in Go1.3. `sync.Pool` is 47 designed to relieve pressure off the Go GC by maintaining a thread-safe cache of 48 previously allocated objects. `sync.Pool` can retrieve an object from the cache 49 if it exists or allocate a new one according to a user specified allocation 50 function. Once the user is finished with an object they can safely return it to 51 the cache to be reused again. 52 53 While `sync.Pool` was exactly what we needed to reduce allocations, 54 incorporating it into Netstack wasn’t going to be as easy as just replacing all 55 our `make()`s with `pool.Get()`s. 56 57 ## Netstack Challenges 58 59 Netstack uses a few different types of buffers under the hood. Some of these are 60 specific to protocols, like 61 [`segment`](https://github.com/google/gvisor/blob/master/pkg/tcpip/transport/tcp/segment.go) 62 for TCP, and others are more widely shared, like 63 [`PacketBuffer`](https://github.com/google/gvisor/blob/master/pkg/tcpip/stack/packet_buffer.go), 64 which is used for IP, ICMP, UDP, etc. Although each of these buffer types are 65 slightly different, they generally share a few common traits that made it 66 difficult to use `sync.Pool` out of the box: 67 68 * The buffers were originally built with the assumption that a garbage 69 collector would clean them up automatically – there was little (if any) 70 effort put into tracking object lifetimes. This meant that we had no way to 71 know when it was safe to return buffers to a pool. 72 * Buffers have dynamic sizes that are determined during creation, usually 73 depending on the size of the packet holding them. A `sync.Pool` out of the 74 box can only accommodate buffers of a single size. One common solution to 75 this is to fill a pool with 76 [`bytes.Buffer`](https://pkg.go.dev/bytes#Buffer), but even a pooled 77 `bytes.Buffer` could incur allocations if it were too small and had to be 78 grown to the requested size. 79 * Netstack splits, merges, and clones buffers at various points during 80 processing (for example, breaking a large segment into smaller MTU-sized 81 packets). Modifying a buffer’s size during runtime could mean lots of 82 reallocating from the pool in a one-size-fits-all setup. This would limit 83 the theoretical effectiveness of a pooled solution. 84 85 We needed an efficient, low-level buffer abstraction that had answers for the 86 Netstack specific challenges and could be shared by the various intermediate 87 buffer types. By sharing a common buffer abstraction, we could maximize the 88 benefits of pooling and avoid introducing additional allocations while minimally 89 changing any intermediate buffer processing logic. 90 91 ## Introducing bufferv2 92 93 Our solution was 94 [bufferv2](https://github.com/google/gvisor/tree/1ceb81454444981448ad57612139adfc0def1b85/pkg/bufferv2). 95 Bufferv2 is a non-contiguous, reference counted, pooled, copy-on-write, 96 buffer-like data structure. 97 98 Internally, a bufferv2 `Buffer` is a linked list of `View`s. Each `View` has 99 start/end indices and holds a pointer to a `Chunk`. A `Chunk` is a 100 reference-counted structure that’s allocated from a pool and holds data in a 101 byte slice. There are several `Chunk` pools, each of which allocates chunks with 102 different sized byte slices. These sizes start at 64 and double until 64k. 103 104 ![Figure 2](/assets/images/2022-10-24-buffer-pooling-figure2.png "bufferv2 implementation diagram."){:width="100%"} 105 106 The design of bufferv2 has a few key advantages over simpler object pooling: 107 108 * **Zero-cost copies and copy-on-write**: Cloning a Buffer only increments the 109 reference count of the underlying chunks instead of reallocating from the 110 pool. Since buffers are much more frequently read than modified, this saves 111 allocations. In the cases where a buffer is modified, only the chunk that’s 112 changed has to be cloned, not the whole buffer. 113 * **Fast buffer transformations**: Truncating and merging buffers or appending 114 and prepending Views to Buffers are fast operations. Thanks to the 115 non-contiguous memory structure these operations are usually as quick as 116 adding a node to a linked list or changing the indices in a View. 117 * **Tiered pools**: When growing a Buffer or appending data, the new chunks 118 come from different pools of previously allocated chunks. Using multiple 119 pools means we are flexible enough to efficiently accommodate packets of all 120 sizes with minimal overhead. Unlike a one-size-fits-all solution, we don't 121 have to waste lots of space with a chunk size that is too big or loop 122 forever allocating small chunks. 123 124 ## Trade-offs 125 126 Shifting Netstack to bufferv2 came with some costs. To start, rewriting all 127 buffers to use bufferv2 was a sizable effort that took many months to fully roll 128 out. Any place in Netstack that allocated or used a byte slice needed to be 129 rewritten. Reference counting had to be introduced so all the aforementioned 130 intermediate buffer types (`PacketBuffer`, `segment`, etc) could accurately 131 track buffer lifetimes, and tests had to be modified to ensure reference 132 counting correctness. 133 134 In addition to the upfront cost, the shift to bufferv2 also increased the 135 engineering complexity of future Netstack changes. Netstack contributors must 136 adhere to new rules to maintain memory safety and maximize the benefits of 137 pooling. These rules are strict – there needs to be strong justification to 138 break them. They are as follows: 139 140 * Never allocate a byte slice; always use `NewView()` instead. 141 * Use a `View` for simple data operations (e.g writing some data of a fixed 142 size) and a `Buffer` for more complex I/O operations (e.g appending data of 143 variable size, merging data, writing from an `io.Reader`). 144 * If you need access to the contents of a `View` as a byte slice, use 145 `View.AsSlice()`. If you need access to the contents of a `Buffer` as a byte 146 slice, consider refactoring, as this will cause an allocation. 147 * Never write or modify the slices returned by `View.AsSlice()`; they are 148 still owned by the view. 149 * Release bufferv2 objects as close to where they're created as possible. This 150 is usually most easily done with defer. 151 * Document function ownership of bufferv2 object parameters. If there is no 152 documentation, it is assumed that the function does not take ownership of 153 its parameters. 154 * If a function takes ownership of its bufferv2 parameters, the bufferv2 155 objects must be cloned before passing them as arguments. 156 * All new Netstack tests must enable the leak checker and run a final leak 157 check after the test is complete. 158 159 ## Give it a Try 160 161 Bufferv2 is enabled by default as of 162 [gVisor 20221017](https://github.com/google/gvisor/releases/tag/release-20221017.0), 163 and will be rolling out to 164 [GKE Sandbox](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods) 165 soon, so no action is required to see a performance boost. Network-bound 166 workloads, such as web servers or databases like Redis, are the most likely to 167 see benefits. All the code implementing bufferv2 is public 168 [here](https://github.com/google/gvisor/tree/master/pkg/bufferv2), and 169 contributions are welcome! If you’d like to run the iperf benchmark for 170 yourself, you can run: 171 172 ``` 173 make run-benchmark BENCHMARKS_TARGETS=//test/benchmarks/network:iperf_test \ 174 RUNTIME=your-runtime-here BENCHMARKS_OPTIONS=-test.benchtime=60s 175 ``` 176 177 in the base gVisor directory. If you experience any issues, please feel free to 178 let us know at [gvisor.dev/issues](https://github.com/google/gvisor/issues).