gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2022-10-24-buffer-pooling.md

gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2022-10-24-buffer-pooling.md (about)

1 # How we Eliminated 99% of gVisor Networking Memory Allocations with Enhanced Buffer Pooling
2
3 In an
4 [earlier blog post](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/)
5 about networking security, we described how and why gVisor implements its own
6 userspace network stack in the Sentry (gVisor kernel). In summary, we’ve
7 implemented our networking stack – aka Netstack – in Go to minimize exposure to
8 unsafe code and avoid using an unsafe Foreign Function Interface. With Netstack,
9 gVisor can do all packet processing internally and only has to enable a few host
10 I/O syscalls for near-complete networking capabilities. This keeps gVisor’s
11 exposure to host vulnerabilities as narrow as possible.
12
13 Although writing Netstack in Go was important for runtime safety, up until now
14 it had an undeniable performance cost. iperf benchmarks showed Netstack was
15 spending between 20-30% of its processing time allocating memory and pausing for
16 garbage collection, a slowdown that limited gVisor’s ability to efficiently
17 sandbox networking workloads. In this blog we will show how we crafted a cure
18 for Netstack’s allocation addiction, reducing them by 99%, while also increasing
19 gVisor networking throughput by 30+%.
20
21 ![Figure 1](/assets/images/2022-10-24-buffer-pooling-figure1.png "Buffer pooling results."){:width="100%"}
22
23 ## A Waste Management Problem
24
25 Go guarantees a basic level of memory safety through the use of a garbage
26 collector (GC), which is described in great detail by the Go team
27 [here](https://tip.golang.org/doc/gc-guide). The Go runtime automatically tracks
28 and frees objects allocated from the heap, relieving the programmer of the often
29 painful and error-prone process of manual memory management. Unfortunately,
30 tracking and freeing memory during runtime comes at a performance cost. Running
31 the GC adds scheduling overhead, consumes valuable CPU time, and occasionally
32 pauses the entire program’s progress to track down garbage.
33
34 Go’s GC is highly optimized, tunable, and sufficient for a majority of
35 workloads. Most of the other parts of gVisor happily use Go's GC with no
36 complaints. However, under high network stress, Netstack needed to aggressively
37 allocate buffers used for processing TCP/IP data and metadata. These buffers
38 often had short lifespans, and once the processing was done they were left to be
39 cleaned up by the GC. This meant Netstack was producing tons of garbage that
40 needed to be tracked and freed by GC workers.
41
42 ## Recycling to the Rescue
43
44 Luckily, we weren't the only ones with this problem. This pattern of small,
45 frequently allocated and discarded objects was common enough that the Go team
46 introduced [`sync.Pool`](https://pkg.go.dev/sync#Pool) in Go1.3. `sync.Pool` is
47 designed to relieve pressure off the Go GC by maintaining a thread-safe cache of
48 previously allocated objects. `sync.Pool` can retrieve an object from the cache
49 if it exists or allocate a new one according to a user specified allocation
50 function. Once the user is finished with an object they can safely return it to
51 the cache to be reused again.
52
53 While `sync.Pool` was exactly what we needed to reduce allocations,
54 incorporating it into Netstack wasn’t going to be as easy as just replacing all
55 our `make()`s with `pool.Get()`s.
56
57 ## Netstack Challenges
58
59 Netstack uses a few different types of buffers under the hood. Some of these are
60 specific to protocols, like
61 [`segment`](https://github.com/google/gvisor/blob/master/pkg/tcpip/transport/tcp/segment.go)
62 for TCP, and others are more widely shared, like
63 [`PacketBuffer`](https://github.com/google/gvisor/blob/master/pkg/tcpip/stack/packet_buffer.go),
64 which is used for IP, ICMP, UDP, etc. Although each of these buffer types are
65 slightly different, they generally share a few common traits that made it
66 difficult to use `sync.Pool` out of the box:
67
68 * The buffers were originally built with the assumption that a garbage
69 collector would clean them up automatically – there was little (if any)
70 effort put into tracking object lifetimes. This meant that we had no way to
71 know when it was safe to return buffers to a pool.
72 * Buffers have dynamic sizes that are determined during creation, usually
73 depending on the size of the packet holding them. A `sync.Pool` out of the
74 box can only accommodate buffers of a single size. One common solution to
75 this is to fill a pool with
76 [`bytes.Buffer`](https://pkg.go.dev/bytes#Buffer), but even a pooled
77 `bytes.Buffer` could incur allocations if it were too small and had to be
78 grown to the requested size.
79 * Netstack splits, merges, and clones buffers at various points during
80 processing (for example, breaking a large segment into smaller MTU-sized
81 packets). Modifying a buffer’s size during runtime could mean lots of
82 reallocating from the pool in a one-size-fits-all setup. This would limit
83 the theoretical effectiveness of a pooled solution.
84
85 We needed an efficient, low-level buffer abstraction that had answers for the
86 Netstack specific challenges and could be shared by the various intermediate
87 buffer types. By sharing a common buffer abstraction, we could maximize the
88 benefits of pooling and avoid introducing additional allocations while minimally
89 changing any intermediate buffer processing logic.
90
91 ## Introducing bufferv2
92
93 Our solution was
94 [bufferv2](https://github.com/google/gvisor/tree/1ceb81454444981448ad57612139adfc0def1b85/pkg/bufferv2).
95 Bufferv2 is a non-contiguous, reference counted, pooled, copy-on-write,
96 buffer-like data structure.
97
98 Internally, a bufferv2 `Buffer` is a linked list of `View`s. Each `View` has
99 start/end indices and holds a pointer to a `Chunk`. A `Chunk` is a
100 reference-counted structure that’s allocated from a pool and holds data in a
101 byte slice. There are several `Chunk` pools, each of which allocates chunks with
102 different sized byte slices. These sizes start at 64 and double until 64k.
103
104 ![Figure 2](/assets/images/2022-10-24-buffer-pooling-figure2.png "bufferv2 implementation diagram."){:width="100%"}
105
106 The design of bufferv2 has a few key advantages over simpler object pooling:
107
108 * **Zero-cost copies and copy-on-write**: Cloning a Buffer only increments the
109 reference count of the underlying chunks instead of reallocating from the
110 pool. Since buffers are much more frequently read than modified, this saves
111 allocations. In the cases where a buffer is modified, only the chunk that’s
112 changed has to be cloned, not the whole buffer.
113 * **Fast buffer transformations**: Truncating and merging buffers or appending
114 and prepending Views to Buffers are fast operations. Thanks to the
115 non-contiguous memory structure these operations are usually as quick as
116 adding a node to a linked list or changing the indices in a View.
117 * **Tiered pools**: When growing a Buffer or appending data, the new chunks
118 come from different pools of previously allocated chunks. Using multiple
119 pools means we are flexible enough to efficiently accommodate packets of all
120 sizes with minimal overhead. Unlike a one-size-fits-all solution, we don't
121 have to waste lots of space with a chunk size that is too big or loop
122 forever allocating small chunks.
123
124 ## Trade-offs
125
126 Shifting Netstack to bufferv2 came with some costs. To start, rewriting all
127 buffers to use bufferv2 was a sizable effort that took many months to fully roll
128 out. Any place in Netstack that allocated or used a byte slice needed to be
129 rewritten. Reference counting had to be introduced so all the aforementioned
130 intermediate buffer types (`PacketBuffer`, `segment`, etc) could accurately
131 track buffer lifetimes, and tests had to be modified to ensure reference
132 counting correctness.
133
134 In addition to the upfront cost, the shift to bufferv2 also increased the
135 engineering complexity of future Netstack changes. Netstack contributors must
136 adhere to new rules to maintain memory safety and maximize the benefits of
137 pooling. These rules are strict – there needs to be strong justification to
138 break them. They are as follows:
139
140 * Never allocate a byte slice; always use `NewView()` instead.
141 * Use a `View` for simple data operations (e.g writing some data of a fixed
142 size) and a `Buffer` for more complex I/O operations (e.g appending data of
143 variable size, merging data, writing from an `io.Reader`).
144 * If you need access to the contents of a `View` as a byte slice, use
145 `View.AsSlice()`. If you need access to the contents of a `Buffer` as a byte
146 slice, consider refactoring, as this will cause an allocation.
147 * Never write or modify the slices returned by `View.AsSlice()`; they are
148 still owned by the view.
149 * Release bufferv2 objects as close to where they're created as possible. This
150 is usually most easily done with defer.
151 * Document function ownership of bufferv2 object parameters. If there is no
152 documentation, it is assumed that the function does not take ownership of
153 its parameters.
154 * If a function takes ownership of its bufferv2 parameters, the bufferv2
155 objects must be cloned before passing them as arguments.
156 * All new Netstack tests must enable the leak checker and run a final leak
157 check after the test is complete.
158
159 ## Give it a Try
160
161 Bufferv2 is enabled by default as of
162 [gVisor 20221017](https://github.com/google/gvisor/releases/tag/release-20221017.0),
163 and will be rolling out to
164 [GKE Sandbox](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods)
165 soon, so no action is required to see a performance boost. Network-bound
166 workloads, such as web servers or databases like Redis, are the most likely to
167 see benefits. All the code implementing bufferv2 is public
168 [here](https://github.com/google/gvisor/tree/master/pkg/bufferv2), and
169 contributions are welcome! If you’d like to run the iperf benchmark for
170 yourself, you can run:
171
172 ```
173 make run-benchmark BENCHMARKS_TARGETS=//test/benchmarks/network:iperf_test \
174 RUNTIME=your-runtime-here BENCHMARKS_OPTIONS=-test.benchtime=60s
175 ```
176
177 in the base gVisor directory. If you experience any issues, please feel free to
178 let us know at [gvisor.dev/issues](https://github.com/google/gvisor/issues).