gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2024-02-01-seccomp.md (about) 1 # Optimizing seccomp usage in gVisor 2 3 gVisor is a multi-layered security sandbox. [`seccomp-bpf`][seccomp-bpf] is 4 gVisor's second layer of defense against container escape attacks. gVisor uses 5 `seccomp-bpf` to filter its own syscalls by the host kernel. This significantly 6 reduces the attack surface to the host that a compromised gVisor process can 7 access. However, this layer comes at a cost: every legitimate system call that 8 gVisor makes must be evaluated against this filter by the host kernel before it 9 is actually executed. **This blog post contains more than you ever wanted to 10 know about `seccomp-bpf`, and explores the past few months of work to optimize 11 gVisor's use of it.** 12 13 ![gVisor and seccomp](/assets/images/2024-02-01-gvisor-seccomp.png "gVisor and seccomp"){:style="max-width:100%"} 14 <span class="attribution">A diagram showing gVisor's two main layers of 15 security: gVisor itself, and `seccomp-bpf`. This blog post touches on the 16 `seccomp-bpf` part. 17 [Tux logo by Larry Ewing and The GIMP](https://commons.wikimedia.org/wiki/File:Tux.svg).</span> 18 19 -------------------------------------------------------------------------------- 20 21 ## Understanding `seccomp-bpf` performance in gVisor {#performance-considerations} 22 23 One challenge with gVisor performance improvement ideas is that it is often very 24 difficult to estimate how much they will impact performance without first doing 25 most of the work necessary to actually implement them. Profiling tools help with 26 knowing where to look, but going from there to numbers is difficult. 27 28 `seccomp-bpf` is one area which is actually much more straightforward to 29 estimate. Because it is a secondary layer of defense that lives outside of 30 gVisor, and it is merely a filter, we can simply yank it out of gVisor and 31 benchmark the performance we get. While running gVisor in this way is strictly 32 **less secure** and not a mode that gVisor should support, the numbers we get in 33 this manner do provide an upper bound on the maximum *potential* performance 34 gains we could see from optimizations within gVisor's use of `seccomp-bpf`. 35 36 To visualize this, we can run a benchmark with the following variants: 37 38 * **Unsandboxed**: Unsandboxed performance without gVisor. 39 * **gVisor**: gVisor from before any of the performance improvements described 40 later in this post. 41 * **gVisor with empty filter**: Same as **gVisor**, but with the `seccomp-bpf` 42 filter replaced with one that unconditionally approves every system call. 43 44 From these three variants, we can break down the gVisor overhead that comes from 45 gVisor itself vs the one that comes from `seccomp-bpf` filtering. The difference 46 between **gVisor** and **unsandboxed** represents the total gVisor performance 47 overhead, and the difference between **gVisor** and **gVisor with empty filter** 48 represents the performance overhead of gVisor's `seccomp-bpf` filtering rules. 49 50 Let's run these numbers for the ABSL build benchmark: 51 52 ![ABSL seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-absl-empty-filter.png "ABSL seccomp-bpf performance"){:style="max-width:100%"} 53 54 We can now use these numbers to give a rough breakdown of where the overhead is 55 coming from: 56 57 ![ABSL seccomp-bpf performance breakdown](/assets/images/2024-02-01-gvisor-seccomp-absl-breakdown.png "ABSL seccomp-bpf performance breakdown"){:style="max-width:100%"} 58 59 The `seccomp-bpf` overhead is small in absolute terms. The numbers suggest that 60 the best that can be shaved off by optimizing `seccomp-bpf` filters is **up to** 61 3.4 seconds off from the total ABSL build time, which represents a reduction of 62 total runtime by ~3.6%. However, when looking at this amount relative to 63 gVisor's overhead over unsandboxed time, this means that optimizing the 64 `seccomp-bpf` filters may remove **up to** ~15% of gVisor overhead, which is 65 significant. *(Not all benchmarks have this behavior; some benchmarks show 66 smaller `seccomp-bpf`-related overhead. The overhead is also highly 67 platform-dependent.)* 68 69 Of course, this level of performance is what was reached with **empty 70 `seccomp-bpf` filtering rules**, so we cannot hope to reach this level of 71 performance gains. However, it is still useful as an upper bound. Let's see how 72 much of it we can recoup without compromising security. 73 74 ## A primer on BPF and `seccomp-bpf` 75 76 ### BPF, cBPF, eBPF, oh my! 77 78 [BPF (Berkeley Packet Filter)][BPF] is a virtual machine and eponymous machine 79 language. Its name comes from its original purpose: filtering packets in a 80 kernel network stack. However, its use has expanded to other domains of the 81 kernel where programmability is desirable. Syscall filtering in the context of 82 `seccomp` is one such area. 83 84 BPF itself comes in two dialects: "Classic BPF" (sometimes stylized as cBPF), 85 and the now-more-well-known ["Extended BPF" (commonly known as eBPF)][EBPF]. 86 eBPF is a superset of cBPF and is usable extensively throughout the kernel. 87 However, `seccomp` is not one such area. While 88 [the topic has been heavily debated](https://lwn.net/Articles/857228/), the 89 status quo remains that `seccomp` filters may only use cBPF, so this post will 90 focus on cBPF alone. 91 92 ### So what is `seccomp-bpf` exactly? 93 94 `seccomp-bpf` is a part of the Linux kernel which allows a program to impose 95 syscall filters on itself. A `seccomp-bpf` filter is a cBPF program that is 96 given syscall data as input, and outputs an "action" (a 32-bit integer) to do as 97 a result of this system call: allow it, reject it, crash the program, trap 98 execution, etc. The kernel evaluates the cBPF program on every system call the 99 application makes. The "input" of this cBPF program is the byte layout of the 100 `seccomp_data` struct, which can be loaded into the registers of the cBPF 101 virtual machine for analysis. 102 103 Here's what the `seccomp_data` struct looks like in 104 [Linux's `include/uapi/linux/seccomp.h`](https://github.com/torvalds/linux/blob/master/include/uapi/linux/seccomp.h): 105 106 ```c 107 struct seccomp_data { 108 int nr; // 32 bits 109 __u32 arch; // 32 bits 110 __u64 instruction_pointer; // 64 bits 111 __u64 args[6]; // 64 bits × 6 112 }; // Total 512 bits 113 ``` 114 115 ### Sample `seccomp-bpf` filter {#sample-filter} 116 117 Here is an example `seccomp-bpf` filter, adapted from the 118 [Linux kernel documentation](https://www.kernel.org/doc/Documentation/networking/filter.txt)[^1]: 119 120 <!-- Markdown note: This uses "javascript" syntax highlighting because that 121 happens to work pretty well with this pseudo-assembly-like language. 122 It is not actually JavaScript. --> 123 124 ```javascript 125 00: load32 4 // Load 32 bits at offsetof(struct seccomp_data, arch) (= 4) 126 // of the seccomp_data input struct into register A. 127 01: jeq 0xc000003e, 0, 11 // If A == AUDIT_ARCH_X86_64, jump by 0 instructions [to 02] 128 // else jump by 11 instructions [to 13]. 129 02: load32 0 // Load 32 bits at offsetof(struct seccomp_data, nr) (= 0) 130 // of the seccomp_data input struct into register A. 131 03: jeq 15, 10, 0 // If A == __NR_rt_sigreturn, jump by 10 instructions [to 14] 132 // else jump by 0 instructions [to 04]. 133 04: jeq 231, 9, 0 // If A == __NR_exit_group, jump by 9 instructions [to 14] 134 // else jump by 0 instructions [to 05]. 135 05: jeq 60, 8, 0 // If A == __NR_exit, jump by 8 instructions [to 14] 136 // else jump by 0 instructions [to 06]. 137 06: jeq 0, 7, 0 // Same thing for __NR_read. 138 07: jeq 1, 6, 0 // Same thing for __NR_write. 139 08: jeq 5, 5, 0 // Same thing for __NR_fstat. 140 09: jeq 9, 4, 0 // Same thing for __NR_mmap. 141 10: jeq 14, 3, 0 // Same thing for __NR_rt_sigprocmask. 142 11: jeq 13, 2, 0 // Same thing for __NR_rt_sigaction. 143 12: jeq 35, 1, 0 // If A == __NR_nanosleep, jump by 1 instruction [to 14] 144 // else jump by 0 instructions [to 13]. 145 13: return 0 // Return SECCOMP_RET_KILL_THREAD 146 14: return 0x7fff0000 // Return SECCOMP_RET_ALLOW 147 ``` 148 149 This filter effectively allows only the following syscalls: `rt_sigreturn`, 150 `exit_group`, `exit`, `read`, `write`, `fstat`, `mmap`, `rt_sigprocmask`, 151 `rt_sigaction`, and `nanosleep`. All other syscalls result in the calling thread 152 being killed. 153 154 ### `seccomp-bpf` and cBPF limitations {#cbpf-limitations} 155 156 cBPF is quite limited as a language. The following limitations all factor into 157 the optimizations described in this blog post: 158 159 - The cBPF virtual machine only has 2 32-bit registers, and a tertiary 160 pseudo-register for a 32-bit immediate value. (Note that syscall arguments 161 evaluated in the context of `seccomp` are 64-bit values, so you can already 162 foresee that this leads to complications.) 163 - `seccomp-bpf` programs are limited to 4,096 instructions. 164 - Jump instructions can only go forward (this ensures that programs must 165 halt). 166 - Jump instructions may only jump by a fixed ("immediate") number of 167 instructions. (You cannot say: "jump by whatever this register says".) 168 - Jump instructions come in two flavors: 169 - "Unconditional" jump instructions, which jump by a fixed number of 170 instructions. This number must fit in 16 bits. 171 - "Conditional" jump instructions, which include a condition expression 172 and two jump targets: 173 - The number of instructions to jump by if the condition is true. This 174 number must fit in 8 bits, so this cannot jump by more than 255 175 instructions. 176 - The number of instructions to jump by if the condition is false. 177 This number must fit in 8 bits, so this cannot jump by by more than 178 255 instructions. 179 180 ### `seccomp-bpf` caching in Linux 181 182 Since 183 [Linux kernel version 5.11](https://www.phoronix.com/news/Linux-5.11-SECCOMP-Performance), 184 when a program uploads a `seccomp-bpf` filter into the kernel, 185 [Linux runs a BPF emulator](https://github.com/torvalds/linux/commit/8e01b51a31a1e08e2c3e8fcc0ef6790441be2f61) 186 that looks for system call numbers where the BPF program doesn't do any fancy 187 operations nor load any bits from the `instruction_pointer` or `args` fields of 188 the `seccomp_data` input struct, and still returns "allow". When this is the 189 case, **Linux will cache this information** in a per-syscall-number bitfield. 190 191 Later, when a cacheable syscall number is executed, the BPF program is not 192 evaluated at all; since the kernel knows that the program is deterministic and 193 doesn't depend on the syscall arguments, it can safely allow the syscall without 194 actually running the BPF program. 195 196 This post uses the term "cacheable" to refer to syscalls that match this 197 criteria. 198 199 ## How gVisor builds its `seccomp-bpf` filter 200 201 gVisor imposes a `seccomp-bpf` filter on itself as part of Sentry start-up. This 202 process works as follows: 203 204 - gVisor gathers bits of configuration that are relevant to the construction 205 of its `seccomp-bpf` filter. This includes which platform is in use, whether 206 certain features that require looser filtering are enabled (e.g. host 207 networking, profiling, GPU proxying, etc.), and certain file descriptors 208 (FDs) which may be checked against syscall arguments that pass in FDs. 209 - gVisor generates a sequence of rulesets from this configuration. A ruleset 210 is a mapping from syscall number to a predicate that must be true for this 211 system call, along with an "action" (return code) that is taken should this 212 predicate be satisfied. For ease of human understanding, the predicate is 213 often written as a 214 [disjunctive rule](https://en.wikipedia.org/wiki/Logical_disjunction), for 215 which each sub-rule is a 216 [conjunctive rule](https://en.wikipedia.org/wiki/Logical_conjunction) that 217 verifies each syscall argument. In other words, `(fA(args[0]) && fB(args[1]) 218 && ...) || (fC(args[0]) && fD(args[1]) && ...) || ...`. This is represented 219 [in gVisor code](https://github.com/google/gvisor/blob/master/runsc/boot/filter/config/config_main.go) 220 as follows: 221 222 ```go 223 Or{ // Disjunction rule 224 PerArg{ // Conjunction rule over each syscall argument 225 fA, // Predicate for `seccomp_data.args[0]` 226 fB, // Predicate for `seccomp_data.args[1]` 227 // ... More predicates can go here (up to 6 arguments per syscall) 228 }, 229 PerArg{ // Conjunction rule over each syscall argument 230 fC, // Predicate for `seccomp_data.args[0]` 231 fD, // Predicate for `seccomp_data.args[1]` 232 // ... More predicates can go here (up to 6 arguments per syscall) 233 }, 234 } 235 ``` 236 237 - gVisor performs several optimizations on this data structure. 238 239 - gVisor then renders this list of rulesets into a linear program that looks 240 close to the final machine language, other than jump offsets which are 241 initially represented as symbolic named labels during the rendering process. 242 243 - gVisor then resolves all the labels to their actual instruction index, and 244 computes the actual jump targets of all jump instructions to obtain valid 245 cBPF machine code. 246 247 - gVisor runs further optimizations on this cBPF bytecode. 248 249 - Finally, the cBPF bytecode is uploaded into the host kernel and the 250 `seccomp-bpf` filter becomes effective. 251 252 Optimizing the `seccomp-bpf` filter to be more efficient allows the program to 253 be more compact (i.e. it's possible to pack more complex filters in the 4,096 254 instruction limit), and to run faster. While `seccomp-bpf` evaluation is 255 measured in nanoseconds, the impact of any optimization is magnified here, 256 because host syscalls are an important part of the synchronous "syscall hot 257 path" that must execute as part of handling certain performance-sensitive 258 syscall from the sandboxed application. The relationship is not 1-to-1: a single 259 application syscall may result in several host syscalls, especially due to 260 `futex(2)` which the Sentry calls many times to synchronize its own operations. 261 Therefore, shaving a nanosecond here and there results in several shaved 262 nanoseconds in the syscall hot path. 263 264 ## Structural optimizations {#structure} 265 266 The first optimization done for gVisor's `seccomp-bpf` was to turn its linear 267 search over syscall numbers into a 268 [binary search tree](https://en.wikipedia.org/wiki/Binary_search_tree). This 269 turns the search for syscall numbers from `O(n)` to `O(log n)` instructions. 270 This is a very common `seccomp-bpf` optimization technique which is replicated 271 in other projects such as 272 [libseccomp](https://github.com/seccomp/libseccomp/issues/116) and Chromium. 273 274 To do this, a cBPF program basically loads the 32-bit `nr` (syscall number) 275 field of the `seccomp_data` struct, and does a binary tree traversal of the 276 [syscall number space](https://chromium.googlesource.com/chromiumos/docs/+/HEAD/constants/syscalls.md#tables). 277 When it finds a match, it jumps to a set of instructions that check that 278 syscall's arguments for validity, and then returns allow/reject. 279 280 But why stop here? Let's go further. 281 282 The problem with the binary search tree approach is that it treats all syscall 283 numbers equally. This is a problem for three reasons: 284 285 1. It does not matter to have good performance for disallowed syscalls, because 286 such syscalls should never happen during normal program execution. 287 2. It does not matter to have good performance for syscalls which can be cached 288 by the kernel, because the BPF program will only have to run once for these 289 system calls. 290 3. For the system calls which are allowed but are not cacheable by the kernel, 291 there is a 292 [Pareto distribution](https://en.wikipedia.org/wiki/Pareto_distribution) of 293 their relative frequency. To exploit this we should evaluate the most-often 294 used syscalls faster than the least-often used ones. The binary tree 295 structure does not exploit this distribution, and instead treats all 296 syscalls equally. 297 298 So gVisor splits syscall numbers into four sets: 299 300 - 🅰: Non-cacheable 🅰llowed, called very frequently. 301 - 🅱: Non-cacheable allowed, called once in a 🅱lue moon. 302 - 🅲: 🅲acheable allowed (whether called frequently or not). 303 - 🅳: 🅳isallowed (which, by definition, is neither cacheable nor expected to 304 ever be called). 305 306 Then, the cBPF program is structured in the following layout: 307 308 - Linear search over allowed frequently-called non-cacheable syscalls (🅰). 309 These syscalls are ordered in most-frequently-called first (e.g. `futex(2)` 310 is the first one as it is by far the most-frequently-called system call). 311 - Binary search over allowed infrequently-called non-cacheable syscalls (🅱). 312 - Binary search over allowed cacheable syscalls (🅲). 313 - Reject anything else (🅳). 314 315 This structure takes full advantage of the kernel caching functionality, and of 316 the Pareto distribution of syscalls. 317 318 <details markdown="1"> 319 320 <summary markdown="1"> 321 322 ### Binary search tree optimizations 323 324 Beyond classifying syscalls to see which binary search tree they should be a 325 part of, gVisor also optimizes the binary search process itself. 326 327 </summary> 328 329 Each syscall number is a node in the tree. When traversing the tree, there are 330 three options at each point: 331 332 - The syscall number is an exact match 333 - The syscall number is lower than the node's value 334 - The syscall number is higher than the node's value 335 336 In order to render the BST as cBPF bytecode, gVisor used to render the following 337 (in pseudocode): 338 339 ```javascript 340 if syscall number == current node value 341 jump @rules_for_this_syscall 342 if syscall number < current node value 343 jump @left_node 344 jump @right_node 345 346 @rules_for_this_syscall: 347 // Render bytecode for this syscall's filters here... 348 349 @left_node: 350 // Recursively render the bytecode for the left node value here... 351 352 @right_node: 353 // Recursively render the bytecode for the right node value here... 354 ``` 355 356 Keep in mind the [cBPF limitations](#cbpf-limitations) here. Because conditional 357 jumps are limited to 255 instructions, the jump to `@left_node` can be further 358 than 255 instructions away (especially for syscalls with complex filtering rules 359 like [`ioctl(2)`](https://man7.org/linux/man-pages/man2/ioctl.2.html)). The jump 360 to `@right_node` is almost certainly more than 255 instructions away. This means 361 in actual cBPF bytecode, we would often need to use conditional jumps followed 362 by unconditional jumps in order to jump so far forward. Meanwhile, the jump to 363 `@rules_for_this_syscall` would be a very short hop away, but this locality 364 would only be taken advantage of for a single node of the entire tree for each 365 traversal. 366 367 Consider this structure instead: 368 369 ```javascript 370 // Traversal code: 371 if syscall number < current node value 372 jump @left_node 373 if syscall_number > current node value 374 jump @right_node 375 jump @rules_for_this_syscall 376 @left_node: 377 // Recursively render only the traversal code for the left node here 378 @right_node: 379 // Recursively render only the traversal code for the right node here 380 381 // Filtering code: 382 @rules_for_this_syscall: 383 // Render bytecode for this syscall's filters here 384 // Recursively render only the filtering code for the left node here 385 // Recursively render only the filtering code for the right node here 386 ``` 387 388 This effectively separates the per-syscall rules from the traversal of the BST. 389 This ensures that the traversal can be done entirely using conditional jumps, 390 and that for any given execution of the cBPF program, there will be at most one 391 unconditional jump to the syscall-specific rules. 392 393 This structure is further improvable by taking advantage of the fact that 394 syscall numbers are a dense space, and so are syscall filter rules. This means 395 we can often avoid needless comparisons. For example, given the following tree: 396 397 ``` 398 22 399 / \ 400 9 24 401 / / \ 402 8 23 50 403 ``` 404 405 Notice that the tree contains `22`, `23`, and `24`. This means that if we get to 406 node `23`, we do not need to check for syscall number equality, because we've 407 already established from the traversal that the syscall number must be `23`. 408 409 </details> 410 411 ## cBPF bytecode optimizations 412 413 gVisor now implements a 414 [bytecode-level cBPF optimizer](https://github.com/google/gvisor/blob/master/pkg/bpf/optimizer.go) 415 running a few lossless optimizations. These optimizations are run repeatedly 416 until the bytecode no longer changes. This is because each type of optimization 417 tends to feed on the fruits of the others, as we'll see below. 418 419 ![gVisor sentry seccomp-bpf filter program size](/assets/images/2024-02-01-gvisor-seccomp-sentry-filter-size.png "gVisor sentry seccomp-bpf filter program size"){:style="max-width:100%"} 420 421 gVisor's `seccomp-bpf` program size is reduced by over a factor of 4 using the 422 optimizations below. 423 424 <details markdown="1"> 425 426 <summary markdown="1"> 427 428 ### Optimizing cBPF jumps 429 430 The [limitations of cBPF jump instructions described earlier](#cbpf-limitations) 431 means that typical BPF bytecode rendering code will usually favor unconditional 432 jumps even when they are not necessary. However, they can be optimized after the 433 fact. 434 435 </summary> 436 437 Typical BPF bytecode rendering code for a simple condition is usually rendered 438 as follows: 439 440 ```javascript 441 jif <condition>, 0, 1 // If <condition> is true, continue, 442 // otherwise skip over 1 instruction. 443 jmp @condition_was_true // Unconditional jump to label @condition_was_true. 444 jmp @condition_was_false // Unconditional jump to label @condition_was_false. 445 ``` 446 447 ... or as follows: 448 449 ```javascript 450 jif <condition>, 1, 0 // If <condition> is true, jump by 1 instruction, 451 // otherwise continue. 452 jmp @condition_was_false // Unconditional jump to label @condition_was_false. 453 // Flow through here if the condition was true. 454 ``` 455 456 ... In other words, the generated code always uses unconditional jumps, and 457 conditional jump offsets are always either 0 or 1 instructions forward. This is 458 because conditional jumps are limited to 8 bits (255 instructions), and it is 459 not always possible at BPF bytecode rendering time to know ahead of time that 460 the jump targets (`@condition_was_true`, `@condition_was_false`) will resolve to 461 an instruction that is close enough ahead that the offset would fit in 8 bits. 462 The safe thing to do is to always use an unconditional jump. Since unconditional 463 jump targets have 16 bits to play with, and `seccomp-bpf` programs are limited 464 to 4,096 instructions, it is always possible to encode a jump using an 465 unconditional jump instruction. 466 467 But of course, the jump target often *does* fit in 8 bits. So gVisor looks over 468 the bytecode for optimization opportunities: 469 470 - **Conditional jumps that jump to unconditional jumps** are rewritten to 471 their final destination, so long as this fits within the 255-instruction 472 conditional jump limit. 473 - **Unconditional jumps that jump to other unconditional jumps** are rewritten 474 to their final destination. 475 - **Conditional jumps where both branches jump to the same instruction** are 476 replaced by an unconditional jump to that instruction. 477 - **Unconditional jumps with a zero-instruction jump target** are removed. 478 479 The aim of these optimizations is to clean up after needless indirection that is 480 a byproduct of cBPF bytecode rendering code. Once they all have run, all jumps 481 are as tight as they can be. 482 483 </details> 484 485 <details markdown="1"> 486 487 <summary markdown="1"> 488 489 ### Removing dead code 490 491 Because cBPF is a very restricted language, it is possible to determine with 492 certainty that some instructions can never be reached. 493 494 </summary> 495 496 In cBPF, each instruction either: 497 498 - **Flows** forward (e.g. `load` operations, math operations). 499 - **Jumps** by a fixed (immediate) number of instructions. 500 - **Stops** the execution immediately (`return` instructions). 501 502 Therefore, gVisor runs a simple program traversal algorithm. It creates a 503 bitfield with one bit per instruction, then traverses the program and all its 504 possible branches. Then, all instructions that were never traversed are removed 505 from the program, and all jump targets are updated to account for these 506 removals. 507 508 In turn, this makes the program shorter, which makes more jump optimizations 509 possible. 510 511 </details> 512 513 <details markdown="1"> 514 515 <summary markdown="1"> 516 517 ### Removing redundant `load` instructions {#redundant-loads} 518 519 cBPF programs filter system calls by inspecting their arguments. To do these 520 comparisons, this data must first be loaded into the cBPF VM registers. These 521 load operations can be optimized. 522 523 </summary> 524 525 cBPF's conditional operations (e.g. "is equal to", "is greater than", etc.) 526 operate on a single 32-bit register called "A". As such, a `seccomp-bpf` program 527 typically consists of many load operations (`load32`) that loads a 32-bit value 528 from a given offset of the `seccomp_data` struct into register A, then performs 529 a comparative operation on it to see if it matches the filter. 530 531 ```javascript 532 00: load32 <offset> 533 01: jif <condition1>, @condition1_was_true, @condition1_was_false 534 02: load32 <offset> 535 03: jif <condition2>, @condition2_was_true, @condition2_was_false 536 // ... 537 ``` 538 539 But when a syscall rule is of the form "this syscall argument must be one of the 540 following values", we don't need to reload the same value (from the same offset) 541 multiple times. So gVisor looks for redundant loads like this, and removes them. 542 543 ```javascript 544 00: load32 <offset> 545 01: jif <condition1>, @condition1_was_true, @condition1_was_false 546 02: jif <condition2>, @condition2_was_true, @condition2_was_false 547 // ... 548 ``` 549 550 Note that syscall arguments are **64-bit** values, whereas the A register is 551 only 32-bits wide. Therefore, asserting that a syscall argument matches a 552 predicate usually involves at least 2 `load32` operations on different offsets, 553 thereby making this optimization useless for the "this syscall argument must be 554 one of the following values" case. We'll get back to that. 555 556 </details> 557 558 <details markdown="1"> 559 560 <summary markdown="1"> 561 562 ### Minimizing the number of `return` instructions 563 564 A typical syscall filter program consists of many predicates which return either 565 "allowed" or "rejected". These are encoded in the bytecode as either `return` 566 instructions, or jumps to `return` instructions. These instructions can show up 567 dozens or hundreds of times in the cBPF bytecode in quick succession, presenting 568 an optimization opportunity. 569 570 </summary> 571 572 Since two `return` instructions with the same immediate return code are exactly 573 equivalent to one another, it is possible to rewrite jumps to all `return` 574 instructions that return "allowed" to go to a single `return` instruction that 575 returns this code, and similar for "rejected", so long as the jump offsets fit 576 within the limits of conditional jumps (255 instructions). In turn, this makes 577 the program shorter, and therefore makes more jump optimizations possible. 578 579 To implement this optimization, gVisor first replaces all unconditional jump 580 instructions that go to `return` statements with a copy of that `return` 581 statement. This removes needless indirection. 582 583 ```javascript 584 Original bytecode New bytecode 585 00: jeq 0, 0, 1 00: jeq 0, 0, 1 586 01: jmp @good --> 01: return allowed 587 02: jmp @bad --> 02: return rejected 588 ... ... 589 10: jge 0, 0, 1 10: jge 0, 0, 1 590 11: jmp @good --> 11: return allowed 591 12: jmp @bad --> 12: return rejected 592 ... ... 593 100 [@good]: return allowed 100 [@good]: return allowed 594 101 [@bad]: return rejected 101 [@bad]: return rejected 595 ``` 596 597 gVisor then searches for `return` statements which can be entirely removed by 598 seeing if it is possible to rewrite the rest of the program to jump or flow 599 through to an equivalent `return` statement (without making the program longer 600 in the process). In the above example: 601 602 ```javascript 603 Original bytecode New bytecode 604 00: jeq 0, 0, 1 --> 00: jeq 0, 99, 100 // Targets updated 605 01: return allowed 01: return allowed // Now dead code 606 02: return reject 02: return rejected // Now dead code 607 ... ... 608 10: jge 0, 0, 1 --> 10: jge 0, 89, 90 // Targets updated 609 11: jmp @good 11: return allowed // Now dead code 610 12: jmp @bad 12: return rejected // Now dead code 611 ... ... 612 100 [@good]: return allowed 100 [@good]: return allowed 613 101 [@bad]: return rejected 101 [@bad]: return rejected 614 ``` 615 616 Finally, the dead code removal pass cleans up the dead `return` statements and 617 the program becomes shorter. 618 619 ```javascript 620 Original bytecode New bytecode 621 00: jeq 0, 99, 100 --> 00: jeq 0, 95, 96 // Targets updated 622 01: return allowed --> /* Removed */ 623 02: return reject --> /* Removed */ 624 ... ... 625 10: jge 0, 89, 90 --> 08: jge 0, 87, 88 // Targets updated 626 11: return allowed --> /* Removed */ 627 12: return rejected --> /* Removed */ 628 ... ... 629 100 [@good]: return allowed 96 [@good]: return allowed 630 101 [@bad]: return rejected 97 [@bad]: return rejected 631 ``` 632 633 While this search is expensive to perform, in a program full of predicates — 634 which is exactly what `seccomp-bpf` programs are — this approach massively 635 reduces program size. 636 637 </details> 638 639 ## Ruleset optimizations {#optimize-rulesets} 640 641 Bytecode-level optimizations are cool, but why stop here? gVisor now also 642 performs 643 [`seccomp` ruleset optimizations](https://github.com/google/gvisor/blob/master/pkg/seccomp/seccomp_optimizer.go). 644 645 In gVisor, a `seccomp` `RuleSet` is a mapping from syscall number to a logical 646 expression named `SyscallRule`, along with a `seccomp-bpf` action (e.g. "allow") 647 if a syscall with a given number matches its `SyscallRule`. 648 649 <details markdown="1"> 650 651 <summary markdown="1"> 652 653 ### Basic ruleset simplifications {#basic-ruleset-simplifications} 654 655 A `SyscallRule` is a predicate over the data contained in the `seccomp_data` 656 struct (beyond its `nr`). A trivial implementation is `MatchAll`, which simply 657 matches any `seccomp_data`. Other implementations include `Or` and `And` (which 658 do what they sound like), and `PerArg` which applies predicates to each specific 659 argument of a `seccomp_data`, and forms the meat of actual syscall filtering 660 rules. Some basic simplifications are already possible with these building 661 blocks. 662 663 </summary> 664 665 gVisor implements the following basic optimizers, which look like they may be 666 useless on their own but end up simplifying the logic of the more complex 667 optimizer described in other sections quite a bit: 668 669 - `Or` and `And` rules with a single predicate within them are replaced with 670 just that predicate. 671 - Duplicate predicates within `Or` and `And` rules are removed. 672 - `Or` rules within `Or` rules are flattened. 673 - `And` rules within `And` rules are flattened. 674 - An `Or` rule which contains a `MatchAll` predicate is replaced with 675 `MatchAll`. 676 - `MatchAll` predicates within `And` rules are removed. 677 - `PerArg` rules with `MatchAll` predicates for each argument are replaced 678 with a rule that matches anything. 679 680 As with the bytecode-level optimizations, gVisor runs these in a loop until the 681 structure of the rules no longer change. With the basic optimizations above, 682 this silly-looking rule: 683 684 ```go 685 Or{ 686 Or{ 687 And{ 688 MatchAll, 689 PerArg{AnyValue, EqualTo(2), AnyValue}, 690 }, 691 MatchAll, 692 }, 693 PerArg{AnyValue, EqualTo(2), AnyValue}, 694 PerArg{AnyValue, EqualTo(2), AnyValue}, 695 } 696 ``` 697 698 ... is simplified down to just `PerArg{AnyValue, EqualTo(2), AnyValue}`. 699 700 </details> 701 702 <details markdown="1"> 703 704 <summary markdown="1"> 705 706 ### Extracting repeated argument matchers 707 708 This is the main optimization that gVisor performs on rulesets. gVisor looks for 709 common argument matchers that are repeated across all combinations of *other* 710 argument matchers in branches of an `Or` rule. It removes them from these 711 `PerArg` rules, and `And` the overall syscall rule with a single instance of 712 that argument matcher. Sound complicated? Let's look at an example. 713 714 </summary> 715 716 In the 717 [gVisor Sentry `seccomp-bpf` configuration](https://github.com/google/gvisor/blob/master/runsc/boot/filter/config/), 718 these are the rules for the 719 [`fcntl(2)` system call](https://man7.org/linux/man-pages/man2/fcntl.2.html): 720 721 ```go 722 rules = ...(map[uintptr]SyscallRule{ 723 SYS_FCNTL: Or{ 724 PerArg{ 725 NonNegativeFD, 726 EqualTo(F_GETFL), 727 }, 728 PerArg{ 729 NonNegativeFD, 730 EqualTo(F_SETFL), 731 }, 732 PerArg{ 733 NonNegativeFD, 734 EqualTo(F_GETFD), 735 }, 736 }, 737 }) 738 ``` 739 740 ... This means that for the `fcntl(2)` system call, `seccomp_data.args[0]` may 741 be any non-negative number, `seccomp_data.args[1]` may be either `F_GETFL`, 742 `F_SETFL`, or `F_GETFD`, and all other `seccomp_data` fields may be any value. 743 744 If rendered naively in BPF, this would iterate over each branch of the `Or` 745 expression, and re-check the `NonNegativeFD` each time. Clearly wasteful. 746 Conceptually, the ideal expression is something like this: 747 748 ```go 749 rules = ...(map[uintptr]SyscallRule{ 750 SYS_FCNTL: PerArg{ 751 NonNegativeFD, 752 AnyOf(F_GETFL, F_SETFL, F_GETFD), 753 }, 754 }) 755 ``` 756 757 ... But going through all the syscall rules to look for this pattern would be 758 quite tedious, and some of them are actually `Or`'d from multiple 759 `map[uintptr]SyscallRule` in different files (e.g. platform-dependent syscalls), 760 so they cannot be all specified in a single location with a single predicate on 761 `seccomp_data.args[1]`. So gVisor needs to detect this programmatically at 762 optimization time. 763 764 Conceptually, gVisor goes from: 765 766 ```go 767 Or{ 768 PerArg{A1, B1, C1, D}, 769 PerArg{A2, B1, C1, D}, 770 PerArg{A1, B2, C2, D}, 771 PerArg{A2, B2, C2, D}, 772 PerArg{A1, B3, C3, D}, 773 PerArg{A2, B3, C3, D}, 774 } 775 ``` 776 777 ... to (after one pass): 778 779 ```go 780 And{ 781 Or{ 782 PerArg{A1, AnyValue, AnyValue, AnyValue}, 783 PerArg{A2, AnyValue, AnyValue, AnyValue}, 784 PerArg{A1, AnyValue, AnyValue, AnyValue}, 785 PerArg{A2, AnyValue, AnyValue, AnyValue}, 786 PerArg{A1, AnyValue, AnyValue, AnyValue}, 787 PerArg{A2, AnyValue, AnyValue, AnyValue}, 788 }, 789 Or{ 790 PerArg{AnyValue, B1, C1, D}, 791 PerArg{AnyValue, B1, C1, D}, 792 PerArg{AnyValue, B2, C2, D}, 793 PerArg{AnyValue, B2, C2, D}, 794 PerArg{AnyValue, B3, C3, D}, 795 PerArg{AnyValue, B3, C3, D}, 796 }, 797 } 798 ``` 799 800 Then the [basic optimizers](#basic-ruleset-simplifications) will kick in and 801 detect duplicate `PerArg` rules in `Or` expressions, and delete them: 802 803 ```go 804 And{ 805 Or{ 806 PerArg{A1, AnyValue, AnyValue, AnyValue}, 807 PerArg{A2, AnyValue, AnyValue, AnyValue}, 808 }, 809 Or{ 810 PerArg{AnyValue, B1, C1, D}, 811 PerArg{AnyValue, B2, C2, D}, 812 PerArg{AnyValue, B3, C3, D}, 813 }, 814 } 815 ``` 816 817 ... Then, on the next pass, the second inner `Or` rule gets recursively 818 optimized: 819 820 ```go 821 And{ 822 Or{ 823 PerArg{A1, AnyValue, AnyValue, AnyValue}, 824 PerArg{A2, AnyValue, AnyValue, AnyValue}, 825 }, 826 And{ 827 Or{ 828 PerArg{AnyValue, AnyValue, AnyValue, D}, 829 PerArg{AnyValue, AnyValue, AnyValue, D}, 830 PerArg{AnyValue, AnyValue, AnyValue, D}, 831 }, 832 Or{ 833 PerArg{AnyValue, B1, C1, AnyValue}, 834 PerArg{AnyValue, B2, C2, AnyValue}, 835 PerArg{AnyValue, B3, C3, AnyValue}, 836 }, 837 }, 838 } 839 ``` 840 841 ... which, after other basic optimizers clean this all up, finally becomes: 842 843 ```go 844 And{ 845 Or{ 846 PerArg{A1, AnyValue, AnyValue, AnyValue}, 847 PerArg{A2, AnyValue, AnyValue, AnyValue}, 848 }, 849 PerArg{AnyValue, AnyValue, AnyValue, D}, 850 Or{ 851 PerArg{AnyValue, B1, C1, AnyValue}, 852 PerArg{AnyValue, B2, C2, AnyValue}, 853 PerArg{AnyValue, B3, C3, AnyValue}, 854 }, 855 } 856 ``` 857 858 This has turned what would be 24 comparisons into just 9: 859 860 - `seccomp_data[0]` must either match predicate `A1` or `A2`. 861 - `seccomp_data[3]` must match predicate `D`. 862 - At least one of the following must be true: 863 - `seccomp_data[1]` must match predicate `B1` and `seccomp_data[2]` must 864 match predicate `C1`. 865 - `seccomp_data[1]` must match predicate `B2` and `seccomp_data[2]` must 866 match predicate `C2`. 867 - `seccomp_data[1]` must match predicate `B3` and `seccomp_data[2]` must 868 match predicate `C3`. 869 870 To go back to our `fcntl(2)` example, the rules would therefore be rewritten to: 871 872 ```go 873 rules = ...(map[uintptr]SyscallRule{ 874 SYS_FCNTL: And{ 875 // Check for args[0] exclusively: 876 PerArg{NonNegativeFD, AnyValue}, 877 // Check for args[1] exclusively: 878 Or{ 879 PerArg{AnyValue, EqualTo(F_GETFL)}, 880 PerArg{AnyValue, EqualTo(F_SETFL)}, 881 PerArg{AnyValue, EqualTo(F_GETFD)}, 882 }, 883 }, 884 }) 885 ``` 886 887 ... thus we've turned 6 comparisons into 4. But we can do better still! 888 889 </details> 890 891 <details markdown="1"> 892 893 <summary markdown="1"> 894 895 ### Extracting repeated 32-bit match logic from 64-bit argument matchers 896 897 We can apply the same optimization, but down to the 32-bit matching logic that 898 underlies the 64-bit syscall argument matching predicates. 899 900 </summary> 901 902 As you may recall, 903 [cBPF instructions are limited to 32-bit math](#cbpf-limitations). This means 904 that when rendered, each of these argument comparisons are actually 2 operations 905 each: one for the first 32-bit half of the argument, and one for the second 906 32-bit half of the argument. 907 908 Let's look at the `F_GETFL`, `F_SETFL`, and `F_GETFD` constants: 909 910 ```go 911 F_GETFL = 0x3 912 F_SETFL = 0x4 913 F_GETFD = 0x1 914 ``` 915 916 The cBPF bytecode for checking the arguments of this syscall may therefore look 917 something like this: 918 919 ```javascript 920 // Check for `seccomp_data.args[0]`: 921 00: load32 16 // Load the first 32 bits of 922 // `seccomp_data.args[0]` into register A. 923 01: jeq 0, 0, @bad // If A == 0, continue, otherwise jump to @bad. 924 02: load32 20 // Load the second 32 bits of 925 // `seccomp_data.args[0]` into register A. 926 03: jset 0x80000000, @bad, 0 // If A & 0x80000000 != 0, jump to @bad, 927 // otherwise continue. 928 929 // Check for `seccomp_data.args[1]`: 930 04: load32 24 // Load the first 32 bits of 931 // `seccomp_data.args[1]` into register A. 932 05: jeq 0, 0, @next1 // If A == 0, continue, otherwise jump to @next1. 933 06: load32 28 // Load the second 32 bits of 934 // `seccomp_data.args[1]` into register A. 935 07: jeq 0x3, @good, @next1 // If A == 0x3, jump to @good, 936 // otherwise jump to @next1. 937 938 @next1: 939 08: load32 24 // Load the first 32 bits of 940 // `seccomp_data.args[1]` into register A. 941 09: jeq 0, 0, @next2 // If A == 0, continue, otherwise jump to @next2. 942 10: load32 28 // Load the second 32 bits of 943 // `seccomp_data.args[1]` into register A. 944 11: jeq 0x4, @good, @next2 // If A == 0x3, jump to @good, 945 // otherwise jump to @next2. 946 947 @next2: 948 12: load32 24 // Load the first 32 bits of 949 // `seccomp_data.args[1]` into register A. 950 13: jeq 0, 0, @bad // If A == 0, continue, otherwise jump to @bad. 951 14: load32 28 // Load the second 32 bits of 952 // `seccomp_data.args[1]` into register A. 953 15: jeq 0x1, @good, @bad // If A == 0x1, jump to @good, 954 // otherwise jump to @bad. 955 956 // Good/bad jump targets for the checks above to jump to: 957 @good: 958 16: return ALLOW 959 @bad: 960 17: return REJECT 961 ``` 962 963 Clearly this could be better. The first 32 bits must be zero in all possible 964 cases. So the syscall argument value-matching primitives (e.g. `EqualTo`) may be 965 split into 2 32-bit value matchers: 966 967 ```go 968 rules = ...(map[uintptr]SyscallRule{ 969 SYS_FCNTL: And{ 970 PerArg{NonNegativeFD, AnyValue}, 971 Or{ 972 PerArg{ 973 AnyValue, 974 splitMatcher{ 975 high32bits: EqualTo32Bits( 976 F_GETFL & 0xffffffff00000000 /* = 0 */), 977 low32bits: EqualTo32Bits( 978 F_GETFL & 0x00000000ffffffff /* = 0x3 */), 979 }, 980 }, 981 PerArg{ 982 AnyValue, 983 splitMatcher{ 984 high32bits: EqualTo32Bits( 985 F_SETFL & 0xffffffff00000000 /* = 0 */), 986 low32bits: EqualTo32Bits( 987 F_SETFL & 0x00000000ffffffff /* = 0x4 */), 988 }, 989 }, 990 PerArg{ 991 AnyValue, 992 splitMatcher{ 993 high32bits: EqualTo32Bits( 994 F_GETFD & 0xffffffff00000000 /* = 0 */), 995 low32bits: EqualTo32Bits( 996 F_GETFD & 0x00000000ffffffff /* = 0x1 */), 997 }, 998 }, 999 }, 1000 }, 1001 }) 1002 ``` 1003 1004 gVisor then applies the same optimization as earlier, but this time going into 1005 each 32-bit half of each argument. This means it can extract the 1006 `EqualTo32Bits(0)` matcher from the `high32bits` part of each `splitMatcher` and 1007 move it up to the `And` expression like so: 1008 1009 ```go 1010 rules = ...(map[uintptr]SyscallRule{ 1011 SYS_FCNTL: And{ 1012 PerArg{NonNegativeFD, AnyValue}, 1013 PerArg{ 1014 AnyValue, 1015 splitMatcher{ 1016 high32bits: EqualTo32Bits(0), 1017 low32bits: Any32BitsValue, 1018 }, 1019 }, 1020 Or{ 1021 PerArg{ 1022 AnyValue, 1023 splitMatcher{ 1024 high32bits: Any32BitsValue, 1025 low32bits: EqualTo32Bits( 1026 F_GETFL & 0x00000000ffffffff /* = 0x3 */), 1027 }, 1028 }, 1029 PerArg{ 1030 AnyValue, 1031 splitMatcher{ 1032 high32bits: Any32BitsValue, 1033 low32bits: EqualTo32Bits( 1034 F_SETFL & 0x00000000ffffffff /* = 0x4 */), 1035 }, 1036 }, 1037 PerArg{ 1038 AnyValue, 1039 splitMatcher{ 1040 high32bits: Any32BitsValue, 1041 low32bits: EqualTo32Bits( 1042 F_GETFD & 0x00000000ffffffff /* = 0x1 */), 1043 }, 1044 }, 1045 }, 1046 }, 1047 }) 1048 ``` 1049 1050 This looks bigger as a tree, but keep in mind that the `AnyValue` and 1051 `Any32BitsValue` matchers do not produce any bytecode. So now let's render that 1052 tree to bytecode: 1053 1054 ```javascript 1055 // Check for `seccomp_data.args[0]`: 1056 00: load32 16 // Load the first 32 bits of 1057 // `seccomp_data.args[0]` into register A. 1058 01: jeq 0, 0, @bad // If A == 0, continue, otherwise jump to @bad. 1059 02: load32 20 // Load the second 32 bits of 1060 // `seccomp_data.args[0]` into register A. 1061 03: jset 0x80000000, @bad, 0 // If A & 0x80000000 != 0, jump to @bad, 1062 // otherwise continue. 1063 1064 // Check for `seccomp_data.args[1]`: 1065 04: load32 24 // Load the first 32 bits of 1066 // `seccomp_data.args[1]` into register A. 1067 05: jeq 0, 0, @bad // If A == 0, continue, otherwise jump to @bad. 1068 06: load32 28 // Load the second 32 bits of 1069 // `seccomp_data.args[1]` into register A. 1070 07: jeq 0x3, @good, @next1 // If A == 0x3, jump to @good, 1071 // otherwise jump to @next1. 1072 1073 @next1: 1074 08: load32 28 // Load the second 32 bits of 1075 // `seccomp_data.args[1]` into register A. 1076 09: jeq 0x4, @good, @next2 // If A == 0x3, jump to @good, 1077 // otherwise jump to @next2. 1078 1079 @next2: 1080 10: load32 28 // Load the second 32 bits of 1081 // `seccomp_data.args[1]` into register A. 1082 11: jeq 0x1, @good, @bad // If A == 0x1, jump to @good, 1083 // otherwise jump to @bad. 1084 1085 // Good/bad jump targets for the checks above to jump to: 1086 @good: 1087 12: return ALLOW 1088 @bad: 1089 13: return REJECT 1090 ``` 1091 1092 This is where the bytecode-level optimization to remove redundant loads 1093 [described earlier](#redundant-loads) finally becomes relevant. We don't need to 1094 load the second 32 bits of `seccomp_data.args[1]` multiple times in a row: 1095 1096 ```javascript 1097 // Check for `seccomp_data.args[0]`: 1098 00: load32 16 // Load the first 32 bits of 1099 // `seccomp_data.args[0]` into register A. 1100 01: jeq 0, 0, @bad // If A == 0, continue, otherwise jump to @bad. 1101 02: load32 20 // Load the second 32 bits of 1102 // `seccomp_data.args[0]` into register A. 1103 03: jset 0x80000000, @bad, 0 // If A & 0x80000000 != 0, jump to @bad, 1104 // otherwise continue. 1105 1106 // Check for `seccomp_data.args[1]`: 1107 04: load32 24 // Load the first 32 bits of 1108 // `seccomp_data.args[1]` into register A. 1109 05: jeq 0, 0, @bad // If A == 0, continue, otherwise jump to @bad. 1110 06: load32 28 // Load the second 32 bits of 1111 // `seccomp_data.args[1]` into register A. 1112 07: jeq 0x3, @good, @next1 // If A == 0x3, jump to @good, 1113 // otherwise jump to @next1. 1114 1115 @next1: 1116 08: jeq 0x4, @good, @next2 // If A == 0x3, jump to @good, 1117 // otherwise jump to @next2. 1118 1119 @next2: 1120 09: jeq 0x1, @good, @bad // If A == 0x1, jump to @good, 1121 // otherwise jump to @bad. 1122 1123 // Good/bad jump targets for the checks above to jump to: 1124 @good: 1125 10: return ALLOW 1126 @bad: 1127 11: return REJECT 1128 ``` 1129 1130 Of course, in practice the `@good`/`@bad` jump targets would also be unified 1131 with rules from other system call filters in order to cut down on those too. And 1132 by having reduced the number of instructions in each individual filtering rule, 1133 the jumps to these targets can be deduplicated against that many more rules. 1134 1135 This example demonstrates how **optimizations build on top of each other**, 1136 making each optimization more likely to make *other* optimizations useful in 1137 turn. 1138 1139 </details> 1140 1141 ## Other optimizations 1142 1143 Beyond these, gVisor also has the following minor optimizations. 1144 1145 <details markdown="1"> 1146 1147 <summary markdown="1"> 1148 1149 ### Making `futex(2)` rules faster 1150 1151 [`futex(2)`](https://man7.org/linux/man-pages/man2/futex.2.html) is by far the 1152 most-often-called system call that gVisor calls as part of its operation. It is 1153 used for synchronization, so it needs to be very efficient. 1154 1155 </summary> 1156 1157 Its rules used to look like this: 1158 1159 ```go 1160 SYS_FUTEX: Or{ 1161 PerArg{ 1162 AnyValue, 1163 EqualTo(FUTEX_WAIT | FUTEX_PRIVATE_FLAG), 1164 }, 1165 PerArg{ 1166 AnyValue, 1167 EqualTo(FUTEX_WAKE | FUTEX_PRIVATE_FLAG), 1168 }, 1169 PerArg{ 1170 AnyValue, 1171 EqualTo(FUTEX_WAIT), 1172 }, 1173 PerArg{ 1174 AnyValue, 1175 EqualTo(FUTEX_WAKE), 1176 }, 1177 }, 1178 ``` 1179 1180 Essentially a 4-way `Or` between 4 different values allowed for 1181 `seccomp_data.args[1]`. This is all well and good, and the above optimizations 1182 already optimize this down to the minimum amount of `jeq` comparison operations. 1183 1184 But looking at the actual bit values of the `FUTEX_*` constants above: 1185 1186 ```go 1187 FUTEX_WAIT = 0x00 1188 FUTEX_WAKE = 0x01 1189 FUTEX_PRIVATE_FLAG = 0x80 1190 ``` 1191 1192 ... We can see that this is equivalent to checking that no bits other than 1193 `0x01` and `0x80` may be set. Turns out that cBPF has an instruction for that. 1194 This is now optimized down to two comparison operations: 1195 1196 ```javascript 1197 01: load32 24 // Load the first 32 bits of 1198 // `seccomp_data.args[1]` into register A. 1199 02: jeq 0, 0, @bad // If A == 0, continue, 1200 // otherwise jump to @bad. 1201 03: load32 28 // Load the second 32 bits of 1202 // `seccomp_data.args[1]` into register A. 1203 04: jset 0xffffff7e, @bad, @good // If A & ^(0x01 | 0x80) != 0, jump to @bad, 1204 // otherwise jump to @good. 1205 ``` 1206 1207 </details> 1208 1209 <details markdown="1"> 1210 1211 <summary markdown="1"> 1212 1213 ### Optimizing non-negative FD checks 1214 1215 A lot of syscall arguments are file descriptors (FD numbers), which we need to 1216 filter efficiently. 1217 1218 </summary> 1219 1220 An FD is a 32-bit positive integer, but is passed as a 64-bit value as all 1221 syscall arguments are. Instead of doing a "less than" operation, we can simply 1222 turn it into a bitwise check. We simply check that the first half of the 64-bit 1223 value is zero, and that the 31st bit of the second half of the 64-bit value is 1224 not set. 1225 1226 </details> 1227 1228 <details markdown="1"> 1229 1230 <summary markdown="1"> 1231 1232 ### Enforcing consistency of argument-wise matchers 1233 1234 When one syscall argument is checked consistently across all branches of an 1235 `Or`, enforcing that this is the case ensures that the 1236 [optimization for such matchers](#optimize-rulesets) remains effective. 1237 1238 </summary> 1239 1240 The `ioctl(2)` system call takes an FD as one of its arguments. Since it is a 1241 "grab bag" of a system call, gVisor's rules for `ioctl(2)` were similarly spread 1242 across many files and rules, and not all of them checked that the FD argument 1243 was non-negative; some of them simply accepted any value for the FD argument. 1244 1245 Before this optimization work, this meant that the BPF program did less work for 1246 the rules which didn't check the value of the FD argument. However, now that 1247 gVisor [optimizes repeated argument-wise matchers](#optimize-rulesets), it is 1248 now actually *cheaper* if *all* `ioctl(2)` rules verify the value of the FD 1249 argument consistently, as that argument check can be performed exactly once for 1250 all possible branches of the `ioctl(2)` rules. So now gVisor has a test that 1251 verifies that this is the case. This is a good example that shows that 1252 **optimization work can lead to improved security** due to the efficiency gains 1253 that comes from applying security checks consistently. 1254 1255 </details> 1256 1257 ## `secbench`: Benchmarking `seccomp-bpf` programs {#secbench} 1258 1259 To measure the effectiveness of the above improvements, measuring gVisor 1260 performance itself would be very difficult, because each improvement is a rather 1261 tiny part of the syscall hot path. At the scale of each of these optimizations, 1262 we need to zoom in a bit more. 1263 1264 So now gVisor has 1265 [tooling for benchmarking `seccomp-bpf` programs](https://github.com/google/gvisor/blob/master/test/secbench/). 1266 It works by taking a 1267 [cBPF program along with several possible syscalls](https://github.com/google/gvisor/blob/master/runsc/boot/filter/filter_bench_test.go) 1268 to try with it. It runs a subprocess that installs this program as `seccomp-bpf` 1269 filter for itself, replacing all actions (other than "approve syscall") with 1270 "return error" in order to avoid crashing. Then it measures the latency of each 1271 syscall. This is then measured against the latency of the very same syscalls in 1272 a subprocess that has an empty `seccomp-bpf` (i.e. the only instruction within 1273 it is `return ALLOW`). 1274 1275 Let's measure the effect of the above improvements on a gVisor-like workload. 1276 1277 <details markdown="1"> 1278 1279 <summary markdown="1"> 1280 1281 ### Modeling gVisor `seccomp-bpf` behavior for benchmarking 1282 1283 This can be done by running gVisor under `ptrace` to see what system calls the 1284 gVisor process is doing. 1285 1286 </summary> 1287 1288 Note that `ptrace` here refers to the mechanism by which we can inspect the 1289 system call that the gVisor Sentry is making. This is distinct from the system 1290 calls the *sandboxed* application is doing. It has also nothing to do with 1291 gVisor's former "ptrace" platform. 1292 1293 For example, after running a Postgres benchmark inside gVisor with Systrap, the 1294 `ptrace` tool generated the following summary table: 1295 1296 ```markdown 1297 % time seconds usecs/call calls errors syscall 1298 ------ ----------- ----------- --------- --------- ---------------- 1299 62.10 431.799048 496 870063 46227 futex 1300 4.23 29.399526 106 275649 38 nanosleep 1301 0.87 6.032292 37 160201 sendmmsg 1302 0.28 1.939492 16 115769 fstat 1303 27.96 194.415343 2787 69749 137 ppoll 1304 1.05 7.298717 315 23131 fsync 1305 0.06 0.446930 31 14096 pwrite64 1306 3.37 23.398106 1907 12266 9 epoll_pwait 1307 0.00 0.019711 9 1991 6 close 1308 0.02 0.116739 82 1414 tgkill 1309 0.01 0.068481 48 1414 201 rt_sigreturn 1310 0.02 0.147048 104 1413 getpid 1311 0.01 0.045338 41 1080 write 1312 0.01 0.039876 37 1056 read 1313 0.00 0.015637 18 836 24 openat 1314 0.01 0.066699 81 814 madvise 1315 0.00 0.029757 111 267 fallocate 1316 0.00 0.006619 15 420 pread64 1317 0.00 0.013334 35 375 sched_yield 1318 0.00 0.008112 114 71 pwritev2 1319 0.00 0.003005 57 52 munmap 1320 0.00 0.000343 18 19 6 unlinkat 1321 0.00 0.000249 15 16 shutdown 1322 0.00 0.000100 8 12 getdents64 1323 0.00 0.000045 4 10 newfstatat 1324 ... 1325 ------ ----------- ----------- --------- --------- ---------------- 1326 100.00 695.311111 447 1552214 46651 total 1327 ``` 1328 1329 To mimic the syscall profile of this gVisor sandbox from the perspective of 1330 `seccomp-bpf` overhead, we need to have it call these system calls with the same 1331 relative frequency. Therefore, the dimension that matters here isn't `time` or 1332 `seconds` or even `usecs/call`; it is actually just the number of system calls 1333 (`calls`). In graph form: 1334 1335 ![Sentry syscall profile](/assets/images/2024-02-01-gvisor-seccomp-sentry-syscall-profile.png "Sentry syscall profile"){:style="max-width:100%"} 1336 1337 The Pareto distribution of system calls becomes immediately clear. 1338 1339 </details> 1340 1341 ### `seccomp-bpf` filtering overhead reduction 1342 1343 The `secbench` library lets us take the top 10 system calls and measure their 1344 `seccomp-bpf` filtering overhead individually, as well as building a weighted 1345 aggregate of their overall overhead. Here are the numbers from before and after 1346 the filtering optimizations described in this post: 1347 1348 ![Systrap seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-systrap.png "Systrap seccomp-bpf performance"){:style="max-width:100%"} 1349 1350 The `nanosleep(2)` system call is a bit of an oddball here. Unlike the others, 1351 this system call causes the current thread to be descheduled. To make the 1352 results more legible, here is the same data with the duration normalized to the 1353 `seccomp-bpf` filtering overhead from before optimizations: 1354 1355 ![Systrap seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-systrap-normalized.png "Systrap seccomp-bpf performance"){:style="max-width:100%"} 1356 1357 This shows that most system calls have had their filtering overhead reduced, but 1358 others haven't significantly changed (10% or less change in either direction). 1359 This is to be expected: those that have not changed are the ones that are 1360 cacheable: `nanosleep(2)`, `fstat(2)`, `ppoll(2)`, `fsync(2)`, `pwrite64(2)`, 1361 `close(2)`, `getpid(2)`. The non-cacheable syscalls 1362 [which have dedicated checks](#structure) before the main BST, `futex(2)` and 1363 `sendmmsg(2)`, experienced the biggest boost. Lastly, `epoll_pwait(2)` is 1364 non-cacheable but doesn't have a dedicated check before the main BST, so while 1365 it still sees a small performance gain, that gain is lower than its 1366 counterparts. 1367 1368 The "Aggregate" number comes from the `secbench` library and represents the 1369 total time difference spent in system calls after calling them using weighted 1370 randomness. It represents the average system call overhead that a Sentry using 1371 Systrap would incur. Therefore, per these numbers, these optimizations removed 1372 ~29% from gVisor's overall `seccomp-bpf` filtering overhead. 1373 1374 Here is the same data for KVM, which has a slightly different syscall profile 1375 with `ioctl(2)` and `rt_sigreturn(2)` being critical for performance: 1376 1377 ![KVM seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-kvm-normalized.png "KVM seccomp-bpf performance"){:style="max-width:100%"} 1378 1379 Lastly, let's look at GPU workload performance. This benchmark enables gVisor's 1380 [experimental `nvproxy` feature for GPU support](/blog/2023/06/20/gpu-pytorch-stable-diffusion/). 1381 What matters for this workload is `ioctl(2)` performance, as this is the system 1382 call used to issue commands to the GPU. Here is the `seccomp-bpf` filtering 1383 overhead of various CUDA control commands issued via `ioctl(2)`: 1384 1385 ![nvproxy ioctl seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-nvproxy-ioctl.png "nvproxy ioctl seccomp-bpf performance"){:style="max-width:100%"} 1386 1387 As `nvproxy` adds a lot of complexity to the `ioctl(2)` filtering rules, this is 1388 where we see the most improvement from these optimizations. 1389 1390 ## `secfuzz`: Fuzzing `seccomp-bpf` programs {#secfuzz} 1391 1392 To ensure that the optimizations above don't accidentally end up producing a 1393 cBPF program that has different behavior from the unoptimized one used to do, 1394 gVisor also has 1395 [`seccomp-bpf` fuzz tests](https://github.com/google/gvisor/blob/master/test/secfuzz/). 1396 1397 Because gVisor knows which high-level filters went into constructing the 1398 `seccomp-bpf` program, it also 1399 [automatically generates test cases](https://github.com/google/gvisor/blob/master/runsc/boot/filter/filter_fuzz_test.go) 1400 from these filters, and the fuzzer verifies that each line and every branch of 1401 the optimized cBPF bytecode is executed, and that the result is the same as 1402 giving the same input to the unoptimized program. 1403 1404 (Line or branch coverage of the unoptimized program is not enforceable, because 1405 without optimizations, the bytecode contains many redundant checks for which 1406 later branches can never be reached.) 1407 1408 ## Optimizing in-gVisor `seccomp-bpf` filtering 1409 1410 gVisor supports sandboxed applications adding `seccomp-bpf` filters onto 1411 themselves, and 1412 [implements its own cBPF interpreter](https://github.com/google/gvisor/blob/master/pkg/bpf/interpreter.go) 1413 for this purpose. 1414 1415 Because the cBPF bytecode-level optimizations are lossless and are generally 1416 applicable to any cBPF program, they are applied onto programs uploaded by 1417 sandboxed applications to make filter evaluation faster in gVisor itself. 1418 1419 Additionally, gVisor removed the use of Go interfaces previously used for 1420 loading data from the BPF "input" (i.e. the `seccomp_data` struct for 1421 `seccomp-bpf`). This used to require an endianness-specific interface due to how 1422 the BPF interpreter was used in two places in gVisor: network processing (which 1423 uses network byte ordering), and `seccomp-bpf` (which uses native byte 1424 ordering). This interface has now been replaced with 1425 [Go templates](https://go.dev/doc/tutorial/generics), yielding to a 2x speedup 1426 on [the reference simplistic `seccomp-bpf` filter](#sample-filter). The more 1427 `load` instructions are in the filter, the better the effect. *(Naturally, this 1428 also benefits network filtering performance!)* 1429 1430 ### gVisor cBPF interpreter performance 1431 1432 The graph below shows the gVisor cBPF interpreter's performance against three 1433 sample filters: [the reference simplistic `seccomp-bpf` filter](#sample-filter), 1434 and optimized vs unoptimized versions of gVisor's own syscall filter (to 1435 represent a more complex filter). 1436 1437 ![gVisor cBPF interpreter performance](/assets/images/2024-02-01-gvisor-seccomp-interpreter.png "gVisor cBPF interpreter performance"){:style="max-width:100%"} 1438 1439 ### `seccomp-bpf` filter result caching for sandboxed applications 1440 1441 Lastly, gVisor now also implements an in-sandbox caching mechanism for syscalls 1442 which do not depend on the `instruction_pointer` or syscall arguments. Unlike 1443 Linux's `seccomp-bpf` cache, gVisor's implementation also handles actions other 1444 than "allow", and supports the entire set of cBPF instructions rather than the 1445 restricted emulator Linux uses for caching evaluation purposes. This removes the 1446 interpreter from the syscall hot path entirely for cacheable syscalls, further 1447 speeding up system calls from applications that use `seccomp-bpf` within gVisor. 1448 1449 ![gVisor seccomp-bpf cache](/assets/images/2024-02-01-gvisor-seccomp-cache.png "gVisor seccomp-bpf cache"){:style="max-width:100%"} 1450 1451 ## Faster gVisor startup via filter precompilation 1452 1453 Due to these optimizations, the overall process of building the syscall 1454 filtering rules, rendering them to cBPF bytecode, and running all the 1455 optimizations, can take quite a while (~10ms). As one of gVisor's strengths is 1456 its startup latency being much faster than VMs, this is an unacceptable delay. 1457 1458 Therefore, gVisor now 1459 [precompiles the rules](https://github.com/google/gvisor/blob/master/pkg/seccomp/precompiledseccomp/) 1460 to optimized cBPF bytecode for most possible gVisor configurations. This means 1461 the `runsc` binary contains cBPF bytecode embedded in it for some subset of 1462 popular configurations, and it will use this bytecode rather than compiling the 1463 cBPF program from scratch during startup. If `runsc` is invoked with a 1464 configuration for which the cBPF bytecode isn't embedded in the `runsc` binary, 1465 it will fall back to compiling the program from scratch. 1466 1467 <details markdown="1"> 1468 1469 <summary markdown="1"> 1470 1471 ### Dealing with dynamic values in precompiled rules 1472 1473 </summary> 1474 1475 One challenge with this approach is to support parts of the configuration that 1476 are only known at `runsc` startup time. For example, many filters act on a 1477 specific file descriptor used for interacting with the `runsc` process after 1478 startup over a Unix Domain Socket (called the "controller FD"). This is an 1479 integer that is only known at runtime, so its value cannot be embedded inside 1480 the optimized cBPF bytecode prepared at `runsc` compilation time. 1481 1482 To address this, the `seccomp-bpf` precompilation tooling actually supports the 1483 notions of 32-bit "variables", and takes as input a function to render cBPF 1484 bytecode for a given key-value mapping of variables to placeholder 32-bit 1485 values. The precompiler calls this function *twice* with different arbitrary 1486 value mappings for each variable, and observes where these arbitrary values show 1487 up in the generated cBPF bytecode. This takes advantage of the fact that 1488 gVisor's `seccomp-bpf` program generation is deterministic. 1489 1490 If the two cBPF programs are of the same byte length, and the placeholder values 1491 show up at exactly the same byte offsets within the cBPF bytecode both times, 1492 and the rest of the cBPF bytecode is byte-for-byte equivalent, the precompiler 1493 has very high confidence that these offsets are where the 32-bit variables are 1494 represented in the cBPF bytecode. It then stores these offsets as part of the 1495 embedded data inside the `runsc` binary. Finally, at `runsc` execution time, the 1496 bytes at these offsets are replaced with the now-known values of the variables. 1497 1498 </details> 1499 1500 ## OK that's great and all, but is gVisor actually faster? {#performance} 1501 1502 The short answer is: **yes, but only slightly**. As we 1503 [established earlier](#performance-considerations), `seccomp-bpf` is only a 1504 small portion of gVisor's total overhead, and the `secbench` benchmark shows 1505 that this work only removes a portion of that overhead, so we should not expect 1506 large differences here. 1507 1508 Let's come back to the trusty ABSL build benchmark, with a new build of gVisor 1509 with all of these optimizations turned on: 1510 1511 ![ABSL build performance](/assets/images/2024-02-01-gvisor-seccomp-absl-vs-unsandboxed.png "ABSL build performance"){:style="max-width:100%"} 1512 1513 Let's zoom the vertical axis in on the gVisor variants to see the difference 1514 better: 1515 1516 ![ABSL build performance](/assets/images/2024-02-01-gvisor-seccomp-absl.png "ABSL build performance"){:style="max-width:100%"} 1517 1518 This is about in line with what the earlier benchmarks showed. The initial 1519 benchmarks showed that `seccomp-bpf` filtering overhead for this benchmark was 1520 on the order of ~3.6% of total runtime, and the `secbench` benchmarks showed 1521 that the optimizations reduced `seccomp-bpf` filter evaluation time by ~29% in 1522 aggregate. The final absolute reduction in total runtime should then be around 1523 ~1%, which is just about what this result shows. 1524 1525 Other benchmarks show a similar pattern. Here's gRPC build, similar to ABSL: 1526 1527 ![gRPC build performance](/assets/images/2024-02-01-gvisor-seccomp-grpc-vs-unsandboxed.png "gRPC build performance"){:style="max-width:100%"} 1528 1529 ![gRPC build performance](/assets/images/2024-02-01-gvisor-seccomp-grpc.png "gRPC build performance"){:style="max-width:100%"} 1530 1531 Here's a benchmark running the 1532 [Ruby Fastlane](https://github.com/fastlane/fastlane) test suite: 1533 1534 ![Ruby Fastlane performance](/assets/images/2024-02-01-gvisor-seccomp-rubydev-vs-unsandboxed.png "Ruby Fastlane performance"){:style="max-width:100%"} 1535 1536 ![Ruby Fastlane performance](/assets/images/2024-02-01-gvisor-seccomp-rubydev.png "Ruby Fastlane performance"){:style="max-width:100%"} 1537 1538 Here's the 50th percentile of nginx serving latency for an empty webpage. 1539 [Every microsecond counts when it comes to web serving](https://www.prnewswire.com/news-releases/akamai-online-retail-performance-report-milliseconds-are-critical-300441498.html), 1540 and here we've shaven off 20 of them. 1541 1542 ![nginx performance](/assets/images/2024-02-01-gvisor-seccomp-nginx-vs-unsandboxed.png "nginx performance"){:style="max-width:100%"} 1543 1544 ![nginx performance](/assets/images/2024-02-01-gvisor-seccomp-nginx.png "nginx performance"){:style="max-width:100%"} 1545 1546 CUDA workloads also get a boost from this work. Since their gVisor-related 1547 overhead is already relatively small, **`seccomp-bpf` filtering makes up a 1548 higher proportion of their overhead**. Additionally, as the performance 1549 improvements described in this post disproportionately help the `ioctl(2)` 1550 system call, this cuts a larger portion of the `seccomp-bpf` filtering overhead 1551 of these workload, since CUDA uses the `ioctl(2)` system call to communicate 1552 with the GPU. 1553 1554 ![PyTorch performance](/assets/images/2024-02-01-gvisor-seccomp-pytorch-vs-unsandboxed.png "PyTorch performance"){:style="max-width:100%"} 1555 1556 ![PyTorch performance](/assets/images/2024-02-01-gvisor-seccomp-pytorch.png "PyTorch performance"){:style="max-width:100%"} 1557 1558 While some of these results may not seem like much in absolute terms, it's 1559 important to remember: 1560 1561 - These improvements have resulted in gVisor being able to enforce **more** 1562 `seccomp-bpf` filters than it previously could; gVisor's `seccomp-bpf` 1563 filter was nearly half the maximum `seccomp-bpf` program size, so it could 1564 at most double in complexity. After optimizations, it is reduced to less 1565 than a fourth of this size. 1566 - These improvements allow the gVisor filters to **scale better**. This is 1567 visible from the effects on `ioctl(2)` performance with `nvproxy` enabled. 1568 - The resulting work has produced useful libraries for `seccomp-bpf` tooling 1569 which may be helpful for other projects: testing, fuzzing, and benchmarking 1570 `seccomp-bpf` filters. 1571 - This overhead could not have been addressed in another way. Unlike other 1572 areas of gVisor, such as network overhead or file I/O, overhead from the 1573 host kernel evaluating `seccomp-bpf` filter lives outside of gVisor itself 1574 and therefore it can only be improved upon by this type of work. 1575 1576 ## Further work 1577 1578 One potential source of work is to look into the performance gap between no 1579 `seccomp-bpf` filter at all versus performance with an empty `seccomp-bpf` 1580 filter (equivalent to an all-cacheable filter). This points to a potential 1581 inefficiency in the Linux kernel implementation of the `seccomp-bpf` cache. 1582 1583 Another potential point of improvement is to port over the optimizations that 1584 went into searching for a syscall number into the 1585 [`ioctl(2)` system call][ioctl]. `ioctl(2)` is a "grab-bag" kind of system call, 1586 used by many drivers and other subsets of the Linux kernel to extend the syscall 1587 interface without using up valuable syscall numbers. For example, the 1588 [KVM](https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine) subsystem is 1589 almost entirely controlled through `ioctl(2)` system calls issued against 1590 `/dev/kvm` or against per-VM file descriptors. 1591 1592 For this reason, the first non-file-descriptor argument of [`ioctl(2)`][ioctl] 1593 ("request") usually encodes something analogous to what the syscall number 1594 usually represents: the type of request made to the kernel. Currently, gVisor 1595 performs a linear scan through all possible enumerations of this argument. This 1596 is usually fine, but with features like `nvproxy` which massively expand this 1597 list of possible values, this can take a long time. `ioctl` performance is also 1598 critical for gVisor's KVM platform. A binary search tree would make sense here. 1599 1600 gVisor welcomes further contributions to its `seccomp-bpf` machinery. Thanks for 1601 reading! 1602 1603 [seccomp-bpf]: https://www.kernel.org/doc/html/v4.19/userspace-api/seccomp_filter.html 1604 [BPF]: https://en.wikipedia.org/wiki/Berkeley_Packet_Filter 1605 [EBPF]: https://en.wikipedia.org/wiki/EBPF 1606 [ioctl]: https://man7.org/linux/man-pages/man2/ioctl.2.html 1607 [^1]: cBPF does not have a canonical assembly-style representation. The 1608 assembly-like code in this blog post is close to 1609 [the one used in `bpfc`](https://man7.org/linux/man-pages/man8/bpfc.8.html) 1610 but diverges in ways to make it hopefully clearer as to what's happening, 1611 and all code is annotated with `// comments`.