gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2024-02-01-seccomp.md (about)

     1  # Optimizing seccomp usage in gVisor
     2  
     3  gVisor is a multi-layered security sandbox. [`seccomp-bpf`][seccomp-bpf] is
     4  gVisor's second layer of defense against container escape attacks. gVisor uses
     5  `seccomp-bpf` to filter its own syscalls by the host kernel. This significantly
     6  reduces the attack surface to the host that a compromised gVisor process can
     7  access. However, this layer comes at a cost: every legitimate system call that
     8  gVisor makes must be evaluated against this filter by the host kernel before it
     9  is actually executed. **This blog post contains more than you ever wanted to
    10  know about `seccomp-bpf`, and explores the past few months of work to optimize
    11  gVisor's use of it.**
    12  
    13  ![gVisor and seccomp](/assets/images/2024-02-01-gvisor-seccomp.png "gVisor and seccomp"){:style="max-width:100%"}
    14  <span class="attribution">A diagram showing gVisor's two main layers of
    15  security: gVisor itself, and `seccomp-bpf`. This blog post touches on the
    16  `seccomp-bpf` part.
    17  [Tux logo by Larry Ewing and The GIMP](https://commons.wikimedia.org/wiki/File:Tux.svg).</span>
    18  
    19  --------------------------------------------------------------------------------
    20  
    21  ## Understanding `seccomp-bpf` performance in gVisor {#performance-considerations}
    22  
    23  One challenge with gVisor performance improvement ideas is that it is often very
    24  difficult to estimate how much they will impact performance without first doing
    25  most of the work necessary to actually implement them. Profiling tools help with
    26  knowing where to look, but going from there to numbers is difficult.
    27  
    28  `seccomp-bpf` is one area which is actually much more straightforward to
    29  estimate. Because it is a secondary layer of defense that lives outside of
    30  gVisor, and it is merely a filter, we can simply yank it out of gVisor and
    31  benchmark the performance we get. While running gVisor in this way is strictly
    32  **less secure** and not a mode that gVisor should support, the numbers we get in
    33  this manner do provide an upper bound on the maximum *potential* performance
    34  gains we could see from optimizations within gVisor's use of `seccomp-bpf`.
    35  
    36  To visualize this, we can run a benchmark with the following variants:
    37  
    38  *   **Unsandboxed**: Unsandboxed performance without gVisor.
    39  *   **gVisor**: gVisor from before any of the performance improvements described
    40      later in this post.
    41  *   **gVisor with empty filter**: Same as **gVisor**, but with the `seccomp-bpf`
    42      filter replaced with one that unconditionally approves every system call.
    43  
    44  From these three variants, we can break down the gVisor overhead that comes from
    45  gVisor itself vs the one that comes from `seccomp-bpf` filtering. The difference
    46  between **gVisor** and **unsandboxed** represents the total gVisor performance
    47  overhead, and the difference between **gVisor** and **gVisor with empty filter**
    48  represents the performance overhead of gVisor's `seccomp-bpf` filtering rules.
    49  
    50  Let's run these numbers for the ABSL build benchmark:
    51  
    52  ![ABSL seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-absl-empty-filter.png "ABSL seccomp-bpf performance"){:style="max-width:100%"}
    53  
    54  We can now use these numbers to give a rough breakdown of where the overhead is
    55  coming from:
    56  
    57  ![ABSL seccomp-bpf performance breakdown](/assets/images/2024-02-01-gvisor-seccomp-absl-breakdown.png "ABSL seccomp-bpf performance breakdown"){:style="max-width:100%"}
    58  
    59  The `seccomp-bpf` overhead is small in absolute terms. The numbers suggest that
    60  the best that can be shaved off by optimizing `seccomp-bpf` filters is **up to**
    61  3.4 seconds off from the total ABSL build time, which represents a reduction of
    62  total runtime by ~3.6%. However, when looking at this amount relative to
    63  gVisor's overhead over unsandboxed time, this means that optimizing the
    64  `seccomp-bpf` filters may remove **up to** ~15% of gVisor overhead, which is
    65  significant. *(Not all benchmarks have this behavior; some benchmarks show
    66  smaller `seccomp-bpf`-related overhead. The overhead is also highly
    67  platform-dependent.)*
    68  
    69  Of course, this level of performance is what was reached with **empty
    70  `seccomp-bpf` filtering rules**, so we cannot hope to reach this level of
    71  performance gains. However, it is still useful as an upper bound. Let's see how
    72  much of it we can recoup without compromising security.
    73  
    74  ## A primer on BPF and `seccomp-bpf`
    75  
    76  ### BPF, cBPF, eBPF, oh my!
    77  
    78  [BPF (Berkeley Packet Filter)][BPF] is a virtual machine and eponymous machine
    79  language. Its name comes from its original purpose: filtering packets in a
    80  kernel network stack. However, its use has expanded to other domains of the
    81  kernel where programmability is desirable. Syscall filtering in the context of
    82  `seccomp` is one such area.
    83  
    84  BPF itself comes in two dialects: "Classic BPF" (sometimes stylized as cBPF),
    85  and the now-more-well-known ["Extended BPF" (commonly known as eBPF)][EBPF].
    86  eBPF is a superset of cBPF and is usable extensively throughout the kernel.
    87  However, `seccomp` is not one such area. While
    88  [the topic has been heavily debated](https://lwn.net/Articles/857228/), the
    89  status quo remains that `seccomp` filters may only use cBPF, so this post will
    90  focus on cBPF alone.
    91  
    92  ### So what is `seccomp-bpf` exactly?
    93  
    94  `seccomp-bpf` is a part of the Linux kernel which allows a program to impose
    95  syscall filters on itself. A `seccomp-bpf` filter is a cBPF program that is
    96  given syscall data as input, and outputs an "action" (a 32-bit integer) to do as
    97  a result of this system call: allow it, reject it, crash the program, trap
    98  execution, etc. The kernel evaluates the cBPF program on every system call the
    99  application makes. The "input" of this cBPF program is the byte layout of the
   100  `seccomp_data` struct, which can be loaded into the registers of the cBPF
   101  virtual machine for analysis.
   102  
   103  Here's what the `seccomp_data` struct looks like in
   104  [Linux's `include/uapi/linux/seccomp.h`](https://github.com/torvalds/linux/blob/master/include/uapi/linux/seccomp.h):
   105  
   106  ```c
   107  struct seccomp_data {
   108      int nr;                     // 32 bits
   109      __u32 arch;                 // 32 bits
   110      __u64 instruction_pointer;  // 64 bits
   111      __u64 args[6];              // 64 bits × 6
   112  };                              // Total 512 bits
   113  ```
   114  
   115  ### Sample `seccomp-bpf` filter {#sample-filter}
   116  
   117  Here is an example `seccomp-bpf` filter, adapted from the
   118  [Linux kernel documentation](https://www.kernel.org/doc/Documentation/networking/filter.txt)[^1]:
   119  
   120  <!-- Markdown note: This uses "javascript" syntax highlighting because that
   121       happens to work pretty well with this pseudo-assembly-like language.
   122       It is not actually JavaScript. -->
   123  
   124  ```javascript
   125  00: load32 4                // Load 32 bits at offsetof(struct seccomp_data, arch) (= 4)
   126                              //   of the seccomp_data input struct into register A.
   127  01: jeq 0xc000003e, 0, 11   // If A == AUDIT_ARCH_X86_64, jump by 0 instructions [to 02]
   128                              //   else jump by 11 instructions [to 13].
   129  02: load32 0                // Load 32 bits at offsetof(struct seccomp_data, nr) (= 0)
   130                              //   of the seccomp_data input struct into register A.
   131  03: jeq  15,  10,   0       // If A == __NR_rt_sigreturn, jump by 10 instructions [to 14]
   132                              //   else jump by 0 instructions [to 04].
   133  04: jeq 231,   9,   0       // If A == __NR_exit_group, jump by 9 instructions [to 14]
   134                              //   else jump by 0 instructions [to 05].
   135  05: jeq  60,   8,   0       // If A == __NR_exit, jump by 8 instructions [to 14]
   136                              //   else jump by 0 instructions [to 06].
   137  06: jeq   0,   7,   0       // Same thing for __NR_read.
   138  07: jeq   1,   6,   0       // Same thing for __NR_write.
   139  08: jeq   5,   5,   0       // Same thing for __NR_fstat.
   140  09: jeq   9,   4,   0       // Same thing for __NR_mmap.
   141  10: jeq  14,   3,   0       // Same thing for __NR_rt_sigprocmask.
   142  11: jeq  13,   2,   0       // Same thing for __NR_rt_sigaction.
   143  12: jeq  35,   1,   0       // If A == __NR_nanosleep, jump by 1 instruction [to 14]
   144                              //   else jump by 0 instructions [to 13].
   145  13: return 0                // Return SECCOMP_RET_KILL_THREAD
   146  14: return 0x7fff0000       // Return SECCOMP_RET_ALLOW
   147  ```
   148  
   149  This filter effectively allows only the following syscalls: `rt_sigreturn`,
   150  `exit_group`, `exit`, `read`, `write`, `fstat`, `mmap`, `rt_sigprocmask`,
   151  `rt_sigaction`, and `nanosleep`. All other syscalls result in the calling thread
   152  being killed.
   153  
   154  ### `seccomp-bpf` and cBPF limitations {#cbpf-limitations}
   155  
   156  cBPF is quite limited as a language. The following limitations all factor into
   157  the optimizations described in this blog post:
   158  
   159  -   The cBPF virtual machine only has 2 32-bit registers, and a tertiary
   160      pseudo-register for a 32-bit immediate value. (Note that syscall arguments
   161      evaluated in the context of `seccomp` are 64-bit values, so you can already
   162      foresee that this leads to complications.)
   163  -   `seccomp-bpf` programs are limited to 4,096 instructions.
   164  -   Jump instructions can only go forward (this ensures that programs must
   165      halt).
   166  -   Jump instructions may only jump by a fixed ("immediate") number of
   167      instructions. (You cannot say: "jump by whatever this register says".)
   168  -   Jump instructions come in two flavors:
   169      -   "Unconditional" jump instructions, which jump by a fixed number of
   170          instructions. This number must fit in 16 bits.
   171      -   "Conditional" jump instructions, which include a condition expression
   172          and two jump targets:
   173          -   The number of instructions to jump by if the condition is true. This
   174              number must fit in 8 bits, so this cannot jump by more than 255
   175              instructions.
   176          -   The number of instructions to jump by if the condition is false.
   177              This number must fit in 8 bits, so this cannot jump by by more than
   178              255 instructions.
   179  
   180  ### `seccomp-bpf` caching in Linux
   181  
   182  Since
   183  [Linux kernel version 5.11](https://www.phoronix.com/news/Linux-5.11-SECCOMP-Performance),
   184  when a program uploads a `seccomp-bpf` filter into the kernel,
   185  [Linux runs a BPF emulator](https://github.com/torvalds/linux/commit/8e01b51a31a1e08e2c3e8fcc0ef6790441be2f61)
   186  that looks for system call numbers where the BPF program doesn't do any fancy
   187  operations nor load any bits from the `instruction_pointer` or `args` fields of
   188  the `seccomp_data` input struct, and still returns "allow". When this is the
   189  case, **Linux will cache this information** in a per-syscall-number bitfield.
   190  
   191  Later, when a cacheable syscall number is executed, the BPF program is not
   192  evaluated at all; since the kernel knows that the program is deterministic and
   193  doesn't depend on the syscall arguments, it can safely allow the syscall without
   194  actually running the BPF program.
   195  
   196  This post uses the term "cacheable" to refer to syscalls that match this
   197  criteria.
   198  
   199  ## How gVisor builds its `seccomp-bpf` filter
   200  
   201  gVisor imposes a `seccomp-bpf` filter on itself as part of Sentry start-up. This
   202  process works as follows:
   203  
   204  -   gVisor gathers bits of configuration that are relevant to the construction
   205      of its `seccomp-bpf` filter. This includes which platform is in use, whether
   206      certain features that require looser filtering are enabled (e.g. host
   207      networking, profiling, GPU proxying, etc.), and certain file descriptors
   208      (FDs) which may be checked against syscall arguments that pass in FDs.
   209  -   gVisor generates a sequence of rulesets from this configuration. A ruleset
   210      is a mapping from syscall number to a predicate that must be true for this
   211      system call, along with an "action" (return code) that is taken should this
   212      predicate be satisfied. For ease of human understanding, the predicate is
   213      often written as a
   214      [disjunctive rule](https://en.wikipedia.org/wiki/Logical_disjunction), for
   215      which each sub-rule is a
   216      [conjunctive rule](https://en.wikipedia.org/wiki/Logical_conjunction) that
   217      verifies each syscall argument. In other words, `(fA(args[0]) && fB(args[1])
   218      && ...) || (fC(args[0]) && fD(args[1]) && ...) || ...`. This is represented
   219      [in gVisor code](https://github.com/google/gvisor/blob/master/runsc/boot/filter/config/config_main.go)
   220      as follows:
   221  
   222      ```go
   223      Or{          // Disjunction rule
   224          PerArg{  // Conjunction rule over each syscall argument
   225              fA,  // Predicate for `seccomp_data.args[0]`
   226              fB,  // Predicate for `seccomp_data.args[1]`
   227              // ... More predicates can go here (up to 6 arguments per syscall)
   228          },
   229          PerArg{  // Conjunction rule over each syscall argument
   230              fC,  // Predicate for `seccomp_data.args[0]`
   231              fD,  // Predicate for `seccomp_data.args[1]`
   232              // ... More predicates can go here (up to 6 arguments per syscall)
   233          },
   234      }
   235      ```
   236  
   237  -   gVisor performs several optimizations on this data structure.
   238  
   239  -   gVisor then renders this list of rulesets into a linear program that looks
   240      close to the final machine language, other than jump offsets which are
   241      initially represented as symbolic named labels during the rendering process.
   242  
   243  -   gVisor then resolves all the labels to their actual instruction index, and
   244      computes the actual jump targets of all jump instructions to obtain valid
   245      cBPF machine code.
   246  
   247  -   gVisor runs further optimizations on this cBPF bytecode.
   248  
   249  -   Finally, the cBPF bytecode is uploaded into the host kernel and the
   250      `seccomp-bpf` filter becomes effective.
   251  
   252  Optimizing the `seccomp-bpf` filter to be more efficient allows the program to
   253  be more compact (i.e. it's possible to pack more complex filters in the 4,096
   254  instruction limit), and to run faster. While `seccomp-bpf` evaluation is
   255  measured in nanoseconds, the impact of any optimization is magnified here,
   256  because host syscalls are an important part of the synchronous "syscall hot
   257  path" that must execute as part of handling certain performance-sensitive
   258  syscall from the sandboxed application. The relationship is not 1-to-1: a single
   259  application syscall may result in several host syscalls, especially due to
   260  `futex(2)` which the Sentry calls many times to synchronize its own operations.
   261  Therefore, shaving a nanosecond here and there results in several shaved
   262  nanoseconds in the syscall hot path.
   263  
   264  ## Structural optimizations {#structure}
   265  
   266  The first optimization done for gVisor's `seccomp-bpf` was to turn its linear
   267  search over syscall numbers into a
   268  [binary search tree](https://en.wikipedia.org/wiki/Binary_search_tree). This
   269  turns the search for syscall numbers from `O(n)` to `O(log n)` instructions.
   270  This is a very common `seccomp-bpf` optimization technique which is replicated
   271  in other projects such as
   272  [libseccomp](https://github.com/seccomp/libseccomp/issues/116) and Chromium.
   273  
   274  To do this, a cBPF program basically loads the 32-bit `nr` (syscall number)
   275  field of the `seccomp_data` struct, and does a binary tree traversal of the
   276  [syscall number space](https://chromium.googlesource.com/chromiumos/docs/+/HEAD/constants/syscalls.md#tables).
   277  When it finds a match, it jumps to a set of instructions that check that
   278  syscall's arguments for validity, and then returns allow/reject.
   279  
   280  But why stop here? Let's go further.
   281  
   282  The problem with the binary search tree approach is that it treats all syscall
   283  numbers equally. This is a problem for three reasons:
   284  
   285  1.  It does not matter to have good performance for disallowed syscalls, because
   286      such syscalls should never happen during normal program execution.
   287  2.  It does not matter to have good performance for syscalls which can be cached
   288      by the kernel, because the BPF program will only have to run once for these
   289      system calls.
   290  3.  For the system calls which are allowed but are not cacheable by the kernel,
   291      there is a
   292      [Pareto distribution](https://en.wikipedia.org/wiki/Pareto_distribution) of
   293      their relative frequency. To exploit this we should evaluate the most-often
   294      used syscalls faster than the least-often used ones. The binary tree
   295      structure does not exploit this distribution, and instead treats all
   296      syscalls equally.
   297  
   298  So gVisor splits syscall numbers into four sets:
   299  
   300  -   🅰: Non-cacheable 🅰llowed, called very frequently.
   301  -   🅱: Non-cacheable allowed, called once in a 🅱lue moon.
   302  -   🅲: 🅲acheable allowed (whether called frequently or not).
   303  -   🅳: 🅳isallowed (which, by definition, is neither cacheable nor expected to
   304      ever be called).
   305  
   306  Then, the cBPF program is structured in the following layout:
   307  
   308  -   Linear search over allowed frequently-called non-cacheable syscalls (🅰).
   309      These syscalls are ordered in most-frequently-called first (e.g. `futex(2)`
   310      is the first one as it is by far the most-frequently-called system call).
   311  -   Binary search over allowed infrequently-called non-cacheable syscalls (🅱).
   312  -   Binary search over allowed cacheable syscalls (🅲).
   313  -   Reject anything else (🅳).
   314  
   315  This structure takes full advantage of the kernel caching functionality, and of
   316  the Pareto distribution of syscalls.
   317  
   318  <details markdown="1">
   319  
   320  <summary markdown="1">
   321  
   322  ### Binary search tree optimizations
   323  
   324  Beyond classifying syscalls to see which binary search tree they should be a
   325  part of, gVisor also optimizes the binary search process itself.
   326  
   327  </summary>
   328  
   329  Each syscall number is a node in the tree. When traversing the tree, there are
   330  three options at each point:
   331  
   332  -   The syscall number is an exact match
   333  -   The syscall number is lower than the node's value
   334  -   The syscall number is higher than the node's value
   335  
   336  In order to render the BST as cBPF bytecode, gVisor used to render the following
   337  (in pseudocode):
   338  
   339  ```javascript
   340  if syscall number == current node value
   341      jump @rules_for_this_syscall
   342  if syscall number < current node value
   343      jump @left_node
   344  jump @right_node
   345  
   346  @rules_for_this_syscall:
   347    // Render bytecode for this syscall's filters here...
   348  
   349  @left_node:
   350    // Recursively render the bytecode for the left node value here...
   351  
   352  @right_node:
   353    // Recursively render the bytecode for the right node value here...
   354  ```
   355  
   356  Keep in mind the [cBPF limitations](#cbpf-limitations) here. Because conditional
   357  jumps are limited to 255 instructions, the jump to `@left_node` can be further
   358  than 255 instructions away (especially for syscalls with complex filtering rules
   359  like [`ioctl(2)`](https://man7.org/linux/man-pages/man2/ioctl.2.html)). The jump
   360  to `@right_node` is almost certainly more than 255 instructions away. This means
   361  in actual cBPF bytecode, we would often need to use conditional jumps followed
   362  by unconditional jumps in order to jump so far forward. Meanwhile, the jump to
   363  `@rules_for_this_syscall` would be a very short hop away, but this locality
   364  would only be taken advantage of for a single node of the entire tree for each
   365  traversal.
   366  
   367  Consider this structure instead:
   368  
   369  ```javascript
   370  // Traversal code:
   371    if syscall number < current node value
   372        jump @left_node
   373    if syscall_number > current node value
   374        jump @right_node
   375    jump @rules_for_this_syscall
   376    @left_node:
   377      // Recursively render only the traversal code for the left node here
   378    @right_node:
   379      // Recursively render only the traversal code for the right node here
   380  
   381  // Filtering code:
   382    @rules_for_this_syscall:
   383      // Render bytecode for this syscall's filters here
   384    // Recursively render only the filtering code for the left node here
   385    // Recursively render only the filtering code for the right node here
   386  ```
   387  
   388  This effectively separates the per-syscall rules from the traversal of the BST.
   389  This ensures that the traversal can be done entirely using conditional jumps,
   390  and that for any given execution of the cBPF program, there will be at most one
   391  unconditional jump to the syscall-specific rules.
   392  
   393  This structure is further improvable by taking advantage of the fact that
   394  syscall numbers are a dense space, and so are syscall filter rules. This means
   395  we can often avoid needless comparisons. For example, given the following tree:
   396  
   397  ```
   398        22
   399       /  \
   400      9    24
   401     /    /  \
   402    8   23    50
   403  ```
   404  
   405  Notice that the tree contains `22`, `23`, and `24`. This means that if we get to
   406  node `23`, we do not need to check for syscall number equality, because we've
   407  already established from the traversal that the syscall number must be `23`.
   408  
   409  </details>
   410  
   411  ## cBPF bytecode optimizations
   412  
   413  gVisor now implements a
   414  [bytecode-level cBPF optimizer](https://github.com/google/gvisor/blob/master/pkg/bpf/optimizer.go)
   415  running a few lossless optimizations. These optimizations are run repeatedly
   416  until the bytecode no longer changes. This is because each type of optimization
   417  tends to feed on the fruits of the others, as we'll see below.
   418  
   419  ![gVisor sentry seccomp-bpf filter program size](/assets/images/2024-02-01-gvisor-seccomp-sentry-filter-size.png "gVisor sentry seccomp-bpf filter program size"){:style="max-width:100%"}
   420  
   421  gVisor's `seccomp-bpf` program size is reduced by over a factor of 4 using the
   422  optimizations below.
   423  
   424  <details markdown="1">
   425  
   426  <summary markdown="1">
   427  
   428  ### Optimizing cBPF jumps
   429  
   430  The [limitations of cBPF jump instructions described earlier](#cbpf-limitations)
   431  means that typical BPF bytecode rendering code will usually favor unconditional
   432  jumps even when they are not necessary. However, they can be optimized after the
   433  fact.
   434  
   435  </summary>
   436  
   437  Typical BPF bytecode rendering code for a simple condition is usually rendered
   438  as follows:
   439  
   440  ```javascript
   441  jif <condition>, 0, 1     // If <condition> is true, continue,
   442                            //   otherwise skip over 1 instruction.
   443  jmp @condition_was_true   // Unconditional jump to label @condition_was_true.
   444  jmp @condition_was_false  // Unconditional jump to label @condition_was_false.
   445  ```
   446  
   447  ... or as follows:
   448  
   449  ```javascript
   450  jif <condition>, 1, 0     // If <condition> is true, jump by 1 instruction,
   451                            //   otherwise continue.
   452  jmp @condition_was_false  // Unconditional jump to label @condition_was_false.
   453  // Flow through here if the condition was true.
   454  ```
   455  
   456  ... In other words, the generated code always uses unconditional jumps, and
   457  conditional jump offsets are always either 0 or 1 instructions forward. This is
   458  because conditional jumps are limited to 8 bits (255 instructions), and it is
   459  not always possible at BPF bytecode rendering time to know ahead of time that
   460  the jump targets (`@condition_was_true`, `@condition_was_false`) will resolve to
   461  an instruction that is close enough ahead that the offset would fit in 8 bits.
   462  The safe thing to do is to always use an unconditional jump. Since unconditional
   463  jump targets have 16 bits to play with, and `seccomp-bpf` programs are limited
   464  to 4,096 instructions, it is always possible to encode a jump using an
   465  unconditional jump instruction.
   466  
   467  But of course, the jump target often *does* fit in 8 bits. So gVisor looks over
   468  the bytecode for optimization opportunities:
   469  
   470  -   **Conditional jumps that jump to unconditional jumps** are rewritten to
   471      their final destination, so long as this fits within the 255-instruction
   472      conditional jump limit.
   473  -   **Unconditional jumps that jump to other unconditional jumps** are rewritten
   474      to their final destination.
   475  -   **Conditional jumps where both branches jump to the same instruction** are
   476      replaced by an unconditional jump to that instruction.
   477  -   **Unconditional jumps with a zero-instruction jump target** are removed.
   478  
   479  The aim of these optimizations is to clean up after needless indirection that is
   480  a byproduct of cBPF bytecode rendering code. Once they all have run, all jumps
   481  are as tight as they can be.
   482  
   483  </details>
   484  
   485  <details markdown="1">
   486  
   487  <summary markdown="1">
   488  
   489  ### Removing dead code
   490  
   491  Because cBPF is a very restricted language, it is possible to determine with
   492  certainty that some instructions can never be reached.
   493  
   494  </summary>
   495  
   496  In cBPF, each instruction either:
   497  
   498  -   **Flows** forward (e.g. `load` operations, math operations).
   499  -   **Jumps** by a fixed (immediate) number of instructions.
   500  -   **Stops** the execution immediately (`return` instructions).
   501  
   502  Therefore, gVisor runs a simple program traversal algorithm. It creates a
   503  bitfield with one bit per instruction, then traverses the program and all its
   504  possible branches. Then, all instructions that were never traversed are removed
   505  from the program, and all jump targets are updated to account for these
   506  removals.
   507  
   508  In turn, this makes the program shorter, which makes more jump optimizations
   509  possible.
   510  
   511  </details>
   512  
   513  <details markdown="1">
   514  
   515  <summary markdown="1">
   516  
   517  ### Removing redundant `load` instructions {#redundant-loads}
   518  
   519  cBPF programs filter system calls by inspecting their arguments. To do these
   520  comparisons, this data must first be loaded into the cBPF VM registers. These
   521  load operations can be optimized.
   522  
   523  </summary>
   524  
   525  cBPF's conditional operations (e.g. "is equal to", "is greater than", etc.)
   526  operate on a single 32-bit register called "A". As such, a `seccomp-bpf` program
   527  typically consists of many load operations (`load32`) that loads a 32-bit value
   528  from a given offset of the `seccomp_data` struct into register A, then performs
   529  a comparative operation on it to see if it matches the filter.
   530  
   531  ```javascript
   532  00: load32 <offset>
   533  01: jif <condition1>, @condition1_was_true, @condition1_was_false
   534  02: load32 <offset>
   535  03: jif <condition2>, @condition2_was_true, @condition2_was_false
   536  // ...
   537  ```
   538  
   539  But when a syscall rule is of the form "this syscall argument must be one of the
   540  following values", we don't need to reload the same value (from the same offset)
   541  multiple times. So gVisor looks for redundant loads like this, and removes them.
   542  
   543  ```javascript
   544  00: load32 <offset>
   545  01: jif <condition1>, @condition1_was_true, @condition1_was_false
   546  02: jif <condition2>, @condition2_was_true, @condition2_was_false
   547  // ...
   548  ```
   549  
   550  Note that syscall arguments are **64-bit** values, whereas the A register is
   551  only 32-bits wide. Therefore, asserting that a syscall argument matches a
   552  predicate usually involves at least 2 `load32` operations on different offsets,
   553  thereby making this optimization useless for the "this syscall argument must be
   554  one of the following values" case. We'll get back to that.
   555  
   556  </details>
   557  
   558  <details markdown="1">
   559  
   560  <summary markdown="1">
   561  
   562  ### Minimizing the number of `return` instructions
   563  
   564  A typical syscall filter program consists of many predicates which return either
   565  "allowed" or "rejected". These are encoded in the bytecode as either `return`
   566  instructions, or jumps to `return` instructions. These instructions can show up
   567  dozens or hundreds of times in the cBPF bytecode in quick succession, presenting
   568  an optimization opportunity.
   569  
   570  </summary>
   571  
   572  Since two `return` instructions with the same immediate return code are exactly
   573  equivalent to one another, it is possible to rewrite jumps to all `return`
   574  instructions that return "allowed" to go to a single `return` instruction that
   575  returns this code, and similar for "rejected", so long as the jump offsets fit
   576  within the limits of conditional jumps (255 instructions). In turn, this makes
   577  the program shorter, and therefore makes more jump optimizations possible.
   578  
   579  To implement this optimization, gVisor first replaces all unconditional jump
   580  instructions that go to `return` statements with a copy of that `return`
   581  statement. This removes needless indirection.
   582  
   583  ```javascript
   584      Original bytecode                      New bytecode
   585  00: jeq 0, 0, 1                        00: jeq 0, 0, 1
   586  01: jmp @good                    -->   01: return allowed
   587  02: jmp @bad                     -->   02: return rejected
   588  ...                                    ...
   589  10: jge 0, 0, 1                        10: jge 0, 0, 1
   590  11: jmp @good                    -->   11: return allowed
   591  12: jmp @bad                     -->   12: return rejected
   592  ...                                    ...
   593  100 [@good]: return allowed            100 [@good]: return allowed
   594  101 [@bad]:  return rejected           101 [@bad]:  return rejected
   595  ```
   596  
   597  gVisor then searches for `return` statements which can be entirely removed by
   598  seeing if it is possible to rewrite the rest of the program to jump or flow
   599  through to an equivalent `return` statement (without making the program longer
   600  in the process). In the above example:
   601  
   602  ```javascript
   603      Original bytecode                      New bytecode
   604  00: jeq 0, 0, 1                  -->   00: jeq 0, 99, 100   // Targets updated
   605  01: return allowed                     01: return allowed   // Now dead code
   606  02: return reject                      02: return rejected  // Now dead code
   607  ...                                    ...
   608  10: jge 0, 0, 1                  -->   10: jge 0, 89, 90    // Targets updated
   609  11: jmp @good                          11: return allowed   // Now dead code
   610  12: jmp @bad                           12: return rejected  // Now dead code
   611  ...                                    ...
   612  100 [@good]: return allowed            100 [@good]: return allowed
   613  101 [@bad]:  return rejected           101 [@bad]:  return rejected
   614  ```
   615  
   616  Finally, the dead code removal pass cleans up the dead `return` statements and
   617  the program becomes shorter.
   618  
   619  ```javascript
   620      Original bytecode                      New bytecode
   621  00: jeq 0, 99, 100               -->   00: jeq 0, 95, 96  // Targets updated
   622  01: return allowed               -->   /* Removed */
   623  02: return reject                -->   /* Removed */
   624  ...                                    ...
   625  10: jge 0, 89, 90                -->   08: jge 0, 87, 88  // Targets updated
   626  11: return allowed               -->   /* Removed */
   627  12: return rejected              -->   /* Removed */
   628  ...                                    ...
   629  100 [@good]: return allowed            96 [@good]: return allowed
   630  101 [@bad]:  return rejected           97 [@bad]:  return rejected
   631  ```
   632  
   633  While this search is expensive to perform, in a program full of predicates —
   634  which is exactly what `seccomp-bpf` programs are — this approach massively
   635  reduces program size.
   636  
   637  </details>
   638  
   639  ## Ruleset optimizations {#optimize-rulesets}
   640  
   641  Bytecode-level optimizations are cool, but why stop here? gVisor now also
   642  performs
   643  [`seccomp` ruleset optimizations](https://github.com/google/gvisor/blob/master/pkg/seccomp/seccomp_optimizer.go).
   644  
   645  In gVisor, a `seccomp` `RuleSet` is a mapping from syscall number to a logical
   646  expression named `SyscallRule`, along with a `seccomp-bpf` action (e.g. "allow")
   647  if a syscall with a given number matches its `SyscallRule`.
   648  
   649  <details markdown="1">
   650  
   651  <summary markdown="1">
   652  
   653  ### Basic ruleset simplifications {#basic-ruleset-simplifications}
   654  
   655  A `SyscallRule` is a predicate over the data contained in the `seccomp_data`
   656  struct (beyond its `nr`). A trivial implementation is `MatchAll`, which simply
   657  matches any `seccomp_data`. Other implementations include `Or` and `And` (which
   658  do what they sound like), and `PerArg` which applies predicates to each specific
   659  argument of a `seccomp_data`, and forms the meat of actual syscall filtering
   660  rules. Some basic simplifications are already possible with these building
   661  blocks.
   662  
   663  </summary>
   664  
   665  gVisor implements the following basic optimizers, which look like they may be
   666  useless on their own but end up simplifying the logic of the more complex
   667  optimizer described in other sections quite a bit:
   668  
   669  -   `Or` and `And` rules with a single predicate within them are replaced with
   670      just that predicate.
   671  -   Duplicate predicates within `Or` and `And` rules are removed.
   672  -   `Or` rules within `Or` rules are flattened.
   673  -   `And` rules within `And` rules are flattened.
   674  -   An `Or` rule which contains a `MatchAll` predicate is replaced with
   675      `MatchAll`.
   676  -   `MatchAll` predicates within `And` rules are removed.
   677  -   `PerArg` rules with `MatchAll` predicates for each argument are replaced
   678      with a rule that matches anything.
   679  
   680  As with the bytecode-level optimizations, gVisor runs these in a loop until the
   681  structure of the rules no longer change. With the basic optimizations above,
   682  this silly-looking rule:
   683  
   684  ```go
   685  Or{
   686      Or{
   687          And{
   688              MatchAll,
   689              PerArg{AnyValue, EqualTo(2), AnyValue},
   690          },
   691          MatchAll,
   692      },
   693      PerArg{AnyValue, EqualTo(2), AnyValue},
   694      PerArg{AnyValue, EqualTo(2), AnyValue},
   695  }
   696  ```
   697  
   698  ... is simplified down to just `PerArg{AnyValue, EqualTo(2), AnyValue}`.
   699  
   700  </details>
   701  
   702  <details markdown="1">
   703  
   704  <summary markdown="1">
   705  
   706  ### Extracting repeated argument matchers
   707  
   708  This is the main optimization that gVisor performs on rulesets. gVisor looks for
   709  common argument matchers that are repeated across all combinations of *other*
   710  argument matchers in branches of an `Or` rule. It removes them from these
   711  `PerArg` rules, and `And` the overall syscall rule with a single instance of
   712  that argument matcher. Sound complicated? Let's look at an example.
   713  
   714  </summary>
   715  
   716  In the
   717  [gVisor Sentry `seccomp-bpf` configuration](https://github.com/google/gvisor/blob/master/runsc/boot/filter/config/),
   718  these are the rules for the
   719  [`fcntl(2)` system call](https://man7.org/linux/man-pages/man2/fcntl.2.html):
   720  
   721  ```go
   722  rules = ...(map[uintptr]SyscallRule{
   723      SYS_FCNTL: Or{
   724          PerArg{
   725              NonNegativeFD,
   726              EqualTo(F_GETFL),
   727          },
   728          PerArg{
   729              NonNegativeFD,
   730              EqualTo(F_SETFL),
   731          },
   732          PerArg{
   733              NonNegativeFD,
   734              EqualTo(F_GETFD),
   735          },
   736      },
   737  })
   738  ```
   739  
   740  ... This means that for the `fcntl(2)` system call, `seccomp_data.args[0]` may
   741  be any non-negative number, `seccomp_data.args[1]` may be either `F_GETFL`,
   742  `F_SETFL`, or `F_GETFD`, and all other `seccomp_data` fields may be any value.
   743  
   744  If rendered naively in BPF, this would iterate over each branch of the `Or`
   745  expression, and re-check the `NonNegativeFD` each time. Clearly wasteful.
   746  Conceptually, the ideal expression is something like this:
   747  
   748  ```go
   749  rules = ...(map[uintptr]SyscallRule{
   750      SYS_FCNTL: PerArg{
   751          NonNegativeFD,
   752          AnyOf(F_GETFL, F_SETFL, F_GETFD),
   753      },
   754  })
   755  ```
   756  
   757  ... But going through all the syscall rules to look for this pattern would be
   758  quite tedious, and some of them are actually `Or`'d from multiple
   759  `map[uintptr]SyscallRule` in different files (e.g. platform-dependent syscalls),
   760  so they cannot be all specified in a single location with a single predicate on
   761  `seccomp_data.args[1]`. So gVisor needs to detect this programmatically at
   762  optimization time.
   763  
   764  Conceptually, gVisor goes from:
   765  
   766  ```go
   767  Or{
   768      PerArg{A1, B1, C1, D},
   769      PerArg{A2, B1, C1, D},
   770      PerArg{A1, B2, C2, D},
   771      PerArg{A2, B2, C2, D},
   772      PerArg{A1, B3, C3, D},
   773      PerArg{A2, B3, C3, D},
   774  }
   775  ```
   776  
   777  ... to (after one pass):
   778  
   779  ```go
   780  And{
   781      Or{
   782          PerArg{A1, AnyValue, AnyValue, AnyValue},
   783          PerArg{A2, AnyValue, AnyValue, AnyValue},
   784          PerArg{A1, AnyValue, AnyValue, AnyValue},
   785          PerArg{A2, AnyValue, AnyValue, AnyValue},
   786          PerArg{A1, AnyValue, AnyValue, AnyValue},
   787          PerArg{A2, AnyValue, AnyValue, AnyValue},
   788      },
   789      Or{
   790          PerArg{AnyValue, B1, C1, D},
   791          PerArg{AnyValue, B1, C1, D},
   792          PerArg{AnyValue, B2, C2, D},
   793          PerArg{AnyValue, B2, C2, D},
   794          PerArg{AnyValue, B3, C3, D},
   795          PerArg{AnyValue, B3, C3, D},
   796      },
   797  }
   798  ```
   799  
   800  Then the [basic optimizers](#basic-ruleset-simplifications) will kick in and
   801  detect duplicate `PerArg` rules in `Or` expressions, and delete them:
   802  
   803  ```go
   804  And{
   805      Or{
   806          PerArg{A1, AnyValue, AnyValue, AnyValue},
   807          PerArg{A2, AnyValue, AnyValue, AnyValue},
   808      },
   809      Or{
   810          PerArg{AnyValue, B1, C1, D},
   811          PerArg{AnyValue, B2, C2, D},
   812          PerArg{AnyValue, B3, C3, D},
   813      },
   814  }
   815  ```
   816  
   817  ... Then, on the next pass, the second inner `Or` rule gets recursively
   818  optimized:
   819  
   820  ```go
   821  And{
   822      Or{
   823          PerArg{A1, AnyValue, AnyValue, AnyValue},
   824          PerArg{A2, AnyValue, AnyValue, AnyValue},
   825      },
   826      And{
   827          Or{
   828              PerArg{AnyValue, AnyValue, AnyValue, D},
   829              PerArg{AnyValue, AnyValue, AnyValue, D},
   830              PerArg{AnyValue, AnyValue, AnyValue, D},
   831          },
   832          Or{
   833              PerArg{AnyValue, B1, C1, AnyValue},
   834              PerArg{AnyValue, B2, C2, AnyValue},
   835              PerArg{AnyValue, B3, C3, AnyValue},
   836          },
   837      },
   838  }
   839  ```
   840  
   841  ... which, after other basic optimizers clean this all up, finally becomes:
   842  
   843  ```go
   844  And{
   845      Or{
   846          PerArg{A1, AnyValue, AnyValue, AnyValue},
   847          PerArg{A2, AnyValue, AnyValue, AnyValue},
   848      },
   849      PerArg{AnyValue, AnyValue, AnyValue, D},
   850      Or{
   851          PerArg{AnyValue, B1, C1, AnyValue},
   852          PerArg{AnyValue, B2, C2, AnyValue},
   853          PerArg{AnyValue, B3, C3, AnyValue},
   854      },
   855  }
   856  ```
   857  
   858  This has turned what would be 24 comparisons into just 9:
   859  
   860  -   `seccomp_data[0]` must either match predicate `A1` or `A2`.
   861  -   `seccomp_data[3]` must match predicate `D`.
   862  -   At least one of the following must be true:
   863      -   `seccomp_data[1]` must match predicate `B1` and `seccomp_data[2]` must
   864          match predicate `C1`.
   865      -   `seccomp_data[1]` must match predicate `B2` and `seccomp_data[2]` must
   866          match predicate `C2`.
   867      -   `seccomp_data[1]` must match predicate `B3` and `seccomp_data[2]` must
   868          match predicate `C3`.
   869  
   870  To go back to our `fcntl(2)` example, the rules would therefore be rewritten to:
   871  
   872  ```go
   873  rules = ...(map[uintptr]SyscallRule{
   874      SYS_FCNTL: And{
   875          // Check for args[0] exclusively:
   876          PerArg{NonNegativeFD, AnyValue},
   877          // Check for args[1] exclusively:
   878          Or{
   879              PerArg{AnyValue, EqualTo(F_GETFL)},
   880              PerArg{AnyValue, EqualTo(F_SETFL)},
   881              PerArg{AnyValue, EqualTo(F_GETFD)},
   882          },
   883      },
   884  })
   885  ```
   886  
   887  ... thus we've turned 6 comparisons into 4. But we can do better still!
   888  
   889  </details>
   890  
   891  <details markdown="1">
   892  
   893  <summary markdown="1">
   894  
   895  ### Extracting repeated 32-bit match logic from 64-bit argument matchers
   896  
   897  We can apply the same optimization, but down to the 32-bit matching logic that
   898  underlies the 64-bit syscall argument matching predicates.
   899  
   900  </summary>
   901  
   902  As you may recall,
   903  [cBPF instructions are limited to 32-bit math](#cbpf-limitations). This means
   904  that when rendered, each of these argument comparisons are actually 2 operations
   905  each: one for the first 32-bit half of the argument, and one for the second
   906  32-bit half of the argument.
   907  
   908  Let's look at the `F_GETFL`, `F_SETFL`, and `F_GETFD` constants:
   909  
   910  ```go
   911  F_GETFL = 0x3
   912  F_SETFL = 0x4
   913  F_GETFD = 0x1
   914  ```
   915  
   916  The cBPF bytecode for checking the arguments of this syscall may therefore look
   917  something like this:
   918  
   919  ```javascript
   920  // Check for `seccomp_data.args[0]`:
   921    00: load32 16                // Load the first 32 bits of
   922                                 //   `seccomp_data.args[0]` into register A.
   923    01: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
   924    02: load32 20                // Load the second 32 bits of
   925                                 //   `seccomp_data.args[0]` into register A.
   926    03: jset 0x80000000, @bad, 0 // If A & 0x80000000 != 0, jump to @bad,
   927                                 //   otherwise continue.
   928  
   929  // Check for `seccomp_data.args[1]`:
   930    04: load32 24                // Load the first 32 bits of
   931                                 //   `seccomp_data.args[1]` into register A.
   932    05: jeq 0, 0, @next1         // If A == 0, continue, otherwise jump to @next1.
   933    06: load32 28                // Load the second 32 bits of
   934                                 //   `seccomp_data.args[1]` into register A.
   935    07: jeq 0x3, @good, @next1   // If A == 0x3, jump to @good,
   936                                 //   otherwise jump to @next1.
   937  
   938  @next1:
   939    08: load32 24                // Load the first 32 bits of
   940                                 //   `seccomp_data.args[1]` into register A.
   941    09: jeq 0, 0, @next2         // If A == 0, continue, otherwise jump to @next2.
   942    10: load32 28                // Load the second 32 bits of
   943                                 //   `seccomp_data.args[1]` into register A.
   944    11: jeq 0x4, @good, @next2   // If A == 0x3, jump to @good,
   945                                 //   otherwise jump to @next2.
   946  
   947  @next2:
   948    12: load32 24                // Load the first 32 bits of
   949                                 //   `seccomp_data.args[1]` into register A.
   950    13: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
   951    14: load32 28                // Load the second 32 bits of
   952                                 //   `seccomp_data.args[1]` into register A.
   953    15: jeq 0x1, @good, @bad     // If A == 0x1, jump to @good,
   954                                 //   otherwise jump to @bad.
   955  
   956  // Good/bad jump targets for the checks above to jump to:
   957  @good:
   958    16: return ALLOW
   959  @bad:
   960    17: return REJECT
   961  ```
   962  
   963  Clearly this could be better. The first 32 bits must be zero in all possible
   964  cases. So the syscall argument value-matching primitives (e.g. `EqualTo`) may be
   965  split into 2 32-bit value matchers:
   966  
   967  ```go
   968  rules = ...(map[uintptr]SyscallRule{
   969      SYS_FCNTL: And{
   970          PerArg{NonNegativeFD, AnyValue},
   971          Or{
   972              PerArg{
   973                  AnyValue,
   974                  splitMatcher{
   975                      high32bits: EqualTo32Bits(
   976                        F_GETFL & 0xffffffff00000000 /* = 0 */),
   977                      low32bits:  EqualTo32Bits(
   978                        F_GETFL & 0x00000000ffffffff /* = 0x3 */),
   979                  },
   980              },
   981              PerArg{
   982                  AnyValue,
   983                  splitMatcher{
   984                      high32bits: EqualTo32Bits(
   985                        F_SETFL & 0xffffffff00000000 /* = 0 */),
   986                      low32bits:  EqualTo32Bits(
   987                        F_SETFL & 0x00000000ffffffff /* = 0x4 */),
   988                  },
   989              },
   990              PerArg{
   991                  AnyValue,
   992                  splitMatcher{
   993                      high32bits: EqualTo32Bits(
   994                        F_GETFD & 0xffffffff00000000 /* = 0 */),
   995                      low32bits:  EqualTo32Bits(
   996                        F_GETFD & 0x00000000ffffffff /* = 0x1 */),
   997                  },
   998              },
   999          },
  1000      },
  1001  })
  1002  ```
  1003  
  1004  gVisor then applies the same optimization as earlier, but this time going into
  1005  each 32-bit half of each argument. This means it can extract the
  1006  `EqualTo32Bits(0)` matcher from the `high32bits` part of each `splitMatcher` and
  1007  move it up to the `And` expression like so:
  1008  
  1009  ```go
  1010  rules = ...(map[uintptr]SyscallRule{
  1011      SYS_FCNTL: And{
  1012          PerArg{NonNegativeFD, AnyValue},
  1013          PerArg{
  1014              AnyValue,
  1015              splitMatcher{
  1016                  high32bits: EqualTo32Bits(0),
  1017                  low32bits:  Any32BitsValue,
  1018              },
  1019          },
  1020          Or{
  1021              PerArg{
  1022                  AnyValue,
  1023                  splitMatcher{
  1024                      high32bits: Any32BitsValue,
  1025                      low32bits:  EqualTo32Bits(
  1026                        F_GETFL & 0x00000000ffffffff /* = 0x3 */),
  1027                  },
  1028              },
  1029              PerArg{
  1030                  AnyValue,
  1031                  splitMatcher{
  1032                      high32bits: Any32BitsValue,
  1033                      low32bits:  EqualTo32Bits(
  1034                        F_SETFL & 0x00000000ffffffff /* = 0x4 */),
  1035                  },
  1036              },
  1037              PerArg{
  1038                  AnyValue,
  1039                  splitMatcher{
  1040                      high32bits: Any32BitsValue,
  1041                      low32bits:  EqualTo32Bits(
  1042                        F_GETFD & 0x00000000ffffffff /* = 0x1 */),
  1043                  },
  1044              },
  1045          },
  1046      },
  1047  })
  1048  ```
  1049  
  1050  This looks bigger as a tree, but keep in mind that the `AnyValue` and
  1051  `Any32BitsValue` matchers do not produce any bytecode. So now let's render that
  1052  tree to bytecode:
  1053  
  1054  ```javascript
  1055  // Check for `seccomp_data.args[0]`:
  1056    00: load32 16                // Load the first 32 bits of
  1057                                 //   `seccomp_data.args[0]` into register A.
  1058    01: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  1059    02: load32 20                // Load the second 32 bits of
  1060                                 //   `seccomp_data.args[0]` into register A.
  1061    03: jset 0x80000000, @bad, 0 // If A & 0x80000000 != 0, jump to @bad,
  1062                                 //   otherwise continue.
  1063  
  1064  // Check for `seccomp_data.args[1]`:
  1065    04: load32 24                // Load the first 32 bits of
  1066                                 //   `seccomp_data.args[1]` into register A.
  1067    05: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  1068    06: load32 28                // Load the second 32 bits of
  1069                                 //   `seccomp_data.args[1]` into register A.
  1070    07: jeq 0x3, @good, @next1   // If A == 0x3, jump to @good,
  1071                                 //   otherwise jump to @next1.
  1072  
  1073  @next1:
  1074    08: load32 28                // Load the second 32 bits of
  1075                                 //   `seccomp_data.args[1]` into register A.
  1076    09: jeq 0x4, @good, @next2   // If A == 0x3, jump to @good,
  1077                                 //   otherwise jump to @next2.
  1078  
  1079  @next2:
  1080    10: load32 28                // Load the second 32 bits of
  1081                                 //   `seccomp_data.args[1]` into register A.
  1082    11: jeq 0x1, @good, @bad     // If A == 0x1, jump to @good,
  1083                                 //   otherwise jump to @bad.
  1084  
  1085  // Good/bad jump targets for the checks above to jump to:
  1086  @good:
  1087    12: return ALLOW
  1088  @bad:
  1089    13: return REJECT
  1090  ```
  1091  
  1092  This is where the bytecode-level optimization to remove redundant loads
  1093  [described earlier](#redundant-loads) finally becomes relevant. We don't need to
  1094  load the second 32 bits of `seccomp_data.args[1]` multiple times in a row:
  1095  
  1096  ```javascript
  1097  // Check for `seccomp_data.args[0]`:
  1098    00: load32 16                // Load the first 32 bits of
  1099                                 //   `seccomp_data.args[0]` into register A.
  1100    01: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  1101    02: load32 20                // Load the second 32 bits of
  1102                                 //   `seccomp_data.args[0]` into register A.
  1103    03: jset 0x80000000, @bad, 0 // If A & 0x80000000 != 0, jump to @bad,
  1104                                 //   otherwise continue.
  1105  
  1106  // Check for `seccomp_data.args[1]`:
  1107    04: load32 24                // Load the first 32 bits of
  1108                                 //   `seccomp_data.args[1]` into register A.
  1109    05: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  1110    06: load32 28                // Load the second 32 bits of
  1111                                 //   `seccomp_data.args[1]` into register A.
  1112    07: jeq 0x3, @good, @next1   // If A == 0x3, jump to @good,
  1113                                 //   otherwise jump to @next1.
  1114  
  1115  @next1:
  1116    08: jeq 0x4, @good, @next2   // If A == 0x3, jump to @good,
  1117                                 //   otherwise jump to @next2.
  1118  
  1119  @next2:
  1120    09: jeq 0x1, @good, @bad     // If A == 0x1, jump to @good,
  1121                                 //   otherwise jump to @bad.
  1122  
  1123  // Good/bad jump targets for the checks above to jump to:
  1124  @good:
  1125    10: return ALLOW
  1126  @bad:
  1127    11: return REJECT
  1128  ```
  1129  
  1130  Of course, in practice the `@good`/`@bad` jump targets would also be unified
  1131  with rules from other system call filters in order to cut down on those too. And
  1132  by having reduced the number of instructions in each individual filtering rule,
  1133  the jumps to these targets can be deduplicated against that many more rules.
  1134  
  1135  This example demonstrates how **optimizations build on top of each other**,
  1136  making each optimization more likely to make *other* optimizations useful in
  1137  turn.
  1138  
  1139  </details>
  1140  
  1141  ## Other optimizations
  1142  
  1143  Beyond these, gVisor also has the following minor optimizations.
  1144  
  1145  <details markdown="1">
  1146  
  1147  <summary markdown="1">
  1148  
  1149  ### Making `futex(2)` rules faster
  1150  
  1151  [`futex(2)`](https://man7.org/linux/man-pages/man2/futex.2.html) is by far the
  1152  most-often-called system call that gVisor calls as part of its operation. It is
  1153  used for synchronization, so it needs to be very efficient.
  1154  
  1155  </summary>
  1156  
  1157  Its rules used to look like this:
  1158  
  1159  ```go
  1160  SYS_FUTEX: Or{
  1161      PerArg{
  1162          AnyValue,
  1163          EqualTo(FUTEX_WAIT | FUTEX_PRIVATE_FLAG),
  1164      },
  1165      PerArg{
  1166          AnyValue,
  1167          EqualTo(FUTEX_WAKE | FUTEX_PRIVATE_FLAG),
  1168      },
  1169      PerArg{
  1170          AnyValue,
  1171          EqualTo(FUTEX_WAIT),
  1172      },
  1173      PerArg{
  1174          AnyValue,
  1175          EqualTo(FUTEX_WAKE),
  1176      },
  1177  },
  1178  ```
  1179  
  1180  Essentially a 4-way `Or` between 4 different values allowed for
  1181  `seccomp_data.args[1]`. This is all well and good, and the above optimizations
  1182  already optimize this down to the minimum amount of `jeq` comparison operations.
  1183  
  1184  But looking at the actual bit values of the `FUTEX_*` constants above:
  1185  
  1186  ```go
  1187  FUTEX_WAIT         = 0x00
  1188  FUTEX_WAKE         = 0x01
  1189  FUTEX_PRIVATE_FLAG = 0x80
  1190  ```
  1191  
  1192  ... We can see that this is equivalent to checking that no bits other than
  1193  `0x01` and `0x80` may be set. Turns out that cBPF has an instruction for that.
  1194  This is now optimized down to two comparison operations:
  1195  
  1196  ```javascript
  1197  01: load32 24                     // Load the first 32 bits of
  1198                                    //   `seccomp_data.args[1]` into register A.
  1199  02: jeq 0, 0, @bad                // If A == 0, continue,
  1200                                    //   otherwise jump to @bad.
  1201  03: load32 28                     // Load the second 32 bits of
  1202                                    //   `seccomp_data.args[1]` into register A.
  1203  04: jset 0xffffff7e, @bad, @good  // If A & ^(0x01 | 0x80) != 0, jump to @bad,
  1204                                    //   otherwise jump to @good.
  1205  ```
  1206  
  1207  </details>
  1208  
  1209  <details markdown="1">
  1210  
  1211  <summary markdown="1">
  1212  
  1213  ### Optimizing non-negative FD checks
  1214  
  1215  A lot of syscall arguments are file descriptors (FD numbers), which we need to
  1216  filter efficiently.
  1217  
  1218  </summary>
  1219  
  1220  An FD is a 32-bit positive integer, but is passed as a 64-bit value as all
  1221  syscall arguments are. Instead of doing a "less than" operation, we can simply
  1222  turn it into a bitwise check. We simply check that the first half of the 64-bit
  1223  value is zero, and that the 31st bit of the second half of the 64-bit value is
  1224  not set.
  1225  
  1226  </details>
  1227  
  1228  <details markdown="1">
  1229  
  1230  <summary markdown="1">
  1231  
  1232  ### Enforcing consistency of argument-wise matchers
  1233  
  1234  When one syscall argument is checked consistently across all branches of an
  1235  `Or`, enforcing that this is the case ensures that the
  1236  [optimization for such matchers](#optimize-rulesets) remains effective.
  1237  
  1238  </summary>
  1239  
  1240  The `ioctl(2)` system call takes an FD as one of its arguments. Since it is a
  1241  "grab bag" of a system call, gVisor's rules for `ioctl(2)` were similarly spread
  1242  across many files and rules, and not all of them checked that the FD argument
  1243  was non-negative; some of them simply accepted any value for the FD argument.
  1244  
  1245  Before this optimization work, this meant that the BPF program did less work for
  1246  the rules which didn't check the value of the FD argument. However, now that
  1247  gVisor [optimizes repeated argument-wise matchers](#optimize-rulesets), it is
  1248  now actually *cheaper* if *all* `ioctl(2)` rules verify the value of the FD
  1249  argument consistently, as that argument check can be performed exactly once for
  1250  all possible branches of the `ioctl(2)` rules. So now gVisor has a test that
  1251  verifies that this is the case. This is a good example that shows that
  1252  **optimization work can lead to improved security** due to the efficiency gains
  1253  that comes from applying security checks consistently.
  1254  
  1255  </details>
  1256  
  1257  ## `secbench`: Benchmarking `seccomp-bpf` programs {#secbench}
  1258  
  1259  To measure the effectiveness of the above improvements, measuring gVisor
  1260  performance itself would be very difficult, because each improvement is a rather
  1261  tiny part of the syscall hot path. At the scale of each of these optimizations,
  1262  we need to zoom in a bit more.
  1263  
  1264  So now gVisor has
  1265  [tooling for benchmarking `seccomp-bpf` programs](https://github.com/google/gvisor/blob/master/test/secbench/).
  1266  It works by taking a
  1267  [cBPF program along with several possible syscalls](https://github.com/google/gvisor/blob/master/runsc/boot/filter/filter_bench_test.go)
  1268  to try with it. It runs a subprocess that installs this program as `seccomp-bpf`
  1269  filter for itself, replacing all actions (other than "approve syscall") with
  1270  "return error" in order to avoid crashing. Then it measures the latency of each
  1271  syscall. This is then measured against the latency of the very same syscalls in
  1272  a subprocess that has an empty `seccomp-bpf` (i.e. the only instruction within
  1273  it is `return ALLOW`).
  1274  
  1275  Let's measure the effect of the above improvements on a gVisor-like workload.
  1276  
  1277  <details markdown="1">
  1278  
  1279  <summary markdown="1">
  1280  
  1281  ### Modeling gVisor `seccomp-bpf` behavior for benchmarking
  1282  
  1283  This can be done by running gVisor under `ptrace` to see what system calls the
  1284  gVisor process is doing.
  1285  
  1286  </summary>
  1287  
  1288  Note that `ptrace` here refers to the mechanism by which we can inspect the
  1289  system call that the gVisor Sentry is making. This is distinct from the system
  1290  calls the *sandboxed* application is doing. It has also nothing to do with
  1291  gVisor's former "ptrace" platform.
  1292  
  1293  For example, after running a Postgres benchmark inside gVisor with Systrap, the
  1294  `ptrace` tool generated the following summary table:
  1295  
  1296  ```markdown
  1297  % time     seconds  usecs/call     calls    errors syscall
  1298  ------ ----------- ----------- --------- --------- ----------------
  1299   62.10  431.799048         496    870063     46227 futex
  1300    4.23   29.399526         106    275649        38 nanosleep
  1301    0.87    6.032292          37    160201           sendmmsg
  1302    0.28    1.939492          16    115769           fstat
  1303   27.96  194.415343        2787     69749       137 ppoll
  1304    1.05    7.298717         315     23131           fsync
  1305    0.06    0.446930          31     14096           pwrite64
  1306    3.37   23.398106        1907     12266         9 epoll_pwait
  1307    0.00    0.019711           9      1991         6 close
  1308    0.02    0.116739          82      1414           tgkill
  1309    0.01    0.068481          48      1414       201 rt_sigreturn
  1310    0.02    0.147048         104      1413           getpid
  1311    0.01    0.045338          41      1080           write
  1312    0.01    0.039876          37      1056           read
  1313    0.00    0.015637          18       836        24 openat
  1314    0.01    0.066699          81       814           madvise
  1315    0.00    0.029757         111       267           fallocate
  1316    0.00    0.006619          15       420           pread64
  1317    0.00    0.013334          35       375           sched_yield
  1318    0.00    0.008112         114        71           pwritev2
  1319    0.00    0.003005          57        52           munmap
  1320    0.00    0.000343          18        19         6 unlinkat
  1321    0.00    0.000249          15        16           shutdown
  1322    0.00    0.000100           8        12           getdents64
  1323    0.00    0.000045           4        10           newfstatat
  1324  ...
  1325  ------ ----------- ----------- --------- --------- ----------------
  1326  100.00  695.311111         447   1552214     46651 total
  1327  ```
  1328  
  1329  To mimic the syscall profile of this gVisor sandbox from the perspective of
  1330  `seccomp-bpf` overhead, we need to have it call these system calls with the same
  1331  relative frequency. Therefore, the dimension that matters here isn't `time` or
  1332  `seconds` or even `usecs/call`; it is actually just the number of system calls
  1333  (`calls`). In graph form:
  1334  
  1335  ![Sentry syscall profile](/assets/images/2024-02-01-gvisor-seccomp-sentry-syscall-profile.png "Sentry syscall profile"){:style="max-width:100%"}
  1336  
  1337  The Pareto distribution of system calls becomes immediately clear.
  1338  
  1339  </details>
  1340  
  1341  ### `seccomp-bpf` filtering overhead reduction
  1342  
  1343  The `secbench` library lets us take the top 10 system calls and measure their
  1344  `seccomp-bpf` filtering overhead individually, as well as building a weighted
  1345  aggregate of their overall overhead. Here are the numbers from before and after
  1346  the filtering optimizations described in this post:
  1347  
  1348  ![Systrap seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-systrap.png "Systrap seccomp-bpf performance"){:style="max-width:100%"}
  1349  
  1350  The `nanosleep(2)` system call is a bit of an oddball here. Unlike the others,
  1351  this system call causes the current thread to be descheduled. To make the
  1352  results more legible, here is the same data with the duration normalized to the
  1353  `seccomp-bpf` filtering overhead from before optimizations:
  1354  
  1355  ![Systrap seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-systrap-normalized.png "Systrap seccomp-bpf performance"){:style="max-width:100%"}
  1356  
  1357  This shows that most system calls have had their filtering overhead reduced, but
  1358  others haven't significantly changed (10% or less change in either direction).
  1359  This is to be expected: those that have not changed are the ones that are
  1360  cacheable: `nanosleep(2)`, `fstat(2)`, `ppoll(2)`, `fsync(2)`, `pwrite64(2)`,
  1361  `close(2)`, `getpid(2)`. The non-cacheable syscalls
  1362  [which have dedicated checks](#structure) before the main BST, `futex(2)` and
  1363  `sendmmsg(2)`, experienced the biggest boost. Lastly, `epoll_pwait(2)` is
  1364  non-cacheable but doesn't have a dedicated check before the main BST, so while
  1365  it still sees a small performance gain, that gain is lower than its
  1366  counterparts.
  1367  
  1368  The "Aggregate" number comes from the `secbench` library and represents the
  1369  total time difference spent in system calls after calling them using weighted
  1370  randomness. It represents the average system call overhead that a Sentry using
  1371  Systrap would incur. Therefore, per these numbers, these optimizations removed
  1372  ~29% from gVisor's overall `seccomp-bpf` filtering overhead.
  1373  
  1374  Here is the same data for KVM, which has a slightly different syscall profile
  1375  with `ioctl(2)` and `rt_sigreturn(2)` being critical for performance:
  1376  
  1377  ![KVM seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-kvm-normalized.png "KVM seccomp-bpf performance"){:style="max-width:100%"}
  1378  
  1379  Lastly, let's look at GPU workload performance. This benchmark enables gVisor's
  1380  [experimental `nvproxy` feature for GPU support](/blog/2023/06/20/gpu-pytorch-stable-diffusion/).
  1381  What matters for this workload is `ioctl(2)` performance, as this is the system
  1382  call used to issue commands to the GPU. Here is the `seccomp-bpf` filtering
  1383  overhead of various CUDA control commands issued via `ioctl(2)`:
  1384  
  1385  ![nvproxy ioctl seccomp-bpf performance](/assets/images/2024-02-01-gvisor-seccomp-nvproxy-ioctl.png "nvproxy ioctl seccomp-bpf performance"){:style="max-width:100%"}
  1386  
  1387  As `nvproxy` adds a lot of complexity to the `ioctl(2)` filtering rules, this is
  1388  where we see the most improvement from these optimizations.
  1389  
  1390  ## `secfuzz`: Fuzzing `seccomp-bpf` programs {#secfuzz}
  1391  
  1392  To ensure that the optimizations above don't accidentally end up producing a
  1393  cBPF program that has different behavior from the unoptimized one used to do,
  1394  gVisor also has
  1395  [`seccomp-bpf` fuzz tests](https://github.com/google/gvisor/blob/master/test/secfuzz/).
  1396  
  1397  Because gVisor knows which high-level filters went into constructing the
  1398  `seccomp-bpf` program, it also
  1399  [automatically generates test cases](https://github.com/google/gvisor/blob/master/runsc/boot/filter/filter_fuzz_test.go)
  1400  from these filters, and the fuzzer verifies that each line and every branch of
  1401  the optimized cBPF bytecode is executed, and that the result is the same as
  1402  giving the same input to the unoptimized program.
  1403  
  1404  (Line or branch coverage of the unoptimized program is not enforceable, because
  1405  without optimizations, the bytecode contains many redundant checks for which
  1406  later branches can never be reached.)
  1407  
  1408  ## Optimizing in-gVisor `seccomp-bpf` filtering
  1409  
  1410  gVisor supports sandboxed applications adding `seccomp-bpf` filters onto
  1411  themselves, and
  1412  [implements its own cBPF interpreter](https://github.com/google/gvisor/blob/master/pkg/bpf/interpreter.go)
  1413  for this purpose.
  1414  
  1415  Because the cBPF bytecode-level optimizations are lossless and are generally
  1416  applicable to any cBPF program, they are applied onto programs uploaded by
  1417  sandboxed applications to make filter evaluation faster in gVisor itself.
  1418  
  1419  Additionally, gVisor removed the use of Go interfaces previously used for
  1420  loading data from the BPF "input" (i.e. the `seccomp_data` struct for
  1421  `seccomp-bpf`). This used to require an endianness-specific interface due to how
  1422  the BPF interpreter was used in two places in gVisor: network processing (which
  1423  uses network byte ordering), and `seccomp-bpf` (which uses native byte
  1424  ordering). This interface has now been replaced with
  1425  [Go templates](https://go.dev/doc/tutorial/generics), yielding to a 2x speedup
  1426  on [the reference simplistic `seccomp-bpf` filter](#sample-filter). The more
  1427  `load` instructions are in the filter, the better the effect. *(Naturally, this
  1428  also benefits network filtering performance!)*
  1429  
  1430  ### gVisor cBPF interpreter performance
  1431  
  1432  The graph below shows the gVisor cBPF interpreter's performance against three
  1433  sample filters: [the reference simplistic `seccomp-bpf` filter](#sample-filter),
  1434  and optimized vs unoptimized versions of gVisor's own syscall filter (to
  1435  represent a more complex filter).
  1436  
  1437  ![gVisor cBPF interpreter performance](/assets/images/2024-02-01-gvisor-seccomp-interpreter.png "gVisor cBPF interpreter performance"){:style="max-width:100%"}
  1438  
  1439  ### `seccomp-bpf` filter result caching for sandboxed applications
  1440  
  1441  Lastly, gVisor now also implements an in-sandbox caching mechanism for syscalls
  1442  which do not depend on the `instruction_pointer` or syscall arguments. Unlike
  1443  Linux's `seccomp-bpf` cache, gVisor's implementation also handles actions other
  1444  than "allow", and supports the entire set of cBPF instructions rather than the
  1445  restricted emulator Linux uses for caching evaluation purposes. This removes the
  1446  interpreter from the syscall hot path entirely for cacheable syscalls, further
  1447  speeding up system calls from applications that use `seccomp-bpf` within gVisor.
  1448  
  1449  ![gVisor seccomp-bpf cache](/assets/images/2024-02-01-gvisor-seccomp-cache.png "gVisor seccomp-bpf cache"){:style="max-width:100%"}
  1450  
  1451  ## Faster gVisor startup via filter precompilation
  1452  
  1453  Due to these optimizations, the overall process of building the syscall
  1454  filtering rules, rendering them to cBPF bytecode, and running all the
  1455  optimizations, can take quite a while (~10ms). As one of gVisor's strengths is
  1456  its startup latency being much faster than VMs, this is an unacceptable delay.
  1457  
  1458  Therefore, gVisor now
  1459  [precompiles the rules](https://github.com/google/gvisor/blob/master/pkg/seccomp/precompiledseccomp/)
  1460  to optimized cBPF bytecode for most possible gVisor configurations. This means
  1461  the `runsc` binary contains cBPF bytecode embedded in it for some subset of
  1462  popular configurations, and it will use this bytecode rather than compiling the
  1463  cBPF program from scratch during startup. If `runsc` is invoked with a
  1464  configuration for which the cBPF bytecode isn't embedded in the `runsc` binary,
  1465  it will fall back to compiling the program from scratch.
  1466  
  1467  <details markdown="1">
  1468  
  1469  <summary markdown="1">
  1470  
  1471  ### Dealing with dynamic values in precompiled rules
  1472  
  1473  </summary>
  1474  
  1475  One challenge with this approach is to support parts of the configuration that
  1476  are only known at `runsc` startup time. For example, many filters act on a
  1477  specific file descriptor used for interacting with the `runsc` process after
  1478  startup over a Unix Domain Socket (called the "controller FD"). This is an
  1479  integer that is only known at runtime, so its value cannot be embedded inside
  1480  the optimized cBPF bytecode prepared at `runsc` compilation time.
  1481  
  1482  To address this, the `seccomp-bpf` precompilation tooling actually supports the
  1483  notions of 32-bit "variables", and takes as input a function to render cBPF
  1484  bytecode for a given key-value mapping of variables to placeholder 32-bit
  1485  values. The precompiler calls this function *twice* with different arbitrary
  1486  value mappings for each variable, and observes where these arbitrary values show
  1487  up in the generated cBPF bytecode. This takes advantage of the fact that
  1488  gVisor's `seccomp-bpf` program generation is deterministic.
  1489  
  1490  If the two cBPF programs are of the same byte length, and the placeholder values
  1491  show up at exactly the same byte offsets within the cBPF bytecode both times,
  1492  and the rest of the cBPF bytecode is byte-for-byte equivalent, the precompiler
  1493  has very high confidence that these offsets are where the 32-bit variables are
  1494  represented in the cBPF bytecode. It then stores these offsets as part of the
  1495  embedded data inside the `runsc` binary. Finally, at `runsc` execution time, the
  1496  bytes at these offsets are replaced with the now-known values of the variables.
  1497  
  1498  </details>
  1499  
  1500  ## OK that's great and all, but is gVisor actually faster? {#performance}
  1501  
  1502  The short answer is: **yes, but only slightly**. As we
  1503  [established earlier](#performance-considerations), `seccomp-bpf` is only a
  1504  small portion of gVisor's total overhead, and the `secbench` benchmark shows
  1505  that this work only removes a portion of that overhead, so we should not expect
  1506  large differences here.
  1507  
  1508  Let's come back to the trusty ABSL build benchmark, with a new build of gVisor
  1509  with all of these optimizations turned on:
  1510  
  1511  ![ABSL build performance](/assets/images/2024-02-01-gvisor-seccomp-absl-vs-unsandboxed.png "ABSL build performance"){:style="max-width:100%"}
  1512  
  1513  Let's zoom the vertical axis in on the gVisor variants to see the difference
  1514  better:
  1515  
  1516  ![ABSL build performance](/assets/images/2024-02-01-gvisor-seccomp-absl.png "ABSL build performance"){:style="max-width:100%"}
  1517  
  1518  This is about in line with what the earlier benchmarks showed. The initial
  1519  benchmarks showed that `seccomp-bpf` filtering overhead for this benchmark was
  1520  on the order of ~3.6% of total runtime, and the `secbench` benchmarks showed
  1521  that the optimizations reduced `seccomp-bpf` filter evaluation time by ~29% in
  1522  aggregate. The final absolute reduction in total runtime should then be around
  1523  ~1%, which is just about what this result shows.
  1524  
  1525  Other benchmarks show a similar pattern. Here's gRPC build, similar to ABSL:
  1526  
  1527  ![gRPC build performance](/assets/images/2024-02-01-gvisor-seccomp-grpc-vs-unsandboxed.png "gRPC build performance"){:style="max-width:100%"}
  1528  
  1529  ![gRPC build performance](/assets/images/2024-02-01-gvisor-seccomp-grpc.png "gRPC build performance"){:style="max-width:100%"}
  1530  
  1531  Here's a benchmark running the
  1532  [Ruby Fastlane](https://github.com/fastlane/fastlane) test suite:
  1533  
  1534  ![Ruby Fastlane performance](/assets/images/2024-02-01-gvisor-seccomp-rubydev-vs-unsandboxed.png "Ruby Fastlane performance"){:style="max-width:100%"}
  1535  
  1536  ![Ruby Fastlane performance](/assets/images/2024-02-01-gvisor-seccomp-rubydev.png "Ruby Fastlane performance"){:style="max-width:100%"}
  1537  
  1538  Here's the 50th percentile of nginx serving latency for an empty webpage.
  1539  [Every microsecond counts when it comes to web serving](https://www.prnewswire.com/news-releases/akamai-online-retail-performance-report-milliseconds-are-critical-300441498.html),
  1540  and here we've shaven off 20 of them.
  1541  
  1542  ![nginx performance](/assets/images/2024-02-01-gvisor-seccomp-nginx-vs-unsandboxed.png "nginx performance"){:style="max-width:100%"}
  1543  
  1544  ![nginx performance](/assets/images/2024-02-01-gvisor-seccomp-nginx.png "nginx performance"){:style="max-width:100%"}
  1545  
  1546  CUDA workloads also get a boost from this work. Since their gVisor-related
  1547  overhead is already relatively small, **`seccomp-bpf` filtering makes up a
  1548  higher proportion of their overhead**. Additionally, as the performance
  1549  improvements described in this post disproportionately help the `ioctl(2)`
  1550  system call, this cuts a larger portion of the `seccomp-bpf` filtering overhead
  1551  of these workload, since CUDA uses the `ioctl(2)` system call to communicate
  1552  with the GPU.
  1553  
  1554  ![PyTorch performance](/assets/images/2024-02-01-gvisor-seccomp-pytorch-vs-unsandboxed.png "PyTorch performance"){:style="max-width:100%"}
  1555  
  1556  ![PyTorch performance](/assets/images/2024-02-01-gvisor-seccomp-pytorch.png "PyTorch performance"){:style="max-width:100%"}
  1557  
  1558  While some of these results may not seem like much in absolute terms, it's
  1559  important to remember:
  1560  
  1561  -   These improvements have resulted in gVisor being able to enforce **more**
  1562      `seccomp-bpf` filters than it previously could; gVisor's `seccomp-bpf`
  1563      filter was nearly half the maximum `seccomp-bpf` program size, so it could
  1564      at most double in complexity. After optimizations, it is reduced to less
  1565      than a fourth of this size.
  1566  -   These improvements allow the gVisor filters to **scale better**. This is
  1567      visible from the effects on `ioctl(2)` performance with `nvproxy` enabled.
  1568  -   The resulting work has produced useful libraries for `seccomp-bpf` tooling
  1569      which may be helpful for other projects: testing, fuzzing, and benchmarking
  1570      `seccomp-bpf` filters.
  1571  -   This overhead could not have been addressed in another way. Unlike other
  1572      areas of gVisor, such as network overhead or file I/O, overhead from the
  1573      host kernel evaluating `seccomp-bpf` filter lives outside of gVisor itself
  1574      and therefore it can only be improved upon by this type of work.
  1575  
  1576  ## Further work
  1577  
  1578  One potential source of work is to look into the performance gap between no
  1579  `seccomp-bpf` filter at all versus performance with an empty `seccomp-bpf`
  1580  filter (equivalent to an all-cacheable filter). This points to a potential
  1581  inefficiency in the Linux kernel implementation of the `seccomp-bpf` cache.
  1582  
  1583  Another potential point of improvement is to port over the optimizations that
  1584  went into searching for a syscall number into the
  1585  [`ioctl(2)` system call][ioctl]. `ioctl(2)` is a "grab-bag" kind of system call,
  1586  used by many drivers and other subsets of the Linux kernel to extend the syscall
  1587  interface without using up valuable syscall numbers. For example, the
  1588  [KVM](https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine) subsystem is
  1589  almost entirely controlled through `ioctl(2)` system calls issued against
  1590  `/dev/kvm` or against per-VM file descriptors.
  1591  
  1592  For this reason, the first non-file-descriptor argument of [`ioctl(2)`][ioctl]
  1593  ("request") usually encodes something analogous to what the syscall number
  1594  usually represents: the type of request made to the kernel. Currently, gVisor
  1595  performs a linear scan through all possible enumerations of this argument. This
  1596  is usually fine, but with features like `nvproxy` which massively expand this
  1597  list of possible values, this can take a long time. `ioctl` performance is also
  1598  critical for gVisor's KVM platform. A binary search tree would make sense here.
  1599  
  1600  gVisor welcomes further contributions to its `seccomp-bpf` machinery. Thanks for
  1601  reading!
  1602  
  1603  [seccomp-bpf]: https://www.kernel.org/doc/html/v4.19/userspace-api/seccomp_filter.html
  1604  [BPF]: https://en.wikipedia.org/wiki/Berkeley_Packet_Filter
  1605  [EBPF]: https://en.wikipedia.org/wiki/EBPF
  1606  [ioctl]: https://man7.org/linux/man-pages/man2/ioctl.2.html
  1607  [^1]: cBPF does not have a canonical assembly-style representation. The
  1608      assembly-like code in this blog post is close to
  1609      [the one used in `bpfc`](https://man7.org/linux/man-pages/man8/bpfc.8.html)
  1610      but diverges in ways to make it hopefully clearer as to what's happening,
  1611      and all code is annotated with `// comments`.