gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-04-28-systrap-release.md (about)

     1  # Releasing Systrap - A high-performance gVisor platform
     2  
     3  We are releasing a new gVisor platform: Systrap. Like the existing ptrace
     4  platform, Systrap runs on most Linux machines out of the box without
     5  virtualization. Unlike the ptrace platform, it’s fast 🚀. Go try it by adding
     6  `--platform=systrap` to the runsc flags. If you want to know more about it, read
     7  on.
     8  
     9  --------------------------------------------------------------------------------
    10  
    11  gVisor is a security boundary for arbitrary Linux processes. Boundaries do not
    12  come for free, and gVisor imposes some performance overhead on sandboxed
    13  applications. One of the most fundamental performance challenges with the
    14  security model implemented by gVisor is system call interception, which is the
    15  focus of this post.
    16  
    17  To recap on the
    18  [security model](https://gvisor.dev/docs/architecture_guide/security/#what-can-a-sandbox-do):
    19  gVisor is an application kernel that implements the Linux ABI. This includes
    20  system calls, signals, memory management, and more. For example, when a
    21  sandboxed application calls
    22  [`read(2)`](https://man7.org/linux/man-pages/man2/read.2.html), it actually
    23  transparently calls into
    24  [gVisor's implementation of this system call](https://github.com/google/gvisor/blob/44e2d0fcfeb641f3b8013c3f93cacdae447cc0f1/pkg/sentry/syscalls/linux/sys_read_write.go#L36)
    25  This minimizes the attack surface of the host kernel, because sandboxed programs
    26  simply can’t make system calls directly to the host in the first place[^1]. This
    27  interception happens through an internal layer called the Platform interface,
    28  which we have written about in a previous
    29  [blog post](https://gvisor.dev/blog/2020/10/22/platform-portability/). To handle
    30  these interceptions, this interface must also create new address spaces,
    31  allocate memory, and create execution contexts to run the workload.
    32  
    33  gVisor had two platform implementations: KVM and ptrace. The KVM platform uses
    34  the kernel’s KVM functionality to allow the Sentry to act as both guest OS and
    35  VMM (Virtual machine monitor). It does system call interception just like a
    36  normal virtual machine would. This gives good performance when using bare-metal
    37  virtualization, but has a noticeable impact with nested virtualization. The
    38  other obvious downside is that it requires support for nested virtualization in
    39  the first place, which is not supported by all hardware (such as ARM CPUs) or
    40  within some Cloud environments.
    41  
    42  The ptrace platform was the alternative wherever KVM was not available. It works
    43  through the
    44  [`PTRACE_SYSEMU`](http://man7.org/linux/man-pages/man2/ptrace.2.html) action,
    45  which makes the user process hand back execution to the sentry whenever it
    46  encounters a system call. This is a clean method to achieve system call
    47  interception in any environment, virtualized or not, except that it’s quite
    48  slow. To see how slow, an unrealistic but highly illustrative benchmark to use
    49  is the
    50  [`getpid` benchmark](https://github.com/google/gvisor/blob/108410638aa8480e82933870ba8279133f543d2b/test/perf/linux/getpid_benchmark.cc)[^2].
    51  This benchmark runs the
    52  [`getpid(2)`](https://man7.org/linux/man-pages/man2/getpid.2.html) system call
    53  in a tight `while` loop. No useful application has this behavior, so it is not a
    54  realistic benchmark, but it is well-suited to measure system call latency.
    55  
    56  ![Figure 1](/assets/images/2023-04-28-getpid-ptrace-vs-native.svg "Getpid benchmark: ptrace vs. native Linux."){:width="100%"}
    57  
    58  All `getpid` runs have been performed on a GCE n2-standard-4 VM, with the
    59  `debian-11-bullseye-v20230306` image.
    60  
    61  While this benchmark is not applicable to most real-world workloads, just about
    62  any workload will generally suffer from high overhead in system call
    63  performance. Since running in a virtualized environment is the default state for
    64  most cloud users these days, it's important that gVisor performs well in this
    65  context. Systrap is the new platform targeting this important use case.
    66  
    67  Systrap relies on multiple techniques to implement the Platform interface. Like
    68  the ptrace platform, Systrap uses Linux's ptrace subsystem to initialize
    69  workload executor threads, which are started as child processes of the main
    70  gVisor sentry process. Systrap additionally sets a very restrictive seccomp
    71  filter, installs a custom signal handler, and allocates chunks of memory shared
    72  between user threads and runsc sentry. This shared memory is what serves as the
    73  main form of communication between the sentry and sandboxed programs: whenever
    74  the sandboxed process attempts to execute a system call, it triggers a `SIGSYS`
    75  signal which is handled by our signal handler. The signal handler in turn
    76  populates shared memory regions, and requests the sentry to handle the requested
    77  system call. This alone proved to be faster than using `PTRACE_SYSEMU`, as
    78  demonstrated by the `getpid` benchmark:
    79  
    80  ![Figure 2](/assets/images/2023-04-28-getpid-ptrace-vs-systrap-unoptimized.svg "Getpid benchmark: ptrace vs. Systrap."){:width="100%"}
    81  
    82  Can we make it even faster? Recall what the main purpose of our signal handler
    83  is: to send a request to the sentry via shared memory. To do that, the sandboxed
    84  process must first incur the overhead of executing the seccomp filter[^3], and
    85  then generating a full signal stack before being able to run the signal handler.
    86  What if there was a way to simply have the sandboxed process jump to another
    87  user-space function when it wanted to perform a system call? Well, turns out,
    88  there is[^4] There is a popular x86 instruction pattern that’s used to perform
    89  system calls, and it goes a little something like this: **`mov sysno, %eax;
    90  syscall`**. The size of the mov instruction is 5 bytes and the size of the
    91  syscall instruction is 2 bytes. Luckily this is just enough space to fit in a
    92  **`jmp *%gs:offset`** instruction. When the signal handler sees this instruction
    93  pattern, it signals to the sentry that the original instructions can be replaced
    94  with a **`jmp`** to trampoline code that performs the same function as the
    95  regular `SIGSYS` signal handler. The system call number is not lost, but rather
    96  encoded in the offset. The results are even more impressive:
    97  
    98  ![Figure 3](/assets/images/2023-04-28-getpid-ptrace-vs-systrap-opt.svg "Getpid benchmark: ptrace vs. Optimized Systrap."){:width="100%"}
    99  
   100  As mentioned, the `getpid` benchmark is not representative of real-world
   101  performance. To get a better picture of the magnitude of improvement, here are
   102  some real-world workloads:
   103  
   104  *   The
   105      [Build ABSL benchmark](https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/fs/bazel_test.go)
   106      measures compilation performance by compiling
   107      [abseil.io](https://abseil.io/); this is a highly system call dependent
   108      workload due to needing to do a lot of I/O filesystem operations (gVisor’s
   109      file system overhead is also dependent upon file system isolation it
   110      implements, which is something you can learn about
   111      [here](https://gvisor.dev/docs/user_guide/filesystem/)).
   112  *   The
   113      [ffmpeg benchmark](https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/media/ffmpeg_test.go)
   114      runs a multimedia processing tool, to perform video stream encoding/decoding
   115      for example; this workload does not require a significant amount of system
   116      calls and there are very few userspace to kernel mode switches.
   117  *   The
   118      [Tensorflow benchmark](https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/ml/tensorflow_test.go)
   119      trains a variety of machine learning models on CPU; the system-call usage of
   120      this workload is in between compilation and ffmpeg, due to needing to
   121      retrieve training and validation data, but the majority of time is still
   122      spent just running userspace computations.
   123  *   Finally, the Redis benchmark performs SET RPC calls with 5 concurrent
   124      clients, measures the latency that each call takes to execute, and reports
   125      the median (scaled by 250,000 to fit the graph's axis); this workload is
   126      heavily bounded by system call performance due to high network stack usage.
   127  
   128  ![Figure 4](/assets/images/2023-04-28-systrap-sample-workloads.svg "Comparison of sample workloads running on ptrace, Systrap, and native Linux."){:width="100%"}
   129  
   130  Systrap will replace the ptrace platform by September 2023 and become the
   131  default. Until then, we are working really hard to make it production-ready,
   132  which includes working on additional performance and stability improvements, and
   133  making sure we maintain a high bar for security through targeted fuzz-testing
   134  for Systrap specifically.
   135  
   136  In the meantime, we would like gVisor users to try it out, and give us feedback!
   137  If you run gVisor using ptrace today (either by specifying `--platform ptrace`
   138  or not specifying the `--platform` flag at all), or you use the KVM platform with
   139  nested virtualization, switching to Systrap should be a drop-in performance
   140  upgrade. All you have to do is specify `--platform systrap` to runsc. If you
   141  encounter any issues, please let us know at
   142  [gvisor.dev/issues](https://github.com/google/gvisor/issues).
   143  <br>
   144  <br>
   145  
   146  --------------------------------------------------------------------------------
   147  
   148  <!-- mdformat off(Footnotes need to be separated by linebreaks to be rendered) -->
   149  
   150  [^1]: Even if the sandbox itself is compromised, it will still be bound by
   151      several defense-in-depth layers, including a restricted set of seccomp
   152      filters. You can find more details here:
   153      [https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/](https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/).
   154  
   155  [^2]: Once the system call has been intercepted by gVisor (or in the case of
   156      Linux, once the process has entered kernel-mode), actually executing the
   157      getpid system call itself is very fast, so this benchmark effectively
   158      measures single-thread syscall-interception overhead.
   159  
   160  [^3]: Seccomp filters are known to have a “not insubstantial” overhead:
   161      [https://lwn.net/Articles/656307/](https://lwn.net/Articles/656307/).
   162  
   163  [^4]: On the x86_64 architecture, ARM does not have this optimization as of the
   164      time of writing.
   165  
   166  <!-- mdformat on -->