gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-04-28-systrap-release.md (about) 1 # Releasing Systrap - A high-performance gVisor platform 2 3 We are releasing a new gVisor platform: Systrap. Like the existing ptrace 4 platform, Systrap runs on most Linux machines out of the box without 5 virtualization. Unlike the ptrace platform, it’s fast 🚀. Go try it by adding 6 `--platform=systrap` to the runsc flags. If you want to know more about it, read 7 on. 8 9 -------------------------------------------------------------------------------- 10 11 gVisor is a security boundary for arbitrary Linux processes. Boundaries do not 12 come for free, and gVisor imposes some performance overhead on sandboxed 13 applications. One of the most fundamental performance challenges with the 14 security model implemented by gVisor is system call interception, which is the 15 focus of this post. 16 17 To recap on the 18 [security model](https://gvisor.dev/docs/architecture_guide/security/#what-can-a-sandbox-do): 19 gVisor is an application kernel that implements the Linux ABI. This includes 20 system calls, signals, memory management, and more. For example, when a 21 sandboxed application calls 22 [`read(2)`](https://man7.org/linux/man-pages/man2/read.2.html), it actually 23 transparently calls into 24 [gVisor's implementation of this system call](https://github.com/google/gvisor/blob/44e2d0fcfeb641f3b8013c3f93cacdae447cc0f1/pkg/sentry/syscalls/linux/sys_read_write.go#L36) 25 This minimizes the attack surface of the host kernel, because sandboxed programs 26 simply can’t make system calls directly to the host in the first place[^1]. This 27 interception happens through an internal layer called the Platform interface, 28 which we have written about in a previous 29 [blog post](https://gvisor.dev/blog/2020/10/22/platform-portability/). To handle 30 these interceptions, this interface must also create new address spaces, 31 allocate memory, and create execution contexts to run the workload. 32 33 gVisor had two platform implementations: KVM and ptrace. The KVM platform uses 34 the kernel’s KVM functionality to allow the Sentry to act as both guest OS and 35 VMM (Virtual machine monitor). It does system call interception just like a 36 normal virtual machine would. This gives good performance when using bare-metal 37 virtualization, but has a noticeable impact with nested virtualization. The 38 other obvious downside is that it requires support for nested virtualization in 39 the first place, which is not supported by all hardware (such as ARM CPUs) or 40 within some Cloud environments. 41 42 The ptrace platform was the alternative wherever KVM was not available. It works 43 through the 44 [`PTRACE_SYSEMU`](http://man7.org/linux/man-pages/man2/ptrace.2.html) action, 45 which makes the user process hand back execution to the sentry whenever it 46 encounters a system call. This is a clean method to achieve system call 47 interception in any environment, virtualized or not, except that it’s quite 48 slow. To see how slow, an unrealistic but highly illustrative benchmark to use 49 is the 50 [`getpid` benchmark](https://github.com/google/gvisor/blob/108410638aa8480e82933870ba8279133f543d2b/test/perf/linux/getpid_benchmark.cc)[^2]. 51 This benchmark runs the 52 [`getpid(2)`](https://man7.org/linux/man-pages/man2/getpid.2.html) system call 53 in a tight `while` loop. No useful application has this behavior, so it is not a 54 realistic benchmark, but it is well-suited to measure system call latency. 55 56 ![Figure 1](/assets/images/2023-04-28-getpid-ptrace-vs-native.svg "Getpid benchmark: ptrace vs. native Linux."){:width="100%"} 57 58 All `getpid` runs have been performed on a GCE n2-standard-4 VM, with the 59 `debian-11-bullseye-v20230306` image. 60 61 While this benchmark is not applicable to most real-world workloads, just about 62 any workload will generally suffer from high overhead in system call 63 performance. Since running in a virtualized environment is the default state for 64 most cloud users these days, it's important that gVisor performs well in this 65 context. Systrap is the new platform targeting this important use case. 66 67 Systrap relies on multiple techniques to implement the Platform interface. Like 68 the ptrace platform, Systrap uses Linux's ptrace subsystem to initialize 69 workload executor threads, which are started as child processes of the main 70 gVisor sentry process. Systrap additionally sets a very restrictive seccomp 71 filter, installs a custom signal handler, and allocates chunks of memory shared 72 between user threads and runsc sentry. This shared memory is what serves as the 73 main form of communication between the sentry and sandboxed programs: whenever 74 the sandboxed process attempts to execute a system call, it triggers a `SIGSYS` 75 signal which is handled by our signal handler. The signal handler in turn 76 populates shared memory regions, and requests the sentry to handle the requested 77 system call. This alone proved to be faster than using `PTRACE_SYSEMU`, as 78 demonstrated by the `getpid` benchmark: 79 80 ![Figure 2](/assets/images/2023-04-28-getpid-ptrace-vs-systrap-unoptimized.svg "Getpid benchmark: ptrace vs. Systrap."){:width="100%"} 81 82 Can we make it even faster? Recall what the main purpose of our signal handler 83 is: to send a request to the sentry via shared memory. To do that, the sandboxed 84 process must first incur the overhead of executing the seccomp filter[^3], and 85 then generating a full signal stack before being able to run the signal handler. 86 What if there was a way to simply have the sandboxed process jump to another 87 user-space function when it wanted to perform a system call? Well, turns out, 88 there is[^4] There is a popular x86 instruction pattern that’s used to perform 89 system calls, and it goes a little something like this: **`mov sysno, %eax; 90 syscall`**. The size of the mov instruction is 5 bytes and the size of the 91 syscall instruction is 2 bytes. Luckily this is just enough space to fit in a 92 **`jmp *%gs:offset`** instruction. When the signal handler sees this instruction 93 pattern, it signals to the sentry that the original instructions can be replaced 94 with a **`jmp`** to trampoline code that performs the same function as the 95 regular `SIGSYS` signal handler. The system call number is not lost, but rather 96 encoded in the offset. The results are even more impressive: 97 98 ![Figure 3](/assets/images/2023-04-28-getpid-ptrace-vs-systrap-opt.svg "Getpid benchmark: ptrace vs. Optimized Systrap."){:width="100%"} 99 100 As mentioned, the `getpid` benchmark is not representative of real-world 101 performance. To get a better picture of the magnitude of improvement, here are 102 some real-world workloads: 103 104 * The 105 [Build ABSL benchmark](https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/fs/bazel_test.go) 106 measures compilation performance by compiling 107 [abseil.io](https://abseil.io/); this is a highly system call dependent 108 workload due to needing to do a lot of I/O filesystem operations (gVisor’s 109 file system overhead is also dependent upon file system isolation it 110 implements, which is something you can learn about 111 [here](https://gvisor.dev/docs/user_guide/filesystem/)). 112 * The 113 [ffmpeg benchmark](https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/media/ffmpeg_test.go) 114 runs a multimedia processing tool, to perform video stream encoding/decoding 115 for example; this workload does not require a significant amount of system 116 calls and there are very few userspace to kernel mode switches. 117 * The 118 [Tensorflow benchmark](https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/ml/tensorflow_test.go) 119 trains a variety of machine learning models on CPU; the system-call usage of 120 this workload is in between compilation and ffmpeg, due to needing to 121 retrieve training and validation data, but the majority of time is still 122 spent just running userspace computations. 123 * Finally, the Redis benchmark performs SET RPC calls with 5 concurrent 124 clients, measures the latency that each call takes to execute, and reports 125 the median (scaled by 250,000 to fit the graph's axis); this workload is 126 heavily bounded by system call performance due to high network stack usage. 127 128 ![Figure 4](/assets/images/2023-04-28-systrap-sample-workloads.svg "Comparison of sample workloads running on ptrace, Systrap, and native Linux."){:width="100%"} 129 130 Systrap will replace the ptrace platform by September 2023 and become the 131 default. Until then, we are working really hard to make it production-ready, 132 which includes working on additional performance and stability improvements, and 133 making sure we maintain a high bar for security through targeted fuzz-testing 134 for Systrap specifically. 135 136 In the meantime, we would like gVisor users to try it out, and give us feedback! 137 If you run gVisor using ptrace today (either by specifying `--platform ptrace` 138 or not specifying the `--platform` flag at all), or you use the KVM platform with 139 nested virtualization, switching to Systrap should be a drop-in performance 140 upgrade. All you have to do is specify `--platform systrap` to runsc. If you 141 encounter any issues, please let us know at 142 [gvisor.dev/issues](https://github.com/google/gvisor/issues). 143 <br> 144 <br> 145 146 -------------------------------------------------------------------------------- 147 148 <!-- mdformat off(Footnotes need to be separated by linebreaks to be rendered) --> 149 150 [^1]: Even if the sandbox itself is compromised, it will still be bound by 151 several defense-in-depth layers, including a restricted set of seccomp 152 filters. You can find more details here: 153 [https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/](https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/). 154 155 [^2]: Once the system call has been intercepted by gVisor (or in the case of 156 Linux, once the process has entered kernel-mode), actually executing the 157 getpid system call itself is very fast, so this benchmark effectively 158 measures single-thread syscall-interception overhead. 159 160 [^3]: Seccomp filters are known to have a “not insubstantial” overhead: 161 [https://lwn.net/Articles/656307/](https://lwn.net/Articles/656307/). 162 163 [^4]: On the x86_64 architecture, ARM does not have this optimization as of the 164 time of writing. 165 166 <!-- mdformat on -->