gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-04-28-systrap-release.md

gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-04-28-systrap-release.md (about)

1 # Releasing Systrap - A high-performance gVisor platform
2
3 We are releasing a new gVisor platform: Systrap. Like the existing ptrace
4 platform, Systrap runs on most Linux machines out of the box without
5 virtualization. Unlike the ptrace platform, it’s fast 🚀. Go try it by adding
6 `--platform=systrap` to the runsc flags. If you want to know more about it, read
7 on.
8
9 --------------------------------------------------------------------------------
10
11 gVisor is a security boundary for arbitrary Linux processes. Boundaries do not
12 come for free, and gVisor imposes some performance overhead on sandboxed
13 applications. One of the most fundamental performance challenges with the
14 security model implemented by gVisor is system call interception, which is the
15 focus of this post.
16
17 To recap on the
18 [security model](https://gvisor.dev/docs/architecture_guide/security/#what-can-a-sandbox-do):
19 gVisor is an application kernel that implements the Linux ABI. This includes
20 system calls, signals, memory management, and more. For example, when a
21 sandboxed application calls
22 [`read(2)`](https://man7.org/linux/man-pages/man2/read.2.html), it actually
23 transparently calls into
24 [gVisor's implementation of this system call](https://github.com/google/gvisor/blob/44e2d0fcfeb641f3b8013c3f93cacdae447cc0f1/pkg/sentry/syscalls/linux/sys_read_write.go#L36)
25 This minimizes the attack surface of the host kernel, because sandboxed programs
26 simply can’t make system calls directly to the host in the first place[^1]. This
27 interception happens through an internal layer called the Platform interface,
28 which we have written about in a previous
29 [blog post](https://gvisor.dev/blog/2020/10/22/platform-portability/). To handle
30 these interceptions, this interface must also create new address spaces,
31 allocate memory, and create execution contexts to run the workload.
32
33 gVisor had two platform implementations: KVM and ptrace. The KVM platform uses
34 the kernel’s KVM functionality to allow the Sentry to act as both guest OS and
35 VMM (Virtual machine monitor). It does system call interception just like a
36 normal virtual machine would. This gives good performance when using bare-metal
37 virtualization, but has a noticeable impact with nested virtualization. The
38 other obvious downside is that it requires support for nested virtualization in
39 the first place, which is not supported by all hardware (such as ARM CPUs) or
40 within some Cloud environments.
41
42 The ptrace platform was the alternative wherever KVM was not available. It works
43 through the
44 [`PTRACE_SYSEMU`](http://man7.org/linux/man-pages/man2/ptrace.2.html) action,
45 which makes the user process hand back execution to the sentry whenever it
46 encounters a system call. This is a clean method to achieve system call
47 interception in any environment, virtualized or not, except that it’s quite
48 slow. To see how slow, an unrealistic but highly illustrative benchmark to use
49 is the
50 [`getpid` benchmark](https://github.com/google/gvisor/blob/108410638aa8480e82933870ba8279133f543d2b/test/perf/linux/getpid_benchmark.cc)[^2].
51 This benchmark runs the
52 [`getpid(2)`](https://man7.org/linux/man-pages/man2/getpid.2.html) system call
53 in a tight `while` loop. No useful application has this behavior, so it is not a
54 realistic benchmark, but it is well-suited to measure system call latency.
55
56 ![Figure 1](/assets/images/2023-04-28-getpid-ptrace-vs-native.svg "Getpid benchmark: ptrace vs. native Linux."){:width="100%"}
57
58 All `getpid` runs have been performed on a GCE n2-standard-4 VM, with the
59 `debian-11-bullseye-v20230306` image.
60
61 While this benchmark is not applicable to most real-world workloads, just about
62 any workload will generally suffer from high overhead in system call
63 performance. Since running in a virtualized environment is the default state for
64 most cloud users these days, it's important that gVisor performs well in this
65 context. Systrap is the new platform targeting this important use case.
66
67 Systrap relies on multiple techniques to implement the Platform interface. Like
68 the ptrace platform, Systrap uses Linux's ptrace subsystem to initialize
69 workload executor threads, which are started as child processes of the main
70 gVisor sentry process. Systrap additionally sets a very restrictive seccomp
71 filter, installs a custom signal handler, and allocates chunks of memory shared
72 between user threads and runsc sentry. This shared memory is what serves as the
73 main form of communication between the sentry and sandboxed programs: whenever
74 the sandboxed process attempts to execute a system call, it triggers a `SIGSYS`
75 signal which is handled by our signal handler. The signal handler in turn
76 populates shared memory regions, and requests the sentry to handle the requested
77 system call. This alone proved to be faster than using `PTRACE_SYSEMU`, as
78 demonstrated by the `getpid` benchmark:
79
80 ![Figure 2](/assets/images/2023-04-28-getpid-ptrace-vs-systrap-unoptimized.svg "Getpid benchmark: ptrace vs. Systrap."){:width="100%"}
81
82 Can we make it even faster? Recall what the main purpose of our signal handler
83 is: to send a request to the sentry via shared memory. To do that, the sandboxed
84 process must first incur the overhead of executing the seccomp filter[^3], and
85 then generating a full signal stack before being able to run the signal handler.
86 What if there was a way to simply have the sandboxed process jump to another
87 user-space function when it wanted to perform a system call? Well, turns out,
88 there is[^4] There is a popular x86 instruction pattern that’s used to perform
89 system calls, and it goes a little something like this: **`mov sysno, %eax;
90 syscall`**. The size of the mov instruction is 5 bytes and the size of the
91 syscall instruction is 2 bytes. Luckily this is just enough space to fit in a
92 **`jmp *%gs:offset`** instruction. When the signal handler sees this instruction
93 pattern, it signals to the sentry that the original instructions can be replaced
94 with a **`jmp`** to trampoline code that performs the same function as the
95 regular `SIGSYS` signal handler. The system call number is not lost, but rather
96 encoded in the offset. The results are even more impressive:
97
98 ![Figure 3](/assets/images/2023-04-28-getpid-ptrace-vs-systrap-opt.svg "Getpid benchmark: ptrace vs. Optimized Systrap."){:width="100%"}
99
100 As mentioned, the `getpid` benchmark is not representative of real-world
101 performance. To get a better picture of the magnitude of improvement, here are
102 some real-world workloads:
103
104 * The
105 [Build ABSL benchmark](https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/fs/bazel_test.go)
106 measures compilation performance by compiling
107 [abseil.io](https://abseil.io/); this is a highly system call dependent
108 workload due to needing to do a lot of I/O filesystem operations (gVisor’s
109 file system overhead is also dependent upon file system isolation it
110 implements, which is something you can learn about
111 [here](https://gvisor.dev/docs/user_guide/filesystem/)).
112 * The
113 [ffmpeg benchmark](https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/media/ffmpeg_test.go)
114 runs a multimedia processing tool, to perform video stream encoding/decoding
115 for example; this workload does not require a significant amount of system
116 calls and there are very few userspace to kernel mode switches.
117 * The
118 [Tensorflow benchmark](https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/ml/tensorflow_test.go)
119 trains a variety of machine learning models on CPU; the system-call usage of
120 this workload is in between compilation and ffmpeg, due to needing to
121 retrieve training and validation data, but the majority of time is still
122 spent just running userspace computations.
123 * Finally, the Redis benchmark performs SET RPC calls with 5 concurrent
124 clients, measures the latency that each call takes to execute, and reports
125 the median (scaled by 250,000 to fit the graph's axis); this workload is
126 heavily bounded by system call performance due to high network stack usage.
127
128 ![Figure 4](/assets/images/2023-04-28-systrap-sample-workloads.svg "Comparison of sample workloads running on ptrace, Systrap, and native Linux."){:width="100%"}
129
130 Systrap will replace the ptrace platform by September 2023 and become the
131 default. Until then, we are working really hard to make it production-ready,
132 which includes working on additional performance and stability improvements, and
133 making sure we maintain a high bar for security through targeted fuzz-testing
134 for Systrap specifically.
135
136 In the meantime, we would like gVisor users to try it out, and give us feedback!
137 If you run gVisor using ptrace today (either by specifying `--platform ptrace`
138 or not specifying the `--platform` flag at all), or you use the KVM platform with
139 nested virtualization, switching to Systrap should be a drop-in performance
140 upgrade. All you have to do is specify `--platform systrap` to runsc. If you
141 encounter any issues, please let us know at
142 [gvisor.dev/issues](https://github.com/google/gvisor/issues).
143 <br>
144 <br>
145
146 --------------------------------------------------------------------------------
147
148 
149
150 [^1]: Even if the sandbox itself is compromised, it will still be bound by
151 several defense-in-depth layers, including a restricted set of seccomp
152 filters. You can find more details here:
153 [https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/](https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/).
154
155 [^2]: Once the system call has been intercepted by gVisor (or in the case of
156 Linux, once the process has entered kernel-mode), actually executing the
157 getpid system call itself is very fast, so this benchmark effectively
158 measures single-thread syscall-interception overhead.
159
160 [^3]: Seccomp filters are known to have a “not insubstantial” overhead:
161 [https://lwn.net/Articles/656307/](https://lwn.net/Articles/656307/).
162
163 [^4]: On the x86_64 architecture, ARM does not have this optimization as of the
164 time of writing.
165
166