gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2021-12-02-running-gvisor-in-production-at-scale-in-ant.md (about) 1 # Running gVisor in Production at Scale in Ant 2 3 > This post was contributed by [Ant Group](https://www.antgroup.com/), a 4 > large-scale digital payment platform. Jianfeng and Yong are engineers at Ant 5 > Group working on infrastructure systems, and contributors to gVisor. 6 7 At Ant Group, we are committed to keep online transactions safe and efficient. 8 Continuously improving security for potential system-level attacks is one of 9 many measures. As a container runtime, gVisor provides container-native security 10 without sacrificing resource efficiency. Therefore, it has been on our radar 11 since it was released. 12 13 However, there have been performance concerns raised by members of 14 [academia](https://www.usenix.org/system/files/hotcloud19-paper-young.pdf) and 15 [industry](https://news.ycombinator.com/item?id=19924036). Users of gVisor tend 16 to bear the extra overhead as the tax of security. But we tend to agree that 17 [security is no excuse for poor performance (See Chapter 6!)](https://sel4.systems/About/seL4-whitepaper.pdf). 18 19 In this article, we will present how we identified bottlenecks in gVisor and 20 unblocked large-scale production adoption. Our main focus are the CPU 21 utilization and latency overhead it brings. Small memory footprint is also a 22 valued goal, but not discussed in this blog. As a result of these efforts and 23 community improvements, 70% of our applications running on runsc have <1% 24 overhead; another 25% have <3% overhead. Some of our most valued application are 25 the focus of our optimization, and get even better performance compared with 26 runc. 27 28 The rest of this blog is organized as follows: * First, we analyze the cost of 29 different syscall paths in gVisor. * Then, a way to profile a whole picture of a 30 instance is proposed to find out if some slow syscall paths are encountered. 31 Some invisible overhead in Go runtime is discussed. * At last, a short summary 32 on performance optimization with some other factors on production adoption. 33 34 For convenience of discussion, we are targeting KVM-based, or hypervisor-based 35 platforms, unless explicitly stated. 36 37 ## Cost of different syscall paths 38 39 [Defense-in-depth](../../../../2019/11/18/gvisor-security-basics-part-1/#defense-in-depth) 40 is the key design principle of gVisor. In gVisor, different syscalls have 41 different paths, further leading to different cost (orders of magnitude) on 42 latency and CPU consumption. Here are the syscall paths in gVisor. 43 44 ![Figure 1](/assets/images/2021-12-02-syscall-figure1.png "Sentry syscall paths.") 45 46 ### Path 1: User-space vDSO 47 48 Sentry provides a 49 [vDSO library](https://github.com/google/gvisor/tree/master/vdso) for its 50 sandboxed processes. Several syscalls are short circuited and implemented in 51 user space. These syscalls cost almost as much as native Linux. But note that 52 the vDSO library is partially implemented. We once noticed some 53 [syscalls](https://github.com/google/gvisor/issues/3101) in our environment are 54 not properly terminated in user space. We create some additional implementations 55 to the vDSO, and aim to push these improvements upstream when possible. 56 57 ### Path 2: Sentry contained 58 59 Most syscalls, e.g., <code>clone(2)</code>, are implemented in Sentry. They are 60 some basic abstractions of a operating system, such as process/thread lifecycle, 61 scheduling, IPC, memory management, etc. These syscalls and all below suffer 62 from a structural cost of syscall interception. The overhead is about 800ns 63 while that of the native syscalls is about 70ns. We'll dig it further below. 64 Syscalls of this kind spend takes about several microseconds, which is 65 competitive to the corresponding native Linux syscalls. 66 67 ### Path 3: Host-kernel involved 68 69 Some syscalls, resource related, e.g., read/write, are redirected into the host 70 kernel. Note that gVisor never passes through application syscalls directly into 71 host kernel for functional and security reasons. So comparing to native Linux, 72 time spent in Sentry seems an extra overhead. Another overhead is the way to 73 call a host kernel syscall. Let's use kvm platform of x86_64 as an example. 74 After Sentry issues the syscall instruction, if it is in GR0, it first goes to 75 the syscall entrypoint defined in LSTAR, and then halts to HR3 (a vmexit happens 76 here), and exits from a signal handler, and executes syscall instruction again. 77 We can save the "Halt to HR3" by introducing vmcall here, but there's still a 78 syscall trampoline there and the vmexit/vmentry overhead is not trivial. 79 Nevertheless, these overhead is not that significant. 80 81 For some sentry-contained syscalls in Path 2, although the syscall semantic is 82 terminated in Sentry, it may further introduces one or many unexpected exits to 83 host kernel. It could be a page fault when Sentry runs, and more likely, a 84 schedule event in Go runtime, e.g., M idle/wakeup. An example in hand is that 85 <code>futex(FUETX_WAIT)</code> and <code>epoll_wait(2)</code> could lead to M 86 idle and a further futex call into host kernel if it does not find any runnable 87 Gs. (See the comments in https://go.dev/src/runtime/proc.go for further 88 explanation about the Go scheduler). 89 90 ### Path 4: Gofer involved 91 92 Other IO-related syscalls, especially security sensitive, go through another 93 layer of protection - Gofer. For such a syscall, it usually involves one or more 94 Sentry/Gofer inter-process communications. Even with the recent optimization 95 that using lisafs to supersede P9, it's still the slowest path which we shall 96 try best to avoid. 97 98 As shown above, some syscall paths are by-design slow, and should be identified 99 and reduced as much as possible. Let's hold it to the next section, and dig into 100 the details of the structural and implementation-specific cost of syscalls 101 firstly, because the performance of some Sentry-contained syscalls are not good 102 enough. 103 104 ### The structural cost 105 106 The first kind of cost is the comparatively stable, introduced by syscall 107 interception. It is platform-specific depending on the way to intercept 108 syscalls. And whether this cost matters also depends on the syscall rate of 109 sandboxed applications. 110 111 Here's the benchmark result on the structural cost of syscall. We got the data 112 on a Intel(R) Xeon(R) CPU E5-2650 v2 platform, using 113 [getpid benchmark](https://github.com/google/gvisor/blob/master/test/perf/linux/getpid_benchmark.cc). 114 As we can see, for KVM platform, the syscall interception costs more than 10x 115 than a native Linux syscall. 116 117 getpid | benchmark (ns) 118 ------------ | -------------- 119 Native | 62 120 Native-KPTI | 236 121 runsc-KVM | 830 122 runsc-ptrace | 6249 123 124 \* "Native" stands for using vanilla linux kernel. 125 126 To understand the structural cost of syscall interception, we did a 127 [quantitative analysis](https://github.com/google/gvisor/issues/2354) on kvm 128 platform. According to the analysis, the overhead mainly comes from: 129 130 1. KPTI-like CR3 switches: to maintain the address equation of Sentry running 131 in HR3 and GR0, it has to switch CR3 register twice, on each user/kernel 132 switch; 133 134 2. Platform's Switch(): Linux is very efficient by just switching to a 135 per-thread kernel stack and calling the corresponding syscall entry 136 function. But in Sentry, each task is represented by a goroutine; before 137 calling into syscall entry functions, it needs to pop the stack to recover 138 the big while loop, i.e., kernel.(*Task).run. 139 140 Can we save the structural cost of syscall interception? This cost is actually 141 by-design. We can optimize it, for example, avoid allocation and map operations 142 in switch process, but it can not be eliminated. 143 144 Does the structural cost of syscall interception really matter? It depends on 145 the syscall rate. Most applications in our case have a syscall rate < 200K/sec, 146 and according to flame graphs (which will be described later in this blog), we 147 see 2~3% of samples are in the switch Secondly, most syscalls, except those as 148 simple as <code>getpid(2)</code>, take several microseconds. In proportion, it's 149 not a significant overhead. However, if you have an elephant RPC (which involves 150 many times of DB access), or a service served by a long-snake RPC chain, this 151 brings nontrivial overhead on latency. 152 153 ### The implementation-specific cost 154 155 The other kind of cost is implementation-specific. For example, it involves some 156 heavy malloc operations; or defer is used in some frequent syscall paths (defer 157 is optimized in Go 1.14); what's worse, the application process may trigger a 158 long-path syscall with host kernel or Gofer involved. 159 160 When we try to do optimization on the gVisor runtime, we need information on the 161 sandboxed applications, POD configurations, and runsc internals. But most people 162 only play either as platform engineer or application engineer. So we need an 163 easier way to understand the whole picture. 164 165 ## Performance profile of a running instance 166 167 To quickly understand the whole picture of performance, we need some ways to 168 profile a running gVisor instance. As gVisor sandbox process is essentially a Go 169 process, Go pprof is an existing way: 170 171 * [Go pprof](https://golang.org/pkg/runtime/pprof/) - provides CPU and heap 172 profile through 173 [runsc debug subcommands](https://gvisor.dev/docs/user_guide/debugging/#profiling). 174 * [Go trace](https://golang.org/pkg/runtime/trace/) - provides more internal 175 profile types like synchronization blocking and scheduler latency. 176 177 Unfortunately, above tools only provide hot-spots in Sentry, instead of the 178 whole picture (how much time spent in GR3 and HR0). And CPU profile relies on 179 the [SIGPROF signal](https://golang.org/pkg/runtime/pprof/), which may not 180 accurate enough. 181 182 [perf-kvm](https://www.linux-kvm.org/page/Perf_events) cannot provide what we 183 need either. It may help to top/record/stat some information in guest with the 184 help of option [--guestkallsyms], but it cannot analyze the call chain (which is 185 not supported in the host kernel, see Linux's perf_callchain_kernel). 186 187 ### Perf sandbox process like a normal process 188 189 Then we turn to a nice virtual address equation in Sentry: [(GR0 VA) = (HR3 190 VA)]. This is to make sure any pointers in HR3 can be directly used in GR0. 191 192 The equation is helpful to solve this problem in the way that we can profile 193 Sentry just as a normal HR3 process with a little hack on kvm. 194 195 - First, as said above, Linux does not support to analyze the call chain of 196 guest. So Change [is_in_guest] to pretend that it runs in host mode even 197 it's in guest mode. This can be done in 198 [kvm_is_in_guest](https://github.com/torvalds/linux/blob/v4.19/arch/x86/kvm/x86.c#L6560) 199 200 ``` 201 int kvm_is_in_guest(void) 202 { 203 - return __this_cpu_read(current_vcpu) != NULL; 204 + return 0; 205 } 206 ``` 207 208 - Secondly, change the process of guest profile. Previously, after PMU counter 209 overflows and triggers a NMI interrupt, vCPU is forced to exit to host, and 210 calls [int $2] immediately for later recording. Now instead of calling [int 211 $2], we shall call **do_nmi** directly with correct registers (i.e., 212 pt_regs): 213 214 ``` 215 +void (*fn_do_nmi)(struct pt_regs *, long); 216 + 217 +#define HIGHER_HALF_CANONICAL_ADDR 0xFFFF800000000000 218 + 219 +void make_pt_regs(struct kvm_vcpu *vcpu, struct pt_regs *regs) 220 +{ 221 + /* In Sentry GR0, we will use address among 222 + * [HIGHER_HALF_CANONICAL_ADDR, 2^64-1) 223 + * when syscall just happens. To avoid conflicting with HR0, 224 + * we correct these addresses into HR3 addresses. 225 + */ 226 + regs->bp = vcpu->arch.regs[VCPU_REGS_RBP] & ~HIGHER_HALF_CANONICAL_ADDR; 227 + regs->ip = vmcs_readl(GUEST_RIP) & ~HIGHER_HALF_CANONICAL_ADDR; 228 + regs->sp = vmcs_readl(GUEST_RSP) & ~HIGHER_HALF_CANONICAL_ADDR; 229 + 230 + regs->flags = (vmcs_readl(GUEST_RFLAGS) & 0xFF) | 231 + X86_EFLAGS_IF | 0x2; 232 + regs->cs = __USER_CS; 233 + regs->ss = __USER_DS; 234 +} 235 + 236 static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) 237 { 238 u32 exit_intr_info; 239 @@ -8943,7 +8965,14 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) 240 /* We need to handle NMIs before interrupts are enabled */ 241 if (is_nmi(exit_intr_info)) { 242 kvm_before_handle_nmi(&vmx->vcpu); 243 - asm("int $2"); 244 + if (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) 245 + asm("int $2"); 246 + else { 247 + struct pt_regs regs; 248 + memset((void *)®s, 0, sizeof(regs)); 249 + make_pt_regs(&vmx->vcpu, ®s); 250 + fn_do_nmi(®s, 0); 251 + } 252 kvm_after_handle_nmi(&vmx->vcpu); 253 } 254 } 255 @@ -11881,6 +11927,10 @@ static int __init vmx_init(void) 256 } 257 } 258 259 + fn_do_nmi = (void *) kallsyms_lookup_name("do_nmi"); 260 + if (!fn_do_nmi) 261 + printk(KERN_ERR "kvm: lookup do_nmi fail\n"); 262 + 263 ``` 264 265 As shown above, we properly handle samples in GR3 and GR0 trampoline. 266 267 ### An example of profile 268 269 Firstly, make sure we compile the runsc with symbols not stripped: `bazel build 270 runsc --strip=never` 271 272 As an example, run below script inside the gVisor container to make it busy: 273 `stress -i 1 -c 1 -m 1` 274 275 Perf the instance with command: `perf kvm --host --guest record -a -g -e cycles 276 -G <path/to/cgroup> -- sleep 10 >/dev/null` 277 278 Note we still need to perf the instance with 'perf kvm' and '--guest', because 279 kvm-intel requires this to keep the PMU hardware event enabled in guest mode. 280 281 Then generate a flame graph using 282 [Brendan's tool](https://github.com/brendangregg/FlameGraph), and we got this 283 [flame graph](https://raw.githubusercontent.com/zhuangel/gvisor/zhuangel_blog/website/blog/blog-kvm-stress.svg). 284 285 Let's roughly divide it to differentiate GR3 and GR0 like this: 286 287 ![Figure 2](/assets/images/2021-12-02-flamegraph-figure2.png "Flamegraph of stress.") 288 289 ### Optimize based on flame graphs 290 291 Now we can get clear information like: 292 293 1. The bottleneck syscall(s): the above flame graph shows <code>sync(2)</code> 294 is a relatively large block of samples. If we cannot avoid them in user 295 space, they are worth time for optimization. Some real cases we found and 296 optimized are: supersede CopyIn/CopyOut with CopyInBytes/CopyOutBytes to 297 avoid reflection; avoid use defer in some frequent syscalls in which case 298 you can say <code>deferreturn()</code> in the flame graph (not needed if you 299 already upgrade to newer Go version). Another optimization is: after we find 300 that append write of shared volume spends a lot of time querying gofer for 301 current file length in the flame graph, we propose to add 302 [an handle only for append write](https://github.com/google/gvisor/issues/1792). 303 304 2. If GC is a real problem: we can barely see sample related to GC in this 305 case. But if we do, we can further search <code>mallocgc()</code> to see 306 where the heap allocation is frequent. We can perform a heap profile to see 307 allocated objects. And we can consider adjust 308 [GC percent](https://golang.org/pkg/runtime/debug/#SetGCPercent), 100% by 309 default, to sacrifice memory for less CPU utilization. We once found that 310 allocating a object > 32 KB also triggers GC, referring to 311 [this](https://github.com/google/gvisor/commit/f697d1a33e4e7cefb4164ec977c38ccc2a228099). 312 313 3. Percentage of time spent in GR3 app and Sentry: We can determine if it 314 worths to continue the optimization. If most of the samples are in GR3, then 315 we better turn to optimizing the application code instead. 316 317 4. Rather large chunk of samples lie in ept violation and 318 <code>fallocate(2)</code> (into HR0). This is caused by frequent memory 319 allocation and free. We can either optimize the application to avoid this, 320 or add a memory buffer layer in memfile management to relieve it. 321 322 As a short summary, now we have a tool to get a visible graph of what's going on 323 in a running gVisor instance. Unfortunately, we cannot get the details of the 324 application processes in the above flame graph because of the semantic gap. To 325 get a flame graph of the application processes, we have prototyped a way in 326 Sentry. Hopefully, we'll discuss it in later blogs. 327 328 A visible way is very helpful when we try to optimize a new application on 329 gVisor. However, there's another kind of overhead, invisible like "Dark matter". 330 331 ## Invisible overhead in Go runtime 332 333 Sentry inherits timer, scheduler, channel, and heap allocator in Go runtime. 334 While it saves a lot of code to build a kernel, it also introduces some 335 unpleasant overhead. The Go runtime, after all, is designed and massively used 336 for general purpose Go applications. While it's used as a part or the basis of a 337 kernel, we shall be very careful with the implementation and overhead of these 338 syntactic sugar. 339 340 Unfortunately, we did not find an universal method to identify this kind of 341 overhead. The only way seems to get your hands dirty with Go runtime. We'll show 342 some examples in our use case. 343 344 ### Timer 345 346 It's known that Go (before 1.14) timer suffers from 347 [lock contention and context switches](https://github.com/golang/go/issues/27707). 348 What's worse, statistics of Sentry syscalls shows that a lot of 349 <code>futex()</code> is introduced by timers (64 timer buckets), and that Sentry 350 syscalls walks a much longer path (redpill), makes it worse. 351 352 We have two optimizations here: 1. decrease the number of timer buckets, from 64 353 to 4; 2. decrease the timer precision from ns to ms. You may worry about the 354 decrease of timer precision, but as we see, most of the applications are 355 event-based, and not affected by a coarse grained timer. 356 357 However, Go changes the implementation of timer in v1.14; how to port this 358 optimization remains an open question. 359 360 ### Scheduler 361 362 gVisor introduces an extra level of schedule along with the host linux scheduler 363 (usually CFS). A L2 scheduler sometimes brings positive impact as it saves the 364 heavy context switch in the L1 scheduler. We can find many two-level scheduler 365 cases, for example, coroutines, virtual machines, etc. 366 367 gVisor reuses Go's work-stealing scheduler, which is originally designed for 368 coroutines, as the L2 scheduler. They share the same goal: 369 370 "We need to balance between keeping enough running worker threads to utilize 371 available hardware parallelism and parking excessive running worker threads to 372 conserve CPU resources and power." -- From 373 [Go scheduler code](https://golang.org/src/runtime/proc.go). 374 375 If not properly tuned, the L2 scheduler may leak the schedule pressure to the L1 376 scheduler. According to G-P-M model of Go, the parallelism is close related to 377 the GOMAXPROCS limit. The upstream gVisor by default uses # of host cores, which 378 leads to a lot of wasted M wake/stop(s). By properly configuring the GOMAXPROCS 379 of a POD of 4/8/16 cores, we find it can save some CPU cycles without worsening 380 the workload latency. 381 382 To further restrict extra M wake/stop(s), before wakep(), we calculate the # of 383 running Gs and # of running Ps to decide if necessary to wake a M. And we find 384 it's better to firstly steal from the longest local run queue, comparing to 385 previously random-sequential way. Another related optimization is that we find 386 most applications will get back to Sentry very soon, and it's not necessary to 387 handle off its P when it leaves into user space and find an idle P when it gets 388 back. 389 390 Some optimizations in Go are put 391 [here](https://github.com/zhuangel/go/tree/go1.13.4.blog). What we learned from 392 the optimization process of gVisor is that digging into Go runtime to understand 393 what's going on there. And it's normal that some ideas work, but some fail. 394 395 ## Summary 396 397 We introduced how we profiled gVisor for production-ready performance. Using 398 this methodology, along with some other aggressive measures, we finally got to 399 run gVisor with an acceptable overhead, and even better than runc in some 400 workloads. We also absorbed a lot of optimization progress in the community, 401 e.g., VFS2. 402 403 So far, we have deployed more than 100K gVisor instances in the production 404 environment. And it very well supported transactions of 405 [Singles Day Global Shopping Festivals](https://en.wikipedia.org/wiki/Singles%27_Day). 406 407 Along with performance, there are also some other important aspects for 408 production adoption. For example, generating a core after a sentry panic is 409 helpful for debugging; a coverage tool is necessary to make sure new changes are 410 properly covered by test cases. We'll leave these topics to later discussions.