gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2021-12-02-running-gvisor-in-production-at-scale-in-ant.md (about)

     1  # Running gVisor in Production at Scale in Ant
     2  
     3  > This post was contributed by [Ant Group](https://www.antgroup.com/), a
     4  > large-scale digital payment platform. Jianfeng and Yong are engineers at Ant
     5  > Group working on infrastructure systems, and contributors to gVisor.
     6  
     7  At Ant Group, we are committed to keep online transactions safe and efficient.
     8  Continuously improving security for potential system-level attacks is one of
     9  many measures. As a container runtime, gVisor provides container-native security
    10  without sacrificing resource efficiency. Therefore, it has been on our radar
    11  since it was released.
    12  
    13  However, there have been performance concerns raised by members of
    14  [academia](https://www.usenix.org/system/files/hotcloud19-paper-young.pdf) and
    15  [industry](https://news.ycombinator.com/item?id=19924036). Users of gVisor tend
    16  to bear the extra overhead as the tax of security. But we tend to agree that
    17  [security is no excuse for poor performance (See Chapter 6!)](https://sel4.systems/About/seL4-whitepaper.pdf).
    18  
    19  In this article, we will present how we identified bottlenecks in gVisor and
    20  unblocked large-scale production adoption. Our main focus are the CPU
    21  utilization and latency overhead it brings. Small memory footprint is also a
    22  valued goal, but not discussed in this blog. As a result of these efforts and
    23  community improvements, 70% of our applications running on runsc have <1%
    24  overhead; another 25% have <3% overhead. Some of our most valued application are
    25  the focus of our optimization, and get even better performance compared with
    26  runc.
    27  
    28  The rest of this blog is organized as follows: * First, we analyze the cost of
    29  different syscall paths in gVisor. * Then, a way to profile a whole picture of a
    30  instance is proposed to find out if some slow syscall paths are encountered.
    31  Some invisible overhead in Go runtime is discussed. * At last, a short summary
    32  on performance optimization with some other factors on production adoption.
    33  
    34  For convenience of discussion, we are targeting KVM-based, or hypervisor-based
    35  platforms, unless explicitly stated.
    36  
    37  ## Cost of different syscall paths
    38  
    39  [Defense-in-depth](../../../../2019/11/18/gvisor-security-basics-part-1/#defense-in-depth)
    40  is the key design principle of gVisor. In gVisor, different syscalls have
    41  different paths, further leading to different cost (orders of magnitude) on
    42  latency and CPU consumption. Here are the syscall paths in gVisor.
    43  
    44  ![Figure 1](/assets/images/2021-12-02-syscall-figure1.png "Sentry syscall paths.")
    45  
    46  ### Path 1: User-space vDSO
    47  
    48  Sentry provides a
    49  [vDSO library](https://github.com/google/gvisor/tree/master/vdso) for its
    50  sandboxed processes. Several syscalls are short circuited and implemented in
    51  user space. These syscalls cost almost as much as native Linux. But note that
    52  the vDSO library is partially implemented. We once noticed some
    53  [syscalls](https://github.com/google/gvisor/issues/3101) in our environment are
    54  not properly terminated in user space. We create some additional implementations
    55  to the vDSO, and aim to push these improvements upstream when possible.
    56  
    57  ### Path 2: Sentry contained
    58  
    59  Most syscalls, e.g., <code>clone(2)</code>, are implemented in Sentry. They are
    60  some basic abstractions of a operating system, such as process/thread lifecycle,
    61  scheduling, IPC, memory management, etc. These syscalls and all below suffer
    62  from a structural cost of syscall interception. The overhead is about 800ns
    63  while that of the native syscalls is about 70ns. We'll dig it further below.
    64  Syscalls of this kind spend takes about several microseconds, which is
    65  competitive to the corresponding native Linux syscalls.
    66  
    67  ### Path 3: Host-kernel involved
    68  
    69  Some syscalls, resource related, e.g., read/write, are redirected into the host
    70  kernel. Note that gVisor never passes through application syscalls directly into
    71  host kernel for functional and security reasons. So comparing to native Linux,
    72  time spent in Sentry seems an extra overhead. Another overhead is the way to
    73  call a host kernel syscall. Let's use kvm platform of x86_64 as an example.
    74  After Sentry issues the syscall instruction, if it is in GR0, it first goes to
    75  the syscall entrypoint defined in LSTAR, and then halts to HR3 (a vmexit happens
    76  here), and exits from a signal handler, and executes syscall instruction again.
    77  We can save the "Halt to HR3" by introducing vmcall here, but there's still a
    78  syscall trampoline there and the vmexit/vmentry overhead is not trivial.
    79  Nevertheless, these overhead is not that significant.
    80  
    81  For some sentry-contained syscalls in Path 2, although the syscall semantic is
    82  terminated in Sentry, it may further introduces one or many unexpected exits to
    83  host kernel. It could be a page fault when Sentry runs, and more likely, a
    84  schedule event in Go runtime, e.g., M idle/wakeup. An example in hand is that
    85  <code>futex(FUETX_WAIT)</code> and <code>epoll_wait(2)</code> could lead to M
    86  idle and a further futex call into host kernel if it does not find any runnable
    87  Gs. (See the comments in https://go.dev/src/runtime/proc.go for further
    88  explanation about the Go scheduler).
    89  
    90  ### Path 4: Gofer involved
    91  
    92  Other IO-related syscalls, especially security sensitive, go through another
    93  layer of protection - Gofer. For such a syscall, it usually involves one or more
    94  Sentry/Gofer inter-process communications. Even with the recent optimization
    95  that using lisafs to supersede P9, it's still the slowest path which we shall
    96  try best to avoid.
    97  
    98  As shown above, some syscall paths are by-design slow, and should be identified
    99  and reduced as much as possible. Let's hold it to the next section, and dig into
   100  the details of the structural and implementation-specific cost of syscalls
   101  firstly, because the performance of some Sentry-contained syscalls are not good
   102  enough.
   103  
   104  ### The structural cost
   105  
   106  The first kind of cost is the comparatively stable, introduced by syscall
   107  interception. It is platform-specific depending on the way to intercept
   108  syscalls. And whether this cost matters also depends on the syscall rate of
   109  sandboxed applications.
   110  
   111  Here's the benchmark result on the structural cost of syscall. We got the data
   112  on a Intel(R) Xeon(R) CPU E5-2650 v2 platform, using
   113  [getpid benchmark](https://github.com/google/gvisor/blob/master/test/perf/linux/getpid_benchmark.cc).
   114  As we can see, for KVM platform, the syscall interception costs more than 10x
   115  than a native Linux syscall.
   116  
   117  getpid       | benchmark (ns)
   118  ------------ | --------------
   119  Native       | 62
   120  Native-KPTI  | 236
   121  runsc-KVM    | 830
   122  runsc-ptrace | 6249
   123  
   124  \* "Native" stands for using vanilla linux kernel.
   125  
   126  To understand the structural cost of syscall interception, we did a
   127  [quantitative analysis](https://github.com/google/gvisor/issues/2354) on kvm
   128  platform. According to the analysis, the overhead mainly comes from:
   129  
   130  1.  KPTI-like CR3 switches: to maintain the address equation of Sentry running
   131      in HR3 and GR0, it has to switch CR3 register twice, on each user/kernel
   132      switch;
   133  
   134  2.  Platform's Switch(): Linux is very efficient by just switching to a
   135      per-thread kernel stack and calling the corresponding syscall entry
   136      function. But in Sentry, each task is represented by a goroutine; before
   137      calling into syscall entry functions, it needs to pop the stack to recover
   138      the big while loop, i.e., kernel.(*Task).run.
   139  
   140  Can we save the structural cost of syscall interception? This cost is actually
   141  by-design. We can optimize it, for example, avoid allocation and map operations
   142  in switch process, but it can not be eliminated.
   143  
   144  Does the structural cost of syscall interception really matter? It depends on
   145  the syscall rate. Most applications in our case have a syscall rate < 200K/sec,
   146  and according to flame graphs (which will be described later in this blog), we
   147  see 2~3% of samples are in the switch Secondly, most syscalls, except those as
   148  simple as <code>getpid(2)</code>, take several microseconds. In proportion, it's
   149  not a significant overhead. However, if you have an elephant RPC (which involves
   150  many times of DB access), or a service served by a long-snake RPC chain, this
   151  brings nontrivial overhead on latency.
   152  
   153  ### The implementation-specific cost
   154  
   155  The other kind of cost is implementation-specific. For example, it involves some
   156  heavy malloc operations; or defer is used in some frequent syscall paths (defer
   157  is optimized in Go 1.14); what's worse, the application process may trigger a
   158  long-path syscall with host kernel or Gofer involved.
   159  
   160  When we try to do optimization on the gVisor runtime, we need information on the
   161  sandboxed applications, POD configurations, and runsc internals. But most people
   162  only play either as platform engineer or application engineer. So we need an
   163  easier way to understand the whole picture.
   164  
   165  ## Performance profile of a running instance
   166  
   167  To quickly understand the whole picture of performance, we need some ways to
   168  profile a running gVisor instance. As gVisor sandbox process is essentially a Go
   169  process, Go pprof is an existing way:
   170  
   171  *   [Go pprof](https://golang.org/pkg/runtime/pprof/) - provides CPU and heap
   172      profile through
   173      [runsc debug subcommands](https://gvisor.dev/docs/user_guide/debugging/#profiling).
   174  *   [Go trace](https://golang.org/pkg/runtime/trace/) - provides more internal
   175      profile types like synchronization blocking and scheduler latency.
   176  
   177  Unfortunately, above tools only provide hot-spots in Sentry, instead of the
   178  whole picture (how much time spent in GR3 and HR0). And CPU profile relies on
   179  the [SIGPROF signal](https://golang.org/pkg/runtime/pprof/), which may not
   180  accurate enough.
   181  
   182  [perf-kvm](https://www.linux-kvm.org/page/Perf_events) cannot provide what we
   183  need either. It may help to top/record/stat some information in guest with the
   184  help of option [--guestkallsyms], but it cannot analyze the call chain (which is
   185  not supported in the host kernel, see Linux's perf_callchain_kernel).
   186  
   187  ### Perf sandbox process like a normal process
   188  
   189  Then we turn to a nice virtual address equation in Sentry: [(GR0 VA) = (HR3
   190  VA)]. This is to make sure any pointers in HR3 can be directly used in GR0.
   191  
   192  The equation is helpful to solve this problem in the way that we can profile
   193  Sentry just as a normal HR3 process with a little hack on kvm.
   194  
   195  -   First, as said above, Linux does not support to analyze the call chain of
   196      guest. So Change [is_in_guest] to pretend that it runs in host mode even
   197      it's in guest mode. This can be done in
   198      [kvm_is_in_guest](https://github.com/torvalds/linux/blob/v4.19/arch/x86/kvm/x86.c#L6560)
   199  
   200  ```
   201  int kvm_is_in_guest(void)
   202   {
   203  -       return __this_cpu_read(current_vcpu) != NULL;
   204  +       return 0;
   205   }
   206  ```
   207  
   208  -   Secondly, change the process of guest profile. Previously, after PMU counter
   209      overflows and triggers a NMI interrupt, vCPU is forced to exit to host, and
   210      calls [int $2] immediately for later recording. Now instead of calling [int
   211      $2], we shall call **do_nmi** directly with correct registers (i.e.,
   212      pt_regs):
   213  
   214  ```
   215  +void (*fn_do_nmi)(struct pt_regs *, long);
   216  +
   217  +#define HIGHER_HALF_CANONICAL_ADDR 0xFFFF800000000000
   218  +
   219  +void make_pt_regs(struct kvm_vcpu *vcpu, struct pt_regs *regs)
   220  +{
   221  +       /* In Sentry GR0, we will use address among
   222  +        *   [HIGHER_HALF_CANONICAL_ADDR, 2^64-1)
   223  +        * when syscall just happens. To avoid conflicting with HR0,
   224  +        * we correct these addresses into HR3 addresses.
   225  +        */
   226  +       regs->bp = vcpu->arch.regs[VCPU_REGS_RBP] & ~HIGHER_HALF_CANONICAL_ADDR;
   227  +       regs->ip = vmcs_readl(GUEST_RIP) & ~HIGHER_HALF_CANONICAL_ADDR;
   228  +       regs->sp = vmcs_readl(GUEST_RSP) & ~HIGHER_HALF_CANONICAL_ADDR;
   229  +
   230  +       regs->flags = (vmcs_readl(GUEST_RFLAGS) & 0xFF) |
   231  +                     X86_EFLAGS_IF | 0x2;
   232  +       regs->cs = __USER_CS;
   233  +       regs->ss = __USER_DS;
   234  +}
   235  +
   236   static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
   237   {
   238          u32 exit_intr_info;
   239  @@ -8943,7 +8965,14 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
   240          /* We need to handle NMIs before interrupts are enabled */
   241          if (is_nmi(exit_intr_info)) {
   242                  kvm_before_handle_nmi(&vmx->vcpu);
   243  -               asm("int $2");
   244  +               if (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF)
   245  +                       asm("int $2");
   246  +               else {
   247  +                       struct pt_regs regs;
   248  +                       memset((void *)&regs, 0, sizeof(regs));
   249  +                       make_pt_regs(&vmx->vcpu, &regs);
   250  +                       fn_do_nmi(&regs, 0);
   251  +               }
   252                  kvm_after_handle_nmi(&vmx->vcpu);
   253          }
   254   }
   255  @@ -11881,6 +11927,10 @@ static int __init vmx_init(void)
   256                  }
   257          }
   258  
   259  +       fn_do_nmi = (void *) kallsyms_lookup_name("do_nmi");
   260  +       if (!fn_do_nmi)
   261  +               printk(KERN_ERR "kvm: lookup do_nmi fail\n");
   262  +
   263  ```
   264  
   265  As shown above, we properly handle samples in GR3 and GR0 trampoline.
   266  
   267  ### An example of profile
   268  
   269  Firstly, make sure we compile the runsc with symbols not stripped: `bazel build
   270  runsc --strip=never`
   271  
   272  As an example, run below script inside the gVisor container to make it busy:
   273  `stress -i 1 -c 1 -m 1`
   274  
   275  Perf the instance with command: `perf kvm --host --guest record -a -g -e cycles
   276  -G <path/to/cgroup> -- sleep 10 >/dev/null`
   277  
   278  Note we still need to perf the instance with 'perf kvm' and '--guest', because
   279  kvm-intel requires this to keep the PMU hardware event enabled in guest mode.
   280  
   281  Then generate a flame graph using
   282  [Brendan's tool](https://github.com/brendangregg/FlameGraph), and we got this
   283  [flame graph](https://raw.githubusercontent.com/zhuangel/gvisor/zhuangel_blog/website/blog/blog-kvm-stress.svg).
   284  
   285  Let's roughly divide it to differentiate GR3 and GR0 like this:
   286  
   287  ![Figure 2](/assets/images/2021-12-02-flamegraph-figure2.png "Flamegraph of stress.")
   288  
   289  ### Optimize based on flame graphs
   290  
   291  Now we can get clear information like:
   292  
   293  1.  The bottleneck syscall(s): the above flame graph shows <code>sync(2)</code>
   294      is a relatively large block of samples. If we cannot avoid them in user
   295      space, they are worth time for optimization. Some real cases we found and
   296      optimized are: supersede CopyIn/CopyOut with CopyInBytes/CopyOutBytes to
   297      avoid reflection; avoid use defer in some frequent syscalls in which case
   298      you can say <code>deferreturn()</code> in the flame graph (not needed if you
   299      already upgrade to newer Go version). Another optimization is: after we find
   300      that append write of shared volume spends a lot of time querying gofer for
   301      current file length in the flame graph, we propose to add
   302      [an handle only for append write](https://github.com/google/gvisor/issues/1792).
   303  
   304  2.  If GC is a real problem: we can barely see sample related to GC in this
   305      case. But if we do, we can further search <code>mallocgc()</code> to see
   306      where the heap allocation is frequent. We can perform a heap profile to see
   307      allocated objects. And we can consider adjust
   308      [GC percent](https://golang.org/pkg/runtime/debug/#SetGCPercent), 100% by
   309      default, to sacrifice memory for less CPU utilization. We once found that
   310      allocating a object > 32 KB also triggers GC, referring to
   311      [this](https://github.com/google/gvisor/commit/f697d1a33e4e7cefb4164ec977c38ccc2a228099).
   312  
   313  3.  Percentage of time spent in GR3 app and Sentry: We can determine if it
   314      worths to continue the optimization. If most of the samples are in GR3, then
   315      we better turn to optimizing the application code instead.
   316  
   317  4.  Rather large chunk of samples lie in ept violation and
   318      <code>fallocate(2)</code> (into HR0). This is caused by frequent memory
   319      allocation and free. We can either optimize the application to avoid this,
   320      or add a memory buffer layer in memfile management to relieve it.
   321  
   322  As a short summary, now we have a tool to get a visible graph of what's going on
   323  in a running gVisor instance. Unfortunately, we cannot get the details of the
   324  application processes in the above flame graph because of the semantic gap. To
   325  get a flame graph of the application processes, we have prototyped a way in
   326  Sentry. Hopefully, we'll discuss it in later blogs.
   327  
   328  A visible way is very helpful when we try to optimize a new application on
   329  gVisor. However, there's another kind of overhead, invisible like "Dark matter".
   330  
   331  ## Invisible overhead in Go runtime
   332  
   333  Sentry inherits timer, scheduler, channel, and heap allocator in Go runtime.
   334  While it saves a lot of code to build a kernel, it also introduces some
   335  unpleasant overhead. The Go runtime, after all, is designed and massively used
   336  for general purpose Go applications. While it's used as a part or the basis of a
   337  kernel, we shall be very careful with the implementation and overhead of these
   338  syntactic sugar.
   339  
   340  Unfortunately, we did not find an universal method to identify this kind of
   341  overhead. The only way seems to get your hands dirty with Go runtime. We'll show
   342  some examples in our use case.
   343  
   344  ### Timer
   345  
   346  It's known that Go (before 1.14) timer suffers from
   347  [lock contention and context switches](https://github.com/golang/go/issues/27707).
   348  What's worse, statistics of Sentry syscalls shows that a lot of
   349  <code>futex()</code> is introduced by timers (64 timer buckets), and that Sentry
   350  syscalls walks a much longer path (redpill), makes it worse.
   351  
   352  We have two optimizations here: 1. decrease the number of timer buckets, from 64
   353  to 4; 2. decrease the timer precision from ns to ms. You may worry about the
   354  decrease of timer precision, but as we see, most of the applications are
   355  event-based, and not affected by a coarse grained timer.
   356  
   357  However, Go changes the implementation of timer in v1.14; how to port this
   358  optimization remains an open question.
   359  
   360  ### Scheduler
   361  
   362  gVisor introduces an extra level of schedule along with the host linux scheduler
   363  (usually CFS). A L2 scheduler sometimes brings positive impact as it saves the
   364  heavy context switch in the L1 scheduler. We can find many two-level scheduler
   365  cases, for example, coroutines, virtual machines, etc.
   366  
   367  gVisor reuses Go's work-stealing scheduler, which is originally designed for
   368  coroutines, as the L2 scheduler. They share the same goal:
   369  
   370  "We need to balance between keeping enough running worker threads to utilize
   371  available hardware parallelism and parking excessive running worker threads to
   372  conserve CPU resources and power." -- From
   373  [Go scheduler code](https://golang.org/src/runtime/proc.go).
   374  
   375  If not properly tuned, the L2 scheduler may leak the schedule pressure to the L1
   376  scheduler. According to G-P-M model of Go, the parallelism is close related to
   377  the GOMAXPROCS limit. The upstream gVisor by default uses # of host cores, which
   378  leads to a lot of wasted M wake/stop(s). By properly configuring the GOMAXPROCS
   379  of a POD of 4/8/16 cores, we find it can save some CPU cycles without worsening
   380  the workload latency.
   381  
   382  To further restrict extra M wake/stop(s), before wakep(), we calculate the # of
   383  running Gs and # of running Ps to decide if necessary to wake a M. And we find
   384  it's better to firstly steal from the longest local run queue, comparing to
   385  previously random-sequential way. Another related optimization is that we find
   386  most applications will get back to Sentry very soon, and it's not necessary to
   387  handle off its P when it leaves into user space and find an idle P when it gets
   388  back.
   389  
   390  Some optimizations in Go are put
   391  [here](https://github.com/zhuangel/go/tree/go1.13.4.blog). What we learned from
   392  the optimization process of gVisor is that digging into Go runtime to understand
   393  what's going on there. And it's normal that some ideas work, but some fail.
   394  
   395  ## Summary
   396  
   397  We introduced how we profiled gVisor for production-ready performance. Using
   398  this methodology, along with some other aggressive measures, we finally got to
   399  run gVisor with an acceptable overhead, and even better than runc in some
   400  workloads. We also absorbed a lot of optimization progress in the community,
   401  e.g., VFS2.
   402  
   403  So far, we have deployed more than 100K gVisor instances in the production
   404  environment. And it very well supported transactions of
   405  [Singles Day Global Shopping Festivals](https://en.wikipedia.org/wiki/Singles%27_Day).
   406  
   407  Along with performance, there are also some other important aspects for
   408  production adoption. For example, generating a core after a sentry panic is
   409  helpful for debugging; a coverage tool is necessary to make sure new changes are
   410  properly covered by test cases. We'll leave these topics to later discussions.