github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/g3doc/proposals/runtime_dedicate_os_thread.md (about)

     1  # `runtime.DedicateOSThread`
     2  
     3  Status as of 2020-09-18: Deprioritized; initial studies in #2180 suggest that
     4  this may be difficult to support in the Go runtime due to issues with GC.
     5  
     6  ## Summary
     7  
     8  Allow goroutines to bind to kernel threads in a way that allows their scheduling
     9  to be kernel-managed rather than runtime-managed.
    10  
    11  ## Objectives
    12  
    13  *   Reduce Go runtime overhead in the gVisor sentry (#2184).
    14  
    15  *   Minimize intrusiveness of changes to the Go runtime.
    16  
    17  ## Background
    18  
    19  In Go, execution contexts are referred to as goroutines, which the runtime calls
    20  Gs. The Go runtime maintains a variably-sized pool of threads (called Ms by the
    21  runtime) on which Gs are executed, as well as a pool of "virtual processors"
    22  (called Ps by the runtime) of size equal to `runtime.GOMAXPROCS()`. Usually,
    23  each M requires a P in order to execute Gs, limiting the number of concurrently
    24  executing goroutines to `runtime.GOMAXPROCS()`.
    25  
    26  The `runtime.LockOSThread` function temporarily locks the invoking goroutine to
    27  its current thread. It is primarily useful for interacting with OS or non-Go
    28  library facilities that are per-thread. It does not reduce interactions with the
    29  Go runtime scheduler: locked Ms relinquish their P when they become blocked, and
    30  only continue execution after another M "chooses" their locked G to run and
    31  donates their P to the locked M instead.
    32  
    33  ## Problems
    34  
    35  ### Context Switch Overhead
    36  
    37  Most goroutines in the gVisor sentry are task goroutines, which back application
    38  threads. Task goroutines spend large amounts of time blocked on syscalls that
    39  execute untrusted application code. When invoking said syscall (which varies by
    40  gVisor platform), the task goroutine may interact with the Go runtime in one of
    41  three ways:
    42  
    43  *   It can invoke the syscall without informing the runtime. In this case, the
    44      task goroutine will continue to hold its P during the syscall, limiting the
    45      number of application threads that can run concurrently to
    46      `runtime.GOMAXPROCS()`. This is problematic because the Go runtime scheduler
    47      is known to scale poorly with `GOMAXPROCS`; see #1942 and
    48      https://github.com/golang/go/issues/28808. It also means that preemption of
    49      application threads must be driven by sentry or runtime code, which is
    50      strictly slower than kernel-driven preemption (since the sentry must invoke
    51      another syscall to preempt the application thread).
    52  
    53  *   It can call `runtime.entersyscallblock` before invoking the syscall, and
    54      `runtime.exitsyscall` after the syscall returns. In this case, the task
    55      goroutine will release its P while the syscall is executing. This allows the
    56      number of threads concurrently executing application code to exceed
    57      `GOMAXPROCS`. However, this incurs additional latency on syscall entry (to
    58      hand off the released P to another M, often requiring a `futex(FUTEX_WAKE)`
    59      syscall) and on syscall exit (to acquire a new P). It also drastically
    60      increases the number of threads that concurrently interact with the runtime
    61      scheduler, which is also problematic for performance (both in terms of CPU
    62      utilization and in terms of context switch latency); see #205.
    63  
    64  -   It can call `runtime.entersyscall` before invoking the syscall, and
    65      `runtime.exitsyscall` after the syscall returns. In this case, the task
    66      goroutine "lazily releases" its P, allowing the runtime's "sysmon" thread to
    67      steal it on behalf of another M after a 20us delay. This mitigates the
    68      context switch latency problem when there are few task goroutines and the
    69      interval between switches to application code (i.e. the interval between
    70      application syscalls, page faults, or signal delivery) is short. (Cynically,
    71      this means that it's most effective in microbenchmarks). However, the delay
    72      before a P is stolen can also be problematic for performance when there are
    73      both many task goroutines switching to application code (lazily releasing
    74      their Ps) *and* many task goroutines switching to sentry code (contending
    75      for Ps), which is likely in larger heterogeneous workloads.
    76  
    77  ### Blocking Overhead
    78  
    79  Task goroutines block on behalf of application syscalls like `futex` and
    80  `epoll_wait` by receiving from a Go channel. (Future work may convert task
    81  goroutine blocking to use the `syncevent` package to avoid overhead associated
    82  with channels and `select`, but this does not change how blocking interacts with
    83  the Go runtime scheduler.)
    84  
    85  If `runtime.LockOSThread()` is not in effect when a task goroutine blocks, then
    86  when the task goroutine is unblocked (by e.g. an application `FUTEX_WAKE`,
    87  signal delivery, or a timeout) by sending to the blocked channel,
    88  `runtime.ready` migrates the unblocked G to the unblocking P. In most cases,
    89  this implies that every application thread block/unblock cycle results in a
    90  migration of the thread between Ps, and therefore Ms, and therefore cores,
    91  resulting in reduced application performance due to loss of CPU caches.
    92  Furthermore, in most cases, the unblocking P cannot immediately switch to the
    93  unblocked G (instead resuming execution of its current application thread after
    94  completing the application's `futex(FUTEX_WAKE)`, `tgkill`, etc. syscall), often
    95  requiring that another P steal the unblocked G before it can resume execution.
    96  
    97  If `runtime.LockOSThread()` is in effect when a task goroutine blocks, then the
    98  G will remain locked to its M, avoiding the core migration described above;
    99  however, wakeup latency is significantly increased since, as described in
   100  "Background", the G still needs to be selected by the scheduler before it can
   101  run, and the M that selects the G then needs to transfer its P to the locked M,
   102  incurring an additional `FUTEX_WAKE` syscall and round of kernel scheduling.
   103  
   104  ## Proposal
   105  
   106  We propose to add a function, tentatively called `DedicateOSThread`, to the Go
   107  `runtime` package, documented as follows:
   108  
   109  ```go
   110  // DedicateOSThread wires the calling goroutine to its current operating system
   111  // thread, and exempts it from counting against GOMAXPROCS. The calling
   112  // goroutine will always execute in that thread, and no other goroutine will
   113  // execute in it, until the calling goroutine has made as many calls to
   114  // UndedicateOSThread as to DedicateOSThread. If the calling goroutine exits
   115  // without unlocking the thread, the thread will be terminated.
   116  //
   117  // DedicateOSThread should only be used by long-lived goroutines that usually
   118  // block due to blocking system calls, rather than interaction with other
   119  // goroutines.
   120  func DedicateOSThread()
   121  ```
   122  
   123  Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the
   124  invoking G to a M), but additionally locks the invoking M to a P. Ps locked by
   125  `DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual
   126  number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number
   127  of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`).
   128  Corollaries:
   129  
   130  *   If `runtime.ready` observes that a readied G is locked to a M locked to a P,
   131      it immediately wakes the locked M without migrating the G to the readying P
   132      or waiting for a future call to `runtime.schedule` to select the readied G
   133      in `runtime.findrunnable`.
   134  
   135  *   `runtime.stoplockedm` and `runtime.reentersyscall` skip the release of
   136      locked Ps; the latter also skips sysmon wakeup. `runtime.stoplockedm` and
   137      `runtime.exitsyscall` skip re-acquisition of Ps if one is locked.
   138  
   139  *   sysmon does not attempt to preempt Gs that are locked to Ps, avoiding
   140      fruitless overhead from `tgkill` syscalls and signal delivery.
   141  
   142  *   `runtime.findrunnable`'s work stealing skips locked Ps (suggesting that
   143      unlocked Ps be tracked in a separate array). `runtime.findrunnable` on
   144      locked Ps skip the global run queue, work stealing, and possibly netpoll.
   145  
   146  *   New goroutines created by goroutines with locked Ps are enqueued on the
   147      global run queue rather than the invoking P's local run queue.
   148  
   149  While gVisor's use case does not strictly require that the association is
   150  reversible (with `runtime.UndedicateOSThread`), such a feature is required to
   151  allow reuse of locked Ms, which is likely to be critical for performance.
   152  
   153  ## Alternatives Considered
   154  
   155  *   Make the runtime scale well with `GOMAXPROCS`. While we are also
   156      concurrently investigating this problem, this would not address the issues
   157      of increased preemption cost or blocking overhead.
   158  
   159  *   Make the runtime scale well with number of Ms. It is unclear if this is
   160      actually feasible, and would not address blocking overhead.
   161  
   162  *   Make P-locking part of `LockOSThread`'s behavior. This would likely
   163      introduce performance regressions in existing uses of `LockOSThread` that do
   164      not fit this usage pattern. In particular, since `DedicateOSThread`
   165      transitions the invoker's P from "counted against `GOMAXPROCS`" to "not
   166      counted against `GOMAXPROCS`", it may need to wake another M to run a new P
   167      (that is counted against `GOMAXPROCS`), and the converse applies to
   168      `UndedicateOSThread`.
   169  
   170  *   Rewrite the gVisor sentry in a language that does not force userspace
   171      scheduling. This is a last resort due to the amount of code involved.
   172  
   173  ## Related Issues
   174  
   175  The proposed functionality is directly analogous to `spawn_blocking` in Rust
   176  async runtimes
   177  [`async_std`](https://docs.rs/async-std/1.8.0/async_std/task/fn.spawn_blocking.html)
   178  and [`tokio`](https://docs.rs/tokio/0.3.5/tokio/task/fn.spawn_blocking.html).
   179  
   180  Outside of gVisor:
   181  
   182  *   https://github.com/golang/go/issues/21827#issuecomment-595152452 describes a
   183      use case for this feature in go-delve, where the goroutine that would use
   184      this feature spends much of its time blocked in `ptrace` syscalls.
   185  
   186  *   This feature may improve performance in the use case described in
   187      https://github.com/golang/go/issues/18237, given the prominence of
   188      syscall.Syscall in the profile given in that bug report.