github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/g3doc/proposals/runtime_dedicate_os_thread.md (about) 1 # `runtime.DedicateOSThread` 2 3 Status as of 2020-09-18: Deprioritized; initial studies in #2180 suggest that 4 this may be difficult to support in the Go runtime due to issues with GC. 5 6 ## Summary 7 8 Allow goroutines to bind to kernel threads in a way that allows their scheduling 9 to be kernel-managed rather than runtime-managed. 10 11 ## Objectives 12 13 * Reduce Go runtime overhead in the gVisor sentry (#2184). 14 15 * Minimize intrusiveness of changes to the Go runtime. 16 17 ## Background 18 19 In Go, execution contexts are referred to as goroutines, which the runtime calls 20 Gs. The Go runtime maintains a variably-sized pool of threads (called Ms by the 21 runtime) on which Gs are executed, as well as a pool of "virtual processors" 22 (called Ps by the runtime) of size equal to `runtime.GOMAXPROCS()`. Usually, 23 each M requires a P in order to execute Gs, limiting the number of concurrently 24 executing goroutines to `runtime.GOMAXPROCS()`. 25 26 The `runtime.LockOSThread` function temporarily locks the invoking goroutine to 27 its current thread. It is primarily useful for interacting with OS or non-Go 28 library facilities that are per-thread. It does not reduce interactions with the 29 Go runtime scheduler: locked Ms relinquish their P when they become blocked, and 30 only continue execution after another M "chooses" their locked G to run and 31 donates their P to the locked M instead. 32 33 ## Problems 34 35 ### Context Switch Overhead 36 37 Most goroutines in the gVisor sentry are task goroutines, which back application 38 threads. Task goroutines spend large amounts of time blocked on syscalls that 39 execute untrusted application code. When invoking said syscall (which varies by 40 gVisor platform), the task goroutine may interact with the Go runtime in one of 41 three ways: 42 43 * It can invoke the syscall without informing the runtime. In this case, the 44 task goroutine will continue to hold its P during the syscall, limiting the 45 number of application threads that can run concurrently to 46 `runtime.GOMAXPROCS()`. This is problematic because the Go runtime scheduler 47 is known to scale poorly with `GOMAXPROCS`; see #1942 and 48 https://github.com/golang/go/issues/28808. It also means that preemption of 49 application threads must be driven by sentry or runtime code, which is 50 strictly slower than kernel-driven preemption (since the sentry must invoke 51 another syscall to preempt the application thread). 52 53 * It can call `runtime.entersyscallblock` before invoking the syscall, and 54 `runtime.exitsyscall` after the syscall returns. In this case, the task 55 goroutine will release its P while the syscall is executing. This allows the 56 number of threads concurrently executing application code to exceed 57 `GOMAXPROCS`. However, this incurs additional latency on syscall entry (to 58 hand off the released P to another M, often requiring a `futex(FUTEX_WAKE)` 59 syscall) and on syscall exit (to acquire a new P). It also drastically 60 increases the number of threads that concurrently interact with the runtime 61 scheduler, which is also problematic for performance (both in terms of CPU 62 utilization and in terms of context switch latency); see #205. 63 64 - It can call `runtime.entersyscall` before invoking the syscall, and 65 `runtime.exitsyscall` after the syscall returns. In this case, the task 66 goroutine "lazily releases" its P, allowing the runtime's "sysmon" thread to 67 steal it on behalf of another M after a 20us delay. This mitigates the 68 context switch latency problem when there are few task goroutines and the 69 interval between switches to application code (i.e. the interval between 70 application syscalls, page faults, or signal delivery) is short. (Cynically, 71 this means that it's most effective in microbenchmarks). However, the delay 72 before a P is stolen can also be problematic for performance when there are 73 both many task goroutines switching to application code (lazily releasing 74 their Ps) *and* many task goroutines switching to sentry code (contending 75 for Ps), which is likely in larger heterogeneous workloads. 76 77 ### Blocking Overhead 78 79 Task goroutines block on behalf of application syscalls like `futex` and 80 `epoll_wait` by receiving from a Go channel. (Future work may convert task 81 goroutine blocking to use the `syncevent` package to avoid overhead associated 82 with channels and `select`, but this does not change how blocking interacts with 83 the Go runtime scheduler.) 84 85 If `runtime.LockOSThread()` is not in effect when a task goroutine blocks, then 86 when the task goroutine is unblocked (by e.g. an application `FUTEX_WAKE`, 87 signal delivery, or a timeout) by sending to the blocked channel, 88 `runtime.ready` migrates the unblocked G to the unblocking P. In most cases, 89 this implies that every application thread block/unblock cycle results in a 90 migration of the thread between Ps, and therefore Ms, and therefore cores, 91 resulting in reduced application performance due to loss of CPU caches. 92 Furthermore, in most cases, the unblocking P cannot immediately switch to the 93 unblocked G (instead resuming execution of its current application thread after 94 completing the application's `futex(FUTEX_WAKE)`, `tgkill`, etc. syscall), often 95 requiring that another P steal the unblocked G before it can resume execution. 96 97 If `runtime.LockOSThread()` is in effect when a task goroutine blocks, then the 98 G will remain locked to its M, avoiding the core migration described above; 99 however, wakeup latency is significantly increased since, as described in 100 "Background", the G still needs to be selected by the scheduler before it can 101 run, and the M that selects the G then needs to transfer its P to the locked M, 102 incurring an additional `FUTEX_WAKE` syscall and round of kernel scheduling. 103 104 ## Proposal 105 106 We propose to add a function, tentatively called `DedicateOSThread`, to the Go 107 `runtime` package, documented as follows: 108 109 ```go 110 // DedicateOSThread wires the calling goroutine to its current operating system 111 // thread, and exempts it from counting against GOMAXPROCS. The calling 112 // goroutine will always execute in that thread, and no other goroutine will 113 // execute in it, until the calling goroutine has made as many calls to 114 // UndedicateOSThread as to DedicateOSThread. If the calling goroutine exits 115 // without unlocking the thread, the thread will be terminated. 116 // 117 // DedicateOSThread should only be used by long-lived goroutines that usually 118 // block due to blocking system calls, rather than interaction with other 119 // goroutines. 120 func DedicateOSThread() 121 ``` 122 123 Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the 124 invoking G to a M), but additionally locks the invoking M to a P. Ps locked by 125 `DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual 126 number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number 127 of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`). 128 Corollaries: 129 130 * If `runtime.ready` observes that a readied G is locked to a M locked to a P, 131 it immediately wakes the locked M without migrating the G to the readying P 132 or waiting for a future call to `runtime.schedule` to select the readied G 133 in `runtime.findrunnable`. 134 135 * `runtime.stoplockedm` and `runtime.reentersyscall` skip the release of 136 locked Ps; the latter also skips sysmon wakeup. `runtime.stoplockedm` and 137 `runtime.exitsyscall` skip re-acquisition of Ps if one is locked. 138 139 * sysmon does not attempt to preempt Gs that are locked to Ps, avoiding 140 fruitless overhead from `tgkill` syscalls and signal delivery. 141 142 * `runtime.findrunnable`'s work stealing skips locked Ps (suggesting that 143 unlocked Ps be tracked in a separate array). `runtime.findrunnable` on 144 locked Ps skip the global run queue, work stealing, and possibly netpoll. 145 146 * New goroutines created by goroutines with locked Ps are enqueued on the 147 global run queue rather than the invoking P's local run queue. 148 149 While gVisor's use case does not strictly require that the association is 150 reversible (with `runtime.UndedicateOSThread`), such a feature is required to 151 allow reuse of locked Ms, which is likely to be critical for performance. 152 153 ## Alternatives Considered 154 155 * Make the runtime scale well with `GOMAXPROCS`. While we are also 156 concurrently investigating this problem, this would not address the issues 157 of increased preemption cost or blocking overhead. 158 159 * Make the runtime scale well with number of Ms. It is unclear if this is 160 actually feasible, and would not address blocking overhead. 161 162 * Make P-locking part of `LockOSThread`'s behavior. This would likely 163 introduce performance regressions in existing uses of `LockOSThread` that do 164 not fit this usage pattern. In particular, since `DedicateOSThread` 165 transitions the invoker's P from "counted against `GOMAXPROCS`" to "not 166 counted against `GOMAXPROCS`", it may need to wake another M to run a new P 167 (that is counted against `GOMAXPROCS`), and the converse applies to 168 `UndedicateOSThread`. 169 170 * Rewrite the gVisor sentry in a language that does not force userspace 171 scheduling. This is a last resort due to the amount of code involved. 172 173 ## Related Issues 174 175 The proposed functionality is directly analogous to `spawn_blocking` in Rust 176 async runtimes 177 [`async_std`](https://docs.rs/async-std/1.8.0/async_std/task/fn.spawn_blocking.html) 178 and [`tokio`](https://docs.rs/tokio/0.3.5/tokio/task/fn.spawn_blocking.html). 179 180 Outside of gVisor: 181 182 * https://github.com/golang/go/issues/21827#issuecomment-595152452 describes a 183 use case for this feature in go-delve, where the goroutine that would use 184 this feature spends much of its time blocked in `ptrace` syscalls. 185 186 * This feature may improve performance in the use case described in 187 https://github.com/golang/go/issues/18237, given the prominence of 188 syscall.Syscall in the profile given in that bug report.