github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/g3doc/proposals/runtime_dedicate_os_thread.md

github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/g3doc/proposals/runtime_dedicate_os_thread.md (about)

1 # `runtime.DedicateOSThread`
2
3 Status as of 2020-09-18: Deprioritized; initial studies in #2180 suggest that
4 this may be difficult to support in the Go runtime due to issues with GC.
5
6 ## Summary
7
8 Allow goroutines to bind to kernel threads in a way that allows their scheduling
9 to be kernel-managed rather than runtime-managed.
10
11 ## Objectives
12
13 * Reduce Go runtime overhead in the gVisor sentry (#2184).
14
15 * Minimize intrusiveness of changes to the Go runtime.
16
17 ## Background
18
19 In Go, execution contexts are referred to as goroutines, which the runtime calls
20 Gs. The Go runtime maintains a variably-sized pool of threads (called Ms by the
21 runtime) on which Gs are executed, as well as a pool of "virtual processors"
22 (called Ps by the runtime) of size equal to `runtime.GOMAXPROCS()`. Usually,
23 each M requires a P in order to execute Gs, limiting the number of concurrently
24 executing goroutines to `runtime.GOMAXPROCS()`.
25
26 The `runtime.LockOSThread` function temporarily locks the invoking goroutine to
27 its current thread. It is primarily useful for interacting with OS or non-Go
28 library facilities that are per-thread. It does not reduce interactions with the
29 Go runtime scheduler: locked Ms relinquish their P when they become blocked, and
30 only continue execution after another M "chooses" their locked G to run and
31 donates their P to the locked M instead.
32
33 ## Problems
34
35 ### Context Switch Overhead
36
37 Most goroutines in the gVisor sentry are task goroutines, which back application
38 threads. Task goroutines spend large amounts of time blocked on syscalls that
39 execute untrusted application code. When invoking said syscall (which varies by
40 gVisor platform), the task goroutine may interact with the Go runtime in one of
41 three ways:
42
43 * It can invoke the syscall without informing the runtime. In this case, the
44 task goroutine will continue to hold its P during the syscall, limiting the
45 number of application threads that can run concurrently to
46 `runtime.GOMAXPROCS()`. This is problematic because the Go runtime scheduler
47 is known to scale poorly with `GOMAXPROCS`; see #1942 and
48 https://github.com/golang/go/issues/28808. It also means that preemption of
49 application threads must be driven by sentry or runtime code, which is
50 strictly slower than kernel-driven preemption (since the sentry must invoke
51 another syscall to preempt the application thread).
52
53 * It can call `runtime.entersyscallblock` before invoking the syscall, and
54 `runtime.exitsyscall` after the syscall returns. In this case, the task
55 goroutine will release its P while the syscall is executing. This allows the
56 number of threads concurrently executing application code to exceed
57 `GOMAXPROCS`. However, this incurs additional latency on syscall entry (to
58 hand off the released P to another M, often requiring a `futex(FUTEX_WAKE)`
59 syscall) and on syscall exit (to acquire a new P). It also drastically
60 increases the number of threads that concurrently interact with the runtime
61 scheduler, which is also problematic for performance (both in terms of CPU
62 utilization and in terms of context switch latency); see #205.
63
64 - It can call `runtime.entersyscall` before invoking the syscall, and
65 `runtime.exitsyscall` after the syscall returns. In this case, the task
66 goroutine "lazily releases" its P, allowing the runtime's "sysmon" thread to
67 steal it on behalf of another M after a 20us delay. This mitigates the
68 context switch latency problem when there are few task goroutines and the
69 interval between switches to application code (i.e. the interval between
70 application syscalls, page faults, or signal delivery) is short. (Cynically,
71 this means that it's most effective in microbenchmarks). However, the delay
72 before a P is stolen can also be problematic for performance when there are
73 both many task goroutines switching to application code (lazily releasing
74 their Ps) *and* many task goroutines switching to sentry code (contending
75 for Ps), which is likely in larger heterogeneous workloads.
76
77 ### Blocking Overhead
78
79 Task goroutines block on behalf of application syscalls like `futex` and
80 `epoll_wait` by receiving from a Go channel. (Future work may convert task
81 goroutine blocking to use the `syncevent` package to avoid overhead associated
82 with channels and `select`, but this does not change how blocking interacts with
83 the Go runtime scheduler.)
84
85 If `runtime.LockOSThread()` is not in effect when a task goroutine blocks, then
86 when the task goroutine is unblocked (by e.g. an application `FUTEX_WAKE`,
87 signal delivery, or a timeout) by sending to the blocked channel,
88 `runtime.ready` migrates the unblocked G to the unblocking P. In most cases,
89 this implies that every application thread block/unblock cycle results in a
90 migration of the thread between Ps, and therefore Ms, and therefore cores,
91 resulting in reduced application performance due to loss of CPU caches.
92 Furthermore, in most cases, the unblocking P cannot immediately switch to the
93 unblocked G (instead resuming execution of its current application thread after
94 completing the application's `futex(FUTEX_WAKE)`, `tgkill`, etc. syscall), often
95 requiring that another P steal the unblocked G before it can resume execution.
96
97 If `runtime.LockOSThread()` is in effect when a task goroutine blocks, then the
98 G will remain locked to its M, avoiding the core migration described above;
99 however, wakeup latency is significantly increased since, as described in
100 "Background", the G still needs to be selected by the scheduler before it can
101 run, and the M that selects the G then needs to transfer its P to the locked M,
102 incurring an additional `FUTEX_WAKE` syscall and round of kernel scheduling.
103
104 ## Proposal
105
106 We propose to add a function, tentatively called `DedicateOSThread`, to the Go
107 `runtime` package, documented as follows:
108
109 ```go
110 // DedicateOSThread wires the calling goroutine to its current operating system
111 // thread, and exempts it from counting against GOMAXPROCS. The calling
112 // goroutine will always execute in that thread, and no other goroutine will
113 // execute in it, until the calling goroutine has made as many calls to
114 // UndedicateOSThread as to DedicateOSThread. If the calling goroutine exits
115 // without unlocking the thread, the thread will be terminated.
116 //
117 // DedicateOSThread should only be used by long-lived goroutines that usually
118 // block due to blocking system calls, rather than interaction with other
119 // goroutines.
120 func DedicateOSThread()
121 ```
122
123 Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the
124 invoking G to a M), but additionally locks the invoking M to a P. Ps locked by
125 `DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual
126 number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number
127 of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`).
128 Corollaries:
129
130 * If `runtime.ready` observes that a readied G is locked to a M locked to a P,
131 it immediately wakes the locked M without migrating the G to the readying P
132 or waiting for a future call to `runtime.schedule` to select the readied G
133 in `runtime.findrunnable`.
134
135 * `runtime.stoplockedm` and `runtime.reentersyscall` skip the release of
136 locked Ps; the latter also skips sysmon wakeup. `runtime.stoplockedm` and
137 `runtime.exitsyscall` skip re-acquisition of Ps if one is locked.
138
139 * sysmon does not attempt to preempt Gs that are locked to Ps, avoiding
140 fruitless overhead from `tgkill` syscalls and signal delivery.
141
142 * `runtime.findrunnable`'s work stealing skips locked Ps (suggesting that
143 unlocked Ps be tracked in a separate array). `runtime.findrunnable` on
144 locked Ps skip the global run queue, work stealing, and possibly netpoll.
145
146 * New goroutines created by goroutines with locked Ps are enqueued on the
147 global run queue rather than the invoking P's local run queue.
148
149 While gVisor's use case does not strictly require that the association is
150 reversible (with `runtime.UndedicateOSThread`), such a feature is required to
151 allow reuse of locked Ms, which is likely to be critical for performance.
152
153 ## Alternatives Considered
154
155 * Make the runtime scale well with `GOMAXPROCS`. While we are also
156 concurrently investigating this problem, this would not address the issues
157 of increased preemption cost or blocking overhead.
158
159 * Make the runtime scale well with number of Ms. It is unclear if this is
160 actually feasible, and would not address blocking overhead.
161
162 * Make P-locking part of `LockOSThread`'s behavior. This would likely
163 introduce performance regressions in existing uses of `LockOSThread` that do
164 not fit this usage pattern. In particular, since `DedicateOSThread`
165 transitions the invoker's P from "counted against `GOMAXPROCS`" to "not
166 counted against `GOMAXPROCS`", it may need to wake another M to run a new P
167 (that is counted against `GOMAXPROCS`), and the converse applies to
168 `UndedicateOSThread`.
169
170 * Rewrite the gVisor sentry in a language that does not force userspace
171 scheduling. This is a last resort due to the amount of code involved.
172
173 ## Related Issues
174
175 The proposed functionality is directly analogous to `spawn_blocking` in Rust
176 async runtimes
177 [`async_std`](https://docs.rs/async-std/1.8.0/async_std/task/fn.spawn_blocking.html)
178 and [`tokio`](https://docs.rs/tokio/0.3.5/tokio/task/fn.spawn_blocking.html).
179
180 Outside of gVisor:
181
182 * https://github.com/golang/go/issues/21827#issuecomment-595152452 describes a
183 use case for this feature in go-delve, where the goroutine that would use
184 this feature spends much of its time blocked in `ptrace` syscalls.
185
186 * This feature may improve performance in the use case described in
187 https://github.com/golang/go/issues/18237, given the prominence of
188 syscall.Syscall in the profile given in that bug report.