github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/talks/2014/research2.slide (about) 1 More Research Problems of Implementing Go 2 3 Dmitry Vyukov 4 Google 5 6 http://golang.org/ 7 8 * About Go 9 10 Go is an open source programming language that makes it easy to build simple, reliable, and efficient software. 11 12 Design began in late 2007. 13 14 - Robert Griesemer, Rob Pike, Ken Thompson 15 - Russ Cox, Ian Lance Taylor 16 17 Became open source in November 2009. 18 19 Developed entirely in the open; very active community. 20 Language stable as of Go 1, early 2012. 21 Work continues. 22 23 * Motivation for Go 24 25 .image research2/datacenter.jpg 26 27 * Motivation for Go 28 29 Started as an answer to software problems at Google: 30 31 - multicore processors 32 - networked systems 33 - massive computation clusters 34 - scale: 10⁷⁺ lines of code 35 - scale: 10³⁺ programmers 36 - scale: 10⁶⁺ machines (design point) 37 38 Deployed: parts of YouTube, dl.google.com, Blogger, Google Code, Google Fiber, ... 39 40 * Go 41 42 A simple but powerful and fun language. 43 44 - start with C, remove complex parts 45 - add interfaces, concurrency 46 - also: garbage collection, closures, reflection, strings, ... 47 48 For more background on design: 49 50 - [[http://commandcenter.blogspot.com/2012/06/less-is-exponentially-more.html][Less is exponentially more]] 51 - [[http://talks.golang.org/2012/splash.article][Go at Google: Language Design in the Service of Software Engineering]] 52 53 * Research and Go 54 55 Go is designed for building production systems at Google. 56 57 - Goal: make that job easier, faster, better. 58 - Non-goal: break new ground in programming language research 59 60 Plenty of research questions about how to implement Go well. 61 62 - Concurrency 63 - Scheduling 64 - Garbage collection 65 - Race and deadlock detection 66 - Testing of the implementation 67 68 69 70 71 * Concurrency 72 73 .image research2/busy.jpg 74 75 * Concurrency 76 77 Go provides two important concepts: 78 79 A goroutine is a thread of control within the program, with its own local variables and stack. Cheap, easy to create. 80 81 A channel carries typed messages between goroutines. 82 83 * Concurrency 84 85 .play research2/hello.go 86 87 * Concurrency: CSP 88 89 Channels adopted from Hoare's Communicating Sequential Processes. 90 91 - Orthogonal to rest of language 92 - Can keep familiar model for computation 93 - Focus on _composition_ of regular code 94 95 Go _enables_ simple, safe concurrent programming. 96 It doesn't _forbid_ bad programming. 97 98 Caveat: not purely memory safe; sharing is legal. 99 Passing a pointer over a channel is idiomatic. 100 101 Experience shows this is practical. 102 103 * Concurrency 104 105 Sequential network address resolution, given a work list: 106 107 .play research2/addr1.go /lookup/+1,/^}/-1 108 109 * Concurrency 110 111 Concurrent network address resolution, given a work list: 112 113 .play research2/addr2.go /lookup/+1,/^}/-1 114 115 * Concurrency 116 117 Select statements: switch for communication. 118 119 .play research2/select.go /select/,/^}/-1 120 121 That's select that makes efficient implementation difficult. 122 123 * Implementing Concurrency 124 125 Challenge: Make channel communication scale 126 127 - start with one global channel lock 128 - per-channel locks, locked in address order for multi-channel operations 129 130 Research question: lock-free channels? 131 132 133 134 135 * Scheduling 136 137 .image research2/gophercomplex6.jpg 138 139 * Scheduling 140 141 On the one hand we have arbitrary user programs: 142 143 - fine-grained goroutines, coarse-grained goroutines or a mix of both 144 - computational goroutines, IO-bound goroutines or a mix of both 145 - arbitrary dynamic communication patterns 146 - busy, idle, bursty programs 147 148 No user hints! 149 150 * Scheduling 151 152 On the other hand we have complex hardware topology: 153 154 - per-core caches 155 - caches shared between cores 156 - cores shared between hyper threads (HT) 157 - multiple processors with non-uniform memory access (NUMA) 158 159 * Scheduling 160 161 Challenge: make it all magically work efficiently 162 163 - start with one global lock for all scheduler state 164 - distributed work-stealing scheduler with per-"processor" state 165 - integrated network poller into scheduler 166 - lock-free work queues 167 168 * Scheduling 169 170 Current scheduler: 171 172 ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ 173 │ │ │ │ │ │ │ │ │ │ 174 ├─┤ ├─┤ ├─┤ ├─┤ ├─┤ Global 175 │ │ │G│ │ │ │ │ │ │ state 176 ├─┤ ├─┤ ├─┤ ├─┤ ├─┤ 177 │G│ │G│ │G│ │ │ │G│ 178 ├─┤ ├─┤ ├─┤ ├─┤ ├─┤ 179 │G│ │G│ │G│ │G│ │G│ 180 └┬┘ └┬┘ └┬┘ └┬┘ └─┘ 181 │ │ │ │ 182 ↓ ↓ ↓ ↓ 183 ┌─┬──────┐ ┌─┬──────┐ ┌─┬──────┐ ┌─┬──────┐ ┌────┐┌──────┐┌───────┐ 184 │P│mcache│ │P│mcache│ │P│mcache│ │P│mcache│ │heap││timers││netpoll│ 185 └┬┴──────┘ └┬┴──────┘ └┬┴──────┘ └┬┴──────┘ └────┘└──────┘└───────┘ 186 │ │ │ │ 187 ↓ ↓ ↓ ↓ 188 ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ 189 │M│ │M│ │M│ │M│ │M│ │M│ │M│ 190 └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ 191 192 G - goroutine; P - logical processor; M - OS thread (machine) 193 194 * Scheduling 195 196 Want: 197 198 - temporal locality to exploit caches 199 - spatial locality to exploit NUMA 200 - schedule mostly LIFO but ensure weak fairness 201 - allocate local memory and stacks 202 - scan local memory in GC 203 - collocate communicating goroutines 204 - distribute non-communicating goroutines 205 - distribute timers and network poller 206 - poll network on the same core where last read was issued 207 208 209 210 211 212 * Garbage Collection 213 214 * Garbage Collection 215 216 Garbage collection simplifies APIs. 217 218 - In C and C++, too much API design (and too much programming effort!) is about memory management. 219 220 Fundamental to concurrency: too hard to track ownership otherwise. 221 222 Fundamental to interfaces: memory management details do not bifurcate otherwise-similar APIs. 223 224 Of course, adds cost, latency, complexity in run time system. 225 226 * Garbage Collection 227 228 Plenty of research about garbage collection, mostly in Java context. 229 230 - Parallel stop-the-world 231 - CMS: concurrent mark-and-sweep, stop-the-world compaction 232 - G1: region-based incremental copying collector 233 234 Java collectors usually: 235 236 - are generational/incremental because allocation rate is high 237 - compact memory to support generations 238 - have pauses because concurrent compaction is tricky and slow 239 240 * Garbage Collection 241 242 But Go is very different! 243 244 - User can avoid lots of allocations by embedding objects: 245 246 type Point struct { 247 X, Y int 248 } 249 type Rectangle struct { 250 Min, Max Point 251 } 252 253 - Less pointers. 254 - Lots of stack allocations. 255 - Interior pointers are allowed: 256 257 p := &rect.Max 258 259 - Hundreds of thousands of stacks (goroutines) 260 - No object headers so far 261 262 * Implementing Garbage Collection 263 264 Current GC: stop the world, parallel mark, start the world, concurrent sweep. 265 Concurrent mark is almost ready. 266 267 Cannot reuse Java GC algorithms directly. 268 269 Research question: what GC algorithm is the best fit for Go? 270 Do we need generations? Do we need compaction? What are efficient data structures that support interior pointers? 271 272 273 274 275 * Race and deadlock detection 276 277 .image research2/race.png 160 600 278 279 * Race detection 280 281 Based on ThreadSanitizer runtime, originally mainly targeted C/C++. 282 Traditional happens-before race detector based on vector clocks (devil in details!). 283 Works fine for Go, except: 284 285 $ go run -race lots_of_goroutines.go 286 race: limit on 8192 simultaneously alive goroutines is exceeded, dying 287 288 Research question: race dectector that efficiently supports hundreds of thousands of goroutines? 289 290 * Deadlock detection 291 292 Deadlock on mutexes due to lock order inversion: 293 294 // thread 1 // thread 2 295 pthread_mutex_lock(&m1); pthread_mutex_lock(&m2); 296 pthread_mutex_lock(&m2); pthread_mutex_lock(&m1); 297 ... ... 298 pthread_mutex_unlock(&m2); pthread_mutex_unlock(&m1); 299 pthread_mutex_unlock(&m1); pthread_mutex_unlock(&m2); 300 301 Lock order inversions are easy to detect: 302 303 - build "M1 is locked under M2" relation. 304 - if it becomes cyclic, there is a potential deadlock. 305 - whenever a new edge is added to the graph, do DFS to find cycles. 306 307 * Deadlock detection 308 309 Go has channels and mutexes. Channels are semaphores. A mutex can be unlocked in 310 a different goroutine, so it is essentially a binary semaphore too. 311 312 Deadlock example: 313 314 // Parallel file tree walk. 315 func worker(pendingItems chan os.FileInfo) 316 for f := range pendingItems { 317 if f.IsDir() { 318 filepath.Walk(f.Name(), func(path string, info os.FileInfo, err error) error { 319 pendingItems <- info 320 }) 321 } else { 322 visit(f) 323 } 324 } 325 } 326 327 pendingItems channel has limited capacity. All workers can block on send to pendingItems. 328 329 * Deadlock detection 330 331 Another deadlock example: 332 333 var ( 334 c = make(chan T, 100) 335 mtx sync.RWMutex 336 ) 337 338 // goroutine 1 // goroutine 2 // goroutine 3 339 // does send // does receive // "resizes" the channel 340 mtx.RLock() mtx.RLock() mtx.Lock() 341 c <- v v := <-c tmp := make(chan T, 200) 342 mtx.RUnlock() mtx.RUnlock() copyAll(c, tmp) 343 c = tmp 344 mtx.Unlock() 345 346 RWMutex is fair for both readers and writers: when a writer arrives, new readers are not let to enter the critical section. 347 Goroutine 1 blocks on chan send; then goroutine 3 blocks on mtx.Lock; then goroutine 2 blocks on mtx.RLock. 348 349 * Deadlock detection 350 351 Research question: how to detect deadlocks on semaphores? 352 353 No known theory to date. 354 355 356 357 358 * Testing of the implementation 359 360 .image research2/gopherswrench.jpg 240 405 361 362 * Testing of the implementation 363 364 So now we have a new language with several complex implementations: 365 366 - lexer 367 - parser 368 - transformation and optimization passes 369 - code generation 370 - linker 371 - channel and map operations 372 - garbage collector 373 - ... 374 375 *How*do*you*test*it?* 376 377 * Testing of the implementation 378 379 Csmith is a tool that generates random C programs that statically and dynamically conform to the C99 standard. 380 381 .image research2/csmith.png 382 383 * Testing of the implementation 384 385 Gosmith is a tool that generates random Go programs that statically and dynamically conform to the Go standard. 386 387 Turned out to be much simpler than C: no undefined behavior all around! 388 389 - no unitialized variables 390 - no concurrent mutations between sequence points (x[i++] = --i) 391 - no UB during signed overflow 392 - total 191 kinds of undefined behavior and 52 kinds of unspecified behavior in C 393 394 * Testing of the implementation 395 396 But generates uninteresting programs from execution point of view: most of them deadlock or crash on nil deref. 397 398 Trophies: 399 400 - 31 bugs in gc compiler 401 - 18 bugs in gccgo compiler 402 - 5 bugs in llgo compiler 403 - 1 bug in gofmt 404 - 3 bugs in the spec 405 406 .image research2/emoji.png 407 408 * Testing of the implementation 409 410 Research question: how to generate random *interesting*concurrent* Go programs? 411 412 Must: 413 414 - create and wait for goroutines 415 - communicate over channels 416 - protect data with mutexes (reader-writer) 417 - pass data ownership between goroutines (explicitly and implicitly) 418 419 Must not: 420 421 - deadlock 422 - cause data races 423 - have non-deterministic results 424 425 426 427 428 429 * Research and Go 430 431 Plenty of research questions about how to implement Go well. 432 433 - Concurrency 434 - Scheduling 435 - Garbage collection 436 - Race and deadlock detection 437 - Testing of the implementation 438 - [Polymorphism] 439 - [Program translation] 440