github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/pipelines.article (about) 1 Go Concurrency Patterns: Pipelines and cancellation 2 13 Mar 2014 3 Tags: concurrency, pipelines, cancellation 4 5 Sameer Ajmani 6 7 * Introduction 8 9 Go's concurrency primitives make it easy to construct streaming data pipelines 10 that make efficient use of I/O and multiple CPUs. This article presents 11 examples of such pipelines, highlights subtleties that arise when operations 12 fail, and introduces techniques for dealing with failures cleanly. 13 14 * What is a pipeline? 15 16 There's no formal definition of a pipeline in Go; it's just one of many kinds of 17 concurrent programs. Informally, a pipeline is a series of _stages_ connected 18 by channels, where each stage is a group of goroutines running the same 19 function. In each stage, the goroutines 20 21 - receive values from _upstream_ via _inbound_ channels 22 - perform some function on that data, usually producing new values 23 - send values _downstream_ via _outbound_ channels 24 25 Each stage has any number of inbound and outbound channels, except the 26 first and last stages, which have only outbound or inbound channels, 27 respectively. The first stage is sometimes called the _source_ or 28 _producer_; the last stage, the _sink_ or _consumer_. 29 30 We'll begin with a simple example pipeline to explain the ideas and techniques. 31 Later, we'll present a more realistic example. 32 33 * Squaring numbers 34 35 Consider a pipeline with three stages. 36 37 The first stage, `gen`, is a function that converts a list of integers to a 38 channel that emits the integers in the list. The `gen` function starts a 39 goroutine that sends the integers on the channel and closes the channel when all 40 the values have been sent: 41 42 .code pipelines/square.go /func gen/,/^}/ 43 44 The second stage, `sq`, receives integers from a channel and returns a 45 channel that emits the square of each received integer. After the 46 inbound channel is closed and this stage has sent all the values 47 downstream, it closes the outbound channel: 48 49 .code pipelines/square.go /func sq/,/^}/ 50 51 The `main` function sets up the pipeline and runs the final stage: it receives 52 values from the second stage and prints each one, until the channel is closed: 53 54 .code pipelines/square.go /func main/,/^}/ 55 56 Since `sq` has the same type for its inbound and outbound channels, we 57 can compose it any number of times. We can also rewrite `main` as a 58 range loop, like the other stages: 59 60 .code pipelines/square2.go /func main/,/^}/ 61 62 * Fan-out, fan-in 63 64 Multiple functions can read from the same channel until that channel is closed; 65 this is called _fan-out_. This provides a way to distribute work amongst a group 66 of workers to parallelize CPU use and I/O. 67 68 A function can read from multiple inputs and proceed until all are closed by 69 multiplexing the input channels onto a single channel that's closed when all the 70 inputs are closed. This is called _fan-in_. 71 72 We can change our pipeline to run two instances of `sq`, each reading from the 73 same input channel. We introduce a new function, _merge_, to fan in the 74 results: 75 76 .code pipelines/sqfan.go /func main/,/^}/ 77 78 The `merge` function converts a list of channels to a single channel by starting 79 a goroutine for each inbound channel that copies the values to the sole outbound 80 channel. Once all the `output` goroutines have been started, `merge` starts one 81 more goroutine to close the outbound channel after all sends on that channel are 82 done. 83 84 Sends on a closed channel panic, so it's important to ensure all sends 85 are done before calling close. The 86 [[http://golang.org/pkg/sync/#WaitGroup][`sync.WaitGroup`]] type 87 provides a simple way to arrange this synchronization: 88 89 .code pipelines/sqfan.go /func merge/,/^}/ 90 91 * Stopping short 92 93 There is a pattern to our pipeline functions: 94 95 - stages close their outbound channels when all the send operations are done. 96 - stages keep receiving values from inbound channels until those channels are closed. 97 98 This pattern allows each receiving stage to be written as a `range` loop and 99 ensures that all goroutines exit once all values have been successfully sent 100 downstream. 101 102 But in real pipelines, stages don't always receive all the inbound 103 values. Sometimes this is by design: the receiver may only need a 104 subset of values to make progress. More often, a stage exits early 105 because an inbound value represents an error in an earlier stage. In 106 either case the receiver should not have to wait for the remaining 107 values to arrive, and we want earlier stages to stop producing values 108 that later stages don't need. 109 110 In our example pipeline, if a stage fails to consume all the inbound values, the 111 goroutines attempting to send those values will block indefinitely: 112 113 .code pipelines/sqleak.go /first value/,/^}/ 114 115 This is a resource leak: goroutines consume memory and runtime resources, and 116 heap references in goroutine stacks keep data from being garbage collected. 117 Goroutines are not garbage collected; they must exit on their own. 118 119 We need to arrange for the upstream stages of our pipeline to exit even when the 120 downstream stages fail to receive all the inbound values. One way to do this is 121 to change the outbound channels to have a buffer. A buffer can hold a fixed 122 number of values; send operations complete immediately if there's room in the 123 buffer: 124 125 c := make(chan int, 2) // buffer size 2 126 c <- 1 // succeeds immediately 127 c <- 2 // succeeds immediately 128 c <- 3 // blocks until another goroutine does <-c and receives 1 129 130 When the number of values to be sent is known at channel creation time, a buffer 131 can simplify the code. For example, we can rewrite `gen` to copy the list of 132 integers into a buffered channel and avoid creating a new goroutine: 133 134 .code pipelines/sqbuffer.go /func gen/,/^}/ 135 136 Returning to the blocked goroutines in our pipeline, we might consider adding a 137 buffer to the outbound channel returned by `merge`: 138 139 .code pipelines/sqbuffer.go /func merge/,/unchanged/ 140 141 While this fixes the blocked goroutine in this program, this is bad code. The 142 choice of buffer size of 1 here depends on knowing the number of values `merge` 143 will receive and the number of values downstream stages will consume. This is 144 fragile: if we pass an additional value to `gen`, or if the downstream stage 145 reads any fewer values, we will again have blocked goroutines. 146 147 Instead, we need to provide a way for downstream stages to indicate to the 148 senders that they will stop accepting input. 149 150 * Explicit cancellation 151 152 When `main` decides to exit without receiving all the values from 153 `out`, it must tell the goroutines in the upstream stages to abandon 154 the values they're trying it send. It does so by sending values on a 155 channel called `done`. It sends two values since there are 156 potentially two blocked senders: 157 158 .code pipelines/sqdone1.go /func main/,/^}/ 159 160 The sending goroutines replace their send operation with a `select` statement 161 that proceeds either when the send on `out` happens or when they receive a value 162 from `done`. The value type of `done` is the empty struct because the value 163 doesn't matter: it is the receive event that indicates the send on `out` should 164 be abandoned. The `output` goroutines continue looping on their inbound 165 channel, `c`, so the upstream stages are not blocked. (We'll discuss in a moment 166 how to allow this loop to return early.) 167 168 .code pipelines/sqdone1.go /func merge/,/unchanged/ 169 170 This approach has a problem: _each_ downstream receiver needs to know the number 171 of potentially blocked upstream senders and arrange to signal those senders on 172 early return. Keeping track of these counts is tedious and error-prone. 173 174 We need a way to tell an unknown and unbounded number of goroutines to 175 stop sending their values downstream. In Go, we can do this by 176 closing a channel, because 177 [[http://golang.org/ref/spec#Receive_operator][a receive operation on a closed channel can always proceed immediately, yielding the element type's zero value.]] 178 179 This means that `main` can unblock all the senders simply by closing 180 the `done` channel. This close is effectively a broadcast signal to 181 the senders. We extend _each_ of our pipeline functions to accept 182 `done` as a parameter and arrange for the close to happen via a 183 `defer` statement, so that all return paths from `main` will signal 184 the pipeline stages to exit. 185 186 .code pipelines/sqdone3.go /func main/,/^}/ 187 188 Each of our pipeline stages is now free to return as soon as `done` is closed. 189 The `output` routine in `merge` can return without draining its inbound channel, 190 since it knows the upstream sender, `sq`, will stop attempting to send when 191 `done` is closed. `output` ensures `wg.Done` is called on all return paths via 192 a `defer` statement: 193 194 .code pipelines/sqdone3.go /func merge/,/unchanged/ 195 196 Similarly, `sq` can return as soon as `done` is closed. `sq` ensures its `out` 197 channel is closed on all return paths via a `defer` statement: 198 199 .code pipelines/sqdone3.go /func sq/,/^}/ 200 201 Here are the guidelines for pipeline construction: 202 203 - stages close their outbound channels when all the send operations are done. 204 - stages keep receiving values from inbound channels until those channels are closed or the senders are unblocked. 205 206 Pipelines unblock senders either by ensuring there's enough buffer for all the 207 values that are sent or by explicitly signalling senders when the receiver may 208 abandon the channel. 209 210 * Digesting a tree 211 212 Let's consider a more realistic pipeline. 213 214 MD5 is a message-digest algorithm that's useful as a file checksum. The command 215 line utility `md5sum` prints digest values for a list of files. 216 217 % md5sum *.go 218 d47c2bbc28298ca9befdfbc5d3aa4e65 bounded.go 219 ee869afd31f83cbb2d10ee81b2b831dc parallel.go 220 b88175e65fdcbc01ac08aaf1fd9b5e96 serial.go 221 222 Our example program is like `md5sum` but instead takes a single directory as an 223 argument and prints the digest values for each regular file under that 224 directory, sorted by path name. 225 226 % go run serial.go . 227 d47c2bbc28298ca9befdfbc5d3aa4e65 bounded.go 228 ee869afd31f83cbb2d10ee81b2b831dc parallel.go 229 b88175e65fdcbc01ac08aaf1fd9b5e96 serial.go 230 231 The main function of our program invokes a helper function `MD5All`, which 232 returns a map from path name to digest value, then sorts and prints the results: 233 234 .code pipelines/serial.go /func main/,/^}/ 235 236 The `MD5All` function is the focus of our discussion. In 237 [[pipelines/serial.go][serial.go]], the implementation uses no concurrency and 238 simply reads and sums each file as it walks the tree. 239 240 .code pipelines/serial.go /MD5All/,/^}/ 241 242 * Parallel digestion 243 244 In [[pipelines/parallel.go][parallel.go]], we split `MD5All` into a two-stage 245 pipeline. The first stage, `sumFiles`, walks the tree, digests each file in 246 a new goroutine, and sends the results on a channel with value type `result`: 247 248 .code pipelines/parallel.go /type result/,/}/ HLresult 249 250 `sumFiles` returns two channels: one for the `results` and another for the error 251 returned by `filepath.Walk`. The walk function starts a new goroutine to 252 process each regular file, then checks `done`. If `done` is closed, the walk 253 stops immediately: 254 255 .code pipelines/parallel.go /func sumFiles/,/^}/ 256 257 `MD5All` receives the digest values from `c`. `MD5All` returns early on error, 258 closing `done` via a `defer`: 259 260 .code pipelines/parallel.go /func MD5All/,/^}/ HLdone 261 262 * Bounded parallelism 263 264 The `MD5All` implementation in [[pipelines/parallel.go][parallel.go]] 265 starts a new goroutine for each file. In a directory with many large 266 files, this may allocate more memory than is available on the machine. 267 268 We can limit these allocations by bounding the number of files read in 269 parallel. In [[pipelines/bounded.go][bounded.go]], we do this by 270 creating a fixed number of goroutines for reading files. Our pipeline 271 now has three stages: walk the tree, read and digest the files, and 272 collect the digests. 273 274 The first stage, `walkFiles`, emits the paths of regular files in the tree: 275 276 .code pipelines/bounded.go /func walkFiles/,/^}/ 277 278 The middle stage starts a fixed number of `digester` goroutines that receive 279 file names from `paths` and send `results` on channel `c`: 280 281 .code pipelines/bounded.go /func digester/,/^}/ HLpaths 282 283 Unlike our previous examples, `digester` does not close its output channel, as 284 multiple goroutines are sending on a shared channel. Instead, code in `MD5All` 285 arranges for the channel to be closed when all the `digesters` are done: 286 287 .code pipelines/bounded.go /fixed number/,/End of pipeline/ HLc 288 289 We could instead have each digester create and return its own output 290 channel, but then we would need additional goroutines to fan-in the 291 results. 292 293 The final stage receives all the `results` from `c` then checks the 294 error from `errc`. This check cannot happen any earlier, since before 295 this point, `walkFiles` may block sending values downstream: 296 297 .code pipelines/bounded.go /m := make/,/^}/ HLerrc 298 299 * Conclusion 300 301 This article has presented techniques for constructing streaming data pipelines 302 in Go. Dealing with failures in such pipelines is tricky, since each stage in 303 the pipeline may block attempting to send values downstream, and the downstream 304 stages may no longer care about the incoming data. We showed how closing a 305 channel can broadcast a "done" signal to all the goroutines started by a 306 pipeline and defined guidelines for constructing pipelines correctly. 307 308 Further reading: 309 310 - [[http://talks.golang.org/2012/concurrency.slide#1][Go Concurrency Patterns]] ([[https://www.youtube.com/watch?v=f6kdp27TYZs][video]]) presents the basics of Go's concurrency primitives and several ways to apply them. 311 - [[http://blog.golang.org/advanced-go-concurrency-patterns][Advanced Go Concurrency Patterns]] ([[http://www.youtube.com/watch?v=QDDwwePbDtw][video]]) covers more complex uses of Go's primitives, especially `select`. 312 - Douglas McIlroy's paper [[http://swtch.com/~rsc/thread/squint.pdf][Squinting at Power Series]] shows how Go-like concurrency provides elegant support for complex calculations.