github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/pipelines.article

github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/pipelines.article (about)

     1  Go Concurrency Patterns: Pipelines and cancellation
     2  13 Mar 2014
     3  Tags: concurrency, pipelines, cancellation
     4  
     5  Sameer Ajmani
     6  
     7  * Introduction
     8  
     9  Go's concurrency primitives make it easy to construct streaming data pipelines
    10  that make efficient use of I/O and multiple CPUs.  This article presents
    11  examples of such pipelines, highlights subtleties that arise when operations
    12  fail, and introduces techniques for dealing with failures cleanly.
    13  
    14  * What is a pipeline?
    15  
    16  There's no formal definition of a pipeline in Go; it's just one of many kinds of
    17  concurrent programs.  Informally, a pipeline is a series of _stages_ connected
    18  by channels, where each stage is a group of goroutines running the same
    19  function.  In each stage, the goroutines
    20  
    21  - receive values from _upstream_ via _inbound_ channels
    22  - perform some function on that data, usually producing new values
    23  - send values _downstream_ via _outbound_ channels
    24  
    25  Each stage has any number of inbound and outbound channels, except the
    26  first and last stages, which have only outbound or inbound channels,
    27  respectively.  The first stage is sometimes called the _source_ or
    28  _producer_; the last stage, the _sink_ or _consumer_.
    29  
    30  We'll begin with a simple example pipeline to explain the ideas and techniques.
    31  Later, we'll present a more realistic example.
    32  
    33  * Squaring numbers
    34  
    35  Consider a pipeline with three stages.
    36  
    37  The first stage, `gen`, is a function that converts a list of integers to a
    38  channel that emits the integers in the list.  The `gen` function starts a
    39  goroutine that sends the integers on the channel and closes the channel when all
    40  the values have been sent:
    41  
    42  .code pipelines/square.go /func gen/,/^}/
    43  
    44  The second stage, `sq`, receives integers from a channel and returns a
    45  channel that emits the square of each received integer.  After the
    46  inbound channel is closed and this stage has sent all the values
    47  downstream, it closes the outbound channel:
    48  
    49  .code pipelines/square.go /func sq/,/^}/
    50  
    51  The `main` function sets up the pipeline and runs the final stage: it receives
    52  values from the second stage and prints each one, until the channel is closed:
    53  
    54  .code pipelines/square.go /func main/,/^}/
    55  
    56  Since `sq` has the same type for its inbound and outbound channels, we
    57  can compose it any number of times.  We can also rewrite `main` as a
    58  range loop, like the other stages:
    59  
    60  .code pipelines/square2.go /func main/,/^}/
    61  
    62  * Fan-out, fan-in
    63  
    64  Multiple functions can read from the same channel until that channel is closed;
    65  this is called _fan-out_. This provides a way to distribute work amongst a group
    66  of workers to parallelize CPU use and I/O.
    67  
    68  A function can read from multiple inputs and proceed until all are closed by
    69  multiplexing the input channels onto a single channel that's closed when all the
    70  inputs are closed.  This is called _fan-in_.
    71  
    72  We can change our pipeline to run two instances of `sq`, each reading from the
    73  same input channel.  We introduce a new function, _merge_, to fan in the
    74  results:
    75  
    76  .code pipelines/sqfan.go /func main/,/^}/
    77  
    78  The `merge` function converts a list of channels to a single channel by starting
    79  a goroutine for each inbound channel that copies the values to the sole outbound
    80  channel.  Once all the `output` goroutines have been started, `merge` starts one
    81  more goroutine to close the outbound channel after all sends on that channel are
    82  done.
    83  
    84  Sends on a closed channel panic, so it's important to ensure all sends
    85  are done before calling close.  The
    86  [[http://golang.org/pkg/sync/#WaitGroup][`sync.WaitGroup`]] type
    87  provides a simple way to arrange this synchronization:
    88  
    89  .code pipelines/sqfan.go /func merge/,/^}/
    90  
    91  * Stopping short
    92  
    93  There is a pattern to our pipeline functions:
    94  
    95  - stages close their outbound channels when all the send operations are done.
    96  - stages keep receiving values from inbound channels until those channels are closed.
    97  
    98  This pattern allows each receiving stage to be written as a `range` loop and
    99  ensures that all goroutines exit once all values have been successfully sent
   100  downstream.
   101  
   102  But in real pipelines, stages don't always receive all the inbound
   103  values.  Sometimes this is by design: the receiver may only need a
   104  subset of values to make progress.  More often, a stage exits early
   105  because an inbound value represents an error in an earlier stage. In
   106  either case the receiver should not have to wait for the remaining
   107  values to arrive, and we want earlier stages to stop producing values
   108  that later stages don't need.
   109  
   110  In our example pipeline, if a stage fails to consume all the inbound values, the
   111  goroutines attempting to send those values will block indefinitely:
   112  
   113  .code pipelines/sqleak.go /first value/,/^}/
   114  
   115  This is a resource leak: goroutines consume memory and runtime resources, and
   116  heap references in goroutine stacks keep data from being garbage collected.
   117  Goroutines are not garbage collected; they must exit on their own.
   118  
   119  We need to arrange for the upstream stages of our pipeline to exit even when the
   120  downstream stages fail to receive all the inbound values.  One way to do this is
   121  to change the outbound channels to have a buffer.  A buffer can hold a fixed
   122  number of values; send operations complete immediately if there's room in the
   123  buffer:
   124  
   125          c := make(chan int, 2) // buffer size 2
   126          c <- 1  // succeeds immediately
   127          c <- 2  // succeeds immediately
   128          c <- 3  // blocks until another goroutine does <-c and receives 1
   129  
   130  When the number of values to be sent is known at channel creation time, a buffer
   131  can simplify the code.  For example, we can rewrite `gen` to copy the list of
   132  integers into a buffered channel and avoid creating a new goroutine:
   133  
   134  .code pipelines/sqbuffer.go /func gen/,/^}/
   135  
   136  Returning to the blocked goroutines in our pipeline, we might consider adding a
   137  buffer to the outbound channel returned by `merge`:
   138  
   139  .code pipelines/sqbuffer.go /func merge/,/unchanged/
   140  
   141  While this fixes the blocked goroutine in this program, this is bad code.  The
   142  choice of buffer size of 1 here depends on knowing the number of values `merge`
   143  will receive and the number of values downstream stages will consume.  This is
   144  fragile: if we pass an additional value to `gen`, or if the downstream stage
   145  reads any fewer values, we will again have blocked goroutines.
   146  
   147  Instead, we need to provide a way for downstream stages to indicate to the
   148  senders that they will stop accepting input.
   149  
   150  * Explicit cancellation
   151  
   152  When `main` decides to exit without receiving all the values from
   153  `out`, it must tell the goroutines in the upstream stages to abandon
   154  the values they're trying it send.  It does so by sending values on a
   155  channel called `done`.  It sends two values since there are
   156  potentially two blocked senders:
   157  
   158  .code pipelines/sqdone1.go /func main/,/^}/
   159  
   160  The sending goroutines replace their send operation with a `select` statement
   161  that proceeds either when the send on `out` happens or when they receive a value
   162  from `done`.  The value type of `done` is the empty struct because the value
   163  doesn't matter: it is the receive event that indicates the send on `out` should
   164  be abandoned.  The `output` goroutines continue looping on their inbound
   165  channel, `c`, so the upstream stages are not blocked. (We'll discuss in a moment
   166  how to allow this loop to return early.)
   167  
   168  .code pipelines/sqdone1.go /func merge/,/unchanged/
   169  
   170  This approach has a problem: _each_ downstream receiver needs to know the number
   171  of potentially blocked upstream senders and arrange to signal those senders on
   172  early return.  Keeping track of these counts is tedious and error-prone.
   173  
   174  We need a way to tell an unknown and unbounded number of goroutines to
   175  stop sending their values downstream.  In Go, we can do this by
   176  closing a channel, because
   177  [[http://golang.org/ref/spec#Receive_operator][a receive operation on a closed channel can always proceed immediately, yielding the element type's zero value.]]
   178  
   179  This means that `main` can unblock all the senders simply by closing
   180  the `done` channel.  This close is effectively a broadcast signal to
   181  the senders.  We extend _each_ of our pipeline functions to accept
   182  `done` as a parameter and arrange for the close to happen via a
   183  `defer` statement, so that all return paths from `main` will signal
   184  the pipeline stages to exit.
   185  
   186  .code pipelines/sqdone3.go /func main/,/^}/
   187  
   188  Each of our pipeline stages is now free to return as soon as `done` is closed.
   189  The `output` routine in `merge` can return without draining its inbound channel,
   190  since it knows the upstream sender, `sq`, will stop attempting to send when
   191  `done` is closed.  `output` ensures `wg.Done` is called on all return paths via
   192  a `defer` statement:
   193  
   194  .code pipelines/sqdone3.go /func merge/,/unchanged/
   195  
   196  Similarly, `sq` can return as soon as `done` is closed.  `sq` ensures its `out`
   197  channel is closed on all return paths via a `defer` statement:
   198  
   199  .code pipelines/sqdone3.go /func sq/,/^}/
   200  
   201  Here are the guidelines for pipeline construction:
   202  
   203  - stages close their outbound channels when all the send operations are done.
   204  - stages keep receiving values from inbound channels until those channels are closed or the senders are unblocked.
   205  
   206  Pipelines unblock senders either by ensuring there's enough buffer for all the
   207  values that are sent or by explicitly signalling senders when the receiver may
   208  abandon the channel.
   209  
   210  * Digesting a tree
   211  
   212  Let's consider a more realistic pipeline.
   213  
   214  MD5 is a message-digest algorithm that's useful as a file checksum.  The command
   215  line utility `md5sum` prints digest values for a list of files.
   216  
   217  	% md5sum *.go
   218  	d47c2bbc28298ca9befdfbc5d3aa4e65  bounded.go
   219  	ee869afd31f83cbb2d10ee81b2b831dc  parallel.go
   220  	b88175e65fdcbc01ac08aaf1fd9b5e96  serial.go
   221  
   222  Our example program is like `md5sum` but instead takes a single directory as an
   223  argument and prints the digest values for each regular file under that
   224  directory, sorted by path name.
   225  
   226  	% go run serial.go .
   227  	d47c2bbc28298ca9befdfbc5d3aa4e65  bounded.go
   228  	ee869afd31f83cbb2d10ee81b2b831dc  parallel.go
   229  	b88175e65fdcbc01ac08aaf1fd9b5e96  serial.go
   230  
   231  The main function of our program invokes a helper function `MD5All`, which
   232  returns a map from path name to digest value, then sorts and prints the results:
   233  
   234  .code pipelines/serial.go /func main/,/^}/
   235  
   236  The `MD5All` function is the focus of our discussion.  In
   237  [[pipelines/serial.go][serial.go]], the implementation uses no concurrency and
   238  simply reads and sums each file as it walks the tree.
   239  
   240  .code pipelines/serial.go /MD5All/,/^}/
   241  
   242  * Parallel digestion
   243  
   244  In [[pipelines/parallel.go][parallel.go]], we split `MD5All` into a two-stage
   245  pipeline.  The first stage, `sumFiles`, walks the tree, digests each file in
   246  a new goroutine, and sends the results on a channel with value type `result`:
   247  
   248  .code pipelines/parallel.go /type result/,/}/  HLresult
   249  
   250  `sumFiles` returns two channels: one for the `results` and another for the error
   251  returned by `filepath.Walk`.  The walk function starts a new goroutine to
   252  process each regular file, then checks `done`.  If `done` is closed, the walk
   253  stops immediately:
   254  
   255  .code pipelines/parallel.go /func sumFiles/,/^}/
   256  
   257  `MD5All` receives the digest values from `c`.  `MD5All` returns early on error,
   258  closing `done` via a `defer`:
   259  
   260  .code pipelines/parallel.go /func MD5All/,/^}/  HLdone
   261  
   262  * Bounded parallelism
   263  
   264  The `MD5All` implementation in [[pipelines/parallel.go][parallel.go]]
   265  starts a new goroutine for each file. In a directory with many large
   266  files, this may allocate more memory than is available on the machine.
   267  
   268  We can limit these allocations by bounding the number of files read in
   269  parallel.  In [[pipelines/bounded.go][bounded.go]], we do this by
   270  creating a fixed number of goroutines for reading files.  Our pipeline
   271  now has three stages: walk the tree, read and digest the files, and
   272  collect the digests.
   273  
   274  The first stage, `walkFiles`, emits the paths of regular files in the tree:
   275  
   276  .code pipelines/bounded.go /func walkFiles/,/^}/
   277  
   278  The middle stage starts a fixed number of `digester` goroutines that receive
   279  file names from `paths` and send `results` on channel `c`:
   280  
   281  .code pipelines/bounded.go /func digester/,/^}/ HLpaths
   282  
   283  Unlike our previous examples, `digester` does not close its output channel, as
   284  multiple goroutines are sending on a shared channel.  Instead, code in `MD5All`
   285  arranges for the channel to be closed when all the `digesters` are done:
   286  
   287  .code pipelines/bounded.go /fixed number/,/End of pipeline/ HLc
   288  
   289  We could instead have each digester create and return its own output
   290  channel, but then we would need additional goroutines to fan-in the
   291  results.
   292  
   293  The final stage receives all the `results` from `c` then checks the
   294  error from `errc`.  This check cannot happen any earlier, since before
   295  this point, `walkFiles` may block sending values downstream:
   296  
   297  .code pipelines/bounded.go /m := make/,/^}/ HLerrc
   298  
   299  * Conclusion
   300  
   301  This article has presented techniques for constructing streaming data pipelines
   302  in Go.  Dealing with failures in such pipelines is tricky, since each stage in
   303  the pipeline may block attempting to send values downstream, and the downstream
   304  stages may no longer care about the incoming data.  We showed how closing a
   305  channel can broadcast a "done" signal to all the goroutines started by a
   306  pipeline and defined guidelines for constructing pipelines correctly.
   307  
   308  Further reading:
   309  
   310  - [[http://talks.golang.org/2012/concurrency.slide#1][Go Concurrency Patterns]] ([[https://www.youtube.com/watch?v=f6kdp27TYZs][video]]) presents the basics of Go's concurrency primitives and several ways to apply them.
   311  - [[http://blog.golang.org/advanced-go-concurrency-patterns][Advanced Go Concurrency Patterns]] ([[http://www.youtube.com/watch?v=QDDwwePbDtw][video]]) covers more complex uses of Go's primitives, especially `select`.
   312  - Douglas McIlroy's paper [[http://swtch.com/~rsc/thread/squint.pdf][Squinting at Power Series]] shows how Go-like concurrency provides elegant support for complex calculations.