github.com/grailbio/bigslice@v0.0.0-20230519005545-30c4c12152ad/docs/index.md (about)

     1  ---
     2  title: Bigslice - cluster computing in Go
     3  layout: default
     4  ---
     5  
     6  <a href="https://github.com/grailbio/bigslice/" class="github-corner" aria-label="View source on GitHub"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#800080; color:#fff; position: absolute; top: 0; border: 0; right: 0;" aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z" fill="currentColor" class="octo-body"></path></svg></a><style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style>
     7  
     8  # Bigslice
     9  
    10  <img src="bigslice.png" alt="Bigslice gopher" height="200"/>
    11  
    12  Bigslice is a system for
    13  <i>fast</i>, large-scale,
    14  serverless data processing
    15  using [Go](https://golang.org).
    16  
    17  Bigslice provides an API that lets users express their
    18  computation with a handful of familiar data transformation
    19  primitives such as
    20  <span class="small">map</span>,
    21  <span class="small">filter</span>,
    22  <span class="small">reduce</span>, and
    23  <span class="small">join</span>.
    24  When the program is run,
    25  Bigslice creates an ad hoc cluster on a cloud computing provider
    26  and transparently distributes the computation among the nodes
    27  in the cluster.
    28  
    29  Bigslice is similar to data processing systems like
    30  [Apache Spark](https://spark.apache.org/)
    31  and [FlumeJava](https://ai.google/research/pubs/pub35650),
    32  but with different aims:
    33  
    34  * *Bigslice is built for Go.* Bigslice is used as an ordinary Go package,
    35    users use their existing Go code, and Bigslice binaries are compiled
    36    like ordinary Go binaries.
    37  * *Bigslice is serverless.* Requiring nothing more than cloud credentials,
    38    Bigslice will have you processing large datasets in no time, without the use
    39    of any other external infrastructure.
    40  * *Bigslice is simple and transparent.* Bigslice programs are regular
    41    Go programs, providing users with a familiar environment and tools.
    42    A Bigslice program can be run on a single node like any other program,
    43    but it is also capable of transparently distributing itself across an
    44    ad hoc cluster, managed entirely by the program itself.
    45  
    46  <div class="links">
    47  <a href="https://github.com/grailbio/bigslice">GitHub project</a> &middot;
    48  <a href="https://godoc.org/github.com/grailbio/bigslice">API documentation</a> &middot;
    49  <a href="https://github.com/grailbio/bigslice/issues">issue tracker</a> &middot;
    50  <a href="https://godoc.org/github.com/grailbio/bigmachine">bigmachine</a>
    51  </div>
    52  
    53  # Getting started
    54  
    55  To get a sense of what writing and running Bigslice programs is like,
    56  we'll implement a simple word counter,
    57  computing the frequencies of words used in Shakespeare's combined works.
    58  Of course, it would be silly to use Bigslice for a corpus this small,
    59  but it serves to illustrate the various features of Bigslice, and,
    60  because the data are small, we enjoy a very quick feedback loop.
    61  
    62  First, we'll install the bigslice command.
    63  This command is not strictly needed to use Bigslice,
    64  but it helps to make common tasks and setup easy and simple.
    65  The bigslice command helps us to build and run Bigslice programs, and,
    66  as we'll see later,
    67  also perform the necessary setup for your cloud provider.
    68  
    69  ```
    70  GO111MODULE=on go get github.com/grailbio/bigslice/cmd/bigslice@latest
    71  ```
    72  
    73  Now, we write a Go file that implements our word count.
    74  Don't worry too much about the details for now;
    75  we'll go over this later.
    76  
    77  ```
    78  package main
    79  
    80  import (
    81  	"context"
    82  	"fmt"
    83  	"io"
    84  	"log"
    85  	"net/http"
    86  	"sort"
    87  	"strings"
    88  
    89  	"github.com/grailbio/bigslice"
    90  	"github.com/grailbio/bigslice/sliceconfig"
    91  )
    92  
    93  var wordCount = bigslice.Func(func(url string) bigslice.Slice {
    94  	slice := bigslice.ScanReader(8, func() (io.ReadCloser, error) {
    95  		resp, err := http.Get(url)
    96  		if err != nil {
    97  			return nil, err
    98  		}
    99  		if resp.StatusCode != 200 {
   100  			return nil, fmt.Errorf("get %v: %v", url, resp.Status)
   101  		}
   102  		return resp.Body, nil
   103  	})
   104  	slice = bigslice.Flatmap(slice, strings.Fields)
   105  	slice = bigslice.Map(slice, func(token string) (string, int) {
   106  		return token, 1
   107  	})
   108  	slice = bigslice.Reduce(slice, func(a, e int) int {
   109  		return a + e
   110  	})
   111  	return slice
   112  })
   113  
   114  
   115  const shakespeare = "https://ocw.mit.edu/ans7870/6"+
   116  	"/6.006/s08/lecturenotes/files/t8.shakespeare.txt"
   117  
   118  
   119  func main() {
   120  	sess := sliceconfig.Parse()
   121  	defer sess.Shutdown()
   122  
   123  	ctx := context.Background()
   124  	tokens, err := sess.Run(ctx, wordCount, shakespeare)
   125  	if err != nil {
   126  		log.Fatal(err)
   127  	}
   128  	scanner := tokens.Scanner()
   129  	defer scanner.Close()
   130  	type counted struct {
   131  		token string
   132  		count int
   133  	}
   134  	var (
   135  		token  string
   136  		count  int
   137  		counts []counted
   138  	)
   139  	for scanner.Scan(ctx, &token, &count) {
   140  		counts = append(counts, counted{token, count})
   141  	}
   142  	if err := scanner.Err(); err != nil {
   143  		log.Fatal(err)
   144  	}
   145  
   146  	sort.Slice(counts, func(i, j int) bool {
   147  		return counts[i].count > counts[j].count
   148  	})
   149  	if len(counts) > 10 {
   150  		counts = counts[:10]
   151  	}
   152  	for _, count := range counts {
   153  		fmt.Println(count.token, count.count)
   154  	}
   155  }
   156  ```
   157  
   158  Now that we have our computation,
   159  we can run it with the bigslice tool.
   160  In order to test it out,
   161  we'll run it in local mode.
   162  
   163  
   164  ```
   165  $ GO111MODULE=on bigslice run shake.go -local
   166  the 23242
   167  I 19540
   168  and 18297
   169  to 15623
   170  of 15544
   171  a 12532
   172  my 10824
   173  in 9576
   174  you 9081
   175  is 7851
   176  $
   177  ```
   178  
   179  Let's run the same thing on EC2.
   180  First we run bigslice setup-ec2
   181  to configure the required EC2 security group,
   182  and then run shake.go without the -local flag:
   183  
   184  ```
   185  $ bigslice setup-ec2
   186  bigslice: no existing bigslice security group found; creating new
   187  bigslice: found default VPC vpc-2c860354
   188  bigslice: authorizing ingress traffic for security group sg-0d4f69daa025633f9
   189  bigslice: tagging security group sg-0d4f69daa025633f9
   190  bigslice: created security group sg-0d4f69daa025633f9
   191  bigslice: set up new security group sg-0d4f69daa025633f9
   192  bigslice: wrote configuration to /Users/marius/.bigslice/config
   193  $ GO111MODULE=on bigslice run shake.go
   194  2019/09/26 07:43:33 http: serve :3333
   195  2019/09/26 07:43:33 slicemachine: 0 machines (0 procs); 1 machines pending (3 procs)
   196  2019/09/26 07:43:33 slicemachine: 0 machines (0 procs); 2 machines pending (6 procs)
   197  2019/09/26 07:43:33 slicemachine: 0 machines (0 procs); 3 machines pending (9 procs)
   198  the 23242
   199  I 19540
   200  and 18297
   201  to 15623
   202  of 15544
   203  a 12532
   204  my 10824
   205  in 9576
   206  you 9081
   207  is 7851
   208  $
   209  ```
   210  
   211  Bigslice launched an ad hoc cluster of 3 nodes
   212  in order to perform the computation;
   213  as soon as the job finished,
   214  the cluster tears itself down automatically.
   215  
   216  While a job is running,
   217  Bigslice exports its status via built-in http server.
   218  For example,
   219  in this case we can use curl to inspect
   220  the current status of the job using
   221  [curl](https://curl.haxx.se/).
   222  
   223  ```
   224  $ curl :3333/debug/status
   225  bigmachine:
   226    :  waiting for machine to boot  33s
   227    :  waiting for machine to boot  19s
   228    :  waiting for machine to boot  19s
   229  run /Users/marius/shake.go:41 [1] slices: count: 4
   230    reader@/Users/marius/shake.go:17:   tasks idle/running/done: 8/0/0  1m6s
   231    flatmap@/Users/marius/shake.go:27:  tasks idle/running/done: 8/0/0  1m6s
   232    map@/Users/marius/shake.go:28:      tasks idle/running/done: 8/0/0  1m6s
   233    reduce@/Users/marius/shake.go:29:   tasks idle/running/done: 8/0/0  1m6s
   234  run /Users/marius/shake.go:41 [1] tasks: tasks: runnable: 8
   235    inv1_reader_flatmap_map@8:0(1):  waiting for a machine  1m6s
   236    inv1_reader_flatmap_map@8:1(1):  waiting for a machine  1m6s
   237    inv1_reader_flatmap_map@8:2(1):  waiting for a machine  1m6s
   238    inv1_reader_flatmap_map@8:3(1):  waiting for a machine  1m6s
   239    inv1_reader_flatmap_map@8:4(1):  waiting for a machine  1m6s
   240    inv1_reader_flatmap_map@8:5(1):  waiting for a machine  1m6s
   241    inv1_reader_flatmap_map@8:6(1):  waiting for a machine  1m6s
   242    inv1_reader_flatmap_map@8:7(1):  waiting for a machine  1m6s
   243  ```
   244  
   245  The first clause tells us there are 3 machines
   246  (in this case, EC2 instances)
   247  waiting to boot.
   248  The second clause shows the status tasks associated with
   249  the slice operations at the given source lines, above.
   250  In this case, every task is idle
   251  because there are not yet any machines on which to run them.
   252  The final clause shows the physical tasks
   253  that require scheduling by Bigslice.
   254  
   255  A little later,
   256  we query the status again.
   257  
   258  ```
   259  $ curl :3333/debug/status
   260  bigmachine:
   261    :                  waiting for machine to boot                                                 36s
   262    https://ec2-.../:  mem 117.0MiB/15.2GiB disk 62.4MiB/7.6GiB load 0.4/0.1/0.0 counters tasks:4  22s
   263    https://ec2-.../:  mem 120.8MiB/15.2GiB disk 62.4MiB/7.6GiB load 0.2/0.1/0.0 counters tasks:4  22s
   264  run /Users/marius/shake.go:41 [1] slices: count: 4
   265    reader@/Users/marius/shake.go:17:   tasks idle/running/done: 0/8/0  1m8s
   266    flatmap@/Users/marius/shake.go:27:  tasks idle/running/done: 0/8/0  1m8s
   267    map@/Users/marius/shake.go:28:      tasks idle/running/done: 0/8/0  1m8s
   268    reduce@/Users/marius/shake.go:29:   tasks idle/running/done: 8/0/0  1m8s
   269  run /Users/marius/shake.go:41 [1] tasks: tasks: runnable: 8
   270    inv1_reader_flatmap_map@8:0(1):  https://ec2-18-236-204-88.../  1m8s
   271    inv1_reader_flatmap_map@8:1(1):  https://ec2-34-221-236-36.../  1m8s
   272    inv1_reader_flatmap_map@8:2(1):  https://ec2-18-236-204-88.../  1m8s
   273    inv1_reader_flatmap_map@8:3(1):  https://ec2-18-236-204-88.../  1m8s
   274    inv1_reader_flatmap_map@8:4(1):  https://ec2-34-221-236-36.../  1m8s
   275    inv1_reader_flatmap_map@8:5(1):  https://ec2-18-236-204-88.../  1m8s
   276    inv1_reader_flatmap_map@8:6(1):  https://ec2-34-221-236-36.../  1m8s
   277    inv1_reader_flatmap_map@8:7(1):  https://ec2-34-221-236-36.../  1m8s
   278  ```
   279  
   280  This time,
   281  we see that the computation is in progress.
   282  Two out of 3 machines in the cluster have booted;
   283  the first clause shows the resource utilization of these machines.
   284  Next, we see that all but the reduce steps are currently running.
   285  This is because <span class="small">reduce</span> requires
   286  a shuffle step, and so depends upon the completion
   287  of its antecedent tasks.
   288  Finally,
   289  the last clause shows the individual tasks and their runtimes.
   290  
   291  Note that there is not a one-to-one correspondence
   292  between the high-level slice operations in the second clause
   293  with the tasks in the third.
   294  This is for two reasons.
   295  First, Bigslice *pipelines* operations when it can.
   296  The tasks names give a hint at this:
   297  the currently running tasks each correspond to
   298  a pipeline of reader, flatmap, and map.
   299  Second, the underlying data are split into individual *shards*,
   300  each task handling a subset of the data.
   301  This is how Bigslice parallelizes computation.
   302  
   303  Let's walk through the code.
   304  
   305  ```
   306  var wordCount = bigslice.Func(func(url string) bigslice.Slice {
   307  	slice := bigslice.ScanReader(8, func() (io.ReadCloser, error) {  // (1)
   308  		resp, err := http.Get(url)
   309  		if err != nil {
   310  			return nil, err
   311  		}
   312  		if resp.StatusCode != 200 {
   313  			return nil, fmt.Errorf("get %v: %v", url, resp.Status)
   314  		}
   315  		return resp.Body, nil
   316  	})
   317  	slice = bigslice.Flatmap(slice, strings.Fields)                  // (2)
   318  	slice = bigslice.Map(slice, func(token string) (string, int) {   // (3)
   319  		return token, 1
   320  	})
   321  	slice = bigslice.Reduce(slice, func(a, e int) int {              // (4)
   322  		return a + e
   323  	})
   324  	return slice
   325  })
   326  ```
   327  
   328  Every Bigslice operation must be implemented by a `bigslice.Func`.
   329  A `bigslice.Func` is a way to wrap an ordinary Go func value so that
   330  it can be invoked by Bigslice. `bigslice.Func`s must return values of
   331  type `bigslice.Slice`, which describe the actual operation to be done.
   332  This may seem like an indirect way of doing things,
   333  but it provides two big advantages:
   334  First, by using `bigslice.Func`,
   335  Bigslice can name and run Go code remotely without
   336  performing on-demand compilation or shipping a whole toolchain.
   337  Second, by expressing data processing tasks in terms of
   338  `bigslice.Slice` values,
   339  Bigslice is free to partition, distribute, and retry bits of
   340  the operations in ways not specified by the user.
   341  
   342  In our example,
   343  we define a word count operation as a function of a URL.
   344  The first operation is a [ScanReader](https://godoc.org/github.com/grailbio/bigslice#ScanReader) (1),
   345  which takes an io.Reader and returns a `bigslice.Slice`
   346  that represents the scanned lines from that io.Reader.
   347  The type of this `bigslice.Slice` value is schematically `bigslice.Slice<string>`.
   348  While we do not have generics in Go,
   349  a `bigslice.Slice` can nevertheless represent a container of any underlying type;
   350  Bigslice performs runtime type checking to make sure that incompatible
   351  `bigslice.Slice` operators are not combined together.
   352  
   353  We then take the output and tokenize it with
   354  [Flatmap](https://godoc.org/github.com/grailbio/bigslice#ScanReader) (2),
   355  which takes each input string (line) and outputs a list of strings (for each token).
   356  The resulting `bigslice.Slice` represents
   357  all the tokens in the corpus.
   358  Note that since `strings.Fields` already has the correct signature,
   359  we did not need to wrap it with our own `func`.
   360  
   361  Next,
   362  we map each token found in the corpus to two columns:
   363  the first column is the token itself,
   364  and the second column is the integer value 1,
   365  representing the count of that token.
   366  `bigslice.Slice` values may contain multiple columns of values;
   367  they are analogous to tuples in other programming languages.
   368  The type of the returned `bigslice.Slice` is schematically
   369  `bigslice.Slice<string, int>`.
   370  
   371  Finally,
   372  we apply `bigslice.Reduce` (4) to the slice of token counts.
   373  The reduce operation aggregates the values for each unique
   374  value of the first column (the "key").
   375  In this case, we just want to add them together
   376  in order to produce the final count for each unique token.
   377  
   378  That's the end of our `bigslice.Func`.
   379  Let's look at our `main` function.
   380  
   381  ```
   382  func main() {
   383  	sess, shutdown := sliceconfig.Parse()                // (1)
   384  	defer shutdown()
   385  
   386  	ctx := context.Background()
   387  	tokens, err := sess.Run(ctx, wordCount, shakespeare) // (2)
   388  	if err != nil {
   389  		log.Fatal(err)
   390  	}
   391  	scan := tokens.Scan(ctx)                             // (3)
   392  	type counted struct {
   393  		token string
   394  		count int
   395  	}
   396  	var (
   397  		token  string
   398  		count  int
   399  		counts []counted
   400  	)
   401  	for scan.Scan(ctx, &token, &count) {                 // (4)
   402  		counts = append(counts, counted{token, count})
   403  	}
   404  	if err := scan.Err(); err != nil {                   // (5)
   405  		log.Fatal(err)
   406  	}
   407  
   408  	sort.Slice(counts, func(i, j int) bool {
   409  		return counts[i].count > counts[j].count
   410  	})
   411  	if len(counts) > 10 {
   412  		counts = counts[:10]
   413  	}
   414  	for _, count := range counts {
   415  		fmt.Println(count.token, count.count)
   416  	}
   417  }
   418  ```
   419  
   420  First, notice that our program is an ordinary Go program,
   421  with an ordinary entry point.
   422  While Bigslice offers low-level APIs to set up a Bigslice session,
   423  the [sliceconfig](https://godoc.org/github.com/grailbio/bigslice/sliceconfig)
   424  package offers a convenient way to set up such
   425  a session by reading the configuration in `$HOME/.bigslice/config`,
   426  which in our case was written by `bigslice setup-ec2`.
   427  `sliceconfig.Parse` reads the Bigslice configuration,
   428  parses command line flags,
   429  and then sets up a session accordingly.
   430  The Bigslice session is required in order to invoke `bigslice.Func`s.
   431  
   432  That is exactly what we do next (2):
   433  we invoke the `wordCount` `bigslice.Func` with the
   434  Shakespeare corpus URL as an argument.
   435  The returned value represents the results of the computation.
   436  We can scan the result to extract the rows,
   437  each of which consists of two columns:
   438  the token and
   439  the number of times that token occurred in the corpus.
   440  Scanning in Bigslice follows the general pattern for scanning in Go:
   441  First, we extract a scanner (3)
   442  which has a `Scan` (4) that returns a boolean indicating whether to continue scanning
   443  (and also populates the value for each column in the scanned row),
   444  while the `Err` method (5) returns any error that occurred while scanning.
   445  
   446  # Some more details to keep you going
   447  
   448  In the examples above,
   449  we used the [command bigslice](https://godoc.org/github.com/grailbio/bigslice/cmd/bigslice)
   450  to build and run Bigslice jobs.
   451  This is needed only to build "fat" binaries that include binaries
   452  for both the host operating system and architecture
   453  as well as linux/amd64,
   454  which is used by the cluster nodes[^cluster-arch].
   455  If your host operating system is already linux/amd64,
   456  then `bigslice build` and `bigslice run`
   457  are equivalent to `go build` and `go run`.
   458  
   459  Bigslice
   460  uses package [github.com/grailbio/base/config](https://godoc.org/github.com/grailbio/base/config)
   461  to maintain its configuration at `$HOME/.bigslice/config`.
   462  You can either edit this file directly,
   463  or override individual parameters
   464  at runtime with the `-set` flag,
   465  for example,
   466  to use [m5.24xlarge](https://aws.amazon.com/ec2/instance-types/m5/)
   467  instances in Bigslice cluster:
   468  
   469  ```
   470  $ bigslice run shake.go -set bigmachine/ec2cluster.instance=m5.24xlarge
   471  ```
   472  
   473  Bigslice uses [bigmachine](https://github.com/grailbio/bigmachine)
   474  to manage clusters of cloud compute instances.
   475  See its documentation for further details.
   476  
   477  # Articles
   478  - <a href="implementation.html">About the implementation</a>
   479  - <a href="parallelism.html">Parallelism in Bigslice</a>
   480  
   481  [^cluster-arch]: Bigslice, by way of Bigmachine, currently only supports
   482      linux/amd64 remote cluster instances.
   483  
   484  <footer>
   485  The Bigslice gopher design was inspired by
   486  <a href="http://reneefrench.blogspot.com/">Renee French</a>.
   487  </footer>