github.com/grailbio/bigslice@v0.0.0-20230519005545-30c4c12152ad/docs/index.md (about) 1 --- 2 title: Bigslice - cluster computing in Go 3 layout: default 4 --- 5 6 <a href="https://github.com/grailbio/bigslice/" class="github-corner" aria-label="View source on GitHub"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#800080; color:#fff; position: absolute; top: 0; border: 0; right: 0;" aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z" fill="currentColor" class="octo-body"></path></svg></a><style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style> 7 8 # Bigslice 9 10 <img src="bigslice.png" alt="Bigslice gopher" height="200"/> 11 12 Bigslice is a system for 13 <i>fast</i>, large-scale, 14 serverless data processing 15 using [Go](https://golang.org). 16 17 Bigslice provides an API that lets users express their 18 computation with a handful of familiar data transformation 19 primitives such as 20 <span class="small">map</span>, 21 <span class="small">filter</span>, 22 <span class="small">reduce</span>, and 23 <span class="small">join</span>. 24 When the program is run, 25 Bigslice creates an ad hoc cluster on a cloud computing provider 26 and transparently distributes the computation among the nodes 27 in the cluster. 28 29 Bigslice is similar to data processing systems like 30 [Apache Spark](https://spark.apache.org/) 31 and [FlumeJava](https://ai.google/research/pubs/pub35650), 32 but with different aims: 33 34 * *Bigslice is built for Go.* Bigslice is used as an ordinary Go package, 35 users use their existing Go code, and Bigslice binaries are compiled 36 like ordinary Go binaries. 37 * *Bigslice is serverless.* Requiring nothing more than cloud credentials, 38 Bigslice will have you processing large datasets in no time, without the use 39 of any other external infrastructure. 40 * *Bigslice is simple and transparent.* Bigslice programs are regular 41 Go programs, providing users with a familiar environment and tools. 42 A Bigslice program can be run on a single node like any other program, 43 but it is also capable of transparently distributing itself across an 44 ad hoc cluster, managed entirely by the program itself. 45 46 <div class="links"> 47 <a href="https://github.com/grailbio/bigslice">GitHub project</a> · 48 <a href="https://godoc.org/github.com/grailbio/bigslice">API documentation</a> · 49 <a href="https://github.com/grailbio/bigslice/issues">issue tracker</a> · 50 <a href="https://godoc.org/github.com/grailbio/bigmachine">bigmachine</a> 51 </div> 52 53 # Getting started 54 55 To get a sense of what writing and running Bigslice programs is like, 56 we'll implement a simple word counter, 57 computing the frequencies of words used in Shakespeare's combined works. 58 Of course, it would be silly to use Bigslice for a corpus this small, 59 but it serves to illustrate the various features of Bigslice, and, 60 because the data are small, we enjoy a very quick feedback loop. 61 62 First, we'll install the bigslice command. 63 This command is not strictly needed to use Bigslice, 64 but it helps to make common tasks and setup easy and simple. 65 The bigslice command helps us to build and run Bigslice programs, and, 66 as we'll see later, 67 also perform the necessary setup for your cloud provider. 68 69 ``` 70 GO111MODULE=on go get github.com/grailbio/bigslice/cmd/bigslice@latest 71 ``` 72 73 Now, we write a Go file that implements our word count. 74 Don't worry too much about the details for now; 75 we'll go over this later. 76 77 ``` 78 package main 79 80 import ( 81 "context" 82 "fmt" 83 "io" 84 "log" 85 "net/http" 86 "sort" 87 "strings" 88 89 "github.com/grailbio/bigslice" 90 "github.com/grailbio/bigslice/sliceconfig" 91 ) 92 93 var wordCount = bigslice.Func(func(url string) bigslice.Slice { 94 slice := bigslice.ScanReader(8, func() (io.ReadCloser, error) { 95 resp, err := http.Get(url) 96 if err != nil { 97 return nil, err 98 } 99 if resp.StatusCode != 200 { 100 return nil, fmt.Errorf("get %v: %v", url, resp.Status) 101 } 102 return resp.Body, nil 103 }) 104 slice = bigslice.Flatmap(slice, strings.Fields) 105 slice = bigslice.Map(slice, func(token string) (string, int) { 106 return token, 1 107 }) 108 slice = bigslice.Reduce(slice, func(a, e int) int { 109 return a + e 110 }) 111 return slice 112 }) 113 114 115 const shakespeare = "https://ocw.mit.edu/ans7870/6"+ 116 "/6.006/s08/lecturenotes/files/t8.shakespeare.txt" 117 118 119 func main() { 120 sess := sliceconfig.Parse() 121 defer sess.Shutdown() 122 123 ctx := context.Background() 124 tokens, err := sess.Run(ctx, wordCount, shakespeare) 125 if err != nil { 126 log.Fatal(err) 127 } 128 scanner := tokens.Scanner() 129 defer scanner.Close() 130 type counted struct { 131 token string 132 count int 133 } 134 var ( 135 token string 136 count int 137 counts []counted 138 ) 139 for scanner.Scan(ctx, &token, &count) { 140 counts = append(counts, counted{token, count}) 141 } 142 if err := scanner.Err(); err != nil { 143 log.Fatal(err) 144 } 145 146 sort.Slice(counts, func(i, j int) bool { 147 return counts[i].count > counts[j].count 148 }) 149 if len(counts) > 10 { 150 counts = counts[:10] 151 } 152 for _, count := range counts { 153 fmt.Println(count.token, count.count) 154 } 155 } 156 ``` 157 158 Now that we have our computation, 159 we can run it with the bigslice tool. 160 In order to test it out, 161 we'll run it in local mode. 162 163 164 ``` 165 $ GO111MODULE=on bigslice run shake.go -local 166 the 23242 167 I 19540 168 and 18297 169 to 15623 170 of 15544 171 a 12532 172 my 10824 173 in 9576 174 you 9081 175 is 7851 176 $ 177 ``` 178 179 Let's run the same thing on EC2. 180 First we run bigslice setup-ec2 181 to configure the required EC2 security group, 182 and then run shake.go without the -local flag: 183 184 ``` 185 $ bigslice setup-ec2 186 bigslice: no existing bigslice security group found; creating new 187 bigslice: found default VPC vpc-2c860354 188 bigslice: authorizing ingress traffic for security group sg-0d4f69daa025633f9 189 bigslice: tagging security group sg-0d4f69daa025633f9 190 bigslice: created security group sg-0d4f69daa025633f9 191 bigslice: set up new security group sg-0d4f69daa025633f9 192 bigslice: wrote configuration to /Users/marius/.bigslice/config 193 $ GO111MODULE=on bigslice run shake.go 194 2019/09/26 07:43:33 http: serve :3333 195 2019/09/26 07:43:33 slicemachine: 0 machines (0 procs); 1 machines pending (3 procs) 196 2019/09/26 07:43:33 slicemachine: 0 machines (0 procs); 2 machines pending (6 procs) 197 2019/09/26 07:43:33 slicemachine: 0 machines (0 procs); 3 machines pending (9 procs) 198 the 23242 199 I 19540 200 and 18297 201 to 15623 202 of 15544 203 a 12532 204 my 10824 205 in 9576 206 you 9081 207 is 7851 208 $ 209 ``` 210 211 Bigslice launched an ad hoc cluster of 3 nodes 212 in order to perform the computation; 213 as soon as the job finished, 214 the cluster tears itself down automatically. 215 216 While a job is running, 217 Bigslice exports its status via built-in http server. 218 For example, 219 in this case we can use curl to inspect 220 the current status of the job using 221 [curl](https://curl.haxx.se/). 222 223 ``` 224 $ curl :3333/debug/status 225 bigmachine: 226 : waiting for machine to boot 33s 227 : waiting for machine to boot 19s 228 : waiting for machine to boot 19s 229 run /Users/marius/shake.go:41 [1] slices: count: 4 230 reader@/Users/marius/shake.go:17: tasks idle/running/done: 8/0/0 1m6s 231 flatmap@/Users/marius/shake.go:27: tasks idle/running/done: 8/0/0 1m6s 232 map@/Users/marius/shake.go:28: tasks idle/running/done: 8/0/0 1m6s 233 reduce@/Users/marius/shake.go:29: tasks idle/running/done: 8/0/0 1m6s 234 run /Users/marius/shake.go:41 [1] tasks: tasks: runnable: 8 235 inv1_reader_flatmap_map@8:0(1): waiting for a machine 1m6s 236 inv1_reader_flatmap_map@8:1(1): waiting for a machine 1m6s 237 inv1_reader_flatmap_map@8:2(1): waiting for a machine 1m6s 238 inv1_reader_flatmap_map@8:3(1): waiting for a machine 1m6s 239 inv1_reader_flatmap_map@8:4(1): waiting for a machine 1m6s 240 inv1_reader_flatmap_map@8:5(1): waiting for a machine 1m6s 241 inv1_reader_flatmap_map@8:6(1): waiting for a machine 1m6s 242 inv1_reader_flatmap_map@8:7(1): waiting for a machine 1m6s 243 ``` 244 245 The first clause tells us there are 3 machines 246 (in this case, EC2 instances) 247 waiting to boot. 248 The second clause shows the status tasks associated with 249 the slice operations at the given source lines, above. 250 In this case, every task is idle 251 because there are not yet any machines on which to run them. 252 The final clause shows the physical tasks 253 that require scheduling by Bigslice. 254 255 A little later, 256 we query the status again. 257 258 ``` 259 $ curl :3333/debug/status 260 bigmachine: 261 : waiting for machine to boot 36s 262 https://ec2-.../: mem 117.0MiB/15.2GiB disk 62.4MiB/7.6GiB load 0.4/0.1/0.0 counters tasks:4 22s 263 https://ec2-.../: mem 120.8MiB/15.2GiB disk 62.4MiB/7.6GiB load 0.2/0.1/0.0 counters tasks:4 22s 264 run /Users/marius/shake.go:41 [1] slices: count: 4 265 reader@/Users/marius/shake.go:17: tasks idle/running/done: 0/8/0 1m8s 266 flatmap@/Users/marius/shake.go:27: tasks idle/running/done: 0/8/0 1m8s 267 map@/Users/marius/shake.go:28: tasks idle/running/done: 0/8/0 1m8s 268 reduce@/Users/marius/shake.go:29: tasks idle/running/done: 8/0/0 1m8s 269 run /Users/marius/shake.go:41 [1] tasks: tasks: runnable: 8 270 inv1_reader_flatmap_map@8:0(1): https://ec2-18-236-204-88.../ 1m8s 271 inv1_reader_flatmap_map@8:1(1): https://ec2-34-221-236-36.../ 1m8s 272 inv1_reader_flatmap_map@8:2(1): https://ec2-18-236-204-88.../ 1m8s 273 inv1_reader_flatmap_map@8:3(1): https://ec2-18-236-204-88.../ 1m8s 274 inv1_reader_flatmap_map@8:4(1): https://ec2-34-221-236-36.../ 1m8s 275 inv1_reader_flatmap_map@8:5(1): https://ec2-18-236-204-88.../ 1m8s 276 inv1_reader_flatmap_map@8:6(1): https://ec2-34-221-236-36.../ 1m8s 277 inv1_reader_flatmap_map@8:7(1): https://ec2-34-221-236-36.../ 1m8s 278 ``` 279 280 This time, 281 we see that the computation is in progress. 282 Two out of 3 machines in the cluster have booted; 283 the first clause shows the resource utilization of these machines. 284 Next, we see that all but the reduce steps are currently running. 285 This is because <span class="small">reduce</span> requires 286 a shuffle step, and so depends upon the completion 287 of its antecedent tasks. 288 Finally, 289 the last clause shows the individual tasks and their runtimes. 290 291 Note that there is not a one-to-one correspondence 292 between the high-level slice operations in the second clause 293 with the tasks in the third. 294 This is for two reasons. 295 First, Bigslice *pipelines* operations when it can. 296 The tasks names give a hint at this: 297 the currently running tasks each correspond to 298 a pipeline of reader, flatmap, and map. 299 Second, the underlying data are split into individual *shards*, 300 each task handling a subset of the data. 301 This is how Bigslice parallelizes computation. 302 303 Let's walk through the code. 304 305 ``` 306 var wordCount = bigslice.Func(func(url string) bigslice.Slice { 307 slice := bigslice.ScanReader(8, func() (io.ReadCloser, error) { // (1) 308 resp, err := http.Get(url) 309 if err != nil { 310 return nil, err 311 } 312 if resp.StatusCode != 200 { 313 return nil, fmt.Errorf("get %v: %v", url, resp.Status) 314 } 315 return resp.Body, nil 316 }) 317 slice = bigslice.Flatmap(slice, strings.Fields) // (2) 318 slice = bigslice.Map(slice, func(token string) (string, int) { // (3) 319 return token, 1 320 }) 321 slice = bigslice.Reduce(slice, func(a, e int) int { // (4) 322 return a + e 323 }) 324 return slice 325 }) 326 ``` 327 328 Every Bigslice operation must be implemented by a `bigslice.Func`. 329 A `bigslice.Func` is a way to wrap an ordinary Go func value so that 330 it can be invoked by Bigslice. `bigslice.Func`s must return values of 331 type `bigslice.Slice`, which describe the actual operation to be done. 332 This may seem like an indirect way of doing things, 333 but it provides two big advantages: 334 First, by using `bigslice.Func`, 335 Bigslice can name and run Go code remotely without 336 performing on-demand compilation or shipping a whole toolchain. 337 Second, by expressing data processing tasks in terms of 338 `bigslice.Slice` values, 339 Bigslice is free to partition, distribute, and retry bits of 340 the operations in ways not specified by the user. 341 342 In our example, 343 we define a word count operation as a function of a URL. 344 The first operation is a [ScanReader](https://godoc.org/github.com/grailbio/bigslice#ScanReader) (1), 345 which takes an io.Reader and returns a `bigslice.Slice` 346 that represents the scanned lines from that io.Reader. 347 The type of this `bigslice.Slice` value is schematically `bigslice.Slice<string>`. 348 While we do not have generics in Go, 349 a `bigslice.Slice` can nevertheless represent a container of any underlying type; 350 Bigslice performs runtime type checking to make sure that incompatible 351 `bigslice.Slice` operators are not combined together. 352 353 We then take the output and tokenize it with 354 [Flatmap](https://godoc.org/github.com/grailbio/bigslice#ScanReader) (2), 355 which takes each input string (line) and outputs a list of strings (for each token). 356 The resulting `bigslice.Slice` represents 357 all the tokens in the corpus. 358 Note that since `strings.Fields` already has the correct signature, 359 we did not need to wrap it with our own `func`. 360 361 Next, 362 we map each token found in the corpus to two columns: 363 the first column is the token itself, 364 and the second column is the integer value 1, 365 representing the count of that token. 366 `bigslice.Slice` values may contain multiple columns of values; 367 they are analogous to tuples in other programming languages. 368 The type of the returned `bigslice.Slice` is schematically 369 `bigslice.Slice<string, int>`. 370 371 Finally, 372 we apply `bigslice.Reduce` (4) to the slice of token counts. 373 The reduce operation aggregates the values for each unique 374 value of the first column (the "key"). 375 In this case, we just want to add them together 376 in order to produce the final count for each unique token. 377 378 That's the end of our `bigslice.Func`. 379 Let's look at our `main` function. 380 381 ``` 382 func main() { 383 sess, shutdown := sliceconfig.Parse() // (1) 384 defer shutdown() 385 386 ctx := context.Background() 387 tokens, err := sess.Run(ctx, wordCount, shakespeare) // (2) 388 if err != nil { 389 log.Fatal(err) 390 } 391 scan := tokens.Scan(ctx) // (3) 392 type counted struct { 393 token string 394 count int 395 } 396 var ( 397 token string 398 count int 399 counts []counted 400 ) 401 for scan.Scan(ctx, &token, &count) { // (4) 402 counts = append(counts, counted{token, count}) 403 } 404 if err := scan.Err(); err != nil { // (5) 405 log.Fatal(err) 406 } 407 408 sort.Slice(counts, func(i, j int) bool { 409 return counts[i].count > counts[j].count 410 }) 411 if len(counts) > 10 { 412 counts = counts[:10] 413 } 414 for _, count := range counts { 415 fmt.Println(count.token, count.count) 416 } 417 } 418 ``` 419 420 First, notice that our program is an ordinary Go program, 421 with an ordinary entry point. 422 While Bigslice offers low-level APIs to set up a Bigslice session, 423 the [sliceconfig](https://godoc.org/github.com/grailbio/bigslice/sliceconfig) 424 package offers a convenient way to set up such 425 a session by reading the configuration in `$HOME/.bigslice/config`, 426 which in our case was written by `bigslice setup-ec2`. 427 `sliceconfig.Parse` reads the Bigslice configuration, 428 parses command line flags, 429 and then sets up a session accordingly. 430 The Bigslice session is required in order to invoke `bigslice.Func`s. 431 432 That is exactly what we do next (2): 433 we invoke the `wordCount` `bigslice.Func` with the 434 Shakespeare corpus URL as an argument. 435 The returned value represents the results of the computation. 436 We can scan the result to extract the rows, 437 each of which consists of two columns: 438 the token and 439 the number of times that token occurred in the corpus. 440 Scanning in Bigslice follows the general pattern for scanning in Go: 441 First, we extract a scanner (3) 442 which has a `Scan` (4) that returns a boolean indicating whether to continue scanning 443 (and also populates the value for each column in the scanned row), 444 while the `Err` method (5) returns any error that occurred while scanning. 445 446 # Some more details to keep you going 447 448 In the examples above, 449 we used the [command bigslice](https://godoc.org/github.com/grailbio/bigslice/cmd/bigslice) 450 to build and run Bigslice jobs. 451 This is needed only to build "fat" binaries that include binaries 452 for both the host operating system and architecture 453 as well as linux/amd64, 454 which is used by the cluster nodes[^cluster-arch]. 455 If your host operating system is already linux/amd64, 456 then `bigslice build` and `bigslice run` 457 are equivalent to `go build` and `go run`. 458 459 Bigslice 460 uses package [github.com/grailbio/base/config](https://godoc.org/github.com/grailbio/base/config) 461 to maintain its configuration at `$HOME/.bigslice/config`. 462 You can either edit this file directly, 463 or override individual parameters 464 at runtime with the `-set` flag, 465 for example, 466 to use [m5.24xlarge](https://aws.amazon.com/ec2/instance-types/m5/) 467 instances in Bigslice cluster: 468 469 ``` 470 $ bigslice run shake.go -set bigmachine/ec2cluster.instance=m5.24xlarge 471 ``` 472 473 Bigslice uses [bigmachine](https://github.com/grailbio/bigmachine) 474 to manage clusters of cloud compute instances. 475 See its documentation for further details. 476 477 # Articles 478 - <a href="implementation.html">About the implementation</a> 479 - <a href="parallelism.html">Parallelism in Bigslice</a> 480 481 [^cluster-arch]: Bigslice, by way of Bigmachine, currently only supports 482 linux/amd64 remote cluster instances. 483 484 <footer> 485 The Bigslice gopher design was inspired by 486 <a href="http://reneefrench.blogspot.com/">Renee French</a>. 487 </footer>