github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/advanced-concepts/distributed_computing.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/advanced-concepts/distributed_computing.md (about)

1 # Distributed Computing
2
3 Distributing your computations across multiple workers
4 is a fundamental part of any big data processing.
5 When you build production-scale pipelines, you need
6 to adjust the number of workers and resources that are
7 allocated to each job to optimize throughput.
8
9 A Pachyderm worker is an identical Kubernetes pod that runs
10 the Docker image that you specified in the
11 [pipeline spec](../../../reference/pipeline_spec/). Your analysis code
12 does not affect how Pachyderm distributes the workload among workers.
13 Instead, Pachyderm spreads out the data that needs to be processed
14 across the various workers and makes that data available for your code.
15
16 When you create a pipeline, Pachyderm spins up worker pods that
17 continuously run in the cluster waiting for new data to be available
18 for processing. You can change this behavior by setting `"standby" :true`.
19 Therefore, you do not need to recreate and
20 schedule workers for every new job.
21
22 For each job, all the datums are queued up and then distributed
23 across the available workers. When a worker finishes processing
24 its datum, it grabs a new datum from the queue until all the datums
25 complete processing. If a worker pod crashes, its datums are
26 redistributed to other workers for maximum fault tolerance.
27
28 The following animation shows how distributed computing works:
29
30 ![Distributed computing basics](../../assets/images/distributed_computing101.gif)
31
32 In the diagram above, you have three Pachyderm worker pods that
33 process your data. When a pod finishes processing a datum,
34 it automatically takes another datum from the queue to process it.
35 Datums might be different in size and, therefore, some of them might be
36 processed faster than others.
37
38 Each datum goes through the following processing phases inside a Pachyderm
39 worker pod:
40
41 | Phase | Description |
42 | ----------- | ----------- |
43 | Downloading | The Pachyderm worker pod downloads the datum contents into Pachyderm. |
44 | Processing | The Pachyderm worker pod runs the contents of the datum against your code. |
45 | Uploading | The Pachyderm worker pod uploads the results of processing into an output repository. |
46
47 When a datum completes a phase, the Pachyderm worker moves it to the next
48 one while another datum from the queue takes its place in the
49 processing sequence.
50
51 The following animation displays what happens inside a pod during
52 the datum processing:
53
54 ![Distributed processing internals](../../assets/images/distributed_computing102.gif)
55
56 
58
59 You can control the number of worker pods that Pachyderm runs in a
60 pipeline by defining the `parallelism` parameter in the
61 [pipeline specification](../../../reference/pipeline_spec/).
62
63 !!! example
64 ```json
65 "parallelism_spec": {
66 // Exactly one of these two fields should be set
67 "constant": int
68 "coefficient": double
69 ```
70
71 Pachyderm has the following parallelism strategies that you
72 can set in the pipeline spec:
73
74 | Strategy | Description |
75 | -------------- | ------------------ |
76 | `constant` | Pachyderm starts the specified number of workers. For example, if you set `"constant":10`, Pachyderm spreads the computation workload among ten workers. |
77 | `coefficient` | Pachyderm starts a number of workers that is a multiple of your Kubernetes cluster size. For example, if your Kubernetes cluster has ten nodes, and you set `"coefficient": 0.5`, Pachyderm starts five workers. If you set parallelism to `"coefficient": 2.0`, Pachyderm starts twenty workers. |
78
79 By default, Pachyderm sets `parallelism` to `“constant": 1`, which means
80 that it spawns one worker per Kubernetes node for this pipeline.
81
82 Pipelines which won't have a constant flow of data to process should use the `autoscaling` feature by setting `"autoscaling": true` in the pipeline spec. Doing so will cause the pipeline to go into standby when there's nothing for the workers to do. In `standby` a pipeline will have no workers and will consume no resources, it will just wait for data to come in for it to process.
83
84 When data does come in the pipeline will exit standby and spin up workers to process the new data. Initially a single worker will spin up and lay out a distributed processing plan for the job. Then it will start working on the job and if there's more work that could happen in parallel it will spin up more workers to run in parallel, up to the limit defined by the `parallelism_spec`.
85
86 Multiple jobs can run in parallel and cause new workers to spin up. For example if a job comes in with a single datum it will cause a single worker to spin up, if another job with a single datum comes in while the first job is still running another worker will spin up to work on the second job. Again this is bounded by the limit defined in the `parallelism_spec`.
87
88 One limitation of autoscaling is that it can't dynamically scale down. Suppose a job with a large number of datums is near completion, only one worker is still working while the others are idle. Pachyderm doesn't yet have a way for the idle workers to steal work and there are a few issues that prevent us from spinning down the idle workers. Kubernetes doesn't have a good way to scale down a controller and specify which pods should be killed, so scaling down may kill the worker pod that's still doing work. This means another worker will have to restart that work from scratch and the job will take longer. The other issue is that we want to keep the workers around to participate in the distributed merge process that happens at the end of the job.
89
90
91 !!! note "See Also:"
92
93 * [Glob Pattern](../../pipeline-concepts/datum/glob-pattern/)
94 * [Pipeline Specification](../../../reference/pipeline_spec/)