github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/advanced-concepts/distributed_computing.md (about)

     1  # Configure Distributed Computing
     2  
     3  Distributing your computations across multiple workers
     4  is a fundamental part of any big data processing.
     5  When you build production-scale pipelines, you need
     6  to adjust the number of workers and resources that are
     7  allocated to each job to optimize throughput.
     8  
     9  A Pachyderm worker is an identical Kubernetes pod that runs
    10  the Docker image that you specified in the
    11  [pipeline spec](../reference/pipeline_spec.md). Your analysis code
    12  does not affect how Pachyderm distributes the workload among workers.
    13  Instead, Pachyderm spreads out the data that needs to be processed
    14  across the various workers and makes that data available for your code.
    15  
    16  When you create a pipeline, Pachyderm spins up worker pods that
    17  continuously run in the cluster waiting for new data to be available
    18  for processing. You can change this behavior by setting `"standby" :true`.
    19  Therefore, you do not need to recreate and
    20  schedule workers for every new job.
    21  
    22  For each job, all the datums are queued up and then distributed
    23  across the available workers. When a worker finishes processing
    24  its datum, it grabs a new datum from the queue until all the datums
    25  complete processing. If a worker pod crashes, its datums are
    26  redistributed to other workers for maximum fault tolerance.
    27  
    28  The following animation shows how distributed computing works:
    29  
    30  ![Distributed computing basics](../assets/images/distributed_computing101.gif)
    31  
    32  In the diagram above, you have three Pachyderm worker pods that
    33  process your data. When a pod finishes processing a datum,
    34  it automatically takes another datum from the queue to process it.
    35  Datums might be different in size and, therefore, some of them might be
    36  processed faster than others.
    37  
    38  Each datum goes through the following processing phases inside a Pachyderm
    39  worker pod:
    40  
    41  | Phase       | Description |
    42  | ----------- | ----------- |
    43  | Downloading | The Pachyderm worker pod downloads the datum contents <br>into Pachyderm. |
    44  | Processing  | The Pachyderm worker pod runs the contents of the datum <br>against your code. |
    45  | Uploading   | The Pachyderm worker pod uploads the results of processing <br>into an output repository. |
    46  
    47  When a datum completes a phase, the Pachyderm worker moves it to the next
    48  one while another datum from the queue takes its place in the
    49  processing sequence.
    50  
    51  The following animation displays what happens inside a pod during
    52  the datum processing:
    53  
    54  ![Distributed processing internals](../assets/images/distributed_computing102.gif)
    55  
    56  <!--TBA: the chunk_size property explanation article. Probably in a separate
    57  How-to, but need to add a link to it here-->
    58  
    59  You can control the number of worker pods that Pachyderm runs in a
    60  pipeline by defining the `parallelism` parameter in the
    61  [pipeline specification](../reference/pipeline_spec.md).
    62  
    63  !!! example
    64      ```json
    65      "parallelism_spec": {
    66         // Exactly one of these two fields should be set
    67         "constant": int
    68         "coefficient": double
    69      ```
    70  
    71  Pachyderm has the following parallelism strategies that you
    72  can set in the pipeline spec:
    73  
    74  | Strategy       | Description        |
    75  | -------------- | ------------------ |
    76  | `constant`     | Pachyderm starts the specified number of workers. For example, <br> if you set `"constant":10`, Pachyderm spreads the computation workload among ten workers. |
    77  | `coefficient`  | Pachyderm starts a number of workers that is a multiple of <br> your Kubernetes cluster size. For example, if your Kubernetes cluster has ten nodes, <br> and you set `"coefficient": 0.5`, Pachyderm starts five workers. If you set parallelism to `"coefficient": 2.0`, Pachyderm starts twenty workers. |
    78  
    79  By default, Pachyderm sets `parallelism` to `“constant": 1`, which means
    80  that it spawns one worker per Kubernetes node for this pipeline.
    81  
    82  !!! note "See Also:"
    83      * [Glob Pattern](../concepts/pipeline-concepts/datum/glob-pattern.md)
    84      * [Pipeline Specification](../reference/pipeline_spec.md)