github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/developer-workflow/working-with-pipelines.md (about)

     1  # Working with Pipelines
     2  
     3  A typical Pachyderm workflow involves multiple iterations of
     4  experimenting with your code and pipeline specs.
     5  
     6  !!! info
     7      Before you read this section, make sure that you
     8      understand basic Pachyderm pipeline concepts described in
     9      [Concepts](../../concepts/pipeline-concepts/index.md).
    10  
    11  In general, there are five steps to working with a pipeline. The stages can be summarized in the image below. 
    12  
    13  ![Developer workflow](../../assets/images/d_steps_analysis_pipeline.svg)
    14  
    15  We'll walk through each of the stages in detail.
    16  
    17  ## Step 1: Write Your Analysis Code
    18  
    19  Because Pachyderm is completely language-agnostic, the code
    20  that is used to process data in Pachyderm can
    21  be written in any language and can use any libraries of choice. Whether
    22  your code is as simple as a bash command or as complicated as a
    23  TensorFlow neural network, it needs to be built with all the required
    24  dependencies into a container that can run anywhere, including inside
    25  of Pachyderm. See [Examples](https://github.com/pachyderm/pachyderm/tree/master/examples).
    26  
    27  Your code does not have to import any special Pachyderm
    28  functionality or libraries. However, it must meet the
    29  following requirements:
    30  
    31  * **Read files from a local file system**. Pachyderm automatically
    32    mounts each input data repository as `/pfs/<repo_name>` in the running
    33    containers of your Docker image. Therefore, the code that you write needs
    34    to read input data from this directory, similar to any other
    35    file system.
    36  
    37    Because Pachyderm automatically spreads data across parallel
    38    containers, your analysis code does not have to deal with data
    39    sharding or parallelization. For example, if you have four
    40    containers that run your Python code, Pachyderm automatically
    41    supplies 1/4 of the input data to `/pfs/<repo_name>` in
    42    each running container. These workload balancing settings
    43    can be adjusted as needed through Pachyderm tunable parameters
    44    in the pipeline specification.
    45  
    46  * **Write files into a local file system**, such as saving results.
    47    Your code must write to the `/pfs/out` directory that Pachyderm
    48    mounts in all of your running containers. Similar to reading data,
    49    your code does not have to manage parallelization or sharding.
    50  
    51  ## Step 2: Build Your Docker Image
    52  
    53  When you create a Pachyderm pipeline, you need
    54  to specify a Docker image that includes the code or binary that
    55  you want to run. Therefore, every time you modify your code,
    56  you need to build a new Docker image, push it to your image registry,
    57  and update the image tag in the pipeline spec. This section
    58  describes one way of building Docker images, but
    59  if you have your own routine, feel free to apply it.
    60  
    61  To build an image, you need to create a `Dockerfile`. However, do not
    62  use the `CMD` field in your `Dockerfile` to specify the commands that
    63  you want to run. Instead, you add them in the `cmd` field in your pipeline
    64  specification. Pachyderm runs these commands inside the
    65  container during the job execution rather than relying on Docker
    66  to run them.
    67  The reason is that Pachyderm cannot execute your code immediately when
    68  your container starts, so it runs a shim process in your container
    69  instead, and then, it calls your pipeline specification's `cmd` from there.
    70  
    71  !!! note
    72      The `Dockerfile` example below is provided for your reference
    73      only. Your `Dockerfile` might look completely different.
    74  
    75  To build a Docker image, complete the following steps:
    76  
    77  1. If you do not have a registry, create one with a preferred provider.
    78  If you decide to use DockerHub, follow the [Docker Hub Quickstart](https://docs.docker.com/docker-hub/) to
    79  create a repository for your project.
    80  1. Create a `Dockerfile` for your project. See the [OpenCV example](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv/Dockerfile).
    81  1. Build a new image from the `Dockerfile` by specifying a tag:
    82  
    83     ```shell
    84     docker build -t <IMAGE>:<TAG> .
    85     ```
    86  
    87  For more information about building Docker images, see
    88  [Docker documentation](https://docs.docker.com/engine/tutorials/dockerimages/).
    89  
    90  ## Step 3: Push Your Docker Image to a Registry
    91  
    92  After building your image, you need to upload the image into
    93  a public or private image registry, such as
    94  [DockerHub](https://hub.docker.com).
    95  
    96  Alternatively, you can use the Pachyderm's built-in functionality to
    97  tag, build, and push images by running the `pachctl update pipeline` command
    98  with the `--build` flag. For more information, see
    99  [Update a pipelines](../updating_pipelines.md).
   100  
   101  1. Log in to an image registry.
   102  
   103     * If you use DockerHub, run:
   104  
   105       ```shell
   106       docker login --username=<dockerhub-username> --password=<dockerhub-password> <dockerhub-fqdn>
   107       ```
   108  
   109  1. Push your image to your image registry.
   110  
   111     * If you use DockerHub, run:
   112  
   113       ```shell
   114       docker push <image>:tag
   115       ```
   116  
   117  !!! note
   118      Pipelines require a unique tag to ensure the appropriate image is pulled. If a floating tag, such as `latest`, is used, the Kubernetes cluster may become out of sync with the Docker registry, concluding it already has the `latest` image.
   119  
   120  ## Step 4: Create/Edit the Pipeline Config
   121  
   122  Pachyderm's pipeline specifications store the configuration information
   123  about the Docker image and code that Pachyderm should run. Pipeline
   124  specifications are stored in JSON or YAML format.
   125  
   126  A standard pipeline specification must include the following
   127  parameters:
   128  
   129  - `name`
   130  - `transform`
   131  - `parallelism`
   132  - `input`
   133  
   134  !!! note
   135      Some special types of pipelines, such as a spout pipeline, do not
   136      require you to specify all of these parameters.
   137  
   138  You can store your pipeline locally or in a remote location, such
   139  as a GitHub repository.
   140  
   141  To create a Pipeline, complete the following steps:
   142  
   143  1. Create a pipeline specification. Here is an example of a pipeline
   144     spec:
   145  
   146     ```shell
   147     # my-pipeline.json
   148     {
   149       "pipeline": {
   150         "name": "my-pipeline"
   151       },
   152       "transform": {
   153         "image": "<image>:<tag>",
   154         "cmd": ["/binary", "/pfs/data", "/pfs/out"]
   155       },
   156       "input": {
   157           "pfs": {
   158             "repo": "data",
   159             "glob": "/*"
   160           }
   161       }
   162     }
   163     ```
   164  
   165  !!! note "See Also:"
   166      - [Pipeline Specification](../../reference/pipeline_spec.md)
   167  
   168  ## Step 5: Deploy/Update the Pipeline
   169  
   170  As soon as you create a pipeline, Pachyderm immediately spins up one or more Kubernetes pods in which the pipeline code runs. By default, after the pipeline finishes
   171  running, the pods continue to run while waiting for the new data to be
   172  committed into the Pachyderm input repository. You can configure this
   173  parameter, as well as many others, in the pipeline specification.
   174  
   175  1. Create a Pachyderm pipeline from the spec:
   176  
   177     ```shell
   178     pachctl create pipeline -f my-pipeline.json
   179     ```
   180  
   181     You can specify a local file or a file stored in a remote
   182     location, such as a GitHub repository. For example,
   183     `https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json`.
   184  1. If your pipeline specification changes, you can update the pipeline 
   185     by running
   186  
   187     ```shell
   188     pachctl create pipeline -f my-pipeline.json
   189     ```
   190  
   191  !!! note "See Also:"
   192    - [Updating Pipelines](../updating_pipelines.md)
   193  <!-- - [Running Pachyderm in Production](TBA)-->