github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/developer-workflow/working-with-pipelines.md (about) 1 # Working with Pipelines 2 3 A typical Pachyderm workflow involves multiple iterations of 4 experimenting with your code and pipeline specs. 5 6 !!! info 7 Before you read this section, make sure that you 8 understand basic Pachyderm pipeline concepts described in 9 [Concepts](../../concepts/pipeline-concepts/index.md). 10 11 In general, there are five steps to working with a pipeline. The stages can be summarized in the image below. 12 13  14 15 We'll walk through each of the stages in detail. 16 17 ## Step 1: Write Your Analysis Code 18 19 Because Pachyderm is completely language-agnostic, the code 20 that is used to process data in Pachyderm can 21 be written in any language and can use any libraries of choice. Whether 22 your code is as simple as a bash command or as complicated as a 23 TensorFlow neural network, it needs to be built with all the required 24 dependencies into a container that can run anywhere, including inside 25 of Pachyderm. See [Examples](https://github.com/pachyderm/pachyderm/tree/master/examples). 26 27 Your code does not have to import any special Pachyderm 28 functionality or libraries. However, it must meet the 29 following requirements: 30 31 * **Read files from a local file system**. Pachyderm automatically 32 mounts each input data repository as `/pfs/<repo_name>` in the running 33 containers of your Docker image. Therefore, the code that you write needs 34 to read input data from this directory, similar to any other 35 file system. 36 37 Because Pachyderm automatically spreads data across parallel 38 containers, your analysis code does not have to deal with data 39 sharding or parallelization. For example, if you have four 40 containers that run your Python code, Pachyderm automatically 41 supplies 1/4 of the input data to `/pfs/<repo_name>` in 42 each running container. These workload balancing settings 43 can be adjusted as needed through Pachyderm tunable parameters 44 in the pipeline specification. 45 46 * **Write files into a local file system**, such as saving results. 47 Your code must write to the `/pfs/out` directory that Pachyderm 48 mounts in all of your running containers. Similar to reading data, 49 your code does not have to manage parallelization or sharding. 50 51 ## Step 2: Build Your Docker Image 52 53 When you create a Pachyderm pipeline, you need 54 to specify a Docker image that includes the code or binary that 55 you want to run. Therefore, every time you modify your code, 56 you need to build a new Docker image, push it to your image registry, 57 and update the image tag in the pipeline spec. This section 58 describes one way of building Docker images, but 59 if you have your own routine, feel free to apply it. 60 61 To build an image, you need to create a `Dockerfile`. However, do not 62 use the `CMD` field in your `Dockerfile` to specify the commands that 63 you want to run. Instead, you add them in the `cmd` field in your pipeline 64 specification. Pachyderm runs these commands inside the 65 container during the job execution rather than relying on Docker 66 to run them. 67 The reason is that Pachyderm cannot execute your code immediately when 68 your container starts, so it runs a shim process in your container 69 instead, and then, it calls your pipeline specification's `cmd` from there. 70 71 !!! note 72 The `Dockerfile` example below is provided for your reference 73 only. Your `Dockerfile` might look completely different. 74 75 To build a Docker image, complete the following steps: 76 77 1. If you do not have a registry, create one with a preferred provider. 78 If you decide to use DockerHub, follow the [Docker Hub Quickstart](https://docs.docker.com/docker-hub/) to 79 create a repository for your project. 80 1. Create a `Dockerfile` for your project. See the [OpenCV example](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv/Dockerfile). 81 1. Build a new image from the `Dockerfile` by specifying a tag: 82 83 ```shell 84 docker build -t <IMAGE>:<TAG> . 85 ``` 86 87 For more information about building Docker images, see 88 [Docker documentation](https://docs.docker.com/engine/tutorials/dockerimages/). 89 90 ## Step 3: Push Your Docker Image to a Registry 91 92 After building your image, you need to upload the image into 93 a public or private image registry, such as 94 [DockerHub](https://hub.docker.com). 95 96 Alternatively, you can use the Pachyderm's built-in functionality to 97 tag, build, and push images by running the `pachctl update pipeline` command 98 with the `--build` flag. For more information, see 99 [Update a pipelines](../updating_pipelines.md). 100 101 1. Log in to an image registry. 102 103 * If you use DockerHub, run: 104 105 ```shell 106 docker login --username=<dockerhub-username> --password=<dockerhub-password> <dockerhub-fqdn> 107 ``` 108 109 1. Push your image to your image registry. 110 111 * If you use DockerHub, run: 112 113 ```shell 114 docker push <image>:tag 115 ``` 116 117 !!! note 118 Pipelines require a unique tag to ensure the appropriate image is pulled. If a floating tag, such as `latest`, is used, the Kubernetes cluster may become out of sync with the Docker registry, concluding it already has the `latest` image. 119 120 ## Step 4: Create/Edit the Pipeline Config 121 122 Pachyderm's pipeline specifications store the configuration information 123 about the Docker image and code that Pachyderm should run. Pipeline 124 specifications are stored in JSON or YAML format. 125 126 A standard pipeline specification must include the following 127 parameters: 128 129 - `name` 130 - `transform` 131 - `parallelism` 132 - `input` 133 134 !!! note 135 Some special types of pipelines, such as a spout pipeline, do not 136 require you to specify all of these parameters. 137 138 You can store your pipeline locally or in a remote location, such 139 as a GitHub repository. 140 141 To create a Pipeline, complete the following steps: 142 143 1. Create a pipeline specification. Here is an example of a pipeline 144 spec: 145 146 ```shell 147 # my-pipeline.json 148 { 149 "pipeline": { 150 "name": "my-pipeline" 151 }, 152 "transform": { 153 "image": "<image>:<tag>", 154 "cmd": ["/binary", "/pfs/data", "/pfs/out"] 155 }, 156 "input": { 157 "pfs": { 158 "repo": "data", 159 "glob": "/*" 160 } 161 } 162 } 163 ``` 164 165 !!! note "See Also:" 166 - [Pipeline Specification](../../reference/pipeline_spec.md) 167 168 ## Step 5: Deploy/Update the Pipeline 169 170 As soon as you create a pipeline, Pachyderm immediately spins up one or more Kubernetes pods in which the pipeline code runs. By default, after the pipeline finishes 171 running, the pods continue to run while waiting for the new data to be 172 committed into the Pachyderm input repository. You can configure this 173 parameter, as well as many others, in the pipeline specification. 174 175 1. Create a Pachyderm pipeline from the spec: 176 177 ```shell 178 pachctl create pipeline -f my-pipeline.json 179 ``` 180 181 You can specify a local file or a file stored in a remote 182 location, such as a GitHub repository. For example, 183 `https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json`. 184 1. If your pipeline specification changes, you can update the pipeline 185 by running 186 187 ```shell 188 pachctl create pipeline -f my-pipeline.json 189 ``` 190 191 !!! note "See Also:" 192 - [Updating Pipelines](../updating_pipelines.md) 193 <!-- - [Running Pachyderm in Production](TBA)-->