github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/individual-developer-workflow.md (about) 1 # Individual Developer Workflow 2 3 A typical Pachyderm workflow involves multiple iterations of 4 experimenting with your code and pipeline specs. 5 6 !!! info 7 Before you read this section, make sure that you 8 understand basic Pachyderm pipeline concepts described in 9 [Concepts](../concepts/pipeline-concepts/index.md). 10 11 ## How it works 12 13 Working with Pachyderm includes multiple iterations of the 14 following steps: 15 16  17 18 ## Step 1: Write Your Analysis Code 19 20 Because Pachyderm is completely language-agnostic, the code 21 that is used to process data in Pachyderm can 22 be written in any language and can use any libraries of choice. Whether 23 your code is as simple as a bash command or as complicated as a 24 TensorFlow neural network, it needs to be built with all the required 25 dependencies into a container that can run anywhere, including inside 26 of Pachyderm. See [Examples](https://github.com/pachyderm/pachyderm/tree/master/examples). 27 28 Your code does not have to import any special Pachyderm 29 functionality or libraries. However, it must meet the 30 following requirements: 31 32 * **Read files from a local file system**. Pachyderm automatically 33 mounts each input data repository as `/pfs/<repo_name>` in the running 34 containers of your Docker image. Therefore, the code that you write needs 35 to read input data from this directory, similar to any other 36 file system. 37 38 Because Pachyderm automatically spreads data across parallel 39 containers, your analysis code does not have to deal with data 40 sharding or parallelization. For example, if you have four 41 containers that run your Python code, Pachyderm automatically 42 supplies 1/4 of the input data to `/pfs/<repo_name>` in 43 each running container. These workload balancing settings 44 can be adjusted as needed through Pachyderm tunable parameters 45 in the pipeline specification. 46 47 * **Write files into a local file system**, such as saving results. 48 Your code must write to the `/pfs/out` directory that Pachyderm 49 mounts in all of your running containers. Similar to reading data, 50 your code does not have to manage parallelization or sharding. 51 52 ## Step 2: Build Your Docker Image 53 54 When you create a Pachyderm pipeline, you need 55 to specify a Docker image that includes the code or binary that 56 you want to run. Therefore, every time you modify your code, 57 you need to build a new Docker image, push it to your image registry, 58 and update the image tag in the pipeline spec. This section 59 describes one way of building Docker images, but 60 if you have your own routine, feel free to apply it. 61 62 To build an image, you need to create a `Dockerfile`. However, do not 63 use the `CMD` field in your `Dockerfile` to specify the commands that 64 you want to run. Instead, you add them in the `cmd` field in your pipeline 65 specification. Pachyderm runs these commands inside the 66 container during the job execution rather than relying on Docker 67 to run them. 68 The reason is that Pachyderm cannot execute your code immediately when 69 your container starts, so it runs a shim process in your container 70 instead, and then, it calls your pipeline specification's `cmd` from there. 71 72 After building your image, you need to upload the image into 73 a public or private image registry, such as 74 [DockerHub](https://hub.docker.com) or other. 75 76 Alternatively, you can use the Pachyderm's built-in functionality to 77 tag, build, and push images by running the `pachctl update pipeline` command 78 with the `--build` and `--push-images` flags. For more information, see 79 [Update a pipelines](updating_pipelines.md). 80 81 !!! note 82 The `Dockerfile` example below is provided for your reference 83 only. Your `Dockerfile` might look completely different. 84 85 To build a Docker image, complete the following steps: 86 87 1. If you do not have a registry, create one with a preferred provider. 88 If you decide to use DockerHub, follow the [Docker Hub Quickstart](https://docs.docker.com/docker-hub/) to 89 create a repository for your project. 90 1. Create a `Dockerfile` for your project. See the [OpenCV example](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv/Dockerfile). 91 1. Log in to an image registry. 92 93 * If you use DockerHub, run: 94 95 ```shell 96 docker login --username=<dockerhub-username> --password=<dockerhub-password> <dockerhub-fqdn> 97 ``` 98 99 1. Build a new image from the `Dockerfile` by specifying a tag: 100 101 ```shell 102 docker build -t <IMAGE>:<TAG> . 103 ``` 104 105 1. Push your image to your image registry. 106 107 * If you use DockerHub, run: 108 109 ```shell 110 docker push <image>:tag 111 ``` 112 113 For more information about building Docker images, see 114 [Docker documentation](https://docs.docker.com/engine/tutorials/dockerimages/). 115 116 117 ## Step 3: Create a Pipeline 118 119 Pachyderm's pipeline specifications store the configuration information 120 about the Docker image and code that Pachyderm should run. Pipeline 121 specifications are stored in JSON format. As soon as you create a pipeline, 122 Pachyderm immediately spins a pod or pods on a Kubernetes worker node 123 in which pipeline code runs. By default, after the pipeline finishes 124 running, the pods continue to run while waiting for the new data to be 125 committed into the Pachyderm input repository. You can configure this 126 parameter, as well as many others, in the pipeline specification. 127 128 A minimum pipeline specification must include the following 129 parameters: 130 131 - `name` 132 - `transform` 133 - `parallelism` 134 - `input` 135 136 You can store your pipeline locally or in a remote location, such 137 as a GitHub repository. 138 139 To create a Pipeline, complete the following steps: 140 141 1. Create a pipeline specification. Here is an example of a pipeline 142 spec: 143 144 ```shell 145 # my-pipeline.json 146 { 147 "pipeline": { 148 "name": "my-pipeline" 149 }, 150 "transform": { 151 "image": "my-pipeline-image", 152 "cmd": ["/binary", "/pfs/data", "/pfs/out"] 153 }, 154 "input": { 155 "pfs": { 156 "repo": "data", 157 "glob": "/*" 158 } 159 } 160 } 161 ``` 162 163 1. Create a Pachyderm pipeline from the spec: 164 165 ```shell 166 $ pachctl create pipeline -f my-pipeline.json 167 ``` 168 169 You can specify a local file or a file stored in a remote 170 location, such as a GitHub repository. For example, 171 `https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json`. 172 173 !!! note "See Also:" 174 175 - [Pipeline Specification](../reference/pipeline_spec.md) 176 <!-- - [Running Pachyderm in Production](TBA)-->