github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/individual-developer-workflow.md (about) 1 # Individual Developer Workflow 2 3 A typical Pachyderm workflow involves multiple iterations of 4 experimenting with your code and pipeline specs. 5 6 !!! info 7 Before you read this section, make sure that you 8 understand basic Pachyderm pipeline concepts described in 9 [Concepts](../concepts/pipeline-concepts/index.md). 10 11 ## How it works 12 13 Working with Pachyderm includes multiple iterations of the 14 following steps: 15 16  17 18 ## Step 1: Write Your Analysis Code 19 20 Because Pachyderm is completely language-agnostic, the code 21 that is used to process data in Pachyderm can 22 be written in any language and can use any libraries of choice. Whether 23 your code is as simple as a bash command or as complicated as a 24 TensorFlow neural network, it needs to be built with all the required 25 dependencies into a container that can run anywhere, including inside 26 of Pachyderm. See [Examples](https://github.com/pachyderm/pachyderm/tree/master/examples). 27 28 Your code does not have to import any special Pachyderm 29 functionality or libraries. However, it must meet the 30 following requirements: 31 32 * **Read files from a local file system**. Pachyderm automatically 33 mounts each input data repository as `/pfs/<repo_name>` in the running 34 containers of your Docker image. Therefore, the code that you write needs 35 to read input data from this directory, similar to any other 36 file system. 37 38 Because Pachyderm automatically spreads data across parallel 39 containers, your analysis code does not have to deal with data 40 sharding or parallelization. For example, if you have four 41 containers that run your Python code, Pachyderm automatically 42 supplies 1/4 of the input data to `/pfs/<repo_name>` in 43 each running container. These workload balancing settings 44 can be adjusted as needed through Pachyderm tunable parameters 45 in the pipeline specification. 46 47 * **Write files into a local file system**, such as saving results. 48 Your code must write to the `/pfs/out` directory that Pachyderm 49 mounts in all of your running containers. Similar to reading data, 50 your code does not have to manage parallelization or sharding. 51 52 ## Step 2: Build Your Docker Image 53 54 When you create a Pachyderm pipeline, you need 55 to specify a Docker image that includes the code or binary that 56 you want to run. Therefore, every time you modify your code, 57 you need to build a new Docker image, push it to your image registry, 58 and update the image tag in the pipeline spec. This section 59 describes one way of building Docker images, but 60 if you have your own routine, feel free to apply it. 61 62 To build an image, you need to create a `Dockerfile`. However, do not 63 use the `CMD` field in your `Dockerfile` to specify the commands that 64 you want to run. Instead, you add them in the `cmd` field in your pipeline 65 specification. Pachyderm runs these commands inside the 66 container during the job execution rather than relying on Docker 67 to run them. 68 The reason is that Pachyderm cannot execute your code immediately when 69 your container starts, so it runs a shim process in your container 70 instead, and then, it calls your pipeline specification's `cmd` from there. 71 72 After building your image, you need to upload the image into 73 a public or private image registry, such as 74 [DockerHub](https://hub.docker.com) or other. 75 76 Alternatively, you can use the Pachyderm's built-in functionality to 77 tag, build, and push images by running the `pachctl update pipeline` command 78 with the `--build` and `--push-images` flags. For more information, see 79 [Update a pipelines](updating_pipelines.md). 80 81 !!! note 82 The `Dockerfile` example below is provided for your reference 83 only. Your `Dockerfile` might look completely different. 84 85 To build a Docker image, complete the following steps: 86 87 1. If you do not have a registry, create one with a preferred provider. 88 If you decide to use DockerHub, follow the [Docker Hub Quickstart](https://docs.docker.com/docker-hub/) to 89 create a repository for your project. 90 1. Create a `Dockerfile` for your project. See the [OpenCV example](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv/Dockerfile). 91 1. Log in to an image registry. 92 93 * If you use DockerHub, run: 94 95 ```shell 96 docker login --username=<dockerhub-username> --password=<dockerhub-password> <dockerhub-fqdn> 97 ``` 98 99 1. Build a new image from the `Dockerfile` by specifying a tag: 100 101 ```shell 102 docker build -t <IMAGE>:<TAG> . 103 ``` 104 105 1. Push your image to your image registry. 106 107 * If you use DockerHub, run: 108 109 ```shell 110 docker push <image>:tag 111 ``` 112 113 For more information about building Docker images, see 114 [Docker documentation](https://docs.docker.com/engine/tutorials/dockerimages/). 115 116 ## Step 3: Load Your Data to Pachyderm 117 118 You need to add your data to Pachyderm so that your pipeline runs your code 119 against it. You can do so by using one of the following methods: 120 121 * By using the `pachctl put file` command 122 * By using a special type of pipeline, such as a spout or cron 123 * By using one of the Pachyderm's [language clients](../reference/clients/) 124 * By using a compatible S3 client 125 * By using the Pachyderm UI (Enterprise version or free trial) 126 127 For more information, see [Load Your Data Into Pachyderm](../load-data-into-pachyderm/). 128 129 ## Step 4: Create a Pipeline 130 131 Pachyderm's pipeline specifications store the configuration information 132 about the Docker image and code that Pachyderm should run. Pipeline 133 specifications are stored in JSON format. As soon as you create a pipeline, 134 Pachyderm immediately spins a pod or pods on a Kubernetes worker node 135 in which the pipeline code runs. By default, after the pipeline finishes 136 running, the pods continue to run while waiting for the new data to be 137 committed into the Pachyderm input repository. You can configure this 138 parameter, as well as many others, in the pipeline specification. 139 140 A standard pipeline specification must include the following 141 parameters: 142 143 - `name` 144 - `transform` 145 - `parallelism` 146 - `input` 147 148 !!! note 149 Some special types of pipelines, such as a spout pipeline, do not 150 require you to specify all of these parameters. 151 152 You can store your pipeline locally or in a remote location, such 153 as a GitHub repository. 154 155 To create a Pipeline, complete the following steps: 156 157 1. Create a pipeline specification. Here is an example of a pipeline 158 spec: 159 160 ```shell 161 # my-pipeline.json 162 { 163 "pipeline": { 164 "name": "my-pipeline" 165 }, 166 "transform": { 167 "image": "my-pipeline-image", 168 "cmd": ["/binary", "/pfs/data", "/pfs/out"] 169 }, 170 "input": { 171 "pfs": { 172 "repo": "data", 173 "glob": "/*" 174 } 175 } 176 } 177 ``` 178 179 1. Create a Pachyderm pipeline from the spec: 180 181 ```shell 182 pachctl create pipeline -f my-pipeline.json 183 ``` 184 185 You can specify a local file or a file stored in a remote 186 location, such as a GitHub repository. For example, 187 `https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json`. 188 189 !!! note "See Also:" 190 - [Pipeline Specification](../reference/pipeline_spec.md) 191 192 <!-- - [Running Pachyderm in Production](TBA)-->