github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/getting_started/beginner_tutorial.md (about) 1 # Beginner Tutorial 2 3 Welcome to the beginner tutorial for Pachyderm! If you have already installed 4 Pachyderm, this tutorial should take about 15 minutes to complete. This tutorial 5 introduces basic Pachyderm concepts. 6 7 ## Image processing with OpenCV 8 9 This tutorial walks you through the deployment of a Pachyderm pipeline 10 that performs [edge 11 detection](https://en.wikipedia.org/wiki/Edge_detection) on a few 12 images. Thanks to Pachyderm's built-in processing primitives, we can 13 keep our code simple but still run the pipeline in a 14 distributed, streaming fashion. Moreover, as new data is added, the 15 pipeline automatically processes it and outputs the results. 16 17 If you hit any errors not covered in this guide, get help in our [public 18 community Slack](http://slack.pachyderm.io), submit an issue on 19 [GitHub](https://github.com/pachyderm/pachyderm), or email us at 20 <support@pachyderm.io>. We are more than happy to help! 21 22 ### Prerequisites 23 24 This guide assumes that you already have Pachyderm running locally. 25 If you haven't done so already, install Pachyderm on your local 26 machine as described in [Local Installation](local_installation.md). 27 28 ### Create a Repo 29 30 A `repo` is the highest level data primitive in Pachyderm. Like many 31 things in Pachyderm, it shares its name with a primitive in Git and is 32 designed to behave analogously. Generally, repos should be dedicated to 33 a single source of data such as log messages from a particular service, 34 a users table, or training data for an ML model. Repos are easy to create 35 and do not take much space when empty so do not worry about making 36 tons of them. 37 38 For this demo, we create a repo called `images` to hold the 39 data we want to process: 40 41 ```shell 42 $ pachctl create repo images 43 $ pachctl list repo 44 NAME CREATED SIZE (MASTER) 45 images 7 seconds ago 0B 46 ``` 47 48 This output shows that the repo has been successfully created. Because we 49 have not added anything to it yet, the size of the repository HEAD commit 50 on the master branch is 0B. 51 52 ### Adding Data to Pachyderm 53 54 Now that we have created a repo it is time to add some data. In 55 Pachyderm, you write data to an explicit `commit`. Commits are immutable 56 snapshots of your data which give Pachyderm its version control properties. 57 You can add, remove, or update `files` in a given commit. 58 59 Let's start by just adding a file, in this case an image, to a new 60 commit. We have provided some sample images for you that we host on 61 Imgur. 62 63 Use the `pachctl put file` command along with the `-f` flag. The `-f` flag can 64 take either a local file, a URL, or a object storage bucket which it 65 scrapes automatically. In this case, we simply pass the URL. 66 67 Unlike Git, commits in Pachyderm must be explicitly started and finished 68 as they can contain huge amounts of data and we do not want that much 69 *dirty* data hanging around in an unpersisted state. `pachctl put file` 70 automatically starts and finishes a commit for you so you can add files 71 more easily. If you want to add many files over a period of time, you 72 can do `pachctl start commit` and `pachctl finish commit` yourself. 73 74 We also specify the repo name `"images"`, the branch name `"master"`, 75 and the file name: `"liberty.png"`. 76 77 Here is an example atomic commit of the file `liberty.png` to the 78 `images` repo `master` branch: 79 80 ```shell 81 $ pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png 82 ``` 83 84 We can check to make sure the data we just added is in Pachyderm. 85 86 * Use the `pachctl list repo` command to check that data has been added: 87 88 ```shell 89 $ pachctl list repo 90 NAME CREATED SIZE (MASTER) 91 images About a minute ago 57.27KiB 92 ``` 93 94 * View the commit that was just created: 95 96 ```shell 97 $ pachctl list commit images 98 REPO COMMIT PARENT STARTED DURATION SIZE 99 images d89758a7496a4c56920b0eaa7d7d3255 <none> 29 seconds ago Less than a second 57.27KiB 100 ``` 101 102 * View the file in that commit: 103 104 ```shell 105 $ pachctl list file images@master 106 COMMIT NAME TYPE COMMITTED SIZE 107 d89758a7496a4c56920b0eaa7d7d3255 /liberty.png file About a minute ago 57.27KiB 108 ``` 109 110 Also, you can view the file you have just added to Pachyderm. Because this is an 111 image, you cannot just print it out in the terminal, but the following 112 commands will let you view it easily: 113 114 * If you are on macOS, run: 115 116 ```shell 117 $ pachctl get file images@master:liberty.png | open -f -a /Applications/Preview.app 118 ``` 119 120 * If you are on Linux, run: 121 122 ```shell 123 $ pachctl get file images@master:liberty.png | display 124 ``` 125 126 ### Create a Pipeline 127 128 Now that you have some data in your repo, it is time to do something 129 with it. Pipelines are the core processing primitive in Pachyderm and 130 you can define them with a JSON encoding. For this example, we have 131 already created the pipeline for you and you can find the [code on 132 GitHub](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv). 133 134 When you want to create your own pipelines later, you can refer to the 135 full [Pipeline Specification](../../reference/pipeline_spec) to use 136 more advanced options. Options include building your own code into a 137 container instead of the pre-built Docker image that we are 138 using in this tutorial. 139 140 For now, we are going to create a single pipeline that takes in images 141 and does some simple edge detection. 142 143  144 145 Below is the pipeline spec and python code that we are using. Let's walk 146 through the details. 147 148 ```shell 149 # edges.json 150 { 151 "pipeline": { 152 "name": "edges" 153 }, 154 "description": "A pipeline that performs image edge detection by using the OpenCV library.", 155 "transform": { 156 "cmd": [ "python3", "/edges.py" ], 157 "image": "pachyderm/opencv" 158 }, 159 "input": { 160 "pfs": { 161 "repo": "images", 162 "glob": "/*" 163 } 164 } 165 } 166 ``` 167 168 Our pipeline spec contains a few simple sections. First, it is the pipeline 169 `name`, edges. Then we have the `transform` which specifies the docker 170 image we want to use, `pachyderm/opencv` (defaults to DockerHub as the 171 registry), and the entry point `edges.py`. Lastly, we specify the input. 172 Here we only have one PFS input, our images repo with a particular glob 173 pattern. 174 175 The glob pattern defines how the input data can be broken up if we want 176 to distribute our computation. `/*` means that each file can be 177 processed individually, which makes sense for images. Glob patterns are 178 one of the most powerful features in Pachyderm. 179 180 The following text is the Python code that we run in this pipeline: 181 182 ``` python 183 # edges.py 184 import cv2 185 import numpy as np 186 from matplotlib import pyplot as plt 187 import os 188 189 # make_edges reads an image from /pfs/images and outputs the result of running 190 # edge detection on that image to /pfs/out. Note that /pfs/images and 191 # /pfs/out are special directories that Pachyderm injects into the container. 192 def make_edges(image): 193 img = cv2.imread(image) 194 tail = os.path.split(image)[1] 195 edges = cv2.Canny(img,100,200) 196 plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray') 197 198 # walk /pfs/images and call make_edges on every file found 199 for dirpath, dirs, files in os.walk("/pfs/images"): 200 for file in files: 201 make_edges(os.path.join(dirpath, file)) 202 ``` 203 204 The code simply walks over all the images in `/pfs/images`, performs edge 205 detection, and writes the result to `/pfs/out`. 206 207 `/pfs/images` and `/pfs/out` are special local directories that 208 Pachyderm creates within the container automatically. All the input data 209 for a pipeline is stored in `/pfs/<input_repo_name>` and your code 210 should always write out to `/pfs/out`. Pachyderm automatically 211 gathers everything you write to `/pfs/out` and version it as this 212 pipeline output. 213 214 Now, let's create the pipeline in Pachyderm: 215 216 ```shell 217 $ pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json 218 ``` 219 220 ### What Happens When You Create a Pipeline 221 222 Creating a pipeline tells Pachyderm to run your code on the data in your 223 input repo (the HEAD commit) as well as **all future commits** that 224 occur after the pipeline is created. Our repo already had a commit, so 225 Pachyderm automatically launched a `job` to process that data. 226 227 The first time Pachyderm runs a pipeline job, it needs to download the 228 Docker image (specified in the pipeline spec) from the specified Docker 229 registry (DockerHub in this case). This first run this might take a 230 minute or so because of the image download, depending on your Internet 231 connection. Subsequent runs will be much faster. 232 233 You can view the job with: 234 235 ``` bash 236 $ pachctl list job 237 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 238 0f6a53829eeb4ca193bb7944fe693700 edges 16 seconds ago Less than a second 0 1 + 0 / 1 57.27KiB 22.22KiB success 239 ``` 240 241 Yay! Our pipeline succeeded! Pachyderm creates a corresponding output 242 repo for every pipeline. This output repo will have the same name as the 243 pipeline, and all the results of that pipeline will be versioned in this 244 output repo. In our example, the `edges` pipeline created a repo 245 called `edges` to store the results. 246 247 ``` bash 248 $ pachctl list repo 249 NAME CREATED SIZE (MASTER) 250 edges 2 minutes ago 22.22KiB 251 images 5 minutes ago 57.27KiB 252 ``` 253 254 ### Reading the Output 255 256 We can view the output data from the `edges` repo in the same fashion 257 that we viewed the input data. 258 259 ``` bash 260 # on macOS 261 $ pachctl get file edges@master:liberty.png | open -f -a /Applications/Preview.app 262 263 # on Linux 264 $ pachctl get file edges@master:liberty.png | display 265 ``` 266 267 The output should look similar to: 268 269  270 271 ### Processing More Data 272 273 Pipelines will also automatically process the data from new commits as 274 they are created. Think of pipelines as being subscribed to any new 275 commits on their input repo(s). Also similar to Git, commits have a 276 parental structure that tracks which files have changed. In this case 277 we are going to be adding more images. 278 279 Let's create two new commits in a parental structure. To do this we 280 will simply do two more `put file` commands and by specifying `master` 281 as the branch, it automatically parents our commits onto each other. 282 Branch names are just references to a particular HEAD commit. 283 284 ``` bash 285 $ pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png 286 $ pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png 287 ``` 288 289 Adding a new commit of data will automatically trigger the pipeline to 290 run on the new data we've added. We'll see corresponding jobs get 291 started and commits to the output "edges" repo. Let's also view our 292 new outputs. 293 294 ``` bash 295 # view the jobs that were kicked off 296 $ pachctl list job 297 ID STARTED DURATION RESTART PROGRESS DL UL STATE 298 81ae47a802f14038b95f8f248cddbed2 7 seconds ago Less than a second 0 1 + 2 / 3 102.4KiB 74.21KiB success 299 ce448c12d0dd4410b3a5ae0c0f07e1f9 16 seconds ago Less than a second 0 1 + 1 / 2 78.7KiB 37.15KiB success 300 490a28be32de491e942372018cd42460 9 minutes ago 35 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 301 ``` 302 303 ``` bash 304 # View the output data 305 306 # on macOS 307 $ pachctl get file edges@master:AT-AT.png | open -f -a /Applications/Preview.app 308 309 $ pachctl get file edges@master:kitten.png | open -f -a /Applications/Preview.app 310 311 # on Linux 312 $ pachctl get file edges@master:AT-AT.png | display 313 314 $ pachctl get file edges@master:kitten.png | display 315 ``` 316 317 ### Adding Another Pipeline 318 319 We have successfully deployed and used a single stage Pachyderm pipeline. 320 Now, let's add a processing stage to illustrate a multi-stage Pachyderm 321 pipeline. Specifically, let's add a `montage` pipeline that take our 322 original and edge detected images and arranges them into a single 323 montage of images: 324 325  326 327 Below is the pipeline spec for this new pipeline: 328 329 ``` bash 330 # montage.json 331 { 332 "pipeline": { 333 "name": "montage" 334 }, 335 "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.", 336 "input": { 337 "cross": [ { 338 "pfs": { 339 "glob": "/", 340 "repo": "images" 341 } 342 }, 343 { 344 "pfs": { 345 "glob": "/", 346 "repo": "edges" 347 } 348 } ] 349 }, 350 "transform": { 351 "cmd": [ "sh" ], 352 "image": "v4tech/imagemagick", 353 "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ] 354 } 355 } 356 ``` 357 358 This `montage` pipeline spec is similar to our `edges` pipeline except 359 for the following differences: 360 361 1. We are using a different Docker image that 362 has `imagemagick` installed. 363 2. We are executing a `sh` command with 364 `stdin` instead of a python script. 365 3. We have multiple input data repositories. 366 367 In the `montage` pipeline we are combining our multiple input data 368 repositories using a `cross` pattern. This `cross` pattern creates a 369 single pairing of our input images with our edge detected images. There 370 are several interesting ways to combine data in Pachyderm, which are 371 discussed 372 [here](../../reference/pipeline_spec/#input-required) 373 and 374 [here](../concepts/pipeline-concepts/pipeline/join.md). 375 376 We create the `montage` pipeline as before, with `pachctl`: 377 378 ```shell 379 $ pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/montage.json 380 ``` 381 382 Pipeline creating triggers a job that generates a montage for all the 383 current HEAD commits of the input repos: 384 385 ```shell 386 $ pachctl list job 387 ID STARTED DURATION RESTART PROGRESS DL UL STATE 388 92cecc40c3144fd5b4e07603bb24b104 45 seconds ago 6 seconds 0 1 + 0 / 1 371.9KiB 1.284MiB success 389 81ae47a802f14038b95f8f248cddbed2 2 minutes ago Less than a second 0 1 + 2 / 3 102.4KiB 74.21KiB success 390 ce448c12d0dd4410b3a5ae0c0f07e1f9 2 minutes ago Less than a second 0 1 + 1 / 2 78.7KiB 37.15KiB success 391 490a28be32de491e942372018cd42460 11 minutes ago 35 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 392 ``` 393 394 And you can view the generated montage image via: 395 396 ``` bash 397 # on macOS 398 $ pachctl get file montage@master:montage.png | open -f -a /Applications/Preview.app 399 400 # on Linux 401 $ pachctl get file montage@master:montage.png | display 402 ``` 403 404  405 406 Exploring your DAG in the Pachyderm dashboard 407 --------------------------------------------- 408 409 When you deployed Pachyderm locally, the Pachyderm Enterprise dashboard 410 was also deployed by default. This dashboard will let you interactively 411 explore your pipeline, visualize the structure of the pipeline, explore 412 your data, debug jobs, etc. To access the dashboard visit 413 `localhost:30080` in an Internet browser (e.g., Google Chrome). You 414 should see something similar to this: 415 416  417 418 Enter your email address if you would like to obtain a free trial token 419 for the dashboard. Upon entering this trial token, you will be able to 420 see your pipeline structure and interactively explore the various pieces 421 of your pipeline as pictured below: 422 423  424 425  426 427 Next Steps 428 ---------- 429 430 Pachyderm is now running locally with data and a pipeline! To play with 431 Pachyderm locally, you can use what you've learned to build on or 432 change this pipeline. You can also dig in and learn more details about: 433 434 - [Deploying Pachyderm to the cloud or on prem](../deploy-manage/deploy/index.md) 435 - [Load Your Data into Pachyderm](../how-tos/load-data-into-pachyderm.md) 436 - [Individual Developer Workflow](../how-tos/individual-developer-workflow.md) 437 438 We'd love to help and see what you come up with, so submit any 439 issues/questions you come across on 440 [GitHub](https://github.com/pachyderm/pachyderm), 441 [Slack](http://slack.pachyderm.io), or email at <support@pachyderm.io> 442 if you want to show off anything nifty you've created!