github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/getting_started/beginner_tutorial.md (about) 1 # Beginner Tutorial 2 3 Welcome to the beginner tutorial for Pachyderm! If you have already installed 4 Pachyderm, this tutorial should take about 15 minutes to complete. This tutorial 5 introduces basic Pachyderm concepts. 6 7 !!! tip 8 If you are new to Pachyderm, try [Pachyderm Shell](../../deploy-manage/manage/pachctl_shell/). 9 This handy tool suggests you `pachctl` commands as you type and 10 helps you learn Pachyderm faster. 11 12 ## Image processing with OpenCV 13 14 This tutorial walks you through the deployment of a Pachyderm pipeline 15 that performs [edge 16 detection](https://en.wikipedia.org/wiki/Edge_detection) on a few 17 images. Thanks to Pachyderm's built-in processing primitives, we can 18 keep our code simple but still run the pipeline in a 19 distributed, streaming fashion. Moreover, as new data is added, the 20 pipeline automatically processes it and outputs the results. 21 22 If you hit any errors not covered in this guide, get help in our [public 23 community Slack](http://slack.pachyderm.io), submit an issue on 24 [GitHub](https://github.com/pachyderm/pachyderm), or email us at 25 <support@pachyderm.io>. We are more than happy to help! 26 27 ### Prerequisites 28 29 This guide assumes that you already have Pachyderm running locally. 30 If you haven't done so already, install Pachyderm on your local 31 machine as described in [Local Installation](local_installation.md). 32 33 ### Create a Repo 34 35 A `repo` is the highest level data primitive in Pachyderm. Like many 36 things in Pachyderm, it shares its name with a primitive in Git and is 37 designed to behave analogously. Generally, repos should be dedicated to 38 a single source of data such as log messages from a particular service, 39 a users table, or training data for an ML model. Repos are easy to create 40 and do not take much space when empty so do not worry about making 41 tons of them. 42 43 For this demo, we create a repo called `images` to hold the 44 data we want to process: 45 46 ```shell 47 pachctl create repo images 48 ``` 49 50 Verify that the repository was created: 51 52 ```shell 53 pachctl list repo 54 ``` 55 56 **System response:** 57 58 ```shell 59 NAME CREATED SIZE (MASTER) 60 images 7 seconds ago 0B 61 ``` 62 63 This output shows that the repo has been successfully created. Because we 64 have not added anything to it yet, the size of the repository HEAD commit 65 on the master branch is 0B. 66 67 ### Adding Data to Pachyderm 68 69 Now that we have created a repo it is time to add some data. In 70 Pachyderm, you write data to an explicit `commit`. Commits are immutable 71 snapshots of your data which give Pachyderm its version control properties. 72 You can add, remove, or update `files` in a given commit. 73 74 Let's start by just adding a file, in this case an image, to a new 75 commit. We have provided some sample images for you that we host on 76 Imgur. 77 78 Use the `pachctl put file` command along with the `-f` flag. The `-f` flag can 79 take either a local file, a URL, or a object storage bucket which it 80 scrapes automatically. In this case, we simply pass the URL. 81 82 Unlike Git, commits in Pachyderm must be explicitly started and finished 83 as they can contain huge amounts of data and we do not want that much 84 *dirty* data hanging around in an unpersisted state. `pachctl put file` 85 automatically starts and finishes a commit for you so you can add files 86 more easily. If you want to add many files over a period of time, you 87 can do `pachctl start commit` and `pachctl finish commit` yourself. 88 89 We also specify the repo name `"images"`, the branch name `"master"`, 90 and the file name: `"liberty.png"`. 91 92 Here is an example atomic commit of the file `liberty.png` to the 93 `images` repo `master` branch: 94 95 ```shell 96 pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png 97 ``` 98 99 We can check to make sure the data we just added is in Pachyderm. 100 101 * Use the `pachctl list repo` command to check that data has been added: 102 103 ```shell 104 pachctl list repo 105 ``` 106 107 **System response:** 108 109 ``` 110 NAME CREATED SIZE (MASTER) 111 images About a minute ago 57.27KiB 112 ``` 113 114 * View the commit that was just created: 115 116 ```shell 117 pachctl list commit images 118 ``` 119 120 **System response:** 121 122 ``` 123 REPO COMMIT PARENT STARTED DURATION SIZE 124 images d89758a7496a4c56920b0eaa7d7d3255 <none> 29 seconds ago Less than a second 57.27KiB 125 ``` 126 127 * View the file in that commit: 128 129 ```shell 130 pachctl list file images@master 131 ``` 132 133 **System response:** 134 135 ``` 136 COMMIT NAME TYPE COMMITTED SIZE 137 d89758a7496a4c56920b0eaa7d7d3255 /liberty.png file About a minute ago 57.27KiB 138 ``` 139 140 Now you can view the file by retrieving it from Pachyderm. Because this is an 141 image, you cannot just print it out in the terminal, but the following 142 command will let you view it: 143 144 * On macOS, run: 145 146 ```shell 147 pachctl get file images@master:liberty.png | open -f -a Preview.app 148 ``` 149 150 * On Linux 64-bit, run: 151 152 ```shell 153 pachctl get file images@master:liberty.png | display 154 ``` 155 156 ### Create a Pipeline 157 158 Now that you have some data in your repo, it is time to do something 159 with it. Pipelines are the core processing primitive in Pachyderm. 160 Pipelines are defined with a simple JSON file called a pipeline 161 specification or pipeline spec for short. For this example, we already 162 [created the pipeline spec for you](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv). 163 164 When you want to create your own pipeline specs later, you can refer to the 165 full [Pipeline Specification](../../reference/pipeline_spec) to use 166 more advanced options. Options include building your own code into a 167 container. In this tutorial, we are using a pre-built Docker image. 168 169 For now, we are going to create a single pipeline spec that takes in images 170 and does some simple edge detection. 171 172  173 174 Below is the `edges.json` pipeline spec. Let's walk 175 through the details. 176 177 ```json 178 { 179 "pipeline": { 180 "name": "edges" 181 }, 182 "description": "A pipeline that performs image edge detection by using the OpenCV library.", 183 "transform": { 184 "cmd": [ "python3", "/edges.py" ], 185 "image": "pachyderm/opencv" 186 }, 187 "input": { 188 "pfs": { 189 "repo": "images", 190 "glob": "/*" 191 } 192 } 193 } 194 ``` 195 196 The pipeline spec contains a few simple sections. The pipeline section contains 197 a `name`, which is how you will identify your pipeline. Your pipeline will also 198 automatically create an output repo with the same name. The `transform` section 199 allows you to specify the docker image you want to use. In this case, 200 `pachyderm/opencv` is the docker image (defaults to DockerHub as the registry), 201 and the entry point is `edges.py`. The input section specifies repos visible 202 to the running pipeline, and how to process the data from the repos. Commits to 203 these repos will automatically trigger the pipeline to create new jobs to 204 process them. In this case, `images` is the repo, and `/*` is the glob pattern. 205 206 The glob pattern defines how the input data can be broken up if you want 207 to distribute computation. `/*` means that each file can be 208 processed individually, which makes sense for images. Glob patterns are 209 one of the most powerful features in Pachyderm. 210 211 The following text is the Python code run in this pipeline: 212 213 ```python 214 import cv2 215 import numpy as np 216 from matplotlib import pyplot as plt 217 import os 218 219 # make_edges reads an image from /pfs/images and outputs the result of running 220 # edge detection on that image to /pfs/out. Note that /pfs/images and 221 # /pfs/out are special directories that Pachyderm injects into the container. 222 def make_edges(image): 223 img = cv2.imread(image) 224 tail = os.path.split(image)[1] 225 edges = cv2.Canny(img,100,200) 226 plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray') 227 228 # walk /pfs/images and call make_edges on every file found 229 for dirpath, dirs, files in os.walk("/pfs/images"): 230 for file in files: 231 make_edges(os.path.join(dirpath, file)) 232 ``` 233 234 The code simply walks over all the images in `/pfs/images`, performs edge 235 detection, and writes the result to `/pfs/out`. 236 237 `/pfs/images` and `/pfs/out` are special local directories that 238 Pachyderm creates within the container automatically. All the input data 239 for a pipeline is stored in `/pfs/<input_repo_name>` and your code 240 should always write out to `/pfs/out`. Pachyderm automatically 241 gathers everything you write to `/pfs/out` and version it as this 242 pipeline output. 243 244 Now, let's create the pipeline in Pachyderm: 245 246 ```shell 247 pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json 248 ``` 249 250 ### What Happens When You Create a Pipeline 251 252 Creating a pipeline tells Pachyderm to run your code on the data in your 253 input repo (the HEAD commit) as well as **all future commits** that 254 occur after the pipeline is created. Our repo already had a commit, so 255 Pachyderm automatically launched a `job` to process that data. 256 257 The first time Pachyderm runs a pipeline job, it needs to download the 258 Docker image (specified in the pipeline spec) from the specified Docker 259 registry (DockerHub in this case). This first run this might take a 260 minute or so because of the image download, depending on your Internet 261 connection. Subsequent runs will be much faster. 262 263 You can view the job with: 264 265 ```shell 266 pachctl list job 267 ``` 268 269 **System response:** 270 271 ```shell 272 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 273 0f6a53829eeb4ca193bb7944fe693700 edges 16 seconds ago Less than a second 0 1 + 0 / 1 57.27KiB 22.22KiB success 274 ``` 275 276 Yay! Our pipeline succeeded! Pachyderm creates a corresponding output 277 repo for every pipeline. This output repo will have the same name as the 278 pipeline, and all the results of that pipeline will be versioned in this 279 output repo. In our example, the `edges` pipeline created a repo 280 called `edges` to store the results. 281 282 ``` 283 pachctl list repo 284 ``` 285 286 **System response:** 287 288 ```shell 289 NAME CREATED SIZE (MASTER) 290 edges 2 minutes ago 22.22KiB 291 images 5 minutes ago 57.27KiB 292 ``` 293 294 ### Reading the Output 295 296 We can view the output data from the `edges` repo in the same fashion 297 that we viewed the input data. 298 299 * On macOS, run: 300 301 ```shell 302 pachctl get file edges@master:liberty.png | open -f -a Preview.app 303 ``` 304 305 * On Linux 64-bit, run: 306 307 ```shell 308 pachctl get file edges@master:liberty.png | display 309 ``` 310 311 The output should look similar to: 312 313  314 315 ### Processing More Data 316 317 Pipelines will also automatically process the data from new commits as 318 they are created. Think of pipelines as being subscribed to any new 319 commits on their input repo(s). Also similar to Git, commits have a 320 parental structure that tracks which files have changed. In this case 321 we are going to be adding more images. 322 323 Let's create two new commits in a parental structure. To do this we 324 will simply do two more `put file` commands and by specifying `master` 325 as the branch, it automatically parents our commits onto each other. 326 Branch names are just references to a particular HEAD commit. 327 328 ```shell 329 pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png 330 pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png 331 ``` 332 333 Adding a new commit of data will automatically trigger the pipeline to 334 run on the new data we've added. We'll see corresponding jobs get 335 started and commits to the output "edges" repo. Let's also view our 336 new outputs. 337 338 View the list of jobs that have started: 339 340 ```shell 341 pachctl list job 342 ``` 343 344 **System response:** 345 346 ``` 347 ID STARTED DURATION RESTART PROGRESS DL UL STATE 348 81ae47a802f14038b95f8f248cddbed2 7 seconds ago Less than a second 0 1 + 2 / 3 102.4KiB 74.21KiB success 349 ce448c12d0dd4410b3a5ae0c0f07e1f9 16 seconds ago Less than a second 0 1 + 1 / 2 78.7KiB 37.15KiB success 350 490a28be32de491e942372018cd42460 9 minutes ago 35 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 351 ``` 352 353 View the output data 354 355 * On macOS, run: 356 357 ```shell 358 pachctl get file edges@master:AT-AT.png | open -f -a /Applications/Preview.app 359 pachctl get file edges@master:kitten.png | open -f -a /Applications/Preview.app 360 ``` 361 362 * On Linux, run: 363 364 ```shell 365 pachctl get file edges@master:AT-AT.png | display 366 pachctl get file edges@master:kitten.png | display 367 ``` 368 369 ### Adding Another Pipeline 370 371 We have successfully deployed and used a single stage Pachyderm pipeline. 372 Now, let's add a processing stage to illustrate a multi-stage Pachyderm 373 pipeline. Specifically, let's add a `montage` pipeline that take our 374 original and edge detected images and arranges them into a single 375 montage of images: 376 377  378 379 Below is the pipeline spec for this new pipeline: 380 381 ```json 382 { 383 "pipeline": { 384 "name": "montage" 385 }, 386 "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.", 387 "input": { 388 "cross": [ { 389 "pfs": { 390 "glob": "/", 391 "repo": "images" 392 } 393 }, 394 { 395 "pfs": { 396 "glob": "/", 397 "repo": "edges" 398 } 399 } ] 400 }, 401 "transform": { 402 "cmd": [ "sh" ], 403 "image": "v4tech/imagemagick", 404 "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ] 405 } 406 } 407 ``` 408 409 This `montage` pipeline spec is similar to our `edges` pipeline except 410 for the following differences: 411 412 1. We are using a different Docker image that 413 has `imagemagick` installed. 414 2. We are executing a `sh` command with 415 `stdin` instead of a python script. 416 3. We have multiple input data repositories. 417 418 In the `montage` pipeline we are combining our multiple input data 419 repositories using a `cross` pattern. This `cross` pattern creates a 420 single pairing of our input images with our edge detected images. There 421 are several interesting ways to combine data in Pachyderm, which are 422 discussed 423 [here](../../reference/pipeline_spec/#input-required) 424 and 425 [here](../../concepts/pipeline-concepts/datum/join/). 426 427 We create the `montage` pipeline as before, with `pachctl`: 428 429 ```shell 430 pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/montage.json 431 ``` 432 433 Pipeline creating triggers a job that generates a montage for all the 434 current HEAD commits of the input repos: 435 436 ```shell 437 pachctl list job 438 ``` 439 440 **System response:** 441 442 ```shell 443 ID STARTED DURATION RESTART PROGRESS DL UL STATE 444 92cecc40c3144fd5b4e07603bb24b104 45 seconds ago 6 seconds 0 1 + 0 / 1 371.9KiB 1.284MiB success 445 81ae47a802f14038b95f8f248cddbed2 2 minutes ago Less than a second 0 1 + 2 / 3 102.4KiB 74.21KiB success 446 ce448c12d0dd4410b3a5ae0c0f07e1f9 2 minutes ago Less than a second 0 1 + 1 / 2 78.7KiB 37.15KiB success 447 490a28be32de491e942372018cd42460 11 minutes ago 35 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 448 ``` 449 450 View the generated montage image by running one of 451 the following commands: 452 453 * On macOS, run: 454 455 ```shell 456 pachctl get file montage@master:montage.png | open -f -a Preview.app 457 ``` 458 459 * On Linux 64-bit, run: 460 461 ```shell 462 pachctl get file montage@master:montage.png | display 463 ``` 464 465  466 467 Exploring your DAG in the Pachyderm dashboard 468 --------------------------------------------- 469 470 When you deployed Pachyderm locally, the Pachyderm Enterprise dashboard 471 was also deployed by default. This dashboard will let you interactively 472 explore your pipeline, visualize the structure of the pipeline, explore 473 your data, debug jobs, etc. To access the dashboard visit 474 `localhost:30080` in an Internet browser (e.g., Google Chrome). You 475 should see something similar to this: 476 477  478 479 Enter your email address if you would like to obtain a free trial token 480 for the dashboard. Upon entering this trial token, you will be able to 481 see your pipeline structure and interactively explore the various pieces 482 of your pipeline as pictured below: 483 484  485 486  487 488 Next Steps 489 ---------- 490 491 Pachyderm is now running locally with data and a pipeline! To play with 492 Pachyderm locally, you can use what you've learned to build on or 493 change this pipeline. You can also dig in and learn more details about: 494 495 - [Deploying Pachyderm to the cloud or on prem](../deploy-manage/deploy/index.md) 496 - [Load Your Data into Pachyderm](../how-tos/load-data-into-pachyderm.md) 497 - [Working with Pipelines](../how-tos/developer-workflow/working-with-pipelines.md) 498 499 We'd love to help and see what you come up with, so submit any 500 issues/questions you come across on 501 [GitHub](https://github.com/pachyderm/pachyderm), 502 [Slack](http://slack.pachyderm.io), or email at <support@pachyderm.io> 503 if you want to show off anything nifty you've created!