github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/getting_started/beginner_tutorial.md (about) 1 # Beginner Tutorial 2 3 Welcome to the beginner tutorial for Pachyderm! If you have already installed 4 Pachyderm, this tutorial should take about 15 minutes to complete. This tutorial 5 introduces basic Pachyderm concepts. 6 7 !!! tip 8 If you are new to Pachyderm, try [Pachyderm Shell](../../deploy-manage/manage/pachctl_shell/). 9 This handy tool suggests you `pachctl` commands as you type and 10 helps you learn Pachyderm faster. 11 12 ## Image processing with OpenCV 13 14 This tutorial walks you through the deployment of a Pachyderm pipeline 15 that performs [edge 16 detection](https://en.wikipedia.org/wiki/Edge_detection) on a few 17 images. Thanks to Pachyderm's built-in processing primitives, we can 18 keep our code simple but still run the pipeline in a 19 distributed, streaming fashion. Moreover, as new data is added, the 20 pipeline automatically processes it and outputs the results. 21 22 If you hit any errors not covered in this guide, get help in our [public 23 community Slack](http://slack.pachyderm.io), submit an issue on 24 [GitHub](https://github.com/pachyderm/pachyderm), or email us at 25 <support@pachyderm.io>. We are more than happy to help! 26 27 ### Prerequisites 28 29 This guide assumes that you already have Pachyderm running locally. 30 If you haven't done so already, install Pachyderm on your local 31 machine as described in [Local Installation](local_installation.md). 32 33 ### Create a Repo 34 35 A `repo` is the highest level data primitive in Pachyderm. Like many 36 things in Pachyderm, it shares its name with a primitive in Git and is 37 designed to behave analogously. Generally, repos should be dedicated to 38 a single source of data such as log messages from a particular service, 39 a users table, or training data for an ML model. Repos are easy to create 40 and do not take much space when empty so do not worry about making 41 tons of them. 42 43 For this demo, we create a repo called `images` to hold the 44 data we want to process: 45 46 ```shell 47 pachctl create repo images 48 ``` 49 50 Verify that the repository was created: 51 52 ```shell 53 pachctl list repo 54 ``` 55 56 **System response:** 57 58 ```shell 59 NAME CREATED SIZE (MASTER) 60 images 7 seconds ago 0B 61 ``` 62 63 This output shows that the repo has been successfully created. Because we 64 have not added anything to it yet, the size of the repository HEAD commit 65 on the master branch is 0B. 66 67 ### Adding Data to Pachyderm 68 69 Now that we have created a repo it is time to add some data. In 70 Pachyderm, you write data to an explicit `commit`. Commits are immutable 71 snapshots of your data which give Pachyderm its version control properties. 72 You can add, remove, or update `files` in a given commit. 73 74 Let's start by just adding a file, in this case an image, to a new 75 commit. We have provided some sample images for you that we host on 76 Imgur. 77 78 Use the `pachctl put file` command along with the `-f` flag. The `-f` flag can 79 take either a local file, a URL, or a object storage bucket which it 80 scrapes automatically. In this case, we simply pass the URL. 81 82 Unlike Git, commits in Pachyderm must be explicitly started and finished 83 as they can contain huge amounts of data and we do not want that much 84 *dirty* data hanging around in an unpersisted state. `pachctl put file` 85 automatically starts and finishes a commit for you so you can add files 86 more easily. If you want to add many files over a period of time, you 87 can do `pachctl start commit` and `pachctl finish commit` yourself. 88 89 We also specify the repo name `"images"`, the branch name `"master"`, 90 and the file name: `"liberty.png"`. 91 92 Here is an example atomic commit of the file `liberty.png` to the 93 `images` repo `master` branch: 94 95 ```shell 96 pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png 97 ``` 98 99 We can check to make sure the data we just added is in Pachyderm. 100 101 * Use the `pachctl list repo` command to check that data has been added: 102 103 ```shell 104 pachctl list repo 105 ``` 106 107 **System response:** 108 109 ``` 110 NAME CREATED SIZE (MASTER) 111 images About a minute ago 57.27KiB 112 ``` 113 114 * View the commit that was just created: 115 116 ```shell 117 pachctl list commit images 118 ``` 119 120 **System response:** 121 122 ``` 123 REPO COMMIT PARENT STARTED DURATION SIZE 124 images d89758a7496a4c56920b0eaa7d7d3255 <none> 29 seconds ago Less than a second 57.27KiB 125 ``` 126 127 * View the file in that commit: 128 129 ```shell 130 pachctl list file images@master 131 ``` 132 133 **System response:** 134 135 ``` 136 COMMIT NAME TYPE COMMITTED SIZE 137 d89758a7496a4c56920b0eaa7d7d3255 /liberty.png file About a minute ago 57.27KiB 138 ``` 139 140 Also, you can view the file you have just added to Pachyderm. Because this is an 141 image, you cannot just print it out in the terminal, but the following 142 commands will let you view it easily: 143 144 145 * on macOS prior to Catalina, run: 146 147 ``` 148 pachctl get file images@master:liberty.png | open -f -a /Applications/Preview.app 149 ``` 150 151 * on macOS Catalina, run: 152 153 ``` 154 pachctl get file images@master:liberty.png | open -f -a /System/Applications/Preview.app 155 ``` 156 157 * on Linux 64-bit, run: 158 159 ``` 160 pachctl get file images@master:liberty.png | display 161 ``` 162 163 ### Create a Pipeline 164 165 Now that you have some data in your repo, it is time to do something 166 with it. Pipelines are the core processing primitive in Pachyderm and 167 you can define them with a JSON encoding. For this example, we have 168 already created the pipeline for you and you can find the [code on 169 GitHub](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv). 170 171 When you want to create your own pipelines later, you can refer to the 172 full [Pipeline Specification](../../reference/pipeline_spec) to use 173 more advanced options. Options include building your own code into a 174 container instead of the pre-built Docker image that we are 175 using in this tutorial. 176 177 For now, we are going to create a single pipeline that takes in images 178 and does some simple edge detection. 179 180  181 182 Below is the pipeline spec and python code that we are using. Let's walk 183 through the details. 184 185 ```shell 186 # edges.json 187 { 188 "pipeline": { 189 "name": "edges" 190 }, 191 "description": "A pipeline that performs image edge detection by using the OpenCV library.", 192 "transform": { 193 "cmd": [ "python3", "/edges.py" ], 194 "image": "pachyderm/opencv" 195 }, 196 "input": { 197 "pfs": { 198 "repo": "images", 199 "glob": "/*" 200 } 201 } 202 } 203 ``` 204 205 Our pipeline spec contains a few simple sections. First, it is the pipeline 206 `name`, edges. Then we have the `transform` which specifies the docker 207 image we want to use, `pachyderm/opencv` (defaults to DockerHub as the 208 registry), and the entry point `edges.py`. Lastly, we specify the input. 209 Here we only have one PFS input, our images repo with a particular glob 210 pattern. 211 212 The glob pattern defines how the input data can be broken up if we want 213 to distribute our computation. `/*` means that each file can be 214 processed individually, which makes sense for images. Glob patterns are 215 one of the most powerful features in Pachyderm. 216 217 The following text is the Python code that we run in this pipeline: 218 219 ``` python 220 # edges.py 221 import cv2 222 import numpy as np 223 from matplotlib import pyplot as plt 224 import os 225 226 # make_edges reads an image from /pfs/images and outputs the result of running 227 # edge detection on that image to /pfs/out. Note that /pfs/images and 228 # /pfs/out are special directories that Pachyderm injects into the container. 229 def make_edges(image): 230 img = cv2.imread(image) 231 tail = os.path.split(image)[1] 232 edges = cv2.Canny(img,100,200) 233 plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray') 234 235 # walk /pfs/images and call make_edges on every file found 236 for dirpath, dirs, files in os.walk("/pfs/images"): 237 for file in files: 238 make_edges(os.path.join(dirpath, file)) 239 ``` 240 241 The code simply walks over all the images in `/pfs/images`, performs edge 242 detection, and writes the result to `/pfs/out`. 243 244 `/pfs/images` and `/pfs/out` are special local directories that 245 Pachyderm creates within the container automatically. All the input data 246 for a pipeline is stored in `/pfs/<input_repo_name>` and your code 247 should always write out to `/pfs/out`. Pachyderm automatically 248 gathers everything you write to `/pfs/out` and version it as this 249 pipeline output. 250 251 Now, let's create the pipeline in Pachyderm: 252 253 ```shell 254 pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json 255 ``` 256 257 ### What Happens When You Create a Pipeline 258 259 Creating a pipeline tells Pachyderm to run your code on the data in your 260 input repo (the HEAD commit) as well as **all future commits** that 261 occur after the pipeline is created. Our repo already had a commit, so 262 Pachyderm automatically launched a `job` to process that data. 263 264 The first time Pachyderm runs a pipeline job, it needs to download the 265 Docker image (specified in the pipeline spec) from the specified Docker 266 registry (DockerHub in this case). This first run this might take a 267 minute or so because of the image download, depending on your Internet 268 connection. Subsequent runs will be much faster. 269 270 You can view the job with: 271 272 ``` bash 273 pachctl list job 274 ``` 275 276 **System response:** 277 278 ```shell 279 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 280 0f6a53829eeb4ca193bb7944fe693700 edges 16 seconds ago Less than a second 0 1 + 0 / 1 57.27KiB 22.22KiB success 281 ``` 282 283 Yay! Our pipeline succeeded! Pachyderm creates a corresponding output 284 repo for every pipeline. This output repo will have the same name as the 285 pipeline, and all the results of that pipeline will be versioned in this 286 output repo. In our example, the `edges` pipeline created a repo 287 called `edges` to store the results. 288 289 ``` bash 290 pachctl list repo 291 ``` 292 293 **System response:** 294 295 ```shell 296 NAME CREATED SIZE (MASTER) 297 edges 2 minutes ago 22.22KiB 298 images 5 minutes ago 57.27KiB 299 ``` 300 301 ### Reading the Output 302 303 We can view the output data from the `edges` repo in the same fashion 304 that we viewed the input data. 305 306 * On macOS prior to Catalina, run: 307 308 ``` 309 pachctl get file edges@master:liberty.png | open -f -a /Applications/Preview.app 310 ``` 311 312 * On macOS Catalina, run: 313 314 ``` 315 pachctl get file edges@master:liberty.png | open -f -a /System/Applications/Preview.app 316 ``` 317 318 * On Linux 64-bit, run: 319 320 ``` 321 pachctl get file edges@master:liberty.png | display 322 ``` 323 324 The output should look similar to: 325 326  327 328 ### Processing More Data 329 330 Pipelines will also automatically process the data from new commits as 331 they are created. Think of pipelines as being subscribed to any new 332 commits on their input repo(s). Also similar to Git, commits have a 333 parental structure that tracks which files have changed. In this case 334 we are going to be adding more images. 335 336 Let's create two new commits in a parental structure. To do this we 337 will simply do two more `put file` commands and by specifying `master` 338 as the branch, it automatically parents our commits onto each other. 339 Branch names are just references to a particular HEAD commit. 340 341 ```shell 342 pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png 343 pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png 344 ``` 345 346 Adding a new commit of data will automatically trigger the pipeline to 347 run on the new data we've added. We'll see corresponding jobs get 348 started and commits to the output "edges" repo. Let's also view our 349 new outputs. 350 351 View the list of jobs that have started: 352 353 ``` bash 354 pachctl list job 355 ``` 356 357 **System response:** 358 359 ``` 360 ID STARTED DURATION RESTART PROGRESS DL UL STATE 361 81ae47a802f14038b95f8f248cddbed2 7 seconds ago Less than a second 0 1 + 2 / 3 102.4KiB 74.21KiB success 362 ce448c12d0dd4410b3a5ae0c0f07e1f9 16 seconds ago Less than a second 0 1 + 1 / 2 78.7KiB 37.15KiB success 363 490a28be32de491e942372018cd42460 9 minutes ago 35 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 364 ``` 365 366 View the output data 367 368 * On macOS, run: 369 370 ```shell 371 pachctl get file edges@master:AT-AT.png | open -f -a /Applications/Preview.app 372 pachctl get file edges@master:kitten.png | open -f -a /Applications/Preview.app 373 ``` 374 375 * On Linux, run: 376 377 ```shell 378 pachctl get file edges@master:AT-AT.png | display 379 pachctl get file edges@master:kitten.png | display 380 ``` 381 382 ### Adding Another Pipeline 383 384 We have successfully deployed and used a single stage Pachyderm pipeline. 385 Now, let's add a processing stage to illustrate a multi-stage Pachyderm 386 pipeline. Specifically, let's add a `montage` pipeline that take our 387 original and edge detected images and arranges them into a single 388 montage of images: 389 390  391 392 Below is the pipeline spec for this new pipeline: 393 394 ```shell 395 # montage.json 396 { 397 "pipeline": { 398 "name": "montage" 399 }, 400 "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.", 401 "input": { 402 "cross": [ { 403 "pfs": { 404 "glob": "/", 405 "repo": "images" 406 } 407 }, 408 { 409 "pfs": { 410 "glob": "/", 411 "repo": "edges" 412 } 413 } ] 414 }, 415 "transform": { 416 "cmd": [ "sh" ], 417 "image": "v4tech/imagemagick", 418 "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ] 419 } 420 } 421 ``` 422 423 This `montage` pipeline spec is similar to our `edges` pipeline except 424 for the following differences: 425 426 1. We are using a different Docker image that 427 has `imagemagick` installed. 428 2. We are executing a `sh` command with 429 `stdin` instead of a python script. 430 3. We have multiple input data repositories. 431 432 In the `montage` pipeline we are combining our multiple input data 433 repositories using a `cross` pattern. This `cross` pattern creates a 434 single pairing of our input images with our edge detected images. There 435 are several interesting ways to combine data in Pachyderm, which are 436 discussed 437 [here](../../reference/pipeline_spec/#input-required) 438 and 439 [here](../../concepts/pipeline-concepts/datum/join/). 440 441 We create the `montage` pipeline as before, with `pachctl`: 442 443 ```shell 444 pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/montage.json 445 ``` 446 447 Pipeline creating triggers a job that generates a montage for all the 448 current HEAD commits of the input repos: 449 450 ```shell 451 pachctl list job 452 ``` 453 454 **System response:** 455 456 ```shell 457 ID STARTED DURATION RESTART PROGRESS DL UL STATE 458 92cecc40c3144fd5b4e07603bb24b104 45 seconds ago 6 seconds 0 1 + 0 / 1 371.9KiB 1.284MiB success 459 81ae47a802f14038b95f8f248cddbed2 2 minutes ago Less than a second 0 1 + 2 / 3 102.4KiB 74.21KiB success 460 ce448c12d0dd4410b3a5ae0c0f07e1f9 2 minutes ago Less than a second 0 1 + 1 / 2 78.7KiB 37.15KiB success 461 490a28be32de491e942372018cd42460 11 minutes ago 35 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 462 ``` 463 464 View the generated montage image by running one of 465 the following commands: 466 467 468 * On macOS prior to Catalina, run: 469 470 ``` 471 pachctl get file montage@master:montage.png | open -f -a /Applications/Preview.app 472 ``` 473 474 475 * On macOS Catalina, run: 476 477 ``` 478 pachctl get file montage@master:montage.png | open -f -a /System/Applications/Preview.app 479 ``` 480 481 482 * On Linux 64-bit, run: 483 484 ``` 485 pachctl get file montage@master:montage.png | display 486 ``` 487 488  489 490 Exploring your DAG in the Pachyderm dashboard 491 --------------------------------------------- 492 493 When you deployed Pachyderm locally, the Pachyderm Enterprise dashboard 494 was also deployed by default. This dashboard will let you interactively 495 explore your pipeline, visualize the structure of the pipeline, explore 496 your data, debug jobs, etc. To access the dashboard visit 497 `localhost:30080` in an Internet browser (e.g., Google Chrome). You 498 should see something similar to this: 499 500  501 502 Enter your email address if you would like to obtain a free trial token 503 for the dashboard. Upon entering this trial token, you will be able to 504 see your pipeline structure and interactively explore the various pieces 505 of your pipeline as pictured below: 506 507  508 509  510 511 Next Steps 512 ---------- 513 514 Pachyderm is now running locally with data and a pipeline! To play with 515 Pachyderm locally, you can use what you've learned to build on or 516 change this pipeline. You can also dig in and learn more details about: 517 518 - [Deploying Pachyderm to the cloud or on prem](../deploy-manage/deploy/index.md) 519 - [Load Your Data into Pachyderm](../how-tos/load-data-into-pachyderm.md) 520 - [Working with Pipelines](../how-tos/developer-workflow/working-with-pipelines.md) 521 522 We'd love to help and see what you come up with, so submit any 523 issues/questions you come across on 524 [GitHub](https://github.com/pachyderm/pachyderm), 525 [Slack](http://slack.pachyderm.io), or email at <support@pachyderm.io> 526 if you want to show off anything nifty you've created!