github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/getting_started/beginner_tutorial.md (about) 1 # Beginner Tutorial 2 3 Welcome to the beginner tutorial for Pachyderm! If you have already installed 4 Pachyderm, this tutorial should take about 15 minutes to complete. This tutorial 5 introduces basic Pachyderm concepts. 6 7 !!! tip 8 If you are new to Pachyderm, try [Pachyderm Shell](../../deploy-manage/manage/pachctl_shell/). 9 This handy tool suggests you `pachctl` commands as you type and 10 helps you learn Pachyderm faster. 11 12 ## Image processing with OpenCV 13 14 This tutorial walks you through the deployment of a Pachyderm pipeline 15 that performs [edge 16 detection](https://en.wikipedia.org/wiki/Edge_detection) on a few 17 images. Thanks to Pachyderm's built-in processing primitives, we can 18 keep our code simple but still run the pipeline in a 19 distributed, streaming fashion. Moreover, as new data is added, the 20 pipeline automatically processes it and outputs the results. 21 22 If you hit any errors not covered in this guide, get help in our [public 23 community Slack](http://slack.pachyderm.io), submit an issue on 24 [GitHub](https://github.com/pachyderm/pachyderm), or email us at 25 <support@pachyderm.io>. We are more than happy to help! 26 27 ### Prerequisites 28 29 This guide assumes that you already have Pachyderm running locally. 30 If you haven't done so already, install Pachyderm on your local 31 machine as described in [Local Installation](local_installation.md). 32 33 ### Create a Repo 34 35 A `repo` is the highest level data primitive in Pachyderm. Like many 36 things in Pachyderm, it shares its name with a primitive in Git and is 37 designed to behave analogously. Generally, repos should be dedicated to 38 a single source of data such as log messages from a particular service, 39 a users table, or training data for an ML model. Repos are easy to create 40 and do not take much space when empty so do not worry about making 41 tons of them. 42 43 For this demo, we create a repo called `images` to hold the 44 data we want to process: 45 46 ```shell 47 pachctl create repo images 48 ``` 49 50 Verify that the repository was created: 51 52 ```shell 53 pachctl list repo 54 ``` 55 56 **System response:** 57 58 ```shell 59 NAME CREATED SIZE (MASTER) 60 images 7 seconds ago 0B 61 ``` 62 63 This output shows that the repo has been successfully created. Because we 64 have not added anything to it yet, the size of the repository HEAD commit 65 on the master branch is 0B. 66 67 ### Adding Data to Pachyderm 68 69 Now that we have created a repo it is time to add some data. In 70 Pachyderm, you write data to an explicit `commit`. Commits are immutable 71 snapshots of your data which give Pachyderm its version control properties. 72 You can add, remove, or update `files` in a given commit. 73 74 Let's start by just adding a file, in this case an image, to a new 75 commit. We have provided some sample images for you that we host on 76 Imgur. 77 78 Use the `pachctl put file` command along with the `-f` flag. The `-f` flag can 79 take either a local file, a URL, or a object storage bucket which it 80 scrapes automatically. In this case, we simply pass the URL. 81 82 Unlike Git, commits in Pachyderm must be explicitly started and finished 83 as they can contain huge amounts of data and we do not want that much 84 *dirty* data hanging around in an unpersisted state. `pachctl put file` 85 automatically starts and finishes a commit for you so you can add files 86 more easily. If you want to add many files over a period of time, you 87 can do `pachctl start commit` and `pachctl finish commit` yourself. 88 89 We also specify the repo name `"images"`, the branch name `"master"`, 90 and the file name: `"liberty.png"`. 91 92 Here is an example atomic commit of the file `liberty.png` to the 93 `images` repo `master` branch: 94 95 ```shell 96 pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png 97 ``` 98 99 We can check to make sure the data we just added is in Pachyderm. 100 101 * Use the `pachctl list repo` command to check that data has been added: 102 103 ```shell 104 pachctl list repo 105 ``` 106 107 **System response:** 108 109 ``` 110 NAME CREATED SIZE (MASTER) 111 images About a minute ago 57.27KiB 112 ``` 113 114 * View the commit that was just created: 115 116 ```shell 117 pachctl list commit images 118 ``` 119 120 **System response:** 121 122 ``` 123 REPO COMMIT PARENT STARTED DURATION SIZE 124 images d89758a7496a4c56920b0eaa7d7d3255 <none> 29 seconds ago Less than a second 57.27KiB 125 ``` 126 127 * View the file in that commit: 128 129 ```shell 130 pachctl list file images@master 131 ``` 132 133 **System response:** 134 135 ``` 136 COMMIT NAME TYPE COMMITTED SIZE 137 d89758a7496a4c56920b0eaa7d7d3255 /liberty.png file About a minute ago 57.27KiB 138 ``` 139 140 Also, you can view the file you have just added to Pachyderm. Because this is an 141 image, you cannot just print it out in the terminal, but the following 142 commands will let you view it easily: 143 144 * On macOS prior to Catalina, run: 145 146 ``` 147 pachctl get file images@master:liberty.png | open -f -a /Applications/Preview.app 148 ``` 149 150 * On macOS Catalina, run: 151 152 ``` 153 pachctl get file images@master:liberty.png | open -f -a /System/Applications/Preview.app 154 ``` 155 156 * On Linux 64-bit, run: 157 158 ``` 159 pachctl get file images@master:liberty.png | display 160 ``` 161 162 ### Create a Pipeline 163 164 Now that you have some data in your repo, it is time to do something 165 with it. Pipelines are the core processing primitive in Pachyderm and 166 you can define them with a JSON encoding. For this example, we have 167 already created the pipeline for you and you can find the [code on 168 GitHub](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv). 169 170 When you want to create your own pipelines later, you can refer to the 171 full [Pipeline Specification](../../reference/pipeline_spec) to use 172 more advanced options. Options include building your own code into a 173 container instead of the pre-built Docker image that we are 174 using in this tutorial. 175 176 For now, we are going to create a single pipeline that takes in images 177 and does some simple edge detection. 178 179  180 181 Below is the pipeline spec and python code that we are using. Let's walk 182 through the details. 183 184 ```shell 185 # edges.json 186 { 187 "pipeline": { 188 "name": "edges" 189 }, 190 "description": "A pipeline that performs image edge detection by using the OpenCV library.", 191 "transform": { 192 "cmd": [ "python3", "/edges.py" ], 193 "image": "pachyderm/opencv" 194 }, 195 "input": { 196 "pfs": { 197 "repo": "images", 198 "glob": "/*" 199 } 200 } 201 } 202 ``` 203 204 Our pipeline spec contains a few simple sections. First, it is the pipeline 205 `name`, edges. Then we have the `transform` which specifies the docker 206 image we want to use, `pachyderm/opencv` (defaults to DockerHub as the 207 registry), and the entry point `edges.py`. Lastly, we specify the input. 208 Here we only have one PFS input, our images repo with a particular glob 209 pattern. 210 211 The glob pattern defines how the input data can be broken up if we want 212 to distribute our computation. `/*` means that each file can be 213 processed individually, which makes sense for images. Glob patterns are 214 one of the most powerful features in Pachyderm. 215 216 The following text is the Python code that we run in this pipeline: 217 218 ``` python 219 # edges.py 220 import cv2 221 import numpy as np 222 from matplotlib import pyplot as plt 223 import os 224 225 # make_edges reads an image from /pfs/images and outputs the result of running 226 # edge detection on that image to /pfs/out. Note that /pfs/images and 227 # /pfs/out are special directories that Pachyderm injects into the container. 228 def make_edges(image): 229 img = cv2.imread(image) 230 tail = os.path.split(image)[1] 231 edges = cv2.Canny(img,100,200) 232 plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray') 233 234 # walk /pfs/images and call make_edges on every file found 235 for dirpath, dirs, files in os.walk("/pfs/images"): 236 for file in files: 237 make_edges(os.path.join(dirpath, file)) 238 ``` 239 240 The code simply walks over all the images in `/pfs/images`, performs edge 241 detection, and writes the result to `/pfs/out`. 242 243 `/pfs/images` and `/pfs/out` are special local directories that 244 Pachyderm creates within the container automatically. All the input data 245 for a pipeline is stored in `/pfs/<input_repo_name>` and your code 246 should always write out to `/pfs/out`. Pachyderm automatically 247 gathers everything you write to `/pfs/out` and version it as this 248 pipeline output. 249 250 Now, let's create the pipeline in Pachyderm: 251 252 ```shell 253 pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json 254 ``` 255 256 ### What Happens When You Create a Pipeline 257 258 Creating a pipeline tells Pachyderm to run your code on the data in your 259 input repo (the HEAD commit) as well as **all future commits** that 260 occur after the pipeline is created. Our repo already had a commit, so 261 Pachyderm automatically launched a `job` to process that data. 262 263 The first time Pachyderm runs a pipeline job, it needs to download the 264 Docker image (specified in the pipeline spec) from the specified Docker 265 registry (DockerHub in this case). This first run this might take a 266 minute or so because of the image download, depending on your Internet 267 connection. Subsequent runs will be much faster. 268 269 You can view the job with: 270 271 ``` bash 272 pachctl list job 273 ``` 274 275 **System response:** 276 277 ```shell 278 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 279 0f6a53829eeb4ca193bb7944fe693700 edges 16 seconds ago Less than a second 0 1 + 0 / 1 57.27KiB 22.22KiB success 280 ``` 281 282 Yay! Our pipeline succeeded! Pachyderm creates a corresponding output 283 repo for every pipeline. This output repo will have the same name as the 284 pipeline, and all the results of that pipeline will be versioned in this 285 output repo. In our example, the `edges` pipeline created a repo 286 called `edges` to store the results. 287 288 ``` bash 289 pachctl list repo 290 ``` 291 292 **System response:** 293 294 ```shell 295 NAME CREATED SIZE (MASTER) 296 edges 2 minutes ago 22.22KiB 297 images 5 minutes ago 57.27KiB 298 ``` 299 300 ### Reading the Output 301 302 We can view the output data from the `edges` repo in the same fashion 303 that we viewed the input data. 304 305 * On macOS prior to Catalina, run: 306 307 ``` 308 pachctl get file edges@master:liberty.png | open -f -a /Applications/Preview.app 309 ``` 310 311 * On macOS Catalina, run: 312 313 ``` 314 pachctl get file edges@master:liberty.png | open -f -a /System/Applications/Preview.app 315 ``` 316 317 * On Linux 64-bit, run: 318 319 ``` 320 pachctl get file edges@master:liberty.png | display 321 ``` 322 323 The output should look similar to: 324 325  326 327 ### Processing More Data 328 329 Pipelines will also automatically process the data from new commits as 330 they are created. Think of pipelines as being subscribed to any new 331 commits on their input repo(s). Also similar to Git, commits have a 332 parental structure that tracks which files have changed. In this case 333 we are going to be adding more images. 334 335 Let's create two new commits in a parental structure. To do this we 336 will simply do two more `put file` commands and by specifying `master` 337 as the branch, it automatically parents our commits onto each other. 338 Branch names are just references to a particular HEAD commit. 339 340 ```shell 341 pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png 342 pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png 343 ``` 344 345 Adding a new commit of data will automatically trigger the pipeline to 346 run on the new data we've added. We'll see corresponding jobs get 347 started and commits to the output "edges" repo. Let's also view our 348 new outputs. 349 350 View the list of jobs that have started: 351 352 ``` bash 353 pachctl list job 354 ``` 355 356 **System response:** 357 358 ``` 359 ID STARTED DURATION RESTART PROGRESS DL UL STATE 360 81ae47a802f14038b95f8f248cddbed2 7 seconds ago Less than a second 0 1 + 2 / 3 102.4KiB 74.21KiB success 361 ce448c12d0dd4410b3a5ae0c0f07e1f9 16 seconds ago Less than a second 0 1 + 1 / 2 78.7KiB 37.15KiB success 362 490a28be32de491e942372018cd42460 9 minutes ago 35 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 363 ``` 364 365 View the output data 366 367 * On macOS, run: 368 369 ```shell 370 pachctl get file edges@master:AT-AT.png | open -f -a /Applications/Preview.app 371 pachctl get file edges@master:kitten.png | open -f -a /Applications/Preview.app 372 ``` 373 374 * On Linux, run: 375 376 ```shell 377 pachctl get file edges@master:AT-AT.png | display 378 pachctl get file edges@master:kitten.png | display 379 ``` 380 381 ### Adding Another Pipeline 382 383 We have successfully deployed and used a single stage Pachyderm pipeline. 384 Now, let's add a processing stage to illustrate a multi-stage Pachyderm 385 pipeline. Specifically, let's add a `montage` pipeline that take our 386 original and edge detected images and arranges them into a single 387 montage of images: 388 389  390 391 Below is the pipeline spec for this new pipeline: 392 393 ```shell 394 # montage.json 395 { 396 "pipeline": { 397 "name": "montage" 398 }, 399 "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.", 400 "input": { 401 "cross": [ { 402 "pfs": { 403 "glob": "/", 404 "repo": "images" 405 } 406 }, 407 { 408 "pfs": { 409 "glob": "/", 410 "repo": "edges" 411 } 412 } ] 413 }, 414 "transform": { 415 "cmd": [ "sh" ], 416 "image": "v4tech/imagemagick", 417 "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ] 418 } 419 } 420 ``` 421 422 This `montage` pipeline spec is similar to our `edges` pipeline except 423 for the following differences: 424 425 1. We are using a different Docker image that 426 has `imagemagick` installed. 427 2. We are executing a `sh` command with 428 `stdin` instead of a python script. 429 3. We have multiple input data repositories. 430 431 In the `montage` pipeline we are combining our multiple input data 432 repositories using a `cross` pattern. This `cross` pattern creates a 433 single pairing of our input images with our edge detected images. There 434 are several interesting ways to combine data in Pachyderm, which are 435 discussed 436 [here](../../reference/pipeline_spec/#input-required) 437 and 438 [here](../../concepts/pipeline-concepts/datum/join/). 439 440 We create the `montage` pipeline as before, with `pachctl`: 441 442 ```shell 443 pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/montage.json 444 ``` 445 446 Pipeline creating triggers a job that generates a montage for all the 447 current HEAD commits of the input repos: 448 449 ```shell 450 pachctl list job 451 ``` 452 453 **System response:** 454 455 ```shell 456 ID STARTED DURATION RESTART PROGRESS DL UL STATE 457 92cecc40c3144fd5b4e07603bb24b104 45 seconds ago 6 seconds 0 1 + 0 / 1 371.9KiB 1.284MiB success 458 81ae47a802f14038b95f8f248cddbed2 2 minutes ago Less than a second 0 1 + 2 / 3 102.4KiB 74.21KiB success 459 ce448c12d0dd4410b3a5ae0c0f07e1f9 2 minutes ago Less than a second 0 1 + 1 / 2 78.7KiB 37.15KiB success 460 490a28be32de491e942372018cd42460 11 minutes ago 35 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 461 ``` 462 463 View the generated montage image by running one of 464 the following commands: 465 466 * On macOS prior to Catalina, run: 467 468 ``` 469 pachctl get file montage@master:montage.png | open -f -a /Applications/Preview.app 470 ``` 471 472 * On macOS Catalina, run: 473 474 ``` 475 pachctl get file montage@master:montage.png | open -f -a /System/Applications/Preview.app 476 ``` 477 478 * On Linux 64-bit, run: 479 480 ``` 481 pachctl get file montage@master:montage.png | display 482 ``` 483 484  485 486 Exploring your DAG in the Pachyderm dashboard 487 --------------------------------------------- 488 489 When you deployed Pachyderm locally, the Pachyderm Enterprise dashboard 490 was also deployed by default. This dashboard will let you interactively 491 explore your pipeline, visualize the structure of the pipeline, explore 492 your data, debug jobs, etc. To access the dashboard visit 493 `localhost:30080` in an Internet browser (e.g., Google Chrome). You 494 should see something similar to this: 495 496  497 498 Enter your email address if you would like to obtain a free trial token 499 for the dashboard. Upon entering this trial token, you will be able to 500 see your pipeline structure and interactively explore the various pieces 501 of your pipeline as pictured below: 502 503  504 505  506 507 Next Steps 508 ---------- 509 510 Pachyderm is now running locally with data and a pipeline! To play with 511 Pachyderm locally, you can use what you've learned to build on or 512 change this pipeline. You can also dig in and learn more details about: 513 514 - [Deploying Pachyderm to the cloud or on prem](../deploy-manage/deploy/index.md) 515 - [Load Your Data into Pachyderm](../how-tos/load-data-into-pachyderm.md) 516 - [Individual Developer Workflow](../how-tos/individual-developer-workflow.md) 517 518 We'd love to help and see what you come up with, so submit any 519 issues/questions you come across on 520 [GitHub](https://github.com/pachyderm/pachyderm), 521 [Slack](http://slack.pachyderm.io), or email at <support@pachyderm.io> 522 if you want to show off anything nifty you've created!