github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/getting_started/beginner_tutorial.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/getting_started/beginner_tutorial.md (about)

     1  # Beginner Tutorial
     2  
     3  Welcome to the beginner tutorial for Pachyderm! If you have already installed
     4  Pachyderm, this tutorial should take about 15 minutes to complete. This tutorial
     5  introduces basic Pachyderm concepts.
     6  
     7  !!! tip
     8      If you are new to Pachyderm, try [Pachyderm Shell](../../deploy-manage/manage/pachctl_shell/).
     9      This handy tool suggests you `pachctl` commands as you type and
    10      helps you learn Pachyderm faster.
    11  
    12  ## Image processing with OpenCV
    13  
    14  This tutorial walks you through the deployment of a Pachyderm pipeline
    15  that performs [edge
    16  detection](https://en.wikipedia.org/wiki/Edge_detection) on a few
    17  images. Thanks to Pachyderm's built-in processing primitives, we can
    18  keep our code simple but still run the pipeline in a
    19  distributed, streaming fashion. Moreover, as new data is added, the
    20  pipeline automatically processes it and outputs the results.
    21  
    22  If you hit any errors not covered in this guide, get help in our [public
    23  community Slack](http://slack.pachyderm.io), submit an issue on
    24  [GitHub](https://github.com/pachyderm/pachyderm), or email us at
    25  <support@pachyderm.io>. We are more than happy to help!
    26  
    27  ### Prerequisites
    28  
    29  This guide assumes that you already have Pachyderm running locally.
    30  If you haven't done so already, install Pachyderm on your local
    31  machine as described in [Local Installation](local_installation.md).
    32  
    33  ### Create a Repo
    34  
    35  A `repo` is the highest level data primitive in Pachyderm. Like many
    36  things in Pachyderm, it shares its name with a primitive in Git and is
    37  designed to behave analogously. Generally, repos should be dedicated to
    38  a single source of data such as log messages from a particular service,
    39  a users table, or training data for an ML model. Repos are easy to create
    40  and do not take much space when empty so do not worry about making
    41  tons of them.
    42  
    43  For this demo, we create a repo called `images` to hold the
    44  data we want to process:
    45  
    46  ```shell
    47  pachctl create repo images
    48  ```
    49  
    50  Verify that the repository was created:
    51  
    52  ```shell
    53  pachctl list repo
    54  ```
    55  
    56  **System response:**
    57  
    58  ```shell
    59  NAME   CREATED       SIZE (MASTER)
    60  images 7 seconds ago 0B
    61  ```
    62  
    63  This output shows that the repo has been successfully created. Because we
    64  have not added anything to it yet, the size of the repository HEAD commit
    65  on the master branch is 0B.
    66  
    67  ### Adding Data to Pachyderm
    68  
    69  Now that we have created a repo it is time to add some data. In
    70  Pachyderm, you write data to an explicit `commit`. Commits are immutable
    71  snapshots of your data which give Pachyderm its version control properties.
    72  You can add, remove, or update `files` in a given commit.
    73  
    74  Let's start by just adding a file, in this case an image, to a new
    75  commit. We have provided some sample images for you that we host on
    76  Imgur.
    77  
    78  Use the `pachctl put file` command along with the `-f` flag.  The `-f` flag can
    79  take either a local file, a URL, or a object storage bucket which it
    80  scrapes automatically. In this case, we simply pass the URL.
    81  
    82  Unlike Git, commits in Pachyderm must be explicitly started and finished
    83  as they can contain huge amounts of data and we do not want that much
    84  *dirty* data hanging around in an unpersisted state. `pachctl put file`
    85  automatically starts and finishes a commit for you so you can add files
    86  more easily. If you want to add many files over a period of time, you
    87  can do `pachctl start commit` and `pachctl finish commit` yourself.
    88  
    89  We also specify the repo name `"images"`, the branch name `"master"`,
    90  and the file name: `"liberty.png"`.
    91  
    92  Here is an example atomic commit of the file `liberty.png` to the
    93  `images` repo `master` branch:
    94  
    95  ```shell
    96  pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png
    97  ```
    98  
    99  We can check to make sure the data we just added is in Pachyderm.
   100  
   101  * Use the `pachctl list repo` command to check that data has been added:
   102  
   103    ```shell
   104    pachctl list repo
   105    ```
   106  
   107    **System response:**
   108  
   109    ```
   110    NAME   CREATED            SIZE (MASTER)
   111    images About a minute ago 57.27KiB
   112    ```
   113  
   114  * View the commit that was just created:
   115  
   116    ```shell
   117    pachctl list commit images
   118    ```
   119  
   120    **System response:**
   121  
   122    ```
   123    REPO   COMMIT                           PARENT STARTED        DURATION           SIZE
   124    images d89758a7496a4c56920b0eaa7d7d3255 <none> 29 seconds ago Less than a second 57.27KiB
   125    ```
   126  
   127  * View the file in that commit:
   128  
   129    ```shell
   130    pachctl list file images@master
   131    ```
   132  
   133    **System response:**
   134  
   135    ```
   136    COMMIT                           NAME         TYPE COMMITTED          SIZE
   137    d89758a7496a4c56920b0eaa7d7d3255 /liberty.png file About a minute ago 57.27KiB
   138    ```
   139  
   140  Now you can view the file by retrieving it from Pachyderm. Because this is an
   141  image, you cannot just print it out in the terminal, but the following
   142  command will let you view it:
   143  
   144  * On macOS, run:
   145  
   146  ```shell
   147  pachctl get file images@master:liberty.png | open -f -a Preview.app
   148  ```
   149  
   150  * On Linux 64-bit, run:
   151  
   152  ```shell
   153  pachctl get file images@master:liberty.png | display
   154  ```
   155  
   156  ### Create a Pipeline
   157  
   158  Now that you have some data in your repo, it is time to do something
   159  with it. Pipelines are the core processing primitive in Pachyderm.
   160  Pipelines are defined with a simple JSON file called a pipeline
   161  specification or pipeline spec for short. For this example, we already
   162  [created the pipeline spec for you](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv).
   163  
   164  When you want to create your own pipeline specs later, you can refer to the
   165  full [Pipeline Specification](../../reference/pipeline_spec) to use
   166  more advanced options. Options include building your own code into a
   167  container. In this tutorial, we are using a pre-built Docker image.
   168  
   169  For now, we are going to create a single pipeline spec that takes in images
   170  and does some simple edge detection.
   171  
   172  ![image](../assets/images/opencv-liberty.png)
   173  
   174  Below is the `edges.json` pipeline spec. Let's walk
   175  through the details.
   176  
   177  ```json
   178  {
   179    "pipeline": {
   180      "name": "edges"
   181    },
   182    "description": "A pipeline that performs image edge detection by using the OpenCV library.",
   183    "transform": {
   184      "cmd": [ "python3", "/edges.py" ],
   185      "image": "pachyderm/opencv"
   186    },
   187    "input": {
   188      "pfs": {
   189        "repo": "images",
   190        "glob": "/*"
   191      }
   192    }
   193  }
   194  ```
   195  
   196  The pipeline spec contains a few simple sections. The pipeline section contains
   197  a `name`, which is how you will identify your pipeline. Your pipeline will also
   198  automatically create an output repo with the same name. The `transform` section
   199  allows you to specify the docker image you want to use. In this case,
   200  `pachyderm/opencv` is the docker image (defaults to DockerHub as the registry),
   201  and the entry point is `edges.py`. The input section specifies repos visible
   202  to the running pipeline, and how to process the data from the repos. Commits to
   203  these repos will automatically trigger the pipeline to create new jobs to
   204  process them. In this case, `images` is the repo, and `/*` is the glob pattern.
   205  
   206  The glob pattern defines how the input data can be broken up if you want
   207  to distribute computation. `/*` means that each file can be
   208  processed individually, which makes sense for images. Glob patterns are
   209  one of the most powerful features in Pachyderm.
   210  
   211  The following text is the Python code run in this pipeline:
   212  
   213  ```python
   214  import cv2
   215  import numpy as np
   216  from matplotlib import pyplot as plt
   217  import os
   218  
   219  # make_edges reads an image from /pfs/images and outputs the result of running
   220  # edge detection on that image to /pfs/out. Note that /pfs/images and
   221  # /pfs/out are special directories that Pachyderm injects into the container.
   222  def make_edges(image):
   223     img = cv2.imread(image)
   224     tail = os.path.split(image)[1]
   225     edges = cv2.Canny(img,100,200)
   226     plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')
   227  
   228  # walk /pfs/images and call make_edges on every file found
   229  for dirpath, dirs, files in os.walk("/pfs/images"):
   230     for file in files:
   231         make_edges(os.path.join(dirpath, file))
   232  ```
   233  
   234  The code simply walks over all the images in `/pfs/images`, performs edge
   235  detection, and writes the result to `/pfs/out`.
   236  
   237  `/pfs/images` and `/pfs/out` are special local directories that
   238  Pachyderm creates within the container automatically. All the input data
   239  for a pipeline is stored in `/pfs/<input_repo_name>` and your code
   240  should always write out to `/pfs/out`. Pachyderm automatically
   241  gathers everything you write to `/pfs/out` and version it as this
   242  pipeline output.
   243  
   244  Now, let's create the pipeline in Pachyderm:
   245  
   246  ```shell
   247  pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json
   248  ```
   249  
   250  ### What Happens When You Create a Pipeline
   251  
   252  Creating a pipeline tells Pachyderm to run your code on the data in your
   253  input repo (the HEAD commit) as well as **all future commits** that
   254  occur after the pipeline is created. Our repo already had a commit, so
   255  Pachyderm automatically launched a `job` to process that data.
   256  
   257  The first time Pachyderm runs a pipeline job, it needs to download the
   258  Docker image (specified in the pipeline spec) from the specified Docker
   259  registry (DockerHub in this case). This first run this might take a
   260  minute or so because of the image download, depending on your Internet
   261  connection. Subsequent runs will be much faster.
   262  
   263  You can view the job with:
   264  
   265  ```shell
   266  pachctl list job
   267  ```
   268  
   269  **System response:**
   270  
   271  ```shell
   272  ID                               PIPELINE STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   273  0f6a53829eeb4ca193bb7944fe693700 edges    16 seconds ago Less than a second 0       1 + 0 / 1 57.27KiB 22.22KiB success
   274  ```
   275  
   276  Yay! Our pipeline succeeded! Pachyderm creates a corresponding output
   277  repo for every pipeline. This output repo will have the same name as the
   278  pipeline, and all the results of that pipeline will be versioned in this
   279  output repo. In our example, the `edges` pipeline created a repo
   280  called `edges` to store the results.
   281  
   282  ```
   283  pachctl list repo
   284  ```
   285  
   286  **System response:**
   287  
   288  ```shell
   289  NAME   CREATED       SIZE (MASTER)
   290  edges  2 minutes ago 22.22KiB
   291  images 5 minutes ago 57.27KiB
   292  ```
   293  
   294  ### Reading the Output
   295  
   296  We can view the output data from the `edges` repo in the same fashion
   297  that we viewed the input data.
   298  
   299  * On macOS, run:
   300  
   301  ```shell
   302  pachctl get file edges@master:liberty.png | open -f -a Preview.app
   303  ```
   304  
   305  * On Linux 64-bit, run:
   306  
   307  ```shell
   308  pachctl get file edges@master:liberty.png | display
   309  ```
   310  
   311  The output should look similar to:
   312  
   313  ![image](../assets/images/edges-screenshot.png)
   314  
   315  ### Processing More Data
   316  
   317  Pipelines will also automatically process the data from new commits as
   318  they are created. Think of pipelines as being subscribed to any new
   319  commits on their input repo(s). Also similar to Git, commits have a
   320  parental structure that tracks which files have changed. In this case
   321  we are going to be adding more images.
   322  
   323  Let's create two new commits in a parental structure. To do this we
   324  will simply do two more `put file` commands and by specifying `master`
   325  as the branch, it automatically parents our commits onto each other.
   326  Branch names are just references to a particular HEAD commit.
   327  
   328  ```shell
   329  pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png
   330  pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png
   331  ```
   332  
   333  Adding a new commit of data will automatically trigger the pipeline to
   334  run on the new data we've added. We'll see corresponding jobs get
   335  started and commits to the output "edges" repo. Let's also view our
   336  new outputs.
   337  
   338  View the list of jobs that have started:
   339  
   340  ```shell
   341  pachctl list job
   342  ```
   343  
   344  **System response:**
   345  
   346  ```
   347  ID                                STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   348  81ae47a802f14038b95f8f248cddbed2  7 seconds ago  Less than a second 0       1 + 2 / 3 102.4KiB 74.21KiB success
   349  ce448c12d0dd4410b3a5ae0c0f07e1f9  16 seconds ago Less than a second 0       1 + 1 / 2 78.7KiB  37.15KiB success
   350  490a28be32de491e942372018cd42460  9 minutes ago  35 seconds         0       1 + 0 / 1 57.27KiB 22.22KiB success
   351  ```
   352  
   353  View the output data
   354  
   355  * On macOS, run:
   356  
   357    ```shell
   358    pachctl get file edges@master:AT-AT.png | open -f -a /Applications/Preview.app
   359    pachctl get file edges@master:kitten.png | open -f -a /Applications/Preview.app
   360    ```
   361  
   362  * On Linux, run:
   363  
   364    ```shell
   365    pachctl get file edges@master:AT-AT.png | display
   366    pachctl get file edges@master:kitten.png | display
   367    ```
   368  
   369  ### Adding Another Pipeline
   370  
   371  We have successfully deployed and used a single stage Pachyderm pipeline.
   372  Now, let's add a processing stage to illustrate a multi-stage Pachyderm
   373  pipeline. Specifically, let's add a `montage` pipeline that take our
   374  original and edge detected images and arranges them into a single
   375  montage of images:
   376  
   377  ![image](../assets/images/opencv-liberty-montage.png)
   378  
   379  Below is the pipeline spec for this new pipeline:
   380  
   381  ```json
   382  {
   383    "pipeline": {
   384      "name": "montage"
   385    },
   386    "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.",
   387    "input": {
   388      "cross": [ {
   389        "pfs": {
   390          "glob": "/",
   391          "repo": "images"
   392        }
   393      },
   394      {
   395        "pfs": {
   396          "glob": "/",
   397          "repo": "edges"
   398        }
   399      } ]
   400    },
   401    "transform": {
   402      "cmd": [ "sh" ],
   403      "image": "v4tech/imagemagick",
   404      "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ]
   405    }
   406  }
   407  ```
   408  
   409  This `montage` pipeline spec is similar to our `edges` pipeline except
   410  for the following differences:
   411  
   412  1. We are using a different Docker image that
   413  has `imagemagick` installed.
   414  2. We are executing a `sh` command with
   415  `stdin` instead of a python script.
   416  3. We have multiple input data repositories.
   417  
   418  In the `montage` pipeline we are combining our multiple input data
   419  repositories using a `cross` pattern. This `cross` pattern creates a
   420  single pairing of our input images with our edge detected images. There
   421  are several interesting ways to combine data in Pachyderm, which are
   422  discussed
   423  [here](../../reference/pipeline_spec/#input-required)
   424  and
   425  [here](../../concepts/pipeline-concepts/datum/join/).
   426  
   427  We create the `montage` pipeline as before, with `pachctl`:
   428  
   429  ```shell
   430  pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/montage.json
   431  ```
   432  
   433  Pipeline creating triggers a job that generates a montage for all the
   434  current HEAD commits of the input repos:
   435  
   436  ```shell
   437  pachctl list job
   438  ```
   439  
   440  **System response:**
   441  
   442  ```shell
   443  ID                                  STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   444  92cecc40c3144fd5b4e07603bb24b104    45 seconds ago 6 seconds          0       1 + 0 / 1 371.9KiB 1.284MiB success
   445  81ae47a802f14038b95f8f248cddbed2    2 minutes ago  Less than a second 0       1 + 2 / 3 102.4KiB 74.21KiB success
   446  ce448c12d0dd4410b3a5ae0c0f07e1f9    2 minutes ago  Less than a second 0       1 + 1 / 2 78.7KiB  37.15KiB success
   447  490a28be32de491e942372018cd42460    11 minutes ago 35 seconds         0       1 + 0 / 1 57.27KiB 22.22KiB success
   448  ```
   449  
   450  View the generated montage image by running one of
   451  the following commands:
   452  
   453  * On macOS, run:
   454  
   455  ```shell
   456  pachctl get file montage@master:montage.png | open -f -a Preview.app
   457  ```
   458  
   459  * On Linux 64-bit, run:
   460  
   461  ```shell
   462  pachctl get file montage@master:montage.png | display
   463  ```
   464  
   465  ![image](../assets/images/montage-screenshot.png)
   466  
   467  Exploring your DAG in the Pachyderm dashboard
   468  ---------------------------------------------
   469  
   470  When you deployed Pachyderm locally, the Pachyderm Enterprise dashboard
   471  was also deployed by default. This dashboard will let you interactively
   472  explore your pipeline, visualize the structure of the pipeline, explore
   473  your data, debug jobs, etc. To access the dashboard visit
   474  `localhost:30080` in an Internet browser (e.g., Google Chrome). You
   475  should see something similar to this:
   476  
   477  ![image](../assets/images/dashboard1.png)
   478  
   479  Enter your email address if you would like to obtain a free trial token
   480  for the dashboard. Upon entering this trial token, you will be able to
   481  see your pipeline structure and interactively explore the various pieces
   482  of your pipeline as pictured below:
   483  
   484  ![image](../assets/images/dashboard2.png)
   485  
   486  ![image](../assets/images/dashboard3.png)
   487  
   488  Next Steps
   489  ----------
   490  
   491  Pachyderm is now running locally with data and a pipeline! To play with
   492  Pachyderm locally, you can use what you've learned to build on or
   493  change this pipeline. You can also dig in and learn more details about:
   494  
   495  - [Deploying Pachyderm to the cloud or on prem](../deploy-manage/deploy/index.md)
   496  - [Load Your Data into Pachyderm](../how-tos/load-data-into-pachyderm.md)
   497  - [Working with Pipelines](../how-tos/developer-workflow/working-with-pipelines.md)
   498  
   499  We'd love to help and see what you come up with, so submit any
   500  issues/questions you come across on
   501  [GitHub](https://github.com/pachyderm/pachyderm),
   502  [Slack](http://slack.pachyderm.io), or email at <support@pachyderm.io>
   503  if you want to show off anything nifty you've created!