github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/getting_started/beginner_tutorial.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/getting_started/beginner_tutorial.md (about)

     1  # Beginner Tutorial
     2  
     3  Welcome to the beginner tutorial for Pachyderm! If you have already installed
     4  Pachyderm, this tutorial should take about 15 minutes to complete. This tutorial
     5  introduces basic Pachyderm concepts.
     6  
     7  ## Image processing with OpenCV
     8  
     9  This tutorial walks you through the deployment of a Pachyderm pipeline
    10  that performs [edge
    11  detection](https://en.wikipedia.org/wiki/Edge_detection) on a few
    12  images. Thanks to Pachyderm's built-in processing primitives, we can
    13  keep our code simple but still run the pipeline in a
    14  distributed, streaming fashion. Moreover, as new data is added, the
    15  pipeline automatically processes it and outputs the results.
    16  
    17  If you hit any errors not covered in this guide, get help in our [public
    18  community Slack](http://slack.pachyderm.io), submit an issue on
    19  [GitHub](https://github.com/pachyderm/pachyderm), or email us at
    20  <support@pachyderm.io>. We are more than happy to help!
    21  
    22  ### Prerequisites
    23  
    24  This guide assumes that you already have Pachyderm running locally.
    25  If you haven't done so already, install Pachyderm on your local
    26  machine as described in [Local Installation](local_installation.md).
    27  
    28  ### Create a Repo
    29  
    30  A `repo` is the highest level data primitive in Pachyderm. Like many
    31  things in Pachyderm, it shares its name with a primitive in Git and is
    32  designed to behave analogously. Generally, repos should be dedicated to
    33  a single source of data such as log messages from a particular service,
    34  a users table, or training data for an ML model. Repos are easy to create
    35  and do not take much space when empty so do not worry about making
    36  tons of them.
    37  
    38  For this demo, we create a repo called `images` to hold the
    39  data we want to process:
    40  
    41  ```shell
    42  $ pachctl create repo images
    43  $ pachctl list repo
    44  NAME   CREATED       SIZE (MASTER)
    45  images 7 seconds ago 0B
    46  ```
    47  
    48  This output shows that the repo has been successfully created. Because we
    49  have not added anything to it yet, the size of the repository HEAD commit
    50  on the master branch is 0B.
    51  
    52  ### Adding Data to Pachyderm
    53  
    54  Now that we have created a repo it is time to add some data. In
    55  Pachyderm, you write data to an explicit `commit`. Commits are immutable
    56  snapshots of your data which give Pachyderm its version control properties.
    57  You can add, remove, or update `files` in a given commit.
    58  
    59  Let's start by just adding a file, in this case an image, to a new
    60  commit. We have provided some sample images for you that we host on
    61  Imgur.
    62  
    63  Use the `pachctl put file` command along with the `-f` flag.  The `-f` flag can
    64  take either a local file, a URL, or a object storage bucket which it
    65  scrapes automatically. In this case, we simply pass the URL.
    66  
    67  Unlike Git, commits in Pachyderm must be explicitly started and finished
    68  as they can contain huge amounts of data and we do not want that much
    69  *dirty* data hanging around in an unpersisted state. `pachctl put file`
    70  automatically starts and finishes a commit for you so you can add files
    71  more easily. If you want to add many files over a period of time, you
    72  can do `pachctl start commit` and `pachctl finish commit` yourself.
    73  
    74  We also specify the repo name `"images"`, the branch name `"master"`,
    75  and the file name: `"liberty.png"`.
    76  
    77  Here is an example atomic commit of the file `liberty.png` to the
    78  `images` repo `master` branch:
    79  
    80  ```shell
    81  $ pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png
    82  ```
    83  
    84  We can check to make sure the data we just added is in Pachyderm.
    85  
    86  * Use the `pachctl list repo` command to check that data has been added:
    87  
    88    ```shell
    89    $ pachctl list repo
    90    NAME   CREATED            SIZE (MASTER)
    91    images About a minute ago 57.27KiB
    92    ```
    93  
    94  * View the commit that was just created:
    95  
    96    ```shell
    97    $ pachctl list commit images
    98    REPO   COMMIT                           PARENT STARTED        DURATION           SIZE
    99    images d89758a7496a4c56920b0eaa7d7d3255 <none> 29 seconds ago Less than a second 57.27KiB
   100    ```
   101  
   102  * View the file in that commit:
   103  
   104    ```shell
   105    $ pachctl list file images@master
   106    COMMIT                           NAME         TYPE COMMITTED          SIZE     
   107    d89758a7496a4c56920b0eaa7d7d3255 /liberty.png file About a minute ago 57.27KiB
   108    ```
   109  
   110  Also, you can view the file you have just added to Pachyderm. Because this is an
   111  image, you cannot just print it out in the terminal, but the following
   112  commands will let you view it easily:
   113  
   114  * If you are on macOS, run:
   115  
   116    ```shell
   117    $ pachctl get file images@master:liberty.png | open -f -a /Applications/Preview.app
   118    ```
   119  
   120  * If you are on Linux, run:
   121  
   122    ```shell
   123    $ pachctl get file images@master:liberty.png | display
   124    ```
   125  
   126  ### Create a Pipeline
   127  
   128  Now that you have some data in your repo, it is time to do something
   129  with it. Pipelines are the core processing primitive in Pachyderm and
   130  you can define them with a JSON encoding. For this example, we have
   131  already created the pipeline for you and you can find the [code on
   132  GitHub](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv).
   133  
   134  When you want to create your own pipelines later, you can refer to the
   135  full [Pipeline Specification](../../reference/pipeline_spec) to use
   136  more advanced options. Options include building your own code into a
   137  container instead of the pre-built Docker image that we are
   138  using in this tutorial.
   139  
   140  For now, we are going to create a single pipeline that takes in images
   141  and does some simple edge detection.
   142  
   143  ![image](../assets/images/opencv-liberty.png)
   144  
   145  Below is the pipeline spec and python code that we are using. Let's walk
   146  through the details.
   147  
   148  ```shell
   149  # edges.json
   150  {
   151    "pipeline": {
   152      "name": "edges"
   153    },
   154    "description": "A pipeline that performs image edge detection by using the OpenCV library.",
   155    "transform": {
   156      "cmd": [ "python3", "/edges.py" ],
   157      "image": "pachyderm/opencv"
   158    },
   159    "input": {
   160      "pfs": {
   161        "repo": "images",
   162        "glob": "/*"
   163      }
   164    }
   165  }
   166  ```
   167  
   168  Our pipeline spec contains a few simple sections. First, it is the pipeline
   169  `name`, edges. Then we have the `transform` which specifies the docker
   170  image we want to use, `pachyderm/opencv` (defaults to DockerHub as the
   171  registry), and the entry point `edges.py`. Lastly, we specify the input.
   172  Here we only have one PFS input, our images repo with a particular glob
   173  pattern.
   174  
   175  The glob pattern defines how the input data can be broken up if we want
   176  to distribute our computation. `/*` means that each file can be
   177  processed individually, which makes sense for images. Glob patterns are
   178  one of the most powerful features in  Pachyderm.
   179  
   180  The following text is the Python code that we run in this pipeline:
   181  
   182  ``` python
   183  # edges.py
   184  import cv2
   185  import numpy as np
   186  from matplotlib import pyplot as plt
   187  import os
   188  
   189  # make_edges reads an image from /pfs/images and outputs the result of running
   190  # edge detection on that image to /pfs/out. Note that /pfs/images and
   191  # /pfs/out are special directories that Pachyderm injects into the container.
   192  def make_edges(image):
   193     img = cv2.imread(image)
   194     tail = os.path.split(image)[1]
   195     edges = cv2.Canny(img,100,200)
   196     plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')
   197  
   198  # walk /pfs/images and call make_edges on every file found
   199  for dirpath, dirs, files in os.walk("/pfs/images"):
   200     for file in files:
   201         make_edges(os.path.join(dirpath, file))
   202  ```
   203  
   204  The code simply walks over all the images in `/pfs/images`, performs edge
   205  detection, and writes the result to `/pfs/out`.
   206  
   207  `/pfs/images` and `/pfs/out` are special local directories that
   208  Pachyderm creates within the container automatically. All the input data
   209  for a pipeline is stored in `/pfs/<input_repo_name>` and your code
   210  should always write out to `/pfs/out`. Pachyderm automatically
   211  gathers everything you write to `/pfs/out` and version it as this
   212  pipeline output.
   213  
   214  Now, let's create the pipeline in Pachyderm:
   215  
   216  ```shell
   217  $ pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json
   218  ```
   219  
   220  ### What Happens When You Create a Pipeline
   221  
   222  Creating a pipeline tells Pachyderm to run your code on the data in your
   223  input repo (the HEAD commit) as well as **all future commits** that
   224  occur after the pipeline is created. Our repo already had a commit, so
   225  Pachyderm automatically launched a `job` to process that data.
   226  
   227  The first time Pachyderm runs a pipeline job, it needs to download the
   228  Docker image (specified in the pipeline spec) from the specified Docker
   229  registry (DockerHub in this case). This first run this might take a
   230  minute or so because of the image download, depending on your Internet
   231  connection. Subsequent runs will be much faster.
   232  
   233  You can view the job with:
   234  
   235  ``` bash
   236  $ pachctl list job
   237  ID                               PIPELINE STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   238  0f6a53829eeb4ca193bb7944fe693700 edges    16 seconds ago Less than a second 0       1 + 0 / 1 57.27KiB 22.22KiB success
   239  ```
   240  
   241  Yay! Our pipeline succeeded! Pachyderm creates a corresponding output
   242  repo for every pipeline. This output repo will have the same name as the
   243  pipeline, and all the results of that pipeline will be versioned in this
   244  output repo. In our example, the `edges` pipeline created a repo
   245  called `edges` to store the results.
   246  
   247  ``` bash
   248  $ pachctl list repo
   249  NAME   CREATED       SIZE (MASTER)
   250  edges  2 minutes ago 22.22KiB
   251  images 5 minutes ago 57.27KiB
   252  ```
   253  
   254  ### Reading the Output
   255  
   256  We can view the output data from the `edges` repo in the same fashion
   257  that we viewed the input data.
   258  
   259  ``` bash
   260  # on macOS
   261  $ pachctl get file edges@master:liberty.png | open -f -a /Applications/Preview.app
   262  
   263  # on Linux
   264  $ pachctl get file edges@master:liberty.png | display
   265  ```
   266  
   267  The output should look similar to:
   268  
   269  ![image](../assets/images/edges-screenshot.png)
   270  
   271  ### Processing More Data
   272  
   273  Pipelines will also automatically process the data from new commits as
   274  they are created. Think of pipelines as being subscribed to any new
   275  commits on their input repo(s). Also similar to Git, commits have a
   276  parental structure that tracks which files have changed. In this case
   277  we are going to be adding more images.
   278  
   279  Let's create two new commits in a parental structure. To do this we
   280  will simply do two more `put file` commands and by specifying `master`
   281  as the branch, it automatically parents our commits onto each other.
   282  Branch names are just references to a particular HEAD commit.
   283  
   284  ``` bash
   285  $ pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png
   286  $ pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png
   287  ```
   288  
   289  Adding a new commit of data will automatically trigger the pipeline to
   290  run on the new data we've added. We'll see corresponding jobs get
   291  started and commits to the output "edges" repo. Let's also view our
   292  new outputs.
   293  
   294  ``` bash
   295  # view the jobs that were kicked off
   296  $ pachctl list job
   297  ID                                STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   298  81ae47a802f14038b95f8f248cddbed2  7 seconds ago  Less than a second 0       1 + 2 / 3 102.4KiB 74.21KiB success
   299  ce448c12d0dd4410b3a5ae0c0f07e1f9  16 seconds ago Less than a second 0       1 + 1 / 2 78.7KiB  37.15KiB success
   300  490a28be32de491e942372018cd42460  9 minutes ago  35 seconds         0       1 + 0 / 1 57.27KiB 22.22KiB success
   301  ```
   302  
   303  ``` bash
   304  # View the output data
   305  
   306  # on macOS
   307  $ pachctl get file edges@master:AT-AT.png | open -f -a /Applications/Preview.app
   308  
   309  $ pachctl get file edges@master:kitten.png | open -f -a /Applications/Preview.app
   310  
   311  # on Linux
   312  $ pachctl get file edges@master:AT-AT.png | display
   313  
   314  $ pachctl get file edges@master:kitten.png | display
   315  ```
   316  
   317  ### Adding Another Pipeline
   318  
   319  We have successfully deployed and used a single stage Pachyderm pipeline.
   320  Now, let's add a processing stage to illustrate a multi-stage Pachyderm
   321  pipeline. Specifically, let's add a `montage` pipeline that take our
   322  original and edge detected images and arranges them into a single
   323  montage of images:
   324  
   325  ![image](../assets/images/opencv-liberty-montage.png)
   326  
   327  Below is the pipeline spec for this new pipeline:
   328  
   329  ``` bash
   330  # montage.json
   331  {
   332    "pipeline": {
   333      "name": "montage"
   334    },
   335    "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.",
   336    "input": {
   337      "cross": [ {
   338        "pfs": {
   339          "glob": "/",
   340          "repo": "images"
   341        }
   342      },
   343      {
   344        "pfs": {
   345          "glob": "/",
   346          "repo": "edges"
   347        }
   348      } ]
   349    },
   350    "transform": {
   351      "cmd": [ "sh" ],
   352      "image": "v4tech/imagemagick",
   353      "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ]
   354    }
   355  }
   356  ```
   357  
   358  This `montage` pipeline spec is similar to our `edges` pipeline except
   359  for the following differences:
   360  
   361  1. We are using a different Docker image that
   362  has `imagemagick` installed.
   363  2. We are executing a `sh` command with
   364  `stdin` instead of a python script.
   365  3. We have multiple input data repositories.
   366  
   367  In the `montage` pipeline we are combining our multiple input data
   368  repositories using a `cross` pattern. This `cross` pattern creates a
   369  single pairing of our input images with our edge detected images. There
   370  are several interesting ways to combine data in Pachyderm, which are
   371  discussed
   372  [here](../../reference/pipeline_spec/#input-required)
   373  and
   374  [here](../concepts/pipeline-concepts/pipeline/join.md).
   375  
   376  We create the `montage` pipeline as before, with `pachctl`:
   377  
   378  ```shell
   379  $ pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/montage.json
   380  ```
   381  
   382  Pipeline creating triggers a job that generates a montage for all the
   383  current HEAD commits of the input repos:
   384  
   385  ```shell
   386  $ pachctl list job
   387  ID                                  STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   388  92cecc40c3144fd5b4e07603bb24b104    45 seconds ago 6 seconds          0       1 + 0 / 1 371.9KiB 1.284MiB success
   389  81ae47a802f14038b95f8f248cddbed2    2 minutes ago  Less than a second 0       1 + 2 / 3 102.4KiB 74.21KiB success
   390  ce448c12d0dd4410b3a5ae0c0f07e1f9    2 minutes ago  Less than a second 0       1 + 1 / 2 78.7KiB  37.15KiB success
   391  490a28be32de491e942372018cd42460    11 minutes ago 35 seconds         0       1 + 0 / 1 57.27KiB 22.22KiB success
   392  ```
   393  
   394  And you can view the generated montage image via:
   395  
   396  ``` bash
   397  # on macOS
   398  $ pachctl get file montage@master:montage.png | open -f -a /Applications/Preview.app
   399  
   400  # on Linux
   401  $ pachctl get file montage@master:montage.png | display
   402  ```
   403  
   404  ![image](../assets/images/montage-screenshot.png)
   405  
   406  Exploring your DAG in the Pachyderm dashboard
   407  ---------------------------------------------
   408  
   409  When you deployed Pachyderm locally, the Pachyderm Enterprise dashboard
   410  was also deployed by default. This dashboard will let you interactively
   411  explore your pipeline, visualize the structure of the pipeline, explore
   412  your data, debug jobs, etc. To access the dashboard visit
   413  `localhost:30080` in an Internet browser (e.g., Google Chrome). You
   414  should see something similar to this:
   415  
   416  ![image](../assets/images/dashboard1.png)
   417  
   418  Enter your email address if you would like to obtain a free trial token
   419  for the dashboard. Upon entering this trial token, you will be able to
   420  see your pipeline structure and interactively explore the various pieces
   421  of your pipeline as pictured below:
   422  
   423  ![image](../assets/images/dashboard2.png)
   424  
   425  ![image](../assets/images/dashboard3.png)
   426  
   427  Next Steps
   428  ----------
   429  
   430  Pachyderm is now running locally with data and a pipeline! To play with
   431  Pachyderm locally, you can use what you've learned to build on or
   432  change this pipeline. You can also dig in and learn more details about:
   433  
   434  - [Deploying Pachyderm to the cloud or on prem](../deploy-manage/deploy/index.md)
   435  - [Load Your Data into Pachyderm](../how-tos/load-data-into-pachyderm.md)
   436  - [Individual Developer Workflow](../how-tos/individual-developer-workflow.md)
   437  
   438  We'd love to help and see what you come up with, so submit any
   439  issues/questions you come across on
   440  [GitHub](https://github.com/pachyderm/pachyderm),
   441  [Slack](http://slack.pachyderm.io), or email at <support@pachyderm.io>
   442  if you want to show off anything nifty you've created!