github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/getting_started/beginner_tutorial.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/getting_started/beginner_tutorial.md (about)

     1  # Beginner Tutorial
     2  
     3  Welcome to the beginner tutorial for Pachyderm! If you have already installed
     4  Pachyderm, this tutorial should take about 15 minutes to complete. This tutorial
     5  introduces basic Pachyderm concepts.
     6  
     7  !!! tip
     8      If you are new to Pachyderm, try [Pachyderm Shell](../../deploy-manage/manage/pachctl_shell/).
     9      This handy tool suggests you `pachctl` commands as you type and
    10      helps you learn Pachyderm faster.
    11  
    12  ## Image processing with OpenCV
    13  
    14  This tutorial walks you through the deployment of a Pachyderm pipeline
    15  that performs [edge
    16  detection](https://en.wikipedia.org/wiki/Edge_detection) on a few
    17  images. Thanks to Pachyderm's built-in processing primitives, we can
    18  keep our code simple but still run the pipeline in a
    19  distributed, streaming fashion. Moreover, as new data is added, the
    20  pipeline automatically processes it and outputs the results.
    21  
    22  If you hit any errors not covered in this guide, get help in our [public
    23  community Slack](http://slack.pachyderm.io), submit an issue on
    24  [GitHub](https://github.com/pachyderm/pachyderm), or email us at
    25  <support@pachyderm.io>. We are more than happy to help!
    26  
    27  ### Prerequisites
    28  
    29  This guide assumes that you already have Pachyderm running locally.
    30  If you haven't done so already, install Pachyderm on your local
    31  machine as described in [Local Installation](local_installation.md).
    32  
    33  ### Create a Repo
    34  
    35  A `repo` is the highest level data primitive in Pachyderm. Like many
    36  things in Pachyderm, it shares its name with a primitive in Git and is
    37  designed to behave analogously. Generally, repos should be dedicated to
    38  a single source of data such as log messages from a particular service,
    39  a users table, or training data for an ML model. Repos are easy to create
    40  and do not take much space when empty so do not worry about making
    41  tons of them.
    42  
    43  For this demo, we create a repo called `images` to hold the
    44  data we want to process:
    45  
    46  ```shell
    47  pachctl create repo images
    48  ```
    49  
    50  Verify that the repository was created:
    51  
    52  ```shell
    53  pachctl list repo
    54  ```
    55  
    56  **System response:**
    57  
    58  ```shell
    59  NAME   CREATED       SIZE (MASTER)
    60  images 7 seconds ago 0B
    61  ```
    62  
    63  This output shows that the repo has been successfully created. Because we
    64  have not added anything to it yet, the size of the repository HEAD commit
    65  on the master branch is 0B.
    66  
    67  ### Adding Data to Pachyderm
    68  
    69  Now that we have created a repo it is time to add some data. In
    70  Pachyderm, you write data to an explicit `commit`. Commits are immutable
    71  snapshots of your data which give Pachyderm its version control properties.
    72  You can add, remove, or update `files` in a given commit.
    73  
    74  Let's start by just adding a file, in this case an image, to a new
    75  commit. We have provided some sample images for you that we host on
    76  Imgur.
    77  
    78  Use the `pachctl put file` command along with the `-f` flag.  The `-f` flag can
    79  take either a local file, a URL, or a object storage bucket which it
    80  scrapes automatically. In this case, we simply pass the URL.
    81  
    82  Unlike Git, commits in Pachyderm must be explicitly started and finished
    83  as they can contain huge amounts of data and we do not want that much
    84  *dirty* data hanging around in an unpersisted state. `pachctl put file`
    85  automatically starts and finishes a commit for you so you can add files
    86  more easily. If you want to add many files over a period of time, you
    87  can do `pachctl start commit` and `pachctl finish commit` yourself.
    88  
    89  We also specify the repo name `"images"`, the branch name `"master"`,
    90  and the file name: `"liberty.png"`.
    91  
    92  Here is an example atomic commit of the file `liberty.png` to the
    93  `images` repo `master` branch:
    94  
    95  ```shell
    96  pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png
    97  ```
    98  
    99  We can check to make sure the data we just added is in Pachyderm.
   100  
   101  * Use the `pachctl list repo` command to check that data has been added:
   102  
   103    ```shell
   104    pachctl list repo
   105    ```
   106  
   107    **System response:**
   108  
   109    ```
   110    NAME   CREATED            SIZE (MASTER)
   111    images About a minute ago 57.27KiB
   112    ```
   113  
   114  * View the commit that was just created:
   115  
   116    ```shell
   117    pachctl list commit images
   118    ```
   119  
   120    **System response:**
   121  
   122    ```
   123    REPO   COMMIT                           PARENT STARTED        DURATION           SIZE
   124    images d89758a7496a4c56920b0eaa7d7d3255 <none> 29 seconds ago Less than a second 57.27KiB
   125    ```
   126  
   127  * View the file in that commit:
   128  
   129    ```shell
   130    pachctl list file images@master
   131    ```
   132  
   133    **System response:**
   134  
   135    ```
   136    COMMIT                           NAME         TYPE COMMITTED          SIZE 
   137    d89758a7496a4c56920b0eaa7d7d3255 /liberty.png file About a minute ago 57.27KiB
   138    ```
   139  
   140  Also, you can view the file you have just added to Pachyderm. Because this is an
   141  image, you cannot just print it out in the terminal, but the following
   142  commands will let you view it easily:
   143  
   144  
   145  * on macOS prior to Catalina, run:
   146  
   147      ```
   148      pachctl get file images@master:liberty.png | open -f -a /Applications/Preview.app
   149      ```
   150  
   151  * on macOS Catalina, run:
   152  
   153      ```
   154      pachctl get file images@master:liberty.png | open -f -a /System/Applications/Preview.app
   155      ```
   156  
   157  * on Linux 64-bit, run:
   158  
   159      ```
   160      pachctl get file images@master:liberty.png | display
   161      ```
   162  
   163  ### Create a Pipeline
   164  
   165  Now that you have some data in your repo, it is time to do something
   166  with it. Pipelines are the core processing primitive in Pachyderm and
   167  you can define them with a JSON encoding. For this example, we have
   168  already created the pipeline for you and you can find the [code on
   169  GitHub](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv).
   170  
   171  When you want to create your own pipelines later, you can refer to the
   172  full [Pipeline Specification](../../reference/pipeline_spec) to use
   173  more advanced options. Options include building your own code into a
   174  container instead of the pre-built Docker image that we are
   175  using in this tutorial.
   176  
   177  For now, we are going to create a single pipeline that takes in images
   178  and does some simple edge detection.
   179  
   180  ![image](../assets/images/opencv-liberty.png)
   181  
   182  Below is the pipeline spec and python code that we are using. Let's walk
   183  through the details.
   184  
   185  ```shell
   186  # edges.json
   187  {
   188    "pipeline": {
   189      "name": "edges"
   190    },
   191    "description": "A pipeline that performs image edge detection by using the OpenCV library.",
   192    "transform": {
   193      "cmd": [ "python3", "/edges.py" ],
   194      "image": "pachyderm/opencv"
   195    },
   196    "input": {
   197      "pfs": {
   198        "repo": "images",
   199        "glob": "/*"
   200      }
   201    }
   202  }
   203  ```
   204  
   205  Our pipeline spec contains a few simple sections. First, it is the pipeline
   206  `name`, edges. Then we have the `transform` which specifies the docker
   207  image we want to use, `pachyderm/opencv` (defaults to DockerHub as the
   208  registry), and the entry point `edges.py`. Lastly, we specify the input.
   209  Here we only have one PFS input, our images repo with a particular glob
   210  pattern.
   211  
   212  The glob pattern defines how the input data can be broken up if we want
   213  to distribute our computation. `/*` means that each file can be
   214  processed individually, which makes sense for images. Glob patterns are
   215  one of the most powerful features in  Pachyderm.
   216  
   217  The following text is the Python code that we run in this pipeline:
   218  
   219  ``` python
   220  # edges.py
   221  import cv2
   222  import numpy as np
   223  from matplotlib import pyplot as plt
   224  import os
   225  
   226  # make_edges reads an image from /pfs/images and outputs the result of running
   227  # edge detection on that image to /pfs/out. Note that /pfs/images and
   228  # /pfs/out are special directories that Pachyderm injects into the container.
   229  def make_edges(image):
   230     img = cv2.imread(image)
   231     tail = os.path.split(image)[1]
   232     edges = cv2.Canny(img,100,200)
   233     plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')
   234  
   235  # walk /pfs/images and call make_edges on every file found
   236  for dirpath, dirs, files in os.walk("/pfs/images"):
   237     for file in files:
   238         make_edges(os.path.join(dirpath, file))
   239  ```
   240  
   241  The code simply walks over all the images in `/pfs/images`, performs edge
   242  detection, and writes the result to `/pfs/out`.
   243  
   244  `/pfs/images` and `/pfs/out` are special local directories that
   245  Pachyderm creates within the container automatically. All the input data
   246  for a pipeline is stored in `/pfs/<input_repo_name>` and your code
   247  should always write out to `/pfs/out`. Pachyderm automatically
   248  gathers everything you write to `/pfs/out` and version it as this
   249  pipeline output.
   250  
   251  Now, let's create the pipeline in Pachyderm:
   252  
   253  ```shell
   254  pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json
   255  ```
   256  
   257  ### What Happens When You Create a Pipeline
   258  
   259  Creating a pipeline tells Pachyderm to run your code on the data in your
   260  input repo (the HEAD commit) as well as **all future commits** that
   261  occur after the pipeline is created. Our repo already had a commit, so
   262  Pachyderm automatically launched a `job` to process that data.
   263  
   264  The first time Pachyderm runs a pipeline job, it needs to download the
   265  Docker image (specified in the pipeline spec) from the specified Docker
   266  registry (DockerHub in this case). This first run this might take a
   267  minute or so because of the image download, depending on your Internet
   268  connection. Subsequent runs will be much faster.
   269  
   270  You can view the job with:
   271  
   272  ``` bash
   273  pachctl list job
   274  ```
   275  
   276  **System response:**
   277  
   278  ```shell
   279  ID                               PIPELINE STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   280  0f6a53829eeb4ca193bb7944fe693700 edges    16 seconds ago Less than a second 0       1 + 0 / 1 57.27KiB 22.22KiB success
   281  ```
   282  
   283  Yay! Our pipeline succeeded! Pachyderm creates a corresponding output
   284  repo for every pipeline. This output repo will have the same name as the
   285  pipeline, and all the results of that pipeline will be versioned in this
   286  output repo. In our example, the `edges` pipeline created a repo
   287  called `edges` to store the results.
   288  
   289  ``` bash
   290  pachctl list repo
   291  ```
   292  
   293  **System response:**
   294  
   295  ```shell
   296  NAME   CREATED       SIZE (MASTER)
   297  edges  2 minutes ago 22.22KiB
   298  images 5 minutes ago 57.27KiB
   299  ```
   300  
   301  ### Reading the Output
   302  
   303  We can view the output data from the `edges` repo in the same fashion
   304  that we viewed the input data.
   305  
   306  * On macOS prior to Catalina, run:
   307  
   308     ```
   309     pachctl get file edges@master:liberty.png | open -f -a /Applications/Preview.app
   310     ```
   311  
   312  * On macOS Catalina, run:
   313  
   314     ```
   315     pachctl get file edges@master:liberty.png | open -f -a /System/Applications/Preview.app
   316     ```
   317  
   318  * On Linux 64-bit, run:
   319  
   320     ```
   321     pachctl get file edges@master:liberty.png | display
   322     ```
   323  
   324  The output should look similar to:
   325  
   326  ![image](../assets/images/edges-screenshot.png)
   327  
   328  ### Processing More Data
   329  
   330  Pipelines will also automatically process the data from new commits as
   331  they are created. Think of pipelines as being subscribed to any new
   332  commits on their input repo(s). Also similar to Git, commits have a
   333  parental structure that tracks which files have changed. In this case
   334  we are going to be adding more images.
   335  
   336  Let's create two new commits in a parental structure. To do this we
   337  will simply do two more `put file` commands and by specifying `master`
   338  as the branch, it automatically parents our commits onto each other.
   339  Branch names are just references to a particular HEAD commit.
   340  
   341  ```shell
   342  pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png
   343  pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png
   344  ```
   345  
   346  Adding a new commit of data will automatically trigger the pipeline to
   347  run on the new data we've added. We'll see corresponding jobs get
   348  started and commits to the output "edges" repo. Let's also view our
   349  new outputs.
   350  
   351  View the list of jobs that have started:
   352  
   353  ``` bash
   354  pachctl list job
   355  ```
   356  
   357  **System response:**
   358  
   359  ```
   360  ID                                STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   361  81ae47a802f14038b95f8f248cddbed2  7 seconds ago  Less than a second 0       1 + 2 / 3 102.4KiB 74.21KiB success
   362  ce448c12d0dd4410b3a5ae0c0f07e1f9  16 seconds ago Less than a second 0       1 + 1 / 2 78.7KiB  37.15KiB success
   363  490a28be32de491e942372018cd42460  9 minutes ago  35 seconds         0       1 + 0 / 1 57.27KiB 22.22KiB success
   364  ```
   365  
   366  View the output data
   367  
   368  * On macOS, run:
   369  
   370    ```shell
   371    pachctl get file edges@master:AT-AT.png | open -f -a /Applications/Preview.app
   372    pachctl get file edges@master:kitten.png | open -f -a /Applications/Preview.app
   373    ```
   374  
   375  * On Linux, run:
   376  
   377    ```shell
   378    pachctl get file edges@master:AT-AT.png | display
   379    pachctl get file edges@master:kitten.png | display
   380    ```
   381  
   382  ### Adding Another Pipeline
   383  
   384  We have successfully deployed and used a single stage Pachyderm pipeline.
   385  Now, let's add a processing stage to illustrate a multi-stage Pachyderm
   386  pipeline. Specifically, let's add a `montage` pipeline that take our
   387  original and edge detected images and arranges them into a single
   388  montage of images:
   389  
   390  ![image](../assets/images/opencv-liberty-montage.png)
   391  
   392  Below is the pipeline spec for this new pipeline:
   393  
   394  ```shell
   395  # montage.json
   396  {
   397    "pipeline": {
   398      "name": "montage"
   399    },
   400    "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.",
   401    "input": {
   402      "cross": [ {
   403        "pfs": {
   404          "glob": "/",
   405          "repo": "images"
   406        }
   407      },
   408      {
   409        "pfs": {
   410          "glob": "/",
   411          "repo": "edges"
   412        }
   413      } ]
   414    },
   415    "transform": {
   416      "cmd": [ "sh" ],
   417      "image": "v4tech/imagemagick",
   418      "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ]
   419    }
   420  }
   421  ```
   422  
   423  This `montage` pipeline spec is similar to our `edges` pipeline except
   424  for the following differences:
   425  
   426  1. We are using a different Docker image that
   427  has `imagemagick` installed.
   428  2. We are executing a `sh` command with
   429  `stdin` instead of a python script.
   430  3. We have multiple input data repositories.
   431  
   432  In the `montage` pipeline we are combining our multiple input data
   433  repositories using a `cross` pattern. This `cross` pattern creates a
   434  single pairing of our input images with our edge detected images. There
   435  are several interesting ways to combine data in Pachyderm, which are
   436  discussed
   437  [here](../../reference/pipeline_spec/#input-required)
   438  and
   439  [here](../../concepts/pipeline-concepts/datum/join/).
   440  
   441  We create the `montage` pipeline as before, with `pachctl`:
   442  
   443  ```shell
   444  pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/montage.json
   445  ```
   446  
   447  Pipeline creating triggers a job that generates a montage for all the
   448  current HEAD commits of the input repos:
   449  
   450  ```shell
   451  pachctl list job
   452  ```
   453  
   454  **System response:**
   455  
   456  ```shell
   457  ID                                  STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   458  92cecc40c3144fd5b4e07603bb24b104    45 seconds ago 6 seconds          0       1 + 0 / 1 371.9KiB 1.284MiB success
   459  81ae47a802f14038b95f8f248cddbed2    2 minutes ago  Less than a second 0       1 + 2 / 3 102.4KiB 74.21KiB success
   460  ce448c12d0dd4410b3a5ae0c0f07e1f9    2 minutes ago  Less than a second 0       1 + 1 / 2 78.7KiB  37.15KiB success
   461  490a28be32de491e942372018cd42460    11 minutes ago 35 seconds         0       1 + 0 / 1 57.27KiB 22.22KiB success
   462  ```
   463  
   464  View the generated montage image by running one of
   465  the following commands:
   466  
   467  
   468  * On macOS prior to Catalina, run:
   469  
   470     ```
   471     pachctl get file montage@master:montage.png | open -f -a /Applications/Preview.app
   472     ```
   473  
   474  
   475  * On macOS Catalina, run:
   476  
   477     ```
   478     pachctl get file montage@master:montage.png | open -f -a /System/Applications/Preview.app
   479     ```
   480  
   481  
   482  * On Linux 64-bit, run:
   483  
   484     ```
   485     pachctl get file montage@master:montage.png | display
   486     ```
   487  
   488    ![image](../assets/images/montage-screenshot.png)
   489  
   490  Exploring your DAG in the Pachyderm dashboard
   491  ---------------------------------------------
   492  
   493  When you deployed Pachyderm locally, the Pachyderm Enterprise dashboard
   494  was also deployed by default. This dashboard will let you interactively
   495  explore your pipeline, visualize the structure of the pipeline, explore
   496  your data, debug jobs, etc. To access the dashboard visit
   497  `localhost:30080` in an Internet browser (e.g., Google Chrome). You
   498  should see something similar to this:
   499  
   500  ![image](../assets/images/dashboard1.png)
   501  
   502  Enter your email address if you would like to obtain a free trial token
   503  for the dashboard. Upon entering this trial token, you will be able to
   504  see your pipeline structure and interactively explore the various pieces
   505  of your pipeline as pictured below:
   506  
   507  ![image](../assets/images/dashboard2.png)
   508  
   509  ![image](../assets/images/dashboard3.png)
   510  
   511  Next Steps
   512  ----------
   513  
   514  Pachyderm is now running locally with data and a pipeline! To play with
   515  Pachyderm locally, you can use what you've learned to build on or
   516  change this pipeline. You can also dig in and learn more details about:
   517  
   518  - [Deploying Pachyderm to the cloud or on prem](../deploy-manage/deploy/index.md)
   519  - [Load Your Data into Pachyderm](../how-tos/load-data-into-pachyderm.md)
   520  - [Working with Pipelines](../how-tos/developer-workflow/working-with-pipelines.md)
   521  
   522  We'd love to help and see what you come up with, so submit any
   523  issues/questions you come across on
   524  [GitHub](https://github.com/pachyderm/pachyderm),
   525  [Slack](http://slack.pachyderm.io), or email at <support@pachyderm.io>
   526  if you want to show off anything nifty you've created!