github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/getting_started/beginner_tutorial.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/getting_started/beginner_tutorial.md (about)

     1  # Beginner Tutorial
     2  
     3  Welcome to the beginner tutorial for Pachyderm! If you have already installed
     4  Pachyderm, this tutorial should take about 15 minutes to complete. This tutorial
     5  introduces basic Pachyderm concepts.
     6  
     7  !!! tip
     8      If you are new to Pachyderm, try [Pachyderm Shell](../../deploy-manage/manage/pachctl_shell/).
     9      This handy tool suggests you `pachctl` commands as you type and
    10      helps you learn Pachyderm faster.
    11  
    12  ## Image processing with OpenCV
    13  
    14  This tutorial walks you through the deployment of a Pachyderm pipeline
    15  that performs [edge
    16  detection](https://en.wikipedia.org/wiki/Edge_detection) on a few
    17  images. Thanks to Pachyderm's built-in processing primitives, we can
    18  keep our code simple but still run the pipeline in a
    19  distributed, streaming fashion. Moreover, as new data is added, the
    20  pipeline automatically processes it and outputs the results.
    21  
    22  If you hit any errors not covered in this guide, get help in our [public
    23  community Slack](http://slack.pachyderm.io), submit an issue on
    24  [GitHub](https://github.com/pachyderm/pachyderm), or email us at
    25  <support@pachyderm.io>. We are more than happy to help!
    26  
    27  ### Prerequisites
    28  
    29  This guide assumes that you already have Pachyderm running locally.
    30  If you haven't done so already, install Pachyderm on your local
    31  machine as described in [Local Installation](local_installation.md).
    32  
    33  ### Create a Repo
    34  
    35  A `repo` is the highest level data primitive in Pachyderm. Like many
    36  things in Pachyderm, it shares its name with a primitive in Git and is
    37  designed to behave analogously. Generally, repos should be dedicated to
    38  a single source of data such as log messages from a particular service,
    39  a users table, or training data for an ML model. Repos are easy to create
    40  and do not take much space when empty so do not worry about making
    41  tons of them.
    42  
    43  For this demo, we create a repo called `images` to hold the
    44  data we want to process:
    45  
    46  ```shell
    47  pachctl create repo images
    48  ```
    49  
    50  Verify that the repository was created:
    51  
    52  ```shell
    53  pachctl list repo
    54  ```
    55  
    56  **System response:**
    57  
    58  ```shell
    59  NAME   CREATED       SIZE (MASTER)
    60  images 7 seconds ago 0B
    61  ```
    62  
    63  This output shows that the repo has been successfully created. Because we
    64  have not added anything to it yet, the size of the repository HEAD commit
    65  on the master branch is 0B.
    66  
    67  ### Adding Data to Pachyderm
    68  
    69  Now that we have created a repo it is time to add some data. In
    70  Pachyderm, you write data to an explicit `commit`. Commits are immutable
    71  snapshots of your data which give Pachyderm its version control properties.
    72  You can add, remove, or update `files` in a given commit.
    73  
    74  Let's start by just adding a file, in this case an image, to a new
    75  commit. We have provided some sample images for you that we host on
    76  Imgur.
    77  
    78  Use the `pachctl put file` command along with the `-f` flag.  The `-f` flag can
    79  take either a local file, a URL, or a object storage bucket which it
    80  scrapes automatically. In this case, we simply pass the URL.
    81  
    82  Unlike Git, commits in Pachyderm must be explicitly started and finished
    83  as they can contain huge amounts of data and we do not want that much
    84  *dirty* data hanging around in an unpersisted state. `pachctl put file`
    85  automatically starts and finishes a commit for you so you can add files
    86  more easily. If you want to add many files over a period of time, you
    87  can do `pachctl start commit` and `pachctl finish commit` yourself.
    88  
    89  We also specify the repo name `"images"`, the branch name `"master"`,
    90  and the file name: `"liberty.png"`.
    91  
    92  Here is an example atomic commit of the file `liberty.png` to the
    93  `images` repo `master` branch:
    94  
    95  ```shell
    96  pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png
    97  ```
    98  
    99  We can check to make sure the data we just added is in Pachyderm.
   100  
   101  * Use the `pachctl list repo` command to check that data has been added:
   102  
   103    ```shell
   104    pachctl list repo
   105    ```
   106  
   107    **System response:**
   108  
   109    ```
   110    NAME   CREATED            SIZE (MASTER)
   111    images About a minute ago 57.27KiB
   112    ```
   113  
   114  * View the commit that was just created:
   115  
   116    ```shell
   117    pachctl list commit images
   118    ```
   119  
   120    **System response:**
   121  
   122    ```
   123    REPO   COMMIT                           PARENT STARTED        DURATION           SIZE
   124    images d89758a7496a4c56920b0eaa7d7d3255 <none> 29 seconds ago Less than a second 57.27KiB
   125    ```
   126  
   127  * View the file in that commit:
   128  
   129    ```shell
   130    pachctl list file images@master
   131    ```
   132  
   133    **System response:**
   134  
   135    ```
   136    COMMIT                           NAME         TYPE COMMITTED          SIZE 
   137    d89758a7496a4c56920b0eaa7d7d3255 /liberty.png file About a minute ago 57.27KiB
   138    ```
   139  
   140  Also, you can view the file you have just added to Pachyderm. Because this is an
   141  image, you cannot just print it out in the terminal, but the following
   142  commands will let you view it easily:
   143  
   144  * On macOS prior to Catalina, run:
   145  
   146  ```
   147  pachctl get file images@master:liberty.png | open -f -a /Applications/Preview.app
   148  ```
   149  
   150  * On macOS Catalina, run:
   151  
   152  ```
   153  pachctl get file images@master:liberty.png | open -f -a /System/Applications/Preview.app
   154  ```
   155  
   156  * On Linux 64-bit, run:
   157  
   158  ```
   159  pachctl get file images@master:liberty.png | display
   160  ```
   161  
   162  ### Create a Pipeline
   163  
   164  Now that you have some data in your repo, it is time to do something
   165  with it. Pipelines are the core processing primitive in Pachyderm and
   166  you can define them with a JSON encoding. For this example, we have
   167  already created the pipeline for you and you can find the [code on
   168  GitHub](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv).
   169  
   170  When you want to create your own pipelines later, you can refer to the
   171  full [Pipeline Specification](../../reference/pipeline_spec) to use
   172  more advanced options. Options include building your own code into a
   173  container instead of the pre-built Docker image that we are
   174  using in this tutorial.
   175  
   176  For now, we are going to create a single pipeline that takes in images
   177  and does some simple edge detection.
   178  
   179  ![image](../assets/images/opencv-liberty.png)
   180  
   181  Below is the pipeline spec and python code that we are using. Let's walk
   182  through the details.
   183  
   184  ```shell
   185  # edges.json
   186  {
   187    "pipeline": {
   188      "name": "edges"
   189    },
   190    "description": "A pipeline that performs image edge detection by using the OpenCV library.",
   191    "transform": {
   192      "cmd": [ "python3", "/edges.py" ],
   193      "image": "pachyderm/opencv"
   194    },
   195    "input": {
   196      "pfs": {
   197        "repo": "images",
   198        "glob": "/*"
   199      }
   200    }
   201  }
   202  ```
   203  
   204  Our pipeline spec contains a few simple sections. First, it is the pipeline
   205  `name`, edges. Then we have the `transform` which specifies the docker
   206  image we want to use, `pachyderm/opencv` (defaults to DockerHub as the
   207  registry), and the entry point `edges.py`. Lastly, we specify the input.
   208  Here we only have one PFS input, our images repo with a particular glob
   209  pattern.
   210  
   211  The glob pattern defines how the input data can be broken up if we want
   212  to distribute our computation. `/*` means that each file can be
   213  processed individually, which makes sense for images. Glob patterns are
   214  one of the most powerful features in  Pachyderm.
   215  
   216  The following text is the Python code that we run in this pipeline:
   217  
   218  ``` python
   219  # edges.py
   220  import cv2
   221  import numpy as np
   222  from matplotlib import pyplot as plt
   223  import os
   224  
   225  # make_edges reads an image from /pfs/images and outputs the result of running
   226  # edge detection on that image to /pfs/out. Note that /pfs/images and
   227  # /pfs/out are special directories that Pachyderm injects into the container.
   228  def make_edges(image):
   229     img = cv2.imread(image)
   230     tail = os.path.split(image)[1]
   231     edges = cv2.Canny(img,100,200)
   232     plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')
   233  
   234  # walk /pfs/images and call make_edges on every file found
   235  for dirpath, dirs, files in os.walk("/pfs/images"):
   236     for file in files:
   237         make_edges(os.path.join(dirpath, file))
   238  ```
   239  
   240  The code simply walks over all the images in `/pfs/images`, performs edge
   241  detection, and writes the result to `/pfs/out`.
   242  
   243  `/pfs/images` and `/pfs/out` are special local directories that
   244  Pachyderm creates within the container automatically. All the input data
   245  for a pipeline is stored in `/pfs/<input_repo_name>` and your code
   246  should always write out to `/pfs/out`. Pachyderm automatically
   247  gathers everything you write to `/pfs/out` and version it as this
   248  pipeline output.
   249  
   250  Now, let's create the pipeline in Pachyderm:
   251  
   252  ```shell
   253  pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json
   254  ```
   255  
   256  ### What Happens When You Create a Pipeline
   257  
   258  Creating a pipeline tells Pachyderm to run your code on the data in your
   259  input repo (the HEAD commit) as well as **all future commits** that
   260  occur after the pipeline is created. Our repo already had a commit, so
   261  Pachyderm automatically launched a `job` to process that data.
   262  
   263  The first time Pachyderm runs a pipeline job, it needs to download the
   264  Docker image (specified in the pipeline spec) from the specified Docker
   265  registry (DockerHub in this case). This first run this might take a
   266  minute or so because of the image download, depending on your Internet
   267  connection. Subsequent runs will be much faster.
   268  
   269  You can view the job with:
   270  
   271  ``` bash
   272  pachctl list job
   273  ```
   274  
   275  **System response:**
   276  
   277  ```shell
   278  ID                               PIPELINE STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   279  0f6a53829eeb4ca193bb7944fe693700 edges    16 seconds ago Less than a second 0       1 + 0 / 1 57.27KiB 22.22KiB success
   280  ```
   281  
   282  Yay! Our pipeline succeeded! Pachyderm creates a corresponding output
   283  repo for every pipeline. This output repo will have the same name as the
   284  pipeline, and all the results of that pipeline will be versioned in this
   285  output repo. In our example, the `edges` pipeline created a repo
   286  called `edges` to store the results.
   287  
   288  ``` bash
   289  pachctl list repo
   290  ```
   291  
   292  **System response:**
   293  
   294  ```shell
   295  NAME   CREATED       SIZE (MASTER)
   296  edges  2 minutes ago 22.22KiB
   297  images 5 minutes ago 57.27KiB
   298  ```
   299  
   300  ### Reading the Output
   301  
   302  We can view the output data from the `edges` repo in the same fashion
   303  that we viewed the input data.
   304  
   305  * On macOS prior to Catalina, run:
   306  
   307  ```
   308  pachctl get file edges@master:liberty.png | open -f -a /Applications/Preview.app
   309  ```
   310  
   311  * On macOS Catalina, run:
   312  
   313  ```
   314  pachctl get file edges@master:liberty.png | open -f -a /System/Applications/Preview.app
   315  ```
   316  
   317  * On Linux 64-bit, run:
   318  
   319  ```
   320  pachctl get file edges@master:liberty.png | display
   321  ```
   322  
   323  The output should look similar to:
   324  
   325  ![image](../assets/images/edges-screenshot.png)
   326  
   327  ### Processing More Data
   328  
   329  Pipelines will also automatically process the data from new commits as
   330  they are created. Think of pipelines as being subscribed to any new
   331  commits on their input repo(s). Also similar to Git, commits have a
   332  parental structure that tracks which files have changed. In this case
   333  we are going to be adding more images.
   334  
   335  Let's create two new commits in a parental structure. To do this we
   336  will simply do two more `put file` commands and by specifying `master`
   337  as the branch, it automatically parents our commits onto each other.
   338  Branch names are just references to a particular HEAD commit.
   339  
   340  ```shell
   341  pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png
   342  pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png
   343  ```
   344  
   345  Adding a new commit of data will automatically trigger the pipeline to
   346  run on the new data we've added. We'll see corresponding jobs get
   347  started and commits to the output "edges" repo. Let's also view our
   348  new outputs.
   349  
   350  View the list of jobs that have started:
   351  
   352  ``` bash
   353  pachctl list job
   354  ```
   355  
   356  **System response:**
   357  
   358  ```
   359  ID                                STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   360  81ae47a802f14038b95f8f248cddbed2  7 seconds ago  Less than a second 0       1 + 2 / 3 102.4KiB 74.21KiB success
   361  ce448c12d0dd4410b3a5ae0c0f07e1f9  16 seconds ago Less than a second 0       1 + 1 / 2 78.7KiB  37.15KiB success
   362  490a28be32de491e942372018cd42460  9 minutes ago  35 seconds         0       1 + 0 / 1 57.27KiB 22.22KiB success
   363  ```
   364  
   365  View the output data
   366  
   367  * On macOS, run:
   368  
   369    ```shell
   370    pachctl get file edges@master:AT-AT.png | open -f -a /Applications/Preview.app
   371    pachctl get file edges@master:kitten.png | open -f -a /Applications/Preview.app
   372    ```
   373  
   374  * On Linux, run:
   375  
   376    ```shell
   377    pachctl get file edges@master:AT-AT.png | display
   378    pachctl get file edges@master:kitten.png | display
   379    ```
   380  
   381  ### Adding Another Pipeline
   382  
   383  We have successfully deployed and used a single stage Pachyderm pipeline.
   384  Now, let's add a processing stage to illustrate a multi-stage Pachyderm
   385  pipeline. Specifically, let's add a `montage` pipeline that take our
   386  original and edge detected images and arranges them into a single
   387  montage of images:
   388  
   389  ![image](../assets/images/opencv-liberty-montage.png)
   390  
   391  Below is the pipeline spec for this new pipeline:
   392  
   393  ```shell
   394  # montage.json
   395  {
   396    "pipeline": {
   397      "name": "montage"
   398    },
   399    "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.",
   400    "input": {
   401      "cross": [ {
   402        "pfs": {
   403          "glob": "/",
   404          "repo": "images"
   405        }
   406      },
   407      {
   408        "pfs": {
   409          "glob": "/",
   410          "repo": "edges"
   411        }
   412      } ]
   413    },
   414    "transform": {
   415      "cmd": [ "sh" ],
   416      "image": "v4tech/imagemagick",
   417      "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ]
   418    }
   419  }
   420  ```
   421  
   422  This `montage` pipeline spec is similar to our `edges` pipeline except
   423  for the following differences:
   424  
   425  1. We are using a different Docker image that
   426  has `imagemagick` installed.
   427  2. We are executing a `sh` command with
   428  `stdin` instead of a python script.
   429  3. We have multiple input data repositories.
   430  
   431  In the `montage` pipeline we are combining our multiple input data
   432  repositories using a `cross` pattern. This `cross` pattern creates a
   433  single pairing of our input images with our edge detected images. There
   434  are several interesting ways to combine data in Pachyderm, which are
   435  discussed
   436  [here](../../reference/pipeline_spec/#input-required)
   437  and
   438  [here](../../concepts/pipeline-concepts/datum/join/).
   439  
   440  We create the `montage` pipeline as before, with `pachctl`:
   441  
   442  ```shell
   443  pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/montage.json
   444  ```
   445  
   446  Pipeline creating triggers a job that generates a montage for all the
   447  current HEAD commits of the input repos:
   448  
   449  ```shell
   450  pachctl list job
   451  ```
   452  
   453  **System response:**
   454  
   455  ```shell
   456  ID                                  STARTED        DURATION           RESTART PROGRESS  DL       UL       STATE
   457  92cecc40c3144fd5b4e07603bb24b104    45 seconds ago 6 seconds          0       1 + 0 / 1 371.9KiB 1.284MiB success
   458  81ae47a802f14038b95f8f248cddbed2    2 minutes ago  Less than a second 0       1 + 2 / 3 102.4KiB 74.21KiB success
   459  ce448c12d0dd4410b3a5ae0c0f07e1f9    2 minutes ago  Less than a second 0       1 + 1 / 2 78.7KiB  37.15KiB success
   460  490a28be32de491e942372018cd42460    11 minutes ago 35 seconds         0       1 + 0 / 1 57.27KiB 22.22KiB success
   461  ```
   462  
   463  View the generated montage image by running one of
   464  the following commands:
   465  
   466  * On macOS prior to Catalina, run:
   467  
   468  ```
   469  pachctl get file montage@master:montage.png | open -f -a /Applications/Preview.app
   470  ```
   471  
   472  * On macOS Catalina, run:
   473  
   474  ```
   475  pachctl get file montage@master:montage.png | open -f -a /System/Applications/Preview.app
   476  ```
   477  
   478  * On Linux 64-bit, run:
   479  
   480  ```
   481  pachctl get file montage@master:montage.png | display
   482  ```
   483  
   484    ![image](../assets/images/montage-screenshot.png)
   485  
   486  Exploring your DAG in the Pachyderm dashboard
   487  ---------------------------------------------
   488  
   489  When you deployed Pachyderm locally, the Pachyderm Enterprise dashboard
   490  was also deployed by default. This dashboard will let you interactively
   491  explore your pipeline, visualize the structure of the pipeline, explore
   492  your data, debug jobs, etc. To access the dashboard visit
   493  `localhost:30080` in an Internet browser (e.g., Google Chrome). You
   494  should see something similar to this:
   495  
   496  ![image](../assets/images/dashboard1.png)
   497  
   498  Enter your email address if you would like to obtain a free trial token
   499  for the dashboard. Upon entering this trial token, you will be able to
   500  see your pipeline structure and interactively explore the various pieces
   501  of your pipeline as pictured below:
   502  
   503  ![image](../assets/images/dashboard2.png)
   504  
   505  ![image](../assets/images/dashboard3.png)
   506  
   507  Next Steps
   508  ----------
   509  
   510  Pachyderm is now running locally with data and a pipeline! To play with
   511  Pachyderm locally, you can use what you've learned to build on or
   512  change this pipeline. You can also dig in and learn more details about:
   513  
   514  - [Deploying Pachyderm to the cloud or on prem](../deploy-manage/deploy/index.md)
   515  - [Load Your Data into Pachyderm](../how-tos/load-data-into-pachyderm.md)
   516  - [Individual Developer Workflow](../how-tos/individual-developer-workflow.md)
   517  
   518  We'd love to help and see what you come up with, so submit any
   519  issues/questions you come across on
   520  [GitHub](https://github.com/pachyderm/pachyderm),
   521  [Slack](http://slack.pachyderm.io), or email at <support@pachyderm.io>
   522  if you want to show off anything nifty you've created!