github.com/pachyderm/pachyderm@v1.13.4/examples/scraper/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/scraper/README.md (about)

     1  >![pach_logo](../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  
     5  :warning: **Warning**: This is a Pachyderm pre version 1.4 tutorial.  This example is deprecated until it has been updated for the latest versions of Pachyderm.
     6  
     7  # Quick Start Guide: Web Scraper
     8  In this guide you're going to create a Pachyderm pipeline to scrape web pages. 
     9  We'll use a standard unix tool, `wget`, to do our scraping.
    10  
    11  ## Setup
    12  
    13  This guide assumes that you already have a Pachyderm cluster running and have configured `pachctl` to talk to the cluster. [Installation instructions can be found here](http://pachyderm.readthedocs.io/en/stable/getting_started/local_installation.html).
    14  
    15  ## Create a Repo
    16  
    17  A `Repo` is the highest level primitive in `pfs`. Like all primitives in pfs, they share
    18  their name with a primitive in Git and are designed to behave analogously.
    19  Generally, a `repo` should be dedicated to a single source of data such as log
    20  messages from a particular service. Repos are dirt cheap so don't be shy about
    21  making them very specific.
    22  
    23  For this demo we'll simply create a `repo` called
    24  “urls” to hold a list of urls that we want to scrape.
    25  
    26  ```shell
    27  $ pachctl create repo urls
    28  $ pachctl list repo
    29  urls
    30  ```
    31  
    32  
    33  ## Start a Commit
    34  Now that we’ve created a `repo` we’ve got a place to add data.
    35  If you try writing to the `repo` right away though, it will fail because you can't write directly to a
    36  `Repo`. In Pachyderm, you write data to an explicit `commit`. Commits are
    37  immutable snapshots of your data which give Pachyderm its version control for
    38  data properties. Unlike Git though, commits in Pachyderm must be explicitly
    39  started and finished.
    40  
    41  Let's start a new commit in the “urls” repo:
    42  ```shell
    43  $ pachctl start commit urls@master
    44  master/0
    45  ```
    46  
    47  This returns a brand new commit id. Yours should be different from mine.
    48  Now if we take a look inside our repo, we’ve created a directory for the new commit:
    49  ```shell
    50  $ pachctl list commit urls
    51  master/0
    52  ```
    53  
    54  A new directory has been created for our commit and now we can start adding
    55  data. Data for this example is just a single file with a list of urls. We've provided a sample file for you with just 3 urls, Google, Reddit, and Imgur.
    56  We're going to write that data as a file called “urls” in pfs.
    57  
    58  ```shell
    59  # Write sample data into pfs
    60  $ cat examples/scraper/urls | pachctl put file urls@master/0:urls
    61  ```
    62  
    63  ## Finish a Commit
    64  
    65  Pachyderm won't let you read data from a commit until the `commit` is `finished`.
    66  This prevents reads from racing with writes. Furthermore, every write
    67  to pfs is atomic. Now let's finish the commit:
    68  
    69  ```shell
    70  $ pachctl finish commit urls@master/0
    71  ```
    72  
    73  Now we can view the file:
    74  
    75  ```shell
    76  $ pachctl get file urls@master/0:urls
    77  www.google.com
    78  www.reddit.com
    79  www.imgur.com
    80  ```
    81  However, we've lost the ability to write to this `commit` since finished
    82  commits are immutable. In Pachyderm, a `commit` is always either _write-only_
    83  when it's been started and files are being added, or _read-only_ after it's
    84  finished.
    85  
    86  ## Create a Pipeline
    87  
    88  Now that we've got some data in our `repo` it's time to do something with it.
    89  Pipelines are the core primitive for Pachyderm's processing system (pps) and
    90  they're specified with a JSON encoding. We're going to create a pipeline that simply scrapes each of the web pages in “urls.”
    91  
    92  ```
    93  +----------+     +---------------+
    94  |input data| --> |scrape pipeline|
    95  +----------+     +---------------+
    96  ```
    97  
    98  The `pipeline` we're creating can be found at [scraper.json](scraper.json).  The full content is also below.
    99  ```json
   100  {
   101    "pipeline": {
   102      "name": "scraper”
   103    },
   104    "transform": {
   105      "cmd": [ "wget",
   106          "--recursive",
   107          "--level", "1",
   108          "--accept", "jpg,jpeg,png,gif,bmp",
   109          "--page-requisites",
   110          "--adjust-extension",
   111          "--span-hosts",
   112          "--no-check-certificate",
   113          "--timestamping",
   114          "--directory-prefix",
   115          "/pfs/out",
   116          "--input-file", "/pfs/urls/urls"
   117      ],
   118      "acceptReturnCode": [4,5,6,7,8]
   119    },
   120    "parallelism": "1",
   121    "input": {
   122      "pfs": {
   123        "repo": "urls"
   124      }
   125    }
   126  }
   127  ```
   128  
   129  In this pipeline, we’re just using `wget` to scrape the content of our input web pages. “level” indicates how many recursive links `wget` will retrieve. We currently have it set to 1, which will only scrape the home page, but you can crank it up later if you want.
   130  
   131  Another important section to notice is that we read data
   132  from `/pfs/urls/urls` (/pfs/[input_repo_name]) and write data to `/pfs/out/`.  We create a directory for each url in “urls” with all of the relevant scrapes as files.
   133  
   134  Now let's create the pipeline in Pachyderm:
   135  
   136  ```shell
   137  $ pachctl create pipeline -f examples/scraper/scraper.json
   138  ```
   139  
   140  ## What Happens When You Create a Pipeline
   141  Creating a `pipeline` tells Pachyderm to run your code on *every* finished
   142  `commit` in a `repo` as well as *all future commits* that happen after the pipeline is
   143  created. Our `repo` already had a `commit` with the file “urls” in it so Pachyderm will automatically
   144  launch a `job` to scrape those webpages.
   145  
   146  You can view the job with:
   147  
   148  ```shell
   149  $ pachctl list job
   150  ID                                 OUTPUT                                     STATE
   151  09a7eb68995c43979cba2b0d29432073   scraper/2b43def9b52b4fdfadd95a70215e90c9   JOB_STATE_RUNNING
   152  ```
   153  
   154  Depending on how quickly you do the above, you may see `running` or
   155  `success`.
   156  
   157  Pachyderm `job`s are implemented as Kubernetes jobs, so you can also see your job with:
   158  
   159  ```shell
   160  $ kubectl get job
   161  JOB                                CONTAINER(S)   IMAGE(S)             SELECTOR                                                         SUCCESSFUL
   162  09a7eb68995c43979cba2b0d29432073   user           ubuntu:14.04   app in (09a7eb68995c43979cba2b0d29432073),suite in (pachyderm)   1
   163  ```
   164  
   165  Every `pipeline` creates a corresponding `repo` with the same
   166  name where it stores its output results. In our example, the pipeline was named “scraper” so it created a `repo` called “scraper” which contains the final output.
   167  
   168  
   169  ## Reading the Output
   170  There are a couple of different ways to retrieve the output. We can read a single output file from the “scraper” `repo` in the same fashion that we read the input data:
   171  
   172  ```shell
   173  $ pachctl list file scraper@2b43def9b52b4fdfadd95a70215e90c9:urls
   174  $ pachctl get file scraper@2b43def9b52b4fdfadd95a70215e90c9:urls/www.imgur.com/index.html
   175  ```
   176  
   177  Using `get file` is good if you know exactly what file you’re looking for, but for this example we want to just see all the scraped pages. One great way to do this is to mount the distributed file system locally and then just poke around.
   178  
   179  ## Mount the Filesystem
   180  First create the mount point:
   181  
   182  ```shell
   183  $ mkdir ~/pfs
   184  ```
   185  
   186  And then mount it:
   187  
   188  ```shell
   189  # We background this process because it blocks.
   190  $ pachctl mount ~/pfs &
   191  ```
   192  
   193  This will mount pfs on `~/pfs` you can inspect the filesystem like you would any
   194  other local filesystem. Try:
   195  
   196  ```shell
   197  $ ls ~/pfs
   198  urls
   199  scraper
   200  ```
   201  You should see the urls repo that we created.
   202  
   203  Now you can simply `ls` and `cd` around the file system. Try pointing your browser at the scraped output files!
   204  
   205  
   206  ## Processing More Data
   207  
   208  Pipelines can be triggered manually, but also will automatically process the data from new commits as they are
   209  created. Think of pipelines as being subscribed to any new commits that are
   210  finished on their input repo(s).
   211  
   212  If we want to re-scrape some of our urls to see if the sites of have changed, we can use the `pachctl update pipeline` command with the `--reprocess` flag:
   213  
   214  ```shell
   215  $ pachctl update pipeline scraper --reprocess
   216  ```
   217  
   218  Next, let’s add additional urls to our input data . We're going to append more urls from “urls2” to the file “urls.”
   219  
   220  We first need to start a new commit to add more data. Similar to Git, commits have a parental
   221  structure that track how files change over time. Specifying a parent is
   222  optional when creating a commit (notice we didn't specify a parent when we
   223  created the first commit), but in this case we're going to be adding
   224  more data to the same file “urls.”
   225  
   226  
   227  Let's create a new commit with our previous commit as the parent:
   228  
   229  ```shell
   230  $ pachctl start commit urls@master
   231  master/1
   232  ```
   233  
   234  Append more data to our urls file in the new commit:
   235  ```shell
   236  $ cat examples/scraper/urls2 | pachctl put file urls@master/1:urls
   237  ```
   238  Finally, we'll want to finish our second commit. After it's finished, we can
   239  read “scraper” from the latest commit to see all the scrapes.
   240  
   241  ```shell
   242  $ pachctl finish commit urls@master/1
   243  ```
   244  Finishing this commit will also automatically trigger the pipeline to run on
   245  the new data we've added. We'll see a corresponding commit to the output
   246  “scraper” repo with data from our newly added sites.
   247  
   248  ```shell
   249  $ pachctl list commit scraper
   250  ```
   251  ## Next Steps
   252  You've now got a working Pachyderm cluster with data and a pipelines! Here are a few ideas for next steps that you can expand on your working setup.
   253  - Add a bunch more urls and crank up the “level” in the pipeline. You’ll have to delete the old pipeline and re-create or give your pipeline and new name.
   254  - Add a new pipeline than does something interesting with the scraper output. Image or text processing could be fun. Just create a pipeline with the scraper repo as an input.
   255  
   256  We'd love to help and see what you come up with so submit any issues/questions you come across or email at info@pachyderm.io if you want to show off anything nifty you've created!