github.com/pachyderm/pachyderm@v1.13.4/examples/word_count/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/word_count/README.md (about)

     1  >![pach_logo](../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Pachyderm Word Count - Map/Reduce 101
     5  >![pach_logo](./img/pach_logo.svg) New to Pachyderm? Start with the [beginner tutorial](https://docs.pachyderm.com/1.13.x/getting_started/beginner_tutorial/).
     6  
     7  In this guide, we will write a classic MapReduce word count application in Pachyderm.A MapReduce job typically splits your input data into independent chunks that are seamlessly processed by a `map` pipeline in a parallel manner. The outputs of the maps are then input to a `reduce` pipeline which creates an aggregated content. 
     8  
     9  - In the first part of this example, we will: 
    10      - **Map**: Extract a list of words occurring in given web pages and create a list of text files named after each word. One line in that file represents the occurrence of the word on a page. 
    11      - **Reduce**: Aggregate those numbers to display the total occurrence for each word in all pages considered. 
    12  - In our second example, we add an additional web page and witness how the map/reduce pipelines manage our additional words.
    13  
    14  ***Table of Contents***
    15  - [1. Getting ready](#1-getting-ready)
    16  - [2. Pipelines setup](#2-pipelines-setup)
    17  - [3. Example part one](#3-example-part-one)
    18      - [***Step 1*** Create and populate Pachyderm's entry repo and pipelines](#step-1-create-and-populate-pachyderms-entry-repo-and-pipelines)
    19      - [***Step 2*** Now, let's take a closer look at their content](#step-2-now-lets-take-a-closer-look-at-their-content)
    20  - [4. Expand on the example](#4-expand-on-the-example)
    21  
    22  
    23  ***Key concepts***
    24  For this example, we recommend being familiar with the following concepts:
    25  - The original Map/Reduce word count example.
    26  - Pachyderm's [file appending strategy](https://docs.pachyderm.com/1.13.x/concepts/data-concepts/file/#file-processing-strategies) - 
    27  When you put a file into a Pachyderm repository and a file by the same name already exists, Pachyderm appends the new data to the existing file by default, unless you add an `override` flag to your instruction.
    28  - [Parallelism](https://docs.pachyderm.com/1.13.x/concepts/advanced-concepts/distributed_computing/) and [Glob Pattern](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/glob-pattern/) to fine tune your performances.
    29  
    30  ## 1. Getting ready
    31  ***Prerequisite***
    32  - A workspace on [Pachyderm Hub](https://docs.pachyderm.com/1.13.x/hub/hub_getting_started/) (recommended) or Pachyderm running [locally](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/).
    33  - [pachctl command-line ](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/#install-pachctl) installed, and your context created (i.e. you are logged in)
    34  
    35  ***Getting started***
    36  - Clone this repo.
    37  - Make sure Pachyderm is running. You should be able to connect to your Pachyderm cluster via the `pachctl` CLI. 
    38      Run a quick:
    39      ```shell
    40      $ pachctl version
    41  
    42      COMPONENT           VERSION
    43      pachctl             1.12.0
    44      pachd               1.12.0
    45      ```
    46      Ideally, have your pachctl and pachd versions match. At a minimum, you should always use the same major & minor versions of pachctl and pachd. 
    47  - You can run this example as is. You can also decide to build, tag, and push your own image to your Docker Hub. 
    48  If so, make sure to update `CONTAINER_TAG` in the `Makefile` accordingly 
    49  as well as your pipelines' specifications,
    50  then run `make docker-image`.
    51  
    52  ## 2. Pipelines setup
    53  ***Goal***
    54  In this example, we will have three successive processing stages (`scraper`, `map`, `reduce`) defined by three pipelines:
    55  
    56  ![alt text](./img/pachyderm_word_count.png)
    57  
    58  1. **Pipeline input repositories**: The `urls` in which we will commit files containing URLS.
    59  Each file is named for the site we want to scrape with the content being the URLs of the pages considered.
    60  
    61  1. **Pipelines**: 
    62      - [scraper.json](./pipelines/scraper.json) will first retrieve the .html content of the pages linked to the given urls.
    63      - [map](./pipelines/map.json) will then tokenize the words from each page in parallel and extract each word occurrence.
    64      - finally,[reduce](./pipelines/reduce.json) will aggregate the total counts for each word across all pages.
    65  
    66      3 pipelines, including `reduce`, can be run in a distributed fashion to maximize performance. 
    67  
    68  1. **Pipeline output repository**: The output repo `reduce` will contain a list of text files named after the words identified in the pages and their cumulative occurrence. 
    69  
    70  ## 3. Example part one
    71  ### ***Step 1*** Create and populate Pachyderm's entry repo and pipelines
    72  
    73  In the `examples/word_count` directory, run:
    74  ```shell
    75  $ make wordcount
    76  ```
    77  or create the entry repo `urls`, add the Wikipedia file:
    78  ```shell
    79  $	pachctl create repo urls
    80  $	cd data && pachctl put file urls@master -f Wikipedia
    81  ```
    82  The input data [`Wikipedia`](./data/Wikipedia) contains 2 URLs refering to 2 Wikipedia pages. 
    83  
    84  ... then create your 3 pipelines:
    85  ```shell
    86  $	pachctl create pipeline -f pipelines/scraper.json
    87  $	pachctl create pipeline -f pipelines/map.json
    88  $	pachctl create pipeline -f pipelines/reduce.json
    89  ```
    90  
    91  The entry repository of the first pipeline already contains data to process.
    92  Therefore, the pipeline creation will trigger a list of 3 jobs.
    93  
    94  You should be able to see your jobs running: 
    95  ```shell
    96  $   pachctl list job
    97  ```
    98  ```shell
    99  ID                               PIPELINE  STARTED        DURATION           RESTART PROGRESS     DL       UL       STATE
   100  
   101  ba77c94678ae401db1a2b58528b74d78 reduce    42 seconds ago -                  0       0 + 0 / 1488 0B       0B       running
   102  7e2ed8e0a8dd49a18f1e45d62942c7ee map       43 seconds ago Less than a second 0       2 + 0 / 2    5.218MiB 4.495KiB success
   103  bc3031c8076c4f91aa1d8e6ba450b096 map_build 50 seconds ago 2 seconds          0       1 + 0 / 1    1.455KiB 2.557MiB success
   104  badd3d81d3ce46358d91bedbb34dd0ed scraper   59 seconds ago 16 seconds         0       1 + 0 / 1    81B      103.4KiB success
   105  ```
   106  
   107  Let's have a look at your repos and check that all entry/output repos have been created:
   108  ```shell 
   109  $   pachctl list repo
   110  ```
   111  ```shell
   112  NAME       CREATED        SIZE (MASTER) DESCRIPTION
   113  
   114  reduce     13 minutes ago 4.62KiB       Output repo for pipeline reduce.
   115  map        13 minutes ago 4.495KiB      Output repo for pipeline map.
   116  scraper    13 minutes ago 103.4KiB      Output repo for pipeline scraper.
   117  urls       13 minutes ago 81B
   118  ```
   119  
   120  ### ***Step 2*** Now, let's take a closer look at their content
   121  - Scraper content
   122      ```shell
   123      $ pachctl list file scraper@master
   124      ```
   125      ```shell
   126      NAME       TYPE SIZE
   127      /Wikipedia dir  103.4KiB
   128      ```
   129      ```shell          
   130      $ pachctl list file scraper@master:/Wikipedia     
   131      ```
   132      ```shell
   133      NAME                      TYPE SIZE
   134      /Wikipedia/Main_Page.html file 77.53KiB
   135      /Wikipedia/Pachyderm.html file 25.88KiB
   136      ```
   137  We have successfully retrieved 2 .html pages corresponding to the 2 URLs provided.
   138  
   139  - Map content 
   140  
   141      The `map` pipeline counts the number of occurrences of each word it encounters in each of the scraped webpages (see [map.go](./src/map.go)). 
   142      The filename for each word is the name of the word itself. 
   143  
   144      ```shell
   145      $ pachctl list file map@master
   146      ```
   147      ```shell
   148      NAME          TYPE SIZE
   149      ...
   150      /liberated   file 2B
   151      /librarians  file 2B
   152      /library     file 2B
   153      /license     file 4B
   154      /licenses    file 4B
   155      ...
   156      ```
   157      For every word on those pages, there is a separate file. 
   158      Each file contains the numeric value for how many times that word appeared for each page. 
   159  
   160      ```shell
   161      $ pachctl get file map@master:/license
   162      5
   163      5
   164      ```
   165      It looks like the word `license` appeared 5 times on each of the 2 pages considered.
   166  
   167      By default, Pachyderm will spin up the same number of workers as the number of nodes in your cluster.
   168      This can be changed. 
   169      For more info on controlling the number of workers, check the [Distributed Computing](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/distributed_computing/#controlling-the-number-of-workers) page.
   170  
   171  - Reduce content
   172  
   173      The final pipeline, `reduce`, aggregates the total count per word. 
   174      Take a look at the words counted:
   175  
   176      ```shell
   177      $ pachctl list file reduce@master
   178      ```
   179      ```shell
   180      NAME         TYPE SIZE  
   181      ...                                                
   182      /liberated   file 2B
   183      /librarians  file 2B
   184      /library     file 2B
   185      /license     file 3B
   186      /licenses    file 2B
   187      ...
   188      ```
   189      For the word `license`, let's confirm that the total occurrence is the sum of the 2 previous numbers above 5+5.
   190  
   191      ```shell          
   192      $ pachctl get file reduce@master:/license
   193      10
   194      ```
   195  ## 4. Expand on the example
   196  
   197  Now that we have a full end-to-end scraper and word count use case set up, let's add more to it. Go ahead and add one more site to scrape. 
   198  
   199  ```shell
   200  $ cd data && pachctl put file urls@master -f Github
   201  ```
   202  Your scraper should automatically get started pulling the new site (it won't scrape Wikipedia again). 
   203  That will then automatically trigger the `map` and `reduce` pipelines
   204  to process the new data and update the word counts for all the sites combined.
   205  
   206  - Scraper content
   207      ```shell
   208      $ pachctl list file scraper@master
   209      ```
   210      ```shell
   211      NAME       TYPE SIZE
   212      /Github    dir  195.1KiB
   213      /Wikipedia dir  103.4KiB
   214      ```
   215      ```shell          
   216      $ pachctl list file scraper@master:/Github      
   217      ```
   218      ```shell
   219      NAME                      TYPE SIZE
   220      /Github/pachyderm.html file 195.1KiB
   221      ```
   222  The scraper has added one additional .html page following the URL provided in the new Github file.
   223  
   224  - Map content 
   225       ```shell
   226      $ pachctl list file map@master
   227      ```
   228      | Then | **NOW** |
   229      |------|-----|
   230      |NAME - TYPE - SIZE|NAME - TYPE - SIZE|   
   231      |...|...|
   232      |/liberated   file 2B|/liberated   file 2B|
   233      |/librarians  file 2B|/librarians  file 2B|
   234      | |**/libraries   file 2B**|
   235      |/library     file 2B|/library     file 2B|
   236      |/license     file 4B|/license     file **7B**|
   237      |/licenses    file 4B|/licenses     file 4B|
   238      |...|...|
   239      
   240  
   241      We have highlighted the newcomer (the new word `libraries`) and the change in the size of the `license` file.
   242  
   243      ```shell
   244      $ pachctl get file map@master:/license
   245      23
   246      5
   247      5
   248      ```
   249      It looks like the word `license` appeared 23 times on our new github page.
   250  
   251   - Reduce content
   252  
   253      ```shell
   254      $ pachctl list file reduce@master
   255      ```
   256      ```shell
   257      NAME         TYPE SIZE  
   258      ...                                                
   259      /liberated   file 2B
   260      /librarians  file 2B
   261      /libraries   file 2B
   262      /library     file 2B
   263      /license     file 3B
   264      /licenses    file 2B
   265      ...
   266      ```
   267      Let's see if the total sum of all occurrences in now  23+5+5 = 33.
   268  
   269      ```shell          
   270      $ pachctl get file reduce@master:/license
   271      33
   272      ```
   273  
   274  
   275  >![pach_logo](./img/pach_logo.svg) By default, pipelines spin up one worker for each node in your cluster,
   276  but you can choose to set a [different number of workers](https://docs.pachyderm.com/1.13.x/concepts/advanced-concepts/distributed_computing/) in your pipeline specification. 
   277  Further, 
   278  the pipelines are already configured to spread computation across the various workers with `"glob": "/*/*"`. Check out our [Glob Pattern](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/glob-pattern/) to learn more.
   279  
   280