github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/spout101/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/spout101/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  
     5  # Spout Pipelines - An introductory example
     6  >![pach_logo](./img/pach_logo.svg) This new implementation of the spout functionality is available in version **1.12 and higher**.
     7  
     8  
     9  ## Intro
    10  A spout is a type of pipeline that ingests 
    11  streaming data (message queue, database transactions logs,
    12  event notifications... ), 
    13  acting as **a bridge
    14  between an external stream of data and Pachyderm's repo**.
    15  
    16  For those familiar with enterprise integration patterns,
    17  a Pachyderm spout implements the
    18  *[Polling Consumer](https://www.enterpriseintegrationpatterns.com/patterns/messaging/PollingConsumer.html)* 
    19  (subscribes to a stream of data,
    20  reads its published messages, 
    21  then push them to -in our case- the spout's output repository).
    22  
    23  For more information about spout pipelines,
    24  we recommend to read the following page in our documentation:
    25  
    26  - [Spout](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/pipeline/spout/) concept.
    27  - [Spout](https://docs.pachyderm.com/1.13.x/reference/pipeline_spec/#spout-optional) configuration 
    28  
    29  
    30  In this example, we have emulated the reception 
    31  of messages from a third-party messaging system
    32  to focus on the specificities of the spout pipeline.
    33  
    34  Note that we used [`python-pachyderm`](https://github.com/pachyderm/python-pachyderm)'s Client to connect to Pachyderm's API.
    35  ![](./img/spout101.png)
    36  Feel free to explore more of our examples
    37  to discover how we used spout to listen
    38  for new S3 objects notifications via an Amazon™ SQS queue
    39  or connected to an IMAP email account
    40  to analyze the polarity of its emails.
    41  
    42  ## Getting ready
    43  ***Prerequisite***
    44  - A workspace on [Pachyderm Hub](https://docs.pachyderm.com/1.13.x/hub/hub_getting_started/) (recommended) or Pachyderm running [locally](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/).
    45  - [pachctl command-line ](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/#install-pachctl) installed, and your context created (i.e., you are logged in)
    46  
    47  ***Getting started***
    48  - Clone this repo.
    49  - Make sure Pachyderm is running. You should be able to connect to your Pachyderm cluster via the `pachctl` CLI. 
    50  Run a quick:
    51  ```shell
    52  $ pachctl version
    53  
    54  COMPONENT           VERSION
    55  pachctl             1.12.0
    56  pachd               1.12.0
    57  ```
    58  Ideally, have your pachctl and pachd versions match. At a minimum, you should always use the same major & minor versions of your pachctl and pachd. 
    59  
    60  ## Example - Spout 101 
    61  ***Goal***
    62  In this example,
    63  we will keep generating two random strings, 
    64  one of 1KB and one of 2KB,
    65  at intervals varying between 10s and 30s.
    66  Our spout pipeline will actively receive
    67  those events and commit them as text files
    68  to the output repo 
    69  using the **put file** command
    70  of `pachyderm-python`'s library. 
    71  A second pipeline will then process those commits
    72  and log an entry in a separate log file
    73  depending on their size.
    74  
    75  
    76  1. **Pipeline input repository**: None
    77  
    78  1. **Spout and processing pipelines**: [`spout.json`](./pipelines/spout.json) polls and commits to its output repo using `pachctl put file` from the pachyderm-python library.  [`processor.json`](./pipelines/processor.json) then reads the files from its spout input repo and log their content separately depending on their size.
    79  
    80      >![pach_logo](./img/pach_logo.svg) Have a quick look at the source code of our spout pipeline in [`./src/consumer/main.py`](./src/consumer/main.py) and notice that we used `client = python_pachyderm.Client()` to connect to pachd and `client.put_file_bytes` to write files to the spout output repo. 
    81  
    82  
    83  1. **Pipeline output repository**: `spout` will contain one commit per set of received messages. Each message has been written to a txt file named after its hash (for uniqueness). `processor` will contain two files (1K.txt and 2K.txt) listing the messages received according to their size.
    84  
    85  
    86  ***Example walkthrough***
    87  
    88  1.  We have a Docker Hub image of this example ready for you.
    89      However, you can choose to build your own and push it to your repository.
    90      
    91      In the `examples/spouts/spout101` directory,
    92      make sure to update the 
    93      `CONTAINER_TAG` in the `Makefile` accordingly
    94      as well as your pipelines' specifications,
    95      then run:
    96      ```shell
    97      $ make docker-image
    98      ```
    99      >![pach_logo](./img/pach_logo.svg) Need a refresher on building, tagging, pushing your image on Docker Hub? Take a look at this [how-to](https://docs.pachyderm.com/1.13.x/how-tos/developer-workflow/working-with-pipelines/).
   100  
   101  1. Let's deploy our spout and processing pipelines: 
   102  
   103      Update the `image` field of the `transform` attribute in your pipelines specifications `./pipelines/spout.json` and `./pipelines/processor.json`.
   104  
   105      In the `examples/spouts/spout101` directory, run:
   106      ```shell
   107  	$ pachctl create pipeline -f ./pipelines/spout.json
   108  	$ pachctl create pipeline -f ./pipelines/processor.json
   109      ```
   110      Or, run the following target: 
   111      ```shell
   112      $ make deploy
   113      ```
   114      Your pipelines  should all be running:
   115      ![pipelines](./img/pachctl_list_pipeline.png)  
   116  
   117  1. Now that the spout pipeline is up, check its output repo once or twice. 
   118      You should be able to see that new commits are coming in.
   119      
   120      ```shell
   121      $ pachctl list file spout@master
   122      ```
   123      ![list_file_spout_master](./img/pachctl_list_file_spout_master.png)
   124  
   125      and some time later...
   126  
   127      ![list_file_spout_master](./img/pachctl_list_file_spout_master_later.png)
   128      
   129      Each of those commits triggers a job in the `processing` pipeline:
   130      ```shell
   131      $ pachctl list job
   132      ```
   133  
   134      ![list_job](./img/pachctl_list_job.png)
   135  
   136      and...
   137  
   138      ![list_job](./img/pachctl_list_job_later.png)
   139  
   140  1. Take a look at the output repository of your second pipeline: `processor`:
   141      ```shell
   142      $ pachctl list file processor@master
   143      ```
   144  
   145      ![list_file_processor](./img/pachctl_list_file_processor_master.png)
   146  
   147      New entry keep being added to each of the two log files:
   148      
   149      ![list_file_processor](./img/pachctl_list_file_processor_master_later.png)
   150  
   151      Zoom into one of them:
   152      ```shell
   153      $ pachctl get file processor@master:/1K.txt
   154      ```
   155      ![get_file_processor_master_1K](./img/pachctl_get_file_processor_master_1K.png)
   156  
   157      ...
   158  
   159      ![get_file_processor_master_1K](./img/pachctl_get_file_processor_master_1K_later.png)   
   160  
   161      That is it. 
   162  
   163  1. When you are done, think about deleting your pipelines.
   164  Remember, a spout pipeline keeps running: 
   165  
   166      In the `examples/spouts/spout101` directory, run:.
   167      ```shell
   168      $ pachctl delete all
   169      ```
   170      You will be prompted to make sure the delete is intentional. Yes it is. 
   171  
   172       >![pach_logo](./img/pach_logo.svg) Hub users, try `pachctl delete pipeline --all` and `pachctl delete repo --all`.
   173     
   174     
   175      
   176      A final check at your pipelines: the list should be empty. You are good to go.
   177      ```shell
   178      $ pachctl list pipeline
   179      ```