github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/go-rabbitmq-spout/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/go-rabbitmq-spout/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Commit messages from RabbitMQ
     5     
     6  This is a simple example of using spouts with [RabbitMQ](https://www.rabbitmq.com/) to process messages and write them to files.
     7  
     8  This example spout connects to a RabbitMQ instance and reads messages from a queue. These messages are written into a single text file 
     9  in the output repository for downstream processing. 
    10  
    11  ## Prerequisites
    12  
    13  The Pachyderm code in this example requires a Pachyderm cluster version 1.12.0 or later and a functioning RabbitMQ deployment. 
    14  
    15  ## Introduction
    16  
    17  RabbitMQ is a very simple messaging system. It is lightweight and easy to deploy, which makes it ideal for particular applications.
    18  While not itself cloud native, it isn't too challenging to stand up a deployment in Kubernetes. If you need a lightweight message queue
    19  and don't need a full scale Kafka cluster, RabbitMQ is one possible alternative depending on your architecture. 
    20  
    21  Pachyderm spouts are a way to ingest data into Pachyderm 
    22  by having your code get the data from inside a Pachyderm pipeline.
    23  
    24  This is a simple implementation of a Pachyderm version 2 spout, but has additional bells and whistles and hopefully can serve as the basis
    25  to build a more robust spout. 
    26  
    27  This spout reads messages from a single configurable RabbitMQ queue. These messages are pushed into a local buffer (go slice)
    28  which is written into a newline delimited file (e.g. NDJSON) when full or at a user configurable flush interval. Every new file creates a new 
    29  commit on the `COMMIT_BRANCH`. After a commit is finalized, all messages read from the RabbitMQ queue are acknowledged at once. This provides
    30  fault tolerance in case the pipeline crashes at any point. You don't need to save your place and you do not need to be concerned about the 
    31  number of consumers (within RabbitMQ's limits, that is). A separate goroutine also reads each commit hash and commits the latest finalized 
    32  commit at a configurable interval (e.g. 60 seconds) to control the rate at which downstream pipelines are triggered. 
    33  
    34  ### Pachyderm setup
    35  
    36  1. If you would simply like to use the prebuilt spout image,
    37  you can simply create the spout with the pachctl command
    38  using the pipeline definition available in the `pipelines` directory
    39  
    40  ```shell
    41  $ pachctl create pipeline -f pipelines/spout.pipeline.json
    42  ```
    43  
    44  
    45  2. To create your own version of the spout,
    46  you may modify the pipeline file and point it at your own container registry
    47  
    48  
    49  The Makefile has targets for `create-pipeline` and `update-pipeline`, 
    50  or you may simply make the image with `docker-image`.
    51  
    52  ### Configuration/Customization
    53  
    54  | Variable Name | Description | Default Value |
    55  |---------------|-------------|---------------|
    56  | `PREFETCH` | The prefetch size on RabbitMQ. How many messages will be written into a single file. | 2000   |
    57  | `EXTENSION` | The file extension.                                                                  | ndjson |
    58  | `FLUSH_INTERVAL_MS` | The amount of time to flush messages to a file/commit in milliseconds                | 10000 |
    59  | `SWITCH_INTERVAL_MS` | How often to commit to `master` and trigger a downstream pipeline                   | 60000 |
    60  | `RABBITMQ_HOST`  | The transport endpoint for RabbitMQ (port included) | `rabbitmq.default.svc.cluster.local:5672` |
    61  | `RABBITMQ_USER`  | The username for RabbitMQ | peter |
    62  | `RABBITMQ_PASSWORD` | (Secret) The RabbitMQ password | `rabbitmq-password` |
    63  | `SWITCH_BRANCH` | The branch to switch to periodically | master |
    64  | `COMMIT_BRANCH` | The branch to commit to | staging |
    65  Furthermore, the following command line arguments are available for the rabbitmq spout:
    66  
    67  | Flag  | Description |
    68  |-------|-------------|
    69  | -topic | The name of the messaging topic to read from |
    70  | -overwrite | Whether or not to overwrite output |