github.com/Jeffail/benthos/v3@v3.65.0/website/cookbooks/joining_streams.md

github.com/Jeffail/benthos/v3@v3.65.0/website/cookbooks/joining_streams.md (about)

     1  ---
     2  id: joining-streams
     3  title: Joining Streams
     4  description: How to hydrate documents by joining multiple streams.
     5  ---
     6  
     7  This cookbook demonstrates how to merge JSON events from parallel streams using content based rules and a [cache][caches] of your choice.
     8  
     9  The imaginary problem we are going to solve is hydrating a feed of article comments with information from their parent articles. We will be consuming and writing to Kafka, but the example works with any [input][inputs] and [output][outputs] combination.
    10  
    11  Articles are received over the topic `articles` and look like this:
    12  
    13  ```json
    14  {
    15    "type": "article",
    16    "article": {
    17      "id": "123foo",
    18      "title": "Dope article",
    19      "content": "this is a totally dope article"
    20    },
    21    "user": {
    22      "id": "user1"
    23    }
    24  }
    25  ```
    26  
    27  Comments can either be posted on an article or a parent comment, are received over the topic `comments`, and look like this:
    28  
    29  ```json
    30  {
    31    "type": "comment",
    32    "comment": {
    33      "id": "456bar",
    34      "parent_id": "123foo",
    35      "content": "this article sucks"
    36    },
    37    "user": {
    38      "id": "user2"
    39    }
    40  }
    41  ```
    42  
    43  Our goal is to end up with a single stream of comments, where information about the root article of the comment is attached to the event. The above comment should exit our pipeline looking like this:
    44  
    45  ```json
    46  {
    47    "type": "comment",
    48    "comment": {
    49      "id": "456bar",
    50      "parent_id": "123foo",
    51      "content": "this article sucks"
    52    },
    53    "article": {
    54      "title": "Dope article",
    55      "content": "this is a totally dope article"
    56    },
    57    "user": {
    58      "id": "user2"
    59    }
    60  }
    61  ```
    62  
    63  In order to achieve this we will need to cache articles as they pass through our pipelines and then retrieve them for each comment passing through. Since the parent of a comment might be another comment we will also need to cache and retrieve comments in the same way.
    64  
    65  ## Caching Articles
    66  
    67  Our first pipeline is very simple, we just consume articles, reduce them to only the fields we wish to cache, and then cache them. If we receive the same article multiple times we're going to assume it's okay to overwrite the old article in the cache.
    68  
    69  In this example I'm targeting Redis, but you can choose any of the supported [cache targets][caches]. The TTL of cached articles is set to one week.
    70  
    71  ```yaml
    72  input:
    73    kafka:
    74      addresses: [ TODO ]
    75      topics: [ articles ]
    76      consumer_group: benthos_articles_group
    77  
    78  pipeline:
    79    processors:
    80      # Reduce document into only fields we wish to cache.
    81      - bloblang: 'article = article'
    82  
    83      # Store reduced articles into our cache.
    84      - cache:
    85          operator: set
    86          resource: hydration_cache
    87          key: '${!json("article.id")}'
    88          value: '${!content()}'
    89  
    90  # Drop all articles after they are cached.
    91  output:
    92    drop: {}
    93  
    94  cache_resources:
    95    - label: hydration_cache
    96      redis:
    97        expiration: 168h
    98        retries: 3
    99        retry_period: 500ms
   100        url: TODO
   101  ```
   102  
   103  ## Hydrating Comments
   104  
   105  Our second pipeline consumes comments, caches them in case a subsequent comment references them, obtains its parent (article or comment), and attaches the root article to the event before sending it to our output topic `comments_hydrated`.
   106  
   107  In this config we make use of the [`branch`][processor.branch] processor as it allows us to reduce documents into smaller maps for caching and gives us greater control over how results are mapped back into the document.
   108  
   109  ```yaml
   110  input:
   111    kafka:
   112      addresses: [ TODO ]
   113      topics: [ comments ]
   114      consumer_group: benthos_comments_group
   115  
   116  pipeline:
   117    processors:
   118      # Perform both hydration and caching within a for_each block as this ensures
   119      # that a given message of a batch is cached before the next message is
   120      # hydrated, ensuring that when a message of the batch has a parent within
   121      # the same batch hydration can still work.
   122      - for_each:
   123        # Attempt to obtain parent event from cache (if the ID exists).
   124        - branch:
   125            request_map: 'root = this.comment.parent_id | deleted()'
   126            processors:
   127              - cache:
   128                  operator: get
   129                  resource: hydration_cache
   130                  key: '${!content()}'
   131            # And if successful copy it into the field `article`.
   132            result_map: 'root.article = this.article'
   133        
   134        # Reduce comment into only fields we wish to cache.
   135        - branch:
   136            request_map: |
   137              root.comment.id = this.comment.id
   138              root.article = this.article
   139            processors:
   140              # Store reduced comment into our cache.
   141              - cache:
   142                  operator: set
   143                  resource: hydration_cache
   144                  key: '${!json("comment.id")}'
   145                  value: '${!content()}'
   146            # No `result_map` since we don't need to map into the original message.
   147  
   148  # Send resulting documents to our hydrated topic.
   149  output:
   150    kafka:
   151      addresses: [ TODO ]
   152      topic: comments_hydrated
   153  
   154  cache_resources:
   155    - label: hydration_cache
   156      redis:
   157        expiration: 168h
   158        retries: 3
   159        retry_period: 500ms
   160        url: TODO
   161  ```
   162  
   163  This pipeline satisfies our basic needs but errors aren't handled at all, meaning intermittent cache connectivity problems that span beyond our cache retries will result in failed documents entering our `comments_hydrated` topic. This is also the case if a comment arrives in our pipeline before its parent.
   164  
   165  There are [many patterns for error handling][error-handling] to choose from in Benthos. In this example we're going to introduce a delayed retry queue as it enables us to reprocess failed documents after a grace period, which is isolated from our main pipeline.
   166  
   167  ## Adding a Retry Queue
   168  
   169  Our retry queue is going to be another topic called `comments_retried`. Since most errors are related to time we will delay retry attempts by storing the current timestamp after a failed request as a metadata field.
   170  
   171  We will use an input [`broker`][input.broker] so that we can consume both the `comments` and `comments_retry` topics in the same pipeline.
   172  
   173  Our config (omitting the caching sections for brevity) now looks like this:
   174  
   175  ```yaml
   176  input:
   177    broker:
   178      inputs:
   179        - kafka:
   180            addresses: [ TODO ]
   181            topics: [ comments ]
   182            consumer_group: benthos_comments_group
   183  
   184        - kafka:
   185            addresses: [ TODO ]
   186            topics: [ comments_retry ]
   187            consumer_group: benthos_comments_group
   188  
   189          processors:
   190            - for_each:
   191              # Calculate time until next retry attempt and sleep for that duration.
   192              # This sleep blocks the topic 'comments_retry' but NOT 'comments',
   193              # because both topics are consumed independently and these processors
   194              # only apply to the 'comments_retry' input.
   195              - sleep:
   196                  duration: '${! 3600 - ( timestamp_unix() - meta("last_attempted").number() ) }s'
   197  
   198  pipeline:
   199    processors:
   200      - try:
   201        - for_each:
   202          # Attempt to obtain parent event from cache.
   203          - branch:
   204              {} # Omitted
   205  
   206          # Reduce document into only fields we wish to cache.
   207          - branch:
   208              {} # Omitted
   209  
   210        # If we've reached this point then both processors succeeded.
   211        - bloblang: 'meta output_topic = "comments_hydrated"'
   212  
   213      - catch:
   214          # If we reach here then a processing stage failed.
   215          - bloblang: |
   216              meta output_topic = "comments_retry"
   217              meta last_attempted = timestamp_unix()
   218  
   219  # Send resulting documents either to our hydrated topic or the retry topic.
   220  output:
   221    kafka:
   222      addresses: [ TODO ]
   223      topic: '${!meta("output_topic")}'
   224  
   225  cache_resources:
   226    - label: hydration_cache
   227      redis: {} # Omitted
   228  ```
   229  
   230  You can find a full example [in the project repo][full-example], and with this config we can deploy as many instances of Benthos as we need as the partitions will be balanced across the consumers.
   231  
   232  [caches]: /docs/components/caches/about
   233  [inputs]: /docs/components/inputs/about
   234  [input.broker]: /docs/components/inputs/broker
   235  [outputs]: /docs/components/outputs/about
   236  [error-handling]: /docs/configuration/error_handling
   237  [processor.branch]: /docs/components/processors/branch
   238  [full-example]: https://github.com/Jeffail/benthos/blob/master/config/examples/joining_streams.yaml