github.com/Jeffail/benthos/v3@v3.65.0/website/cookbooks/joining_streams.md (about) 1 --- 2 id: joining-streams 3 title: Joining Streams 4 description: How to hydrate documents by joining multiple streams. 5 --- 6 7 This cookbook demonstrates how to merge JSON events from parallel streams using content based rules and a [cache][caches] of your choice. 8 9 The imaginary problem we are going to solve is hydrating a feed of article comments with information from their parent articles. We will be consuming and writing to Kafka, but the example works with any [input][inputs] and [output][outputs] combination. 10 11 Articles are received over the topic `articles` and look like this: 12 13 ```json 14 { 15 "type": "article", 16 "article": { 17 "id": "123foo", 18 "title": "Dope article", 19 "content": "this is a totally dope article" 20 }, 21 "user": { 22 "id": "user1" 23 } 24 } 25 ``` 26 27 Comments can either be posted on an article or a parent comment, are received over the topic `comments`, and look like this: 28 29 ```json 30 { 31 "type": "comment", 32 "comment": { 33 "id": "456bar", 34 "parent_id": "123foo", 35 "content": "this article sucks" 36 }, 37 "user": { 38 "id": "user2" 39 } 40 } 41 ``` 42 43 Our goal is to end up with a single stream of comments, where information about the root article of the comment is attached to the event. The above comment should exit our pipeline looking like this: 44 45 ```json 46 { 47 "type": "comment", 48 "comment": { 49 "id": "456bar", 50 "parent_id": "123foo", 51 "content": "this article sucks" 52 }, 53 "article": { 54 "title": "Dope article", 55 "content": "this is a totally dope article" 56 }, 57 "user": { 58 "id": "user2" 59 } 60 } 61 ``` 62 63 In order to achieve this we will need to cache articles as they pass through our pipelines and then retrieve them for each comment passing through. Since the parent of a comment might be another comment we will also need to cache and retrieve comments in the same way. 64 65 ## Caching Articles 66 67 Our first pipeline is very simple, we just consume articles, reduce them to only the fields we wish to cache, and then cache them. If we receive the same article multiple times we're going to assume it's okay to overwrite the old article in the cache. 68 69 In this example I'm targeting Redis, but you can choose any of the supported [cache targets][caches]. The TTL of cached articles is set to one week. 70 71 ```yaml 72 input: 73 kafka: 74 addresses: [ TODO ] 75 topics: [ articles ] 76 consumer_group: benthos_articles_group 77 78 pipeline: 79 processors: 80 # Reduce document into only fields we wish to cache. 81 - bloblang: 'article = article' 82 83 # Store reduced articles into our cache. 84 - cache: 85 operator: set 86 resource: hydration_cache 87 key: '${!json("article.id")}' 88 value: '${!content()}' 89 90 # Drop all articles after they are cached. 91 output: 92 drop: {} 93 94 cache_resources: 95 - label: hydration_cache 96 redis: 97 expiration: 168h 98 retries: 3 99 retry_period: 500ms 100 url: TODO 101 ``` 102 103 ## Hydrating Comments 104 105 Our second pipeline consumes comments, caches them in case a subsequent comment references them, obtains its parent (article or comment), and attaches the root article to the event before sending it to our output topic `comments_hydrated`. 106 107 In this config we make use of the [`branch`][processor.branch] processor as it allows us to reduce documents into smaller maps for caching and gives us greater control over how results are mapped back into the document. 108 109 ```yaml 110 input: 111 kafka: 112 addresses: [ TODO ] 113 topics: [ comments ] 114 consumer_group: benthos_comments_group 115 116 pipeline: 117 processors: 118 # Perform both hydration and caching within a for_each block as this ensures 119 # that a given message of a batch is cached before the next message is 120 # hydrated, ensuring that when a message of the batch has a parent within 121 # the same batch hydration can still work. 122 - for_each: 123 # Attempt to obtain parent event from cache (if the ID exists). 124 - branch: 125 request_map: 'root = this.comment.parent_id | deleted()' 126 processors: 127 - cache: 128 operator: get 129 resource: hydration_cache 130 key: '${!content()}' 131 # And if successful copy it into the field `article`. 132 result_map: 'root.article = this.article' 133 134 # Reduce comment into only fields we wish to cache. 135 - branch: 136 request_map: | 137 root.comment.id = this.comment.id 138 root.article = this.article 139 processors: 140 # Store reduced comment into our cache. 141 - cache: 142 operator: set 143 resource: hydration_cache 144 key: '${!json("comment.id")}' 145 value: '${!content()}' 146 # No `result_map` since we don't need to map into the original message. 147 148 # Send resulting documents to our hydrated topic. 149 output: 150 kafka: 151 addresses: [ TODO ] 152 topic: comments_hydrated 153 154 cache_resources: 155 - label: hydration_cache 156 redis: 157 expiration: 168h 158 retries: 3 159 retry_period: 500ms 160 url: TODO 161 ``` 162 163 This pipeline satisfies our basic needs but errors aren't handled at all, meaning intermittent cache connectivity problems that span beyond our cache retries will result in failed documents entering our `comments_hydrated` topic. This is also the case if a comment arrives in our pipeline before its parent. 164 165 There are [many patterns for error handling][error-handling] to choose from in Benthos. In this example we're going to introduce a delayed retry queue as it enables us to reprocess failed documents after a grace period, which is isolated from our main pipeline. 166 167 ## Adding a Retry Queue 168 169 Our retry queue is going to be another topic called `comments_retried`. Since most errors are related to time we will delay retry attempts by storing the current timestamp after a failed request as a metadata field. 170 171 We will use an input [`broker`][input.broker] so that we can consume both the `comments` and `comments_retry` topics in the same pipeline. 172 173 Our config (omitting the caching sections for brevity) now looks like this: 174 175 ```yaml 176 input: 177 broker: 178 inputs: 179 - kafka: 180 addresses: [ TODO ] 181 topics: [ comments ] 182 consumer_group: benthos_comments_group 183 184 - kafka: 185 addresses: [ TODO ] 186 topics: [ comments_retry ] 187 consumer_group: benthos_comments_group 188 189 processors: 190 - for_each: 191 # Calculate time until next retry attempt and sleep for that duration. 192 # This sleep blocks the topic 'comments_retry' but NOT 'comments', 193 # because both topics are consumed independently and these processors 194 # only apply to the 'comments_retry' input. 195 - sleep: 196 duration: '${! 3600 - ( timestamp_unix() - meta("last_attempted").number() ) }s' 197 198 pipeline: 199 processors: 200 - try: 201 - for_each: 202 # Attempt to obtain parent event from cache. 203 - branch: 204 {} # Omitted 205 206 # Reduce document into only fields we wish to cache. 207 - branch: 208 {} # Omitted 209 210 # If we've reached this point then both processors succeeded. 211 - bloblang: 'meta output_topic = "comments_hydrated"' 212 213 - catch: 214 # If we reach here then a processing stage failed. 215 - bloblang: | 216 meta output_topic = "comments_retry" 217 meta last_attempted = timestamp_unix() 218 219 # Send resulting documents either to our hydrated topic or the retry topic. 220 output: 221 kafka: 222 addresses: [ TODO ] 223 topic: '${!meta("output_topic")}' 224 225 cache_resources: 226 - label: hydration_cache 227 redis: {} # Omitted 228 ``` 229 230 You can find a full example [in the project repo][full-example], and with this config we can deploy as many instances of Benthos as we need as the partitions will be balanced across the consumers. 231 232 [caches]: /docs/components/caches/about 233 [inputs]: /docs/components/inputs/about 234 [input.broker]: /docs/components/inputs/broker 235 [outputs]: /docs/components/outputs/about 236 [error-handling]: /docs/configuration/error_handling 237 [processor.branch]: /docs/components/processors/branch 238 [full-example]: https://github.com/Jeffail/benthos/blob/master/config/examples/joining_streams.yaml