github.com/Jeffail/benthos/v3@v3.65.0/website/cookbooks/enrichments.md (about) 1 --- 2 id: enrichments 3 title: Enrichment Workflows 4 description: How to configure Benthos to process a workflow of enrichment services. 5 --- 6 7 This cookbook demonstrates how to enrich a stream of JSON documents with HTTP services. This method also works with [AWS Lambda functions][processor.lambda], [subprocesses][processor.subprocess], etc. 8 9 We will start off by configuring a single enrichment, then we will move onto a workflow of enrichments with a network of dependencies using the [`workflow` processor][processor.workflow]. 10 11 Each enrichment will be performed in parallel across a [pre-batched][batching] stream of documents. Workflow enrichments that do not depend on each other will also be performed in parallel, making this orchestration method very efficient. 12 13 The imaginary problem we are going to solve is applying a set of NLP based enrichments to a feed of articles in order to detect fake news. We will be consuming and writing to Kafka, but the example works with any [input][inputs] and [output][outputs] combination. 14 15 Articles are received over the topic `articles` and look like this: 16 17 ```json 18 { 19 "type": "article", 20 "article": { 21 "id": "123foo", 22 "title": "Dogs Stop Barking", 23 "content": "The world was shocked this morning to find that all dogs have stopped barking." 24 } 25 } 26 ``` 27 28 ## Meet the Enrichments 29 30 ### Claims Detector 31 32 To start us off we will configure a single enrichment, which is an imaginary 'claims detector' service. This is an HTTP service that wraps a trained machine learning model to extract claims that are made within a body of text. 33 34 The service expects a `POST` request with JSON payload of the form: 35 36 ```json 37 { 38 "text": "The world was shocked this morning to find that all dogs have stopped barking." 39 } 40 ``` 41 42 And returns a JSON payload of the form: 43 44 ```json 45 { 46 "claims": [ 47 { 48 "entity": "world", 49 "claim": "shocked" 50 }, 51 { 52 "entity": "dogs", 53 "claim": "NOT barking" 54 } 55 ] 56 } 57 ``` 58 59 Since each request only applies to a single document we will make this enrichment scale by deploying multiple HTTP services and hitting those instances in parallel across our document batches. 60 61 In order to send a mapped request and map the response back into the original document we will use the [`branch` processor][processor.branch], with a child [`http`][processor.http] processor. 62 63 ```yaml 64 input: 65 kafka: 66 addresses: [ TODO ] 67 topics: [ articles ] 68 consumer_group: benthos_articles_group 69 batching: 70 count: 20 # Tune this to set the size of our document batches. 71 period: 1s 72 73 pipeline: 74 processors: 75 - branch: 76 request_map: 'root.text = this.article.content' 77 processors: 78 - http: 79 parallel: true 80 url: http://localhost:4197/claims 81 verb: POST 82 result_map: 'root.tmp.claims = this.claims' 83 84 output: 85 kafka: 86 addresses: [ TODO ] 87 topic: comments_hydrated 88 ``` 89 90 With this pipeline our documents will come out looking something like this: 91 92 ```json 93 { 94 "type": "article", 95 "article": { 96 "id": "123foo", 97 "title": "Dogs Stop Barking", 98 "content": "The world was shocked this morning to find that all dogs have stopped barking." 99 }, 100 "tmp": { 101 "claims": [ 102 { 103 "entity": "world", 104 "claim": "shocked" 105 }, 106 { 107 "entity": "dogs", 108 "claim": "NOT barking" 109 } 110 ] 111 } 112 } 113 ``` 114 115 ### Hyperbole Detector 116 117 Next up is a 'hyperbole detector' that takes a `POST` request containing the article contents and returns a hyperbole score between 0 and 1. This time the format is array-based and therefore supports calculating multiple documents in a single request, making better use of the host machines GPU. 118 119 A request should take the following form: 120 121 ```json 122 [ 123 { 124 "text": "The world was shocked this morning to find that all dogs have stopped barking." 125 } 126 ] 127 ``` 128 129 And the response looks like this: 130 131 ```json 132 [ 133 { 134 "hyperbole_rank": 0.73 135 } 136 ] 137 ``` 138 139 In order to create a single request from a batch of documents, and subsequently map the result back into our batch, we will use the [`archive`][processor.archive] and [`unarchive`][processor.unarchive] processors in our [`branch`][processor.branch] flow, like this: 140 141 ```yaml 142 pipeline: 143 processors: 144 - branch: 145 request_map: 'root.text = this.article.content' 146 processors: 147 - archive: 148 format: json_array 149 - http: 150 url: http://localhost:4198/hyperbole 151 verb: POST 152 - unarchive: 153 format: json_array 154 result_map: 'root.tmp.hyperbole_rank = this.hyperbole_rank' 155 ``` 156 157 The purpose of the `json_array` format `archive` processor is to take a batch of JSON documents and place them into a single document as an array. Subsequently, we then send one single request for each batch. 158 159 After the request is made we do the opposite with the `unarchive` processor in order to convert it back into a batch of the original size. 160 161 ### Fake News Detector 162 163 Finally, we are going to use a 'fake news detector' that takes the article contents as well as the output of the previous two enrichments and calculates a fake news rank between 0 and 1. 164 165 This service behaves similarly to the claims detector service and takes a document of the form: 166 167 ```json 168 { 169 "text": "The world was shocked this morning to find that all dogs have stopped barking.", 170 "hyperbole_rank": 0.73, 171 "claims": [ 172 { 173 "entity": "world", 174 "claim": "shocked" 175 }, 176 { 177 "entity": "dogs", 178 "claim": "NOT barking" 179 } 180 ] 181 } 182 ``` 183 184 And returns an object of the form: 185 186 ```json 187 { 188 "fake_news_rank": 0.893 189 } 190 ``` 191 192 We then wish to map the field `fake_news_rank` from that result into the original document at the path `article.fake_news_score`. Our [`branch`][processor.branch] block for this enrichment would look like this: 193 194 ```yaml 195 pipeline: 196 processors: 197 - branch: 198 request_map: | 199 root.text = this.article.content 200 root.claims = this.tmp.claims 201 root.hyperbole_rank = this.tmp.hyperbole_rank 202 processors: 203 - http: 204 parallel: true 205 url: http://localhost:4199/fakenews 206 verb: POST 207 result_map: 'root.article.fake_news_score = this.fake_news_rank' 208 ``` 209 210 Note that in our `request_map` we are targeting fields that are populated from the previous two enrichments. 211 212 If we were to execute all three enrichments in a sequence we'll end up with a document looking like this: 213 214 ```json 215 { 216 "type": "article", 217 "article": { 218 "id": "123foo", 219 "title": "Dogs Stop Barking", 220 "content": "The world was shocked this morning to find that all dogs have stopped barking.", 221 "fake_news_rank": 0.76 222 }, 223 "tmp": { 224 "hyperbole_rank": 0.34, 225 "claims": [ 226 { 227 "entity": "world", 228 "claim": "shocked" 229 }, 230 { 231 "entity": "dogs", 232 "claim": "NOT barking" 233 } 234 ] 235 } 236 } 237 ``` 238 239 Great! However, as a streaming pipeline this set up isn't ideal as our first two enrichments are independent and could potentially be executed in parallel in order to reduce processing latency. 240 241 ## Combining into a Workflow 242 243 If we configure our enrichments within a [`workflow` processor][processor.workflow] we can use Benthos to automatically detect our dependency graph, giving us two key benefits: 244 245 1. Enrichments at the same level of a dependency graph (claims and hyperbole) will be executed in parallel. 246 2. When introducing more enrichments to our pipeline the added complexity of resolving the dependency graph is handled automatically by Benthos. 247 248 Placing our branches within a [`workflow` processor][processor.workflow] makes our final pipeline configuration look like this: 249 250 ```yaml 251 input: 252 kafka: 253 addresses: [ TODO ] 254 topics: [ articles ] 255 consumer_group: benthos_articles_group 256 batching: 257 count: 20 # Tune this to set the size of our document batches. 258 period: 1s 259 260 pipeline: 261 processors: 262 - workflow: 263 meta_path: '' # Don't bother storing branch metadata. 264 branches: 265 claims: 266 request_map: 'root.text = this.article.content' 267 processors: 268 - http: 269 parallel: true 270 url: http://localhost:4197/claims 271 verb: POST 272 result_map: 'root.tmp.claims = this.claims' 273 274 hyperbole: 275 request_map: 'root.text = this.article.content' 276 processors: 277 - archive: 278 format: json_array 279 - http: 280 url: http://localhost:4198/hyperbole 281 verb: POST 282 - unarchive: 283 format: json_array 284 result_map: 'root.tmp.hyperbole_rank = this.hyperbole_rank' 285 286 fake_news: 287 request_map: | 288 root.text = this.article.content 289 root.claims = this.tmp.claims 290 root.hyperbole_rank = this.tmp.hyperbole_rank 291 processors: 292 - http: 293 parallel: true 294 url: http://localhost:4199/fakenews 295 verb: POST 296 result_map: 'root.article.fake_news_score = this.fake_news_rank' 297 298 - catch: 299 - log: 300 fields: 301 content: "${!content()}" 302 message: "Enrichments failed due to: ${!error()}" 303 304 - bloblang: | 305 root = this 306 root.tmp = deleted() 307 308 output: 309 kafka: 310 addresses: [ TODO ] 311 topic: comments_hydrated 312 ``` 313 314 Since the contents of `tmp` won't be required downstream we remove it after our enrichments using a [`bloblang` processor][processor.bloblang]. 315 316 A [`catch`][processor.catch] processor was added at the end of the pipeline which catches documents that failed enrichment. You can replace the log event with a wide range of recovery actions such as sending to a dead-letter/retry queue, dropping the message entirely, etc. You can read more about error handling [in this article][error-handling]. 317 318 [inputs]: /docs/components/inputs/about 319 [outputs]: /docs/components/outputs/about 320 [error-handling]: /docs/configuration/error_handling 321 [batching]: /docs/configuration/batching 322 [processor.catch]: /docs/components/processors/catch 323 [processor.archive]: /docs/components/processors/archive 324 [processor.unarchive]: /docs/components/processors/unarchive 325 [processor.bloblang]: /docs/components/processors/bloblang 326 [processor.subprocess]: /docs/components/processors/subprocess 327 [processor.lambda]: /docs/components/processors/aws_lambda 328 [processor.http]: /docs/components/processors/http 329 [processor.branch]: /docs/components/processors/branch 330 [processor.workflow]: /docs/components/processors/workflow