github.com/Jeffail/benthos/v3@v3.65.0/website/cookbooks/enrichments.md

github.com/Jeffail/benthos/v3@v3.65.0/website/cookbooks/enrichments.md (about)

     1  ---
     2  id: enrichments
     3  title: Enrichment Workflows
     4  description: How to configure Benthos to process a workflow of enrichment services.
     5  ---
     6  
     7  This cookbook demonstrates how to enrich a stream of JSON documents with HTTP services. This method also works with [AWS Lambda functions][processor.lambda], [subprocesses][processor.subprocess], etc.
     8  
     9  We will start off by configuring a single enrichment, then we will move onto a workflow of enrichments with a network of dependencies using the [`workflow` processor][processor.workflow].
    10  
    11  Each enrichment will be performed in parallel across a [pre-batched][batching] stream of documents. Workflow enrichments that do not depend on each other will also be performed in parallel, making this orchestration method very efficient.
    12  
    13  The imaginary problem we are going to solve is applying a set of NLP based enrichments to a feed of articles in order to detect fake news. We will be consuming and writing to Kafka, but the example works with any [input][inputs] and [output][outputs] combination.
    14  
    15  Articles are received over the topic `articles` and look like this:
    16  
    17  ```json
    18  {
    19    "type": "article",
    20    "article": {
    21      "id": "123foo",
    22      "title": "Dogs Stop Barking",
    23      "content": "The world was shocked this morning to find that all dogs have stopped barking."
    24    }
    25  }
    26  ```
    27  
    28  ## Meet the Enrichments
    29  
    30  ### Claims Detector
    31  
    32  To start us off we will configure a single enrichment, which is an imaginary 'claims detector' service. This is an HTTP service that wraps a trained machine learning model to extract claims that are made within a body of text.
    33  
    34  The service expects a `POST` request with JSON payload of the form:
    35  
    36  ```json
    37  {
    38    "text": "The world was shocked this morning to find that all dogs have stopped barking."
    39  }
    40  ```
    41  
    42  And returns a JSON payload of the form:
    43  
    44  ```json
    45  {
    46    "claims": [
    47      {
    48        "entity": "world",
    49        "claim": "shocked"
    50      },
    51      {
    52        "entity": "dogs",
    53        "claim": "NOT barking"
    54      }
    55    ]
    56  }
    57  ```
    58  
    59  Since each request only applies to a single document we will make this enrichment scale by deploying multiple HTTP services and hitting those instances in parallel across our document batches.
    60  
    61  In order to send a mapped request and map the response back into the original document we will use the [`branch` processor][processor.branch], with a child [`http`][processor.http] processor.
    62  
    63  ```yaml
    64  input:
    65    kafka:
    66      addresses: [ TODO ]
    67      topics: [ articles ]
    68      consumer_group: benthos_articles_group
    69      batching:
    70        count: 20 # Tune this to set the size of our document batches.
    71        period: 1s
    72  
    73  pipeline:
    74    processors:
    75      - branch:
    76          request_map: 'root.text = this.article.content'
    77          processors:
    78            - http:
    79                parallel: true
    80                url: http://localhost:4197/claims
    81                verb: POST
    82          result_map: 'root.tmp.claims = this.claims'
    83  
    84  output:
    85    kafka:
    86      addresses: [ TODO ]
    87      topic: comments_hydrated
    88  ```
    89  
    90  With this pipeline our documents will come out looking something like this:
    91  
    92  ```json
    93  {
    94    "type": "article",
    95    "article": {
    96      "id": "123foo",
    97      "title": "Dogs Stop Barking",
    98      "content": "The world was shocked this morning to find that all dogs have stopped barking."
    99    },
   100    "tmp": {
   101      "claims": [
   102        {
   103          "entity": "world",
   104          "claim": "shocked"
   105        },
   106        {
   107          "entity": "dogs",
   108          "claim": "NOT barking"
   109        }
   110      ]
   111    }
   112  }
   113  ```
   114  
   115  ### Hyperbole Detector
   116  
   117  Next up is a 'hyperbole detector' that takes a `POST` request containing the article contents and returns a hyperbole score between 0 and 1. This time the format is array-based and therefore supports calculating multiple documents in a single request, making better use of the host machines GPU.
   118  
   119  A request should take the following form:
   120  
   121  ```json
   122  [
   123    {
   124      "text": "The world was shocked this morning to find that all dogs have stopped barking."
   125    }
   126  ]
   127  ```
   128  
   129  And the response looks like this:
   130  
   131  ```json
   132  [
   133    {
   134      "hyperbole_rank": 0.73
   135    }
   136  ]
   137  ```
   138  
   139  In order to create a single request from a batch of documents, and subsequently map the result back into our batch, we will use the [`archive`][processor.archive] and [`unarchive`][processor.unarchive] processors in our [`branch`][processor.branch] flow, like this:
   140  
   141  ```yaml
   142  pipeline:
   143    processors:
   144      - branch:
   145          request_map: 'root.text = this.article.content'
   146          processors:
   147            - archive:
   148                format: json_array
   149            - http:
   150                url: http://localhost:4198/hyperbole
   151                verb: POST
   152            - unarchive:
   153                format: json_array
   154          result_map: 'root.tmp.hyperbole_rank = this.hyperbole_rank'
   155  ```
   156  
   157  The purpose of the `json_array` format `archive` processor is to take a batch of JSON documents and place them into a single document as an array. Subsequently, we then send one single request for each batch.
   158  
   159  After the request is made we do the opposite with the `unarchive` processor in order to convert it back into a batch of the original size.
   160  
   161  ### Fake News Detector
   162  
   163  Finally, we are going to use a 'fake news detector' that takes the article contents as well as the output of the previous two enrichments and calculates a fake news rank between 0 and 1.
   164  
   165  This service behaves similarly to the claims detector service and takes a document of the form:
   166  
   167  ```json
   168  {
   169    "text": "The world was shocked this morning to find that all dogs have stopped barking.",
   170    "hyperbole_rank": 0.73,
   171    "claims": [
   172      {
   173        "entity": "world",
   174        "claim": "shocked"
   175      },
   176      {
   177        "entity": "dogs",
   178        "claim": "NOT barking"
   179      }
   180    ]
   181  }
   182  ```
   183  
   184  And returns an object of the form:
   185  
   186  ```json
   187  {
   188    "fake_news_rank": 0.893
   189  }
   190  ```
   191  
   192  We then wish to map the field `fake_news_rank` from that result into the original document at the path `article.fake_news_score`. Our [`branch`][processor.branch] block for this enrichment would look like this:
   193  
   194  ```yaml
   195  pipeline:
   196    processors:
   197      - branch:
   198          request_map: |
   199            root.text = this.article.content
   200            root.claims = this.tmp.claims
   201            root.hyperbole_rank = this.tmp.hyperbole_rank
   202          processors:
   203            - http:
   204                parallel: true
   205                url: http://localhost:4199/fakenews
   206                verb: POST
   207          result_map: 'root.article.fake_news_score = this.fake_news_rank'
   208  ```
   209  
   210  Note that in our `request_map` we are targeting fields that are populated from the previous two enrichments.
   211  
   212  If we were to execute all three enrichments in a sequence we'll end up with a document looking like this:
   213  
   214  ```json
   215  {
   216    "type": "article",
   217    "article": {
   218      "id": "123foo",
   219      "title": "Dogs Stop Barking",
   220      "content": "The world was shocked this morning to find that all dogs have stopped barking.",
   221      "fake_news_rank": 0.76
   222    },
   223    "tmp": {
   224      "hyperbole_rank": 0.34,
   225      "claims": [
   226        {
   227          "entity": "world",
   228          "claim": "shocked"
   229        },
   230        {
   231          "entity": "dogs",
   232          "claim": "NOT barking"
   233        }
   234      ]
   235    }
   236  }
   237  ```
   238  
   239  Great! However, as a streaming pipeline this set up isn't ideal as our first two enrichments are independent and could potentially be executed in parallel in order to reduce processing latency.
   240  
   241  ## Combining into a Workflow
   242  
   243  If we configure our enrichments within a [`workflow` processor][processor.workflow] we can use Benthos to automatically detect our dependency graph, giving us two key benefits:
   244  
   245  1. Enrichments at the same level of a dependency graph (claims and hyperbole) will be executed in parallel.
   246  2. When introducing more enrichments to our pipeline the added complexity of resolving the dependency graph is handled automatically by Benthos.
   247  
   248  Placing our branches within a [`workflow` processor][processor.workflow] makes our final pipeline configuration look like this:
   249  
   250  ```yaml
   251  input:
   252    kafka:
   253      addresses: [ TODO ]
   254      topics: [ articles ]
   255      consumer_group: benthos_articles_group
   256      batching:
   257        count: 20 # Tune this to set the size of our document batches.
   258        period: 1s
   259  
   260  pipeline:
   261    processors:
   262      - workflow:
   263          meta_path: '' # Don't bother storing branch metadata.
   264          branches:
   265            claims:
   266              request_map: 'root.text = this.article.content'
   267              processors:
   268                - http:
   269                    parallel: true
   270                    url: http://localhost:4197/claims
   271                    verb: POST
   272              result_map: 'root.tmp.claims = this.claims'
   273  
   274            hyperbole:
   275              request_map: 'root.text = this.article.content'
   276              processors:
   277                - archive:
   278                    format: json_array
   279                - http:
   280                    url: http://localhost:4198/hyperbole
   281                    verb: POST
   282                - unarchive:
   283                    format: json_array
   284              result_map: 'root.tmp.hyperbole_rank = this.hyperbole_rank'
   285  
   286            fake_news:
   287              request_map: |
   288                root.text = this.article.content
   289                root.claims = this.tmp.claims
   290                root.hyperbole_rank = this.tmp.hyperbole_rank
   291              processors:
   292                - http:
   293                    parallel: true
   294                    url: http://localhost:4199/fakenews
   295                    verb: POST
   296              result_map: 'root.article.fake_news_score = this.fake_news_rank'
   297  
   298      - catch:
   299          - log:
   300              fields:
   301                content: "${!content()}"
   302              message: "Enrichments failed due to: ${!error()}"
   303  
   304      - bloblang: |
   305          root = this
   306          root.tmp = deleted()
   307  
   308  output:
   309    kafka:
   310      addresses: [ TODO ]
   311      topic: comments_hydrated
   312  ```
   313  
   314  Since the contents of `tmp` won't be required downstream we remove it after our enrichments using a [`bloblang` processor][processor.bloblang].
   315  
   316  A [`catch`][processor.catch] processor was added at the end of the pipeline which catches documents that failed enrichment. You can replace the log event with a wide range of recovery actions such as sending to a dead-letter/retry queue, dropping the message entirely, etc. You can read more about error handling [in this article][error-handling].
   317  
   318  [inputs]: /docs/components/inputs/about
   319  [outputs]: /docs/components/outputs/about
   320  [error-handling]: /docs/configuration/error_handling
   321  [batching]: /docs/configuration/batching
   322  [processor.catch]: /docs/components/processors/catch
   323  [processor.archive]: /docs/components/processors/archive
   324  [processor.unarchive]: /docs/components/processors/unarchive
   325  [processor.bloblang]: /docs/components/processors/bloblang
   326  [processor.subprocess]: /docs/components/processors/subprocess
   327  [processor.lambda]: /docs/components/processors/aws_lambda
   328  [processor.http]: /docs/components/processors/http
   329  [processor.branch]: /docs/components/processors/branch
   330  [processor.workflow]: /docs/components/processors/workflow