github.com/Jeffail/benthos/v3@v3.65.0/website/docs/configuration/windowed_processing.md

github.com/Jeffail/benthos/v3@v3.65.0/website/docs/configuration/windowed_processing.md (about)

     1  ---
     2  title: Windowed Processing
     3  description: Learn how to process periodic windows of messages with Benthos
     4  ---
     5  
     6  A window is a batch of messages made with respect to time, with which we are able to perform processing that can analyse or aggregate the messages of the window. This is useful in stream processing as the dataset is never "complete", and therefore in order to perform analysis against a collection of messages we must do so by creating a continuous feed of windows (collections), where our analysis is made against each window. 
     7  
     8  For example, given a stream of messages relating to cars passing through various traffic lights:
     9  
    10  ```json
    11  {
    12    "traffic_light": "cbf2eafc-806e-4067-9211-97be7e42cee3",
    13    "created_at": "2021-08-07T09:49:35Z",
    14    "registration_plate": "AB1C DEF",
    15    "passengers": 3
    16  }
    17  ```
    18  
    19  Windowing allows us to produce a stream of messages representing the total traffic for each light every hour:
    20  
    21  ```json
    22  {
    23    "traffic_light": "cbf2eafc-806e-4067-9211-97be7e42cee3",
    24    "created_at": "2021-08-07T10:00:00Z",
    25    "unique_cars": 15,
    26    "passengers": 43
    27  }
    28  ```
    29  
    30  ## Creating Windows
    31  
    32  The first step in processing windows is producing the windows themselves, this can be done by configuring a window producing buffer after your input:
    33  
    34  import Tabs from '@theme/Tabs';
    35  import TabItem from '@theme/TabItem';
    36  
    37  <Tabs defaultValue="system" values={[
    38    { label: 'System Clock', value: 'system', },
    39  ]}>
    40  <TabItem value="system">
    41  
    42  A [`system_window` buffer][buffers.system_window] creates windows by following the system clock of the running machine. Windows will be created and emitted at predictable times, but this also means windows for historic data will not be emitted and therefore prevents backfills of traffic data:
    43  
    44  ```yaml
    45  input:
    46    kafka:
    47      addresses: [ TODO ]
    48      topics: [ traffic_data ]
    49      consumer_group: traffic_consumer
    50      checkpoint_limit: 1000
    51  
    52  buffer:
    53    system_window:
    54      timestamp_mapping: root = this.created_at
    55      size: 1h
    56      allowed_lateness: 3m
    57  ```
    58  
    59  For more information about this buffer refer to [the `system_window` buffer docs][buffers.system_window].
    60  
    61  </TabItem>
    62  </Tabs>
    63  
    64  ## Grouping
    65  
    66  With a window buffer chosen our stream of messages will be emitted periodically as batches of all messages that fit within each window. Since we want to analyse the window separately for each traffic light we need to expand this single batch out into one for each traffic light identifier within the window. For that purpose we have two processor options: [`group_by`][processors.group_by] and [`group_by_value`][processors.group_by_value].
    67  
    68  In our case we want to group by the value of the field `traffic_light` of each message, which we can do with the following:
    69  
    70  ```yaml
    71  pipeline:
    72    processors:
    73      - group_by_value:
    74          value: ${! json("traffic_light") }
    75  ```
    76  
    77  ## Aggregating
    78  
    79  Once our window has been grouped the next step is to calculate the aggregated passenger and unique cars counts. For this purpose the Benthos [mapping language Bloblang][bloblang.about] comes in handy as the method [`from_all`][bloblang.methods.from_all] executes the target function against the entire batch and returns an array of the values, allowing us to mutate the result with chained methods such as [`sum`][bloblang.methods.sum]:
    80  
    81  ```yaml
    82  pipeline:
    83    processors:
    84      - group_by_value:
    85          value: ${! json("traffic_light") }
    86  
    87      - bloblang: |
    88          let is_first_message = batch_index() == 0
    89  
    90          root.traffic_light = this.traffic_light
    91          root.created_at = meta("window_end_timestamp")
    92          root.total_cars = if $is_first_message {
    93            json("registration_plate").from_all().unique().length()
    94          }
    95          root.passengers = if $is_first_message {
    96            json("passengers").from_all().sum()
    97          }
    98  
    99          # Only keep the first batch message containing the aggregated results.
   100          root = if ! $is_first_message {
   101            deleted()
   102          }
   103  ```
   104  
   105  [Bloblang][bloblang.about] is very powerful, and by using [`from`][bloblang.methods.from] and [`from_all`][bloblang.methods.from_all] it's possible to perform a wide range of batch-wide processing. If you fancy a challenge try updating the above mapping to only count passengers from the first journey of each registration plate in the window (hint: the [`fold` method][bloblang.methods.fold] might come in handy).
   106  
   107  [buffers.system_window]: /docs/components/buffers/system_window
   108  [processors.group_by]: /docs/components/processors/group_by
   109  [processors.group_by_value]: /docs/components/processors/group_by_value
   110  [bloblang.about]: /docs/guides/bloblang/about
   111  [bloblang.methods.from_all]: /docs/guides/bloblang/methods#from_all
   112  [bloblang.methods.sum]: /docs/guides/bloblang/methods#sum
   113  [bloblang.methods.unique]: /docs/guides/bloblang/methods#unique
   114  [bloblang.methods.from]: /docs/guides/bloblang/methods#from
   115  [bloblang.methods.fold]: /docs/guides/bloblang/methods#fold