github.com/Jeffail/benthos/v3@v3.65.0/website/docs/configuration/batching.md

github.com/Jeffail/benthos/v3@v3.65.0/website/docs/configuration/batching.md (about)

     1  ---
     2  title: Message Batching
     3  ---
     4  
     5  Benthos is able to join sources and sinks with sometimes conflicting batching behaviours without sacrificing its strong delivery guarantees. It's also able to perform powerful [processing functions][windowing] across batches of messages such as grouping, archiving and reduction. Therefore, batching within Benthos is a mechanism that serves multiple purposes:
     6  
     7  1. [Performance (throughput)](#performance)
     8  2. [Grouped message processing](#grouped-message-processing)
     9  3. [Compatibility (mixing multi and single part message protocols)](#compatibility)
    10  
    11  ## Performance
    12  
    13  For most users the only benefit of batching messages is improving throughput over your output protocol. For some protocols this can happen in the background and requires no configuration from you. However, if an output has a `batching` configuration block this means it benefits from batching and requires you to specify how you'd like your batches to be formed by configuring a [batching policy](#batch-policy):
    14  
    15  ```yaml
    16  output:
    17    kafka:
    18      addresses: [ todo:9092 ]
    19      topic: benthos_stream
    20  
    21      # Either send batches when they reach 10 messages or when 100ms has passed
    22      # since the last batch.
    23      batching:
    24        count: 10
    25        period: 100ms
    26  ```
    27  
    28  However, a small number of inputs such as [`kafka`][input_kafka] must be consumed sequentially (in this case by partition) and therefore benefit from specifying your batch policy at the input level instead:
    29  
    30  ```yaml
    31  input:
    32    kafka:
    33      addresses: [ todo:9092 ]
    34      topics: [ benthos_input_stream ]
    35      batching:
    36        count: 10
    37        period: 100ms
    38  
    39  output:
    40    kafka:
    41      addresses: [ todo:9092 ]
    42      topic: benthos_stream
    43  ```
    44  
    45  Inputs that behave this way are documented as such and have a `batching` configuration block.
    46  
    47  Sometimes you may prefer to create your batches before processing in order to benefit from [batch wide processing](#grouped-message-processing), in which case if your input doesn't already support [a batch policy](#batch-policy) you can instead use a [`broker`][input_broker], which also allows you to combine inputs with a single batch policy:
    48  
    49  ```yaml
    50  input:
    51    broker:
    52      inputs:
    53        - resource: foo
    54        - resource: bar
    55      batching:
    56        count: 50
    57        period: 500ms
    58  ```
    59  
    60  This also works the same with [output brokers][output_broker].
    61  
    62  ## Grouped Message Processing
    63  
    64  One of the more powerful features of Benthos is that all processors are "batch aware", which means processors that operate on single messages can be configured using the `parts` field to only operate on select messages of a batch:
    65  
    66  ```yaml
    67  pipeline:
    68    processors:
    69      # This processor only acts on the first message of a batch
    70      - protobuf:
    71          parts: [ 0 ]
    72          operator: to_json
    73          message: header.Message
    74          import_paths: [ /tmp/protos ]
    75  ```
    76  
    77  And some processors such as [`sleep`][processor.sleep] are executed once per batch, you can avoid this behaviour with the [`for_each` processor][proc_for_each]:
    78  
    79  ```yaml
    80  pipeline:
    81    processors:
    82      # Sleep for one second for each message of a batch
    83      - for_each:
    84        - sleep:
    85            duration: 1s
    86  ```
    87  
    88  There's a vast number of processors that specialise in operations across batches such as [grouping][proc_group_by] and [archiving][proc_archive]. For example, the following processors group a batch of messages according to a metadata field and compresses them into separate `.tar.gz` archives:
    89  
    90  ```yaml
    91  pipeline:
    92    processors:
    93      - group_by_value:
    94          value: ${! meta("kafka_partition") }
    95      - archive:
    96          format: tar
    97      - compress:
    98          algorithm: gzip
    99  
   100  output:
   101    aws_s3:
   102      bucket: TODO
   103      path: docs/${! meta("kafka_partition") }/${! count("files") }-${! timestamp_unix_nano() }.tar.gz
   104  ```
   105  
   106  For more examples of batched (or windowed) processing check out [this document][windowing].
   107  
   108  ## Compatibility
   109  
   110  Benthos is able to read and write over protocols that support multiple part messages, and all payloads travelling through Benthos are represented as a multiple part message. Therefore, all components within Benthos are able to work with multiple parts in a message as standard.
   111  
   112  When messages reach an output that _doesn't_ support multiple parts the message is broken down into an individual message per part, and then one of two behaviours happen depending on the output. If the output supports batch sending messages then the collection of messages are sent as a single batch. Otherwise, Benthos falls back to sending the messages sequentially in multiple, individual requests.
   113  
   114  This behaviour means that not only can multiple part message protocols be easily matched with single part protocols, but also the concept of multiple part messages and message batches are interchangeable within Benthos.
   115  
   116  ### Shrinking Batches
   117  
   118  A message batch (or multiple part message) can be broken down into smaller batches using the [`split`][split] processor:
   119  
   120  ```yaml
   121  input:
   122    # Consume messages that arrive in three parts.
   123    resource: foo
   124    processors:
   125      # Drop the third part
   126      - select_parts:
   127          parts: [ 0, 1 ]
   128      # Then break our message parts into individual messages
   129      - split:
   130          size: 1
   131  ```
   132  
   133  This is also useful when your input source creates batches that are too large for your output protocol:
   134  
   135  ```yaml
   136  input:
   137    aws_s3:
   138      bucket: todo
   139  
   140  pipeline:
   141    processors:
   142      - decompress:
   143          algorithm: gzip
   144      - unarchive:
   145          format: tar
   146      # Limit batch sizes to 5MB
   147      - split:
   148          byte_size: 5_000_000
   149  ```
   150  
   151  ## Batch Policy
   152  
   153  When an input or output component has a config field `batching` that means it supports a batch policy. This is a mechanism that allows you to configure exactly how your batching should work on messages before they are routed to the input or output it's associated with. Batches are considered complete and will be flushed downstream when either of the following conditions are met:
   154  
   155  
   156  - The `byte_size` field is non-zero and the total size of the batch in bytes matches or exceeds it (disregarding metadata.)
   157  - The `count` field is non-zero and the total number of messages in the batch matches or exceeds it.
   158  - A message added to the batch causes the [`check`][bloblang] to return to `true`.
   159  - The `period` field is non-empty and the time since the last batch exceeds its value.
   160  
   161  This allows you to combine conditions:
   162  
   163  ```yaml
   164  output:
   165    kafka:
   166      addresses: [ todo:9092 ]
   167      topic: benthos_stream
   168  
   169      # Either send batches when they reach 10 messages or when 100ms has passed
   170      # since the last batch.
   171      batching:
   172        count: 10
   173        period: 100ms
   174  ```
   175  
   176  :::caution
   177  A batch policy has the capability to _create_ batches, but not to break them down.
   178  :::
   179  
   180  If your configured pipeline is processing messages that are batched _before_ they reach the batch policy then they may circumvent the conditions you've specified here, resulting in sizes you aren't expecting.
   181  
   182  If you are affected by this limitation then consider breaking the batches down with a [`split` processor][split] before they reach the batch policy.
   183  
   184  ### Post-Batch Processing
   185  
   186  A batch policy also has a field `processors` which allows you to define an optional list of [processors][processors] to apply to each batch before it is flushed. This is a good place to aggregate or archive the batch into a compatible format for an output:
   187  
   188  ```yaml
   189  output:
   190    http_client:
   191      url: http://localhost:4195/post
   192      batching:
   193        count: 10
   194        processors:
   195          - archive:
   196              format: lines
   197  ```
   198  
   199  The above config will batch up messages and then merge them into a line delimited format before sending it over HTTP. This is an easier format to parse than the default which would have been [rfc1342](https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html).
   200  
   201  During shutdown any remaining messages waiting for a batch to complete will be flushed down the pipeline.
   202  
   203  [processors]: /docs/components/processors/about
   204  [processor.sleep]: /docs/components/processors/sleep
   205  [split]: /docs/components/processors/split
   206  [archive]: /docs/components/processors/archive
   207  [unarchive]: /docs/components/processors/unarchive
   208  [proc_for_each]: /docs/components/processors/for_each
   209  [proc_group_by]: /docs/components/processors/group_by
   210  [proc_archive]: /docs/components/processors/archive
   211  [input_broker]: /docs/components/inputs/broker
   212  [output_broker]: /docs/components/outputs/broker
   213  [input_kafka]: /docs/components/inputs/kafka
   214  [function_interpolation]: /docs/configuration/interpolation#bloblang-queries
   215  [bloblang]: /docs/guides/bloblang/about
   216  [windowing]: /docs/configuration/windowed_processing