github.com/Jeffail/benthos/v3@v3.65.0/website/docs/components/processors/dedupe.md (about)

     1  ---
     2  title: dedupe
     3  type: processor
     4  status: stable
     5  categories: ["Utility"]
     6  ---
     7  
     8  <!--
     9       THIS FILE IS AUTOGENERATED!
    10  
    11       To make changes please edit the contents of:
    12       lib/processor/dedupe.go
    13  -->
    14  
    15  import Tabs from '@theme/Tabs';
    16  import TabItem from '@theme/TabItem';
    17  
    18  
    19  Deduplicates message batches by caching selected (and optionally hashed)
    20  messages, dropping batches that are already cached.
    21  
    22  
    23  <Tabs defaultValue="common" values={[
    24    { label: 'Common', value: 'common', },
    25    { label: 'Advanced', value: 'advanced', },
    26  ]}>
    27  
    28  <TabItem value="common">
    29  
    30  ```yaml
    31  # Common config fields, showing default values
    32  label: ""
    33  dedupe:
    34    cache: ""
    35    hash: none
    36    key: ""
    37    drop_on_err: true
    38  ```
    39  
    40  </TabItem>
    41  <TabItem value="advanced">
    42  
    43  ```yaml
    44  # All config fields, showing default values
    45  label: ""
    46  dedupe:
    47    cache: ""
    48    hash: none
    49    key: ""
    50    drop_on_err: true
    51    parts:
    52      - 0
    53  ```
    54  
    55  </TabItem>
    56  </Tabs>
    57  
    58  This processor acts across an entire batch, in order to deduplicate individual
    59  messages within a batch use this processor with the
    60  [`for_each`](/docs/components/processors/for_each) processor.
    61  
    62  Optionally, the `key` field can be populated in order to hash on a
    63  function interpolated string rather than the full contents of messages. This
    64  allows you to deduplicate based on dynamic fields within a message, such as its
    65  metadata, JSON fields, etc. A full list of interpolation functions can be found
    66  [here](/docs/configuration/interpolation#bloblang-queries).
    67  
    68  For example, the following config would deduplicate based on the concatenated
    69  values of the metadata field `kafka_key` and the value of the JSON
    70  path `id` within the message contents:
    71  
    72  ```yaml
    73  pipeline:
    74    processors:
    75      - dedupe:
    76          cache: foocache
    77          key: ${! meta("kafka_key") }-${! json("id") }
    78  ```
    79  
    80  Caches should be configured as a resource, for more information check out the
    81  [documentation here](/docs/components/caches/about).
    82  
    83  When using this processor with an output target that might fail you should
    84  always wrap the output within a [`retry`](/docs/components/outputs/retry)
    85  block. This ensures that during outages your messages aren't reprocessed after
    86  failures, which would result in messages being dropped.
    87  
    88  ## Delivery Guarantees
    89  
    90  Performing deduplication on a stream using a distributed cache voids any
    91  at-least-once guarantees that it previously had. This is because the cache will
    92  preserve message signatures even if the message fails to leave the Benthos
    93  pipeline, which would cause message loss in the event of an outage at the output
    94  sink followed by a restart of the Benthos instance.
    95  
    96  If you intend to preserve at-least-once delivery guarantees you can avoid this
    97  problem by using a memory based cache. This is a compromise that can achieve
    98  effective deduplication but parallel deployments of the pipeline as well as
    99  service restarts increase the chances of duplicates passing undetected.
   100  
   101  ## Fields
   102  
   103  ### `cache`
   104  
   105  The [`cache` resource](/docs/components/caches/about) to target with this processor.
   106  
   107  
   108  Type: `string`  
   109  Default: `""`  
   110  
   111  ### `hash`
   112  
   113  The hash type to used.
   114  
   115  
   116  Type: `string`  
   117  Default: `"none"`  
   118  Options: `none`, `xxhash`.
   119  
   120  ### `key`
   121  
   122  An optional key to use for deduplication (instead of the entire message contents).
   123  This field supports [interpolation functions](/docs/configuration/interpolation#bloblang-queries).
   124  
   125  
   126  Type: `string`  
   127  Default: `""`  
   128  
   129  ### `drop_on_err`
   130  
   131  Whether messages should be dropped when the cache returns an error.
   132  
   133  
   134  Type: `bool`  
   135  Default: `true`  
   136  
   137  ### `parts`
   138  
   139  An array of message indexes within the batch to deduplicate based on. If left empty all messages are included. This field is only applicable when batching messages [at the input level](/docs/configuration/batching).
   140  
   141  
   142  Type: `array`  
   143  Default: `[0]`  
   144  
   145