github.com/Jeffail/benthos/v3@v3.65.0/website/docs/components/processors/dedupe.md (about) 1 --- 2 title: dedupe 3 type: processor 4 status: stable 5 categories: ["Utility"] 6 --- 7 8 <!-- 9 THIS FILE IS AUTOGENERATED! 10 11 To make changes please edit the contents of: 12 lib/processor/dedupe.go 13 --> 14 15 import Tabs from '@theme/Tabs'; 16 import TabItem from '@theme/TabItem'; 17 18 19 Deduplicates message batches by caching selected (and optionally hashed) 20 messages, dropping batches that are already cached. 21 22 23 <Tabs defaultValue="common" values={[ 24 { label: 'Common', value: 'common', }, 25 { label: 'Advanced', value: 'advanced', }, 26 ]}> 27 28 <TabItem value="common"> 29 30 ```yaml 31 # Common config fields, showing default values 32 label: "" 33 dedupe: 34 cache: "" 35 hash: none 36 key: "" 37 drop_on_err: true 38 ``` 39 40 </TabItem> 41 <TabItem value="advanced"> 42 43 ```yaml 44 # All config fields, showing default values 45 label: "" 46 dedupe: 47 cache: "" 48 hash: none 49 key: "" 50 drop_on_err: true 51 parts: 52 - 0 53 ``` 54 55 </TabItem> 56 </Tabs> 57 58 This processor acts across an entire batch, in order to deduplicate individual 59 messages within a batch use this processor with the 60 [`for_each`](/docs/components/processors/for_each) processor. 61 62 Optionally, the `key` field can be populated in order to hash on a 63 function interpolated string rather than the full contents of messages. This 64 allows you to deduplicate based on dynamic fields within a message, such as its 65 metadata, JSON fields, etc. A full list of interpolation functions can be found 66 [here](/docs/configuration/interpolation#bloblang-queries). 67 68 For example, the following config would deduplicate based on the concatenated 69 values of the metadata field `kafka_key` and the value of the JSON 70 path `id` within the message contents: 71 72 ```yaml 73 pipeline: 74 processors: 75 - dedupe: 76 cache: foocache 77 key: ${! meta("kafka_key") }-${! json("id") } 78 ``` 79 80 Caches should be configured as a resource, for more information check out the 81 [documentation here](/docs/components/caches/about). 82 83 When using this processor with an output target that might fail you should 84 always wrap the output within a [`retry`](/docs/components/outputs/retry) 85 block. This ensures that during outages your messages aren't reprocessed after 86 failures, which would result in messages being dropped. 87 88 ## Delivery Guarantees 89 90 Performing deduplication on a stream using a distributed cache voids any 91 at-least-once guarantees that it previously had. This is because the cache will 92 preserve message signatures even if the message fails to leave the Benthos 93 pipeline, which would cause message loss in the event of an outage at the output 94 sink followed by a restart of the Benthos instance. 95 96 If you intend to preserve at-least-once delivery guarantees you can avoid this 97 problem by using a memory based cache. This is a compromise that can achieve 98 effective deduplication but parallel deployments of the pipeline as well as 99 service restarts increase the chances of duplicates passing undetected. 100 101 ## Fields 102 103 ### `cache` 104 105 The [`cache` resource](/docs/components/caches/about) to target with this processor. 106 107 108 Type: `string` 109 Default: `""` 110 111 ### `hash` 112 113 The hash type to used. 114 115 116 Type: `string` 117 Default: `"none"` 118 Options: `none`, `xxhash`. 119 120 ### `key` 121 122 An optional key to use for deduplication (instead of the entire message contents). 123 This field supports [interpolation functions](/docs/configuration/interpolation#bloblang-queries). 124 125 126 Type: `string` 127 Default: `""` 128 129 ### `drop_on_err` 130 131 Whether messages should be dropped when the cache returns an error. 132 133 134 Type: `bool` 135 Default: `true` 136 137 ### `parts` 138 139 An array of message indexes within the batch to deduplicate based on. If left empty all messages are included. This field is only applicable when batching messages [at the input level](/docs/configuration/batching). 140 141 142 Type: `array` 143 Default: `[0]` 144 145