github.com/Jeffail/benthos/v3@v3.65.0/website/docs/components/inputs/sequence.md

github.com/Jeffail/benthos/v3@v3.65.0/website/docs/components/inputs/sequence.md (about)

     1  ---
     2  title: sequence
     3  type: input
     4  status: stable
     5  categories: ["Utility"]
     6  ---
     7  
     8  <!--
     9       THIS FILE IS AUTOGENERATED!
    10  
    11       To make changes please edit the contents of:
    12       lib/input/sequence.go
    13  -->
    14  
    15  import Tabs from '@theme/Tabs';
    16  import TabItem from '@theme/TabItem';
    17  
    18  
    19  Reads messages from a sequence of child inputs, starting with the first and once
    20  that input gracefully terminates starts consuming from the next, and so on.
    21  
    22  
    23  <Tabs defaultValue="common" values={[
    24    { label: 'Common', value: 'common', },
    25    { label: 'Advanced', value: 'advanced', },
    26  ]}>
    27  
    28  <TabItem value="common">
    29  
    30  ```yaml
    31  # Common config fields, showing default values
    32  input:
    33    label: ""
    34    sequence:
    35      inputs: []
    36  ```
    37  
    38  </TabItem>
    39  <TabItem value="advanced">
    40  
    41  ```yaml
    42  # All config fields, showing default values
    43  input:
    44    label: ""
    45    sequence:
    46      sharded_join:
    47        type: none
    48        id_path: ""
    49        iterations: 1
    50        merge_strategy: array
    51      inputs: []
    52  ```
    53  
    54  </TabItem>
    55  </Tabs>
    56  
    57  This input is useful for consuming from inputs that have an explicit end but
    58  must not be consumed in parallel.
    59  
    60  ## Examples
    61  
    62  <Tabs defaultValue="End of Stream Message" values={[
    63  { label: 'End of Stream Message', value: 'End of Stream Message', },
    64  { label: 'Joining Data (Simple)', value: 'Joining Data (Simple)', },
    65  { label: 'Joining Data (Advanced)', value: 'Joining Data (Advanced)', },
    66  ]}>
    67  
    68  <TabItem value="End of Stream Message">
    69  
    70  A common use case for sequence might be to generate a message at the end of our main input. With the following config once the records within `./dataset.csv` are exhausted our final payload `{"status":"finished"}` will be routed through the pipeline.
    71  
    72  ```yaml
    73  input:
    74    sequence:
    75      inputs:
    76        - csv:
    77            paths: [ ./dataset.csv ]
    78        - generate:
    79            count: 1
    80            mapping: 'root = {"status":"finished"}'
    81  ```
    82  
    83  </TabItem>
    84  <TabItem value="Joining Data (Simple)">
    85  
    86  Benthos can be used to join unordered data from fragmented datasets in memory by specifying a common identifier field and a number of sharded iterations. For example, given two CSV files, the first called "main.csv", which contains rows of user data:
    87  
    88  ```csv
    89  uuid,name,age
    90  AAA,Melanie,34
    91  BBB,Emma,28
    92  CCC,Geri,45
    93  ```
    94  
    95  And the second called "hobbies.csv" that, for each user, contains zero or more rows of hobbies:
    96  
    97  ```csv
    98  uuid,hobby
    99  CCC,pokemon go
   100  AAA,rowing
   101  AAA,golf
   102  ```
   103  
   104  We can parse and join this data into a single dataset:
   105  
   106  ```json
   107  {"uuid":"AAA","name":"Melanie","age":34,"hobbies":["rowing","golf"]}
   108  {"uuid":"BBB","name":"Emma","age":28}
   109  {"uuid":"CCC","name":"Geri","age":45,"hobbies":["pokemon go"]}
   110  ```
   111  
   112  With the following config:
   113  
   114  ```yaml
   115  input:
   116    sequence:
   117      sharded_join:
   118        type: full-outter
   119        id_path: uuid
   120        merge_strategy: array
   121      inputs:
   122        - csv:
   123            paths:
   124              - ./hobbies.csv
   125              - ./main.csv
   126  ```
   127  
   128  </TabItem>
   129  <TabItem value="Joining Data (Advanced)">
   130  
   131  In this example we are able to join unordered and fragmented data from a combination of CSV files and newline-delimited JSON documents by specifying multiple sequence inputs with their own processors for extracting the structured data.
   132  
   133  The first file "main.csv" contains straight forward CSV data:
   134  
   135  ```csv
   136  uuid,name,age
   137  AAA,Melanie,34
   138  BBB,Emma,28
   139  CCC,Geri,45
   140  ```
   141  
   142  And the second file called "hobbies.ndjson" contains JSON documents, one per line, that associate an identifer with an array of hobbies. However, these data objects are in a nested format:
   143  
   144  ```json
   145  {"document":{"uuid":"CCC","hobbies":[{"type":"pokemon go"}]}}
   146  {"document":{"uuid":"AAA","hobbies":[{"type":"rowing"},{"type":"golf"}]}}
   147  ```
   148  
   149  And so we will want to map these into a flattened structure before the join, and then we will end up with a single dataset that looks like this:
   150  
   151  ```json
   152  {"uuid":"AAA","name":"Melanie","age":34,"hobbies":["rowing","golf"]}
   153  {"uuid":"BBB","name":"Emma","age":28}
   154  {"uuid":"CCC","name":"Geri","age":45,"hobbies":["pokemon go"]}
   155  ```
   156  
   157  With the following config:
   158  
   159  ```yaml
   160  input:
   161    sequence:
   162      sharded_join:
   163        type: full-outter
   164        id_path: uuid
   165        iterations: 10
   166        merge_strategy: array
   167      inputs:
   168        - csv:
   169            paths: [ ./main.csv ]
   170        - file:
   171            codec: lines
   172            paths: [ ./hobbies.ndjson ]
   173          processors:
   174            - bloblang: |
   175                root.uuid = this.document.uuid
   176                root.hobbies = this.document.hobbies.map_each(this.type)
   177  ```
   178  
   179  </TabItem>
   180  </Tabs>
   181  
   182  ## Fields
   183  
   184  ### `sharded_join`
   185  
   186  EXPERIMENTAL: Provides a way to perform outter joins of arbitrarily structured and unordered data resulting from the input sequence, even when the overall size of the data surpasses the memory available on the machine.
   187  
   188  When configured the sequence of inputs will be consumed one or more times according to the number of iterations, and when more than one iteration is specified each iteration will process an entirely different set of messages by sharding them by the ID field. Increasing the number of iterations reduces the memory consumption at the cost of needing to fully parse the data each time.
   189  
   190  Each message must be structured (JSON or otherwise processed into a structured form) and the fields will be aggregated with those of other messages sharing the ID. At the end of each iteration the joined messages are flushed downstream before the next iteration begins, hence keeping memory usage limited.
   191  
   192  
   193  Type: `object`  
   194  Requires version 3.40.0 or newer  
   195  
   196  ### `sharded_join.type`
   197  
   198  The type of join to perform. A `full-outter` ensures that all identifiers seen in any of the input sequences are sent, and is performed by consuming all input sequences before flushing the joined results. An `outter` join consumes all input sequences but only writes data joined from the last input in the sequence, similar to a left or right outter join. With an `outter` join if an identifier appears multiple times within the final sequence input it will be flushed each time it appears.
   199  
   200  
   201  Type: `string`  
   202  Default: `"none"`  
   203  Options: `none`, `full-outter`, `outter`.
   204  
   205  ### `sharded_join.id_path`
   206  
   207  A [dot path](/docs/configuration/field_paths) that points to a common field within messages of each fragmented data set and can be used to join them. Messages that are not structured or are missing this field will be dropped. This field must be set in order to enable joins.
   208  
   209  
   210  Type: `string`  
   211  Default: `""`  
   212  
   213  ### `sharded_join.iterations`
   214  
   215  The total number of iterations (shards), increasing this number will increase the overall time taken to process the data, but reduces the memory used in the process. The real memory usage required is significantly higher than the real size of the data and therefore the number of iterations should be at least an order of magnitude higher than the available memory divided by the overall size of the dataset.
   216  
   217  
   218  Type: `int`  
   219  Default: `1`  
   220  
   221  ### `sharded_join.merge_strategy`
   222  
   223  The chosen strategy to use when a data join would otherwise result in a collision of field values. The strategy `array` means non-array colliding values are placed into an array and colliding arrays are merged. The strategy `replace` replaces old values with new values. The strategy `keep` keeps the old value.
   224  
   225  
   226  Type: `string`  
   227  Default: `"array"`  
   228  Options: `array`, `replace`, `keep`.
   229  
   230  ### `inputs`
   231  
   232  An array of inputs to read from sequentially.
   233  
   234  
   235  Type: `array`  
   236  Default: `[]`  
   237  
   238