github.com/Jeffail/benthos/v3@v3.65.0/website/docs/components/processors/parquet.md

github.com/Jeffail/benthos/v3@v3.65.0/website/docs/components/processors/parquet.md (about)

     1  ---
     2  title: parquet
     3  type: processor
     4  status: experimental
     5  categories: ["Parsing"]
     6  ---
     7  
     8  <!--
     9       THIS FILE IS AUTOGENERATED!
    10  
    11       To make changes please edit the contents of:
    12       lib/processor/parquet.go
    13  -->
    14  
    15  import Tabs from '@theme/Tabs';
    16  import TabItem from '@theme/TabItem';
    17  
    18  :::caution EXPERIMENTAL
    19  This component is experimental and therefore subject to change or removal outside of major version releases.
    20  :::
    21  Converts batches of documents to or from [Parquet files](https://parquet.apache.org/documentation/latest/).
    22  
    23  Introduced in version 3.62.0.
    24  
    25  ```yaml
    26  # Config fields, showing default values
    27  label: ""
    28  parquet:
    29    operator: ""
    30    compression: snappy
    31    schema_file: ""
    32    schema: ""
    33  ```
    34  
    35  ### Troubleshooting
    36  
    37  This processor is experimental and the error messages that it provides are often vague and unhelpful. An error message of the form `interface {} is nil, not <value type>` implies that a field of the given type was expected but not found in the processed message when writing parquet files.
    38  
    39  Unfortunately the name of the field will sometimes be missing from the error, in which case it's worth double checking the schema you provided to make sure that there are no typos in the field names, and if that doesn't reveal the issue it can help to mark fields as OPTIONAL in the schema and gradually change them back to REQUIRED until the error returns.
    40  
    41  ### Defining the Schema
    42  
    43  The schema must be specified as a JSON string, containing an object that describes the fields expected at the root of each document. Each field can itself have more fields defined, allowing for nested structures:
    44  
    45  ```json
    46  {
    47    "Tag": "name=root, repetitiontype=REQUIRED",
    48    "Fields": [
    49      {"Tag": "name=name, inname=NameIn, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
    50      {"Tag": "name=age, inname=Age, type=INT32, repetitiontype=REQUIRED"},
    51      {"Tag": "name=id, inname=Id, type=INT64, repetitiontype=REQUIRED"},
    52      {"Tag": "name=weight, inname=Weight, type=FLOAT, repetitiontype=REQUIRED"},
    53      {
    54        "Tag": "name=favPokemon, inname=FavPokemon, type=LIST, repetitiontype=OPTIONAL",
    55        "Fields": [
    56          {"Tag": "name=name, inname=PokeName, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
    57          {"Tag": "name=coolness, inname=Coolness, type=FLOAT, repetitiontype=REQUIRED"}
    58        ]
    59      }
    60    ]
    61  }
    62  ```
    63  
    64  ## Fields
    65  
    66  ### `operator`
    67  
    68  Determines whether the processor converts messages into a parquet file or expands parquet files into messages. Converting into JSON allows subsequent processors and mappings to convert the data into any other format.
    69  
    70  
    71  Type: `string`  
    72  
    73  | Option | Summary |
    74  |---|---|
    75  | `from_json` | Compress a batch of JSON documents into a file. |
    76  | `to_json` | Expand a file into one or more JSON messages. |
    77  
    78  
    79  ### `compression`
    80  
    81  The type of compression to use when writing parquet files, this field is ignored when consuming parquet files.
    82  
    83  
    84  Type: `string`  
    85  Default: `"snappy"`  
    86  Options: `uncompressed`, `snappy`, `gzip`, `lz4`, `zstd`.
    87  
    88  ### `schema_file`
    89  
    90  A file path containing a schema used to describe the parquet files being generated or consumed, the format of the schema is a JSON document detailing the tag and fields of documents. The schema can be found at: https://pkg.go.dev/github.com/xitongsys/parquet-go#readme-json. Either a `schema_file` or `schema` field must be specified.
    91  
    92  
    93  Type: `string`  
    94  
    95  ```yaml
    96  # Examples
    97  
    98  schema_file: schemas/foo.json
    99  ```
   100  
   101  ### `schema`
   102  
   103  A schema used to describe the parquet files being generated or consumed, the format of the schema is a JSON document detailing the tag and fields of documents. The schema can be found at: https://pkg.go.dev/github.com/xitongsys/parquet-go#readme-json. Either a `schema_file` or `schema` field must be specified.
   104  
   105  
   106  Type: `string`  
   107  
   108  ```yaml
   109  # Examples
   110  
   111  schema: |-
   112    {
   113      "Tag": "name=root, repetitiontype=REQUIRED",
   114      "Fields": [
   115        {"Tag":"name=name,inname=NameIn,type=BYTE_ARRAY,convertedtype=UTF8, repetitiontype=REQUIRED"},
   116        {"Tag":"name=age,inname=Age,type=INT32,repetitiontype=REQUIRED"}
   117      ]
   118    }
   119  ```
   120  
   121  ## Examples
   122  
   123  <Tabs defaultValue="Batching Output Files" values={[
   124  { label: 'Batching Output Files', value: 'Batching Output Files', },
   125  ]}>
   126  
   127  <TabItem value="Batching Output Files">
   128  
   129  Parquet is often used to write batches of documents to a file store.
   130  
   131  ```yaml
   132  output:
   133    broker:
   134      outputs:
   135        - file:
   136            path: ./stuff-${! uuid_v4() }.parquet
   137            codec: all-bytes
   138      batching:
   139        count: 100
   140        period: 30s
   141        processors:
   142          - parquet:
   143              operator: from_json
   144              schema: |-
   145                {
   146                  "Tag": "name=root, repetitiontype=REQUIRED",
   147                  "Fields": [
   148                    {"Tag":"name=name,inname=NameIn,type=BYTE_ARRAY,convertedtype=UTF8, repetitiontype=REQUIRED"},
   149                    {"Tag":"name=age,inname=Age,type=INT32,repetitiontype=REQUIRED"}
   150                  ]
   151                }
   152  ```
   153  
   154  </TabItem>
   155  </Tabs>
   156  
   157