github.com/fraugster/parquet-go@v0.12.0/README.md

github.com/fraugster/parquet-go@v0.12.0/README.md (about)

     1  <h1 align="center">parquet-go</h1>
     2  <p align="center">
     3          <a href="https://github.com/fraugster/parquet-go/releases"><img src="https://img.shields.io/github/v/tag/fraugster/parquet-go.svg?color=brightgreen&label=version&sort=semver"></a>
     4          <a href="https://circleci.com/gh/fraugster/parquet-go/tree/master"><img src="https://circleci.com/gh/fraugster/parquet-go/tree/master.svg?style=shield"></a>
     5          <a href="https://goreportcard.com/report/github.com/fraugster/parquet-go"><img src="https://goreportcard.com/badge/github.com/fraugster/parquet-go"></a>
     6          <a href="https://codecov.io/gh/fraugster/parquet-go"><img src="https://codecov.io/gh/fraugster/parquet-go/branch/master/graph/badge.svg"/></a>
     7          <a href="https://godoc.org/github.com/fraugster/parquet-go"><img src="https://img.shields.io/badge/godoc-reference-blue.svg?color=blue"></a>
     8          <a href="https://github.com/fraugster/parquet-go/blob/master/LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-blue"></a>
     9  </p>
    10  
    11  ---
    12  
    13  parquet-go is an implementation of the [Apache Parquet file format](https://github.com/apache/parquet-format)
    14  in Go. It provides functionality to both read and write parquet files, as well
    15  as high-level functionality to manage the data schema of parquet files, to
    16  directly write Go objects to parquet files using automatic or custom
    17  marshalling and to read records from parquet files into Go objects using
    18  automatic or custom marshalling.
    19  
    20  parquet is a file format to store nested data structures in a flat columnar
    21  format. By storing in a column-oriented way, it allows for efficient reading
    22  of individual columns without having to read and decode complete rows. This
    23  allows for efficient reading and faster processing when using the file format
    24  in conjunction with distributed data processing frameworks like Apache Hadoop
    25  or distributed SQL query engines like Presto and AWS Athena.
    26  
    27  This implementation is divided into several packages. The top-level package is
    28  the low-level implementation of the parquet file format. It is accompanied by
    29  the sub-packages parquetschema and floor. parquetschema provides functionality
    30  to parse textual schema definitions as well as the data types to manually or
    31  programmatically construct schema definitions. floor is a high-level wrapper
    32  around the low-level package. It provides functionality to open parquet files
    33  to read from them or write to them using automated or custom marshalling and
    34  unmarshalling.
    35  
    36  ## Supported Features
    37  
    38  | Feature                                  | Read | Write | Note |
    39  | ---                                      | ---- | ---- | --- |
    40  | Compression                              | Yes  | Yes  | Only GZIP and SNAPPY are supported out of the box, but it is possible to add other compressors, see below. |
    41  | Dictionary Encoding                      | Yes  | Yes  |
    42  | Run Length Encoding / Bit-Packing Hybrid | Yes  | Yes  | The reader can read RLE/Bit-pack encoding, but the writer only uses bit-packing |
    43  | Delta Encoding                           | Yes  | Yes  |
    44  | Byte Stream Split                        | No   | No   |
    45  | Data page V1                             | Yes  | Yes  |
    46  | Data page V2                             | Yes  | Yes  |
    47  | Statistics in page meta data             | No   | Yes  | Page meta data is generally not made available to users and not used by parquet-go.
    48  | Index Pages                              | No   | No   |
    49  | Dictionary Pages                         | Yes  | Yes  |
    50  | Encryption                               | No   | No   |
    51  | Bloom Filter                             | No   | No   |
    52  | Logical Types                            | Yes  | Yes  | Support for logical type is in the high-level package (floor) the low level parquet library only supports the basic types, see the type mapping table |
    53  
    54  ## Supported Data Types
    55  
    56  | Type in parquet         | Type in Go      | Note |
    57  | ----------------------- | --------------- | ---- |
    58  | boolean                 | bool            |
    59  | int32                   | int32           | See the note about the int type |
    60  | int64                   | int64           | See the note about the int type |
    61  | int96                   | [12]byte        |
    62  | float                   | float32         |
    63  | double                  | float64         |
    64  | byte_array              | []byte          |
    65  | fixed_len_byte_array(N) | [N]byte, []byte | use any positive number for `N` |
    66  
    67  Note: the low-level implementation only supports int32 for the INT32 type and int64 for the INT64 type in Parquet.
    68  Plain int or uint are not supported. The high-level `floor` package contains more extensive support for these
    69  data types.
    70  
    71  ## Supported Logical Types
    72  
    73  | Logical Type   | Mapped to Go types      | Note |
    74  | -------------- | ----------------------- | ---- |
    75  | STRING         | string, []byte          |
    76  | DATE           | int32, time.Time        | int32: days since Unix epoch (Jan 01 1970 00:00:00 UTC); time.Time only in `floor` |
    77  | TIME           | int32, int64, time.Time | int32: TIME(MILLIS, ...), int64: TIME(MICROS, ...), TIME(NANOS, ...); time.Time only in `floor` |
    78  | TIMESTAMP      | int64, int96, time.Time | time.Time only in `floor`
    79  | UUID           | [16]byte                |
    80  | LIST           | []T                     | slices of any type |
    81  | MAP            | map[T1]T2               | maps with any key and value types |
    82  | ENUM           | string, []byte          |
    83  | BSON           | []byte                  |
    84  | DECIMAL        | []byte, [N]byte         |
    85  | INT            | {,u}int{8,16,32,64}     | implementation is loose and will allow any INT logical type converted to any signed or unsigned int Go type. |
    86  
    87  ## Supported Converted Types
    88  
    89  | Converted Type       | Mapped to Go types  | Note |
    90  | -------------------- | ------------------- | ---- |
    91  | UTF8                 | string, []byte      |
    92  | TIME\_MILLIS          | int32               | Number of milliseconds since the beginning of the day |
    93  | TIME\_MICROS          | int64               | Number of microseconds since the beginning of the day |
    94  | TIMESTAMP\_MILLIS     | int64               | Number of milliseconds since Unix epoch (Jan 01 1970 00:00:00 UTC) |
    95  | TIMESTAMP\_MICROS     | int64               | Number of milliseconds since Unix epoch (Jan 01 1970 00:00:00 UTC) |
    96  | {,U}INT\_{8,16,32,64} | {,u}int{8,16,32,64} | implementation is loose and will allow any converted type with any int Go type. |
    97  | INTERVAL             | [12]byte            |
    98  
    99  Please note that converted types are deprecated. Logical types should be used preferably.
   100  
   101  ## Supported Compression Algorithms
   102  
   103  | Compression Algorithm | Supported | Notes |
   104  | --------------------- | --------- | ----- |
   105  | GZIP                  | Yes; Out of the box |
   106  | SNAPPY                | Yes; Out of the box |
   107  | BROTLI                | Yes; By importing [github.com/akrennmair/parquet-go-brotli](https://github.com/akrennmair/parquet-go-brotli) |
   108  | LZ4                   | No | LZ4 has been deprecated as of parquet-format 2.9.0. |
   109  | LZ4\_RAW              | Yes; By importing [github.com/akrennmair/parquet-go-lz4raw](https://github.com/akrennmair/parquet-go-lz4raw) |
   110  | LZO                   | Yes; By importing [github.com/akrennmair/parquet-go-lzo](https://github.com/akrennmair/parquet-go-lzo) | Uses a cgo wrapper around the original LZO implementation which is licensed as GPLv2+. |
   111  | ZSTD                  | Yes; By importing [github.com/akrennmair/parquet-go-zstd](https://github.com/akrennmair/parquet-go-zstd) |
   112  
   113  ## Schema Definition
   114  
   115  parquet-go comes with support for textual schema definitions. The sub-package
   116  `parquetschema` comes with a parser to turn the textual schema definition into
   117  the right data type to use elsewhere to specify parquet schemas. The syntax
   118  has been mostly reverse-engineered from a similar format also supported but
   119  barely documented in [Parquet's Java implementation](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java).
   120  
   121  For the full syntax, please have a look at the [parquetschema package Go documentation](http://godoc.org/github.com/fraugster/parquet-go/parquetschema).
   122  
   123  Generally, the schema definition describes the structure of a message. Parquet
   124  will then flatten this into a purely column-based structure when writing the
   125  actual data to parquet files.
   126  
   127  A message consists of a number of fields. Each field either has type or is a
   128  group. A group itself consists of a number of fields, which in turn can have
   129  either a type or are a group themselves. This allows for theoretically
   130  unlimited levels of hierarchy.
   131  
   132  Each field has a repetition type, describing whether a field is required (i.e.
   133  a value has to be present), optional (i.e. a value can be present but doesn't
   134  have to be) or repeated (i.e. zero or more values can be present). Optionally,
   135  each field (including groups) have an annotation, which contains a logical type
   136  or converted type that annotates something about the general structure at this
   137  point, e.g. `LIST` indicates a more complex list structure, or `MAP` a key-value
   138  map structure, both following certain conventions. Optionally, a typed field
   139  can also have a numeric field ID. The field ID has no purpose intrinsic to the
   140  parquet file format.
   141  
   142  Here is a simple example of a message with a few typed fields:
   143  
   144  ```
   145  message coordinates {
   146      required float64 latitude;
   147      required float64 longitude;
   148      optional int32 elevation = 1;
   149      optional binary comment (STRING);
   150  }
   151  ```
   152  
   153  In this example, we have a message with four typed fields, two of them
   154  required, and two of them optional. `float64`, `int32` and `binary` describe
   155  the fundamental data type of the field, while `longitude`, `latitude`,
   156  `elevation` and `comment` are the field names. The parentheses contain
   157  an annotation `STRING` which indicates that the field is a string, encoded
   158  as binary data, i.e. a byte array. The field `elevation` also has a field
   159  ID of `1`, indicated as numeric literal and separated from the field name
   160  by the equal sign `=`.
   161  
   162  In the following example, we will introduce a plain group as well as two
   163  nested groups annotated with logical types to indicate certain data structures:
   164  
   165  ```
   166  message transaction {
   167      required fixed_len_byte_array(16) txn_id (UUID);
   168      required int32 amount;
   169      required int96 txn_ts;
   170      optional group attributes {
   171          optional int64 shop_id;
   172          optional binary country_code (STRING);
   173          optional binary postcode (STRING);
   174      }
   175      required group items (LIST) {
   176          repeated group list {
   177              required int64 item_id;
   178              optional binary name (STRING);
   179          }
   180      }
   181      optional group user_attributes (MAP) {
   182          repeated group key_value {
   183              required binary key (STRING);
   184              required binary value (STRING);
   185          }
   186      }
   187  }
   188  ```
   189  
   190  In this example, we see a number of top-level fields, some of which are
   191  groups. The first group is simply a group of typed fields, named `attributes`.
   192  
   193  The second group, `items` is annotated to be a `LIST` and in turn contains a
   194  `repeated group list`, which in turn contains a number of typed fields. When
   195  a group is annotated as `LIST`, it needs to follow a particular convention:
   196  it has to contain a `repeated group` named `list`. Inside this group, any
   197  fields can be present.
   198  
   199  The third group, `user_attributes` is annotated as `MAP`. Similar to `LIST`,
   200  it follows some conventions. In particular, it has to contain only a single
   201  `required group` with the name `key_value`, which in turn contains exactly two
   202  fields, one named `key`, the other named `value`. This represents a map
   203  structure in which each key is associated with one value.
   204  
   205  ## Examples
   206  
   207  For examples how to use both the low-level and high-level APIs of this library, please
   208  see the directory `examples`. You can also check out the accompanying tools (see below)
   209  for more advanced examples. The tools are located in the `cmd` directory.
   210  
   211  ## Tools
   212  
   213  `parquet-go` comes with tooling to inspect and generate parquet tools.
   214  
   215  ### parquet-tool
   216  
   217  `parquet-tool` allows you to inspect the meta data, the schema and the number of rows
   218  as well as print the content of a parquet file. You can also use it to split an existing
   219  parquet file into multiple smaller files.
   220  
   221  Install it by running `go get github.com/fraugster/parquet-go/cmd/parquet-tool` on your command line.
   222  For more detailed help on how to use the tool, consult `parquet-tool --help`.
   223  
   224  ### csv2parquet
   225  
   226  `csv2parquet` makes it possible to convert an existing CSV file into a parquet file. By default,
   227  all columns are simply turned into strings, but you provide it with type hints to influence
   228  the generated parquet schema.
   229  
   230  You can install this tool by running `go get github.com/fraugster/parquet-go/cmd/csv2parquet` on your command line.
   231  For more help, consult `csv2parquet --help`.
   232  
   233  ## Contributing
   234  
   235  If you want to hack on this repository, please read the short [CONTRIBUTING.md](CONTRIBUTING.md)
   236  guide first.
   237  
   238  # Versioning
   239  
   240  We use [SemVer](http://semver.org/) for versioning. For the versions available,
   241  see the [tags on this repository][tags].
   242  
   243  ## Authors
   244  
   245  - **Forud Ghafouri** - *Initial work* [fzerorubigd](https://github.com/fzerorubigd)
   246  - **Andreas Krennmair** - *floor package, schema parser* [akrennmair](https://github.com/akrennmair)
   247  - **Stefan Koshiw** - *Engineering Manager for Core Team* [panamafrancis](https://github.com/panamafrancis)
   248  
   249  See also the list of [contributors][contributors] who participated in this project.
   250  
   251  ## Special Mentions
   252  
   253  - **Nathan Hanna** - *proposal and prototyping of automatic schema generator* [jnathanh](https://github.com/jnathanh)
   254  
   255  ## License
   256  
   257  Copyright 2021 Fraugster GmbH
   258  
   259  This project is licensed under the Apache-2 License - see the [LICENSE](LICENSE) file for details.
   260  
   261  This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
   262  
   263  [tags]: https://github.com/fraugster/parquet-go/tags
   264  [contributors]: https://github.com/fraugster/parquet-go/graphs/contributors
   265