github.com/fraugster/parquet-go@v0.12.0/README.md (about) 1 <h1 align="center">parquet-go</h1> 2 <p align="center"> 3 <a href="https://github.com/fraugster/parquet-go/releases"><img src="https://img.shields.io/github/v/tag/fraugster/parquet-go.svg?color=brightgreen&label=version&sort=semver"></a> 4 <a href="https://circleci.com/gh/fraugster/parquet-go/tree/master"><img src="https://circleci.com/gh/fraugster/parquet-go/tree/master.svg?style=shield"></a> 5 <a href="https://goreportcard.com/report/github.com/fraugster/parquet-go"><img src="https://goreportcard.com/badge/github.com/fraugster/parquet-go"></a> 6 <a href="https://codecov.io/gh/fraugster/parquet-go"><img src="https://codecov.io/gh/fraugster/parquet-go/branch/master/graph/badge.svg"/></a> 7 <a href="https://godoc.org/github.com/fraugster/parquet-go"><img src="https://img.shields.io/badge/godoc-reference-blue.svg?color=blue"></a> 8 <a href="https://github.com/fraugster/parquet-go/blob/master/LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-blue"></a> 9 </p> 10 11 --- 12 13 parquet-go is an implementation of the [Apache Parquet file format](https://github.com/apache/parquet-format) 14 in Go. It provides functionality to both read and write parquet files, as well 15 as high-level functionality to manage the data schema of parquet files, to 16 directly write Go objects to parquet files using automatic or custom 17 marshalling and to read records from parquet files into Go objects using 18 automatic or custom marshalling. 19 20 parquet is a file format to store nested data structures in a flat columnar 21 format. By storing in a column-oriented way, it allows for efficient reading 22 of individual columns without having to read and decode complete rows. This 23 allows for efficient reading and faster processing when using the file format 24 in conjunction with distributed data processing frameworks like Apache Hadoop 25 or distributed SQL query engines like Presto and AWS Athena. 26 27 This implementation is divided into several packages. The top-level package is 28 the low-level implementation of the parquet file format. It is accompanied by 29 the sub-packages parquetschema and floor. parquetschema provides functionality 30 to parse textual schema definitions as well as the data types to manually or 31 programmatically construct schema definitions. floor is a high-level wrapper 32 around the low-level package. It provides functionality to open parquet files 33 to read from them or write to them using automated or custom marshalling and 34 unmarshalling. 35 36 ## Supported Features 37 38 | Feature | Read | Write | Note | 39 | --- | ---- | ---- | --- | 40 | Compression | Yes | Yes | Only GZIP and SNAPPY are supported out of the box, but it is possible to add other compressors, see below. | 41 | Dictionary Encoding | Yes | Yes | 42 | Run Length Encoding / Bit-Packing Hybrid | Yes | Yes | The reader can read RLE/Bit-pack encoding, but the writer only uses bit-packing | 43 | Delta Encoding | Yes | Yes | 44 | Byte Stream Split | No | No | 45 | Data page V1 | Yes | Yes | 46 | Data page V2 | Yes | Yes | 47 | Statistics in page meta data | No | Yes | Page meta data is generally not made available to users and not used by parquet-go. 48 | Index Pages | No | No | 49 | Dictionary Pages | Yes | Yes | 50 | Encryption | No | No | 51 | Bloom Filter | No | No | 52 | Logical Types | Yes | Yes | Support for logical type is in the high-level package (floor) the low level parquet library only supports the basic types, see the type mapping table | 53 54 ## Supported Data Types 55 56 | Type in parquet | Type in Go | Note | 57 | ----------------------- | --------------- | ---- | 58 | boolean | bool | 59 | int32 | int32 | See the note about the int type | 60 | int64 | int64 | See the note about the int type | 61 | int96 | [12]byte | 62 | float | float32 | 63 | double | float64 | 64 | byte_array | []byte | 65 | fixed_len_byte_array(N) | [N]byte, []byte | use any positive number for `N` | 66 67 Note: the low-level implementation only supports int32 for the INT32 type and int64 for the INT64 type in Parquet. 68 Plain int or uint are not supported. The high-level `floor` package contains more extensive support for these 69 data types. 70 71 ## Supported Logical Types 72 73 | Logical Type | Mapped to Go types | Note | 74 | -------------- | ----------------------- | ---- | 75 | STRING | string, []byte | 76 | DATE | int32, time.Time | int32: days since Unix epoch (Jan 01 1970 00:00:00 UTC); time.Time only in `floor` | 77 | TIME | int32, int64, time.Time | int32: TIME(MILLIS, ...), int64: TIME(MICROS, ...), TIME(NANOS, ...); time.Time only in `floor` | 78 | TIMESTAMP | int64, int96, time.Time | time.Time only in `floor` 79 | UUID | [16]byte | 80 | LIST | []T | slices of any type | 81 | MAP | map[T1]T2 | maps with any key and value types | 82 | ENUM | string, []byte | 83 | BSON | []byte | 84 | DECIMAL | []byte, [N]byte | 85 | INT | {,u}int{8,16,32,64} | implementation is loose and will allow any INT logical type converted to any signed or unsigned int Go type. | 86 87 ## Supported Converted Types 88 89 | Converted Type | Mapped to Go types | Note | 90 | -------------------- | ------------------- | ---- | 91 | UTF8 | string, []byte | 92 | TIME\_MILLIS | int32 | Number of milliseconds since the beginning of the day | 93 | TIME\_MICROS | int64 | Number of microseconds since the beginning of the day | 94 | TIMESTAMP\_MILLIS | int64 | Number of milliseconds since Unix epoch (Jan 01 1970 00:00:00 UTC) | 95 | TIMESTAMP\_MICROS | int64 | Number of milliseconds since Unix epoch (Jan 01 1970 00:00:00 UTC) | 96 | {,U}INT\_{8,16,32,64} | {,u}int{8,16,32,64} | implementation is loose and will allow any converted type with any int Go type. | 97 | INTERVAL | [12]byte | 98 99 Please note that converted types are deprecated. Logical types should be used preferably. 100 101 ## Supported Compression Algorithms 102 103 | Compression Algorithm | Supported | Notes | 104 | --------------------- | --------- | ----- | 105 | GZIP | Yes; Out of the box | 106 | SNAPPY | Yes; Out of the box | 107 | BROTLI | Yes; By importing [github.com/akrennmair/parquet-go-brotli](https://github.com/akrennmair/parquet-go-brotli) | 108 | LZ4 | No | LZ4 has been deprecated as of parquet-format 2.9.0. | 109 | LZ4\_RAW | Yes; By importing [github.com/akrennmair/parquet-go-lz4raw](https://github.com/akrennmair/parquet-go-lz4raw) | 110 | LZO | Yes; By importing [github.com/akrennmair/parquet-go-lzo](https://github.com/akrennmair/parquet-go-lzo) | Uses a cgo wrapper around the original LZO implementation which is licensed as GPLv2+. | 111 | ZSTD | Yes; By importing [github.com/akrennmair/parquet-go-zstd](https://github.com/akrennmair/parquet-go-zstd) | 112 113 ## Schema Definition 114 115 parquet-go comes with support for textual schema definitions. The sub-package 116 `parquetschema` comes with a parser to turn the textual schema definition into 117 the right data type to use elsewhere to specify parquet schemas. The syntax 118 has been mostly reverse-engineered from a similar format also supported but 119 barely documented in [Parquet's Java implementation](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java). 120 121 For the full syntax, please have a look at the [parquetschema package Go documentation](http://godoc.org/github.com/fraugster/parquet-go/parquetschema). 122 123 Generally, the schema definition describes the structure of a message. Parquet 124 will then flatten this into a purely column-based structure when writing the 125 actual data to parquet files. 126 127 A message consists of a number of fields. Each field either has type or is a 128 group. A group itself consists of a number of fields, which in turn can have 129 either a type or are a group themselves. This allows for theoretically 130 unlimited levels of hierarchy. 131 132 Each field has a repetition type, describing whether a field is required (i.e. 133 a value has to be present), optional (i.e. a value can be present but doesn't 134 have to be) or repeated (i.e. zero or more values can be present). Optionally, 135 each field (including groups) have an annotation, which contains a logical type 136 or converted type that annotates something about the general structure at this 137 point, e.g. `LIST` indicates a more complex list structure, or `MAP` a key-value 138 map structure, both following certain conventions. Optionally, a typed field 139 can also have a numeric field ID. The field ID has no purpose intrinsic to the 140 parquet file format. 141 142 Here is a simple example of a message with a few typed fields: 143 144 ``` 145 message coordinates { 146 required float64 latitude; 147 required float64 longitude; 148 optional int32 elevation = 1; 149 optional binary comment (STRING); 150 } 151 ``` 152 153 In this example, we have a message with four typed fields, two of them 154 required, and two of them optional. `float64`, `int32` and `binary` describe 155 the fundamental data type of the field, while `longitude`, `latitude`, 156 `elevation` and `comment` are the field names. The parentheses contain 157 an annotation `STRING` which indicates that the field is a string, encoded 158 as binary data, i.e. a byte array. The field `elevation` also has a field 159 ID of `1`, indicated as numeric literal and separated from the field name 160 by the equal sign `=`. 161 162 In the following example, we will introduce a plain group as well as two 163 nested groups annotated with logical types to indicate certain data structures: 164 165 ``` 166 message transaction { 167 required fixed_len_byte_array(16) txn_id (UUID); 168 required int32 amount; 169 required int96 txn_ts; 170 optional group attributes { 171 optional int64 shop_id; 172 optional binary country_code (STRING); 173 optional binary postcode (STRING); 174 } 175 required group items (LIST) { 176 repeated group list { 177 required int64 item_id; 178 optional binary name (STRING); 179 } 180 } 181 optional group user_attributes (MAP) { 182 repeated group key_value { 183 required binary key (STRING); 184 required binary value (STRING); 185 } 186 } 187 } 188 ``` 189 190 In this example, we see a number of top-level fields, some of which are 191 groups. The first group is simply a group of typed fields, named `attributes`. 192 193 The second group, `items` is annotated to be a `LIST` and in turn contains a 194 `repeated group list`, which in turn contains a number of typed fields. When 195 a group is annotated as `LIST`, it needs to follow a particular convention: 196 it has to contain a `repeated group` named `list`. Inside this group, any 197 fields can be present. 198 199 The third group, `user_attributes` is annotated as `MAP`. Similar to `LIST`, 200 it follows some conventions. In particular, it has to contain only a single 201 `required group` with the name `key_value`, which in turn contains exactly two 202 fields, one named `key`, the other named `value`. This represents a map 203 structure in which each key is associated with one value. 204 205 ## Examples 206 207 For examples how to use both the low-level and high-level APIs of this library, please 208 see the directory `examples`. You can also check out the accompanying tools (see below) 209 for more advanced examples. The tools are located in the `cmd` directory. 210 211 ## Tools 212 213 `parquet-go` comes with tooling to inspect and generate parquet tools. 214 215 ### parquet-tool 216 217 `parquet-tool` allows you to inspect the meta data, the schema and the number of rows 218 as well as print the content of a parquet file. You can also use it to split an existing 219 parquet file into multiple smaller files. 220 221 Install it by running `go get github.com/fraugster/parquet-go/cmd/parquet-tool` on your command line. 222 For more detailed help on how to use the tool, consult `parquet-tool --help`. 223 224 ### csv2parquet 225 226 `csv2parquet` makes it possible to convert an existing CSV file into a parquet file. By default, 227 all columns are simply turned into strings, but you provide it with type hints to influence 228 the generated parquet schema. 229 230 You can install this tool by running `go get github.com/fraugster/parquet-go/cmd/csv2parquet` on your command line. 231 For more help, consult `csv2parquet --help`. 232 233 ## Contributing 234 235 If you want to hack on this repository, please read the short [CONTRIBUTING.md](CONTRIBUTING.md) 236 guide first. 237 238 # Versioning 239 240 We use [SemVer](http://semver.org/) for versioning. For the versions available, 241 see the [tags on this repository][tags]. 242 243 ## Authors 244 245 - **Forud Ghafouri** - *Initial work* [fzerorubigd](https://github.com/fzerorubigd) 246 - **Andreas Krennmair** - *floor package, schema parser* [akrennmair](https://github.com/akrennmair) 247 - **Stefan Koshiw** - *Engineering Manager for Core Team* [panamafrancis](https://github.com/panamafrancis) 248 249 See also the list of [contributors][contributors] who participated in this project. 250 251 ## Special Mentions 252 253 - **Nathan Hanna** - *proposal and prototyping of automatic schema generator* [jnathanh](https://github.com/jnathanh) 254 255 ## License 256 257 Copyright 2021 Fraugster GmbH 258 259 This project is licensed under the Apache-2 License - see the [LICENSE](LICENSE) file for details. 260 261 This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 262 263 [tags]: https://github.com/fraugster/parquet-go/tags 264 [contributors]: https://github.com/fraugster/parquet-go/graphs/contributors 265