github.com/parquet-go/parquet-go@v0.21.1-0.20240501160520-b3c3a0c3ed6f/README.md (about) 1 # parquet-go/parquet-go [![build status](https://github.com/parquet-go/parquet-go/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/parquet-go/parquet-go/actions) [![Go Report Card](https://goreportcard.com/badge/github.com/parquet-go/parquet-go)](https://goreportcard.com/report/github.com/parquet-go/parquet-go) [![Go Reference](https://pkg.go.dev/badge/github.com/parquet-go/parquet-go.svg)](https://pkg.go.dev/github.com/parquet-go/parquet-go) 2 3 High-performance Go library to manipulate parquet files, initially developed at 4 [Twilio Segment](https://segment.com/engineering). 5 6 ![parquet-go-logo](https://github.com/parquet-go/parquet-go/assets/96151026/5b1f043b-2cee-4a64-a3c3-40d3353fecc0) 7 8 9 ## Motivation 10 11 Parquet has been established as a powerful solution to represent columnar data 12 on persistent storage mediums, achieving levels of compression and query 13 performance that enable managing data sets at scales that reach the petabytes. 14 In addition, having intensive data applications sharing a common format creates 15 opportunities for interoperation in our tool kits, providing greater leverage 16 and value to engineers maintaining and operating those systems. 17 18 The creation and evolution of large scale data management systems, combined with 19 realtime expectations come with challenging maintenance and performance 20 requirements, that existing solutions to use parquet with Go were not addressing. 21 22 The `parquet-go/parquet-go` package was designed and developed to respond to those 23 challenges, offering high level APIs to read and write parquet files, while 24 keeping a low compute and memory footprint in order to be used in environments 25 where data volumes and cost constraints require software to achieve high levels 26 of efficiency. 27 28 ## Specification 29 30 Columnar storage allows Parquet to store data more efficiently than, say, 31 using JSON or Protobuf. For more information, refer to the [Parquet Format Specification](https://github.com/apache/parquet-format). 32 33 ## Installation 34 35 The package is distributed as a standard Go module that programs can take a 36 dependency on and install with the following command: 37 38 ``` 39 go get github.com/parquet-go/parquet-go 40 ``` 41 42 Go 1.20 or later is required to use the package. 43 44 ### Compatibility Guarantees 45 46 The package is currently released as a pre-v1 version, which gives maintainers 47 the freedom to break backward compatibility to help improve the APIs as we learn 48 which initial design decisions would need to be revisited to better support the 49 use cases that the library solves for. These occurrences are expected to be rare 50 in frequency and documentation will be produce to guide users on how to adapt 51 their programs to breaking changes. 52 53 ## Usage 54 55 The following sections describe how to use APIs exposed by the library, 56 highlighting the use cases with code examples to demonstrate how they are used 57 in practice. 58 59 ### Writing Parquet Files: [parquet.GenericWriter[T]](https://pkg.go.dev/github.com/parquet-go/parquet-go#GenericWriter) 60 61 A parquet file is a collection of rows sharing the same schema, arranged in 62 columns to support faster scan operations on subsets of the data set. 63 64 For simple use cases, the `parquet.WriteFile[T]` function allows the creation 65 of parquet files on the file system from a slice of Go values representing the 66 rows to write to the file. 67 68 ```go 69 type RowType struct { FirstName, LastName string } 70 71 if err := parquet.WriteFile("file.parquet", []RowType{ 72 {FirstName: "Bob"}, 73 {FirstName: "Alice"}, 74 }); err != nil { 75 ... 76 } 77 ``` 78 79 The `parquet.GenericWriter[T]` type denormalizes rows into columns, then encodes 80 the columns into a parquet file, generating row groups, column chunks, and pages 81 based on configurable heuristics. 82 83 ```go 84 type RowType struct { FirstName, LastName string } 85 86 writer := parquet.NewGenericWriter[RowType](output) 87 88 _, err := writer.Write([]RowType{ 89 ... 90 }) 91 if err != nil { 92 ... 93 } 94 95 // Closing the writer is necessary to flush buffers and write the file footer. 96 if err := writer.Close(); err != nil { 97 ... 98 } 99 ``` 100 101 Explicit declaration of the parquet schema on a writer is useful when the 102 application needs to ensure that data written to a file adheres to a predefined 103 schema, which may differ from the schema derived from the writer's type 104 parameter. The `parquet.Schema` type is a in-memory representation of the schema 105 of parquet rows, translated from the type of Go values, and can be used for this 106 purpose. 107 108 ```go 109 schema := parquet.SchemaOf(new(RowType)) 110 writer := parquet.NewGenericWriter[any](output, schema) 111 ... 112 ``` 113 114 ### Reading Parquet Files: [parquet.GenericReader[T]](https://pkg.go.dev/github.com/parquet-go/parquet-go#GenericReader) 115 116 For simple use cases where the data set fits in memory and the program will 117 read most rows of the file, the `parquet.ReadFile[T]` function returns a slice 118 of Go values representing the rows read from the file. 119 120 ```go 121 type RowType struct { FirstName, LastName string } 122 123 rows, err := parquet.ReadFile[RowType]("file.parquet") 124 if err != nil { 125 ... 126 } 127 128 for _, c := range rows { 129 fmt.Printf("%+v\n", c) 130 } 131 ``` 132 133 The expected schema of rows can be explicitly declared when the reader is 134 constructed, which is useful to ensure that the program receives rows matching 135 an specific format; for example, when dealing with files from remote storage 136 sources that applications cannot trust to have used an expected schema. 137 138 Configuring the schema of a reader is done by passing a `parquet.Schema` 139 instance as argument when constructing a reader. When the schema is declared, 140 conversion rules implemented by the package are applied to ensure that rows 141 read by the application match the desired format (see **Evolving Parquet Schemas**). 142 143 ```go 144 schema := parquet.SchemaOf(new(RowType)) 145 reader := parquet.NewReader(file, schema) 146 ... 147 ``` 148 149 ### Inspecting Parquet Files: [parquet.File](https://pkg.go.dev/github.com/parquet-go/parquet-go#File) 150 151 Sometimes, lower-level APIs can be useful to leverage the columnar layout of 152 parquet files. The `parquet.File` type is intended to provide such features to 153 Go applications, by exposing APIs to iterate over the various parts of a 154 parquet file. 155 156 ```go 157 f, err := parquet.OpenFile(file, size) 158 if err != nil { 159 ... 160 } 161 162 for _, rowGroup := range f.RowGroups() { 163 for _, columnChunk := range rowGroup.ColumnChunks() { 164 ... 165 } 166 } 167 ``` 168 169 ### Evolving Parquet Schemas: [parquet.Convert](https://pkg.go.dev/github.com/parquet-go/parquet-go#Convert) 170 171 Parquet files embed all the metadata necessary to interpret their content, 172 including a description of the schema of the tables represented by the rows and 173 columns they contain. 174 175 Parquet files are also immutable; once written, there is not mechanism for 176 _updating_ a file. If their contents need to be changed, rows must be read, 177 modified, and written to a new file. 178 179 Because applications evolve, the schema written to parquet files also tend to 180 evolve over time. Those requirements creating challenges when applications need 181 to operate on parquet files with heterogenous schemas: algorithms that expect 182 new columns to exist may have issues dealing with rows that come from files with 183 mismatching schema versions. 184 185 To help build applications that can handle evolving schemas, `parquet-go/parquet-go` 186 implements conversion rules that create views of row groups to translate between 187 schema versions. 188 189 The `parquet.Convert` function is the low-level routine constructing conversion 190 rules from a source to a target schema. The function is used to build converted 191 views of `parquet.RowReader` or `parquet.RowGroup`, for example: 192 193 ```go 194 type RowTypeV1 struct { ID int64; FirstName string } 195 type RowTypeV2 struct { ID int64; FirstName, LastName string } 196 197 source := parquet.SchemaOf(RowTypeV1{}) 198 target := parquet.SchemaOf(RowTypeV2{}) 199 200 conversion, err := parquet.Convert(target, source) 201 if err != nil { 202 ... 203 } 204 205 targetRowGroup := parquet.ConvertRowGroup(sourceRowGroup, conversion) 206 ... 207 ``` 208 209 Conversion rules are automatically applied by the `parquet.CopyRows` function 210 when the reader and writers passed to the function also implement the 211 `parquet.RowReaderWithSchema` and `parquet.RowWriterWithSchema` interfaces. 212 The copy determines whether the reader and writer schemas can be converted from 213 one to the other, and automatically applies the conversion rules to facilitate 214 the translation between schemas. 215 216 At this time, conversion rules only supports adding or removing columns from 217 the schemas, there are no type conversions performed, nor ways to rename 218 columns, etc... More advanced conversion rules may be added in the future. 219 220 ### Sorting Row Groups: [parquet.GenericBuffer[T]](https://pkg.go.dev/github.com/parquet-go/parquet-go#Buffer) 221 222 The `parquet.GenericWriter[T]` type is optimized for minimal memory usage, 223 keeping the order of rows unchanged and flushing pages as soon as they are filled. 224 225 Parquet supports expressing columns by which rows are sorted through the 226 declaration of _sorting columns_ on row groups. Sorting row groups requires 227 buffering all rows before ordering and writing them to a parquet file. 228 229 To help with those use cases, the `parquet-go/parquet-go` package exposes the 230 `parquet.GenericBuffer[T]` type which acts as a buffer of rows and implements 231 `sort.Interface` to allow applications to sort rows prior to writing them 232 to a file. 233 234 The columns that rows are ordered by are configured when creating 235 `parquet.GenericBuffer[T]` instances using the `parquet.SortingColumns` function 236 to construct row group options configuring the buffer. The type of parquet 237 columns defines how values are compared, see [Parquet Logical Types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) 238 for details. 239 240 When written to a file, the buffer is materialized into a single row group with 241 the declared sorting columns. After being written, buffers can be reused by 242 calling their `Reset` method. 243 244 The following example shows how to use a `parquet.GenericBuffer[T]` to order rows 245 written to a parquet file: 246 247 ```go 248 type RowType struct { FirstName, LastName string } 249 250 buffer := parquet.NewGenericBuffer[RowType]( 251 parquet.SortingRowGroupConfig( 252 parquet.SortingColumns( 253 parquet.Ascending("LastName"), 254 parquet.Ascending("FistName"), 255 ), 256 ), 257 ) 258 259 buffer.Write([]RowType{ 260 {FirstName: "Luke", LastName: "Skywalker"}, 261 {FirstName: "Han", LastName: "Solo"}, 262 {FirstName: "Anakin", LastName: "Skywalker"}, 263 }) 264 265 sort.Sort(buffer) 266 267 writer := parquet.NewGenericWriter[RowType](output) 268 _, err := parquet.CopyRows(writer, buffer.Rows()) 269 if err != nil { 270 ... 271 } 272 if err := writer.Close(); err != nil { 273 ... 274 } 275 ``` 276 277 ### Merging Row Groups: [parquet.MergeRowGroups](https://pkg.go.dev/github.com/parquet-go/parquet-go#MergeRowGroups) 278 279 Parquet files are often used as part of the underlying engine for data 280 processing or storage layers, in which cases merging multiple row groups 281 into one that contains more rows can be a useful operation to improve query 282 performance; for example, bloom filters in parquet files are stored for each 283 row group, the larger the row group, the fewer filters need to be stored and 284 the more effective they become. 285 286 The `parquet-go/parquet-go` package supports creating merged views of row groups, 287 where the view contains all the rows of the merged groups, maintaining the order 288 defined by the sorting columns of the groups. 289 290 There are a few constraints when merging row groups: 291 292 * The sorting columns of all the row groups must be the same, or the merge 293 operation must be explicitly configured a set of sorting columns which are 294 a prefix of the sorting columns of all merged row groups. 295 296 * The schemas of row groups must all be equal, or the merge operation must 297 be explicitly configured with a schema that all row groups can be converted 298 to, in which case the limitations of schema conversions apply. 299 300 Once a merged view is created, it may be written to a new parquet file or buffer 301 in order to create a larger row group: 302 303 ```go 304 merge, err := parquet.MergeRowGroups(rowGroups) 305 if err != nil { 306 ... 307 } 308 309 writer := parquet.NewGenericWriter[RowType](output) 310 _, err := parquet.CopyRows(writer, merge) 311 if err != nil { 312 ... 313 } 314 if err := writer.Close(); err != nil { 315 ... 316 } 317 ``` 318 319 ### Using Bloom Filters: [parquet.BloomFilter](https://pkg.go.dev/github.com/parquet-go/parquet-go#BloomFilter) 320 321 Parquet files can embed bloom filters to help improve the performance of point 322 lookups in the files. The format of parquet bloom filters is documented in 323 the parquet specification: [Parquet Bloom Filter](https://github.com/apache/parquet-format/blob/master/BloomFilter.md) 324 325 By default, no bloom filters are created in parquet files, but applications can 326 configure the list of columns to create filters for using the `parquet.BloomFilters` 327 option when instantiating writers; for example: 328 329 ```go 330 type RowType struct { 331 FirstName string `parquet:"first_name"` 332 LastName string `parquet:"last_name"` 333 } 334 335 const filterBitsPerValue = 10 336 writer := parquet.NewGenericWriter[RowType](output, 337 parquet.BloomFilters( 338 // Configures the write to generate split-block bloom filters for the 339 // "first_name" and "last_name" columns of the parquet schema of rows 340 // witten by the application. 341 parquet.SplitBlockFilter(filterBitsPerValue, "first_name"), 342 parquet.SplitBlockFilter(filterBitsPerValue, "last_name"), 343 ), 344 ) 345 ... 346 ``` 347 348 Generating bloom filters requires to know how many values exist in a column 349 chunk in order to properly size the filter, which requires buffering all the 350 values written to the column in memory. Because of it, the memory footprint 351 of `parquet.GenericWriter[T]` increases linearly with the number of columns 352 that the writer needs to generate filters for. This extra cost is optimized 353 away when rows are copied from a `parquet.GenericBuffer[T]` to a writer, since 354 in this case the number of values per column in known since the buffer already 355 holds all the values in memory. 356 357 When reading parquet files, column chunks expose the generated bloom filters 358 with the `parquet.ColumnChunk.BloomFilter` method, returning a 359 `parquet.BloomFilter` instance if a filter was available, or `nil` when there 360 were no filters. 361 362 Using bloom filters in parquet files is useful when performing point-lookups in 363 parquet files; searching for column rows matching a given value. Programs can 364 quickly eliminate column chunks that they know does not contain the value they 365 search for by checking the filter first, which is often multiple orders of 366 magnitude faster than scanning the column. 367 368 The following code snippet hilights how filters are typically used: 369 370 ```go 371 var candidateChunks []parquet.ColumnChunk 372 373 for _, rowGroup := range file.RowGroups() { 374 columnChunk := rowGroup.ColumnChunks()[columnIndex] 375 bloomFilter := columnChunk.BloomFilter() 376 377 if bloomFilter != nil { 378 if ok, err := bloomFilter.Check(value); err != nil { 379 ... 380 } else if !ok { 381 // Bloom filters may return false positives, but never return false 382 // negatives, we know this column chunk does not contain the value. 383 continue 384 } 385 } 386 387 candidateChunks = append(candidateChunks, columnChunk) 388 } 389 ``` 390 391 ## Optimizations 392 393 The following sections describe common optimization techniques supported by the 394 library. 395 396 ### Optimizing Reads 397 398 Lower level APIs used to read parquet files offer more efficient ways to access 399 column values. Consecutive sequences of values are grouped into pages which are 400 represented by the `parquet.Page` interface. 401 402 A column chunk may contain multiple pages, each holding a section of the column 403 values. Applications can retrieve the column values either by reading them into 404 buffers of `parquet.Value`, or type asserting the pages to read arrays of 405 primitive Go values. The following example demonstrates how to use both 406 mechanisms to read column values: 407 408 ```go 409 pages := column.Pages() 410 defer func() { 411 checkErr(pages.Close()) 412 }() 413 414 for { 415 p, err := pages.ReadPage() 416 if err != nil { 417 ... // io.EOF when there are no more pages 418 } 419 420 switch page := p.Values().(type) { 421 case parquet.Int32Reader: 422 values := make([]int32, page.NumValues()) 423 _, err := page.ReadInt32s(values) 424 ... 425 case parquet.Int64Reader: 426 values := make([]int64, page.NumValues()) 427 _, err := page.ReadInt64s(values) 428 ... 429 default: 430 values := make([]parquet.Value, page.NumValues()) 431 _, err := page.ReadValues(values) 432 ... 433 } 434 } 435 ``` 436 437 Reading arrays of typed values is often preferable when performing aggregations 438 on the values as this model offers a more compact representation of the values 439 in memory, and pairs well with the use of optimizations like SIMD vectorization. 440 441 ### Optimizing Writes 442 443 Applications that deal with columnar storage are sometimes designed to work with 444 columnar data throughout the abstraction layers; it then becomes possible to 445 write columns of values directly instead of reconstructing rows from the column 446 values. The package offers two main mechanisms to satisfy those use cases: 447 448 #### A. Writing Columns of Typed Arrays 449 450 The first solution assumes that the program works with in-memory arrays of typed 451 values, for example slices of primitive Go types like `[]float32`; this would be 452 the case if the application is built on top of a framework like 453 [Apache Arrow](https://pkg.go.dev/github.com/apache/arrow/go/arrow). 454 455 `parquet.GenericBuffer[T]` is an implementation of the `parquet.RowGroup` 456 interface which maintains in-memory buffers of column values. Rows can be 457 written by either boxing primitive values into arrays of `parquet.Value`, 458 or type asserting the columns to a access specialized versions of the write 459 methods accepting arrays of Go primitive types. 460 461 When using either of these models, the application is responsible for ensuring 462 that the same number of rows are written to each column or the resulting parquet 463 file will be malformed. 464 465 The following examples demonstrate how to use these two models to write columns 466 of Go values: 467 468 ```go 469 type RowType struct { FirstName, LastName string } 470 471 func writeColumns(buffer *parquet.GenericBuffer[RowType], firstNames []string) error { 472 values := make([]parquet.Value, len(firstNames)) 473 for i := range firstNames { 474 values[i] = parquet.ValueOf(firstNames[i]) 475 } 476 _, err := buffer.ColumnBuffers()[0].WriteValues(values) 477 return err 478 } 479 ``` 480 481 ```go 482 type RowType struct { ID int64; Value float32 } 483 484 func writeColumns(buffer *parquet.GenericBuffer[RowType], ids []int64, values []float32) error { 485 if len(ids) != len(values) { 486 return fmt.Errorf("number of ids and values mismatch: ids=%d values=%d", len(ids), len(values)) 487 } 488 columns := buffer.ColumnBuffers() 489 if err := columns[0].(parquet.Int64Writer).WriteInt64s(ids); err != nil { 490 return err 491 } 492 if err := columns[1].(parquet.FloatWriter).WriteFloats(values); err != nil { 493 return err 494 } 495 return nil 496 } 497 ``` 498 499 The latter is more efficient as it does not require boxing the input into an 500 intermediary array of `parquet.Value`. However, it may not always be the right 501 model depending on the situation, sometimes the generic abstraction can be a 502 more expressive model. 503 504 #### B. Implementing parquet.RowGroup 505 506 Programs that need full control over the construction of row groups can choose 507 to provide their own implementation of the `parquet.RowGroup` interface, which 508 includes defining implementations of `parquet.ColumnChunk` and `parquet.Page` 509 to expose column values of the row group. 510 511 This model can be preferable when the underlying storage or in-memory 512 representation of the data needs to be optimized further than what can be 513 achieved by using an intermediary buffering layer with `parquet.GenericBuffer[T]`. 514 515 See [parquet.RowGroup](https://pkg.go.dev/github.com/parquet-go/parquet-go#RowGroup) 516 for the full interface documentation. 517 518 #### C. Using on-disk page buffers 519 520 When generating parquet files, the writer needs to buffer all pages before it 521 can create the row group. This may require significant amounts of memory as the 522 entire file content must be buffered prior to generating it. In some cases, the 523 files might even be larger than the amount of memory available to the program. 524 525 The `parquet.GenericWriter[T]` can be configured to use disk storage instead as 526 a scratch buffer when generating files, by configuring a different page buffer 527 pool using the `parquet.ColumnPageBuffers` option and `parquet.PageBufferPool` 528 interface. 529 530 The `parquet-go/parquet-go` package provides an implementation of the interface 531 which uses temporary files to store pages while a file is generated, allowing 532 programs to use local storage as swap space to hold pages and keep memory 533 utilization to a minimum. The following example demonstrates how to configure 534 a parquet writer to use on-disk page buffers: 535 536 ```go 537 type RowType struct { ... } 538 539 writer := parquet.NewGenericWriter[RowType](output, 540 parquet.ColumnPageBuffers( 541 parquet.NewFileBufferPool("", "buffers.*"), 542 ), 543 ) 544 ``` 545 546 When a row group is complete, pages buffered to disk need to be copied back to 547 the output file. This results in doubling I/O operations and storage space 548 requirements (the system needs to have enough free disk space to hold two copies 549 of the file). The resulting write amplification can often be optimized away by 550 the kernel if the file system supports copy-on-write of disk pages since copies 551 between `os.File` instances are optimized using `copy_file_range(2)` (on linux). 552 553 See [parquet.PageBufferPool](https://pkg.go.dev/github.com/parquet-go/parquet-go#PageBufferPool) 554 for the full interface documentation. 555 556 ## Maintenance 557 558 While initial design and development occurred at Twilio Segment, the project is now maintained by the open source community. We welcome external contributors. 559 to participate in the form of discussions or code changes. Please review to the 560 [Contribution](./CONTRIBUTING.md) guidelines as well as the [Code of Conduct](./CODE_OF_CONDUCT.md) 561 before submitting contributions. 562 563 ### Continuous Integration 564 565 The project uses [Github Actions](https://github.com/parquet-go/parquet-go/actions) for CI. 566 567 ### Debugging 568 569 The package has debugging capabilities built in which can be turned on using the 570 `PARQUETGODEBUG` environment variable. The value follows a model similar to 571 `GODEBUG`, it must be formatted as a comma-separated list of `key=value` pairs. 572 573 The following debug flag are currently supported: 574 575 - `tracebuf=1` turns on tracing of internal buffers, which validates that 576 reference counters are set to zero when buffers are reclaimed by the garbage 577 collector. When the package detects that a buffer was leaked, it logs an error 578 message along with the stack trace captured when the buffer was last used.