github.com/vc42/parquet-go@v0.0.0-20240320194221-1a9adb5f23f5/README.md (about) 1 # segmentio/parquet-go [![build status](https://github.com/segmentio/parquet-go/actions/workflows/test.yml/badge.svg)](https://github.com/segmentio/parquet-go/actions) [![Go Report Card](https://goreportcard.com/badge/github.com/segmentio/parquet-go)](https://goreportcard.com/report/github.com/segmentio/parquet-go) [![Go Reference](https://pkg.go.dev/badge/github.com/segmentio/parquet-go.svg)](https://pkg.go.dev/github.com/segmentio/parquet-go) 2 3 High-performance Go library to manipulate parquet files. 4 5 ## Motivation 6 7 Parquet has been established as a powerful solution to represent columnar data 8 on persistent storage mediums, achieving levels of compression and query 9 performance that enable managing data sets at scales that reach the petabytes. 10 In addition, having intensive data applications sharing a common format creates 11 opportunities for interoperation in our tool kits, providing greater leverage 12 and value to engineers maintaining and operating those systems. 13 14 The creation and evolution of large scale data management systems, combined with 15 realtime expectations come with challenging maintenance and performance 16 requirements, that existing solutions to use parquet with Go were not addressing. 17 18 The `segmentio/parquet-go` package was designed and developed to respond to those 19 challenges, offering high level APIs to read and write parquet files, while 20 keeping a low compute and memory footprint in order to be used in environments 21 where data volumes and cost constraints require software to achieve high levels 22 of efficiency. 23 24 ## Specification 25 26 Columnar storage allows Parquet to store data more efficiently than, say, 27 using JSON or Protobuf. For more information, refer to the [Parquet Format Specification](https://github.com/apache/parquet-format). 28 29 ## Installation 30 31 The package is distributed as a standard Go module that programs can take a 32 dependency on and install with the following command: 33 34 ``` 35 go get github.com/segmentio/parquet-go 36 ``` 37 38 Go 1.18 or later is required to use the package. As a backward-compatibility 39 mechanism, the package can also be built with Go 1.17, in which case the APIs 40 based on Generics are disabled. 41 42 ### Compatibility Guarantees 43 44 The package is currently released as a pre-v1 version, which gives maintainers 45 the freedom to break backward compatibility to help improve the APIs as we learn 46 which initial design decisions would need to be revisited to better support the 47 use cases that the library solves for. These occurrences are expected to be rare 48 in frequency and documentation will be produce to guide users on how to adapt 49 their programs to breaking changes. 50 51 ## Usage 52 53 The following sections describe how to use APIs exposed by the library, 54 highlighting the use cases with code examples to demonstrate how they are used 55 in practice. 56 57 ### Writing Parquet Files: [parquet.GenericWriter[T]](https://pkg.go.dev/github.com/segmentio/parquet-go#GenericWriter) 58 59 A parquet file is a collection of rows sharing the same schema, arranged in 60 columns to support faster scan operations on subsets of the data set. 61 62 For simple use cases, the `parquet.WriteFile[T]` function allows the creation 63 of parquet files on the file system from a slice of Go values representing the 64 rows to write to the file. 65 66 ```go 67 type RowType struct { FirstName, LastName string } 68 69 if err := parquet.WriteFile("file.parquet", []RowType{ 70 {FirstName: "Bob"}, 71 {FirstName: "Alice"}, 72 }); err != nil { 73 ... 74 } 75 ``` 76 77 The `parquet.GenericWriter[T]` type denormalizes rows into columns, then encodes 78 the columns into a parquet file, generating row groups, column chunks, and pages 79 based on configurable heuristics. 80 81 ```go 82 type RowType struct { FirstName, LastName string } 83 84 writer := parquet.NewGenericWriter[RowType](output) 85 86 _, err := writer.Write([]RowType{ 87 ... 88 }) 89 if err != nil { 90 ... 91 } 92 93 // Closing the writer is necessary to flush buffers and write the file footer. 94 if err := writer.Close(); err != nil { 95 ... 96 } 97 ``` 98 99 Explicit declaration of the parquet schema on a writer is useful when the 100 application needs to ensure that data written to a file adheres to a predefined 101 schema, which may differ from the schema derived from the writer's type 102 parameter. The `parquet.Schema` type is a in-memory representation of the schema 103 of parquet rows, translated from the type of Go values, and can be used for this 104 purpose. 105 106 ```go 107 schema := parquet.SchemaOf(new(RowType)) 108 writer := parquet.NewGenericWriter[any](output, schema) 109 ... 110 ``` 111 112 ### Reading Parquet Files: [parquet.GenericReader[T]](https://pkg.go.dev/github.com/segmentio/parquet-go#GenericReader) 113 114 For simple use cases where the data set fits in memory and the program will 115 read most rows of the file, the `parquet.ReadFile[T]` function returns a slice 116 of Go values representing the rows read from the file. 117 118 ```go 119 type RowType struct { FirstName, LastName string } 120 121 rows, err := parquet.ReadFile[RowType]("file.parquet") 122 if err != nil { 123 ... 124 } 125 126 for _, c := range rows { 127 fmt.Printf("%+v\n", c) 128 } 129 ``` 130 131 The expected schema of rows can be explicitly declared when the reader is 132 constructed, which is useful to ensure that the program receives rows matching 133 an specific format; for example, when dealing with files from remote storage 134 sources that applications cannot trust to have used an expected schema. 135 136 Configuring the schema of a reader is done by passing a `parquet.Schema` 137 instance as argument when constructing a reader. When the schema is declared, 138 conversion rules implemented by the package are applied to ensure that rows 139 read by the application match the desired format (see **Evolving Parquet Schemas**). 140 141 ```go 142 schema := parquet.SchemaOf(new(RowType)) 143 reader := parquet.NewReader(file, schema) 144 ... 145 ``` 146 147 ### Inspecting Parquet Files: [parquet.File](https://pkg.go.dev/github.com/segmentio/parquet-go#File) 148 149 Sometimes, lower-level APIs can be useful to leverage the columnar layout of 150 parquet files. The `parquet.File` type is intended to provide such features to 151 Go applications, by exposing APIs to iterate over the various parts of a 152 parquet file. 153 154 ```go 155 f, err := parquet.OpenFile(file, size) 156 if err != nil { 157 ... 158 } 159 160 for _, rowGroup := range f.RowGroups() { 161 for _, columnChunk := range rowGroup.ColumnChunks() { 162 ... 163 } 164 } 165 ``` 166 167 ### Evolving Parquet Schemas: [parquet.Convert](https://pkg.go.dev/github.com/segmentio/parquet-go#Convert) 168 169 Parquet files embed all the metadata necessary to interpret their content, 170 including a description of the schema of the tables represented by the rows and 171 columns they contain. 172 173 Parquet files are also immutable; once written, there is not mechanism for 174 _updating_ a file. If their contents need to be changed, rows must be read, 175 modified, and written to a new file. 176 177 Because applications evolve, the schema written to parquet files also tend to 178 evolve over time. Those requirements creating challenges when applications need 179 to operate on parquet files with heterogenous schemas: algorithms that expect 180 new columns to exist may have issues dealing with rows that come from files with 181 mismatching schema versions. 182 183 To help build applications that can handle evolving schemas, `segmentio/parquet-go` 184 implements conversion rules that create views of row groups to translate between 185 schema versions. 186 187 The `parquet.Convert` function is the low-level routine constructing conversion 188 rules from a source to a target schema. The function is used to build converted 189 views of `parquet.RowReader` or `parquet.RowGroup`, for example: 190 191 ```go 192 type RowTypeV1 struct { ID int64; FirstName string } 193 type RowTypeV2 struct { ID int64; FirstName, LastName string } 194 195 source := parquet.NewSchema(RowTypeV1{}) 196 target := parquet.NewSchema(RowTypeV2{}) 197 198 conversion, err := parquet.Convert(target, source) 199 if err != nil { 200 ... 201 } 202 203 targetRowGroup := parquet.ConvertRowGroup(sourceRowGroup, conversion) 204 ... 205 ``` 206 207 Conversion rules are automatically applied by the `parquet.CopyRows` function 208 when the reader and writers passed to the function also implement the 209 `parquet.RowReaderWithSchema` and `parquet.RowWriterWithSchema` interfaces. 210 The copy determines whether the reader and writer schemas can be converted from 211 one to the other, and automatically applies the conversion rules to facilitate 212 the translation between schemas. 213 214 At this time, conversion rules only supports adding or removing columns from 215 the schemas, there are no type conversions performed, nor ways to rename 216 columns, etc... More advanced conversion rules may be added in the future. 217 218 ### Sorting Row Groups: [parquet.GenericBuffer[T]](https://pkg.go.dev/github.com/segmentio/parquet-go#Buffer) 219 220 The `parquet.GenericWriter[T]` type is optimized for minimal memory usage, 221 keeping the order of rows unchanged and flushing pages as soon as they are filled. 222 223 Parquet supports expressing columns by which rows are sorted through the 224 declaration of _sorting columns_ on row groups. Sorting row groups requires 225 buffering all rows before ordering and writing them to a parquet file. 226 227 To help with those use cases, the `segmentio/parquet-go` package exposes the 228 `parquet.GenericBuffer[T]` type which acts as a buffer of rows and implements 229 `sort.Interface` to allow applications to sort rows prior to writing them 230 to a file. 231 232 The columns that rows are ordered by are configured when creating 233 `parquet.GenericBuffer[T]` instances using the `parquet.SortingColumns` function 234 to construct row group options configuring the buffer. The type of parquet 235 columns defines how values are compared, see [Parquet Logical Types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) 236 for details. 237 238 When written to a file, the buffer is materialized into a single row group with 239 the declared sorting columns. After being written, buffers can be reused by 240 calling their `Reset` method. 241 242 The following example shows how to use a `parquet.GenericBuffer[T]` to order rows 243 written to a parquet file: 244 245 ```go 246 type RowType struct { FirstName, LastName string } 247 248 buffer := parquet.NewGenericBuffer[RowType]( 249 parquet.SortingColumns( 250 parquet.Ascending("LastName"), 251 parquet.Ascending("FistName"), 252 ), 253 ) 254 255 buffer.Write([]RowType{ 256 {FirstName: "Luke", LastName: "Skywalker"}, 257 {FirstName: "Han", LastName: "Solo"}, 258 {FirstName: "Anakin", LastName: "Skywalker"}, 259 }) 260 261 sort.Sort(buffer) 262 263 writer := parquet.NewGenericWriter[RowType](output) 264 _, err := parquet.CopyRows(writer, buffer.Rows()) 265 if err != nil { 266 ... 267 } 268 if err := writer.Close(); err != nil { 269 ... 270 } 271 ``` 272 273 ### Merging Row Groups: [parquet.MergeRowGroups](https://pkg.go.dev/github.com/segmentio/parquet-go#MergeRowGroups) 274 275 Parquet files are often used as part of the underlying engine for data 276 processing or storage layers, in which cases merging multiple row groups 277 into one that contains more rows can be a useful operation to improve query 278 performance; for example, bloom filters in parquet files are stored for each 279 row group, the larger the row group, the fewer filters need to be stored and 280 the more effective they become. 281 282 The `segmentio/parquet-go` package supports creating merged views of row groups, 283 where the view contains all the rows of the merged groups, maintaining the order 284 defined by the sorting columns of the groups. 285 286 There are a few constraints when merging row groups: 287 288 * The sorting columns of all the row groups must be the same, or the merge 289 operation must be explicitly configured a set of sorting columns which are 290 a prefix of the sorting columns of all merged row groups. 291 292 * The schemas of row groups must all be equal, or the merge operation must 293 be explicitly configured with a schema that all row groups can be converted 294 to, in which case the limitations of schema conversions apply. 295 296 Once a merged view is created, it may be written to a new parquet file or buffer 297 in order to create a larger row group: 298 299 ```go 300 merge, err := parquet.MergeRowGroups(rowGroups) 301 if err != nil { 302 ... 303 } 304 305 writer := parquet.NewGenericWriter[RowType](output) 306 _, err := parquet.CopyRows(writer, merge) 307 if err != nil { 308 ... 309 } 310 if err := writer.Close(); err != nil { 311 ... 312 } 313 ``` 314 315 ### Using Bloom Filters: [parquet.BloomFilter](https://pkg.go.dev/github.com/segmentio/parquet-go#BloomFilter) 316 317 Parquet files can embed bloom filters to help improve the performance of point 318 lookups in the files. The format of parquet bloom filters is documented in 319 the parquet specification: [Parquet Bloom Filter](https://github.com/apache/parquet-format/blob/master/BloomFilter.md) 320 321 By default, no bloom filters are created in parquet files, but applications can 322 configure the list of columns to create filters for using the `parquet.BloomFilters` 323 option when instantiating writers; for example: 324 325 ```go 326 type RowType struct { 327 FirstName string `parquet:"first_name"` 328 LastName string `parquet:"last_name"` 329 } 330 331 writer := parquet.NewGenericWriter[RowType](output, 332 parquet.BloomFilters( 333 // Configures the write to generate split-block bloom filters for the 334 // "first_name" and "last_name" columns of the parquet schema of rows 335 // witten by the application. 336 parquet.SplitBlockFilter("first_name"), 337 parquet.SplitBlockFilter("last_name"), 338 ), 339 ) 340 ... 341 ``` 342 343 Generating bloom filters requires to know how many values exist in a column 344 chunk in order to properly size the filter, which requires buffering all the 345 values written to the column in memory. Because of it, the memory footprint 346 of `parquet.GenericWriter[T]` increases linearly with the number of columns 347 that the writer needs to generate filters for. This extra cost is optimized 348 away when rows are copied from a `parquet.GenericBuffer[T]` to a writer, since 349 in this case the number of values per column in known since the buffer already 350 holds all the values in memory. 351 352 When reading parquet files, column chunks expose the generated bloom filters 353 with the `parquet.ColumnChunk.BloomFilter` method, returning a 354 `parquet.BloomFilter` instance if a filter was available, or `nil` when there 355 were no filters. 356 357 Using bloom filters in parquet files is useful when performing point-lookups in 358 parquet files; searching for column rows matching a given value. Programs can 359 quickly eliminate column chunks that they know does not contain the value they 360 search for by checking the filter first, which is often multiple orders of 361 magnitude faster than scanning the column. 362 363 The following code snippet hilights how filters are typically used: 364 365 ```go 366 var candidateChunks []parquet.ColumnChunk 367 368 for _, rowGroup := range file.RowGroups() { 369 columnChunk := rowGroup.ColumnChunks()[columnIndex] 370 bloomFilter := columnChunk.BloomFilter() 371 372 if bloomFilter != nil { 373 if ok, err := bloomFilter.Check(value); err != nil { 374 ... 375 } else if !ok { 376 // Bloom filters may return false positives, but never return false 377 // negatives, we know this column chunk does not contain the value. 378 continue 379 } 380 } 381 382 candidateChunks = append(candidateChunks, columnChunk) 383 } 384 ``` 385 386 ## Optimizations 387 388 The following sections describe common optimization techniques supported by the 389 library. 390 391 ### Optimizing Reads 392 393 Lower level APIs used to read parquet files offer more efficient ways to access 394 column values. Consecutive sequences of values are grouped into pages which are 395 represented by the `parquet.Page` interface. 396 397 A column chunk may contain multiple pages, each holding a section of the column 398 values. Applications can retrieve the column values either by reading them into 399 buffers of `parquet.Value`, or type asserting the pages to read arrays of 400 primitive Go values. The following example demonstrates how to use both 401 mechanisms to read column values: 402 403 ```go 404 pages := column.Pages() 405 406 for { 407 p, err := pages.ReadPage() 408 if err != nil { 409 ... // io.EOF when there are no more pages 410 } 411 412 switch page := p.Values().(type) { 413 case parquet.Int32Page: 414 values := make([]int32, page.NumValues()) 415 _, err := page.ReadInt32s(values) 416 ... 417 case parquet.Int64Page: 418 values := make([]int64, page.NumValues()) 419 _, err := page.ReadInt64s(values) 420 ... 421 default: 422 values := make([]parquet.Value, page.NumValues()) 423 _, err := page.ReadValues(values) 424 ... 425 } 426 } 427 ``` 428 429 Reading arrays of typed values is often preferable when performing aggregations 430 on the values as this model offers a more compact representation of the values 431 in memory, and pairs well with the use of optimizations like SIMD vectorization. 432 433 ### Optimizing Writes 434 435 Applications that deal with columnar storage are sometimes designed to work with 436 columnar data throughout the abstraction layers; it then becomes possible to 437 write columns of values directly instead of reconstructing rows from the column 438 values. The package offers two main mechanisms to satisfy those use cases: 439 440 #### A. Writing Columns of Typed Arrays 441 442 The first solution assumes that the program works with in-memory arrays of typed 443 values, for example slices of primitive Go types like `[]float32`; this would be 444 the case if the application is built on top of a framework like 445 [Apache Arrow](https://pkg.go.dev/github.com/apache/arrow/go/arrow). 446 447 `parquet.GenericBuffer[T]` is an implementation of the `parquet.RowGroup` 448 interface which maintains in-memory buffers of column values. Rows can be 449 written by either boxing primitive values into arrays of `parquet.Value`, 450 or type asserting the columns to a access specialized versions of the write 451 methods accepting arrays of Go primitive types. 452 453 When using either of these models, the application is responsible for ensuring 454 that the same number of rows are written to each column or the resulting parquet 455 file will be malformed. 456 457 The following examples demonstrate how to use these two models to write columns 458 of Go values: 459 460 ```go 461 type RowType struct { FirstName, LastName string } 462 463 func writeColumns(buffer *parquet.GenericBuffer[RowType], firstNames []string) error { 464 values := make([]parquet.Value, len(firstNames)) 465 for i := range firstNames { 466 values[i] = parquet.ValueOf(firstNames[i]) 467 } 468 _, err := buffer.ColumnBuffers()[0].WriteValues(values) 469 return err 470 } 471 ``` 472 473 ```go 474 type RowType struct { ID int64; Value float32 } 475 476 func writeColumns(buffer *parquet.GenericBuffer[RowType], ids []int64, values []float32) error { 477 if len(ids) != len(values) { 478 return fmt.Errorf("number of ids and values mismatch: ids=%d values=%d", len(ids), len(values)) 479 } 480 columns := buffer.ColumnBuffers() 481 if err := columns[0].(parquet.Int64Writer).WriteInt64s(ids); err != nil { 482 return err 483 } 484 if err := columns[1].(parquet.FloatWriter).WriteFloats(values); err != nil { 485 return err 486 } 487 return nil 488 } 489 ``` 490 491 The latter is more efficient as it does not require boxing the input into an 492 intermediary array of `parquet.Value`. However, it may not always be the right 493 model depending on the situation, sometimes the generic abstraction can be a 494 more expressive model. 495 496 #### B. Implementing parquet.RowGroup 497 498 Programs that need full control over the construction of row groups can choose 499 to provide their own implementation of the `parquet.RowGroup` interface, which 500 includes defining implementations of `parquet.ColumnChunk` and `parquet.Page` 501 to expose column values of the row group. 502 503 This model can be preferable when the underlying storage or in-memory 504 representation of the data needs to be optimized further than what can be 505 achieved by using an intermediary buffering layer with `parquet.GenericBuffer[T]`. 506 507 See [parquet.RowGroup](https://pkg.go.dev/github.com/segmentio/parquet-go#RowGroup) 508 for the full interface documentation. 509 510 #### C. Using on-disk page buffers 511 512 When generating parquet files, the writer needs to buffer all pages before it 513 can create the row group. This may require significant amounts of memory as the 514 entire file content must be buffered prior to generating it. In some cases, the 515 files might even be larger than the amount of memory available to the program. 516 517 The `parquet.GenericWriter[T]` can be configured to use disk storage instead as 518 a scratch buffer when generating files, by configuring a different page buffer 519 pool using the `parquet.ColumnPageBuffers` option and `parquet.PageBufferPool` 520 interface. 521 522 The `segmentio/parquet-go` package provides an implementation of the interface 523 which uses temporary files to store pages while a file is generated, allowing 524 programs to use local storage as swap space to hold pages and keep memory 525 utilization to a minimum. The following example demonstrates how to configure 526 a parquet writer to use on-disk page buffers: 527 528 ```go 529 type RowType struct { ... } 530 531 writer := parquet.NewGenericWriter[RowType](output, 532 parquet.ColumnPageBuffers( 533 parquet.NewFileBufferPool("", "buffers.*"), 534 ), 535 ) 536 ``` 537 538 When a row group is complete, pages buffered to disk need to be copied back to 539 the output file. This results in doubling I/O operations and storage space 540 requirements (the system needs to have enough free disk space to hold two copies 541 of the file). The resulting write amplification can often be optimized away by 542 the kernel if the file system supports copy-on-write of disk pages since copies 543 between `os.File` instances are optimized using `copy_file_range(2)` (on linux). 544 545 See [parquet.PageBufferPool](https://pkg.go.dev/github.com/segmentio/parquet-go#PageBufferPool) 546 for the full interface documentation. 547 548 ## Maintenance 549 550 The project is hosted and maintained by Twilio; we welcome external contributors 551 to participate in the form of discussions or code changes. Please review to the 552 [Contribution](./CONTRIBUTING.md) guidelines as well as the [Code of Condution](./CODE_OF_CONDUCT.md) 553 before submitting contributions. 554 555 ### Continuous Integration 556 557 The project uses [Github Actions](https://github.com/segmentio/parquet-go/actions) for CI.