github.com/vc42/parquet-go@v0.0.0-20240320194221-1a9adb5f23f5/README.md (about)

     1  # segmentio/parquet-go [![build status](https://github.com/segmentio/parquet-go/actions/workflows/test.yml/badge.svg)](https://github.com/segmentio/parquet-go/actions) [![Go Report Card](https://goreportcard.com/badge/github.com/segmentio/parquet-go)](https://goreportcard.com/report/github.com/segmentio/parquet-go) [![Go Reference](https://pkg.go.dev/badge/github.com/segmentio/parquet-go.svg)](https://pkg.go.dev/github.com/segmentio/parquet-go)
     2  
     3  High-performance Go library to manipulate parquet files.
     4  
     5  ## Motivation
     6  
     7  Parquet has been established as a powerful solution to represent columnar data
     8  on persistent storage mediums, achieving levels of compression and query
     9  performance that enable managing data sets at scales that reach the petabytes.
    10  In addition, having intensive data applications sharing a common format creates
    11  opportunities for interoperation in our tool kits, providing greater leverage
    12  and value to engineers maintaining and operating those systems.
    13  
    14  The creation and evolution of large scale data management systems, combined with
    15  realtime expectations come with challenging maintenance and performance
    16  requirements, that existing solutions to use parquet with Go were not addressing.
    17  
    18  The `segmentio/parquet-go` package was designed and developed to respond to those
    19  challenges, offering high level APIs to read and write parquet files, while
    20  keeping a low compute and memory footprint in order to be used in environments
    21  where data volumes and cost constraints require software to achieve high levels
    22  of efficiency.
    23  
    24  ## Specification
    25  
    26  Columnar storage allows Parquet to store data more efficiently than, say,
    27  using JSON or Protobuf. For more information, refer to the [Parquet Format Specification](https://github.com/apache/parquet-format).
    28  
    29  ## Installation
    30  
    31  The package is distributed as a standard Go module that programs can take a
    32  dependency on and install with the following command:
    33  
    34  ```
    35  go get github.com/segmentio/parquet-go
    36  ```
    37  
    38  Go 1.18 or later is required to use the package. As a backward-compatibility
    39  mechanism, the package can also be built with Go 1.17, in which case the APIs
    40  based on Generics are disabled.
    41  
    42  ### Compatibility Guarantees
    43  
    44  The package is currently released as a pre-v1 version, which gives maintainers
    45  the freedom to break backward compatibility to help improve the APIs as we learn
    46  which initial design decisions would need to be revisited to better support the
    47  use cases that the library solves for. These occurrences are expected to be rare
    48  in frequency and documentation will be produce to guide users on how to adapt
    49  their programs to breaking changes.
    50  
    51  ## Usage
    52  
    53  The following sections describe how to use APIs exposed by the library,
    54  highlighting the use cases with code examples to demonstrate how they are used
    55  in practice.
    56  
    57  ### Writing Parquet Files: [parquet.GenericWriter[T]](https://pkg.go.dev/github.com/segmentio/parquet-go#GenericWriter)
    58  
    59  A parquet file is a collection of rows sharing the same schema, arranged in
    60  columns to support faster scan operations on subsets of the data set.
    61  
    62  For simple use cases, the `parquet.WriteFile[T]` function allows the creation
    63  of parquet files on the file system from a slice of Go values representing the
    64  rows to write to the file.
    65  
    66  ```go
    67  type RowType struct { FirstName, LastName string }
    68  
    69  if err := parquet.WriteFile("file.parquet", []RowType{
    70      {FirstName: "Bob"},
    71      {FirstName: "Alice"},
    72  }); err != nil {
    73      ...
    74  }
    75  ```
    76  
    77  The `parquet.GenericWriter[T]` type denormalizes rows into columns, then encodes
    78  the columns into a parquet file, generating row groups, column chunks, and pages
    79  based on configurable heuristics.
    80  
    81  ```go
    82  type RowType struct { FirstName, LastName string }
    83  
    84  writer := parquet.NewGenericWriter[RowType](output)
    85  
    86  _, err := writer.Write([]RowType{
    87      ...
    88  })
    89  if err != nil {
    90      ...
    91  }
    92  
    93  // Closing the writer is necessary to flush buffers and write the file footer.
    94  if err := writer.Close(); err != nil {
    95      ...
    96  }
    97  ```
    98  
    99  Explicit declaration of the parquet schema on a writer is useful when the
   100  application needs to ensure that data written to a file adheres to a predefined
   101  schema, which may differ from the schema derived from the writer's type
   102  parameter. The `parquet.Schema` type is a in-memory representation of the schema
   103  of parquet rows, translated from the type of Go values, and can be used for this
   104  purpose.
   105  
   106  ```go
   107  schema := parquet.SchemaOf(new(RowType))
   108  writer := parquet.NewGenericWriter[any](output, schema)
   109  ...
   110  ```
   111  
   112  ### Reading Parquet Files: [parquet.GenericReader[T]](https://pkg.go.dev/github.com/segmentio/parquet-go#GenericReader)
   113  
   114  For simple use cases where the data set fits in memory and the program will
   115  read most rows of the file, the `parquet.ReadFile[T]` function returns a slice
   116  of Go values representing the rows read from the file.
   117  
   118  ```go
   119  type RowType struct { FirstName, LastName string }
   120  
   121  rows, err := parquet.ReadFile[RowType]("file.parquet")
   122  if err != nil {
   123      ...
   124  }
   125  
   126  for _, c := range rows {
   127      fmt.Printf("%+v\n", c)
   128  }
   129  ```
   130  
   131  The expected schema of rows can be explicitly declared when the reader is
   132  constructed, which is useful to ensure that the program receives rows matching
   133  an specific format; for example, when dealing with files from remote storage
   134  sources that applications cannot trust to have used an expected schema.
   135  
   136  Configuring the schema of a reader is done by passing a `parquet.Schema`
   137  instance as argument when constructing a reader. When the schema is declared,
   138  conversion rules implemented by the package are applied to ensure that rows
   139  read by the application match the desired format (see **Evolving Parquet Schemas**).
   140  
   141  ```go
   142  schema := parquet.SchemaOf(new(RowType))
   143  reader := parquet.NewReader(file, schema)
   144  ...
   145  ```
   146  
   147  ### Inspecting Parquet Files: [parquet.File](https://pkg.go.dev/github.com/segmentio/parquet-go#File)
   148  
   149  Sometimes, lower-level APIs can be useful to leverage the columnar layout of
   150  parquet files. The `parquet.File` type is intended to provide such features to
   151  Go applications, by exposing APIs to iterate over the various parts of a
   152  parquet file.
   153  
   154  ```go
   155  f, err := parquet.OpenFile(file, size)
   156  if err != nil {
   157      ...
   158  }
   159  
   160  for _, rowGroup := range f.RowGroups() {
   161      for _, columnChunk := range rowGroup.ColumnChunks() {
   162          ...
   163      }
   164  }
   165  ```
   166  
   167  ### Evolving Parquet Schemas: [parquet.Convert](https://pkg.go.dev/github.com/segmentio/parquet-go#Convert)
   168  
   169  Parquet files embed all the metadata necessary to interpret their content,
   170  including a description of the schema of the tables represented by the rows and
   171  columns they contain.
   172  
   173  Parquet files are also immutable; once written, there is not mechanism for
   174  _updating_ a file. If their contents need to be changed, rows must be read,
   175  modified, and written to a new file.
   176  
   177  Because applications evolve, the schema written to parquet files also tend to
   178  evolve over time. Those requirements creating challenges when applications need
   179  to operate on parquet files with heterogenous schemas: algorithms that expect
   180  new columns to exist may have issues dealing with rows that come from files with
   181  mismatching schema versions.
   182  
   183  To help build applications that can handle evolving schemas, `segmentio/parquet-go`
   184  implements conversion rules that create views of row groups to translate between
   185  schema versions.
   186  
   187  The `parquet.Convert` function is the low-level routine constructing conversion
   188  rules from a source to a target schema. The function is used to build converted
   189  views of `parquet.RowReader` or `parquet.RowGroup`, for example:
   190  
   191  ```go
   192  type RowTypeV1 struct { ID int64; FirstName string }
   193  type RowTypeV2 struct { ID int64; FirstName, LastName string }
   194  
   195  source := parquet.NewSchema(RowTypeV1{})
   196  target := parquet.NewSchema(RowTypeV2{})
   197  
   198  conversion, err := parquet.Convert(target, source)
   199  if err != nil {
   200      ...
   201  }
   202  
   203  targetRowGroup := parquet.ConvertRowGroup(sourceRowGroup, conversion)
   204  ...
   205  ```
   206  
   207  Conversion rules are automatically applied by the `parquet.CopyRows` function
   208  when the reader and writers passed to the function also implement the
   209  `parquet.RowReaderWithSchema` and `parquet.RowWriterWithSchema` interfaces.
   210  The copy determines whether the reader and writer schemas can be converted from
   211  one to the other, and automatically applies the conversion rules to facilitate
   212  the translation between schemas.
   213  
   214  At this time, conversion rules only supports adding or removing columns from
   215  the schemas, there are no type conversions performed, nor ways to rename
   216  columns, etc... More advanced conversion rules may be added in the future.
   217  
   218  ### Sorting Row Groups: [parquet.GenericBuffer[T]](https://pkg.go.dev/github.com/segmentio/parquet-go#Buffer)
   219  
   220  The `parquet.GenericWriter[T]` type is optimized for minimal memory usage,
   221  keeping the order of rows unchanged and flushing pages as soon as they are filled.
   222  
   223  Parquet supports expressing columns by which rows are sorted through the
   224  declaration of _sorting columns_ on row groups. Sorting row groups requires
   225  buffering all rows before ordering and writing them to a parquet file.
   226  
   227  To help with those use cases, the `segmentio/parquet-go` package exposes the
   228  `parquet.GenericBuffer[T]` type which acts as a buffer of rows and implements
   229  `sort.Interface` to allow applications to sort rows prior to writing them
   230  to a file.
   231  
   232  The columns that rows are ordered by are configured when creating
   233  `parquet.GenericBuffer[T]` instances using the `parquet.SortingColumns` function
   234  to construct row group options configuring the buffer. The type of parquet
   235  columns defines how values are compared, see [Parquet Logical Types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)
   236  for details.
   237  
   238  When written to a file, the buffer is materialized into a single row group with
   239  the declared sorting columns. After being written, buffers can be reused by
   240  calling their `Reset` method.
   241  
   242  The following example shows how to use a `parquet.GenericBuffer[T]` to order rows
   243  written to a parquet file:
   244  
   245  ```go
   246  type RowType struct { FirstName, LastName string }
   247  
   248  buffer := parquet.NewGenericBuffer[RowType](
   249      parquet.SortingColumns(
   250          parquet.Ascending("LastName"),
   251          parquet.Ascending("FistName"),
   252      ),
   253  )
   254  
   255  buffer.Write([]RowType{
   256      {FirstName: "Luke", LastName: "Skywalker"},
   257      {FirstName: "Han", LastName: "Solo"},
   258      {FirstName: "Anakin", LastName: "Skywalker"},
   259  })
   260  
   261  sort.Sort(buffer)
   262  
   263  writer := parquet.NewGenericWriter[RowType](output)
   264  _, err := parquet.CopyRows(writer, buffer.Rows())
   265  if err != nil {
   266      ...
   267  }
   268  if err := writer.Close(); err != nil {
   269      ...
   270  }
   271  ```
   272  
   273  ### Merging Row Groups: [parquet.MergeRowGroups](https://pkg.go.dev/github.com/segmentio/parquet-go#MergeRowGroups)
   274  
   275  Parquet files are often used as part of the underlying engine for data
   276  processing or storage layers, in which cases merging multiple row groups
   277  into one that contains more rows can be a useful operation to improve query
   278  performance; for example, bloom filters in parquet files are stored for each
   279  row group, the larger the row group, the fewer filters need to be stored and
   280  the more effective they become.
   281  
   282  The `segmentio/parquet-go` package supports creating merged views of row groups,
   283  where the view contains all the rows of the merged groups, maintaining the order
   284  defined by the sorting columns of the groups.
   285  
   286  There are a few constraints when merging row groups:
   287  
   288  * The sorting columns of all the row groups must be the same, or the merge
   289    operation must be explicitly configured a set of sorting columns which are
   290    a prefix of the sorting columns of all merged row groups.
   291  
   292  * The schemas of row groups must all be equal, or the merge operation must
   293    be explicitly configured with a schema that all row groups can be converted
   294    to, in which case the limitations of schema conversions apply.
   295  
   296  Once a merged view is created, it may be written to a new parquet file or buffer
   297  in order to create a larger row group:
   298  
   299  ```go
   300  merge, err := parquet.MergeRowGroups(rowGroups)
   301  if err != nil {
   302      ...
   303  }
   304  
   305  writer := parquet.NewGenericWriter[RowType](output)
   306  _, err := parquet.CopyRows(writer, merge)
   307  if err != nil {
   308      ...
   309  }
   310  if err := writer.Close(); err != nil {
   311      ...
   312  }
   313  ```
   314  
   315  ### Using Bloom Filters: [parquet.BloomFilter](https://pkg.go.dev/github.com/segmentio/parquet-go#BloomFilter)
   316  
   317  Parquet files can embed bloom filters to help improve the performance of point
   318  lookups in the files. The format of parquet bloom filters is documented in
   319  the parquet specification: [Parquet Bloom Filter](https://github.com/apache/parquet-format/blob/master/BloomFilter.md)
   320  
   321  By default, no bloom filters are created in parquet files, but applications can
   322  configure the list of columns to create filters for using the `parquet.BloomFilters`
   323  option when instantiating writers; for example:
   324  
   325  ```go
   326  type RowType struct {
   327      FirstName string `parquet:"first_name"`
   328      LastName  string `parquet:"last_name"`
   329  }
   330  
   331  writer := parquet.NewGenericWriter[RowType](output,
   332      parquet.BloomFilters(
   333          // Configures the write to generate split-block bloom filters for the
   334          // "first_name" and "last_name" columns of the parquet schema of rows
   335          // witten by the application.
   336          parquet.SplitBlockFilter("first_name"),
   337          parquet.SplitBlockFilter("last_name"),
   338      ),
   339  )
   340  ...
   341  ```
   342  
   343  Generating bloom filters requires to know how many values exist in a column
   344  chunk in order to properly size the filter, which requires buffering all the
   345  values written to the column in memory. Because of it, the memory footprint
   346  of `parquet.GenericWriter[T]` increases linearly with the number of columns
   347  that the writer needs to generate filters for. This extra cost is optimized
   348  away when rows are copied from a `parquet.GenericBuffer[T]` to a writer, since
   349  in this case the number of values per column in known since the buffer already
   350  holds all the values in memory.
   351  
   352  When reading parquet files, column chunks expose the generated bloom filters
   353  with the `parquet.ColumnChunk.BloomFilter` method, returning a
   354  `parquet.BloomFilter` instance if a filter was available, or `nil` when there
   355  were no filters.
   356  
   357  Using bloom filters in parquet files is useful when performing point-lookups in
   358  parquet files; searching for column rows matching a given value. Programs can
   359  quickly eliminate column chunks that they know does not contain the value they
   360  search for by checking the filter first, which is often multiple orders of
   361  magnitude faster than scanning the column.
   362  
   363  The following code snippet hilights how filters are typically used:
   364  
   365  ```go
   366  var candidateChunks []parquet.ColumnChunk
   367  
   368  for _, rowGroup := range file.RowGroups() {
   369      columnChunk := rowGroup.ColumnChunks()[columnIndex]
   370      bloomFilter := columnChunk.BloomFilter()
   371  
   372      if bloomFilter != nil {
   373          if ok, err := bloomFilter.Check(value); err != nil {
   374              ...
   375          } else if !ok {
   376              // Bloom filters may return false positives, but never return false
   377              // negatives, we know this column chunk does not contain the value.
   378              continue
   379          }
   380      }
   381  
   382      candidateChunks = append(candidateChunks, columnChunk)
   383  }
   384  ```
   385  
   386  ## Optimizations
   387  
   388  The following sections describe common optimization techniques supported by the
   389  library.
   390  
   391  ### Optimizing Reads
   392  
   393  Lower level APIs used to read parquet files offer more efficient ways to access
   394  column values. Consecutive sequences of values are grouped into pages which are
   395  represented by the `parquet.Page` interface.
   396  
   397  A column chunk may contain multiple pages, each holding a section of the column
   398  values. Applications can retrieve the column values either by reading them into
   399  buffers of `parquet.Value`, or type asserting the pages to read arrays of
   400  primitive Go values. The following example demonstrates how to use both
   401  mechanisms to read column values:
   402  
   403  ```go
   404  pages := column.Pages()
   405  
   406  for {
   407      p, err := pages.ReadPage()
   408      if err != nil {
   409          ... // io.EOF when there are no more pages
   410      }
   411  
   412      switch page := p.Values().(type) {
   413      case parquet.Int32Page:
   414          values := make([]int32, page.NumValues())
   415          _, err := page.ReadInt32s(values)
   416          ...
   417      case parquet.Int64Page:
   418          values := make([]int64, page.NumValues())
   419          _, err := page.ReadInt64s(values)
   420          ...
   421      default:
   422          values := make([]parquet.Value, page.NumValues())
   423          _, err := page.ReadValues(values)
   424          ...
   425      }
   426  }
   427  ```
   428  
   429  Reading arrays of typed values is often preferable when performing aggregations
   430  on the values as this model offers a more compact representation of the values
   431  in memory, and pairs well with the use of optimizations like SIMD vectorization.
   432  
   433  ### Optimizing Writes
   434  
   435  Applications that deal with columnar storage are sometimes designed to work with
   436  columnar data throughout the abstraction layers; it then becomes possible to
   437  write columns of values directly instead of reconstructing rows from the column
   438  values. The package offers two main mechanisms to satisfy those use cases:
   439  
   440  #### A. Writing Columns of Typed Arrays
   441  
   442  The first solution assumes that the program works with in-memory arrays of typed
   443  values, for example slices of primitive Go types like `[]float32`; this would be
   444  the case if the application is built on top of a framework like
   445  [Apache Arrow](https://pkg.go.dev/github.com/apache/arrow/go/arrow).
   446  
   447  `parquet.GenericBuffer[T]` is an implementation of the `parquet.RowGroup`
   448  interface which maintains in-memory buffers of column values. Rows can be
   449  written by either boxing primitive values into arrays of `parquet.Value`,
   450  or type asserting the columns to a access specialized versions of the write
   451  methods accepting arrays of Go primitive types.
   452  
   453  When using either of these models, the application is responsible for ensuring
   454  that the same number of rows are written to each column or the resulting parquet
   455  file will be malformed.
   456  
   457  The following examples demonstrate how to use these two models to write columns
   458  of Go values:
   459  
   460  ```go
   461  type RowType struct { FirstName, LastName string }
   462  
   463  func writeColumns(buffer *parquet.GenericBuffer[RowType], firstNames []string) error {
   464      values := make([]parquet.Value, len(firstNames))
   465      for i := range firstNames {
   466          values[i] = parquet.ValueOf(firstNames[i])
   467      }
   468      _, err := buffer.ColumnBuffers()[0].WriteValues(values)
   469      return err
   470  }
   471  ```
   472  
   473  ```go
   474  type RowType struct { ID int64; Value float32 }
   475  
   476  func writeColumns(buffer *parquet.GenericBuffer[RowType], ids []int64, values []float32) error {
   477      if len(ids) != len(values) {
   478          return fmt.Errorf("number of ids and values mismatch: ids=%d values=%d", len(ids), len(values))
   479      }
   480      columns := buffer.ColumnBuffers()
   481      if err := columns[0].(parquet.Int64Writer).WriteInt64s(ids); err != nil {
   482          return err
   483      }
   484      if err := columns[1].(parquet.FloatWriter).WriteFloats(values); err != nil {
   485          return err
   486      }
   487      return nil
   488  }
   489  ```
   490  
   491  The latter is more efficient as it does not require boxing the input into an
   492  intermediary array of `parquet.Value`. However, it may not always be the right
   493  model depending on the situation, sometimes the generic abstraction can be a
   494  more expressive model.
   495  
   496  #### B. Implementing parquet.RowGroup
   497  
   498  Programs that need full control over the construction of row groups can choose
   499  to provide their own implementation of the `parquet.RowGroup` interface, which
   500  includes defining implementations of `parquet.ColumnChunk` and `parquet.Page`
   501  to expose column values of the row group.
   502  
   503  This model can be preferable when the underlying storage or in-memory
   504  representation of the data needs to be optimized further than what can be
   505  achieved by using an intermediary buffering layer with `parquet.GenericBuffer[T]`.
   506  
   507  See [parquet.RowGroup](https://pkg.go.dev/github.com/segmentio/parquet-go#RowGroup)
   508  for the full interface documentation.
   509  
   510  #### C. Using on-disk page buffers
   511  
   512  When generating parquet files, the writer needs to buffer all pages before it
   513  can create the row group. This may require significant amounts of memory as the
   514  entire file content must be buffered prior to generating it. In some cases, the
   515  files might even be larger than the amount of memory available to the program.
   516  
   517  The `parquet.GenericWriter[T]` can be configured to use disk storage instead as
   518  a scratch buffer when generating files, by configuring a different page buffer
   519  pool using the `parquet.ColumnPageBuffers` option and `parquet.PageBufferPool`
   520  interface.
   521  
   522  The `segmentio/parquet-go` package provides an implementation of the interface
   523  which uses temporary files to store pages while a file is generated, allowing
   524  programs to use local storage as swap space to hold pages and keep memory
   525  utilization to a minimum. The following example demonstrates how to configure
   526  a parquet writer to use on-disk page buffers:
   527  
   528  ```go
   529  type RowType struct { ... }
   530  
   531  writer := parquet.NewGenericWriter[RowType](output,
   532      parquet.ColumnPageBuffers(
   533          parquet.NewFileBufferPool("", "buffers.*"),
   534      ),
   535  )
   536  ```
   537  
   538  When a row group is complete, pages buffered to disk need to be copied back to
   539  the output file. This results in doubling I/O operations and storage space
   540  requirements (the system needs to have enough free disk space to hold two copies
   541  of the file). The resulting write amplification can often be optimized away by
   542  the kernel if the file system supports copy-on-write of disk pages since copies
   543  between `os.File` instances are optimized using `copy_file_range(2)` (on linux).
   544  
   545  See [parquet.PageBufferPool](https://pkg.go.dev/github.com/segmentio/parquet-go#PageBufferPool)
   546  for the full interface documentation.
   547  
   548  ## Maintenance
   549  
   550  The project is hosted and maintained by Twilio; we welcome external contributors
   551  to participate in the form of discussions or code changes. Please review to the
   552  [Contribution](./CONTRIBUTING.md) guidelines as well as the [Code of Condution](./CODE_OF_CONDUCT.md)
   553  before submitting contributions.
   554  
   555  ### Continuous Integration
   556  
   557  The project uses [Github Actions](https://github.com/segmentio/parquet-go/actions) for CI.