github.com/parquet-go/parquet-go@v0.21.1-0.20240501160520-b3c3a0c3ed6f/README.md (about)

     1  # parquet-go/parquet-go [![build status](https://github.com/parquet-go/parquet-go/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/parquet-go/parquet-go/actions) [![Go Report Card](https://goreportcard.com/badge/github.com/parquet-go/parquet-go)](https://goreportcard.com/report/github.com/parquet-go/parquet-go) [![Go Reference](https://pkg.go.dev/badge/github.com/parquet-go/parquet-go.svg)](https://pkg.go.dev/github.com/parquet-go/parquet-go)
     2  
     3  High-performance Go library to manipulate parquet files, initially developed at
     4  [Twilio Segment](https://segment.com/engineering).
     5  
     6  ![parquet-go-logo](https://github.com/parquet-go/parquet-go/assets/96151026/5b1f043b-2cee-4a64-a3c3-40d3353fecc0)
     7  
     8  
     9  ## Motivation
    10  
    11  Parquet has been established as a powerful solution to represent columnar data
    12  on persistent storage mediums, achieving levels of compression and query
    13  performance that enable managing data sets at scales that reach the petabytes.
    14  In addition, having intensive data applications sharing a common format creates
    15  opportunities for interoperation in our tool kits, providing greater leverage
    16  and value to engineers maintaining and operating those systems.
    17  
    18  The creation and evolution of large scale data management systems, combined with
    19  realtime expectations come with challenging maintenance and performance
    20  requirements, that existing solutions to use parquet with Go were not addressing.
    21  
    22  The `parquet-go/parquet-go` package was designed and developed to respond to those
    23  challenges, offering high level APIs to read and write parquet files, while
    24  keeping a low compute and memory footprint in order to be used in environments
    25  where data volumes and cost constraints require software to achieve high levels
    26  of efficiency.
    27  
    28  ## Specification
    29  
    30  Columnar storage allows Parquet to store data more efficiently than, say,
    31  using JSON or Protobuf. For more information, refer to the [Parquet Format Specification](https://github.com/apache/parquet-format).
    32  
    33  ## Installation
    34  
    35  The package is distributed as a standard Go module that programs can take a
    36  dependency on and install with the following command:
    37  
    38  ```
    39  go get github.com/parquet-go/parquet-go
    40  ```
    41  
    42  Go 1.20 or later is required to use the package.
    43  
    44  ### Compatibility Guarantees
    45  
    46  The package is currently released as a pre-v1 version, which gives maintainers
    47  the freedom to break backward compatibility to help improve the APIs as we learn
    48  which initial design decisions would need to be revisited to better support the
    49  use cases that the library solves for. These occurrences are expected to be rare
    50  in frequency and documentation will be produce to guide users on how to adapt
    51  their programs to breaking changes.
    52  
    53  ## Usage
    54  
    55  The following sections describe how to use APIs exposed by the library,
    56  highlighting the use cases with code examples to demonstrate how they are used
    57  in practice.
    58  
    59  ### Writing Parquet Files: [parquet.GenericWriter[T]](https://pkg.go.dev/github.com/parquet-go/parquet-go#GenericWriter)
    60  
    61  A parquet file is a collection of rows sharing the same schema, arranged in
    62  columns to support faster scan operations on subsets of the data set.
    63  
    64  For simple use cases, the `parquet.WriteFile[T]` function allows the creation
    65  of parquet files on the file system from a slice of Go values representing the
    66  rows to write to the file.
    67  
    68  ```go
    69  type RowType struct { FirstName, LastName string }
    70  
    71  if err := parquet.WriteFile("file.parquet", []RowType{
    72      {FirstName: "Bob"},
    73      {FirstName: "Alice"},
    74  }); err != nil {
    75      ...
    76  }
    77  ```
    78  
    79  The `parquet.GenericWriter[T]` type denormalizes rows into columns, then encodes
    80  the columns into a parquet file, generating row groups, column chunks, and pages
    81  based on configurable heuristics.
    82  
    83  ```go
    84  type RowType struct { FirstName, LastName string }
    85  
    86  writer := parquet.NewGenericWriter[RowType](output)
    87  
    88  _, err := writer.Write([]RowType{
    89      ...
    90  })
    91  if err != nil {
    92      ...
    93  }
    94  
    95  // Closing the writer is necessary to flush buffers and write the file footer.
    96  if err := writer.Close(); err != nil {
    97      ...
    98  }
    99  ```
   100  
   101  Explicit declaration of the parquet schema on a writer is useful when the
   102  application needs to ensure that data written to a file adheres to a predefined
   103  schema, which may differ from the schema derived from the writer's type
   104  parameter. The `parquet.Schema` type is a in-memory representation of the schema
   105  of parquet rows, translated from the type of Go values, and can be used for this
   106  purpose.
   107  
   108  ```go
   109  schema := parquet.SchemaOf(new(RowType))
   110  writer := parquet.NewGenericWriter[any](output, schema)
   111  ...
   112  ```
   113  
   114  ### Reading Parquet Files: [parquet.GenericReader[T]](https://pkg.go.dev/github.com/parquet-go/parquet-go#GenericReader)
   115  
   116  For simple use cases where the data set fits in memory and the program will
   117  read most rows of the file, the `parquet.ReadFile[T]` function returns a slice
   118  of Go values representing the rows read from the file.
   119  
   120  ```go
   121  type RowType struct { FirstName, LastName string }
   122  
   123  rows, err := parquet.ReadFile[RowType]("file.parquet")
   124  if err != nil {
   125      ...
   126  }
   127  
   128  for _, c := range rows {
   129      fmt.Printf("%+v\n", c)
   130  }
   131  ```
   132  
   133  The expected schema of rows can be explicitly declared when the reader is
   134  constructed, which is useful to ensure that the program receives rows matching
   135  an specific format; for example, when dealing with files from remote storage
   136  sources that applications cannot trust to have used an expected schema.
   137  
   138  Configuring the schema of a reader is done by passing a `parquet.Schema`
   139  instance as argument when constructing a reader. When the schema is declared,
   140  conversion rules implemented by the package are applied to ensure that rows
   141  read by the application match the desired format (see **Evolving Parquet Schemas**).
   142  
   143  ```go
   144  schema := parquet.SchemaOf(new(RowType))
   145  reader := parquet.NewReader(file, schema)
   146  ...
   147  ```
   148  
   149  ### Inspecting Parquet Files: [parquet.File](https://pkg.go.dev/github.com/parquet-go/parquet-go#File)
   150  
   151  Sometimes, lower-level APIs can be useful to leverage the columnar layout of
   152  parquet files. The `parquet.File` type is intended to provide such features to
   153  Go applications, by exposing APIs to iterate over the various parts of a
   154  parquet file.
   155  
   156  ```go
   157  f, err := parquet.OpenFile(file, size)
   158  if err != nil {
   159      ...
   160  }
   161  
   162  for _, rowGroup := range f.RowGroups() {
   163      for _, columnChunk := range rowGroup.ColumnChunks() {
   164          ...
   165      }
   166  }
   167  ```
   168  
   169  ### Evolving Parquet Schemas: [parquet.Convert](https://pkg.go.dev/github.com/parquet-go/parquet-go#Convert)
   170  
   171  Parquet files embed all the metadata necessary to interpret their content,
   172  including a description of the schema of the tables represented by the rows and
   173  columns they contain.
   174  
   175  Parquet files are also immutable; once written, there is not mechanism for
   176  _updating_ a file. If their contents need to be changed, rows must be read,
   177  modified, and written to a new file.
   178  
   179  Because applications evolve, the schema written to parquet files also tend to
   180  evolve over time. Those requirements creating challenges when applications need
   181  to operate on parquet files with heterogenous schemas: algorithms that expect
   182  new columns to exist may have issues dealing with rows that come from files with
   183  mismatching schema versions.
   184  
   185  To help build applications that can handle evolving schemas, `parquet-go/parquet-go`
   186  implements conversion rules that create views of row groups to translate between
   187  schema versions.
   188  
   189  The `parquet.Convert` function is the low-level routine constructing conversion
   190  rules from a source to a target schema. The function is used to build converted
   191  views of `parquet.RowReader` or `parquet.RowGroup`, for example:
   192  
   193  ```go
   194  type RowTypeV1 struct { ID int64; FirstName string }
   195  type RowTypeV2 struct { ID int64; FirstName, LastName string }
   196  
   197  source := parquet.SchemaOf(RowTypeV1{})
   198  target := parquet.SchemaOf(RowTypeV2{})
   199  
   200  conversion, err := parquet.Convert(target, source)
   201  if err != nil {
   202      ...
   203  }
   204  
   205  targetRowGroup := parquet.ConvertRowGroup(sourceRowGroup, conversion)
   206  ...
   207  ```
   208  
   209  Conversion rules are automatically applied by the `parquet.CopyRows` function
   210  when the reader and writers passed to the function also implement the
   211  `parquet.RowReaderWithSchema` and `parquet.RowWriterWithSchema` interfaces.
   212  The copy determines whether the reader and writer schemas can be converted from
   213  one to the other, and automatically applies the conversion rules to facilitate
   214  the translation between schemas.
   215  
   216  At this time, conversion rules only supports adding or removing columns from
   217  the schemas, there are no type conversions performed, nor ways to rename
   218  columns, etc... More advanced conversion rules may be added in the future.
   219  
   220  ### Sorting Row Groups: [parquet.GenericBuffer[T]](https://pkg.go.dev/github.com/parquet-go/parquet-go#Buffer)
   221  
   222  The `parquet.GenericWriter[T]` type is optimized for minimal memory usage,
   223  keeping the order of rows unchanged and flushing pages as soon as they are filled.
   224  
   225  Parquet supports expressing columns by which rows are sorted through the
   226  declaration of _sorting columns_ on row groups. Sorting row groups requires
   227  buffering all rows before ordering and writing them to a parquet file.
   228  
   229  To help with those use cases, the `parquet-go/parquet-go` package exposes the
   230  `parquet.GenericBuffer[T]` type which acts as a buffer of rows and implements
   231  `sort.Interface` to allow applications to sort rows prior to writing them
   232  to a file.
   233  
   234  The columns that rows are ordered by are configured when creating
   235  `parquet.GenericBuffer[T]` instances using the `parquet.SortingColumns` function
   236  to construct row group options configuring the buffer. The type of parquet
   237  columns defines how values are compared, see [Parquet Logical Types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)
   238  for details.
   239  
   240  When written to a file, the buffer is materialized into a single row group with
   241  the declared sorting columns. After being written, buffers can be reused by
   242  calling their `Reset` method.
   243  
   244  The following example shows how to use a `parquet.GenericBuffer[T]` to order rows
   245  written to a parquet file:
   246  
   247  ```go
   248  type RowType struct { FirstName, LastName string }
   249  
   250  buffer := parquet.NewGenericBuffer[RowType](
   251      parquet.SortingRowGroupConfig(
   252          parquet.SortingColumns(
   253              parquet.Ascending("LastName"),
   254              parquet.Ascending("FistName"),
   255          ),
   256      ),
   257  )
   258  
   259  buffer.Write([]RowType{
   260      {FirstName: "Luke", LastName: "Skywalker"},
   261      {FirstName: "Han", LastName: "Solo"},
   262      {FirstName: "Anakin", LastName: "Skywalker"},
   263  })
   264  
   265  sort.Sort(buffer)
   266  
   267  writer := parquet.NewGenericWriter[RowType](output)
   268  _, err := parquet.CopyRows(writer, buffer.Rows())
   269  if err != nil {
   270      ...
   271  }
   272  if err := writer.Close(); err != nil {
   273      ...
   274  }
   275  ```
   276  
   277  ### Merging Row Groups: [parquet.MergeRowGroups](https://pkg.go.dev/github.com/parquet-go/parquet-go#MergeRowGroups)
   278  
   279  Parquet files are often used as part of the underlying engine for data
   280  processing or storage layers, in which cases merging multiple row groups
   281  into one that contains more rows can be a useful operation to improve query
   282  performance; for example, bloom filters in parquet files are stored for each
   283  row group, the larger the row group, the fewer filters need to be stored and
   284  the more effective they become.
   285  
   286  The `parquet-go/parquet-go` package supports creating merged views of row groups,
   287  where the view contains all the rows of the merged groups, maintaining the order
   288  defined by the sorting columns of the groups.
   289  
   290  There are a few constraints when merging row groups:
   291  
   292  * The sorting columns of all the row groups must be the same, or the merge
   293    operation must be explicitly configured a set of sorting columns which are
   294    a prefix of the sorting columns of all merged row groups.
   295  
   296  * The schemas of row groups must all be equal, or the merge operation must
   297    be explicitly configured with a schema that all row groups can be converted
   298    to, in which case the limitations of schema conversions apply.
   299  
   300  Once a merged view is created, it may be written to a new parquet file or buffer
   301  in order to create a larger row group:
   302  
   303  ```go
   304  merge, err := parquet.MergeRowGroups(rowGroups)
   305  if err != nil {
   306      ...
   307  }
   308  
   309  writer := parquet.NewGenericWriter[RowType](output)
   310  _, err := parquet.CopyRows(writer, merge)
   311  if err != nil {
   312      ...
   313  }
   314  if err := writer.Close(); err != nil {
   315      ...
   316  }
   317  ```
   318  
   319  ### Using Bloom Filters: [parquet.BloomFilter](https://pkg.go.dev/github.com/parquet-go/parquet-go#BloomFilter)
   320  
   321  Parquet files can embed bloom filters to help improve the performance of point
   322  lookups in the files. The format of parquet bloom filters is documented in
   323  the parquet specification: [Parquet Bloom Filter](https://github.com/apache/parquet-format/blob/master/BloomFilter.md)
   324  
   325  By default, no bloom filters are created in parquet files, but applications can
   326  configure the list of columns to create filters for using the `parquet.BloomFilters`
   327  option when instantiating writers; for example:
   328  
   329  ```go
   330  type RowType struct {
   331      FirstName string `parquet:"first_name"`
   332      LastName  string `parquet:"last_name"`
   333  }
   334  
   335  const filterBitsPerValue = 10
   336  writer := parquet.NewGenericWriter[RowType](output,
   337      parquet.BloomFilters(
   338          // Configures the write to generate split-block bloom filters for the
   339          // "first_name" and "last_name" columns of the parquet schema of rows
   340          // witten by the application.
   341          parquet.SplitBlockFilter(filterBitsPerValue, "first_name"),
   342          parquet.SplitBlockFilter(filterBitsPerValue, "last_name"),
   343      ),
   344  )
   345  ...
   346  ```
   347  
   348  Generating bloom filters requires to know how many values exist in a column
   349  chunk in order to properly size the filter, which requires buffering all the
   350  values written to the column in memory. Because of it, the memory footprint
   351  of `parquet.GenericWriter[T]` increases linearly with the number of columns
   352  that the writer needs to generate filters for. This extra cost is optimized
   353  away when rows are copied from a `parquet.GenericBuffer[T]` to a writer, since
   354  in this case the number of values per column in known since the buffer already
   355  holds all the values in memory.
   356  
   357  When reading parquet files, column chunks expose the generated bloom filters
   358  with the `parquet.ColumnChunk.BloomFilter` method, returning a
   359  `parquet.BloomFilter` instance if a filter was available, or `nil` when there
   360  were no filters.
   361  
   362  Using bloom filters in parquet files is useful when performing point-lookups in
   363  parquet files; searching for column rows matching a given value. Programs can
   364  quickly eliminate column chunks that they know does not contain the value they
   365  search for by checking the filter first, which is often multiple orders of
   366  magnitude faster than scanning the column.
   367  
   368  The following code snippet hilights how filters are typically used:
   369  
   370  ```go
   371  var candidateChunks []parquet.ColumnChunk
   372  
   373  for _, rowGroup := range file.RowGroups() {
   374      columnChunk := rowGroup.ColumnChunks()[columnIndex]
   375      bloomFilter := columnChunk.BloomFilter()
   376  
   377      if bloomFilter != nil {
   378          if ok, err := bloomFilter.Check(value); err != nil {
   379              ...
   380          } else if !ok {
   381              // Bloom filters may return false positives, but never return false
   382              // negatives, we know this column chunk does not contain the value.
   383              continue
   384          }
   385      }
   386  
   387      candidateChunks = append(candidateChunks, columnChunk)
   388  }
   389  ```
   390  
   391  ## Optimizations
   392  
   393  The following sections describe common optimization techniques supported by the
   394  library.
   395  
   396  ### Optimizing Reads
   397  
   398  Lower level APIs used to read parquet files offer more efficient ways to access
   399  column values. Consecutive sequences of values are grouped into pages which are
   400  represented by the `parquet.Page` interface.
   401  
   402  A column chunk may contain multiple pages, each holding a section of the column
   403  values. Applications can retrieve the column values either by reading them into
   404  buffers of `parquet.Value`, or type asserting the pages to read arrays of
   405  primitive Go values. The following example demonstrates how to use both
   406  mechanisms to read column values:
   407  
   408  ```go
   409  pages := column.Pages()
   410  defer func() {
   411      checkErr(pages.Close())
   412  }()
   413  
   414  for {
   415      p, err := pages.ReadPage()
   416      if err != nil {
   417          ... // io.EOF when there are no more pages
   418      }
   419  
   420      switch page := p.Values().(type) {
   421      case parquet.Int32Reader:
   422          values := make([]int32, page.NumValues())
   423          _, err := page.ReadInt32s(values)
   424          ...
   425      case parquet.Int64Reader:
   426          values := make([]int64, page.NumValues())
   427          _, err := page.ReadInt64s(values)
   428          ...
   429      default:
   430          values := make([]parquet.Value, page.NumValues())
   431          _, err := page.ReadValues(values)
   432          ...
   433      }
   434  }
   435  ```
   436  
   437  Reading arrays of typed values is often preferable when performing aggregations
   438  on the values as this model offers a more compact representation of the values
   439  in memory, and pairs well with the use of optimizations like SIMD vectorization.
   440  
   441  ### Optimizing Writes
   442  
   443  Applications that deal with columnar storage are sometimes designed to work with
   444  columnar data throughout the abstraction layers; it then becomes possible to
   445  write columns of values directly instead of reconstructing rows from the column
   446  values. The package offers two main mechanisms to satisfy those use cases:
   447  
   448  #### A. Writing Columns of Typed Arrays
   449  
   450  The first solution assumes that the program works with in-memory arrays of typed
   451  values, for example slices of primitive Go types like `[]float32`; this would be
   452  the case if the application is built on top of a framework like
   453  [Apache Arrow](https://pkg.go.dev/github.com/apache/arrow/go/arrow).
   454  
   455  `parquet.GenericBuffer[T]` is an implementation of the `parquet.RowGroup`
   456  interface which maintains in-memory buffers of column values. Rows can be
   457  written by either boxing primitive values into arrays of `parquet.Value`,
   458  or type asserting the columns to a access specialized versions of the write
   459  methods accepting arrays of Go primitive types.
   460  
   461  When using either of these models, the application is responsible for ensuring
   462  that the same number of rows are written to each column or the resulting parquet
   463  file will be malformed.
   464  
   465  The following examples demonstrate how to use these two models to write columns
   466  of Go values:
   467  
   468  ```go
   469  type RowType struct { FirstName, LastName string }
   470  
   471  func writeColumns(buffer *parquet.GenericBuffer[RowType], firstNames []string) error {
   472      values := make([]parquet.Value, len(firstNames))
   473      for i := range firstNames {
   474          values[i] = parquet.ValueOf(firstNames[i])
   475      }
   476      _, err := buffer.ColumnBuffers()[0].WriteValues(values)
   477      return err
   478  }
   479  ```
   480  
   481  ```go
   482  type RowType struct { ID int64; Value float32 }
   483  
   484  func writeColumns(buffer *parquet.GenericBuffer[RowType], ids []int64, values []float32) error {
   485      if len(ids) != len(values) {
   486          return fmt.Errorf("number of ids and values mismatch: ids=%d values=%d", len(ids), len(values))
   487      }
   488      columns := buffer.ColumnBuffers()
   489      if err := columns[0].(parquet.Int64Writer).WriteInt64s(ids); err != nil {
   490          return err
   491      }
   492      if err := columns[1].(parquet.FloatWriter).WriteFloats(values); err != nil {
   493          return err
   494      }
   495      return nil
   496  }
   497  ```
   498  
   499  The latter is more efficient as it does not require boxing the input into an
   500  intermediary array of `parquet.Value`. However, it may not always be the right
   501  model depending on the situation, sometimes the generic abstraction can be a
   502  more expressive model.
   503  
   504  #### B. Implementing parquet.RowGroup
   505  
   506  Programs that need full control over the construction of row groups can choose
   507  to provide their own implementation of the `parquet.RowGroup` interface, which
   508  includes defining implementations of `parquet.ColumnChunk` and `parquet.Page`
   509  to expose column values of the row group.
   510  
   511  This model can be preferable when the underlying storage or in-memory
   512  representation of the data needs to be optimized further than what can be
   513  achieved by using an intermediary buffering layer with `parquet.GenericBuffer[T]`.
   514  
   515  See [parquet.RowGroup](https://pkg.go.dev/github.com/parquet-go/parquet-go#RowGroup)
   516  for the full interface documentation.
   517  
   518  #### C. Using on-disk page buffers
   519  
   520  When generating parquet files, the writer needs to buffer all pages before it
   521  can create the row group. This may require significant amounts of memory as the
   522  entire file content must be buffered prior to generating it. In some cases, the
   523  files might even be larger than the amount of memory available to the program.
   524  
   525  The `parquet.GenericWriter[T]` can be configured to use disk storage instead as
   526  a scratch buffer when generating files, by configuring a different page buffer
   527  pool using the `parquet.ColumnPageBuffers` option and `parquet.PageBufferPool`
   528  interface.
   529  
   530  The `parquet-go/parquet-go` package provides an implementation of the interface
   531  which uses temporary files to store pages while a file is generated, allowing
   532  programs to use local storage as swap space to hold pages and keep memory
   533  utilization to a minimum. The following example demonstrates how to configure
   534  a parquet writer to use on-disk page buffers:
   535  
   536  ```go
   537  type RowType struct { ... }
   538  
   539  writer := parquet.NewGenericWriter[RowType](output,
   540      parquet.ColumnPageBuffers(
   541          parquet.NewFileBufferPool("", "buffers.*"),
   542      ),
   543  )
   544  ```
   545  
   546  When a row group is complete, pages buffered to disk need to be copied back to
   547  the output file. This results in doubling I/O operations and storage space
   548  requirements (the system needs to have enough free disk space to hold two copies
   549  of the file). The resulting write amplification can often be optimized away by
   550  the kernel if the file system supports copy-on-write of disk pages since copies
   551  between `os.File` instances are optimized using `copy_file_range(2)` (on linux).
   552  
   553  See [parquet.PageBufferPool](https://pkg.go.dev/github.com/parquet-go/parquet-go#PageBufferPool)
   554  for the full interface documentation.
   555  
   556  ## Maintenance
   557  
   558  While initial design and development occurred at Twilio Segment, the project is now maintained by the open source community. We welcome external contributors.
   559  to participate in the form of discussions or code changes. Please review to the
   560  [Contribution](./CONTRIBUTING.md) guidelines as well as the [Code of Conduct](./CODE_OF_CONDUCT.md)
   561  before submitting contributions.
   562  
   563  ### Continuous Integration
   564  
   565  The project uses [Github Actions](https://github.com/parquet-go/parquet-go/actions) for CI.
   566  
   567  ### Debugging
   568  
   569  The package has debugging capabilities built in which can be turned on using the
   570  `PARQUETGODEBUG` environment variable. The value follows a model similar to
   571  `GODEBUG`, it must be formatted as a comma-separated list of `key=value` pairs.
   572  
   573  The following debug flag are currently supported:
   574  
   575  - `tracebuf=1` turns on tracing of internal buffers, which validates that
   576    reference counters are set to zero when buffers are reclaimed by the garbage
   577    collector. When the package detects that a buffer was leaked, it logs an error
   578    message along with the stack trace captured when the buffer was last used.