github.com/fraugster/parquet-go@v0.12.0/doc.go (about)

     1  // Package goparquet is an implementation of the parquet file format in Go. It provides
     2  // functionality to both read and write parquet files, as well as high-level functionality
     3  // to manage the data schema of parquet files, to directly write Go objects to parquet files
     4  // using automatic or custom marshalling and to read records from parquet files into
     5  // Go objects using automatic or custom marshalling.
     6  //
     7  // parquet is a file format to store nested data structures in a flat columnar format. By
     8  // storing in a column-oriented way, it allows for efficient reading of individual columns
     9  // without having to read and decode complete rows. This allows for efficient reading and
    10  // faster processing when using the file format in conjunction with distributed data processing
    11  // frameworks like Apache Hadoop or distributed SQL query engines like Presto and AWS Athena.
    12  //
    13  // This particular implementation is divided into several packages. The top-level package
    14  // that you're currently viewing is the low-level implementation of the file format. It is
    15  // accompanied by the sub-packages parquetschema and floor.
    16  //
    17  // parquetschema provides functionality to parse textual schema definitions as well as the
    18  // data types to manually or programmatically construct schema definitions by other means
    19  // that are open to the user. The textual schema definition format is based on the barely
    20  // documented schema definition format that is implemented in the parquet Java implementation.
    21  // See the parquetschema sub-package for further documentation on how to use this package
    22  // and the grammar of the schema definition format as well as examples.
    23  //
    24  // floor is a high-level wrapper around the low-level package. It provides functionality
    25  // to open parquet files to read from them or to write to them. When reading from parquet files,
    26  // floor takes care of automatically unmarshal the low-level data into the user-provided
    27  // Go object. When writing to parquet files, user-provided Go objects are first marshalled
    28  // to a low-level data structure that is then written to the parquet file. These mechanisms
    29  // allow to directly read and write Go objects without having to deal with the details of the
    30  // low-level parquet format. Alternatively, marshalling and unmarshalling can be implemented
    31  // in a custom manner, giving the user maximum flexibility in case of disparities between
    32  // the parquet schema definition and the actual Go data structure. For more information, please
    33  // refer to the floor sub-package's documentation.
    34  //
    35  // To aid in working with parquet files, this package also provides a commandline tool named
    36  // "parquet-tool" that allows you to inspect a parquet file's schema, meta data, row count and
    37  // content as well as to merge and split parquet files.
    38  //
    39  // When operating with parquet files, most users should be able to cover their regular use cases
    40  // of reading and writing files using just the high-level floor package as well as the
    41  // parquetschema package. Only if a user has more special requirements in how to work with
    42  // the parquet files, it is advisable to use this low-level package.
    43  //
    44  // To write to a parquet file, the type provided by this package is the FileWriter. Create a
    45  // new *FileWriter object using the NewFileWriter function. You have a number of options available
    46  // with which you can influence the FileWriter's behaviour. You can use these options to e.g. set
    47  // meta data, the compression algorithm to use, the schema definition to use, or whether the
    48  // data should be written in the V2 format. If you didn't set a schema definition, you then need
    49  // to manually create columns using the functions NewDataColumn, NewListColumn and NewMapColumn,
    50  // and then add them to the FileWriter by using the AddColumn method. To further structure
    51  // your data into groups, use AddGroup to create groups. When you add columns to groups, you need
    52  // to provide the full column name using dotted notation (e.g. "groupname.fieldname") to AddColumn.
    53  // Using the AddData method, you can then add records. The provided data is of type map[string]interface{}.
    54  // This data can be nested: to provide data for a repeated field, the data type to use for the
    55  // map value is []interface{}. When the provided data is a group, the data type for the group itself
    56  // again needs to be map[string]interface{}.
    57  //
    58  // The data within a parquet file is divided into row groups of a certain size. You can either set
    59  // the desired row group size as a FileWriterOption, or you can manually check the estimated data
    60  // size of the current row group using the CurrentRowGroupSize method, and use FlushRowGroup
    61  // to write the data to disk and start a new row group. Please note that CurrentRowGroupSize
    62  // only estimates the _uncompressed_ data size. If you've enabled compression, it is impossible
    63  // to predict the compressed data size, so the actual row groups written to disk may be a lot
    64  // smaller than uncompressed, depending on how efficiently your data can be compressed.
    65  //
    66  // When you're done writing, always use the Close method to flush any remaining data and to
    67  // write the file's footer.
    68  //
    69  // To read from files, create a FileReader object using the NewFileReader function. You can
    70  // optionally provide a list of columns to read. If these are set, only these columns are read
    71  // from the file, while all other columns are ignored. If no columns are proided, then all
    72  // columns are read.
    73  //
    74  // With the FileReader, you can then go through the row groups (using PreLoad and SkipRowGroup).
    75  // and iterate through the row data in each row group (using NextRow). To find out how many rows
    76  // to expect in total and per row group, use the NumRows and RowGroupNumRows methods. The number
    77  // of row groups can be determined using the RowGroupCount method.
    78  package goparquet
    79  
    80  //go:generate go run bitpack_gen.go