github.com/fraugster/parquet-go@v0.12.0/doc.go

github.com/fraugster/parquet-go@v0.12.0/doc.go (about)

1 // Package goparquet is an implementation of the parquet file format in Go. It provides
2 // functionality to both read and write parquet files, as well as high-level functionality
3 // to manage the data schema of parquet files, to directly write Go objects to parquet files
4 // using automatic or custom marshalling and to read records from parquet files into
5 // Go objects using automatic or custom marshalling.
6 //
7 // parquet is a file format to store nested data structures in a flat columnar format. By
8 // storing in a column-oriented way, it allows for efficient reading of individual columns
9 // without having to read and decode complete rows. This allows for efficient reading and
10 // faster processing when using the file format in conjunction with distributed data processing
11 // frameworks like Apache Hadoop or distributed SQL query engines like Presto and AWS Athena.
12 //
13 // This particular implementation is divided into several packages. The top-level package
14 // that you're currently viewing is the low-level implementation of the file format. It is
15 // accompanied by the sub-packages parquetschema and floor.
16 //
17 // parquetschema provides functionality to parse textual schema definitions as well as the
18 // data types to manually or programmatically construct schema definitions by other means
19 // that are open to the user. The textual schema definition format is based on the barely
20 // documented schema definition format that is implemented in the parquet Java implementation.
21 // See the parquetschema sub-package for further documentation on how to use this package
22 // and the grammar of the schema definition format as well as examples.
23 //
24 // floor is a high-level wrapper around the low-level package. It provides functionality
25 // to open parquet files to read from them or to write to them. When reading from parquet files,
26 // floor takes care of automatically unmarshal the low-level data into the user-provided
27 // Go object. When writing to parquet files, user-provided Go objects are first marshalled
28 // to a low-level data structure that is then written to the parquet file. These mechanisms
29 // allow to directly read and write Go objects without having to deal with the details of the
30 // low-level parquet format. Alternatively, marshalling and unmarshalling can be implemented
31 // in a custom manner, giving the user maximum flexibility in case of disparities between
32 // the parquet schema definition and the actual Go data structure. For more information, please
33 // refer to the floor sub-package's documentation.
34 //
35 // To aid in working with parquet files, this package also provides a commandline tool named
36 // "parquet-tool" that allows you to inspect a parquet file's schema, meta data, row count and
37 // content as well as to merge and split parquet files.
38 //
39 // When operating with parquet files, most users should be able to cover their regular use cases
40 // of reading and writing files using just the high-level floor package as well as the
41 // parquetschema package. Only if a user has more special requirements in how to work with
42 // the parquet files, it is advisable to use this low-level package.
43 //
44 // To write to a parquet file, the type provided by this package is the FileWriter. Create a
45 // new *FileWriter object using the NewFileWriter function. You have a number of options available
46 // with which you can influence the FileWriter's behaviour. You can use these options to e.g. set
47 // meta data, the compression algorithm to use, the schema definition to use, or whether the
48 // data should be written in the V2 format. If you didn't set a schema definition, you then need
49 // to manually create columns using the functions NewDataColumn, NewListColumn and NewMapColumn,
50 // and then add them to the FileWriter by using the AddColumn method. To further structure
51 // your data into groups, use AddGroup to create groups. When you add columns to groups, you need
52 // to provide the full column name using dotted notation (e.g. "groupname.fieldname") to AddColumn.
53 // Using the AddData method, you can then add records. The provided data is of type map[string]interface{}.
54 // This data can be nested: to provide data for a repeated field, the data type to use for the
55 // map value is []interface{}. When the provided data is a group, the data type for the group itself
56 // again needs to be map[string]interface{}.
57 //
58 // The data within a parquet file is divided into row groups of a certain size. You can either set
59 // the desired row group size as a FileWriterOption, or you can manually check the estimated data
60 // size of the current row group using the CurrentRowGroupSize method, and use FlushRowGroup
61 // to write the data to disk and start a new row group. Please note that CurrentRowGroupSize
62 // only estimates the _uncompressed_ data size. If you've enabled compression, it is impossible
63 // to predict the compressed data size, so the actual row groups written to disk may be a lot
64 // smaller than uncompressed, depending on how efficiently your data can be compressed.
65 //
66 // When you're done writing, always use the Close method to flush any remaining data and to
67 // write the file's footer.
68 //
69 // To read from files, create a FileReader object using the NewFileReader function. You can
70 // optionally provide a list of columns to read. If these are set, only these columns are read
71 // from the file, while all other columns are ignored. If no columns are proided, then all
72 // columns are read.
73 //
74 // With the FileReader, you can then go through the row groups (using PreLoad and SkipRowGroup).
75 // and iterate through the row data in each row group (using NextRow). To find out how many rows
76 // to expect in total and per row group, use the NumRows and RowGroupNumRows methods. The number
77 // of row groups can be determined using the RowGroupCount method.
78 package goparquet
79
80 //go:generate go run bitpack_gen.go