github.com/fraugster/parquet-go@v0.12.0/doc.go (about) 1 // Package goparquet is an implementation of the parquet file format in Go. It provides 2 // functionality to both read and write parquet files, as well as high-level functionality 3 // to manage the data schema of parquet files, to directly write Go objects to parquet files 4 // using automatic or custom marshalling and to read records from parquet files into 5 // Go objects using automatic or custom marshalling. 6 // 7 // parquet is a file format to store nested data structures in a flat columnar format. By 8 // storing in a column-oriented way, it allows for efficient reading of individual columns 9 // without having to read and decode complete rows. This allows for efficient reading and 10 // faster processing when using the file format in conjunction with distributed data processing 11 // frameworks like Apache Hadoop or distributed SQL query engines like Presto and AWS Athena. 12 // 13 // This particular implementation is divided into several packages. The top-level package 14 // that you're currently viewing is the low-level implementation of the file format. It is 15 // accompanied by the sub-packages parquetschema and floor. 16 // 17 // parquetschema provides functionality to parse textual schema definitions as well as the 18 // data types to manually or programmatically construct schema definitions by other means 19 // that are open to the user. The textual schema definition format is based on the barely 20 // documented schema definition format that is implemented in the parquet Java implementation. 21 // See the parquetschema sub-package for further documentation on how to use this package 22 // and the grammar of the schema definition format as well as examples. 23 // 24 // floor is a high-level wrapper around the low-level package. It provides functionality 25 // to open parquet files to read from them or to write to them. When reading from parquet files, 26 // floor takes care of automatically unmarshal the low-level data into the user-provided 27 // Go object. When writing to parquet files, user-provided Go objects are first marshalled 28 // to a low-level data structure that is then written to the parquet file. These mechanisms 29 // allow to directly read and write Go objects without having to deal with the details of the 30 // low-level parquet format. Alternatively, marshalling and unmarshalling can be implemented 31 // in a custom manner, giving the user maximum flexibility in case of disparities between 32 // the parquet schema definition and the actual Go data structure. For more information, please 33 // refer to the floor sub-package's documentation. 34 // 35 // To aid in working with parquet files, this package also provides a commandline tool named 36 // "parquet-tool" that allows you to inspect a parquet file's schema, meta data, row count and 37 // content as well as to merge and split parquet files. 38 // 39 // When operating with parquet files, most users should be able to cover their regular use cases 40 // of reading and writing files using just the high-level floor package as well as the 41 // parquetschema package. Only if a user has more special requirements in how to work with 42 // the parquet files, it is advisable to use this low-level package. 43 // 44 // To write to a parquet file, the type provided by this package is the FileWriter. Create a 45 // new *FileWriter object using the NewFileWriter function. You have a number of options available 46 // with which you can influence the FileWriter's behaviour. You can use these options to e.g. set 47 // meta data, the compression algorithm to use, the schema definition to use, or whether the 48 // data should be written in the V2 format. If you didn't set a schema definition, you then need 49 // to manually create columns using the functions NewDataColumn, NewListColumn and NewMapColumn, 50 // and then add them to the FileWriter by using the AddColumn method. To further structure 51 // your data into groups, use AddGroup to create groups. When you add columns to groups, you need 52 // to provide the full column name using dotted notation (e.g. "groupname.fieldname") to AddColumn. 53 // Using the AddData method, you can then add records. The provided data is of type map[string]interface{}. 54 // This data can be nested: to provide data for a repeated field, the data type to use for the 55 // map value is []interface{}. When the provided data is a group, the data type for the group itself 56 // again needs to be map[string]interface{}. 57 // 58 // The data within a parquet file is divided into row groups of a certain size. You can either set 59 // the desired row group size as a FileWriterOption, or you can manually check the estimated data 60 // size of the current row group using the CurrentRowGroupSize method, and use FlushRowGroup 61 // to write the data to disk and start a new row group. Please note that CurrentRowGroupSize 62 // only estimates the _uncompressed_ data size. If you've enabled compression, it is impossible 63 // to predict the compressed data size, so the actual row groups written to disk may be a lot 64 // smaller than uncompressed, depending on how efficiently your data can be compressed. 65 // 66 // When you're done writing, always use the Close method to flush any remaining data and to 67 // write the file's footer. 68 // 69 // To read from files, create a FileReader object using the NewFileReader function. You can 70 // optionally provide a list of columns to read. If these are set, only these columns are read 71 // from the file, while all other columns are ignored. If no columns are proided, then all 72 // columns are read. 73 // 74 // With the FileReader, you can then go through the row groups (using PreLoad and SkipRowGroup). 75 // and iterate through the row data in each row group (using NextRow). To find out how many rows 76 // to expect in total and per row group, use the NumRows and RowGroupNumRows methods. The number 77 // of row groups can be determined using the RowGroupCount method. 78 package goparquet 79 80 //go:generate go run bitpack_gen.go