github.com/fraugster/parquet-go@v0.12.0/floor/doc.go (about) 1 /* 2 3 Package floor provides a high-level interface to read from and write to parquet files. It works 4 in conjunction with the goparquet package. 5 6 To start writing to a parquet file, you must first create a Writer object. You can do this 7 using the NewWriter function and a parquet.Writer object that you've previously created using 8 the lower-level parquet library. Alternatively, there is also a NewFileWriter function available 9 that will accept a filename and an optional list of goparquet.FileWriterOption objects, and return 10 a floor.Writer object or an error. If you use that function, you also need to provide a 11 parquet schema definition, which you can create using the parquet.CreateSchemaDefinition 12 function. 13 14 sd, err := goparquet.CreateSchemaDefinition(` 15 message your_msg { 16 required int64 id; 17 optional binary data (JSON); 18 optional float64 score; 19 optional group attrs (MAP) { 20 repeated group key_value (MAP_KEY_VALUE) { 21 required binary key (STRING); 22 required binary value (STRING); 23 } 24 } 25 } 26 `) 27 // ... 28 w, err := floor.NewFileWriter("your-file.parquet", goparquet.UseSchemaDefinition(sd)) 29 if err != nil { 30 // ... 31 } 32 defer w.Close() 33 // ... 34 record := &yourRecord{ID: id, Data: string(data), Score: computeScore(data), Attrs: map[string]string{"foo": "bar"}} 35 if err := w.Write(record); err != nil { 36 // ... 37 } 38 39 By default, floor will use reflection to map your data structure to a parquet schema. Alternatively, 40 you can choose to bypass the use of reflection by implementing the floor.Marshaller interface. This is 41 especially useful if the structure of your parquet schema doesn't exactly match the structure of your 42 Go data structure but rather requires some translating or mapping. 43 44 To implement the floor.Marshaller interface, a data type needs to implement the method MarshalParquet(MarshalObject) error. 45 The MarshalObject object provides methods of adding fields, set their value for a particular data type supported by 46 parquet, as well as structure the object using lists and maps. 47 48 func (r *yourRecord) MarshalParquet(obj MarshalObject) error { 49 obj.AddField("id").SetInt64(r.ID) 50 obj.AddField("data").SetByteArray([]byte(r.Data)) 51 obj.AddField("score").SetFloat64(r.Score) 52 attrMap := obj.AddField("attrs").Map() 53 for attrName, attrValue := range r.Attrs { 54 kvPair := attrMap.Add() 55 kvPair.Key().SetByteArray([]byte(attrName)) 56 kvPair.Value().SetByteArray([]byte(attrValue)) 57 } 58 return nil 59 } 60 61 Reflection does this work automatically for you, but in turn you are hit by a slight performance penalty for using reflection, 62 and you lose some flexibility in how you define your Go structs in relation to your parquet schema definition. If the object 63 that you want to write does not implement the floor.Marshaller interface, then (*Writer).Write will inspect it via reflection. 64 You can only write objects that are either a struct or a *struct. It will then iterate the struct's field, attempting to 65 decode each field according to its data type. Struct fields are matched up with parquet columns by converting the Go field name 66 to lowercase. If the struct field is equal to the parquet column name, it's a positive match. The exact mechanics of this may 67 change in the future. 68 69 Boolean types and numeric types will be mapped to their parquet equivalents. 70 71 In particular, Go's int, int8, int16, int32, uint, uint8, and uint16 types will be mapped to parquet's int32 type, while 72 Go's int64, uint32 and uint64 types will be mapped to parquet's int64 type. Go's bool will be mapped to parquet's boolean. 73 74 Go's float32 will be mapped to parquet's float, and Go's float64 will be mapped to parquet's double. 75 76 Go strings, byte slices and byte arrays will be mapped to parquet byte arrays. Byte slices and byte arrays, if specified 77 in the schema, are also mapped to fixed length byte arrays, with additional check to ensure that the length of the slices 78 resp. arrays matches up with the parquet schema definition. 79 80 Go slices of other data types will be mapped to parquet's LIST logical type. A strict adherence to a structure like this 81 will be enforced: 82 83 <repetition-type> group <group-name> (LIST) { 84 repeated group list { 85 <repetition-type> <data-type> element; 86 } 87 } 88 89 Go maps will be mapped to parquet's MAP logical type. As with the LIST type, a strict adherence to a particular structure 90 will be enforced: 91 92 <repetition-type> group <group-name> (MAP) { 93 repeated group key_value (MAP_KEY_VALUE) { 94 <repetition-type> <data-type> key; 95 <repetition-type> <data-type> value; 96 } 97 } 98 99 Nested Go types will be mapped to parquet groups, e.g. if your Go type is a slice of a struct, it will be encoded to match 100 a schema definition of a LIST logical type in which the element is a group containing the fields of the struct. 101 102 Pointers are automagically taken care of when analyzing a data structure via reflection. Types such as interfaces, chans 103 and functions do not have suitable equivalents in parquet, and are therefore unsupported. Attempting to write data structures 104 that involve any of these types in any of their fields will fail. 105 106 Reading from parquet files works in a similar fashion. Package floor provides the function NewFileReader. The only parameter 107 it accepts is the filename of your parquet file, and it returns a floor.Reader object or an error if opening the parquet file 108 failed. There is no need to provide any further parameters, as other necessary information, such as the parquet schema, are 109 stored in the parquet file itself. 110 111 The floor.Reader object is styled after the iterator pattern that can be found in other Go packages: 112 113 r, err := floor.NewFileReader("your-file.parquet") 114 // ... 115 for r.Next() { 116 var record yourRecord 117 if err := r.Scan(&record); err != nil { 118 // ... 119 } 120 // ... 121 } 122 123 The (*floor.Reader).Scan method supports two ways of populating your objects: by default, it uses reflection. If the provided 124 object implements the floor.Unmarshaller interface, it will call (floor.Unmarshaller).UnmarshalParquet on the object instead. This 125 approach works without any reflection and gives the implementer the greatest freedom in terms of dealing with differences 126 between the parquet schema definition and the Go data structures, but also puts all the burden to correctly populate data 127 onto the implementer. 128 129 fun (r *yourRecord) UnmarshalParquet(record floor.UnmarshalObject) error { 130 idField, err := record.GetField("id") 131 if err != nil { 132 return err 133 } 134 r.ID, err = idField.Int64() 135 if err != nil { 136 return err 137 } 138 139 dataField, err := record.GetField("data") 140 if err != nil { 141 return err 142 } 143 dataValue, err := dataField.ByteArray() 144 if err != nil { 145 return err 146 } 147 r.Data = string(dataValue) 148 149 scoreField, err := record.GetField("score") 150 if err != nil { 151 return err 152 } 153 r.Score, err = scoreField.Float64() 154 if err != nil { 155 return err 156 } 157 158 attrsField, err := record.GetField("attrs") 159 if err != nil { 160 return err 161 } 162 163 attrsMap, err := attrsField.Map() 164 r.Attrs = make(map[string]string) 165 166 for attrsMap.Next() { 167 keyField, err := attrsMap.Key() 168 if err != nil { 169 return err 170 } 171 172 valueField, err := attrsMap.Value() 173 if err != nil { 174 return err 175 } 176 177 key, err := keyField.ByteArray() 178 if err != nil { 179 return err 180 } 181 182 value, err := keyField.ByteArray() 183 if err != nil { 184 return err 185 } 186 187 m.Attrs[key] = value 188 } 189 190 return nil 191 } 192 193 As with the Writer implementation, the Reader also supports reflection. When an object without 194 floor.Marshaller implementation is passed to the Scan function, it will use reflection to analyze 195 the structure of the object and fill all fields from the data read from the current record in 196 the parquet file. The mapping of parquet data types to Go data types is equivalent to the 197 Writer implementation. 198 199 The object you want to have populated by Scan needs to be passed as a pointer, and on the top level 200 needs to be a struct. Struct fields are matched up with parquet columns by converting the Go field name 201 to lowercase. If the struct field is equal to the parquet column name, it's a positive match. The exact 202 mechanics of this may change in the future. 203 204 */ 205 package floor