github.com/fraugster/parquet-go@v0.12.0/floor/doc.go (about)

     1  /*
     2  
     3  Package floor provides a high-level interface to read from and write to parquet files. It works
     4  in conjunction with the goparquet package.
     5  
     6  To start writing to a parquet file, you must first create a Writer object. You can do this
     7  using the NewWriter function and a parquet.Writer object that you've previously created using
     8  the lower-level parquet library. Alternatively, there is also a NewFileWriter function available
     9  that will accept a filename and an optional list of goparquet.FileWriterOption objects, and return
    10  a floor.Writer object or an error. If you use that function, you also need to provide a
    11  parquet schema definition, which you can create using the parquet.CreateSchemaDefinition
    12  function.
    13  
    14  	sd, err := goparquet.CreateSchemaDefinition(`
    15  		message your_msg {
    16  			required int64 id;
    17  			optional binary data (JSON);
    18  			optional float64 score;
    19  			optional group attrs (MAP) {
    20  				repeated group key_value (MAP_KEY_VALUE) {
    21  					required binary key (STRING);
    22  					required binary value (STRING);
    23  				}
    24  			}
    25  		}
    26  	`)
    27  	// ...
    28  	w, err := floor.NewFileWriter("your-file.parquet", goparquet.UseSchemaDefinition(sd))
    29  	if err != nil {
    30  		// ...
    31  	}
    32  	defer w.Close()
    33  	// ...
    34  	record := &yourRecord{ID: id, Data: string(data), Score: computeScore(data), Attrs: map[string]string{"foo": "bar"}}
    35  	if err := w.Write(record); err != nil {
    36  		// ...
    37  	}
    38  
    39  By default, floor will use reflection to map your data structure to a parquet schema. Alternatively,
    40  you can choose to bypass the use of reflection by implementing the floor.Marshaller interface. This is
    41  especially useful if the structure of your parquet schema doesn't exactly match the structure of your
    42  Go data structure but rather requires some translating or mapping.
    43  
    44  To implement the floor.Marshaller interface, a data type needs to implement the method MarshalParquet(MarshalObject) error.
    45  The MarshalObject object provides methods of adding fields, set their value for a particular data type supported by
    46  parquet, as well as structure the object using lists and maps.
    47  
    48  	func (r *yourRecord) MarshalParquet(obj MarshalObject) error {
    49  		obj.AddField("id").SetInt64(r.ID)
    50  		obj.AddField("data").SetByteArray([]byte(r.Data))
    51  		obj.AddField("score").SetFloat64(r.Score)
    52  		attrMap := obj.AddField("attrs").Map()
    53  		for attrName, attrValue := range r.Attrs {
    54  			kvPair := attrMap.Add()
    55  			kvPair.Key().SetByteArray([]byte(attrName))
    56  			kvPair.Value().SetByteArray([]byte(attrValue))
    57  		}
    58  		return nil
    59  	}
    60  
    61  Reflection does this work automatically for you, but in turn you are hit by a slight performance penalty for using reflection,
    62  and you lose some flexibility in how you define your Go structs in relation to your parquet schema definition. If the object
    63  that you want to write does not implement the floor.Marshaller interface, then (*Writer).Write will inspect it via reflection.
    64  You can only write objects that are either a struct or a *struct. It will then iterate the struct's field, attempting to
    65  decode each field according to its data type.  Struct fields are matched up with parquet columns by converting the Go field name
    66  to lowercase. If the struct field is equal to the parquet column name, it's a positive match. The exact mechanics of this may
    67  change in the future.
    68  
    69  Boolean types and numeric types will be mapped to their parquet equivalents.
    70  
    71  In particular, Go's int, int8, int16, int32, uint, uint8, and uint16 types will be mapped to parquet's int32 type, while
    72  Go's int64, uint32 and uint64 types will be mapped to parquet's int64 type. Go's bool will be mapped to parquet's boolean.
    73  
    74  Go's float32 will be mapped to parquet's float, and Go's float64 will be mapped to parquet's double.
    75  
    76  Go strings, byte slices and byte arrays will be mapped to parquet byte arrays. Byte slices and byte arrays, if specified
    77  in the schema, are also mapped to fixed length byte arrays, with additional check to ensure that the length of the slices
    78  resp. arrays matches up with the parquet schema definition.
    79  
    80  Go slices of other data types will be mapped to parquet's LIST logical type. A strict adherence to a structure like this
    81  will be enforced:
    82  
    83  	<repetition-type> group <group-name> (LIST) {
    84  		repeated group list {
    85  			<repetition-type> <data-type> element;
    86  		}
    87  	}
    88  
    89  Go maps will be mapped to parquet's MAP logical type. As with the LIST type, a strict adherence to a particular structure
    90  will be enforced:
    91  
    92  	<repetition-type> group <group-name> (MAP) {
    93  		repeated group key_value (MAP_KEY_VALUE) {
    94  			<repetition-type> <data-type> key;
    95  			<repetition-type> <data-type> value;
    96  		}
    97  	}
    98  
    99  Nested Go types will be mapped to parquet groups, e.g. if your Go type is a slice of a struct, it will be encoded to match
   100  a schema definition of a LIST logical type in which the element is a group containing the fields of the struct.
   101  
   102  Pointers are automagically taken care of when analyzing a data structure via reflection. Types such as interfaces, chans
   103  and functions do not have suitable equivalents in parquet, and are therefore unsupported. Attempting to write data structures
   104  that involve any of these types in any of their fields will fail.
   105  
   106  Reading from parquet files works in a similar fashion. Package floor provides the function NewFileReader. The only parameter
   107  it accepts is the filename of your parquet file, and it returns a floor.Reader object or an error if opening the parquet file
   108  failed. There is no need to provide any further parameters, as other necessary information, such as the parquet schema, are
   109  stored in the parquet file itself.
   110  
   111  The floor.Reader object is styled after the iterator pattern that can be found in other Go packages:
   112  
   113  	r, err := floor.NewFileReader("your-file.parquet")
   114  	// ...
   115  	for r.Next() {
   116  		var record yourRecord
   117  		if err := r.Scan(&record); err != nil {
   118  			// ...
   119  		}
   120  		// ...
   121  	}
   122  
   123  The (*floor.Reader).Scan method supports two ways of populating your objects: by default, it uses reflection. If the provided
   124  object implements the floor.Unmarshaller interface, it will call (floor.Unmarshaller).UnmarshalParquet on the object instead. This
   125  approach works without any reflection and gives the implementer the greatest freedom in terms of dealing with differences
   126  between the parquet schema definition and the Go data structures, but also puts all the burden to correctly populate data
   127  onto the implementer.
   128  
   129  	fun (r *yourRecord) UnmarshalParquet(record floor.UnmarshalObject) error {
   130  		idField, err := record.GetField("id")
   131  		if err != nil {
   132  			return err
   133  		}
   134  		r.ID, err = idField.Int64()
   135  		if err != nil {
   136  			return err
   137  		}
   138  
   139  		dataField, err := record.GetField("data")
   140  		if err != nil {
   141  			return err
   142  		}
   143  		dataValue, err := dataField.ByteArray()
   144  		if err != nil {
   145  			return err
   146  		}
   147  		r.Data = string(dataValue)
   148  
   149  		scoreField, err := record.GetField("score")
   150  		if err != nil {
   151  			return err
   152  		}
   153  		r.Score, err = scoreField.Float64()
   154  		if err != nil {
   155  			return err
   156  		}
   157  
   158  		attrsField, err := record.GetField("attrs")
   159  		if err != nil {
   160  			return err
   161  		}
   162  
   163  		attrsMap, err := attrsField.Map()
   164  		r.Attrs = make(map[string]string)
   165  
   166  		for attrsMap.Next() {
   167  			keyField, err := attrsMap.Key()
   168  			if err != nil {
   169  				return err
   170  			}
   171  
   172  			valueField, err := attrsMap.Value()
   173  			if err != nil {
   174  				return err
   175  			}
   176  
   177  			key, err := keyField.ByteArray()
   178  			if err != nil {
   179  				return err
   180  			}
   181  
   182  			value, err := keyField.ByteArray()
   183  			if err != nil {
   184  				return err
   185  			}
   186  
   187  			m.Attrs[key] = value
   188  		}
   189  
   190  		return nil
   191  	}
   192  
   193  As with the Writer implementation, the Reader also supports reflection. When an object without
   194  floor.Marshaller implementation is passed to the Scan function, it will use reflection to analyze
   195  the structure of the object and fill all fields from the data read from the current record in
   196  the parquet file. The mapping of parquet data types to Go data types is equivalent to the
   197  Writer implementation.
   198  
   199  The object you want to have populated by Scan needs to be passed as a pointer, and on the top level
   200  needs to be a struct. Struct fields are matched up with parquet columns by converting the Go field name
   201  to lowercase. If the struct field is equal to the parquet column name, it's a positive match. The exact
   202  mechanics of this may change in the future.
   203  
   204  */
   205  package floor