github.com/tobgu/qframe@v0.4.0/README.md (about) 1 [![CI Status](https://github.com/tobgu/qframe/actions/workflows/ci.yaml/badge.svg)](https://github.com/tobgu/qframe/actions/workflows/ci.yaml) 2 [![Go Coverage](https://github.com/tobgu/qframe/wiki/coverage.svg)](https://raw.githack.com/wiki/tobgu/qframe/coverage.html) 3 [![Go Report Card](https://goreportcard.com/badge/github.com/tobgu/qframe)](https://goreportcard.com/report/github.com/tobgu/qframe) 4 [![GoDoc](https://godoc.org/github.com/tobgu/qframe?status.svg)](https://godoc.org/github.com/tobgu/qframe) 5 6 QFrame is an immutable data frame that support filtering, aggregation 7 and data manipulation. Any operation on a QFrame results in 8 a new QFrame, the original QFrame remains unchanged. This can be done 9 fairly efficiently since much of the underlying data will be shared 10 between the two frames. 11 12 The design of QFrame has mainly been driven by the requirements from 13 [qocache](https://github.com/tobgu/qocache) but it is in many aspects 14 a general purpose data frame. Any suggestions for added/improved 15 functionality to support a wider scope is always of interest as long 16 as they don't conflict with the requirements from qocache! 17 See [Contribute](#contribute). 18 19 ## Installation 20 `go get github.com/tobgu/qframe` 21 22 ## Usage 23 Below are some examples of common use cases. The list is not exhaustive 24 in any way. For a complete description of all operations including more 25 examples see the [docs](https://godoc.org/github.com/tobgu/qframe). 26 27 ### IO 28 QFrames can currently be read from and written to CSV, record 29 oriented JSON, and any SQL database supported by the go `database/sql` 30 driver. 31 32 #### CSV Data 33 34 Read CSV data: 35 ```go 36 input := `COL1,COL2 37 a,1.5 38 b,2.25 39 c,3.0` 40 41 f := qframe.ReadCSV(strings.NewReader(input)) 42 fmt.Println(f) 43 ``` 44 Output: 45 ``` 46 COL1(s) COL2(f) 47 ------- ------- 48 a 1.5 49 b 2.25 50 c 3 51 52 Dims = 2 x 3 53 ``` 54 55 #### SQL Data 56 57 QFrame supports reading and writing data from the standard library `database/sql` 58 drivers. It has been tested with [SQLite](github.com/mattn/go-sqlite3), [Postgres](github.com/lib/pq), and [MariaDB](github.com/go-sql-driver/mysql). 59 60 ##### SQLite Example 61 62 Load data to and from an in-memory SQLite database. Note 63 that this example requires you to have [go-sqlite3](https://github.com/mattn/go-sqlite3) installed 64 prior to running. 65 66 ```go 67 package main 68 69 import ( 70 "database/sql" 71 "fmt" 72 73 _ "github.com/mattn/go-sqlite3" 74 "github.com/tobgu/qframe" 75 qsql "github.com/tobgu/qframe/config/sql" 76 ) 77 78 func main() { 79 // Create a new in-memory SQLite database. 80 db, _ := sql.Open("sqlite3", ":memory:") 81 // Add a new table. 82 db.Exec(` 83 CREATE TABLE test ( 84 COL1 INT, 85 COL2 REAL, 86 COL3 TEXT, 87 COL4 BOOL 88 );`) 89 // Create a new QFrame to populate our table with. 90 qf := qframe.New(map[string]interface{}{ 91 "COL1": []int{1, 2, 3}, 92 "COL2": []float64{1.1, 2.2, 3.3}, 93 "COL3": []string{"one", "two", "three"}, 94 "COL4": []bool{true, true, true}, 95 }) 96 fmt.Println(qf) 97 // Start a new SQL Transaction. 98 tx, _ := db.Begin() 99 // Write the QFrame to the database. 100 qf.ToSQL(tx, 101 // Write only to the test table 102 qsql.Table("test"), 103 // Explicitly set SQLite compatibility. 104 qsql.SQLite(), 105 ) 106 // Create a new QFrame from SQL. 107 newQf := qframe.ReadSQL(tx, 108 // A query must return at least one column. In this 109 // case it will return all of the columns we created above. 110 qsql.Query("SELECT * FROM test"), 111 // SQLite stores boolean values as integers, so we 112 // can coerce them back to bools with the CoercePair option. 113 qsql.Coerce(qsql.CoercePair{Column: "COL4", Type: qsql.Int64ToBool}), 114 qsql.SQLite(), 115 ) 116 fmt.Println(newQf) 117 fmt.Println(newQf.Equals(qf)) 118 } 119 ``` 120 121 Output: 122 123 ``` 124 COL1(i) COL2(f) COL3(s) COL4(b) 125 ------- ------- ------- ------- 126 1 1.1 one true 127 2 2.2 two true 128 3 3.3 three true 129 130 Dims = 4 x 3 131 true 132 ``` 133 134 ### Filtering 135 Filtering can be done either by applying individual filters 136 to the QFrame or by combining filters using AND and OR. 137 138 Filter with OR-clause: 139 ```go 140 f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 3}, "COL2": []string{"a", "b", "c"}}) 141 newF := f.Filter(qframe.Or( 142 qframe.Filter{Column: "COL1", Comparator: ">", Arg: 2}, 143 qframe.Filter{Column: "COL2", Comparator: "=", Arg: "a"})) 144 fmt.Println(newF) 145 ``` 146 147 Output: 148 ``` 149 COL1(i) COL2(s) 150 ------- ------- 151 1 a 152 3 c 153 154 Dims = 2 x 2 155 ``` 156 157 ### Grouping and aggregation 158 Grouping and aggregation is done in two distinct steps. The function 159 used in the aggregation step takes a slice of elements and 160 returns an element. For floats this function signature matches 161 many of the statistical functions in [Gonum](https://github.com/gonum/gonum), 162 these can hence be applied directly. 163 164 ```go 165 intSum := func(xx []int) int { 166 result := 0 167 for _, x := range xx { 168 result += x 169 } 170 return result 171 } 172 173 f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 2, 3, 3}, "COL2": []string{"a", "b", "c", "a", "b"}}) 174 f = f.GroupBy(groupby.Columns("COL2")).Aggregate(qframe.Aggregation{Fn: intSum, Column: "COL1"}) 175 fmt.Println(f.Sort(qframe.Order{Column: "COL2"})) 176 ``` 177 178 Output: 179 ``` 180 COL2(s) COL1(i) 181 ------- ------- 182 a 4 183 b 5 184 c 2 185 186 Dims = 2 x 3 187 ``` 188 189 ### Data manipulation 190 There are two different functions by which data can be manipulated, 191 `Apply` and `Eval`. 192 `Eval` is slightly more high level and takes a more data driven approach 193 but basically boils down to a bunch of `Apply` in the end. 194 195 Example using `Apply` to string concatenate two columns: 196 ```go 197 f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 3}, "COL2": []string{"a", "b", "c"}}) 198 f = f.Apply( 199 qframe.Instruction{Fn: function.StrI, DstCol: "COL1", SrcCol1: "COL1"}, 200 qframe.Instruction{Fn: function.ConcatS, DstCol: "COL3", SrcCol1: "COL1", SrcCol2: "COL2"}) 201 fmt.Println(f.Select("COL3")) 202 ``` 203 204 Output: 205 ``` 206 COL3(s) 207 ------- 208 1a 209 2b 210 3c 211 212 Dims = 1 x 3 213 ``` 214 215 The same example using `Eval` instead: 216 ```go 217 f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 3}, "COL2": []string{"a", "b", "c"}}) 218 f = f.Eval("COL3", qframe.Expr("+", qframe.Expr("str", types.ColumnName("COL1")), types.ColumnName("COL2"))) 219 fmt.Println(f.Select("COL3")) 220 ``` 221 222 ## More usage examples 223 Examples of the most common operations are available in the 224 [docs](https://godoc.org/github.com/tobgu/qframe). 225 226 ## Error handling 227 All operations that may result in errors will set the `Err` variable 228 on the returned QFrame to indicate that an error occurred. 229 The presence of an error on the QFrame will prevent any future operations 230 from being executed on the frame (eg. it follows a monad-like pattern). 231 This allows for smooth chaining of multiple operations without having 232 to explicitly check errors between each operation. 233 234 ## Configuration parameters 235 API functions that require configuration parameters make use of 236 [functional options](https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis) 237 to allow more options to be easily added in the future in a backwards 238 compatible way. 239 240 ## Design goals 241 * Performance 242 - Speed should be on par with, or better than, Python Pandas for corresponding operations. 243 - No or very little memory overhead per data element. 244 - Performance impact of operations should be straight forward to reason about. 245 * API 246 - Should be reasonably small and low ceremony. 247 - Should allow custom, user provided, functions to be used for data processing 248 - Should provide built in functions for most common operations 249 250 ## High level design 251 A QFrame is a collection of columns which can be of type int, float, 252 string, bool or enum. For more information about the data types see the 253 [types docs](https://godoc.org/github.com/tobgu/qframe/types). 254 255 In addition to the columns there is also an index which controls 256 which rows in the columns that are part of the QFrame and the 257 sort order of these columns. 258 Many operations on QFrames only affect the index, the underlying 259 data remains the same. 260 261 Many functions and methods in qframe take the empty interface as parameter, 262 for functions to be applied or string references to internal functions 263 for example. 264 These always correspond to a union/sum type with a fixed set of valid types 265 that are checked in runtime through type switches (there's hardly any 266 reflection applied in QFrame for performance reasons). 267 Which types are valid depends on the function called and the column type 268 that is affected. Modelling this statically is hard/impossible in Go, 269 hence the dynamic approach. If you plan to use QFrame with datasets 270 with fixed layout and types it should be a small task to write tiny 271 wrappers for the types you are using to regain static type safety. 272 273 ## Limitations 274 * The API can still not be considered stable. 275 * The maximum number of rows in a QFrame is 4294967296 (2^32). 276 * The CSV parser only handles ASCII characters as separators. 277 * Individual strings cannot be longer than 268 Mb (2^28 byte). 278 * A string column cannot contain more than a total of 34 Gb (2^35 byte). 279 * At the moment you cannot rely on any of the errors returned to 280 fulfill anything else than the `Error` interface. In the future 281 this will hopefully be improved to provide more help in identifying 282 the root cause of errors. 283 284 ## Performance/benchmarks 285 There are a number of benchmarks in [qbench](https://github.com/tobgu/qbench) 286 comparing QFrame to Pandas and Gota where applicable. 287 288 ## Other data frames 289 The work on QFrame has been inspired by [Python Pandas](https://pandas.pydata.org/) 290 and [Gota](https://github.com/kniren/gota). 291 292 ## Contribute 293 Want to contribute? Great! Open an issue on Github and let the discussions 294 begin! Below are some instructions for working with the QFrame repo. 295 296 ### Ideas for further work 297 Below are some ideas of areas where contributions would be welcome. 298 299 * Support for more input and output formats. 300 * Support for additional column formats. 301 * Support for using the [Arrow](https://github.com/apache/arrow) format for columns. 302 * General CPU and memory optimizations. 303 * Improve documentation. 304 * More analytical functionality. 305 * Dataset joins. 306 * Improved interoperability with other libraries in the Go data science eco system. 307 * Improve string representation of QFrames. 308 309 ### Install dependencies 310 `make dev-deps` 311 312 ### Tests 313 Please contribute tests together with any code. The tests should be 314 written against the public API to avoid lockdown of the implementation 315 and internal structure which would make it more difficult to change in 316 the future. 317 318 Run tests: 319 `make test` 320 321 This will also trigger code to be regenerated. 322 323 ### Code generation 324 The codebase contains some generated code to reduce the amount of 325 duplication required for similar functionality across different column 326 types. Generated code is recognized by file names ending with `_gen.go`. 327 These files must never be edited directly. 328 329 To trigger code generation: 330 `make generate`