github.com/tobgu/qframe@v0.4.0/README.md (about)

     1  [![CI Status](https://github.com/tobgu/qframe/actions/workflows/ci.yaml/badge.svg)](https://github.com/tobgu/qframe/actions/workflows/ci.yaml)
     2  [![Go Coverage](https://github.com/tobgu/qframe/wiki/coverage.svg)](https://raw.githack.com/wiki/tobgu/qframe/coverage.html)
     3  [![Go Report Card](https://goreportcard.com/badge/github.com/tobgu/qframe)](https://goreportcard.com/report/github.com/tobgu/qframe)
     4  [![GoDoc](https://godoc.org/github.com/tobgu/qframe?status.svg)](https://godoc.org/github.com/tobgu/qframe)
     5  
     6  QFrame is an immutable data frame that support filtering, aggregation
     7  and data manipulation. Any operation on a QFrame results in
     8  a new QFrame, the original QFrame remains unchanged. This can be done
     9  fairly efficiently since much of the underlying data will be shared
    10  between the two frames.
    11  
    12  The design of QFrame has mainly been driven by the requirements from
    13  [qocache](https://github.com/tobgu/qocache) but it is in many aspects
    14  a general purpose data frame. Any suggestions for added/improved
    15  functionality to support a wider scope is always of interest as long
    16  as they don't conflict with the requirements from qocache!
    17  See [Contribute](#contribute).
    18  
    19  ## Installation
    20  `go get github.com/tobgu/qframe`
    21  
    22  ## Usage
    23  Below are some examples of common use cases. The list is not exhaustive
    24  in any way. For a complete description of all operations including more
    25  examples see the [docs](https://godoc.org/github.com/tobgu/qframe).
    26  
    27  ### IO
    28  QFrames can currently be read from and written to CSV, record
    29  oriented JSON, and any SQL database supported by the go `database/sql`
    30  driver.
    31  
    32  #### CSV Data
    33  
    34  Read CSV data:
    35  ```go
    36  input := `COL1,COL2
    37  a,1.5
    38  b,2.25
    39  c,3.0`
    40  
    41  f := qframe.ReadCSV(strings.NewReader(input))
    42  fmt.Println(f)
    43  ```
    44  Output:
    45  ```
    46  COL1(s) COL2(f)
    47  ------- -------
    48        a     1.5
    49        b    2.25
    50        c       3
    51  
    52  Dims = 2 x 3
    53  ```
    54  
    55  #### SQL Data
    56  
    57  QFrame supports reading and writing data from the standard library `database/sql`
    58  drivers. It has been tested with [SQLite](github.com/mattn/go-sqlite3), [Postgres](github.com/lib/pq), and [MariaDB](github.com/go-sql-driver/mysql).
    59  
    60  ##### SQLite Example
    61  
    62  Load data to and from an in-memory SQLite database. Note
    63  that this example requires you to have [go-sqlite3](https://github.com/mattn/go-sqlite3) installed
    64  prior to running.
    65  
    66  ```go
    67  package main
    68  
    69  import (
    70  	"database/sql"
    71  	"fmt"
    72  
    73  	_ "github.com/mattn/go-sqlite3"
    74  	"github.com/tobgu/qframe"
    75  	qsql "github.com/tobgu/qframe/config/sql"
    76  )
    77  
    78  func main() {
    79  	// Create a new in-memory SQLite database.
    80  	db, _ := sql.Open("sqlite3", ":memory:")
    81  	// Add a new table.
    82  	db.Exec(`
    83  	CREATE TABLE test (
    84  		COL1 INT,
    85  		COL2 REAL,
    86  		COL3 TEXT,
    87  		COL4 BOOL
    88  	);`)
    89  	// Create a new QFrame to populate our table with.
    90  	qf := qframe.New(map[string]interface{}{
    91  		"COL1": []int{1, 2, 3},
    92  		"COL2": []float64{1.1, 2.2, 3.3},
    93  		"COL3": []string{"one", "two", "three"},
    94  		"COL4": []bool{true, true, true},
    95  	})
    96  	fmt.Println(qf)
    97  	// Start a new SQL Transaction.
    98  	tx, _ := db.Begin()
    99  	// Write the QFrame to the database.
   100  	qf.ToSQL(tx,
   101  		// Write only to the test table
   102  		qsql.Table("test"),
   103  		// Explicitly set SQLite compatibility.
   104  		qsql.SQLite(),
   105  	)
   106  	// Create a new QFrame from SQL.
   107  	newQf := qframe.ReadSQL(tx,
   108  		// A query must return at least one column. In this 
   109  		// case it will return all of the columns we created above.
   110  		qsql.Query("SELECT * FROM test"),
   111  		// SQLite stores boolean values as integers, so we
   112  		// can coerce them back to bools with the CoercePair option.
   113  		qsql.Coerce(qsql.CoercePair{Column: "COL4", Type: qsql.Int64ToBool}),
   114  		qsql.SQLite(),
   115  	)
   116  	fmt.Println(newQf)
   117  	fmt.Println(newQf.Equals(qf))
   118  }
   119  ```
   120  
   121  Output:
   122  
   123  ```
   124  COL1(i) COL2(f) COL3(s) COL4(b)
   125  ------- ------- ------- -------
   126        1     1.1     one    true
   127        2     2.2     two    true
   128        3     3.3   three    true
   129  
   130  Dims = 4 x 3
   131  true 
   132  ```
   133  
   134  ### Filtering
   135  Filtering can be done either by applying individual filters
   136  to the QFrame or by combining filters using AND and OR.
   137  
   138  Filter with OR-clause:
   139  ```go
   140  f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 3}, "COL2": []string{"a", "b", "c"}})
   141  newF := f.Filter(qframe.Or(
   142      qframe.Filter{Column: "COL1", Comparator: ">", Arg: 2},
   143      qframe.Filter{Column: "COL2", Comparator: "=", Arg: "a"}))
   144  fmt.Println(newF)
   145  ```
   146  
   147  Output:
   148  ```
   149  COL1(i) COL2(s)
   150  ------- -------
   151        1       a
   152        3       c
   153  
   154  Dims = 2 x 2
   155  ```
   156  
   157  ### Grouping and aggregation
   158  Grouping and aggregation is done in two distinct steps. The function
   159  used in the aggregation step takes a slice of elements and
   160  returns an element. For floats this function signature matches
   161  many of the statistical functions in [Gonum](https://github.com/gonum/gonum),
   162  these can hence be applied directly.
   163  
   164  ```go
   165  intSum := func(xx []int) int {
   166      result := 0
   167      for _, x := range xx {
   168          result += x
   169      }
   170      return result
   171  }
   172  
   173  f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 2, 3, 3}, "COL2": []string{"a", "b", "c", "a", "b"}})
   174  f = f.GroupBy(groupby.Columns("COL2")).Aggregate(qframe.Aggregation{Fn: intSum, Column: "COL1"})
   175  fmt.Println(f.Sort(qframe.Order{Column: "COL2"}))
   176  ```
   177  
   178  Output:
   179  ```
   180  COL2(s) COL1(i)
   181  ------- -------
   182        a       4
   183        b       5
   184        c       2
   185  
   186  Dims = 2 x 3
   187  ```
   188  
   189  ### Data manipulation
   190  There are two different functions by which data can be manipulated,
   191  `Apply` and `Eval`.
   192  `Eval` is slightly more high level and takes a more data driven approach
   193  but basically boils down to a bunch of `Apply` in the end.
   194  
   195  Example using `Apply` to string concatenate two columns:
   196  ```go
   197  f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 3}, "COL2": []string{"a", "b", "c"}})
   198  f = f.Apply(
   199      qframe.Instruction{Fn: function.StrI, DstCol: "COL1", SrcCol1: "COL1"},
   200      qframe.Instruction{Fn: function.ConcatS, DstCol: "COL3", SrcCol1: "COL1", SrcCol2: "COL2"})
   201  fmt.Println(f.Select("COL3"))
   202  ```
   203  
   204  Output:
   205  ```
   206  COL3(s)
   207  -------
   208       1a
   209       2b
   210       3c
   211  
   212  Dims = 1 x 3
   213  ```
   214  
   215  The same example using `Eval` instead:
   216  ```go
   217  f := qframe.New(map[string]interface{}{"COL1": []int{1, 2, 3}, "COL2": []string{"a", "b", "c"}})
   218  f = f.Eval("COL3", qframe.Expr("+", qframe.Expr("str", types.ColumnName("COL1")), types.ColumnName("COL2")))
   219  fmt.Println(f.Select("COL3"))
   220  ```
   221  
   222  ## More usage examples
   223  Examples of the most common operations are available in the
   224  [docs](https://godoc.org/github.com/tobgu/qframe).
   225  
   226  ## Error handling
   227  All operations that may result in errors will set the `Err` variable
   228  on the returned QFrame to indicate that an error occurred.
   229  The presence of an error on the QFrame will prevent any future operations
   230  from being executed on the frame (eg. it follows a monad-like pattern).
   231  This allows for smooth chaining of multiple operations without having
   232  to explicitly check errors between each operation.
   233  
   234  ## Configuration parameters
   235  API functions that require configuration parameters make use of
   236  [functional options](https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis)
   237  to allow more options to be easily added in the future in a backwards
   238  compatible way.
   239  
   240  ## Design goals
   241  * Performance
   242    - Speed should be on par with, or better than, Python Pandas for corresponding operations.
   243    - No or very little memory overhead per data element.
   244    - Performance impact of operations should be straight forward to reason about.
   245  * API
   246    - Should be reasonably small and low ceremony.
   247    - Should allow custom, user provided, functions to be used for data processing
   248    - Should provide built in functions for most common operations
   249  
   250  ## High level design
   251  A QFrame is a collection of columns which can be of type int, float,
   252  string, bool or enum. For more information about the data types see the
   253  [types docs](https://godoc.org/github.com/tobgu/qframe/types).
   254  
   255  In addition to the columns there is also an index which controls
   256  which rows in the columns that are part of the QFrame and the
   257  sort order of these columns.
   258  Many operations on QFrames only affect the index, the underlying
   259  data remains the same.
   260  
   261  Many functions and methods in qframe take the empty interface as parameter,
   262  for functions to be applied or string references to internal functions
   263  for example.
   264  These always correspond to a union/sum type with a fixed set of valid types
   265  that are checked in runtime through type switches (there's hardly any
   266  reflection applied in QFrame for performance reasons).
   267  Which types are valid depends on the function called and the column type
   268  that is affected. Modelling this statically is hard/impossible in Go,
   269  hence the dynamic approach. If you plan to use QFrame with datasets
   270  with fixed layout and types it should be a small task to write tiny
   271  wrappers for the types you are using to regain static type safety.
   272  
   273  ## Limitations
   274  * The API can still not be considered stable.
   275  * The maximum number of rows in a QFrame is 4294967296 (2^32).
   276  * The CSV parser only handles ASCII characters as separators.
   277  * Individual strings cannot be longer than 268 Mb (2^28 byte).
   278  * A string column cannot contain more than a total of 34 Gb (2^35 byte).
   279  * At the moment you cannot rely on any of the errors returned to
   280    fulfill anything else than the `Error` interface. In the future
   281    this will hopefully be improved to provide more help in identifying
   282    the root cause of errors.
   283  
   284  ## Performance/benchmarks
   285  There are a number of benchmarks in [qbench](https://github.com/tobgu/qbench)
   286  comparing QFrame to Pandas and Gota where applicable.
   287  
   288  ## Other data frames
   289  The work on QFrame has been inspired by [Python Pandas](https://pandas.pydata.org/)
   290  and [Gota](https://github.com/kniren/gota).
   291  
   292  ## Contribute
   293  Want to contribute? Great! Open an issue on Github and let the discussions
   294  begin! Below are some instructions for working with the QFrame repo.
   295  
   296  ### Ideas for further work
   297  Below are some ideas of areas where contributions would be welcome.
   298  
   299  * Support for more input and output formats.
   300  * Support for additional column formats.
   301  * Support for using the [Arrow](https://github.com/apache/arrow) format for columns.
   302  * General CPU and memory optimizations.
   303  * Improve documentation.
   304  * More analytical functionality.
   305  * Dataset joins.
   306  * Improved interoperability with other libraries in the Go data science eco system.
   307  * Improve string representation of QFrames.
   308  
   309  ### Install dependencies
   310  `make dev-deps`
   311  
   312  ### Tests
   313  Please contribute tests together with any code. The tests should be
   314  written against the public API to avoid lockdown of the implementation
   315  and internal structure which would make it more difficult to change in
   316  the future.
   317  
   318  Run tests:
   319  `make test`
   320  
   321  This will also trigger code to be regenerated.
   322  
   323  ### Code generation
   324  The codebase contains some generated code to reduce the amount of
   325  duplication required for similar functionality across different column
   326  types. Generated code is recognized by file names ending with `_gen.go`.
   327  These files must never be edited directly.
   328  
   329  To trigger code generation:
   330  `make generate`