github.com/ledgerwatch/erigon-lib@v1.0.0/etl/README.md

github.com/ledgerwatch/erigon-lib@v1.0.0/etl/README.md (about)

     1  # ETL
     2  ETL framework is most commonly used in [staged sync](https://github.com/ledgerwatch/erigon/blob/devel/eth/stagedsync/README.md).
     3  
     4  It implements a pattern where we extract some data from a database, transform it,
     5  then put it into temp files and insert back to the database in sorted order.
     6  
     7  Inserting entries into our KV storage sorted by keys helps to minimize write
     8  amplification, hence it is much faster, even considering additional I/O that
     9  is generated by storing files.
    10  
    11  It behaves similarly to enterprise [Extract, Tranform, Load](https://en.wikipedia.org/wiki/Extract,_transform,_load) frameworks, hence the name.
    12  We use temporary files because that helps keep RAM usage predictable and allows
    13  using ETL on large amounts of data.
    14  
    15  ### Example
    16  
    17  ```
    18  func keyTransformExtractFunc(transformKey func([]byte) ([]byte, error)) etl.ExtractFunc {
    19  	return func(k, v []byte, next etl.ExtractNextFunc) error {
    20  		newK, err := transformKey(k)
    21  		if err != nil {
    22  			return err
    23  		}
    24  		return next(k, newK, v)
    25  	}
    26  }
    27  
    28  err := etl.Transform(
    29  		db,                                              // database 
    30  		dbutils.PlainStateBucket,                        // "from" bucket
    31  		dbutils.CurrentStateBucket,                      // "to" bucket
    32  		datadir,                                         // where to store temp files
    33  		keyTransformExtractFunc(transformPlainStateKey), // transformFunc on extraction
    34  		etl.IdentityLoadFunc,                            // transform on load
    35  		etl.TransformArgs{                               // additional arguments
    36  			Quit: quit,
    37  		},
    38  	)
    39  	if err != nil {
    40  		return err
    41  	}
    42  
    43  ```
    44  
    45  ## Data Transformation
    46  
    47  The whole flow is shown in the image 
    48  
    49  ![](./ETL.png)
    50  
    51  Data could be transformed in two places along the pipeline:
    52  
    53  * transform on extraction
    54  
    55  * transform on loading
    56  
    57  ### Transform On Extraction
    58  
    59  `type ExtractFunc func(k []byte, v []byte, next ExtractNextFunc) error`
    60  
    61  Transform on extraction function receives the current key and value from the
    62  source bucket.
    63  
    64  ### Transform On Loading
    65  
    66  `type LoadFunc func(k []byte, value []byte, state State, next LoadNextFunc) error`
    67  
    68  As well as the current key and value, the transform on loading function
    69  receives the `State` object that can receive data from the destination bucket.
    70  
    71  That is used in index generation where we want to extend index entries with new
    72  data instead of just adding new ones.
    73  
    74  ### `<...>NextFunc` pattern
    75  
    76  Sometimes we need to produce multiple entries from a single entry when
    77  transforming.
    78  
    79  To do that, each of the transform function receives a next function that should
    80  be called to move data further. That means that each transformation can produce
    81  any number of outputs for a single input.
    82  
    83  It can be one output, like in `IdentityLoadFunc`:
    84  
    85  ```
    86  func IdentityLoadFunc(k []byte, value []byte, _ State, next LoadNextFunc) error {
    87  	return next(k, k, value) // go to the next step
    88  }
    89  ```
    90  
    91  It can be multiple outputs like when each entry is a `ChangeSet`:
    92  
    93  ```
    94  func(dbKey, dbValue []byte, next etl.ExtractNextFunc) error {
    95  		blockNum, _ := dbutils.DecodeTimestamp(dbKey)
    96  		return bytes2walker(dbValue).Walk(func(changesetKey, changesetValue []byte) error {
    97  			key := common.CopyBytes(changesetKey)
    98  			v := make([]byte, 9)
    99  			binary.BigEndian.PutUint64(v, blockNum)
   100  			if len(changesetValue) == 0 {
   101  				v[8] = 1
   102  			}
   103  			return next(dbKey, key, v)                      // go to the next step
   104  		})
   105  	}
   106  ```
   107  
   108  ### Buffer Types
   109  
   110  Before the data is being flushed into temp files, it is getting collected into
   111  a buffer until if overflows (`etl.ExtractArgs.BufferSize`).
   112  
   113  There are different types of buffers available with different behaviour.
   114  
   115  * `SortableSliceBuffer` -- just append `(k, v1)`, `(k, v2)` onto a slice. Duplicate keys
   116      will lead to duplicate entries: `[(k, v1) (k, v2)]`.
   117  
   118  * `SortableAppendBuffer` -- on duplicate keys: merge. `(k, v1)`, `(k, v2)`
   119      will lead to `k: [v1 v2]`
   120  
   121  * `SortableOldestAppearedBuffer` -- on duplicate keys: keep the oldest. `(k,
   122      v1)`, `(k v2)` will lead to `k: v1`
   123  
   124  ### Transforming Structs 
   125  
   126  Both transform functions and next functions allow only byte arrays.
   127  If you need to pass a struct, you will need to marshal it.
   128  
   129  ### Loading Into Database
   130  
   131  We load data from the temp files into a database in batches, limited by
   132  `IdealBatchSize()` of an `ethdb.Mutation`.
   133  
   134  (for tests we can also override it)
   135  
   136  ### Handling Interruptions
   137  
   138  ETL processes are long, so we need to be able to handle interruptions.
   139  
   140  #### Handing `Ctrl+C`
   141  
   142  You can pass your quit channel into `Quit` parameter into `etl.TransformArgs`.
   143  
   144  When this channel is closed, ETL will be interrupted.
   145  
   146  #### Saving & Restoring State
   147  
   148  Interrupting in the middle of loading can lead to inconsistent state in the
   149  database.
   150  
   151  To avoid that, the ETL framework allows storing progress by setting `OnLoadCommit` in `etl.TransformArgs`.
   152  
   153  Then we can use this data to know the progress the ETL transformation made.
   154  
   155  You can also specify `ExtractStartKey` and `ExtractEndKey` to limit the number
   156  of items transformed.
   157  
   158  ## Ways to work with ETL framework
   159  
   160  There might be 2 scenarios on how you want to work with the ETL framework.
   161  
   162  ![](./ETL-collector.png)
   163  
   164  ### `etl.Transform` function
   165  
   166  The vast majority of use-cases is when we extract data from one bucket and in
   167  the end, load it into another bucket. That is the use-case for `etl.Transform`
   168  function.
   169  
   170  ### `etl.Collector` struct
   171  
   172  If you want a more modular behaviour instead of just reading from the DB (like
   173  generating intermediate hashes in `../../core/chain_makers.go`, you can use
   174  `etl.Collector` struct directly.
   175  
   176  It has a `.Collect()` method that you can provide your data to.
   177  
   178  
   179  ## Optimizations
   180  
   181  * if all data fits into a single file, we don't write anything to disk and just
   182      use in-memory storage.