github.com/matrixorigin/matrixone@v1.2.0/docs/rfcs/20220503_tae_design.md

github.com/matrixorigin/matrixone@v1.2.0/docs/rfcs/20220503_tae_design.md (about)

     1  - Feature Name: Transactional Analytic Engine
     2  - Status: In Progress
     3  - Start Date: 2022-02-21
     4  - Authors: [Xu Peng](https://github.com/XuPeng-SH)
     5  - Implementation PR:
     6  - Issue for this RFC: [2360](https://github.com/matrixorigin/matrixone/issues/2360)
     7  
     8  # Summary
     9  
    10  **TAE** (Transactional Analytic Engine) is designed for hybrid transactional analytical query workloads, which can be used as the underlying storage engine of database management system (DBMS) for online analytical processing of queries (HTAP).
    11  
    12  # Guide-level design
    13  
    14  ## Terms
    15  ### Layout
    16  - **Block**: Piece of a segment which is the minimum part of table data. The maximum number of rows of a block is fixed
    17  - **Segment**: Piece of a table which is composed of blocks
    18  - **Table**: Piece of a database which is composed of segments
    19  - **Database**: A combination of tables, which shares the same log space
    20  
    21  **TODO**
    22  
    23  ### State
    24  - **Transient Block**: Block where the number of rows does not reach the upper limit and the blocks queued to be sorted and flushed
    25  - **Sorted Block**: Sorted block
    26  - **Unsorted Segment**: Segment that not merge sorted
    27  - **Sorted Segment**: Segment that merge sorted.
    28  
    29  **TODO**
    30  
    31  ### Container
    32  - **Vector**: Data fragment of a column in memory
    33  - **Batch**: A combination of vectors, and the number of rows in each vector is aligned
    34  
    35  **TODO**
    36  
    37  ## Data storage
    38  ### Table
    39  **TAE** stores data represented as tables. Each table is bound to a schema consisting of numbers of column definitions. A table data is organized as a log-structured merge-tree (LSM tree).
    40  
    41  Currently, **TAE** is a three-level LSM tree, called L0, L1 and L2. L0 is small and can be entirely resident in memory, whereas L1 and L2 are both definitely resident on disk. In **TAE**, L0 consists of transient blocks and L1 consists of sorted blocks. The incoming new data is always inserted into the latest transient block. If the insertion causes the block to exceed the maximum row count of a block, the block will be sorted by primary key and flushed into L1 as sorted block. If the number of sorted blocks exceed the maximum number of a segment, the segment will be sorted by primary key using merge sort.
    42  
    43  L1 and L2 are organized into sorted runs of data. Each run contains data sorted by the primary key, which can be represented on disk as a single file. There will be overlapping primary key ranges between sort runs. The difference of L1 and L2 is that a run in L1 is a **block** while a run in L2 is a **segment**.
    44  
    45  A segment can be compacted into a new segment if it has many updates(deletions). Segments can be merged into a segment. The scheduling behind this has some customizable strategies, mainly the trade-off between write amplification and read amplification.
    46  
    47  As described above, transient blocks can be entirely resident in memory, but not necessarily so. Because there will be many tables, each table has transient blocks. If they are always resident in memory, it will cause a huge waste. In **TAE**, transient blocks from all tables share a dedicated fixed-size LRU cache. A evicted transient block will be unloaded from memory and flushed as a transient block file. In practice, the transient blocks are constantly flowing to the L1 and the number of transient blocks per table at a certain time is very small, those active transient blocks will likly reside in memory even with a small-sized cache.
    48  
    49  ### Indexes
    50  There's no table-level index in **TAE**, only segment and block-level indexes are available.
    51  
    52  In **TAE**, there is a dedicated fixed-size LRU cache for all indexes. Compared with the original data, the index occupies a limited space, but the acceleration of the query is very obvious, and the index will be called very frequently. A dedicated cache can avoid a memory copy when being called.
    53  
    54  #### Primary key index
    55  **TAE** creates an index for each table's primary key by default. The main function is to deduplicate when inserting data and filter according to the primary key. Deduplication is the critical path for data insertion. We need to make trade-offs in the following three aspects:
    56  - Query performance
    57  - Memory usage
    58  - Match with the underlying data store layout
    59  
    60  From the granularity of the index, we divide the index into two categories, one is a table-level index, and the other is an index set composed of a series of partition indexes. For example, we can have a table-level B tree index, or each segment has a B tree index. The table data of **TAE** consists of multiple segments, and each segment must be unordered first and then ordered. Compaction, merging, or splitting may take place afterwards. This scenario is very unfriendly to the table-level index. So the index of **TAE** should be a segment-level index set.
    61  
    62  There are two types of segment. One is appendable and the other is not. For non-appendable segment, the segment-level index is a two-level structure, bloomfilter and zonemap respectively. There are two options for bloomfilter, a segment-based bloomfilter, and a block-based bloomfilter. The Segment-based is a better choice when the index can be fully resident in memory. An appendable segment consists of at least one appendable block plus multiple non-appendable blocks. Appendable block index is a resident memory ART-tree plus zonemap while the non-appendable one is bloomfilter plus zonemap.
    63  
    64  <img src="https://user-images.githubusercontent.com/39627130/155945090-945442eb-25f6-4624-8c5a-442475328e49.png" height="70%" width="70%" />
    65  
    66  The query flow chart of segment index is as follow:
    67  
    68  <img src="https://user-images.githubusercontent.com/39627130/155949839-3e771818-8fb0-41d5-8765-f81fdc650289.png" height="100%" width="100%" />
    69  
    70  **TODO**
    71  
    72  #### Secondary index
    73  **TODO**
    74  
    75  ### Compression
    76  **TAE** is a column-oriented data store, very friendly to data compression. It supports per-column compression codecs and now only **LZ4** is used. You can easily obtain the meta information of compressed blocks. In **TAE**, the compression unit is a column of a block.
    77  
    78  ### Layout
    79  
    80  #### Storage File Format
    81  <img src="https://user-images.githubusercontent.com/39627130/158129098-dfe77d0b-c6dc-4323-a63e-b7b381f15949.png" height="100%" width="100%" />
    82  
    83  ##### Header
    84  ```
    85  +-------------+--------------+-------------+----------------+------------------+------------------+
    86  |  Magic (8B) | Version (2B) | Format (2B) | Reserved (32B) | MinPartSize (2B) | MaxPartSize (2B) |
    87  +-------------+--------------+-------------+----------------+------------------+------------------+
    88  
    89  Magic = Engine identity (0x01346616). TAE only
    90  Version = File version
    91  Format = Layout format
    92  Reserved = 32 bytes reserved space
    93  MinPartSize = Specify the size of the smallest part uint. 1 for 1K bytes. If 4 is specified, the smallest part size is 4K bytes
    94  MaxPartSize = Specify the size of the bigest part. 1 for 1K bytes
    95  ```
    96  
    97  ##### Meta
    98  ```
    99  Meta-1, Meta-2 = One is stale while the other is active
   100  
   101  +----------+----------+-
   102  | <Meta-1> | <Meta-2> |
   103  +----------+----------+-
   104         |
   105         |
   106  +---------------------------------------------------+
   107  |                        Meta                       |
   108  +---------------+-----------------+-----------------+
   109  |  Version (2B) | PartOffset (4B) | Reserved (128B) |
   110  +---------------+-----------------+-----------------+
   111  
   112  Version = Meta version
   113  PartOffset = The first part position
   114  Reserved = 128 bytes reserved space
   115  ```
   116  
   117  ##### Part
   118  
   119  ```
   120  +------------+---------------------+---- ... ----+
   121  |  Size (2B) | NextPartOffset (4B) |   Payload   |
   122  +------------+---------------------+---- ... ----+
   123  
   124  Size = Part size. 1 for 1K bytes
   125  NextPartOffset = Next part pointer
   126  Payload = The size of payload is ($Size * 1K - 2B - 4B)
   127  ```
   128  
   129  #### Segment File
   130  
   131  | Format       | Value    | Description             |
   132  | ------------ | -------- | ----- |
   133  | SegmentFile | 0x10     | Segment file format      |
   134  
   135  ```
   136  +-------------------------------------------------------------------------------+
   137  |                       Segment File Meta Part Payload                          |
   138  +-----------+---------------+--------------+---------------+--------------------+
   139  | Cols (2B) | BlockCnt (2B) | <ColumnMeta> | <IndexesMeta> | <BlockIndexesMeta> |
   140  +-----------+---------------+--------------+---------------+--------------------+
   141                                      |             |                  |
   142                                      |             |                  |
   143                                      |             |  +-------------------------------------+
   144                                      |             |  |           BlockIndexesMeta          |
   145                                      |             |  +------------------+-------...--------+
   146                                      |             |  | <IndexesMeta>    |       ...        |
   147                                      |             |  +------------------+-------...--------+
   148                                      |             |          |
   149                                      |             |          |
   150                                      |  +------------------------------------------------------+
   151                                      |  |                        IndexesMeta                   |
   152                                      |  +-------------------+------------+-------------+--...--+
   153                                      |  | CompressAlgo (1B) | Count (2B) | <IndexMeta> |  ...  |
   154                                      |  +-------------------+------------+-------------+--...--+
   155                                      |                                        |
   156                                      |                                        |
   157               +----------------------------------+   +----------------+----------------+-----------------+------------+-----------+--------------+
   158               |             ColumnMeta           |   |                                        IndexMeta                                          |
   159               +-------------------+--------------+   +----------------+----------------+-----------------+------------+-----------+--------------+
   160               | CompressAlgo (1B) | <BlocksMeta> |   | IndexType (1B) | ColumnIdx (2B) | PartOffset (4B) | Start (4B) | Size (4B) | RawSize (4B) |
   161               +-------------------+--------------+   +----------------+----------------+-----------------+------------+-----------+--------------+
   162                                          |
   163                                          |
   164                  +----------------------------------------------------+
   165                  |                    BlocksMeta                      |
   166                  +--------------+-------------+---...---+-------------+
   167                  |  <BlockMeta> | <BlockMeta> |         | <BlockMeta> |
   168                  +--------------+-------------+---...---+-------------+
   169                          |
   170                          |
   171  +----------------------------------------------------------+
   172  |                        BlockMeta                         |
   173  +------------------+------------+-----------+--------------+
   174  |  PartOffset (4B) | Start (4B) | Size (4B) | RawSize (4B) |
   175  +------------------+------------+-----------+--------------+
   176  ```
   177  
   178  ##### Mutation
   179  
   180  Suppose the `MinPartSize` is `4`, which stands for `4K`. It starts with an empty segment (No Parts). Now we do the following in sequence
   181  
   182  1. Append a block (Block0) of 3 columns, and the compressed sizes of the three columns are 3.4K, 7.1K, and 2.5K.
   183     ```
   184     1) Calculate an optimal size for each column: 4K, 8K and 4K
   185     2) Find unused candidate parts. Not found here.
   186     3) Allocate new parts for these 3 columns
   187     4) Flush these 3 columns to the specified part
   188     5) Flush meta
   189     ```
   190  
   191  <img src="https://user-images.githubusercontent.com/39627130/158206129-09394ea4-4a24-4c7a-a167-bf61aa6113a7.png" height="100%" width="100%" />
   192  
   193  2. Append a block (Block1) of 3 columns, and the compressed sizes of the three columns are 11.3K, 7.8K, and 3.5K.
   194     ```
   195     1) Calculate an optimal size for each column: 12K, 8K and 4K
   196     2) Find unused candidate parts. Not found here.
   197     3) Allocate new parts for these 3 columns
   198     4) Flush these 3 columns to the specified new parts
   199     5) Flush meta
   200     ```
   201  <img src="https://user-images.githubusercontent.com/39627130/158206785-e59a0212-e53c-475a-b34d-cdc0a9d781a5.png" height="100%" width="100%" />
   202  
   203  3. There are some updates to the 3rd column of Block0, which is marked as `Block0-2`. Now we start to checkpoint the updates into segment file. The compressed size of the updated `Block0-2` is 7.3K.
   204     ```
   205     1) Calculate an optimal size: 8K
   206     2) Find unused candidate parts. Not found here.
   207     3) Allocate a new part
   208     4) Flush to the specified new part
   209     5) Flush meta
   210     ```
   211  
   212  ![image](https://user-images.githubusercontent.com/39627130/158207135-dfb052ab-8306-4282-9e55-d2c774b9ec29.png)
   213  
   214  4. There are some updates to the first column of Block1, which is marked as `Block1-0`. Now we start to checkpoint the updates into segment file. The compressed size of the updated `Block1-0` is 3.6K.
   215     ```
   216     1) Calculate an optimal size: 4K
   217     2) Find unused candidate parts. Unused Part(3, 1) found
   218     3) Flush to the specified unused part
   219     4) Flush meta
   220     ```
   221  
   222  ![image](https://user-images.githubusercontent.com/39627130/158209321-59404617-fa1b-4ac8-aecf-9f30ecffa200.png)
   223  
   224  5. Append a block (Block2) of 3 columns, and the compressed sizes of the three columns are 2.8K, 3.1K, and 3.7K.
   225     ```
   226     1) Calculate an optimal size for each column: 4K, 4K and 4K
   227     2) Find unused candidate parts. Unused Part(4, 1), Part(5, 1), Part(6, 1) found
   228     3) Flush these 3 columns to the specified unused parts
   229     4) Flush meta
   230     ```
   231  
   232  ![image](https://user-images.githubusercontent.com/39627130/158207658-e674f419-c734-48c7-8a40-f4c2476e9ea4.png)
   233  
   234  ## Buffer manager
   235  Buffer manager is responsible for the allocation of buffer space. It handles all requests for data pages and temporary blocks of the **TAE**.
   236  1. Each page is bound to a buffer node with a unique node ID
   237  2. A buffer node has two states:
   238     1) Loaded
   239     2) Unloaded
   240  3. When a requestor **Pin** a node:
   241     1) If the node is in **Loaded** state, it will increase the node reference count by 1 and wrap a node handle with the page address in memory
   242     2) If the node is in **Unloaded** state, it will read the page from disk|remote first, increase the node reference count by 1 and wrap a node handle with the page address in memory. When there is no left room in the buffer, some victim node will be unloaded to make room. The current replacement strategy is **LRU**
   243  4. When a requestor **Unpin** a node, just call **Close** of the node handle. It will decrease the node reference count by 1. If the reference count is 0, the node will be a candidate for eviction. Node with reference count greater than 0 never be evicted.
   244  
   245  There are currently four buffer managers for different purposes in **TAE**
   246  1. Mutation buffer manager: A dedicated fixed-size buffer used by L0 transient blocks. Each block corresponds to a node in the buffer
   247  2. SST buffer manager: A dedicated fixed-size buffer used by L1 and L2 blocks. Each column within a block corresponds to a node in the buffer
   248  3. Index buffer manager: A dedicated fixed-size buffer used by indexes. Each block or a segment index corresponds to a node in the buffer
   249  4. Redo log buffer manager: A dedicated fixed-size buffer used by uncommitted transactions. Each transaction local storage consists of at least one buffer node.
   250  
   251  ## LogStore
   252  An embedded log-structured data store. It is used as the underlying driver of **Catalog** and **WAL**.
   253  
   254  **TODO**
   255  
   256  ## WAL
   257  Write-ahead logging (WAL) is the key for providing atomicity and durability. All modifications should be written to a log before applied. In TAE,                 REDO log does not need to record every write operation, but it must be recorded when the transaction is committed. We will reduce the usage of io by using the     redo log buffer manager, and avoid any io events for those transactions that are not long and may need to be rolled back due to various conflicts. It can also support long or large transactions.
   258  
   259  ### Log Entry
   260  
   261  #### Entry Layout
   262  
   263  <img src="https://user-images.githubusercontent.com/39627130/156019975-ea156716-79fa-4682-ab63-a9cbe0c07f33.png" height="40%" width="40%" />
   264  
   265  #### Entry Header Layout
   266  | Item         | Size(Byte) | Scope| Description                                                |
   267  | ------------ | -------- | -----| ------------------------------------------------------------ |
   268  | `GroupId`      | 4        | `All` | Specify the group id          |
   269  | `LSN`         | 8        | `All` | Specify the log sequence number        |
   270  | `Length`     | 4        | `All` | Specify the length of the entry         |
   271  | `Type`       | 1        | `All` | Specify the entry type         |
   272  
   273  #### Entry Type
   274  
   275  | Type       | Datatype | Value| Description                                                |
   276  | ---------- | -------- | -----| ---------------------------------------------------------- |
   277  | `AC`       | int8     | 0x10 | A committed transaction of complete write operations         |
   278  | `PC`       | int8     | 0x11 | A committed transaction of partial write operations          |
   279  | `UC`       | int8     | 0x12 | Partial write operations of a uncommitted transaction        |
   280  | `RB`       | int8     | 0x13 | Rollback of a transaction                                    |
   281  | `CKP`      | int8     | 0x40 | Checkpoint                                                   |
   282  
   283  #### Transaction Log Entry
   284  Most transactions only have one log entry. Only those long or large transactions may need to record multiple log entries. So the log of a transaction may be of `1+` `UC` type log entries plus one `PC` type log entry, or only one `AC` type log entry. **TAE** assigns a dedicate group to log entries of type `UC`. Here is the transaction log of six committed transactions. <img src="https://latex.codecogs.com/svg.image?E_{2,3}" /> specifies the log entry in group 2 with LSN 3.
   285  - <img src="https://latex.codecogs.com/svg.image?Txn_{1}=E_{2,1}\leftarrow&space;&space;E_{1,5}" />
   286  - <img src="https://latex.codecogs.com/svg.image?Txn_{2}=E_{1,1}" />
   287  - <img src="https://latex.codecogs.com/svg.image?Txn_{3}=E_{1,2}" />
   288  - <img src="https://latex.codecogs.com/svg.image?Txn_{4}=E_{2,2}\leftarrow&space;&space;E_{1,4}" />
   289  - <img src="https://latex.codecogs.com/svg.image?Txn_{5}=E_{1,3}\leftarrow&space;&space;E_{1,5}" />
   290  - <img src="https://latex.codecogs.com/svg.image?Txn_{6}=E_{2,3}\leftarrow&space;&space;E_{2,4}\leftarrow&space;&space;E_{1,6}" />
   291  
   292  <img src="https://user-images.githubusercontent.com/39627130/156754808-658a1f73-4635-4bd6-909f-5e930b6bcab6.png" height="100%" width="100%" />
   293  
   294  A transaction log entry includes multiple nodes, and there are multiple types of nodes. DML node, delete info node, append info node, update info node. A node is an atomic command, which can be annotated as a sub-entry index of a committed entry. For example, there are 3 nodes in <img src="https://latex.codecogs.com/svg.image?Txn_{6}" />, which can be annotated as <img src="https://latex.codecogs.com/svg.image?\{E_{6-1},E_{6-2},E_{6-3}\}" title="\{E_{6-1},E_{6-2},E_{6-3}\}" />
   295  
   296  ##### Transaction Payload
   297  
   298  <img src="https://user-images.githubusercontent.com/39627130/156021428-3282eb20-4e69-4ba0-a4ed-ba5409ac7b71.png" height="40%" width="40%" />
   299  
   300  ##### Meta Layout
   301  | Item         | Size(Byte) |  Scope |  Description                                                |
   302  | ------------ | -------- | -------- | ------------------------------------------------------------ |
   303  | `Id`      | 8      |  `All` |  Specify the transaction id          |
   304  | `InfoLen` | 2      |  `All` |  Specify the length of transaction info   |
   305  | `Info`    | -      |  `All` |  Specify the transaction info        |
   306  | `PrevPtr` | 12     | `UC`,`PC`,`RB` | Specify the previous entry      |
   307  | `NodeCnt` | 4      |  `All` |  Specify the count of nodes          |
   308  
   309  ##### Node Type
   310  | Type       | Datatype | Value| Description                                                  |
   311  | ---------- | -------- | -----| ------------------------------------------------------------ |
   312  | `DML_A`   | int8     | 0x1  | Append info         |
   313  | `DML_D`   | int8     | 0x2  | Delete info         |
   314  | `DML_U`   | int8     | 0x3  | Update info         |
   315  | `DDL_CT`  | int8     | 0x4  | Create table info       |
   316  | `DDL_DT`  | int8     | 0x5  | Drop table info         |
   317  | `DDL_CD`  | int8     | 0x6  | Create database info    |
   318  | `DDL_DD`  | int8     | 0x7  | Drop database info         |
   319  | `DDL_CI`  | int8     | 0x8  | Create index info         |
   320  | `DDL_DI`  | int8     | 0x9  | Drop index info         |
   321  
   322  ##### Node Layout
   323  | Item         | Size(Byte) |  Scope | Description                                                |
   324  | ------------ | -------- |  -------- | ------------------------------------------------------------ |
   325  | `Type`      | 1      | `All`     | Specify the type of node        |
   326  | `Len`      | 4      | `All`  | Specify the length of node      |
   327  | `Buf`      | -      | `All`  | Node payload      |
   328  
   329  #### Example
   330  Here are 6 concurrent transactions, arranged chronologically from left to right. <img src="https://latex.codecogs.com/svg.image?W_{1-2}" title="W_{1-2}" /> specify the second write operation of <img src="https://latex.codecogs.com/svg.image?Txn_{1}" title="Txn_{1}" />. <img src="https://latex.codecogs.com/svg.image?C_{1}" title="C_{1}" /> specify the commit of <img src="https://latex.codecogs.com/svg.image?Txn_{1}" title="Txn_{1}" />. <img src="https://latex.codecogs.com/svg.image?A_{2}" title="A_{2}" /> specify the rollback of <img src="https://latex.codecogs.com/svg.image?Txn_{2}" title="Txn_{2}" />.
   331  
   332  <img src="https://user-images.githubusercontent.com/39627130/156098948-0c442ca1-6386-43a2-907e-22a55c012546.png" height="50%" width="50%" />
   333  
   334  As mentioned in the chapter on transactions, all active transactions share a fixed-size memory space, which is managed by the buffer manager. When the remaining space is not enough, some transaction nodes will be unloaded. If it is the first time of the node unload, it will be saved to the redo log as a log entry, and when loaded, the corresponding log entry will be loaded from the redo log.
   335  
   336  <img src="https://user-images.githubusercontent.com/39627130/156109116-eb2e5978-f8a1-491b-b13e-b14aa297d418.png" height="80%" width="80%" />
   337  
   338  <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" /> specify the first transaction node of <img src="https://latex.codecogs.com/svg.image?Txn_{1}" title="Txn_{1}" />. At first, <img src="https://latex.codecogs.com/svg.image?Txn_{1}" title="Txn_{1}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" /> in buffer manager, and add <img src="https://latex.codecogs.com/svg.image?W_{1-1}" title="W_{1-1}" /> into it.
   339  
   340  1. <img src="https://latex.codecogs.com/svg.image?Txn_{2}" title="Txn_{2}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{2-1}" title="TN_{2-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{2-1}" title="W_{2-1}" /> into it. Add <img src="https://latex.codecogs.com/svg.image?W_{1-2}" title="W_{1-2}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" />.
   341  2. <img src="https://latex.codecogs.com/svg.image?Txn_{3}" title="Txn_{3}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{3-1}" title="TN_{3-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{3-1}" title="W_{3-1}" /> into it.
   342  3. <img src="https://latex.codecogs.com/svg.image?Txn_{4}" title="Txn_{4}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{4-1}" title="TN_{4-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{4-1}" title="W_{4-1}" /> into it. Add <img src="https://latex.codecogs.com/svg.image?W_{2-2}" title="W_{2-2}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{2-1}" title="TN_{2-1}" />. Till now, no log entry generated.
   343  
   344  <img src="https://user-images.githubusercontent.com/39627130/156114351-54b40ca9-358a-4df3-b46c-17ea583c963b.png" height="80%" width="80%" />
   345  
   346  4. <img src="https://latex.codecogs.com/svg.image?Txn_{5}" title="Txn_{5}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{5-1}" title="TN_{5-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{5-1}" title="W_{5-1}" /> into it.
   347  5. <img src="https://latex.codecogs.com/svg.image?Txn_{6}" title="Txn_{6}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{6-1}" title="TN_{6-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{6-1}" title="W_{6-1}" /> into it. Add <img src="https://latex.codecogs.com/svg.image?W_{3-2}" title="W_{3-2}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{3-1}" title="TN_{3-1}" />. Add <img src="https://latex.codecogs.com/svg.image?W_{2-3}" title="W_{2-3}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{2-1}" title="TN_{2-1}" />. Add <img src="https://latex.codecogs.com/svg.image?C_{5}" title="C_{5}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{5}" title="TN_{5}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{5-1}" title="TN_{5-1}" />. Add <img src="https://latex.codecogs.com/svg.image?C_{4}" title="C_{4}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{4-1}" title="TN_{4-1}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{4-1}" title="TN_{4-1}" />.
   348  6. Unregister <img src="https://latex.codecogs.com/svg.image?TN_{4-1}" title="TN_{4-1}" /> and <img src="https://latex.codecogs.com/svg.image?TN_{5-1}" title="TN_{5-1}" /> from buffer manager. Before adding <img src="https://latex.codecogs.com/svg.image?W_{3-3}" title="W_{3-3}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{3-1}" title="TN_{3-1}" />, there is no space left. <img src="https://latex.codecogs.com/svg.image?TN_{2-1}" title="TN_{2-1}" /> is choosen as a candidate for eviction and it is unloaded as a log entry. Add <img src="https://latex.codecogs.com/svg.image?W_{3-3}" title="W_{3-3}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{3-1}" title="TN_{3-1}" />. <img src="https://latex.codecogs.com/svg.image?Txn_{2}" title="Txn_{2}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{2-2}" title="TN_{2-2}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{2-4}" title="W_{2-4}" /> into it. Add <img src="https://latex.codecogs.com/svg.image?C_{1}" title="C_{1}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" />. Add <img src="https://latex.codecogs.com/svg.image?C_{6}" title="C_{6}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{6-1}" title="TN_{6-1}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{6-1}" title="TN_{6-1}" />. Add <img src="https://latex.codecogs.com/svg.image?W_{2-5}" title="W_{2-5}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{2-2}" title="TN_{2-2}" />. Add <img src="https://latex.codecogs.com/svg.image?A_{2}" title="A_{2}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{2-2}" title="TN_{2-2}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{2-2}" title="TN_{2-2}" /> (For rolled back transaction, log entry is only necessary if some of the previous operations have been persisted).
   349  
   350  ### Checkpoint
   351  Typically, a checkpoint is a safe point from which state machine can start applying log entries during restart. Entries before the checkpoint are no longer needed and will be physically destroyed at the appropriate time. A checkpoint can represent the equivalent of the collection of data in the range it indicates. For example, <img src="https://latex.codecogs.com/svg.image?CKP_{LSN=11}(-\infty,&space;10]" title="CKP_{LSn=5}(-\infty, 10]" /> is the equivalent of the entries from <img src="https://latex.codecogs.com/svg.image?E_{LSN=1}" title="E_{LSN=1}" /> to <img src="https://latex.codecogs.com/svg.image?E_{LSN=10}" title="E_{LSN=10}" /> and entries in the range are no longer needed. Replay from the last checkpoint log entry <img src="https://latex.codecogs.com/svg.image?CKP_{LSN=11}(-\infty,&space;10]" title="CKP_{LSn=5}(-\infty, 10]" /> on restart.
   352  
   353  <img src="https://user-images.githubusercontent.com/39627130/156356421-9bc93fb8-6e11-4e2c-aad0-e6772f4b6cf0.png" height="100%" width="100%" />
   354  
   355  However, the checkpoint in **TAE** is different
   356  
   357  <img src="https://user-images.githubusercontent.com/39627130/156372311-1cee455b-ab97-406f-b6fe-abb14c097917.png" height="100%" width="100%" />
   358  
   359  #### Dedicated Group
   360  The log of **TAE** will have different groups, and the LSN of each group is continuously monotonically increasing. There is a dedicated group for checkpoint log entries.
   361  
   362  #### Fuzzy Checkpoint
   363  The range indicated by a typical checkpoint is always a continous interval from the minimum value to a certain LSN like <img src="https://latex.codecogs.com/svg.image?(-\infty,&space;4]" title="(-\infty, 4]" /> and <img src="https://latex.codecogs.com/svg.image?(-\infty,&space;10]" title="(-\infty, 10]" />. While the interval of **TAE** does not need to be continuous like <img src="https://latex.codecogs.com/svg.image?\{[1,4],&space;[6,8]\}" title="\{[1,4], [6,8]\}" />. Futhermore, given that each committed entry is a collection of multiple subcommands, each checkpoint should be a collection of subcommands indexes <img src="https://latex.codecogs.com/svg.image?\{[E_{1},E_{4}],&space;\{E_{5-1}\}\,&space;[E_{6},E_{8}]\}" title="\{[E_{1},E_{4}], \{E_{5-1}\}\, [E_{6},E_{8}]\}" />
   364  
   365  ## Catalog
   366  **Catalog** is TAE's in-memory metadata manager that manages all states of the engine, and the underlying driver is an embedded LogStore. Catalog can be fully replayed from the underlying LogStore.
   367  1. Storage layout info
   368  2. Database and table schema info
   369  3. DDL operation info
   370  
   371  ### Yet Another Database-Database
   372  **Catalog** is the database of **TAE** metadata. As a database, it has the following characteristics：
   373  
   374  1. In-memory database (Data and Indexes)
   375  2. Row oriented database
   376  3. MVCC
   377  4. RR isolation
   378  5. DDL like `CREATE|DROP TABLE, CREATE|DROP DATABASE` not supported. There are only a few built-in tables and cannot be deleted.
   379  
   380  ### Built-in Tables
   381  The hierarchical relationship from top to bottom is: **Database**, **Table**, **Segment**, **Block**. **Catalog** is responsible for managing these resources and supports addition, deletion, modification and query.
   382  
   383  <img src="https://user-images.githubusercontent.com/39627130/145529173-1c6ad8eb-84e2-4d7e-a49a-9085153f3436.png" height="60%" width="60%" />
   384  
   385  **Catalog** creates a table for each resource, corresponding to **Database Entry**, **Table Entry**, **Segment Entry** and **Block Entry**.
   386  
   387  <img src="https://user-images.githubusercontent.com/39627130/156906206-c1d905ef-ad5e-4c42-93cf-a38174dfd7be.png" height="80%" width="80%" />
   388  
   389  ### Transactional Operation
   390  The **Catalog** transaction is actually a sub-transaction of the engine transaction. Each table has an in-memory primary key index, and each node in the index corresponds to a table row. Once there is any update on the row, a version chain will be created for it.
   391  
   392  <img src="https://user-images.githubusercontent.com/39627130/156907205-01ffab4b-5f44-4400-9d3b-5def876e7d49.png" height="50%" width="50%" />
   393  
   394  #### Insert
   395  1. Search the primary key index, if there is the same primary key, return Duplicate error
   396  2. Create a new index node (uncommitted)
   397  
   398  #### Delete
   399  1. Search the primary key index, if not found, return NotFound error
   400  2. If the row was already deleted (committed), return NotFound error
   401  3. If there's a no version chain on the row, create a version chain and insert a delete node into the chain. Return
   402  4. Scan the version chain. If a delete node is found
   403     - If the delete node is committed, return NotFound error
   404     - If the delete node is uncomiitted, if it is not the same transaction, return `W-W` conflict error. Else return NotFound error
   405  5. Insert a delete node into the chain. Return
   406  
   407  #### Update
   408  1. Search the primary key index, if not found, return NotFound error
   409  2. If the row was already deleted (committed), return NotFound error
   410  3. If there's a no version chain on the row, create a version chain and insert a update node into the chain. Return
   411  4. Scan the version chain.
   412     - If a delete node is found
   413       - If the delete node is committed, return NotFound error
   414       - If the delete node is uncomiitted, if it is the different transaction, return `W-W` conflict error. Else return NotFound error
   415     - If a update node is found
   416       - If the delete node is uncommitted, return `W-W` conflict error
   417  5. Insert a update node into the chain. Return
   418  
   419  #### Query
   420  1. Search the primary key index, if not found, return NotFound error
   421  2. If the row was already deleted (committed), return NotFound error
   422  3. If there's a no version chain on the row
   423     - If the row is uncommitted
   424       - If it is the same transaction, return the row value
   425       - If it is a different transaction, return NotFound error
   426     - If the row is committed, return the row value
   427  5. Scan the version chain.
   428     - If a delete node is found
   429       - If the delete node is committed before the query transaction starts, return NotFound error.
   430       - If the delete node is uncomiitted and is the same transaction, return NotFound error
   431     - If a update node is found
   432       - If the update node is committed before the query transaction starts, return row value
   433       - If the update node is uncommitted and is the same transaction, return row value
   434  
   435  #### Commit & Rollback
   436  1. All uncommitted changes are stored in transaction's in-memory store
   437  2. All rows read in the transaction will be recorded in the txn store.
   438  3. Commit pipeline
   439     - Prepare:
   440       - Check R-W anti-dependency. If violated, rollback
   441       - Update all related uncommitted rows' commit info
   442     - Commit: flush and update visible version
   443     - Rollback: remove all uncommitted rows
   444  
   445  #### Checkpoint
   446  Any operation within a subtransaction corresponds to a sub-command in the engine's transaction log. So a commit or rollback of **Catalog** transaction is a checkpoint to some sub-commands in the engine's transaction log.
   447  
   448  #### Compaction
   449  1. In-memory version chain pruning
   450  2. Hard delete previous soft deleted rows
   451  3. Compact disk data
   452  
   453  ## Database (Column Families)
   454  In **TAE**, a **Table** is a **Column Family** while a **Database** is **Column Families**. The main idea behind **Column Families** is that they share the write-ahead log (Share **Log Space**), so that we can implement **Database-level** atomic writes. The old **WAL** cannot be compacted when the mutable buffer of a **Table** flushed since it may contains live data from other **Tables**. It can only be compacted when all related **Tables** mutable buffer are flushed.
   455  
   456  **TAE** supports multiple **Databases**, that is, one **TAE** instance can work with multiple **Log Spaces**. Our **MatrixOne** DBMS is built upon multi-raft and each node only needs one **TAE** engine, and each raft group corresponds to a **Database**. It is complicated and what makes it more complicated is the engine shares the external **WAL** with **Raft** log.
   457  
   458  ## Multi-Version Concurrency Control (MVCC)
   459  **TAE** uses MVCC to provide snapshot isolation of individual transactions. For SI, the consistent read view of a transaction is determined by the transaction start time, so that data read within the transaction will never reflect changes made by other simultaneous transactions. For example, for <img src="https://latex.codecogs.com/svg.image?Txn-2" title="Txn-2" />, the read view includes <img src="https://latex.codecogs.com/svg.image?[seg-2,&space;seg-4]" title="[seg-2, seg-4]" />, and more fine-grained read view to the block level includes <img src="https://latex.codecogs.com/svg.image?[blk2-1,&space;blk2-3,&space;blk2-5,&space;blk4-1,&space;blk4-3,&space;blk4-5]" title="[blk2-1, blk2-3, blk2-5, blk4-1, blk4-3, blk4-5]" />.
   460  
   461  <img src="https://user-images.githubusercontent.com/39627130/155067474-3303ef05-ad1c-4a4c-9bf5-50cb15e41c63.png" height="60%" width="60%" />
   462  
   463  **TAE** provides value-level fine-grained optimistic concurrency control, only updates to the same row and same column will conflict. The transaction uses the value versions that exist when the transaction begins and no locks are placed on the data when it is read. When two transactions attempt to update the same value, the second transaction will fail due to write-write conflict.
   464  
   465  ### Read View
   466  
   467  In **TAE**, a table includes multiple segments. A segment is the result of the combined action of multiple transactions. So a segment can be represented as <img src="https://latex.codecogs.com/svg.image?[T_{start},&space;T_{end}]" title="[T_{start}, T_{end}]" /> (<img src="https://latex.codecogs.com/svg.image?T_{start}" title="T_{start}" /> is the commit time of the oldest transaction while <img src="https://latex.codecogs.com/svg.image?T_{end}" /> is the commit time of the newest). Since segment can be compacted to a new segment and segments can be merged into a new segment, We need to add a dimension to the segment representation to distinguish versions <img src="https://latex.codecogs.com/svg.image?([T_{start},T_{end}],&space;[T_{create},T_{drop}])" title="([T_{start},T_{end}], [T_{create},T_{drop}])" /> (<img src="https://latex.codecogs.com/svg.image?T_{create}" title="T_{create}" /> is the segment create time while <img src="https://latex.codecogs.com/svg.image?T_{drop}" title="T_{drop}" /> is the segment drop time). <img src="https://latex.codecogs.com/svg.image?T_{drop}&space;=&space;0" title="T_{drop} = 0" /> means the segment is not dropped. The block representation is as same as the segment <img src="https://latex.codecogs.com/svg.image?([T_{start},T_{end}],&space;[T_{create},T_{drop}])" title="([T_{start},T_{end}], [T_{create},T_{drop}])" />.
   468  
   469  A transaction can be represented as <img src="https://latex.codecogs.com/svg.image?[Txn_{start},&space;Txn_{commit}]" title="[Txn_{start}, Txn_{commit}]" /> (<img src="https://latex.codecogs.com/svg.image?Txn_{start}" /> is the transaction start time while <img src="https://latex.codecogs.com/svg.image?T_{commit}" /> is the commit time). The read view of a transaction can be determined by the following formula:
   470  
   471  <img src="https://latex.codecogs.com/svg.image?(Txn_{start}&space;\geqslant&space;T_{create})&space;\bigcap&space;((T_{drop}=&space;0)\bigcup&space;(T_{drop}>Txn_{start}))" title="(Txn_{start} \geqslant T_{create}) \bigcap ((T_{drop} = 0)\bigcup (T_{drop}>Txn_{start}))" />
   472  
   473  When a transaction is committed, it is necessary to obtain a read view related to the commit time for deduplication:
   474  
   475  <img src="https://latex.codecogs.com/svg.image?(Txn_{commit}&space;\geqslant&space;T_{create})&space;\bigcap&space;((T_{drop}=&space;0)\bigcup&space;(T_{drop}>Txn_{commit}))" title="(Txn_{commit} \geqslant T_{create}) \bigcap ((T_{drop}= 0)\bigcup (T_{drop}>Txn_{commit}))" />
   476  
   477  For example, the read view of <img src="https://latex.codecogs.com/svg.image?Txn-2" title="Txn-2" /> includes <img src="https://latex.codecogs.com/svg.image?[seg1]" title="[seg1]" /> while the read view during commit includes <img src="https://latex.codecogs.com/svg.image?[seg1,seg2,seg3]" title="[seg1,seg2,seg3]" />.
   478  
   479  <img src="https://user-images.githubusercontent.com/39627130/154995795-2c367e33-bafa-4e47-812d-f80d82594613.png" height="100%" width="100%" />
   480  
   481  The block read view is similar to segment.
   482  
   483  ### Concurrent Compaction
   484  Compaction is needed for space efficiency, read efficiency, and timely data deletion. In **TAE**, the following scenarios require compaction：
   485  
   486  - <img src="https://latex.codecogs.com/svg.image?Block_{L_{0}}&space;\overset{sort}{\rightarrow}&space;Block_{L_{1}}" title="Block_{L_{0}} \overset{sort}{\rightarrow} Block_{L_{1}}" />. When inserting data, it first flows to L0 in an unordered manner. After certain conditions are met, the data will be reorganized and flowed to L1, sorted by the primary key.
   487  - <img src="https://latex.codecogs.com/svg.image?\{Block_{L_{1}},...\}&space;\overset{merge}{\rightarrow}Segment_{L_{2}}" title="\{Block_{L_{1}},...\} \overset{merge}{\rightarrow}Segment_{L_{2}}" />. Multiple L1 blocks are merge-sorted into a L2 segment.
   488  - <img src="https://latex.codecogs.com/svg.image?Block_{L_{2}}&space;\overset{compact}{\rightarrow}&space;Block_{L_{2}}" title="Block_{L_{2}} \overset{compact}{\rightarrow} Block_{L_{2}}" />. If there are many updates to a L2 block and it is needed to compact the block to a new block to improve read efficiency.
   489  - <img src="https://latex.codecogs.com/svg.image?Segment_{L_{2}}&space;\overset{compact}{\rightarrow}&space;Segment_{L_{2}}" title="Segment_{L_{2}} \overset{compact}{\rightarrow} Segment_{L_{2}}" />. If there are many updates to a L2 segment and it is needed to compact the block to a new segment to improve read efficiency.
   490  - <img src="https://latex.codecogs.com/svg.image?\{Segment_{L_{2}}&space;...\}&space;\overset{merge}{\rightarrow}&space;Segment_{L_{2}}" title="\{Segment_{L_{2}} ...\} \overset{merge}{\rightarrow} Segment_{L_{2}}" />. Multiple L2 segments are merge-sorted into a L2 segment.
   491  
   492  #### Block Sort Example
   493  
   494  <img src="https://user-images.githubusercontent.com/39627130/155315545-0ec97d65-b716-4c30-9a00-e9bddcfaea2d.png" height="100%" width="100%" />
   495  
   496  <img src="https://latex.codecogs.com/svg.image?Block1_{L_{0}}" title="Block1_{L_{0}}" /> is created @ <img src="https://latex.codecogs.com/svg.image?t_{1}" title="t_{1}" />, which contains data from <img src="https://latex.codecogs.com/svg.image?\{Txn1,Txn2,Txn3,Txn4\}" title="\{Txn1,Txn2,Txn3,Txn4\}" />. <img src="https://latex.codecogs.com/svg.image?Block1_{L_{0}}" title="Block1_{L_{0}}" /> starts to sort @ <img src="https://latex.codecogs.com/svg.image?t_{11}" title="t_{11}" />，and its block read view is the baseline plus an uncommitted update node, which will be skipped. Sort and persist a block may take a long time. There are two committed transactions <img src="https://latex.codecogs.com/svg.image?\{Txn5,Txn6\}" title="\{Txn5,Txn6\}" /> and one uncommitted <img src="https://latex.codecogs.com/svg.image?\{Txn7\}" title="\{Txn7\}" /> before commiting sorted <img src="https://latex.codecogs.com/svg.image?Block2_{L_{1}}" title="Block2_{L_{1}}" />. When commiting <img src="https://latex.codecogs.com/svg.image?\{Txn7\}" title="\{Txn7\}" /> @ <img src="https://latex.codecogs.com/svg.image?t_{16}" title="t_{16}" />, it will fail because <img src="https://latex.codecogs.com/svg.image?Block1_{L_{0}}" title="Block1_{L_{0}}" /> has been terminated. Update nodes <img src="https://latex.codecogs.com/svg.image?\{Txn5,Txn6\}" title="\{Txn5,Txn6\}" /> that were committed in between <img src="https://latex.codecogs.com/svg.image?(t_{11},&space;t_{16})" title="(t_{11}, t_{16})" /> will be merged into a new update node and it  will be committed together with <img src="https://latex.codecogs.com/svg.image?Block2_{L_{1}}" title="Block2_{L_{1}}" /> @ <img src="https://latex.codecogs.com/svg.image?t_{16}" title="t_{16}" />.
   497  
   498  ![image](https://user-images.githubusercontent.com/39627130/155317195-483f7b67-48b1-4474-8555-315805492204.png)
   499  
   500  #### Compaction As A Transactional Schema Change
   501  A compaction is the termination of a series of blocks or segments, while atomically creating a new one (building index). It usually takes a long time compared to normal transactions and we don't want to block update or delete transactions on involved blocks or segments. Here we extend the content of the **read view** to include the metadata of blocks and segments into it. When commiting a normal transaction, once it is detected that the metadata of blocks (segments) corresponding to write operation has been changed (committed), it will fail.
   502  
   503  For a compaction transaction, write operations include block (segment) soft deletion and addition. During the execution of the transaction, each write will detect a write-write conflict. Once there is a conflict, the transaction will be terminated in advance.
   504  
   505  ## Transaction
   506  A tuple <img src="https://latex.codecogs.com/svg.image?(T_{start},T_{commit})" title="(T_{start},T_{commit})" /> is the representation of a transaction where both elements are of type `uint64`, and <img src="https://latex.codecogs.com/svg.image?T_{start}" /> can be regarded as <img src="https://latex.codecogs.com/svg.image?TxnId" />. The gloabl timestamp is monotonically increasing continuously from 1, and after restarting, the previous value needs to be restored as a new starting point. When starting a transaction, the global timestamp is assigned to the transaction's <img src="https://latex.codecogs.com/svg.image?T_{start}" /> and incremented by 1. When a transaction is committed, the global timestamp is also assigned to its <img src="https://latex.codecogs.com/svg.image?T_{commit}" /> and incremented by 1.
   507  
   508  ### Preprocessing
   509  
   510  A transaction usually consists of multiple commands, and each command is usually trivial. When committing a transaction, we want to be able to preprocess some commands ahead of time. It is a requirement of the fuzzy checkpoint mechanism.
   511  
   512  #### Command
   513  Here are all command types:
   514  
   515  | Command    | Datatype | Value| Description                                                |
   516  | ---------- | -------- | -----| ---------------------------------------------------------- |
   517  | `CREATE_DB`| int8     | 0x01 | Create a database          |
   518  | `DELETE_DB`| int8     | 0x02 | Delete a database          |
   519  | `CREATE_TABLE`   | int8     | 0x13 | Create a table       |
   520  | `UPDATE_TABLE`   | int8     | 0x14 | Update a table       |
   521  | `DELETE_TABLE`   | int8     | 0x15 | Delete a table       |
   522  | `INSERT`               | int8     | 0x30 | Insert rows                                                  |
   523  | `UPDATE_COMMITTED`     | int8     | 0x32 | Update committed value                                       |
   524  | `DELETE_LOCAL`         | int8     | 0x33 | Delete row in transaction local store                        |
   525  | `DELETE_COMMITTED`     | int8     | 0x34 | Delete committed row                                         |
   526  
   527  #### Split Commands
   528  The raw command list <img src="https://latex.codecogs.com/svg.image?&space;&space;&space;CMD_{1}^{t_{1}}&space;\to&space;CMD_{2}^{t_{2}}\to&space;CMD_{3}^{t_{3}}&space;\to&space;...&space;\to&space;CMD_{n}^{t_{n}}" title=" CMD_{1}^{t_{1}} \to CMD_{2}^{t_{2}}\to CMD_{3}^{t_{3}} \to ... \to CMD_{n}^{t_{n}}" /> for a transaction is in the order accepted. <img src="https://latex.codecogs.com/svg.image?n" title="n" /> is the command sequence and <img src="https://latex.codecogs.com/svg.image?t_{n}" title="t_{n}" /> is the command type. **Split** splits the command list into serveral lists, which improves the parallelism during committing. **Split** is table-bounded, and all commands applied on the same table are group into a same list.
   529  
   530  #### Merge Commands
   531  **Merge** transforms a command list <img src="https://latex.codecogs.com/svg.image?&space;&space;&space;CMD_{1}^{t_{1}}&space;\to&space;CMD_{2}^{t_{2}}\to&space;CMD_{3}^{t_{3}}&space;\to&space;...&space;\to&space;CMD_{n}^{t_{n}}" title=" CMD_{1}^{t_{1}} \to CMD_{2}^{t_{2}}\to CMD_{3}^{t_{3}} \to ... \to CMD_{n}^{t_{n}}" />  into a new list <img src="https://latex.codecogs.com/svg.image?&space;&space;CMD_{1}^{t_{1}}&space;\to&space;CMD_{2}^{t_{2}}\to&space;CMD_{3}^{t_{3}}&space;\to&space;...&space;\to&space;CMD_{m}^{t_{m}}&space;" title=" CMD_{1}^{t_{1}} \to CMD_{2}^{t_{2}}\to CMD_{3}^{t_{3}} \to ... \to CMD_{m}^{t_{m}} " />, where <img src="https://latex.codecogs.com/svg.image?m&space;<&space;n" title="m < n" />.
   532  
   533  ##### Rules
   534  - There are some types of commands that cannot be merged
   535    ```
   536    CREATE_DB, DELETE_DB, CREATE_TABLE, CREATE_TABLE, DELETE_TABLE
   537    ```
   538  - Commands of different types can be merged
   539    ```
   540    1. INSERT + INSERT => INSERT
   541    2. INSERT + DELETE_LOCAL => INSERT
   542    ```
   543  - Commands dependency. Prepend all `DELETE_COMMITTED` commands
   544    ```
   545    INSERT, INSERT, DELETE_COMMITTED, INSERT => DELETE_COMMITTED, INSERT
   546    ```
   547  
   548  ##### Example
   549  <img src="https://user-images.githubusercontent.com/39627130/157154690-d66b71f2-15c5-488d-867c-ec2604ab3c5d.png" height="90%" width="90%" />
   550  
   551  ### DML
   552  #### INSERT
   553  ##### Uncommitted
   554  All inserted data is stored in transaction local storage before committed. Data in transaction local storage is grouped into tables. Insert request inserts data into the target table. The data for each table contains one or more batches. If the amount of request data is particularly large, it will be split into multiple batches <img src="https://latex.codecogs.com/svg.image?request_{rows=n+20}&space;\overset{split}{\rightarrow}&space;\{batch_{rows=20},&space;batch_{rows=n}\}"  />.
   555  
   556  Suppose the maximum number of rows in a batch is 10. Here are commands in a transaction
   557  
   558  ```
   559  1. INSERT 15 ROWS INTO TABLE1
   560  ```
   561  - <img src="https://latex.codecogs.com/svg.image?request_{rows=15}^{1}&space;\overset{split}{\rightarrow}&space;\{batch_{rows=10}^{1},&space;batch_{rows=5}^{2}\}&space;" title="request_{rows=15}^{1} \overset{split}{\rightarrow} \{batch_{rows=10}^{1}, batch_{rows=5}^{2}\} " />
   562  - <img src="https://latex.codecogs.com/svg.image?batch_{rows=10}^{1}&space;\overset{append}{\rightarrow}&space;Batch_{}^{1}" title="batch_{rows=10}^{1} \overset{append}{\rightarrow} Batch_{}^{1}" />, <img src="https://latex.codecogs.com/svg.image?batch_{rows=5}^{2}&space;\overset{append}{\rightarrow}&space;Batch_{}^{2}" title="batch_{rows=5}^{2} \overset{append}{\rightarrow} Batch_{}^{2}" />
   563  - <img src="https://latex.codecogs.com/svg.image?Table_{}^{1}&space;=&space;\{Batch_{}^{1},&space;Batch_{}^{2}\}" title="Table_{}^{1} = \{Batch_{}^{1}, Batch_{}^{2}\}" />
   564  
   565  ```
   566  2. INSERT 4 ROWS INTO TABLE2
   567  ```
   568  - <img src="https://latex.codecogs.com/svg.image?request_{rows=4}^{2}&space;\overset{split}{\rightarrow}&space;\{batch_{rows=4}^{1}\}" title="request_{rows=4}^{2} \overset{split}{\rightarrow} \{batch_{rows=4}^{1}\}" />
   569  - <img src="https://latex.codecogs.com/svg.image?batch_{rows=4}^{1}&space;\overset{append}{\rightarrow}&space;Batch_{}^{3}" title="batch_{rows=4}^{1} \overset{append}{\rightarrow} Batch_{}^{3}" />
   570  - <img src="https://latex.codecogs.com/svg.image?Table_{}^{2}&space;=&space;\{Batch_{}^{3}\}" title="Table_{}^{2} = \{Batch_{}^{3}\}" />
   571  
   572  ```
   573  3. INSERT 5 ROWS INTO TABLE1
   574  ```
   575  - <img src="https://latex.codecogs.com/svg.image?request_{rows=5}^{3}&space;\overset{split}{\rightarrow}&space;\{batch_{rows=5}^{1}\}" title="request_{rows=5}^{3} \overset{split}{\rightarrow} \{batch_{rows=5}^{1}\}" />
   576  - <img src="https://latex.codecogs.com/svg.image?batch_{rows=5}^{1}&space;\overset{append}{\rightarrow}&space;Batch_{}^{2}" title="batch_{rows=5}^{1} \overset{append}{\rightarrow} Batch_{}^{2}" />
   577  - <img src="https://latex.codecogs.com/svg.image?Table_{}^{1}&space;=&space;\{Batch_{}^{1},&space;Batch_{}^{2}\}" title="Table_{}^{1} = \{Batch_{}^{1}, Batch_{}^{2}\}" />
   578  
   579  ```
   580  Txn Local Storage
   581         |
   582         |--- Table1
   583         |     |
   584         |     |--- Batch1
   585         |     |--- Batch2
   586         |
   587         |--- Table2
   588               |
   589               |--- Batch3
   590  ```
   591  
   592  ##### Committed
   593  
   594  <img src="https://user-images.githubusercontent.com/39627130/156866455-e9dff497-21a9-463b-be66-8f8f564544d9.png" height="80%" width="80%" />
   595  
   596  #### UPDATE & DELETE
   597  ##### UPDATE & DELETE INSERTED DATA
   598  If any deletes applied to the batch in the transaction local store, a bitmap for deletion is created. Any update will be transfer to a delete and insert.
   599  
   600  ##### UPDATE & DELETE COMMITTED DATA
   601  A delete history will be create for blocks with any **DELETE** operation.
   602  ```go
   603  type UpdateEntry struct {
   604    txn *TxnEntry
   605    indexes []*LogIndex
   606  }
   607  
   608  type DeleteHistry struct {
   609    entries map[int32]*UpdateEntry
   610  }
   611  ```
   612  
   613  For the **UPDATE** of the primary key, it will be converted to **DELETE** plus **INSERT**. So the **UPDATE** we are talking about is always update to non-primary key.
   614  ```go
   615  type UpdateNode struct {
   616    UpdateEntry
   617    values map[int32]interface{}
   618  }
   619  ```
   620  
   621  <img src="https://user-images.githubusercontent.com/39627130/156788303-4fd2a2e4-6975-493e-8b5f-16f156fdc9dc.png" height="80%" width="80%" />
   622  
   623  ##### Version Chain
   624  
   625  <img src="https://user-images.githubusercontent.com/39627130/159515818-ddbd5317-0e15-4d3e-8c49-a7edc0ebaef1.png" height="70%" width="70%" />
   626  
   627  **TODO**
   628  
   629  #### DDL
   630  A transaction usually contains multiple **DDL** and **DML** statements. As mentioned in [Catalog](#Catalog), **Catalog** has its own transaction mechanism, and the transaction of the **TAE** contains both **DDL** and **DML**, so we take the transaction of the **Catalog** as a sub-transaction of the **TAE** transaction.
   631  
   632  <img src="https://user-images.githubusercontent.com/39627130/157159731-4b7015cf-621a-4436-90e4-cc9d5d930412.png" height="70%" width="70%" />
   633  
   634  All **DDL** operations correspond to **Catalog** **DML** operations, see the [corresponding chapter](#Catalog) for details.
   635  
   636  ### Commit
   637  As mentioned earlier, the data of a transaction is grouped by table, and each group of data is a combination of the following data types
   638  - <img src="https://latex.codecogs.com/svg.image?Batch_{}^{i}" title="Batch_{}^{i}" />. The i-th uncommitted batch
   639  - <img src="https://latex.codecogs.com/svg.image?Bitmap_{}^{i}" title="Bitmap_{}^{i}" />. The delete bitmap of <img src="https://latex.codecogs.com/svg.image?Batch_{}^{i}" title="Batch_{}^{i}" />
   640  - <img src="https://latex.codecogs.com/svg.image?DeleteNode_{blk}" title="DeleteNode_{blk}" />. The delete node of a committed block.
   641  - <img src="https://latex.codecogs.com/svg.image?UpdateNode_{blk}^{col}" title="UpdateNode_{blk}^{col}" />. The update node of a committed column block.
   642  - `CREATE_TABLE`
   643  - `DROP_TABLE`
   644  
   645  When committing a transaction, all combinations can be summed up in the following pattern:
   646  1. **P1**: Only insert
   647  2. **P2**: Only delete or update to committed data
   648  3. **P3**: Insert and delete or update to inserted data
   649  4. **P4**: Insert and delete or update to committed data
   650  5. **P5**: Only DDL
   651  6. **P6**: DDL and others
   652  
   653  #### P1
   654  1. Wrap each batch as a command with a sequence number
   655  2. Append the batch to statemachine with annotated command context
   656  
   657  #### P2
   658  1. Wrap each update|delete node as a command with a sequence number
   659  2. Update the update|delete node commit info
   660  
   661  #### P3
   662  1. If there is a delete bitmap for a batch, apply the delete bitmap first to generate a new batch
   663  2. Same as [P1](#P1)
   664  
   665  #### P4
   666  1. Process delete|update first. Same as [P2](#P2)
   667  2. Process insert. Same as [P1](#P1)
   668  
   669  #### P5
   670  1. Wrap a command with a sequence number
   671  2. Update the commit info
   672  
   673  #### P6
   674  1. All operations to a table before the `DROP TABLE` can be eliminated. Delete all related uncommitted batch, delete|update nodes. Then do as [P5](#P5)
   675  2. `CREATE TABLE` should always be the first command, unless there is a `DROP TABLE` later
   676  
   677  #### Commit Pipeline
   678  **TODO**
   679  
   680  ### Schema Change
   681  **TODO**
   682  
   683  ## Snapshot
   684  **TODO**
   685  
   686  ## Split
   687  **TODO**
   688  
   689  ## GC
   690  1. Metadata compaction
   691     1) In-memory version chain compaction
   692     2) In-memory hard deleted metadata entry compaction
   693     3) Persisted data compaction
   694  2. Stale data deletion
   695     1) Table data with reference count equal 0
   696     2) Log compaction
   697  
   698  **TODO**