github.com/matrixorigin/matrixone@v1.2.0/docs/rfcs/20220503_tae_design.md (about) 1 - Feature Name: Transactional Analytic Engine 2 - Status: In Progress 3 - Start Date: 2022-02-21 4 - Authors: [Xu Peng](https://github.com/XuPeng-SH) 5 - Implementation PR: 6 - Issue for this RFC: [2360](https://github.com/matrixorigin/matrixone/issues/2360) 7 8 # Summary 9 10 **TAE** (Transactional Analytic Engine) is designed for hybrid transactional analytical query workloads, which can be used as the underlying storage engine of database management system (DBMS) for online analytical processing of queries (HTAP). 11 12 # Guide-level design 13 14 ## Terms 15 ### Layout 16 - **Block**: Piece of a segment which is the minimum part of table data. The maximum number of rows of a block is fixed 17 - **Segment**: Piece of a table which is composed of blocks 18 - **Table**: Piece of a database which is composed of segments 19 - **Database**: A combination of tables, which shares the same log space 20 21 **TODO** 22 23 ### State 24 - **Transient Block**: Block where the number of rows does not reach the upper limit and the blocks queued to be sorted and flushed 25 - **Sorted Block**: Sorted block 26 - **Unsorted Segment**: Segment that not merge sorted 27 - **Sorted Segment**: Segment that merge sorted. 28 29 **TODO** 30 31 ### Container 32 - **Vector**: Data fragment of a column in memory 33 - **Batch**: A combination of vectors, and the number of rows in each vector is aligned 34 35 **TODO** 36 37 ## Data storage 38 ### Table 39 **TAE** stores data represented as tables. Each table is bound to a schema consisting of numbers of column definitions. A table data is organized as a log-structured merge-tree (LSM tree). 40 41 Currently, **TAE** is a three-level LSM tree, called L0, L1 and L2. L0 is small and can be entirely resident in memory, whereas L1 and L2 are both definitely resident on disk. In **TAE**, L0 consists of transient blocks and L1 consists of sorted blocks. The incoming new data is always inserted into the latest transient block. If the insertion causes the block to exceed the maximum row count of a block, the block will be sorted by primary key and flushed into L1 as sorted block. If the number of sorted blocks exceed the maximum number of a segment, the segment will be sorted by primary key using merge sort. 42 43 L1 and L2 are organized into sorted runs of data. Each run contains data sorted by the primary key, which can be represented on disk as a single file. There will be overlapping primary key ranges between sort runs. The difference of L1 and L2 is that a run in L1 is a **block** while a run in L2 is a **segment**. 44 45 A segment can be compacted into a new segment if it has many updates(deletions). Segments can be merged into a segment. The scheduling behind this has some customizable strategies, mainly the trade-off between write amplification and read amplification. 46 47 As described above, transient blocks can be entirely resident in memory, but not necessarily so. Because there will be many tables, each table has transient blocks. If they are always resident in memory, it will cause a huge waste. In **TAE**, transient blocks from all tables share a dedicated fixed-size LRU cache. A evicted transient block will be unloaded from memory and flushed as a transient block file. In practice, the transient blocks are constantly flowing to the L1 and the number of transient blocks per table at a certain time is very small, those active transient blocks will likly reside in memory even with a small-sized cache. 48 49 ### Indexes 50 There's no table-level index in **TAE**, only segment and block-level indexes are available. 51 52 In **TAE**, there is a dedicated fixed-size LRU cache for all indexes. Compared with the original data, the index occupies a limited space, but the acceleration of the query is very obvious, and the index will be called very frequently. A dedicated cache can avoid a memory copy when being called. 53 54 #### Primary key index 55 **TAE** creates an index for each table's primary key by default. The main function is to deduplicate when inserting data and filter according to the primary key. Deduplication is the critical path for data insertion. We need to make trade-offs in the following three aspects: 56 - Query performance 57 - Memory usage 58 - Match with the underlying data store layout 59 60 From the granularity of the index, we divide the index into two categories, one is a table-level index, and the other is an index set composed of a series of partition indexes. For example, we can have a table-level B tree index, or each segment has a B tree index. The table data of **TAE** consists of multiple segments, and each segment must be unordered first and then ordered. Compaction, merging, or splitting may take place afterwards. This scenario is very unfriendly to the table-level index. So the index of **TAE** should be a segment-level index set. 61 62 There are two types of segment. One is appendable and the other is not. For non-appendable segment, the segment-level index is a two-level structure, bloomfilter and zonemap respectively. There are two options for bloomfilter, a segment-based bloomfilter, and a block-based bloomfilter. The Segment-based is a better choice when the index can be fully resident in memory. An appendable segment consists of at least one appendable block plus multiple non-appendable blocks. Appendable block index is a resident memory ART-tree plus zonemap while the non-appendable one is bloomfilter plus zonemap. 63 64 <img src="https://user-images.githubusercontent.com/39627130/155945090-945442eb-25f6-4624-8c5a-442475328e49.png" height="70%" width="70%" /> 65 66 The query flow chart of segment index is as follow: 67 68 <img src="https://user-images.githubusercontent.com/39627130/155949839-3e771818-8fb0-41d5-8765-f81fdc650289.png" height="100%" width="100%" /> 69 70 **TODO** 71 72 #### Secondary index 73 **TODO** 74 75 ### Compression 76 **TAE** is a column-oriented data store, very friendly to data compression. It supports per-column compression codecs and now only **LZ4** is used. You can easily obtain the meta information of compressed blocks. In **TAE**, the compression unit is a column of a block. 77 78 ### Layout 79 80 #### Storage File Format 81 <img src="https://user-images.githubusercontent.com/39627130/158129098-dfe77d0b-c6dc-4323-a63e-b7b381f15949.png" height="100%" width="100%" /> 82 83 ##### Header 84 ``` 85 +-------------+--------------+-------------+----------------+------------------+------------------+ 86 | Magic (8B) | Version (2B) | Format (2B) | Reserved (32B) | MinPartSize (2B) | MaxPartSize (2B) | 87 +-------------+--------------+-------------+----------------+------------------+------------------+ 88 89 Magic = Engine identity (0x01346616). TAE only 90 Version = File version 91 Format = Layout format 92 Reserved = 32 bytes reserved space 93 MinPartSize = Specify the size of the smallest part uint. 1 for 1K bytes. If 4 is specified, the smallest part size is 4K bytes 94 MaxPartSize = Specify the size of the bigest part. 1 for 1K bytes 95 ``` 96 97 ##### Meta 98 ``` 99 Meta-1, Meta-2 = One is stale while the other is active 100 101 +----------+----------+- 102 | <Meta-1> | <Meta-2> | 103 +----------+----------+- 104 | 105 | 106 +---------------------------------------------------+ 107 | Meta | 108 +---------------+-----------------+-----------------+ 109 | Version (2B) | PartOffset (4B) | Reserved (128B) | 110 +---------------+-----------------+-----------------+ 111 112 Version = Meta version 113 PartOffset = The first part position 114 Reserved = 128 bytes reserved space 115 ``` 116 117 ##### Part 118 119 ``` 120 +------------+---------------------+---- ... ----+ 121 | Size (2B) | NextPartOffset (4B) | Payload | 122 +------------+---------------------+---- ... ----+ 123 124 Size = Part size. 1 for 1K bytes 125 NextPartOffset = Next part pointer 126 Payload = The size of payload is ($Size * 1K - 2B - 4B) 127 ``` 128 129 #### Segment File 130 131 | Format | Value | Description | 132 | ------------ | -------- | ----- | 133 | SegmentFile | 0x10 | Segment file format | 134 135 ``` 136 +-------------------------------------------------------------------------------+ 137 | Segment File Meta Part Payload | 138 +-----------+---------------+--------------+---------------+--------------------+ 139 | Cols (2B) | BlockCnt (2B) | <ColumnMeta> | <IndexesMeta> | <BlockIndexesMeta> | 140 +-----------+---------------+--------------+---------------+--------------------+ 141 | | | 142 | | | 143 | | +-------------------------------------+ 144 | | | BlockIndexesMeta | 145 | | +------------------+-------...--------+ 146 | | | <IndexesMeta> | ... | 147 | | +------------------+-------...--------+ 148 | | | 149 | | | 150 | +------------------------------------------------------+ 151 | | IndexesMeta | 152 | +-------------------+------------+-------------+--...--+ 153 | | CompressAlgo (1B) | Count (2B) | <IndexMeta> | ... | 154 | +-------------------+------------+-------------+--...--+ 155 | | 156 | | 157 +----------------------------------+ +----------------+----------------+-----------------+------------+-----------+--------------+ 158 | ColumnMeta | | IndexMeta | 159 +-------------------+--------------+ +----------------+----------------+-----------------+------------+-----------+--------------+ 160 | CompressAlgo (1B) | <BlocksMeta> | | IndexType (1B) | ColumnIdx (2B) | PartOffset (4B) | Start (4B) | Size (4B) | RawSize (4B) | 161 +-------------------+--------------+ +----------------+----------------+-----------------+------------+-----------+--------------+ 162 | 163 | 164 +----------------------------------------------------+ 165 | BlocksMeta | 166 +--------------+-------------+---...---+-------------+ 167 | <BlockMeta> | <BlockMeta> | | <BlockMeta> | 168 +--------------+-------------+---...---+-------------+ 169 | 170 | 171 +----------------------------------------------------------+ 172 | BlockMeta | 173 +------------------+------------+-----------+--------------+ 174 | PartOffset (4B) | Start (4B) | Size (4B) | RawSize (4B) | 175 +------------------+------------+-----------+--------------+ 176 ``` 177 178 ##### Mutation 179 180 Suppose the `MinPartSize` is `4`, which stands for `4K`. It starts with an empty segment (No Parts). Now we do the following in sequence 181 182 1. Append a block (Block0) of 3 columns, and the compressed sizes of the three columns are 3.4K, 7.1K, and 2.5K. 183 ``` 184 1) Calculate an optimal size for each column: 4K, 8K and 4K 185 2) Find unused candidate parts. Not found here. 186 3) Allocate new parts for these 3 columns 187 4) Flush these 3 columns to the specified part 188 5) Flush meta 189 ``` 190 191 <img src="https://user-images.githubusercontent.com/39627130/158206129-09394ea4-4a24-4c7a-a167-bf61aa6113a7.png" height="100%" width="100%" /> 192 193 2. Append a block (Block1) of 3 columns, and the compressed sizes of the three columns are 11.3K, 7.8K, and 3.5K. 194 ``` 195 1) Calculate an optimal size for each column: 12K, 8K and 4K 196 2) Find unused candidate parts. Not found here. 197 3) Allocate new parts for these 3 columns 198 4) Flush these 3 columns to the specified new parts 199 5) Flush meta 200 ``` 201 <img src="https://user-images.githubusercontent.com/39627130/158206785-e59a0212-e53c-475a-b34d-cdc0a9d781a5.png" height="100%" width="100%" /> 202 203 3. There are some updates to the 3rd column of Block0, which is marked as `Block0-2`. Now we start to checkpoint the updates into segment file. The compressed size of the updated `Block0-2` is 7.3K. 204 ``` 205 1) Calculate an optimal size: 8K 206 2) Find unused candidate parts. Not found here. 207 3) Allocate a new part 208 4) Flush to the specified new part 209 5) Flush meta 210 ``` 211 212 ![image](https://user-images.githubusercontent.com/39627130/158207135-dfb052ab-8306-4282-9e55-d2c774b9ec29.png) 213 214 4. There are some updates to the first column of Block1, which is marked as `Block1-0`. Now we start to checkpoint the updates into segment file. The compressed size of the updated `Block1-0` is 3.6K. 215 ``` 216 1) Calculate an optimal size: 4K 217 2) Find unused candidate parts. Unused Part(3, 1) found 218 3) Flush to the specified unused part 219 4) Flush meta 220 ``` 221 222 ![image](https://user-images.githubusercontent.com/39627130/158209321-59404617-fa1b-4ac8-aecf-9f30ecffa200.png) 223 224 5. Append a block (Block2) of 3 columns, and the compressed sizes of the three columns are 2.8K, 3.1K, and 3.7K. 225 ``` 226 1) Calculate an optimal size for each column: 4K, 4K and 4K 227 2) Find unused candidate parts. Unused Part(4, 1), Part(5, 1), Part(6, 1) found 228 3) Flush these 3 columns to the specified unused parts 229 4) Flush meta 230 ``` 231 232 ![image](https://user-images.githubusercontent.com/39627130/158207658-e674f419-c734-48c7-8a40-f4c2476e9ea4.png) 233 234 ## Buffer manager 235 Buffer manager is responsible for the allocation of buffer space. It handles all requests for data pages and temporary blocks of the **TAE**. 236 1. Each page is bound to a buffer node with a unique node ID 237 2. A buffer node has two states: 238 1) Loaded 239 2) Unloaded 240 3. When a requestor **Pin** a node: 241 1) If the node is in **Loaded** state, it will increase the node reference count by 1 and wrap a node handle with the page address in memory 242 2) If the node is in **Unloaded** state, it will read the page from disk|remote first, increase the node reference count by 1 and wrap a node handle with the page address in memory. When there is no left room in the buffer, some victim node will be unloaded to make room. The current replacement strategy is **LRU** 243 4. When a requestor **Unpin** a node, just call **Close** of the node handle. It will decrease the node reference count by 1. If the reference count is 0, the node will be a candidate for eviction. Node with reference count greater than 0 never be evicted. 244 245 There are currently four buffer managers for different purposes in **TAE** 246 1. Mutation buffer manager: A dedicated fixed-size buffer used by L0 transient blocks. Each block corresponds to a node in the buffer 247 2. SST buffer manager: A dedicated fixed-size buffer used by L1 and L2 blocks. Each column within a block corresponds to a node in the buffer 248 3. Index buffer manager: A dedicated fixed-size buffer used by indexes. Each block or a segment index corresponds to a node in the buffer 249 4. Redo log buffer manager: A dedicated fixed-size buffer used by uncommitted transactions. Each transaction local storage consists of at least one buffer node. 250 251 ## LogStore 252 An embedded log-structured data store. It is used as the underlying driver of **Catalog** and **WAL**. 253 254 **TODO** 255 256 ## WAL 257 Write-ahead logging (WAL) is the key for providing atomicity and durability. All modifications should be written to a log before applied. In TAE, REDO log does not need to record every write operation, but it must be recorded when the transaction is committed. We will reduce the usage of io by using the redo log buffer manager, and avoid any io events for those transactions that are not long and may need to be rolled back due to various conflicts. It can also support long or large transactions. 258 259 ### Log Entry 260 261 #### Entry Layout 262 263 <img src="https://user-images.githubusercontent.com/39627130/156019975-ea156716-79fa-4682-ab63-a9cbe0c07f33.png" height="40%" width="40%" /> 264 265 #### Entry Header Layout 266 | Item | Size(Byte) | Scope| Description | 267 | ------------ | -------- | -----| ------------------------------------------------------------ | 268 | `GroupId` | 4 | `All` | Specify the group id | 269 | `LSN` | 8 | `All` | Specify the log sequence number | 270 | `Length` | 4 | `All` | Specify the length of the entry | 271 | `Type` | 1 | `All` | Specify the entry type | 272 273 #### Entry Type 274 275 | Type | Datatype | Value| Description | 276 | ---------- | -------- | -----| ---------------------------------------------------------- | 277 | `AC` | int8 | 0x10 | A committed transaction of complete write operations | 278 | `PC` | int8 | 0x11 | A committed transaction of partial write operations | 279 | `UC` | int8 | 0x12 | Partial write operations of a uncommitted transaction | 280 | `RB` | int8 | 0x13 | Rollback of a transaction | 281 | `CKP` | int8 | 0x40 | Checkpoint | 282 283 #### Transaction Log Entry 284 Most transactions only have one log entry. Only those long or large transactions may need to record multiple log entries. So the log of a transaction may be of `1+` `UC` type log entries plus one `PC` type log entry, or only one `AC` type log entry. **TAE** assigns a dedicate group to log entries of type `UC`. Here is the transaction log of six committed transactions. <img src="https://latex.codecogs.com/svg.image?E_{2,3}" /> specifies the log entry in group 2 with LSN 3. 285 - <img src="https://latex.codecogs.com/svg.image?Txn_{1}=E_{2,1}\leftarrow&space;&space;E_{1,5}" /> 286 - <img src="https://latex.codecogs.com/svg.image?Txn_{2}=E_{1,1}" /> 287 - <img src="https://latex.codecogs.com/svg.image?Txn_{3}=E_{1,2}" /> 288 - <img src="https://latex.codecogs.com/svg.image?Txn_{4}=E_{2,2}\leftarrow&space;&space;E_{1,4}" /> 289 - <img src="https://latex.codecogs.com/svg.image?Txn_{5}=E_{1,3}\leftarrow&space;&space;E_{1,5}" /> 290 - <img src="https://latex.codecogs.com/svg.image?Txn_{6}=E_{2,3}\leftarrow&space;&space;E_{2,4}\leftarrow&space;&space;E_{1,6}" /> 291 292 <img src="https://user-images.githubusercontent.com/39627130/156754808-658a1f73-4635-4bd6-909f-5e930b6bcab6.png" height="100%" width="100%" /> 293 294 A transaction log entry includes multiple nodes, and there are multiple types of nodes. DML node, delete info node, append info node, update info node. A node is an atomic command, which can be annotated as a sub-entry index of a committed entry. For example, there are 3 nodes in <img src="https://latex.codecogs.com/svg.image?Txn_{6}" />, which can be annotated as <img src="https://latex.codecogs.com/svg.image?\{E_{6-1},E_{6-2},E_{6-3}\}" title="\{E_{6-1},E_{6-2},E_{6-3}\}" /> 295 296 ##### Transaction Payload 297 298 <img src="https://user-images.githubusercontent.com/39627130/156021428-3282eb20-4e69-4ba0-a4ed-ba5409ac7b71.png" height="40%" width="40%" /> 299 300 ##### Meta Layout 301 | Item | Size(Byte) | Scope | Description | 302 | ------------ | -------- | -------- | ------------------------------------------------------------ | 303 | `Id` | 8 | `All` | Specify the transaction id | 304 | `InfoLen` | 2 | `All` | Specify the length of transaction info | 305 | `Info` | - | `All` | Specify the transaction info | 306 | `PrevPtr` | 12 | `UC`,`PC`,`RB` | Specify the previous entry | 307 | `NodeCnt` | 4 | `All` | Specify the count of nodes | 308 309 ##### Node Type 310 | Type | Datatype | Value| Description | 311 | ---------- | -------- | -----| ------------------------------------------------------------ | 312 | `DML_A` | int8 | 0x1 | Append info | 313 | `DML_D` | int8 | 0x2 | Delete info | 314 | `DML_U` | int8 | 0x3 | Update info | 315 | `DDL_CT` | int8 | 0x4 | Create table info | 316 | `DDL_DT` | int8 | 0x5 | Drop table info | 317 | `DDL_CD` | int8 | 0x6 | Create database info | 318 | `DDL_DD` | int8 | 0x7 | Drop database info | 319 | `DDL_CI` | int8 | 0x8 | Create index info | 320 | `DDL_DI` | int8 | 0x9 | Drop index info | 321 322 ##### Node Layout 323 | Item | Size(Byte) | Scope | Description | 324 | ------------ | -------- | -------- | ------------------------------------------------------------ | 325 | `Type` | 1 | `All` | Specify the type of node | 326 | `Len` | 4 | `All` | Specify the length of node | 327 | `Buf` | - | `All` | Node payload | 328 329 #### Example 330 Here are 6 concurrent transactions, arranged chronologically from left to right. <img src="https://latex.codecogs.com/svg.image?W_{1-2}" title="W_{1-2}" /> specify the second write operation of <img src="https://latex.codecogs.com/svg.image?Txn_{1}" title="Txn_{1}" />. <img src="https://latex.codecogs.com/svg.image?C_{1}" title="C_{1}" /> specify the commit of <img src="https://latex.codecogs.com/svg.image?Txn_{1}" title="Txn_{1}" />. <img src="https://latex.codecogs.com/svg.image?A_{2}" title="A_{2}" /> specify the rollback of <img src="https://latex.codecogs.com/svg.image?Txn_{2}" title="Txn_{2}" />. 331 332 <img src="https://user-images.githubusercontent.com/39627130/156098948-0c442ca1-6386-43a2-907e-22a55c012546.png" height="50%" width="50%" /> 333 334 As mentioned in the chapter on transactions, all active transactions share a fixed-size memory space, which is managed by the buffer manager. When the remaining space is not enough, some transaction nodes will be unloaded. If it is the first time of the node unload, it will be saved to the redo log as a log entry, and when loaded, the corresponding log entry will be loaded from the redo log. 335 336 <img src="https://user-images.githubusercontent.com/39627130/156109116-eb2e5978-f8a1-491b-b13e-b14aa297d418.png" height="80%" width="80%" /> 337 338 <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" /> specify the first transaction node of <img src="https://latex.codecogs.com/svg.image?Txn_{1}" title="Txn_{1}" />. At first, <img src="https://latex.codecogs.com/svg.image?Txn_{1}" title="Txn_{1}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" /> in buffer manager, and add <img src="https://latex.codecogs.com/svg.image?W_{1-1}" title="W_{1-1}" /> into it. 339 340 1. <img src="https://latex.codecogs.com/svg.image?Txn_{2}" title="Txn_{2}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{2-1}" title="TN_{2-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{2-1}" title="W_{2-1}" /> into it. Add <img src="https://latex.codecogs.com/svg.image?W_{1-2}" title="W_{1-2}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" />. 341 2. <img src="https://latex.codecogs.com/svg.image?Txn_{3}" title="Txn_{3}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{3-1}" title="TN_{3-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{3-1}" title="W_{3-1}" /> into it. 342 3. <img src="https://latex.codecogs.com/svg.image?Txn_{4}" title="Txn_{4}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{4-1}" title="TN_{4-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{4-1}" title="W_{4-1}" /> into it. Add <img src="https://latex.codecogs.com/svg.image?W_{2-2}" title="W_{2-2}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{2-1}" title="TN_{2-1}" />. Till now, no log entry generated. 343 344 <img src="https://user-images.githubusercontent.com/39627130/156114351-54b40ca9-358a-4df3-b46c-17ea583c963b.png" height="80%" width="80%" /> 345 346 4. <img src="https://latex.codecogs.com/svg.image?Txn_{5}" title="Txn_{5}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{5-1}" title="TN_{5-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{5-1}" title="W_{5-1}" /> into it. 347 5. <img src="https://latex.codecogs.com/svg.image?Txn_{6}" title="Txn_{6}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{6-1}" title="TN_{6-1}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{6-1}" title="W_{6-1}" /> into it. Add <img src="https://latex.codecogs.com/svg.image?W_{3-2}" title="W_{3-2}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{3-1}" title="TN_{3-1}" />. Add <img src="https://latex.codecogs.com/svg.image?W_{2-3}" title="W_{2-3}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{2-1}" title="TN_{2-1}" />. Add <img src="https://latex.codecogs.com/svg.image?C_{5}" title="C_{5}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{5}" title="TN_{5}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{5-1}" title="TN_{5-1}" />. Add <img src="https://latex.codecogs.com/svg.image?C_{4}" title="C_{4}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{4-1}" title="TN_{4-1}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{4-1}" title="TN_{4-1}" />. 348 6. Unregister <img src="https://latex.codecogs.com/svg.image?TN_{4-1}" title="TN_{4-1}" /> and <img src="https://latex.codecogs.com/svg.image?TN_{5-1}" title="TN_{5-1}" /> from buffer manager. Before adding <img src="https://latex.codecogs.com/svg.image?W_{3-3}" title="W_{3-3}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{3-1}" title="TN_{3-1}" />, there is no space left. <img src="https://latex.codecogs.com/svg.image?TN_{2-1}" title="TN_{2-1}" /> is choosen as a candidate for eviction and it is unloaded as a log entry. Add <img src="https://latex.codecogs.com/svg.image?W_{3-3}" title="W_{3-3}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{3-1}" title="TN_{3-1}" />. <img src="https://latex.codecogs.com/svg.image?Txn_{2}" title="Txn_{2}" /> register a transaction node <img src="https://latex.codecogs.com/svg.image?TN_{2-2}" title="TN_{2-2}" /> in buffer manager and add <img src="https://latex.codecogs.com/svg.image?W_{2-4}" title="W_{2-4}" /> into it. Add <img src="https://latex.codecogs.com/svg.image?C_{1}" title="C_{1}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{1-1}" title="TN_{1-1}" />. Add <img src="https://latex.codecogs.com/svg.image?C_{6}" title="C_{6}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{6-1}" title="TN_{6-1}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{6-1}" title="TN_{6-1}" />. Add <img src="https://latex.codecogs.com/svg.image?W_{2-5}" title="W_{2-5}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{2-2}" title="TN_{2-2}" />. Add <img src="https://latex.codecogs.com/svg.image?A_{2}" title="A_{2}" /> into <img src="https://latex.codecogs.com/svg.image?TN_{2-2}" title="TN_{2-2}" /> and create a log entry for <img src="https://latex.codecogs.com/svg.image?TN_{2-2}" title="TN_{2-2}" /> (For rolled back transaction, log entry is only necessary if some of the previous operations have been persisted). 349 350 ### Checkpoint 351 Typically, a checkpoint is a safe point from which state machine can start applying log entries during restart. Entries before the checkpoint are no longer needed and will be physically destroyed at the appropriate time. A checkpoint can represent the equivalent of the collection of data in the range it indicates. For example, <img src="https://latex.codecogs.com/svg.image?CKP_{LSN=11}(-\infty,&space;10]" title="CKP_{LSn=5}(-\infty, 10]" /> is the equivalent of the entries from <img src="https://latex.codecogs.com/svg.image?E_{LSN=1}" title="E_{LSN=1}" /> to <img src="https://latex.codecogs.com/svg.image?E_{LSN=10}" title="E_{LSN=10}" /> and entries in the range are no longer needed. Replay from the last checkpoint log entry <img src="https://latex.codecogs.com/svg.image?CKP_{LSN=11}(-\infty,&space;10]" title="CKP_{LSn=5}(-\infty, 10]" /> on restart. 352 353 <img src="https://user-images.githubusercontent.com/39627130/156356421-9bc93fb8-6e11-4e2c-aad0-e6772f4b6cf0.png" height="100%" width="100%" /> 354 355 However, the checkpoint in **TAE** is different 356 357 <img src="https://user-images.githubusercontent.com/39627130/156372311-1cee455b-ab97-406f-b6fe-abb14c097917.png" height="100%" width="100%" /> 358 359 #### Dedicated Group 360 The log of **TAE** will have different groups, and the LSN of each group is continuously monotonically increasing. There is a dedicated group for checkpoint log entries. 361 362 #### Fuzzy Checkpoint 363 The range indicated by a typical checkpoint is always a continous interval from the minimum value to a certain LSN like <img src="https://latex.codecogs.com/svg.image?(-\infty,&space;4]" title="(-\infty, 4]" /> and <img src="https://latex.codecogs.com/svg.image?(-\infty,&space;10]" title="(-\infty, 10]" />. While the interval of **TAE** does not need to be continuous like <img src="https://latex.codecogs.com/svg.image?\{[1,4],&space;[6,8]\}" title="\{[1,4], [6,8]\}" />. Futhermore, given that each committed entry is a collection of multiple subcommands, each checkpoint should be a collection of subcommands indexes <img src="https://latex.codecogs.com/svg.image?\{[E_{1},E_{4}],&space;\{E_{5-1}\}\,&space;[E_{6},E_{8}]\}" title="\{[E_{1},E_{4}], \{E_{5-1}\}\, [E_{6},E_{8}]\}" /> 364 365 ## Catalog 366 **Catalog** is TAE's in-memory metadata manager that manages all states of the engine, and the underlying driver is an embedded LogStore. Catalog can be fully replayed from the underlying LogStore. 367 1. Storage layout info 368 2. Database and table schema info 369 3. DDL operation info 370 371 ### Yet Another Database-Database 372 **Catalog** is the database of **TAE** metadata. As a database, it has the following characteristics: 373 374 1. In-memory database (Data and Indexes) 375 2. Row oriented database 376 3. MVCC 377 4. RR isolation 378 5. DDL like `CREATE|DROP TABLE, CREATE|DROP DATABASE` not supported. There are only a few built-in tables and cannot be deleted. 379 380 ### Built-in Tables 381 The hierarchical relationship from top to bottom is: **Database**, **Table**, **Segment**, **Block**. **Catalog** is responsible for managing these resources and supports addition, deletion, modification and query. 382 383 <img src="https://user-images.githubusercontent.com/39627130/145529173-1c6ad8eb-84e2-4d7e-a49a-9085153f3436.png" height="60%" width="60%" /> 384 385 **Catalog** creates a table for each resource, corresponding to **Database Entry**, **Table Entry**, **Segment Entry** and **Block Entry**. 386 387 <img src="https://user-images.githubusercontent.com/39627130/156906206-c1d905ef-ad5e-4c42-93cf-a38174dfd7be.png" height="80%" width="80%" /> 388 389 ### Transactional Operation 390 The **Catalog** transaction is actually a sub-transaction of the engine transaction. Each table has an in-memory primary key index, and each node in the index corresponds to a table row. Once there is any update on the row, a version chain will be created for it. 391 392 <img src="https://user-images.githubusercontent.com/39627130/156907205-01ffab4b-5f44-4400-9d3b-5def876e7d49.png" height="50%" width="50%" /> 393 394 #### Insert 395 1. Search the primary key index, if there is the same primary key, return Duplicate error 396 2. Create a new index node (uncommitted) 397 398 #### Delete 399 1. Search the primary key index, if not found, return NotFound error 400 2. If the row was already deleted (committed), return NotFound error 401 3. If there's a no version chain on the row, create a version chain and insert a delete node into the chain. Return 402 4. Scan the version chain. If a delete node is found 403 - If the delete node is committed, return NotFound error 404 - If the delete node is uncomiitted, if it is not the same transaction, return `W-W` conflict error. Else return NotFound error 405 5. Insert a delete node into the chain. Return 406 407 #### Update 408 1. Search the primary key index, if not found, return NotFound error 409 2. If the row was already deleted (committed), return NotFound error 410 3. If there's a no version chain on the row, create a version chain and insert a update node into the chain. Return 411 4. Scan the version chain. 412 - If a delete node is found 413 - If the delete node is committed, return NotFound error 414 - If the delete node is uncomiitted, if it is the different transaction, return `W-W` conflict error. Else return NotFound error 415 - If a update node is found 416 - If the delete node is uncommitted, return `W-W` conflict error 417 5. Insert a update node into the chain. Return 418 419 #### Query 420 1. Search the primary key index, if not found, return NotFound error 421 2. If the row was already deleted (committed), return NotFound error 422 3. If there's a no version chain on the row 423 - If the row is uncommitted 424 - If it is the same transaction, return the row value 425 - If it is a different transaction, return NotFound error 426 - If the row is committed, return the row value 427 5. Scan the version chain. 428 - If a delete node is found 429 - If the delete node is committed before the query transaction starts, return NotFound error. 430 - If the delete node is uncomiitted and is the same transaction, return NotFound error 431 - If a update node is found 432 - If the update node is committed before the query transaction starts, return row value 433 - If the update node is uncommitted and is the same transaction, return row value 434 435 #### Commit & Rollback 436 1. All uncommitted changes are stored in transaction's in-memory store 437 2. All rows read in the transaction will be recorded in the txn store. 438 3. Commit pipeline 439 - Prepare: 440 - Check R-W anti-dependency. If violated, rollback 441 - Update all related uncommitted rows' commit info 442 - Commit: flush and update visible version 443 - Rollback: remove all uncommitted rows 444 445 #### Checkpoint 446 Any operation within a subtransaction corresponds to a sub-command in the engine's transaction log. So a commit or rollback of **Catalog** transaction is a checkpoint to some sub-commands in the engine's transaction log. 447 448 #### Compaction 449 1. In-memory version chain pruning 450 2. Hard delete previous soft deleted rows 451 3. Compact disk data 452 453 ## Database (Column Families) 454 In **TAE**, a **Table** is a **Column Family** while a **Database** is **Column Families**. The main idea behind **Column Families** is that they share the write-ahead log (Share **Log Space**), so that we can implement **Database-level** atomic writes. The old **WAL** cannot be compacted when the mutable buffer of a **Table** flushed since it may contains live data from other **Tables**. It can only be compacted when all related **Tables** mutable buffer are flushed. 455 456 **TAE** supports multiple **Databases**, that is, one **TAE** instance can work with multiple **Log Spaces**. Our **MatrixOne** DBMS is built upon multi-raft and each node only needs one **TAE** engine, and each raft group corresponds to a **Database**. It is complicated and what makes it more complicated is the engine shares the external **WAL** with **Raft** log. 457 458 ## Multi-Version Concurrency Control (MVCC) 459 **TAE** uses MVCC to provide snapshot isolation of individual transactions. For SI, the consistent read view of a transaction is determined by the transaction start time, so that data read within the transaction will never reflect changes made by other simultaneous transactions. For example, for <img src="https://latex.codecogs.com/svg.image?Txn-2" title="Txn-2" />, the read view includes <img src="https://latex.codecogs.com/svg.image?[seg-2,&space;seg-4]" title="[seg-2, seg-4]" />, and more fine-grained read view to the block level includes <img src="https://latex.codecogs.com/svg.image?[blk2-1,&space;blk2-3,&space;blk2-5,&space;blk4-1,&space;blk4-3,&space;blk4-5]" title="[blk2-1, blk2-3, blk2-5, blk4-1, blk4-3, blk4-5]" />. 460 461 <img src="https://user-images.githubusercontent.com/39627130/155067474-3303ef05-ad1c-4a4c-9bf5-50cb15e41c63.png" height="60%" width="60%" /> 462 463 **TAE** provides value-level fine-grained optimistic concurrency control, only updates to the same row and same column will conflict. The transaction uses the value versions that exist when the transaction begins and no locks are placed on the data when it is read. When two transactions attempt to update the same value, the second transaction will fail due to write-write conflict. 464 465 ### Read View 466 467 In **TAE**, a table includes multiple segments. A segment is the result of the combined action of multiple transactions. So a segment can be represented as <img src="https://latex.codecogs.com/svg.image?[T_{start},&space;T_{end}]" title="[T_{start}, T_{end}]" /> (<img src="https://latex.codecogs.com/svg.image?T_{start}" title="T_{start}" /> is the commit time of the oldest transaction while <img src="https://latex.codecogs.com/svg.image?T_{end}" /> is the commit time of the newest). Since segment can be compacted to a new segment and segments can be merged into a new segment, We need to add a dimension to the segment representation to distinguish versions <img src="https://latex.codecogs.com/svg.image?([T_{start},T_{end}],&space;[T_{create},T_{drop}])" title="([T_{start},T_{end}], [T_{create},T_{drop}])" /> (<img src="https://latex.codecogs.com/svg.image?T_{create}" title="T_{create}" /> is the segment create time while <img src="https://latex.codecogs.com/svg.image?T_{drop}" title="T_{drop}" /> is the segment drop time). <img src="https://latex.codecogs.com/svg.image?T_{drop}&space;=&space;0" title="T_{drop} = 0" /> means the segment is not dropped. The block representation is as same as the segment <img src="https://latex.codecogs.com/svg.image?([T_{start},T_{end}],&space;[T_{create},T_{drop}])" title="([T_{start},T_{end}], [T_{create},T_{drop}])" />. 468 469 A transaction can be represented as <img src="https://latex.codecogs.com/svg.image?[Txn_{start},&space;Txn_{commit}]" title="[Txn_{start}, Txn_{commit}]" /> (<img src="https://latex.codecogs.com/svg.image?Txn_{start}" /> is the transaction start time while <img src="https://latex.codecogs.com/svg.image?T_{commit}" /> is the commit time). The read view of a transaction can be determined by the following formula: 470 471 <img src="https://latex.codecogs.com/svg.image?(Txn_{start}&space;\geqslant&space;T_{create})&space;\bigcap&space;((T_{drop}=&space;0)\bigcup&space;(T_{drop}>Txn_{start}))" title="(Txn_{start} \geqslant T_{create}) \bigcap ((T_{drop} = 0)\bigcup (T_{drop}>Txn_{start}))" /> 472 473 When a transaction is committed, it is necessary to obtain a read view related to the commit time for deduplication: 474 475 <img src="https://latex.codecogs.com/svg.image?(Txn_{commit}&space;\geqslant&space;T_{create})&space;\bigcap&space;((T_{drop}=&space;0)\bigcup&space;(T_{drop}>Txn_{commit}))" title="(Txn_{commit} \geqslant T_{create}) \bigcap ((T_{drop}= 0)\bigcup (T_{drop}>Txn_{commit}))" /> 476 477 For example, the read view of <img src="https://latex.codecogs.com/svg.image?Txn-2" title="Txn-2" /> includes <img src="https://latex.codecogs.com/svg.image?[seg1]" title="[seg1]" /> while the read view during commit includes <img src="https://latex.codecogs.com/svg.image?[seg1,seg2,seg3]" title="[seg1,seg2,seg3]" />. 478 479 <img src="https://user-images.githubusercontent.com/39627130/154995795-2c367e33-bafa-4e47-812d-f80d82594613.png" height="100%" width="100%" /> 480 481 The block read view is similar to segment. 482 483 ### Concurrent Compaction 484 Compaction is needed for space efficiency, read efficiency, and timely data deletion. In **TAE**, the following scenarios require compaction: 485 486 - <img src="https://latex.codecogs.com/svg.image?Block_{L_{0}}&space;\overset{sort}{\rightarrow}&space;Block_{L_{1}}" title="Block_{L_{0}} \overset{sort}{\rightarrow} Block_{L_{1}}" />. When inserting data, it first flows to L0 in an unordered manner. After certain conditions are met, the data will be reorganized and flowed to L1, sorted by the primary key. 487 - <img src="https://latex.codecogs.com/svg.image?\{Block_{L_{1}},...\}&space;\overset{merge}{\rightarrow}Segment_{L_{2}}" title="\{Block_{L_{1}},...\} \overset{merge}{\rightarrow}Segment_{L_{2}}" />. Multiple L1 blocks are merge-sorted into a L2 segment. 488 - <img src="https://latex.codecogs.com/svg.image?Block_{L_{2}}&space;\overset{compact}{\rightarrow}&space;Block_{L_{2}}" title="Block_{L_{2}} \overset{compact}{\rightarrow} Block_{L_{2}}" />. If there are many updates to a L2 block and it is needed to compact the block to a new block to improve read efficiency. 489 - <img src="https://latex.codecogs.com/svg.image?Segment_{L_{2}}&space;\overset{compact}{\rightarrow}&space;Segment_{L_{2}}" title="Segment_{L_{2}} \overset{compact}{\rightarrow} Segment_{L_{2}}" />. If there are many updates to a L2 segment and it is needed to compact the block to a new segment to improve read efficiency. 490 - <img src="https://latex.codecogs.com/svg.image?\{Segment_{L_{2}}&space;...\}&space;\overset{merge}{\rightarrow}&space;Segment_{L_{2}}" title="\{Segment_{L_{2}} ...\} \overset{merge}{\rightarrow} Segment_{L_{2}}" />. Multiple L2 segments are merge-sorted into a L2 segment. 491 492 #### Block Sort Example 493 494 <img src="https://user-images.githubusercontent.com/39627130/155315545-0ec97d65-b716-4c30-9a00-e9bddcfaea2d.png" height="100%" width="100%" /> 495 496 <img src="https://latex.codecogs.com/svg.image?Block1_{L_{0}}" title="Block1_{L_{0}}" /> is created @ <img src="https://latex.codecogs.com/svg.image?t_{1}" title="t_{1}" />, which contains data from <img src="https://latex.codecogs.com/svg.image?\{Txn1,Txn2,Txn3,Txn4\}" title="\{Txn1,Txn2,Txn3,Txn4\}" />. <img src="https://latex.codecogs.com/svg.image?Block1_{L_{0}}" title="Block1_{L_{0}}" /> starts to sort @ <img src="https://latex.codecogs.com/svg.image?t_{11}" title="t_{11}" />,and its block read view is the baseline plus an uncommitted update node, which will be skipped. Sort and persist a block may take a long time. There are two committed transactions <img src="https://latex.codecogs.com/svg.image?\{Txn5,Txn6\}" title="\{Txn5,Txn6\}" /> and one uncommitted <img src="https://latex.codecogs.com/svg.image?\{Txn7\}" title="\{Txn7\}" /> before commiting sorted <img src="https://latex.codecogs.com/svg.image?Block2_{L_{1}}" title="Block2_{L_{1}}" />. When commiting <img src="https://latex.codecogs.com/svg.image?\{Txn7\}" title="\{Txn7\}" /> @ <img src="https://latex.codecogs.com/svg.image?t_{16}" title="t_{16}" />, it will fail because <img src="https://latex.codecogs.com/svg.image?Block1_{L_{0}}" title="Block1_{L_{0}}" /> has been terminated. Update nodes <img src="https://latex.codecogs.com/svg.image?\{Txn5,Txn6\}" title="\{Txn5,Txn6\}" /> that were committed in between <img src="https://latex.codecogs.com/svg.image?(t_{11},&space;t_{16})" title="(t_{11}, t_{16})" /> will be merged into a new update node and it will be committed together with <img src="https://latex.codecogs.com/svg.image?Block2_{L_{1}}" title="Block2_{L_{1}}" /> @ <img src="https://latex.codecogs.com/svg.image?t_{16}" title="t_{16}" />. 497 498 ![image](https://user-images.githubusercontent.com/39627130/155317195-483f7b67-48b1-4474-8555-315805492204.png) 499 500 #### Compaction As A Transactional Schema Change 501 A compaction is the termination of a series of blocks or segments, while atomically creating a new one (building index). It usually takes a long time compared to normal transactions and we don't want to block update or delete transactions on involved blocks or segments. Here we extend the content of the **read view** to include the metadata of blocks and segments into it. When commiting a normal transaction, once it is detected that the metadata of blocks (segments) corresponding to write operation has been changed (committed), it will fail. 502 503 For a compaction transaction, write operations include block (segment) soft deletion and addition. During the execution of the transaction, each write will detect a write-write conflict. Once there is a conflict, the transaction will be terminated in advance. 504 505 ## Transaction 506 A tuple <img src="https://latex.codecogs.com/svg.image?(T_{start},T_{commit})" title="(T_{start},T_{commit})" /> is the representation of a transaction where both elements are of type `uint64`, and <img src="https://latex.codecogs.com/svg.image?T_{start}" /> can be regarded as <img src="https://latex.codecogs.com/svg.image?TxnId" />. The gloabl timestamp is monotonically increasing continuously from 1, and after restarting, the previous value needs to be restored as a new starting point. When starting a transaction, the global timestamp is assigned to the transaction's <img src="https://latex.codecogs.com/svg.image?T_{start}" /> and incremented by 1. When a transaction is committed, the global timestamp is also assigned to its <img src="https://latex.codecogs.com/svg.image?T_{commit}" /> and incremented by 1. 507 508 ### Preprocessing 509 510 A transaction usually consists of multiple commands, and each command is usually trivial. When committing a transaction, we want to be able to preprocess some commands ahead of time. It is a requirement of the fuzzy checkpoint mechanism. 511 512 #### Command 513 Here are all command types: 514 515 | Command | Datatype | Value| Description | 516 | ---------- | -------- | -----| ---------------------------------------------------------- | 517 | `CREATE_DB`| int8 | 0x01 | Create a database | 518 | `DELETE_DB`| int8 | 0x02 | Delete a database | 519 | `CREATE_TABLE` | int8 | 0x13 | Create a table | 520 | `UPDATE_TABLE` | int8 | 0x14 | Update a table | 521 | `DELETE_TABLE` | int8 | 0x15 | Delete a table | 522 | `INSERT` | int8 | 0x30 | Insert rows | 523 | `UPDATE_COMMITTED` | int8 | 0x32 | Update committed value | 524 | `DELETE_LOCAL` | int8 | 0x33 | Delete row in transaction local store | 525 | `DELETE_COMMITTED` | int8 | 0x34 | Delete committed row | 526 527 #### Split Commands 528 The raw command list <img src="https://latex.codecogs.com/svg.image?&space;&space;&space;CMD_{1}^{t_{1}}&space;\to&space;CMD_{2}^{t_{2}}\to&space;CMD_{3}^{t_{3}}&space;\to&space;...&space;\to&space;CMD_{n}^{t_{n}}" title=" CMD_{1}^{t_{1}} \to CMD_{2}^{t_{2}}\to CMD_{3}^{t_{3}} \to ... \to CMD_{n}^{t_{n}}" /> for a transaction is in the order accepted. <img src="https://latex.codecogs.com/svg.image?n" title="n" /> is the command sequence and <img src="https://latex.codecogs.com/svg.image?t_{n}" title="t_{n}" /> is the command type. **Split** splits the command list into serveral lists, which improves the parallelism during committing. **Split** is table-bounded, and all commands applied on the same table are group into a same list. 529 530 #### Merge Commands 531 **Merge** transforms a command list <img src="https://latex.codecogs.com/svg.image?&space;&space;&space;CMD_{1}^{t_{1}}&space;\to&space;CMD_{2}^{t_{2}}\to&space;CMD_{3}^{t_{3}}&space;\to&space;...&space;\to&space;CMD_{n}^{t_{n}}" title=" CMD_{1}^{t_{1}} \to CMD_{2}^{t_{2}}\to CMD_{3}^{t_{3}} \to ... \to CMD_{n}^{t_{n}}" /> into a new list <img src="https://latex.codecogs.com/svg.image?&space;&space;CMD_{1}^{t_{1}}&space;\to&space;CMD_{2}^{t_{2}}\to&space;CMD_{3}^{t_{3}}&space;\to&space;...&space;\to&space;CMD_{m}^{t_{m}}&space;" title=" CMD_{1}^{t_{1}} \to CMD_{2}^{t_{2}}\to CMD_{3}^{t_{3}} \to ... \to CMD_{m}^{t_{m}} " />, where <img src="https://latex.codecogs.com/svg.image?m&space;<&space;n" title="m < n" />. 532 533 ##### Rules 534 - There are some types of commands that cannot be merged 535 ``` 536 CREATE_DB, DELETE_DB, CREATE_TABLE, CREATE_TABLE, DELETE_TABLE 537 ``` 538 - Commands of different types can be merged 539 ``` 540 1. INSERT + INSERT => INSERT 541 2. INSERT + DELETE_LOCAL => INSERT 542 ``` 543 - Commands dependency. Prepend all `DELETE_COMMITTED` commands 544 ``` 545 INSERT, INSERT, DELETE_COMMITTED, INSERT => DELETE_COMMITTED, INSERT 546 ``` 547 548 ##### Example 549 <img src="https://user-images.githubusercontent.com/39627130/157154690-d66b71f2-15c5-488d-867c-ec2604ab3c5d.png" height="90%" width="90%" /> 550 551 ### DML 552 #### INSERT 553 ##### Uncommitted 554 All inserted data is stored in transaction local storage before committed. Data in transaction local storage is grouped into tables. Insert request inserts data into the target table. The data for each table contains one or more batches. If the amount of request data is particularly large, it will be split into multiple batches <img src="https://latex.codecogs.com/svg.image?request_{rows=n+20}&space;\overset{split}{\rightarrow}&space;\{batch_{rows=20},&space;batch_{rows=n}\}" />. 555 556 Suppose the maximum number of rows in a batch is 10. Here are commands in a transaction 557 558 ``` 559 1. INSERT 15 ROWS INTO TABLE1 560 ``` 561 - <img src="https://latex.codecogs.com/svg.image?request_{rows=15}^{1}&space;\overset{split}{\rightarrow}&space;\{batch_{rows=10}^{1},&space;batch_{rows=5}^{2}\}&space;" title="request_{rows=15}^{1} \overset{split}{\rightarrow} \{batch_{rows=10}^{1}, batch_{rows=5}^{2}\} " /> 562 - <img src="https://latex.codecogs.com/svg.image?batch_{rows=10}^{1}&space;\overset{append}{\rightarrow}&space;Batch_{}^{1}" title="batch_{rows=10}^{1} \overset{append}{\rightarrow} Batch_{}^{1}" />, <img src="https://latex.codecogs.com/svg.image?batch_{rows=5}^{2}&space;\overset{append}{\rightarrow}&space;Batch_{}^{2}" title="batch_{rows=5}^{2} \overset{append}{\rightarrow} Batch_{}^{2}" /> 563 - <img src="https://latex.codecogs.com/svg.image?Table_{}^{1}&space;=&space;\{Batch_{}^{1},&space;Batch_{}^{2}\}" title="Table_{}^{1} = \{Batch_{}^{1}, Batch_{}^{2}\}" /> 564 565 ``` 566 2. INSERT 4 ROWS INTO TABLE2 567 ``` 568 - <img src="https://latex.codecogs.com/svg.image?request_{rows=4}^{2}&space;\overset{split}{\rightarrow}&space;\{batch_{rows=4}^{1}\}" title="request_{rows=4}^{2} \overset{split}{\rightarrow} \{batch_{rows=4}^{1}\}" /> 569 - <img src="https://latex.codecogs.com/svg.image?batch_{rows=4}^{1}&space;\overset{append}{\rightarrow}&space;Batch_{}^{3}" title="batch_{rows=4}^{1} \overset{append}{\rightarrow} Batch_{}^{3}" /> 570 - <img src="https://latex.codecogs.com/svg.image?Table_{}^{2}&space;=&space;\{Batch_{}^{3}\}" title="Table_{}^{2} = \{Batch_{}^{3}\}" /> 571 572 ``` 573 3. INSERT 5 ROWS INTO TABLE1 574 ``` 575 - <img src="https://latex.codecogs.com/svg.image?request_{rows=5}^{3}&space;\overset{split}{\rightarrow}&space;\{batch_{rows=5}^{1}\}" title="request_{rows=5}^{3} \overset{split}{\rightarrow} \{batch_{rows=5}^{1}\}" /> 576 - <img src="https://latex.codecogs.com/svg.image?batch_{rows=5}^{1}&space;\overset{append}{\rightarrow}&space;Batch_{}^{2}" title="batch_{rows=5}^{1} \overset{append}{\rightarrow} Batch_{}^{2}" /> 577 - <img src="https://latex.codecogs.com/svg.image?Table_{}^{1}&space;=&space;\{Batch_{}^{1},&space;Batch_{}^{2}\}" title="Table_{}^{1} = \{Batch_{}^{1}, Batch_{}^{2}\}" /> 578 579 ``` 580 Txn Local Storage 581 | 582 |--- Table1 583 | | 584 | |--- Batch1 585 | |--- Batch2 586 | 587 |--- Table2 588 | 589 |--- Batch3 590 ``` 591 592 ##### Committed 593 594 <img src="https://user-images.githubusercontent.com/39627130/156866455-e9dff497-21a9-463b-be66-8f8f564544d9.png" height="80%" width="80%" /> 595 596 #### UPDATE & DELETE 597 ##### UPDATE & DELETE INSERTED DATA 598 If any deletes applied to the batch in the transaction local store, a bitmap for deletion is created. Any update will be transfer to a delete and insert. 599 600 ##### UPDATE & DELETE COMMITTED DATA 601 A delete history will be create for blocks with any **DELETE** operation. 602 ```go 603 type UpdateEntry struct { 604 txn *TxnEntry 605 indexes []*LogIndex 606 } 607 608 type DeleteHistry struct { 609 entries map[int32]*UpdateEntry 610 } 611 ``` 612 613 For the **UPDATE** of the primary key, it will be converted to **DELETE** plus **INSERT**. So the **UPDATE** we are talking about is always update to non-primary key. 614 ```go 615 type UpdateNode struct { 616 UpdateEntry 617 values map[int32]interface{} 618 } 619 ``` 620 621 <img src="https://user-images.githubusercontent.com/39627130/156788303-4fd2a2e4-6975-493e-8b5f-16f156fdc9dc.png" height="80%" width="80%" /> 622 623 ##### Version Chain 624 625 <img src="https://user-images.githubusercontent.com/39627130/159515818-ddbd5317-0e15-4d3e-8c49-a7edc0ebaef1.png" height="70%" width="70%" /> 626 627 **TODO** 628 629 #### DDL 630 A transaction usually contains multiple **DDL** and **DML** statements. As mentioned in [Catalog](#Catalog), **Catalog** has its own transaction mechanism, and the transaction of the **TAE** contains both **DDL** and **DML**, so we take the transaction of the **Catalog** as a sub-transaction of the **TAE** transaction. 631 632 <img src="https://user-images.githubusercontent.com/39627130/157159731-4b7015cf-621a-4436-90e4-cc9d5d930412.png" height="70%" width="70%" /> 633 634 All **DDL** operations correspond to **Catalog** **DML** operations, see the [corresponding chapter](#Catalog) for details. 635 636 ### Commit 637 As mentioned earlier, the data of a transaction is grouped by table, and each group of data is a combination of the following data types 638 - <img src="https://latex.codecogs.com/svg.image?Batch_{}^{i}" title="Batch_{}^{i}" />. The i-th uncommitted batch 639 - <img src="https://latex.codecogs.com/svg.image?Bitmap_{}^{i}" title="Bitmap_{}^{i}" />. The delete bitmap of <img src="https://latex.codecogs.com/svg.image?Batch_{}^{i}" title="Batch_{}^{i}" /> 640 - <img src="https://latex.codecogs.com/svg.image?DeleteNode_{blk}" title="DeleteNode_{blk}" />. The delete node of a committed block. 641 - <img src="https://latex.codecogs.com/svg.image?UpdateNode_{blk}^{col}" title="UpdateNode_{blk}^{col}" />. The update node of a committed column block. 642 - `CREATE_TABLE` 643 - `DROP_TABLE` 644 645 When committing a transaction, all combinations can be summed up in the following pattern: 646 1. **P1**: Only insert 647 2. **P2**: Only delete or update to committed data 648 3. **P3**: Insert and delete or update to inserted data 649 4. **P4**: Insert and delete or update to committed data 650 5. **P5**: Only DDL 651 6. **P6**: DDL and others 652 653 #### P1 654 1. Wrap each batch as a command with a sequence number 655 2. Append the batch to statemachine with annotated command context 656 657 #### P2 658 1. Wrap each update|delete node as a command with a sequence number 659 2. Update the update|delete node commit info 660 661 #### P3 662 1. If there is a delete bitmap for a batch, apply the delete bitmap first to generate a new batch 663 2. Same as [P1](#P1) 664 665 #### P4 666 1. Process delete|update first. Same as [P2](#P2) 667 2. Process insert. Same as [P1](#P1) 668 669 #### P5 670 1. Wrap a command with a sequence number 671 2. Update the commit info 672 673 #### P6 674 1. All operations to a table before the `DROP TABLE` can be eliminated. Delete all related uncommitted batch, delete|update nodes. Then do as [P5](#P5) 675 2. `CREATE TABLE` should always be the first command, unless there is a `DROP TABLE` later 676 677 #### Commit Pipeline 678 **TODO** 679 680 ### Schema Change 681 **TODO** 682 683 ## Snapshot 684 **TODO** 685 686 ## Split 687 **TODO** 688 689 ## GC 690 1. Metadata compaction 691 1) In-memory version chain compaction 692 2) In-memory hard deleted metadata entry compaction 693 3) Persisted data compaction 694 2. Stale data deletion 695 1) Table data with reference count equal 0 696 2) Log compaction 697 698 **TODO**