github.com/matrixorigin/matrixone@v1.2.0/docs/rfcs/20211210_aoe_overall_design.md (about)

     1  - Feature Name: Analytic Optimized Engine Overall Design
     2  - Status: In Progress
     3  - Start Date: 2021-05-10
     4  - Authors: [Xu Peng](https://github.com/XuPeng-SH)
     5  - Implementation PR: [#1335](https://github.com/matrixorigin/matrixone/pull/1335)
     6  - Issue for this RFC: [#1320](https://github.com/matrixorigin/matrixone/pull/1320)
     7  
     8  # Summary
     9  
    10  **AOE** (Analytic Optimized Engine) is designed for analytical query workloads, which can be used as the underlying storage engine of database management system (DBMS) for online analytical processing of queries (OLAP).
    11  - Cloud native
    12  - Efficient processing small and big inserts
    13  - State-of-art data layout for vectorized execution engines
    14  - Fine-grained, efficient and convenient memory management
    15  - Conveniently support new index types and provide dedicated index cache management
    16  - Serious data management for concurent queries and mutations
    17  - Avoid external dependencies
    18  
    19  # Guilde-level design
    20  
    21  ## Terms
    22  ### Layout
    23  - **Block**: Piece of a segment which is the minimum part of table data. The maximum number of rows of a block is fixed.
    24  - **Segment**: Piece of a table which is composed of blocks. The maximum number of blocks of a segment is fixed.
    25  - **Table**: Piece of a database which is composed of segments
    26  - **Database**: A combination of tables, which shares the same log space
    27  
    28  ### State
    29  - **Transient Block**: Block where the number of rows does not reach the upper limit and the blocks queued to be sorted and flushed
    30  - **Persisted Block**: Sorted block
    31  - **Unclosed Segment**: Segment that not merge sorted
    32  - **Closed Segment**: Segment that merge sorted
    33  
    34  ### Container
    35  - **Vector**: Data fragment of a column in memory
    36  - **Batch**: A combination of vectors, and the number of rows in each vector is aligned
    37  
    38  ### Misc
    39  - **Log space**: A raft group can be considered as a log space
    40  
    41  ## Data storage
    42  ### Table
    43  **AOE** stores data represented as tables. Each table is bound to a schema consisting of numbers of column definitions. A table data is organized as a log-structured merge-tree (LSM tree).
    44  
    45  Currently, **AOE** is a three-level LSM tree, called L0, L1 and L2. L0 is small and can be entirely resident in memory, whereas L1 and L2 are both definitely resident on disk. In **AOE**, L0 consists of transient blocks and L1 consists of sorted blocks. The incoming new data is always inserted into the latest transient block. If the insertion causes the block to exceed the maximum row count of a block, the block will be sorted by primary key and flushed into L1 as sorted block. If the number of sorted blocks exceed the maximum number of a segment, the segment will be sorted by primary key using merge sort.
    46  
    47  L1 and L2 are organized into sorted runs of data. Each run contains data sorted by the primary key, which can be represented on disk as a single file. There will be overlapping primary key ranges between sort runs. The difference of L1 and L2 is that a run in L1 is a **block** while a run in L2 is a **segment**.
    48  
    49  As described above, transient blocks can be entirely resident in memory, but not necessarily so. Because there will be many tables, each table has transient blocks. If they are always resident in memory, it will cause a huge waste. In **AOE**, transient blocks from all tables share a dedicated fixed-size LRU cache. A evicted transient block will be unloaded from memory and flushed as a transient block file. In practice, the transient blocks are constantly flowing to the L1 and the number of transient blocks per table at a certain time is very small, those active transient blocks will likely reside in memory even with a small-sized cache.
    50  
    51  ### Indexes
    52  There's no table-level index in **AOE**, only segment and block-level indexes are available. **AOE** supports dynamic index creation and deletion, index creation is an asynchronous process. Indexes will only be created on blocks and segments at L1 and L2, that is, transient blocks don't have any index.
    53  
    54  In **AOE**, there is a dedicated fixed-size LRU cache for all indexes. Compared with the original data, the index occupies a limited space, but the acceleration of the query is very obvious, and the index will be called very frequently. A dedicated cache can avoid a memory copy when being called.
    55  
    56  Currently, **AOE** uses two index types:
    57  - **Zonemap**: It is automatically created for all columns. Persisted.
    58  - **BSI**: It should be explicitly defined for a column. Persisted.
    59  
    60  It is very easy to add a new index to **AOE**.
    61  
    62  ### Compression
    63  **AOE** is a column-oriented data store, very friendly to data compression. It supports per-column compression codecs and now only **LZ4** is used. You can easily obtain the meta information of compressed blocks. In **AOE**, the compression unit is a column of a block.
    64  
    65  ### Layout
    66  #### Block
    67     ![image](https://user-images.githubusercontent.com/39627130/145402878-72f9aa0a-65f5-494a-96ff-c075065c1f01.png)
    68  
    69  #### Segment
    70     ![image](https://user-images.githubusercontent.com/39627130/145402537-6500bcf4-5897-4dfa-b3fc-196d0c5835df.png)
    71  
    72  ## Buffer manager
    73  Buffer manager is responsible for the allocation of buffer space. It handles all requests for data pages and temporary blocks of the **AOE**.
    74  1. Each page is bound to a buffer node with a unique node ID
    75  2. A buffer node has two states:
    76     1) Loaded
    77     2) Unloaded
    78  3. When a requestor **Pin** a node:
    79     1) If the node is in **Loaded** state, it will increase the node reference count by 1 and wrap a node handle with the page address in memory
    80     2) If the node is in **Unloaded** state, it will read the page from disk|remote first, increase the node reference count by 1 and wrap a node handle with the page address in memory. When there is no left room in the buffer, some victim node will be unloaded to make room. The current replacement strategy is **LRU**
    81  4. When a requestor **Unpin** a node, just call **Close** of the node handle. It will decrease the node reference count by 1. If the reference count is 0, the node will be a candidate for eviction. Node with reference count greater than 0 never be evicted.
    82  
    83  There are currently three buffer managers for different purposes in **AOE**
    84  1. Mutation buffer manager: A dedicated fixed-size buffer used by L0 transient blocks. Each block corresponds to a node in the buffer
    85  2. SST buffer manager: A dedicated fixed-size buffer used by L1 and L2 blocks. Each column within a block corresponds to a node in the buffer
    86  3. Index buffer manager: A dedicated fixed-size buffer used by indexes. Each block or a segment index corresponds to a node in the buffer
    87  
    88  ## WAL
    89  **Write-ahead logging** (WAL) is the key for providing **atomicity** and **durability**. All modifications should be written to a log before applied. **AOE** depends a abstract **WAL** layer on top of a more concrete **WAL** backend. There are two **WAL** roles in **AOE**:
    90  1. HolderRole: The backend **WAL** is used as a **AOE** embedded **WAL**
    91  2. BrokerRole: The incoming modification requests were already written to external log and now applied to **AOE**. The backend **WAL** is used to share with external **WAL**
    92  
    93  ### Share with external WAL
    94  When a storage engine is used as a state machine of a raft group, **WAL** in the storage engine is unnecessary and would only add overhead. **AOE** is currently used as the underlying state machine of **MatrixOne**, which uses **Raft** consensus for replication. To share with external **Raft** log, **AOE** can use a default **WAL** backend of role **BrokerRole**
    95  
    96  ## Catalog
    97  **Catalog** is **AOE**'s in-memory metadata manager that manages all states of the engine, and the underlying driver is an embedded **LogStore**. **Catalog** implements a simple memory transaction database, retains a complete version chain in memory, and is compacted when it is not referenced. **Catalog** can be fully replayed from the underlying **LogStore**.
    98  1. DDL operation infos
    99  2. Table Schema infos
   100  3. Layout infos
   101  
   102  ### Example
   103  ![image](https://user-images.githubusercontent.com/39627130/139570327-f484858c-347c-4100-b0cc-afd03c5e6e8d.png)
   104  - There are 5 tables (id: 1,3,5,6,7) with the same table name "m" and only table 7 is active now.
   105  - Table 1 is created @ timestamp 1, soft-deleted @ timestamp 2, hard-deleted @ timestamp 9
   106  - Table 3 is created @ timestamp 3, soft-deleted @ timestamp 4, hard-deleted @ timestamp 5
   107  - Table 5 is created @ timestamp 6, soft-deleted @ timestamp 7, hard-deleted @ timestamp 8
   108  - Table 6 is created @ timestamp 10, **replaced** @ timestamp 11, hard-deleted @ timestamp 12
   109  - Table 7 is created @ timestamp 11
   110  
   111  ## LogStore
   112  An embedded log-structured data store. It is used as the underlying driver of **Catalog** and **WAL**.
   113  
   114  ## Multi-Version Concurrency Control (MVCC)
   115  For any update, **AOE** create a new version of data object instead of in-place update. The concurrent read operations started at an older timestamp could still see the old version.
   116  
   117  ### Example
   118  ![image](https://user-images.githubusercontent.com/39627130/145431744-e3f52d23-7ae0-4356-801e-29807e9fc325.png)
   119  
   120  ## Database (Column Families)
   121  In **AOE**, a **Table** is a **Column Family** while a **Database** is **Column Families**. The main idea behind **Column Families** is that they share the write-ahead log (Share **Log Space**), so that we can implement **Database-level** atomic writes. The old **WAL** cannot be compacted when the mutable buffer of a **Table** flushed since it may contains live data from other **Tables**. It can only be compacted when all related **Tables** mutable buffer are flushed.
   122  
   123  **AOE** supports multiple **Databases**, that is, one **AOE** instance can work with multiple **Log Spaces**. Our **MatrixOne** DBMS is built upon multi-raft and each node only needs one **AOE** engine, and each raft group corresponds to a **Database**. It is complicated and what makes it more complicated is the engine shares the external **WAL** with **Raft** log.
   124  
   125  ## Snapshot
   126  **AOE** can create a snapshot of **Database** at a certain time or LSN. As described in **MVCC**, a snapshoter can fetch a database snapshot and dump all related data and metadata to a specified path. **AOE** also can install snapshot files to restore to the same state as when the snapshot was created.
   127  
   128  ## Split
   129  **AOE** can split a **Database** into the specified number of **Databases**. Currently, the segment is not splitable. Splitting corresponds to the data layer, just a reorganization of segments.
   130  
   131  ## GC
   132  1. Metadata compaction
   133     1) In-memory version chain compaction
   134     2) In-memory hard deleted metadata entry compaction
   135     3) Persisted data compaction
   136  2. Stale data deletion
   137     1) Table data with reference count equal 0
   138     2) Log compaction
   139  
   140  # Feature works
   141  1. More index types
   142  2. Per-column compress codec
   143  3. More LSM tree levels
   144  4. Integrate some scan and filter operators
   145  5. Support deletion