github.com/matrixorigin/matrixone@v1.2.0/docs/rfcs/20211210_aoe_overall_design.md

github.com/matrixorigin/matrixone@v1.2.0/docs/rfcs/20211210_aoe_overall_design.md (about)

1 - Feature Name: Analytic Optimized Engine Overall Design
2 - Status: In Progress
3 - Start Date: 2021-05-10
4 - Authors: [Xu Peng](https://github.com/XuPeng-SH)
5 - Implementation PR: [#1335](https://github.com/matrixorigin/matrixone/pull/1335)
6 - Issue for this RFC: [#1320](https://github.com/matrixorigin/matrixone/pull/1320)
7
8 # Summary
9
10 **AOE** (Analytic Optimized Engine) is designed for analytical query workloads, which can be used as the underlying storage engine of database management system (DBMS) for online analytical processing of queries (OLAP).
11 - Cloud native
12 - Efficient processing small and big inserts
13 - State-of-art data layout for vectorized execution engines
14 - Fine-grained, efficient and convenient memory management
15 - Conveniently support new index types and provide dedicated index cache management
16 - Serious data management for concurent queries and mutations
17 - Avoid external dependencies
18
19 # Guilde-level design
20
21 ## Terms
22 ### Layout
23 - **Block**: Piece of a segment which is the minimum part of table data. The maximum number of rows of a block is fixed.
24 - **Segment**: Piece of a table which is composed of blocks. The maximum number of blocks of a segment is fixed.
25 - **Table**: Piece of a database which is composed of segments
26 - **Database**: A combination of tables, which shares the same log space
27
28 ### State
29 - **Transient Block**: Block where the number of rows does not reach the upper limit and the blocks queued to be sorted and flushed
30 - **Persisted Block**: Sorted block
31 - **Unclosed Segment**: Segment that not merge sorted
32 - **Closed Segment**: Segment that merge sorted
33
34 ### Container
35 - **Vector**: Data fragment of a column in memory
36 - **Batch**: A combination of vectors, and the number of rows in each vector is aligned
37
38 ### Misc
39 - **Log space**: A raft group can be considered as a log space
40
41 ## Data storage
42 ### Table
43 **AOE** stores data represented as tables. Each table is bound to a schema consisting of numbers of column definitions. A table data is organized as a log-structured merge-tree (LSM tree).
44
45 Currently, **AOE** is a three-level LSM tree, called L0, L1 and L2. L0 is small and can be entirely resident in memory, whereas L1 and L2 are both definitely resident on disk. In **AOE**, L0 consists of transient blocks and L1 consists of sorted blocks. The incoming new data is always inserted into the latest transient block. If the insertion causes the block to exceed the maximum row count of a block, the block will be sorted by primary key and flushed into L1 as sorted block. If the number of sorted blocks exceed the maximum number of a segment, the segment will be sorted by primary key using merge sort.
46
47 L1 and L2 are organized into sorted runs of data. Each run contains data sorted by the primary key, which can be represented on disk as a single file. There will be overlapping primary key ranges between sort runs. The difference of L1 and L2 is that a run in L1 is a **block** while a run in L2 is a **segment**.
48
49 As described above, transient blocks can be entirely resident in memory, but not necessarily so. Because there will be many tables, each table has transient blocks. If they are always resident in memory, it will cause a huge waste. In **AOE**, transient blocks from all tables share a dedicated fixed-size LRU cache. A evicted transient block will be unloaded from memory and flushed as a transient block file. In practice, the transient blocks are constantly flowing to the L1 and the number of transient blocks per table at a certain time is very small, those active transient blocks will likely reside in memory even with a small-sized cache.
50
51 ### Indexes
52 There's no table-level index in **AOE**, only segment and block-level indexes are available. **AOE** supports dynamic index creation and deletion, index creation is an asynchronous process. Indexes will only be created on blocks and segments at L1 and L2, that is, transient blocks don't have any index.
53
54 In **AOE**, there is a dedicated fixed-size LRU cache for all indexes. Compared with the original data, the index occupies a limited space, but the acceleration of the query is very obvious, and the index will be called very frequently. A dedicated cache can avoid a memory copy when being called.
55
56 Currently, **AOE** uses two index types:
57 - **Zonemap**: It is automatically created for all columns. Persisted.
58 - **BSI**: It should be explicitly defined for a column. Persisted.
59
60 It is very easy to add a new index to **AOE**.
61
62 ### Compression
63 **AOE** is a column-oriented data store, very friendly to data compression. It supports per-column compression codecs and now only **LZ4** is used. You can easily obtain the meta information of compressed blocks. In **AOE**, the compression unit is a column of a block.
64
65 ### Layout
66 #### Block
67 ![image](https://user-images.githubusercontent.com/39627130/145402878-72f9aa0a-65f5-494a-96ff-c075065c1f01.png)
68
69 #### Segment
70 ![image](https://user-images.githubusercontent.com/39627130/145402537-6500bcf4-5897-4dfa-b3fc-196d0c5835df.png)
71
72 ## Buffer manager
73 Buffer manager is responsible for the allocation of buffer space. It handles all requests for data pages and temporary blocks of the **AOE**.
74 1. Each page is bound to a buffer node with a unique node ID
75 2. A buffer node has two states:
76 1) Loaded
77 2) Unloaded
78 3. When a requestor **Pin** a node:
79 1) If the node is in **Loaded** state, it will increase the node reference count by 1 and wrap a node handle with the page address in memory
80 2) If the node is in **Unloaded** state, it will read the page from disk|remote first, increase the node reference count by 1 and wrap a node handle with the page address in memory. When there is no left room in the buffer, some victim node will be unloaded to make room. The current replacement strategy is **LRU**
81 4. When a requestor **Unpin** a node, just call **Close** of the node handle. It will decrease the node reference count by 1. If the reference count is 0, the node will be a candidate for eviction. Node with reference count greater than 0 never be evicted.
82
83 There are currently three buffer managers for different purposes in **AOE**
84 1. Mutation buffer manager: A dedicated fixed-size buffer used by L0 transient blocks. Each block corresponds to a node in the buffer
85 2. SST buffer manager: A dedicated fixed-size buffer used by L1 and L2 blocks. Each column within a block corresponds to a node in the buffer
86 3. Index buffer manager: A dedicated fixed-size buffer used by indexes. Each block or a segment index corresponds to a node in the buffer
87
88 ## WAL
89 **Write-ahead logging** (WAL) is the key for providing **atomicity** and **durability**. All modifications should be written to a log before applied. **AOE** depends a abstract **WAL** layer on top of a more concrete **WAL** backend. There are two **WAL** roles in **AOE**:
90 1. HolderRole: The backend **WAL** is used as a **AOE** embedded **WAL**
91 2. BrokerRole: The incoming modification requests were already written to external log and now applied to **AOE**. The backend **WAL** is used to share with external **WAL**
92
93 ### Share with external WAL
94 When a storage engine is used as a state machine of a raft group, **WAL** in the storage engine is unnecessary and would only add overhead. **AOE** is currently used as the underlying state machine of **MatrixOne**, which uses **Raft** consensus for replication. To share with external **Raft** log, **AOE** can use a default **WAL** backend of role **BrokerRole**
95
96 ## Catalog
97 **Catalog** is **AOE**'s in-memory metadata manager that manages all states of the engine, and the underlying driver is an embedded **LogStore**. **Catalog** implements a simple memory transaction database, retains a complete version chain in memory, and is compacted when it is not referenced. **Catalog** can be fully replayed from the underlying **LogStore**.
98 1. DDL operation infos
99 2. Table Schema infos
100 3. Layout infos
101
102 ### Example
103 ![image](https://user-images.githubusercontent.com/39627130/139570327-f484858c-347c-4100-b0cc-afd03c5e6e8d.png)
104 - There are 5 tables (id: 1,3,5,6,7) with the same table name "m" and only table 7 is active now.
105 - Table 1 is created @ timestamp 1, soft-deleted @ timestamp 2, hard-deleted @ timestamp 9
106 - Table 3 is created @ timestamp 3, soft-deleted @ timestamp 4, hard-deleted @ timestamp 5
107 - Table 5 is created @ timestamp 6, soft-deleted @ timestamp 7, hard-deleted @ timestamp 8
108 - Table 6 is created @ timestamp 10, **replaced** @ timestamp 11, hard-deleted @ timestamp 12
109 - Table 7 is created @ timestamp 11
110
111 ## LogStore
112 An embedded log-structured data store. It is used as the underlying driver of **Catalog** and **WAL**.
113
114 ## Multi-Version Concurrency Control (MVCC)
115 For any update, **AOE** create a new version of data object instead of in-place update. The concurrent read operations started at an older timestamp could still see the old version.
116
117 ### Example
118 ![image](https://user-images.githubusercontent.com/39627130/145431744-e3f52d23-7ae0-4356-801e-29807e9fc325.png)
119
120 ## Database (Column Families)
121 In **AOE**, a **Table** is a **Column Family** while a **Database** is **Column Families**. The main idea behind **Column Families** is that they share the write-ahead log (Share **Log Space**), so that we can implement **Database-level** atomic writes. The old **WAL** cannot be compacted when the mutable buffer of a **Table** flushed since it may contains live data from other **Tables**. It can only be compacted when all related **Tables** mutable buffer are flushed.
122
123 **AOE** supports multiple **Databases**, that is, one **AOE** instance can work with multiple **Log Spaces**. Our **MatrixOne** DBMS is built upon multi-raft and each node only needs one **AOE** engine, and each raft group corresponds to a **Database**. It is complicated and what makes it more complicated is the engine shares the external **WAL** with **Raft** log.
124
125 ## Snapshot
126 **AOE** can create a snapshot of **Database** at a certain time or LSN. As described in **MVCC**, a snapshoter can fetch a database snapshot and dump all related data and metadata to a specified path. **AOE** also can install snapshot files to restore to the same state as when the snapshot was created.
127
128 ## Split
129 **AOE** can split a **Database** into the specified number of **Databases**. Currently, the segment is not splitable. Splitting corresponds to the data layer, just a reorganization of segments.
130
131 ## GC
132 1. Metadata compaction
133 1) In-memory version chain compaction
134 2) In-memory hard deleted metadata entry compaction
135 3) Persisted data compaction
136 2. Stale data deletion
137 1) Table data with reference count equal 0
138 2) Log compaction
139
140 # Feature works
141 1. More index types
142 2. Per-column compress codec
143 3. More LSM tree levels
144 4. Integrate some scan and filter operators
145 5. Support deletion