github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2022-03-16-ticdc-db-sorter.md (about) 1 # DB Sorter 2 3 - Author(s): [overvenus](https://github.com/overvenus) 4 - Tracking Issue: https://github.com/pingcap/tiflow/issues/3227 5 6 ## Table of Contents 7 8 - [Introduction](#introduction) 9 - [Motivation or Background](#motivation-or-background) 10 - [Detailed Design](#detailed-design) 11 - [Encoding](#encoding) 12 - [Key](#key) 13 - [Value](#value) 14 - [GC](#gc) 15 - [Snapshot and Iterator](#snapshot-and-iterator) 16 - [Unexpected conditions](#unexpected-conditions) 17 - [Latency](#latency) 18 - [Test Design](#test-design) 19 - [Functional Tests](#functional-tests) 20 - [Scenario Tests](#scenario-tests) 21 - [Compatibility Tests](#compatibility-tests) 22 - [Impacts & Risks](#impacts--risks) 23 - [Investigation & Alternatives](#investigation--alternatives) 24 - [Unresolved Questions](#unresolved-questions) 25 26 ## Introduction 27 28 This document provides a complete design on implementing db sorter, 29 a resource-friendly sorter with predictable and controllable usage of CPU, 30 memory, on-disk files, open file descriptors, and goroutines. 31 32 ## Motivation or Background 33 34 We have met issues <sup>[1]</sup> about resource exhaustion in TiCDC. 35 One of the main source of consumption is TiCDC sorter. 36 37 In the current architecture, the resources consumed by sorter is proportional to 38 the number of tables, in terms of goroutines and CPU. 39 40 To support replicating many tables scenario, like 100,000 tables, we need 41 a sorter that only consumes O(1) or O(logN) resources. 42 43 ## Detailed Design 44 45 LevelDB is a fast on-disk key-value storage that provides ordered key-value 46 iteration. Also, it has matured resource management for CPU, memory, on-disk 47 files and open file descriptors. It matches TiCDC sorter requirements. 48 49 To further limit consumption, TiCDC creates a fixed set of leveldb instances 50 that are shared by multiple tables. 51 52 The db sorter is driven by actors that run on a fixed-size of goroutine 53 pool. This addresses goroutine management issues. 54 55 The db sorter is composed of five structs: 56 57 1. `DBActor` is a struct that reads (by taking iterators) and writes to leveldb 58 directly. It is shared by multiple tables. It is driven by an actor. 59 2. `TableSorter` is a struct that implements `Sorter` interface and manages 60 table-level states. It forwards `Sorter.AddEntry` to `Writer` and forwards 61 `Sorter.Output` to `Reader`. 62 3. `Writer` is a struct that collects unordered key-value change data and 63 forwards to `DBActor`. It is driven by an actor. 64 4. `Reader` is a struct that reads ordered key-value change data from iterators. 65 5. `Compactor` is a garbage collector for leveldb. It is shared by multiple 66 tables. It is driven by an actor. 67 68 _Quantitative relationship between above structs_ 69 70 | Table | DBActor | TableSorter | Writer | Reader | Compactor | 71 | ----- | ------- | ----------- | ------ | ------ | --------- | 72 | N | 1 | N | N | N | 1 | 73 74 | Read Write Sequence | Table Sorter Structs | 75 | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | 76 | <img src="../media/db-sorter-sequence.svg?sanitize=true" alt="db-sorter-sequence" width="600"/> | <img src="../media/db-sorter-class.svg?sanitize=true" alt="db-sorter-class" width="600"/> | 77 78 ### Encoding 79 80 #### Key 81 82 The following table shows key encoding. Events are sorted by 83 randomly generated ID, table ID, CRTs, start ts, OpType, and key. 84 85 ``` 86 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 87 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 88 | unique ID | table ID | CRTs | start ts | | key (variable-length) | 89 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 90 ^OpType(Put/Delete) 91 ``` 92 93 LevelDB sorts keys in ascending order, so as table sorter reads events. 94 95 Let’s say “A has higher priority than B” means A is sorted before B. 96 Unique ID has the highest priority. It is assigned when the table pipeline 97 starts, and it serves two purposes: 98 99 1. Prevent data conflicting after table rescheduling, e.g., move out/in. 100 2. Table data can be deleted by the unique ID prefix after tables are 101 scheduled out. 102 103 `CRTs` has higher priority than start ts, because TiCDC needs to output events 104 in commit order. 105 `start ts` has higher priority than key because TiCDC needs to group events 106 in the same transaction, and `start ts` is the transaction ID. 107 `OpType` has higher priority than key and `Delete` has higher priority 108 than `Put`, because REPLACE SQL might change the key by deleting the original 109 key and putting a new key. TiCDC must execute `Delete` first, 110 otherwise, data is lost. 111 112 #### Value 113 114 Value is a binary representation of events, encoded by MessagePack<sup>[2]</sup>. 115 116 ``` 117 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 118 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 119 | event value (variable-length) | 120 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 121 ``` 122 123 ### GC 124 125 Because all events are written to leveldb, TiCDC needs a GC mechanism to 126 free disk in time. 127 128 Leveldb sorter adopts `DeleteRange` approach, which bulk deletes key-values 129 that has been outputted. It minimizes the GC impact to `Writer` and `Reader` 130 for both read/write throughput and latency. 131 132 For table movement, we bulk delete data in the background after the table is 133 stopped. 134 135 ### Snapshot and Iterator 136 137 Leveldb sorter limits the total number and the max alive time of snapshots and 138 iterators, because they pin memtable and keep obsolete sst files on disk. 139 Too many concurrent snapshots and iterators can easily cause performance and 140 stability issues. 141 142 ### Unexpected conditions 143 144 Leveldb has two kinds of unexpected conditions, 145 146 1. Disk full: disk is full, TiCDC can no longer write more data. 147 2. I/O error: it means hardware reports an error, and data may be corrupted. 148 149 TiCDC should stop changefeeds immediately. 150 151 ### Latency 152 153 Data can only be read after they were written to leveldb, it adds extra latency 154 to changefeed replication lag. It ranges from sub milliseconds to minutes 155 (write stall) depending on upstream write QPS. 156 157 As an optimization, we can implement storage that stores data in memory or 158 on-disk depending on data size as future work. 159 160 ## Test Design 161 162 Leveldb sorter is an internal optimization. For tests, we focus on the scenario 163 tests and benchmark. 164 165 ### Functional Tests 166 167 Regular unit tests and integration tests, no additional tests required. 168 (The proposal does not add or remove any functionality). 169 170 #### Unit test 171 172 Coverage should be more than 75% in newly added code. 173 174 ### Scenario Tests 175 176 - Regular unit tests and integration tests. 177 - 1 TiCDC node and 12 TiKV nodes with 100K tables and 270K regions. 178 - No OOM. 179 - Changefeed lag should be less than 1 minute. 180 181 We will test the scenario of replicating 100,000 tables in one TiCDC node. 182 183 ### Compatibility Tests 184 185 #### Compatibility with other features/components 186 187 Should be compatible with other features. 188 189 #### Upgrade Downgrade Compatibility 190 191 Sorter cleans up on-disk files when TiCDC starts, so there should be no upgrade 192 or downgrade compatibility issues. 193 194 ## Impacts & Risks 195 196 N/A 197 198 ## Investigation & Alternatives 199 200 N/A 201 202 ## Unresolved Questions 203 204 N/A 205 206 [1]: https://github.com/pingcap/tiflow/issues/2698 207 [2]: https://msgpack.org