github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2022-03-16-ticdc-db-sorter.md (about)

     1  # DB Sorter
     2  
     3  - Author(s): [overvenus](https://github.com/overvenus)
     4  - Tracking Issue: https://github.com/pingcap/tiflow/issues/3227
     5  
     6  ## Table of Contents
     7  
     8  - [Introduction](#introduction)
     9  - [Motivation or Background](#motivation-or-background)
    10  - [Detailed Design](#detailed-design)
    11    - [Encoding](#encoding)
    12      - [Key](#key)
    13      - [Value](#value)
    14    - [GC](#gc)
    15    - [Snapshot and Iterator](#snapshot-and-iterator)
    16    - [Unexpected conditions](#unexpected-conditions)
    17    - [Latency](#latency)
    18  - [Test Design](#test-design)
    19    - [Functional Tests](#functional-tests)
    20    - [Scenario Tests](#scenario-tests)
    21    - [Compatibility Tests](#compatibility-tests)
    22  - [Impacts & Risks](#impacts--risks)
    23  - [Investigation & Alternatives](#investigation--alternatives)
    24  - [Unresolved Questions](#unresolved-questions)
    25  
    26  ## Introduction
    27  
    28  This document provides a complete design on implementing db sorter,
    29  a resource-friendly sorter with predictable and controllable usage of CPU,
    30  memory, on-disk files, open file descriptors, and goroutines.
    31  
    32  ## Motivation or Background
    33  
    34  We have met issues <sup>[1]</sup> about resource exhaustion in TiCDC.
    35  One of the main source of consumption is TiCDC sorter.
    36  
    37  In the current architecture, the resources consumed by sorter is proportional to
    38  the number of tables, in terms of goroutines and CPU.
    39  
    40  To support replicating many tables scenario, like 100,000 tables, we need
    41  a sorter that only consumes O(1) or O(logN) resources.
    42  
    43  ## Detailed Design
    44  
    45  LevelDB is a fast on-disk key-value storage that provides ordered key-value
    46  iteration. Also, it has matured resource management for CPU, memory, on-disk
    47  files and open file descriptors. It matches TiCDC sorter requirements.
    48  
    49  To further limit consumption, TiCDC creates a fixed set of leveldb instances
    50  that are shared by multiple tables.
    51  
    52  The db sorter is driven by actors that run on a fixed-size of goroutine
    53  pool. This addresses goroutine management issues.
    54  
    55  The db sorter is composed of five structs:
    56  
    57  1. `DBActor` is a struct that reads (by taking iterators) and writes to leveldb
    58     directly. It is shared by multiple tables. It is driven by an actor.
    59  2. `TableSorter` is a struct that implements `Sorter` interface and manages
    60     table-level states. It forwards `Sorter.AddEntry` to `Writer` and forwards
    61     `Sorter.Output` to `Reader`.
    62  3. `Writer` is a struct that collects unordered key-value change data and
    63     forwards to `DBActor`. It is driven by an actor.
    64  4. `Reader` is a struct that reads ordered key-value change data from iterators.
    65  5. `Compactor` is a garbage collector for leveldb. It is shared by multiple
    66     tables. It is driven by an actor.
    67  
    68  _Quantitative relationship between above structs_
    69  
    70  | Table | DBActor | TableSorter | Writer | Reader | Compactor |
    71  | ----- | ------- | ----------- | ------ | ------ | --------- |
    72  | N     | 1       | N           | N      | N      | 1         |
    73  
    74  | Read Write Sequence                                                                             | Table Sorter Structs                                                                      |
    75  | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
    76  | <img src="../media/db-sorter-sequence.svg?sanitize=true" alt="db-sorter-sequence" width="600"/> | <img src="../media/db-sorter-class.svg?sanitize=true" alt="db-sorter-class" width="600"/> |
    77  
    78  ### Encoding
    79  
    80  #### Key
    81  
    82  The following table shows key encoding. Events are sorted by
    83  randomly generated ID, table ID, CRTs, start ts, OpType, and key.
    84  
    85  ```
    86  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
    87  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    88  |   unique ID   |    table ID   |      CRTs     |   start ts    | |  key (variable-length)  |
    89  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    90                                                                   ^OpType(Put/Delete)
    91  ```
    92  
    93  LevelDB sorts keys in ascending order, so as table sorter reads events.
    94  
    95  Let’s say “A has higher priority than B” means A is sorted before B.
    96  Unique ID has the highest priority. It is assigned when the table pipeline
    97  starts, and it serves two purposes:
    98  
    99  1. Prevent data conflicting after table rescheduling, e.g., move out/in.
   100  2. Table data can be deleted by the unique ID prefix after tables are
   101     scheduled out.
   102  
   103  `CRTs` has higher priority than start ts, because TiCDC needs to output events
   104  in commit order.
   105  `start ts` has higher priority than key because TiCDC needs to group events
   106  in the same transaction, and `start ts` is the transaction ID.
   107  `OpType` has higher priority than key and `Delete` has higher priority
   108  than `Put`, because REPLACE SQL might change the key by deleting the original
   109  key and putting a new key. TiCDC must execute `Delete` first,
   110  otherwise, data is lost.
   111  
   112  #### Value
   113  
   114  Value is a binary representation of events, encoded by MessagePack<sup>[2]</sup>.
   115  
   116  ```
   117  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
   118  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   119  | event value (variable-length) |
   120  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   121  ```
   122  
   123  ### GC
   124  
   125  Because all events are written to leveldb, TiCDC needs a GC mechanism to
   126  free disk in time.
   127  
   128  Leveldb sorter adopts `DeleteRange` approach, which bulk deletes key-values
   129  that has been outputted. It minimizes the GC impact to `Writer` and `Reader`
   130  for both read/write throughput and latency.
   131  
   132  For table movement, we bulk delete data in the background after the table is
   133  stopped.
   134  
   135  ### Snapshot and Iterator
   136  
   137  Leveldb sorter limits the total number and the max alive time of snapshots and
   138  iterators, because they pin memtable and keep obsolete sst files on disk.
   139  Too many concurrent snapshots and iterators can easily cause performance and
   140  stability issues.
   141  
   142  ### Unexpected conditions
   143  
   144  Leveldb has two kinds of unexpected conditions,
   145  
   146  1. Disk full: disk is full, TiCDC can no longer write more data.
   147  2. I/O error: it means hardware reports an error, and data may be corrupted.
   148  
   149  TiCDC should stop changefeeds immediately.
   150  
   151  ### Latency
   152  
   153  Data can only be read after they were written to leveldb, it adds extra latency
   154  to changefeed replication lag. It ranges from sub milliseconds to minutes
   155  (write stall) depending on upstream write QPS.
   156  
   157  As an optimization, we can implement storage that stores data in memory or
   158  on-disk depending on data size as future work.
   159  
   160  ## Test Design
   161  
   162  Leveldb sorter is an internal optimization. For tests, we focus on the scenario
   163  tests and benchmark.
   164  
   165  ### Functional Tests
   166  
   167  Regular unit tests and integration tests, no additional tests required.
   168  (The proposal does not add or remove any functionality).
   169  
   170  #### Unit test
   171  
   172  Coverage should be more than 75% in newly added code.
   173  
   174  ### Scenario Tests
   175  
   176  - Regular unit tests and integration tests.
   177  - 1 TiCDC node and 12 TiKV nodes with 100K tables and 270K regions.
   178    - No OOM.
   179    - Changefeed lag should be less than 1 minute.
   180  
   181  We will test the scenario of replicating 100,000 tables in one TiCDC node.
   182  
   183  ### Compatibility Tests
   184  
   185  #### Compatibility with other features/components
   186  
   187  Should be compatible with other features.
   188  
   189  #### Upgrade Downgrade Compatibility
   190  
   191  Sorter cleans up on-disk files when TiCDC starts, so there should be no upgrade
   192  or downgrade compatibility issues.
   193  
   194  ## Impacts & Risks
   195  
   196  N/A
   197  
   198  ## Investigation & Alternatives
   199  
   200  N/A
   201  
   202  ## Unresolved Questions
   203  
   204  N/A
   205  
   206  [1]: https://github.com/pingcap/tiflow/issues/2698
   207  [2]: https://msgpack.org