github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/dm/docs/RFCS/20211012_async_checkpoint_flush.md

github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/dm/docs/RFCS/20211012_async_checkpoint_flush.md (about)

     1  # Proposal: Async Checkpoint Flush
     2  
     3  - Author(s):    [will](https://github.com/db-will)
     4  - Last updated: 2021-10-12
     5  
     6  ## Abstract
     7  This proposal introduces an asynchronous mechanism to flush checkpoint in order to improve DM incremental replication performance.
     8  
     9  Table of contents:
    10  
    11  - [Background](#Background)
    12  
    13  ## Background
    14  In an incremental replication of DM, a checkpoint corresponds to the binlog position and other information of a binlog event that is successfully parsed and migrated to the downstream. Currently, a checkpoint is updated after the DDL operation is successfully migrated or 30 seconds after the last update.
    15  
    16  For every 30 seconds, DM will synchronously flush checkpoints to the downstream database. The synchronous flushing blocks DM to pull and parse upstream binlog events, and so it impacts overall latency and throughput of DM.
    17  
    18  Since checkpoint wasn’t designed to provide a very accurate replication progress info(thanks to [safe mode](https://docs.pingcap.com/tidb-data-migration/stable/glossary?_ga=2.227219742.2051636884.1631070434-1987700333.1624974822#safe-mode)), it’s only necessary to guarantee all upstream binlog events before saved checkpoint are replicated. Therefore, instead of synchronously flushing the checkpoint every 30 seconds, we can asynchronously flush the checkpoint without blocking DM to pull or process upstream binlog events. With async checkpoint flush, it can improve DM’s incremental replication throughput and latency.
    19  
    20  Previously, [gmh](https://github.com/GMHDBJD) has conducted a performance testing to compare DM qps when checkpoint flush is enabled and disabled. It shows that there could be a 15% ~ 30% qps increase when checkpoint flush disabled. [glorv](https://github.com/glorv) created the initial proposal for aynsc checkpoint flush. On community side, there is a stale pr for [WIP: async flush checkpoints](https://github.com/pingcap/dm/pull/1627). But it hasn’t been updated for a very long time.
    21  
    22  ## Goals
    23  With asynchronous checkpoint flush:
    24  - we should be able to see improvements on DM incremental replication qps and latency.
    25  - it should guarantee that all events before checkpoint have been flushed.
    26  - it should be upgradable and compatible with previous versions.
    27  
    28  ## Current Implementation - synchronously checkpoints flush
    29  ![DM synchronously checkpoints flush](../media/synchronously-checkpoint-flush.png)
    30  
    31  As shown in the above diagram, the current bottleneck is the red box - “Add Commit Job to Queue”, where it will block further pulling and parsing upstream binlog events every 30 seconds when a flush checkpoint is needed.
    32  
    33  ### How Causality Works With Checkpoint
    34  Here is an example about how dml generate causality relations:
    35  ```
    36  create table t1 (id int not null primary key, username varchar(60) not null unique key, pass varchar(60) not null unique key);
    37  
    38  insert into t1 values (3, 'a3', 'b3');
    39  insert into t1 values (5, 'a5', 'b5');
    40  insert into t1 values (7, 'a7', 'b7');
    41  ```
    42  
    43  DM causality relations:
    44  ```
    45  {
    46    "3.id.`only_dml`.`t1`":"3.id.`only_dml`.`t1`",
    47    "5.id.`only_dml`.`t1`":"5.id.`only_dml`.`t1`",
    48    "7.id.`only_dml`.`t1`":"7.id.`only_dml`.`t1`",
    49    "a3.username.`only_dml`.`t1`":"3.id.`only_dml`.`t1`",
    50    "a5.username.`only_dml`.`t1`":"5.id.`only_dml`.`t1`",
    51    "a7.username.`only_dml`.`t1`":"7.id.`only_dml`.`t1`",
    52    "b3.pass.`only_dml`.`t1`":"3.id.`only_dml`.`t1`",
    53    "b5.pass.`only_dml`.`t1`":"5.id.`only_dml`.`t1`",
    54    "b7.pass.`only_dml`.`t1`":"7.id.`only_dml`.`t1`",
    55  }
    56  ```
    57  
    58  For each DML generated from upstream binlog event, DM will create a list of causality keys from primary key and unique key related to the DML, and then those keys will be added into global causality relations map. If a DML contains unique keys, those unique keys' value will point to the primary key as described in above causality relation map example.
    59  
    60  Causality is used to prevent dmls related to the same primary key or unique key from being executed out of order. For example:
    61  ```
    62  create table t1(id int primary key, sid varchar)
    63  update t1 set sid='b' where id = 1
    64  update t1 set sid='c' where id = 1
    65  update t1 set sid='d' where id = 1
    66  ```
    67  
    68  With a causality mechanism, we are guaranteed that these dmls will be executed in the same order as the upstream database does.
    69  
    70  
    71  ## Proposed Implementation - asynchronously checkpoints flush
    72  ![DM asynchronously checkpoints flush](../media/asynchronously-checkpoint-flush.png)
    73  The component colored with orange are new components added for async checkpoint flush.
    74  
    75  Instead of blocking to flush checkpoint, the idea of asynchronous flush checkpoint is to add a checkpoint snapshot message into each job channel, and forward it to the checkpoint snapshot message processing unit, then it will continuously update a central checkpoint map; At the same time, we periodically scan the central checkpoint snapshots map to flush the most recent checkpoint snapshot that already received all jobs chan’s checkpoint snapshot flush message, and clean up flushed checkpoints in the map.
    76  
    77  The central checkpoint snapshot map is a map that uses a checkpoint snapshot id as key, a map of channel id mapped to bool as value.
    78  
    79  The checkpoint snapshot id is a specific format of time(eg. 0102 15:04:05) representing when it’s created, and so we can easily sort them based on when it’s created.
    80  
    81  The job channel id to bool map indicates which job channel’s checkpoint snapshot message has been received. As long as we have received all job channel’s checkpoint snapshot messages, we can flush the checkpoint snapshot.
    82  
    83  However, the missing part is that we are leaving the causality structure filled with flushed jobs causality keys, and hence it could falsely trigger checkpoint flush even if the conflicting key’s jobs have been executed/flushed. We still need to explore the workload potentially causing a serious performance impact with the async checkpoint flush.
    84  
    85  ### Causality Optimization
    86  ![DM asynchronously checkpoints flush with causality optimization](../media/asynchronously-checkpoint-flush-with-causality-opt.png)
    87  The component colored with orange are new components added for async checkpoint flush.
    88  
    89  At the time of executing a batch of sql jobs in syncDML, we can collect those causality keys from these executed sql jobs, and so we can collect causality keys to delete in a separate data structure(TBD). There is a seperate routine running in background to purge causality keys to delete and update current causality on certain time interval.
    90  
    91  Whenever we detect causality conflicts during processing upstream binlog event, we can update the current out-of-date causality with those causality-to-delete keys, and then recheck if it still violates causality rules. If still so, we will need to pause and flush all jobs in the channel.
    92  
    93  This optimization is designed for keeping causality up to date and hence avoid stale causality keys conflicts caused checkpoint flush.
    94  
    95  ## Testing
    96  A separate testing doc needs to be created to fully design the testing cases, configurations and workload, failure points etc. In this section, we describe key points we need to evaluate the performance and compatibility of asynchronous checkpoints flush feature.
    97  
    98  ### Performance
    99  For performance testing, we would like to evaluate the asynchronous flush checkpoint feature from following aspects:
   100  - Compared to the synchronous one, what’s the performance gain?
   101  - Find workload/config easily violates causality check, and how does that impact performance? Does asynchronous perform better than the existing synchronous method?
   102  - Try find workload/config cause checkpoint message queue busy/full, and check if perf is impacted?
   103  - Check the actual performance impact with different checkpoint message interval and checkpoint flush interval
   104  
   105  ### Compatibility
   106  For compatibility testing, we would like to evaluate async checkpoint flush compatibility with DM operations, safe mode, and previous DM versions.
   107  - Evaluate if pause & restart DM at different point, DM will still continue works well
   108  - Evaluate if upgrading DM from lower DM version to current DM version will work
   109  - Evaluate if async checkpoint compatible with safe mode DM
   110  
   111  ## Metrics & Logs
   112  Since extra components are introduced with asynchronous checkpoint flush, we need to add related metrics & logs to help us observe the DM cluster. The following design will subject to change in the development phase.
   113  
   114  To get a better view of these new metrics and their impact on performance, we can add the [new Grafana dashboard](https://github.com/pingcap/dm/blob/master/dm/dm-ansible/scripts/DM-Monitor-Professional.json).
   115  
   116  ### Checkpoint
   117  #### Mertics
   118  With following two metrics, we can conclude the time span of checkpoint snapshot msg:
   119  - To understand start time of checkpoint snapshot msg, we record the checkpoint position when it’s pushed into the job chan
   120  - To understand end time of checkpoint snapshot msg, we record the checkpoint pos when [checkpoint msg processing routine] find out a recent checkpoint is ready to flush at the time it process the received checkpoint msg from a channel
   121  - To understand flush time of checkpoint, we record the checkpoint when it flushed in [checkpoint flush routine]
   122  #### Logs
   123  Log checkpoint when it’s pushed into job channel(need to evaluate its perf impact due to large amount of log generated)
   124  Log checkpoint when it’s ready to flush after received all job chan msg(need to evaluate its perf impact due to large amount of log generated)
   125  Log flushed and purged checkpoints
   126  ### Causality
   127  #### Mertics
   128  To understand how many causality keys are added, we record the number of causality keys added into central causality structure and we can use sum to see the total number of causality keys added so far
   129  
   130  To understand how many causality keys are resetted, we record the number of causality keys resetted.
   131  
   132  Combine above two metrics, it will give us the number of causality keys at a certain moment.
   133  
   134  Similar methods are applied to causality to delete structure, and we need to monitor the size of causality to delete.
   135  #### Logs
   136  Log information whenever causality or causality-to-delete are resetted.
   137  Log causality keys when it’s added into causality or causality-to-delete data structure.
   138  
   139  Both two pieces of information could be used for further debugging issues related to causality.
   140  
   141  ## Milestones
   142  ### Milestone 1 - Asynchronous Flush Checkpoint
   143  For our first milestone, we will work on asynchronous flush checkpoint implementation, and measure the impact of causality keys conflicts.
   144  
   145  In this stage, we focus on delivering code fast and test its performance to understand its impact for performance.
   146  
   147  There are major work needed to be done for this milestone:
   148  Asynchronous flush checkpoint implementation
   149  Checkpoint related metrics and logs
   150  Performance metrics of asynchronous flush checkpoint(especially when workload easily trigger causality conflicts)
   151  No need to do comprehensive performance here, we focus on find out potential performance bottlenecks/impacts
   152  Control memory consumption when no causality conflicts for a long time.
   153  
   154  ### Milestone 2 - Asynchronous Flush Checkpoint With Causality Optimization
   155  For our second milestone, if we find there are serious impacts on async flush checkpoint performance under causality conflicts, we will work on causality optimization.
   156  
   157  In this stage, we focus on the quality of code of async flush checkpoint code. There are major work needed to done at this milestone:
   158  Asynchronous flush checkpoint implementation with causality optimization
   159  Causality related metrics and logs
   160  Performance metrics of the implementation with causality optimization(especially when workload easily trigger causality conflicts)
   161  No need to do comprehensive performance here, we focus on find out potential performance bottlenecks/impacts
   162  Control memory consumption when no causality conflicts for a long time.
   163  
   164  ### Milestone 3 - Testing
   165  At this final stage, we will focus on comprehensive performance, compatibility, functionality testing, integration testing of asynchronous flush checkpoint.
   166  There are major work needed to done at this milestone:
   167  Testing design doc of asynchronous flush checkpoint
   168  Configurable checkpoint flush interval config, need to support dynamically update
   169  Conduct various testing on the feature
   170  Optional, try DM with mock customer workload
   171