github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/dm/docs/RFCS/20220721_continuous_data_validation.md (about)

     1  # Continuous Data Validation
     2  
     3  ## Background
     4  
     5  TiDB already has [`sync-diff-inspector`](https://docs.pingcap.com/tidb/stable/sync-diff-inspector-overview) to do data validation for full data migration, but for incremental data migration there's no such option. We can do full data validation using `sync-diff-inspector` during database downtime for maintenance to validate all incremental data migrated since last full data validation, but its time window is too short for `sync-diff-inspector` to completed and it might be too late to find any incorrect-migrated rows. We need an alternative to do incremental data validation which can validate incremental data in a more real-time way so that we can find incorrect-migrated rows earlier and has less pressure on upstream and downstream database where during validation business queries can run normally.
     6  
     7  ## Limition
     8  
     9  Some known limitions to this design:
    10  - table which needs validation should have primary key or non-null unique key
    11  - there should only be compatible DDL operations in binlog stream between current syncer location and validation location, to make sure validation to use current DDL in schema-tracker to validate row change in history binlog:
    12    - there is no operation to change primary key or non-null unique key
    13    - there is no column order change operation
    14    - there is no drop column operation
    15    - table in downstream which needs validation cannot be dropped
    16  - do not support validation for tasks which enable extend-column
    17  - do not support validation for tasks which enable event filtering by expression
    18  - TiDB implements floating point data types differently with MySQL, we take it as equal if its absolute difference < 10^-6
    19  - do not support validation for JSON and binary data types
    20  
    21  ## Concepts
    22  
    23  - `row change`: validator decodes upstream binlog into `row change` which contains:
    24      - change type(`insert`/`update`/`delete`)
    25      - table name
    26      - data before change(missing for `insert` change)
    27      - data after change(missing for `delete` change).
    28  - `failed row change`: `row change` which has failed to validate but hasn't been marked as `error row change`
    29  - `error row change`: if some `failed row change` keeps failing to validator for enough time(`delay time`), it's marked as `error row change`
    30  
    31  ## Detailed Design
    32  
    33  ### Life cycle of validator
    34  
    35  Validator can be enabled together with the task or enabled on the fly, the task which enables validator should has an incremental migration unit, i.e. syncer. Validator will be deleted when the task is deleted. Enabled validator will be in `running` state, it can be stopped manually or stopped when meeting error that cannot be recovered automatically, subsequently turning to `stopped` state. Validator in `stopped` state can be started again and return to `running` state.
    36  
    37  ### Validation process
    38  
    39  1. Validator in `running` state will pull binlog from upstream and get `row change`s.
    40      - validator will only validate `row change` which has already been migrated by syncer.
    41  2. After routing and filtering using the same rule as in syncer, `row change` is dispatched to validator `worker`. `row change` of same table and primary key will be dispatched to the same `worker`.
    42  3. `worker` merges `row change` by table and primary key, we will use the last `row change` since last change overrides previous one. Then put it into `pending row changes`.
    43  4. After accumulating enough `row change`s or after a set interval, `worker` queries  the downstream to fetch replicated data with respect to those `row change`s, then compares `row change`s and their downstream counterpart
    44      - for `insert`/`update` type of row change, we validate them differently regarding the validation mode
    45        - in `full` validation mode, we compare them column by column
    46        - in `fast` validation mode, we only check its existence
    47      - for `delete` type of `row change`, downstream should not contain that data.
    48  5. For `row change`s which are validated successfully, worker will remove them from `pending row changes`, while others failing the validation will be marked as `failed row change` and be validated again after a set interval.
    49  6. If a `failed row change` doesn't pass the validation after a set time(`delay time`) since its first validation, we mark it as `error row change` and save it into the meta database.
    50  
    51  ### False positive
    52  
    53  `error row change` produced in validation process might not be data which is incorrectly migrated, there are cases where `row change` is marked falsely. Suppose some rows keep changing on upstream for a time period > `delay time`. If it's marked as `failed row change` since the first time it changes, validator may mark it as `error row change` falsely. In real world scenarios, it's not common.
    54  
    55  To reduce the chance of false positive, validator will not start marking `failed row change` until validator has reached the progress of syncer or after some `initial delay time`
    56  
    57  
    58  ### Validation checkpoint
    59  
    60  Validator will start validation from previous location after failover or resuming. Validator will save current location, current `pending row changes` and `error row change`s into meta-db after some interval.
    61  
    62  ### Self-protection
    63  
    64  Validator `worker` will cache `pending row changes` in memory, to avoid potential OOM, we add a self-protect mechanism. If there are too many `pending row changes` or the overall size of `pending row changes` is too large, validator will be stopped automatically.