github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/dm/docs/RFCS/20220721_continuous_data_validation.md

github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/dm/docs/RFCS/20220721_continuous_data_validation.md (about)

1 # Continuous Data Validation
2
3 ## Background
4
5 TiDB already has [`sync-diff-inspector`](https://docs.pingcap.com/tidb/stable/sync-diff-inspector-overview) to do data validation for full data migration, but for incremental data migration there's no such option. We can do full data validation using `sync-diff-inspector` during database downtime for maintenance to validate all incremental data migrated since last full data validation, but its time window is too short for `sync-diff-inspector` to completed and it might be too late to find any incorrect-migrated rows. We need an alternative to do incremental data validation which can validate incremental data in a more real-time way so that we can find incorrect-migrated rows earlier and has less pressure on upstream and downstream database where during validation business queries can run normally.
6
7 ## Limition
8
9 Some known limitions to this design:
10 - table which needs validation should have primary key or non-null unique key
11 - there should only be compatible DDL operations in binlog stream between current syncer location and validation location, to make sure validation to use current DDL in schema-tracker to validate row change in history binlog:
12 - there is no operation to change primary key or non-null unique key
13 - there is no column order change operation
14 - there is no drop column operation
15 - table in downstream which needs validation cannot be dropped
16 - do not support validation for tasks which enable extend-column
17 - do not support validation for tasks which enable event filtering by expression
18 - TiDB implements floating point data types differently with MySQL, we take it as equal if its absolute difference < 10^-6
19 - do not support validation for JSON and binary data types
20
21 ## Concepts
22
23 - `row change`: validator decodes upstream binlog into `row change` which contains:
24 - change type(`insert`/`update`/`delete`)
25 - table name
26 - data before change(missing for `insert` change)
27 - data after change(missing for `delete` change).
28 - `failed row change`: `row change` which has failed to validate but hasn't been marked as `error row change`
29 - `error row change`: if some `failed row change` keeps failing to validator for enough time(`delay time`), it's marked as `error row change`
30
31 ## Detailed Design
32
33 ### Life cycle of validator
34
35 Validator can be enabled together with the task or enabled on the fly, the task which enables validator should has an incremental migration unit, i.e. syncer. Validator will be deleted when the task is deleted. Enabled validator will be in `running` state, it can be stopped manually or stopped when meeting error that cannot be recovered automatically, subsequently turning to `stopped` state. Validator in `stopped` state can be started again and return to `running` state.
36
37 ### Validation process
38
39 1. Validator in `running` state will pull binlog from upstream and get `row change`s.
40 - validator will only validate `row change` which has already been migrated by syncer.
41 2. After routing and filtering using the same rule as in syncer, `row change` is dispatched to validator `worker`. `row change` of same table and primary key will be dispatched to the same `worker`.
42 3. `worker` merges `row change` by table and primary key, we will use the last `row change` since last change overrides previous one. Then put it into `pending row changes`.
43 4. After accumulating enough `row change`s or after a set interval, `worker` queries the downstream to fetch replicated data with respect to those `row change`s, then compares `row change`s and their downstream counterpart
44 - for `insert`/`update` type of row change, we validate them differently regarding the validation mode
45 - in `full` validation mode, we compare them column by column
46 - in `fast` validation mode, we only check its existence
47 - for `delete` type of `row change`, downstream should not contain that data.
48 5. For `row change`s which are validated successfully, worker will remove them from `pending row changes`, while others failing the validation will be marked as `failed row change` and be validated again after a set interval.
49 6. If a `failed row change` doesn't pass the validation after a set time(`delay time`) since its first validation, we mark it as `error row change` and save it into the meta database.
50
51 ### False positive
52
53 `error row change` produced in validation process might not be data which is incorrectly migrated, there are cases where `row change` is marked falsely. Suppose some rows keep changing on upstream for a time period > `delay time`. If it's marked as `failed row change` since the first time it changes, validator may mark it as `error row change` falsely. In real world scenarios, it's not common.
54
55 To reduce the chance of false positive, validator will not start marking `failed row change` until validator has reached the progress of syncer or after some `initial delay time`
56
57
58 ### Validation checkpoint
59
60 Validator will start validation from previous location after failover or resuming. Validator will save current location, current `pending row changes` and `error row change`s into meta-db after some interval.
61
62 ### Self-protection
63
64 Validator `worker` will cache `pending row changes` in memory, to avoid potential OOM, we add a self-protect mechanism. If there are too many `pending row changes` or the overall size of `pending row changes` is too large, validator will be stopped automatically.