github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2021-10-13-ticdc-mq-sink-column-selector.md (about)

     1  # TiCDC Design Documents
     2  
     3  - Author(s): [hi-rustin](https://github.com/hi-rustin)
     4  - Tracking Issue: https://github.com/pingcap/tiflow/issues/3082
     5  
     6  ## Table of Contents
     7  
     8  - [Introduction](#introduction)
     9  - [Motivation or Background](#motivation-or-background)
    10  - [Detailed Design](#detailed-design)
    11  - [Test Design](#test-design)
    12    - [Functional Tests](#functional-tests)
    13    - [Scenario Tests](#scenario-tests)
    14    - [Compatibility Tests](#compatibility-tests)
    15    - [Benchmark Tests](#benchmark-tests)
    16  - [Impacts & Risks](#impacts--risks)
    17  - [Investigation & Alternatives](#investigation--alternatives)
    18  - [Unresolved Questions](#unresolved-questions)
    19  
    20  ## Introduction
    21  
    22  This document provides a complete design about implementing column selector in TiCDC MQ Sink. This feature is currently only available for the `canal-json` protocol.
    23  
    24  ## Motivation or Background
    25  
    26  TiCDC is a change data capture for TiDB, it supports replicating change data from upstream TiDB to various kinds of downstream, including MySQL compatible database, messaging queue system such as Kafka. When synchronizing data to Kafka, we currently support row data synchronization, but there are some user scenarios where users only need a few columns of data that have changed.
    27  
    28  In this scenario, we need to support synchronization of specified columns. The key requirements of this feature are as follows:
    29  
    30  - Support for synchronizing multiple columns
    31  - Flexible column configuration support
    32  
    33  ## Detailed Design
    34  
    35  This solution will introduce a new configuration that will specify the columns in the tables that sink needs to synchronize.
    36  
    37  ### Column selector configuration format
    38  
    39  This configuration will be added to the TiCDC changefeed configuration file. This configuration only takes effect when the protocol is `canal-json`. Adding this configuration under other protocols will report an error.
    40  
    41  ```toml
    42  [sink]
    43  dispatchers = [
    44      {matcher = ['test1.*', 'test2.*'], dispatcher = "ts"},
    45      {matcher = ['test3.*', 'test4.*'], dispatcher = "rowid"},
    46  ]
    47  
    48  protocol = "open-protocol"
    49  
    50  column-selectors = [
    51      {matcher = ['test1.*', 'test2.*'], columns = ["Column selector expression"]},
    52      {matcher = ['test1.*', 'test2.*'], columns = ["Column selector expression"]},
    53  ]
    54  ```
    55  
    56  Add a new selector array configuration item named `column-selectors`. Each item consists of a matcher array and a
    57  columns array. This allows us to support multiple table and column selections.
    58  
    59  The matcher match rules for tables are the same as the [TiDB table filter rules]. The column selector rules are
    60  explained in detail below.
    61  
    62  ### Column selector expression details
    63  
    64  The syntactic parsing of column selector rules is similar to that of matcher, but for columns.
    65  
    66  - Use column names directly
    67    - Only columns whose names match the rules exactly will be accepted.
    68  - Using wildcards
    69    - `*` — matches zero or more characters
    70    - `?` — matches one character
    71    - `[a-z]` — matches one character between "a" and "z" inclusively
    72    - `[!a-z]` — matches one character except "a" to "z"
    73    - `Character` here means a Unicode code point
    74  - Exclusion
    75    - An `!` at the beginning of the rule means the pattern after it is used to exclude columns from being processed
    76  
    77  Column is not case-sensitive on any platform, nor are column aliases. The matching order is the same as the filter rule, the first match from the back to the front.
    78  
    79  Some examples:
    80  
    81  - matcher = ['test1.student_*'], columns = ["id", "name"]
    82    - For the schema named test1, all tables prefixed with student\_, only the id, name columns are synchronized
    83  - matcher = ['test1.t1'], columns = ["*", "!name"]
    84    - For the test1.t1 table, synchronize the columns except for the name column
    85  - matcher = ['test1.t2'], columns = ["src*", "!src1"]
    86    - For the test1.t2 table, synchronize all columns prefixed with src, except for column src1
    87  - matcher = ['test1.t3'], columns = ["sdb?c"]
    88    - For the test1.t3 table, synchronize all columns of the form sdb?c, "?" can only represent one character, such as sdb1c, sdboc, sbd-c
    89  
    90  ## Test Design
    91  
    92  This functionality will be mainly covered by unit and integration tests, and we also need to design a testing framework for it.
    93  
    94  ### Functional Tests
    95  
    96  #### Unit test
    97  
    98  Coverage should be more than 75% in new added code.
    99  
   100  #### Integration test
   101  
   102  Can pass all existing integration tests when changefeed without column selector configuration.
   103  Build a new mock integration test framework to validate column selector.
   104  
   105  ### Scenario Tests
   106  
   107  N/A
   108  
   109  ### Compatibility Tests
   110  
   111  #### Compatibility with other features/components
   112  
   113  Because there is a possibility of filtering out all columns, the open protocol delete event may have empty data. If this happens we will delete the message.
   114  
   115  #### Upgrade compatibility
   116  
   117  The columns will not be filtered without this configuration, so just add the configuration and create a new changefeed after the upgrade.
   118  
   119  #### Downgrade compatibility
   120  
   121  The new configuration is not recognized by the old TiCDC, so you need to downgrade after modifying the configuration and remove the changefeed.
   122  
   123  ### Benchmark Tests
   124  
   125  N/A
   126  
   127  ## Impacts & Risks
   128  
   129  N/A
   130  
   131  ## Investigation & Alternatives
   132  
   133  N/A
   134  
   135  ## Unresolved Questions
   136  
   137  How to build a mock integration test framework to validate filtering rules, and that integration test framework should also be able to be used to validate other encodings and message formats.
   138  
   139  [tidb table filter rules]: https://docs.pingcap.com/tidb/stable/table-filter#syntax