github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2021-10-13-ticdc-mq-sink-column-selector.md (about) 1 # TiCDC Design Documents 2 3 - Author(s): [hi-rustin](https://github.com/hi-rustin) 4 - Tracking Issue: https://github.com/pingcap/tiflow/issues/3082 5 6 ## Table of Contents 7 8 - [Introduction](#introduction) 9 - [Motivation or Background](#motivation-or-background) 10 - [Detailed Design](#detailed-design) 11 - [Test Design](#test-design) 12 - [Functional Tests](#functional-tests) 13 - [Scenario Tests](#scenario-tests) 14 - [Compatibility Tests](#compatibility-tests) 15 - [Benchmark Tests](#benchmark-tests) 16 - [Impacts & Risks](#impacts--risks) 17 - [Investigation & Alternatives](#investigation--alternatives) 18 - [Unresolved Questions](#unresolved-questions) 19 20 ## Introduction 21 22 This document provides a complete design about implementing column selector in TiCDC MQ Sink. This feature is currently only available for the `canal-json` protocol. 23 24 ## Motivation or Background 25 26 TiCDC is a change data capture for TiDB, it supports replicating change data from upstream TiDB to various kinds of downstream, including MySQL compatible database, messaging queue system such as Kafka. When synchronizing data to Kafka, we currently support row data synchronization, but there are some user scenarios where users only need a few columns of data that have changed. 27 28 In this scenario, we need to support synchronization of specified columns. The key requirements of this feature are as follows: 29 30 - Support for synchronizing multiple columns 31 - Flexible column configuration support 32 33 ## Detailed Design 34 35 This solution will introduce a new configuration that will specify the columns in the tables that sink needs to synchronize. 36 37 ### Column selector configuration format 38 39 This configuration will be added to the TiCDC changefeed configuration file. This configuration only takes effect when the protocol is `canal-json`. Adding this configuration under other protocols will report an error. 40 41 ```toml 42 [sink] 43 dispatchers = [ 44 {matcher = ['test1.*', 'test2.*'], dispatcher = "ts"}, 45 {matcher = ['test3.*', 'test4.*'], dispatcher = "rowid"}, 46 ] 47 48 protocol = "open-protocol" 49 50 column-selectors = [ 51 {matcher = ['test1.*', 'test2.*'], columns = ["Column selector expression"]}, 52 {matcher = ['test1.*', 'test2.*'], columns = ["Column selector expression"]}, 53 ] 54 ``` 55 56 Add a new selector array configuration item named `column-selectors`. Each item consists of a matcher array and a 57 columns array. This allows us to support multiple table and column selections. 58 59 The matcher match rules for tables are the same as the [TiDB table filter rules]. The column selector rules are 60 explained in detail below. 61 62 ### Column selector expression details 63 64 The syntactic parsing of column selector rules is similar to that of matcher, but for columns. 65 66 - Use column names directly 67 - Only columns whose names match the rules exactly will be accepted. 68 - Using wildcards 69 - `*` — matches zero or more characters 70 - `?` — matches one character 71 - `[a-z]` — matches one character between "a" and "z" inclusively 72 - `[!a-z]` — matches one character except "a" to "z" 73 - `Character` here means a Unicode code point 74 - Exclusion 75 - An `!` at the beginning of the rule means the pattern after it is used to exclude columns from being processed 76 77 Column is not case-sensitive on any platform, nor are column aliases. The matching order is the same as the filter rule, the first match from the back to the front. 78 79 Some examples: 80 81 - matcher = ['test1.student_*'], columns = ["id", "name"] 82 - For the schema named test1, all tables prefixed with student\_, only the id, name columns are synchronized 83 - matcher = ['test1.t1'], columns = ["*", "!name"] 84 - For the test1.t1 table, synchronize the columns except for the name column 85 - matcher = ['test1.t2'], columns = ["src*", "!src1"] 86 - For the test1.t2 table, synchronize all columns prefixed with src, except for column src1 87 - matcher = ['test1.t3'], columns = ["sdb?c"] 88 - For the test1.t3 table, synchronize all columns of the form sdb?c, "?" can only represent one character, such as sdb1c, sdboc, sbd-c 89 90 ## Test Design 91 92 This functionality will be mainly covered by unit and integration tests, and we also need to design a testing framework for it. 93 94 ### Functional Tests 95 96 #### Unit test 97 98 Coverage should be more than 75% in new added code. 99 100 #### Integration test 101 102 Can pass all existing integration tests when changefeed without column selector configuration. 103 Build a new mock integration test framework to validate column selector. 104 105 ### Scenario Tests 106 107 N/A 108 109 ### Compatibility Tests 110 111 #### Compatibility with other features/components 112 113 Because there is a possibility of filtering out all columns, the open protocol delete event may have empty data. If this happens we will delete the message. 114 115 #### Upgrade compatibility 116 117 The columns will not be filtered without this configuration, so just add the configuration and create a new changefeed after the upgrade. 118 119 #### Downgrade compatibility 120 121 The new configuration is not recognized by the old TiCDC, so you need to downgrade after modifying the configuration and remove the changefeed. 122 123 ### Benchmark Tests 124 125 N/A 126 127 ## Impacts & Risks 128 129 N/A 130 131 ## Investigation & Alternatives 132 133 N/A 134 135 ## Unresolved Questions 136 137 How to build a mock integration test framework to validate filtering rules, and that integration test framework should also be able to be used to validate other encodings and message formats. 138 139 [tidb table filter rules]: https://docs.pingcap.com/tidb/stable/table-filter#syntax