github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md (about) 1 # TiCDC Design Documents 2 3 - Author(s): [Zhao Yilin](http://github.com/leoppro), [Zhang Xiang](http://github.com/zhangyangyu) 4 - Tracking Issue: https://github.com/pingcap/tiflow/issues/5338 5 6 ## Table of Contents 7 8 - [TiCDC Design Documents](#ticdc-design-documents) 9 - [Table of Contents](#table-of-contents) 10 - [Introduction](#introduction) 11 - [Motivation or Background](#motivation-or-background) 12 - [Detailed Design](#detailed-design) 13 - [New Config Items](#new-config-items) 14 - [flat-avro Schema Definition](#flat-avro-schema-definition) 15 - [Key Schema](#key-schema) 16 - [Value Schema](#value-schema) 17 - [DML Events](#dml-events) 18 - [Schema Change](#schema-change) 19 - [Subject Name Strategy](#subject-name-strategy) 20 - [ColumnValueBlock and Data Mapping](#columnvalueblock-and-data-mapping) 21 - [Test Design](#test-design) 22 - [Functional Tests](#functional-tests) 23 - [CLI Tests](#cli-tests) 24 - [Data Mapping Tests](#data-mapping-tests) 25 - [DML Tests](#dml-tests) 26 - [Schema Tests](#schema-tests) 27 - [SubjectNameStrategy Tests](#subjectnamestrategy-tests) 28 - [Compatibility Tests](#compatibility-tests) 29 - [Impacts & Risks](#impacts--risks) 30 - [Investigation & Alternatives](#investigation--alternatives) 31 - [Unresolved Questions](#unresolved-questions) 32 33 ## Introduction 34 35 This document provides a complete design on refactoring the existing Avro protocol implementation. A common Avro data format is defined in order to building data pathways to various streaming systems. 36 37 ## Motivation or Background 38 39 Apache Avro™ is a data serialization system with rich data structures and a compact binary data format. Avro relies on schemas, schemas is managed by schema-registry. Avro is a common data format in streaming systems, supported by Confluent, Flink, Debezium, etc. 40 41 ## Detailed Design 42 43 ### New Config Items 44 45 | Config item | Option values | Default | Explain | 46 | ---------------------------------- | ---------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 47 | protocol | canal-json / flat-avro | - | Specify the format of messages written to Kafka.<br>The `flat-avro` option means that using the avro format design by this document. | 48 | enable-tidb-extension | true / false | false | Append tidb extension fields into avro message or not. | 49 | schema-registry | - | - | Specifies the schema registry endpoint. | 50 | avro-decimal-handling-mode | precise / string | precise | Specifies how the TiCDC should handle values for DECIMAL columns:<br>`precise` option represents encoding decimals as precise bytes.<br>`string` option encodes values as formatted strings, which is easy to consume but semantic information about the real type is lost. | 51 | avro-bigint-unsigned-handling-mode | long / string | long | Specifies how the TiCDC should handle values for UNSIGNED BIGINT columns:<br>`long` represents values by using Avro long(64-bit signed integer) which might overflow but which is easy to use in consumers.<br>`string` represents values by string which is precise but which needs to be parsed by consumers. | 52 53 ### flat-avro Schema Definition 54 55 `flat-avro` is an alias of `avro` protocol. It means all column values are placed directly inside the message with no nesting. This structure is compatible with most confluent sink connectors, but it cannot handle `old-value`. `rich-avro` is opposite and reserved for future needs. 56 57 #### Key Schema 58 59 ``` 60 { 61 "name":"{{RecordName}}", 62 "namespace":"{{Namespace}}", 63 "type":"record", 64 "fields":[ 65 {{ColumnValueBlock}}, 66 {{ColumnValueBlock}}, 67 ] 68 } 69 ``` 70 71 - `{{RecordName}}` represents full qualified table name. 72 - `{{ColumnValueBlock}}` represents a JSON block, which defines a column value of a key. 73 - Key only includes the valid index fields. 74 75 #### Value Schema 76 77 ``` 78 { 79 "name":"{{RecordName}}", 80 "namespace":"{{Namespace}}", 81 "type":"record", 82 "fields":[ 83 {{ColumnValueBlock}}, 84 {{ColumnValueBlock}}, 85 { 86 "name":"_tidb_op", 87 "type":"string" 88 }, 89 { 90 "name":"_tidb_commit_ts", 91 "type":"long" 92 }, 93 { 94 "name":"_tidb_commit_physical_time", 95 "type":"long" 96 } 97 ] 98 } 99 ``` 100 101 - `{{RecordName}}` represents full qualified table name. 102 - `{{ColumnValueBlock}}` represents a JSON block, which defines a column value of a key. 103 - `_tidb_op` is used to distinguish between INSERT or UPDATE events, optional values are "c" / "u". 104 - `_tidb_commit_ts` represents a CommitTS of a transaction. 105 - `_tidb_commit_physical_time` represents a physical timestamp of a transaction. 106 107 When `enable-tidb-extension` is `true`, `_tidb_op`, `_tidb_commit_ts`, `_tidb_commit_physical_time` will be appended to every Kafka value. When `enable-tidb-extension` is `false`, no extension fields will be appended to Kafka values. 108 109 ### DML Events 110 111 If `enable-tidb-extension` is `true`, the `_tidb_op` field for the INSERT event is "c" and the field for UPDATE event is "u". 112 113 If `enable-tidb-extension` is `false`, the `_tidb_op` field will not be appended in Kafka value, so there is no difference between INSERT and UPDATE event. 114 115 For the DELETE event, TiCDC will send the primary key value as the Kafka key, and the Kafka value will be `null`. 116 117 ### Schema Change 118 119 Avro detects schema change at every DML events instead of DDL events. Whenever there is a schema change, avro codec tries to register a new version schema under corresponding subject in the schema registry. Whether it succeeds or not depends on the schema evolution compatibility. Avro codec will not address any compatibility issues and simply propagates errors. 120 121 ### Subject Name Strategy 122 123 Avro codec only supports the default `TopicNameStrategy`. This means a kafka topic could only accepts a unique schema. With the multi-topic ability in TiCDC, events from multiple tables could be all dispatched to one topic, which is not allowed under `TopicNameStrategy`. So we require in the dispatcher rules, for avro protocol, the topic rule must contain both `{schema}` and `{table}` placeholders, which means one table would occupy one kafka topic. 124 125 ### ColumnValueBlock and Data Mapping 126 127 A `ColumnValueBlock` has the following schema: 128 129 ``` 130 { 131 "name":"{{ColumnName}}", 132 "type":{ 133 "connect.parameters":{ 134 "tidb_type":"{{TIDB_TYPE}}" 135 }, 136 "type":"{{AVRO_TYPE}}" 137 } 138 } 139 ``` 140 141 | SQL TYPE | TIDB_TYPE | AVRO_TYPE | Description | 142 | -------------------------------------------------- | ---------------------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------ | 143 | TINYINT/BOOL/SMALLINT/MEDIUMINT/INT | INT | int | When it's unsigned, TIDB_TYPE is INT UNSIGNED. For SQL TYPE INT UNSIGNED, its AVRO_TYPE is long. | 144 | BIGINT | BIGINT | long | When it's unsigned, TIDB_TYPE is BIGINT UNSIGNED. If `avro-bigint-unsigned-handling-mode` is string, AVRO_TYPE is string. | 145 | TINYBLOB/BLOB/MEDIUMBLOB/LONGBLOB/BINARY/VARBINARY | BLOB | bytes | | 146 | TINYTEXT/TEXT/MEDIUMTEXT/LONGTEXT/CHAR/VARCHAR | TEXT | string | | 147 | FLOAT/DOUBLE | FLOAT/DOUBLE | double | | 148 | DATE/DATETIME/TIMESTAMP/TIME | DATE/DATETIME/TIMESTAMP/TIME | string | | 149 | YEAR | YEAR | int | | 150 | BIT | BIT | bytes | BIT has another `connector.parameters` entry `"length":"64"`. | 151 | JSON | JSON | string | | 152 | ENUM/SET | ENUM/SET | string | BIT has another `connector.parameters` entry `"allowed":"a,b,c"`. | 153 | DECIMAL | DECIMAL | bytes | This is an avro logical type having `scale` and `precision`. When `avro-decimal-handling-mode` is string, AVRO_TYPE is string. | 154 155 ## Test Design 156 157 ### Functional Tests 158 159 #### CLI Tests 160 161 - avro/flat-avro protocol 162 - avro/flat-avro protocol & true/false/invalid enable-tidb-extension 163 - avro/flat-avro protocol & precise/string/invalid avro-decimal-handling-mode 164 - avro/flat-avro protocol & long/string/invalid avro-bigint-unsigned-handling-mode 165 - avro/flat-avro protocol & valid/invalid schema-registry 166 167 #### Data Mapping Tests 168 169 - With protocol=avro&enable-tidb-extension=false&avro-decimal-handling-mode=precise&avro-bigint-unsigned-handling-mode=long, all generated schema and data are correct. 170 - With enable-tidb-extension=true, schema and value will have \_tidb_op, \_tidb_commit_ts, \_tidb_commit_physical_time fields. 171 - With avro-decimal-handling-mode=string,decimal field generates string schema and data. 172 - With avro-bigint-unsigned-handling-mode=string, bigint unsigned generates string schema and data. 173 174 #### DML Tests 175 176 - Insert row and check the row in downstream database. 177 - Update row and check the row in downstream database. 178 - Delete row and check the row in downstream database. 179 180 #### Schema Tests 181 182 - When the schema is not in schema registry, a fresh new schema is created. 183 - When the schema is in schema registry and pass compatibility check, a new version is created. 184 - When the schema is in schema registry and cannot pass compatibility check, reports error. 185 186 #### SubjectNameStrategy Tests 187 188 - When there is only default topic, a changefeed could only replicate one table. 189 - When there is invalid topic rule, report error. 190 191 ### Compatibility Tests 192 193 N/A 194 195 ## Impacts & Risks 196 197 N/A 198 199 ## Investigation & Alternatives 200 201 N/A 202 203 ## Unresolved Questions 204 205 N/A