github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md

github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md (about)

     1  # TiCDC Design Documents
     2  
     3  - Author(s): [Zhao Yilin](http://github.com/leoppro), [Zhang Xiang](http://github.com/zhangyangyu)
     4  - Tracking Issue: https://github.com/pingcap/tiflow/issues/5338
     5  
     6  ## Table of Contents
     7  
     8  - [TiCDC Design Documents](#ticdc-design-documents)
     9    - [Table of Contents](#table-of-contents)
    10    - [Introduction](#introduction)
    11    - [Motivation or Background](#motivation-or-background)
    12    - [Detailed Design](#detailed-design)
    13      - [New Config Items](#new-config-items)
    14      - [flat-avro Schema Definition](#flat-avro-schema-definition)
    15        - [Key Schema](#key-schema)
    16        - [Value Schema](#value-schema)
    17      - [DML Events](#dml-events)
    18      - [Schema Change](#schema-change)
    19      - [Subject Name Strategy](#subject-name-strategy)
    20      - [ColumnValueBlock and Data Mapping](#columnvalueblock-and-data-mapping)
    21    - [Test Design](#test-design)
    22      - [Functional Tests](#functional-tests)
    23        - [CLI Tests](#cli-tests)
    24        - [Data Mapping Tests](#data-mapping-tests)
    25        - [DML Tests](#dml-tests)
    26        - [Schema Tests](#schema-tests)
    27        - [SubjectNameStrategy Tests](#subjectnamestrategy-tests)
    28      - [Compatibility Tests](#compatibility-tests)
    29    - [Impacts & Risks](#impacts--risks)
    30    - [Investigation & Alternatives](#investigation--alternatives)
    31    - [Unresolved Questions](#unresolved-questions)
    32  
    33  ## Introduction
    34  
    35  This document provides a complete design on refactoring the existing Avro protocol implementation. A common Avro data format is defined in order to building data pathways to various streaming systems.
    36  
    37  ## Motivation or Background
    38  
    39  Apache Avro™ is a data serialization system with rich data structures and a compact binary data format. Avro relies on schemas, schemas is managed by schema-registry. Avro is a common data format in streaming systems, supported by Confluent, Flink, Debezium, etc.
    40  
    41  ## Detailed Design
    42  
    43  ### New Config Items
    44  
    45  | Config item                        | Option values          | Default | Explain                                                                                                                                                                                                                                                                                                         |
    46  | ---------------------------------- | ---------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    47  | protocol                           | canal-json / flat-avro | -       | Specify the format of messages written to Kafka.<br>The `flat-avro` option means that using the avro format design by this document.                                                                                                                                                                            |
    48  | enable-tidb-extension              | true / false           | false   | Append tidb extension fields into avro message or not.                                                                                                                                                                                                                                                          |
    49  | schema-registry                    | -                      | -       | Specifies the schema registry endpoint.                                                                                                                                                                                                                                                                         |
    50  | avro-decimal-handling-mode         | precise / string       | precise | Specifies how the TiCDC should handle values for DECIMAL columns:<br>`precise` option represents encoding decimals as precise bytes.<br>`string` option encodes values as formatted strings, which is easy to consume but semantic information about the real type is lost.                                     |
    51  | avro-bigint-unsigned-handling-mode | long / string          | long    | Specifies how the TiCDC should handle values for UNSIGNED BIGINT columns:<br>`long` represents values by using Avro long(64-bit signed integer) which might overflow but which is easy to use in consumers.<br>`string` represents values by string which is precise but which needs to be parsed by consumers. |
    52  
    53  ### flat-avro Schema Definition
    54  
    55  `flat-avro` is an alias of `avro` protocol. It means all column values are placed directly inside the message with no nesting. This structure is compatible with most confluent sink connectors, but it cannot handle `old-value`. `rich-avro` is opposite and reserved for future needs.
    56  
    57  #### Key Schema
    58  
    59  ```
    60  {
    61      "name":"{{RecordName}}",
    62      "namespace":"{{Namespace}}",
    63      "type":"record",
    64      "fields":[
    65          {{ColumnValueBlock}},
    66          {{ColumnValueBlock}},
    67      ]
    68  }
    69  ```
    70  
    71  - `{{RecordName}}` represents full qualified table name.
    72  - `{{ColumnValueBlock}}` represents a JSON block, which defines a column value of a key.
    73  - Key only includes the valid index fields.
    74  
    75  #### Value Schema
    76  
    77  ```
    78  {
    79      "name":"{{RecordName}}",
    80      "namespace":"{{Namespace}}",
    81      "type":"record",
    82      "fields":[
    83          {{ColumnValueBlock}},
    84          {{ColumnValueBlock}},
    85          {
    86              "name":"_tidb_op",
    87              "type":"string"
    88          },
    89          {
    90              "name":"_tidb_commit_ts",
    91              "type":"long"
    92          },
    93          {
    94              "name":"_tidb_commit_physical_time",
    95              "type":"long"
    96          }
    97      ]
    98  }
    99  ```
   100  
   101  - `{{RecordName}}` represents full qualified table name.
   102  - `{{ColumnValueBlock}}` represents a JSON block, which defines a column value of a key.
   103  - `_tidb_op` is used to distinguish between INSERT or UPDATE events, optional values are "c" / "u".
   104  - `_tidb_commit_ts` represents a CommitTS of a transaction.
   105  - `_tidb_commit_physical_time` represents a physical timestamp of a transaction.
   106  
   107  When `enable-tidb-extension` is `true`, `_tidb_op`, `_tidb_commit_ts`, `_tidb_commit_physical_time` will be appended to every Kafka value. When `enable-tidb-extension` is `false`, no extension fields will be appended to Kafka values.
   108  
   109  ### DML Events
   110  
   111  If `enable-tidb-extension` is `true`, the `_tidb_op` field for the INSERT event is "c" and the field for UPDATE event is "u".
   112  
   113  If `enable-tidb-extension` is `false`, the `_tidb_op` field will not be appended in Kafka value, so there is no difference between INSERT and UPDATE event.
   114  
   115  For the DELETE event, TiCDC will send the primary key value as the Kafka key, and the Kafka value will be `null`.
   116  
   117  ### Schema Change
   118  
   119  Avro detects schema change at every DML events instead of DDL events. Whenever there is a schema change, avro codec tries to register a new version schema under corresponding subject in the schema registry. Whether it succeeds or not depends on the schema evolution compatibility. Avro codec will not address any compatibility issues and simply propagates errors.
   120  
   121  ### Subject Name Strategy
   122  
   123  Avro codec only supports the default `TopicNameStrategy`. This means a kafka topic could only accepts a unique schema. With the multi-topic ability in TiCDC, events from multiple tables could be all dispatched to one topic, which is not allowed under `TopicNameStrategy`. So we require in the dispatcher rules, for avro protocol, the topic rule must contain both `{schema}` and `{table}` placeholders, which means one table would occupy one kafka topic.
   124  
   125  ### ColumnValueBlock and Data Mapping
   126  
   127  A `ColumnValueBlock` has the following schema:
   128  
   129  ```
   130  {
   131      "name":"{{ColumnName}}",
   132      "type":{
   133          "connect.parameters":{
   134              "tidb_type":"{{TIDB_TYPE}}"
   135          },
   136          "type":"{{AVRO_TYPE}}"
   137      }
   138  }
   139  ```
   140  
   141  | SQL TYPE                                           | TIDB_TYPE                    | AVRO_TYPE | Description                                                                                                                    |
   142  | -------------------------------------------------- | ---------------------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------ |
   143  | TINYINT/BOOL/SMALLINT/MEDIUMINT/INT                | INT                          | int       | When it's unsigned, TIDB_TYPE is INT UNSIGNED. For SQL TYPE INT UNSIGNED, its AVRO_TYPE is long.                               |
   144  | BIGINT                                             | BIGINT                       | long      | When it's unsigned, TIDB_TYPE is BIGINT UNSIGNED. If `avro-bigint-unsigned-handling-mode` is string, AVRO_TYPE is string.      |
   145  | TINYBLOB/BLOB/MEDIUMBLOB/LONGBLOB/BINARY/VARBINARY | BLOB                         | bytes     |                                                                                                                                |
   146  | TINYTEXT/TEXT/MEDIUMTEXT/LONGTEXT/CHAR/VARCHAR     | TEXT                         | string    |                                                                                                                                |
   147  | FLOAT/DOUBLE                                       | FLOAT/DOUBLE                 | double    |                                                                                                                                |
   148  | DATE/DATETIME/TIMESTAMP/TIME                       | DATE/DATETIME/TIMESTAMP/TIME | string    |                                                                                                                                |
   149  | YEAR                                               | YEAR                         | int       |                                                                                                                                |
   150  | BIT                                                | BIT                          | bytes     | BIT has another `connector.parameters` entry `"length":"64"`.                                                                  |
   151  | JSON                                               | JSON                         | string    |                                                                                                                                |
   152  | ENUM/SET                                           | ENUM/SET                     | string    | BIT has another `connector.parameters` entry `"allowed":"a,b,c"`.                                                              |
   153  | DECIMAL                                            | DECIMAL                      | bytes     | This is an avro logical type having `scale` and `precision`. When `avro-decimal-handling-mode` is string, AVRO_TYPE is string. |
   154  
   155  ## Test Design
   156  
   157  ### Functional Tests
   158  
   159  #### CLI Tests
   160  
   161  - avro/flat-avro protocol
   162  - avro/flat-avro protocol & true/false/invalid enable-tidb-extension
   163  - avro/flat-avro protocol & precise/string/invalid avro-decimal-handling-mode
   164  - avro/flat-avro protocol & long/string/invalid avro-bigint-unsigned-handling-mode
   165  - avro/flat-avro protocol & valid/invalid schema-registry
   166  
   167  #### Data Mapping Tests
   168  
   169  - With protocol=avro&enable-tidb-extension=false&avro-decimal-handling-mode=precise&avro-bigint-unsigned-handling-mode=long, all generated schema and data are correct.
   170  - With enable-tidb-extension=true, schema and value will have \_tidb_op, \_tidb_commit_ts, \_tidb_commit_physical_time fields.
   171  - With avro-decimal-handling-mode=string，decimal field generates string schema and data.
   172  - With avro-bigint-unsigned-handling-mode=string, bigint unsigned generates string schema and data.
   173  
   174  #### DML Tests
   175  
   176  - Insert row and check the row in downstream database.
   177  - Update row and check the row in downstream database.
   178  - Delete row and check the row in downstream database.
   179  
   180  #### Schema Tests
   181  
   182  - When the schema is not in schema registry, a fresh new schema is created.
   183  - When the schema is in schema registry and pass compatibility check, a new version is created.
   184  - When the schema is in schema registry and cannot pass compatibility check, reports error.
   185  
   186  #### SubjectNameStrategy Tests
   187  
   188  - When there is only default topic, a changefeed could only replicate one table.
   189  - When there is invalid topic rule, report error.
   190  
   191  ### Compatibility Tests
   192  
   193  N/A
   194  
   195  ## Impacts & Risks
   196  
   197  N/A
   198  
   199  ## Investigation & Alternatives
   200  
   201  N/A
   202  
   203  ## Unresolved Questions
   204  
   205  N/A