github.com/pingcap/ticdc@v0.0.0-20220526033649-485a10ef2652/docs/design/2020-07-16-ticdc-avro-protocol-design.md (about)

     1  # Background
     2  
     3  - [Apache Avro](https://avro.apache.org/) is a data serialization system, which provides algorithms to convert complex data structures to and from compact binary representations. The Avro format is compact in the sense that it is *not* self-explanatory the way JSON is, and ideally the schema of the data should be acquired separately from the data, and therefore a centralized Schema Registry could be ideal in some cases. 
     4  
     5  - [Kafka Connect](https://docs.confluent.io/current/connect/index.html) is a component of the Kafka platform that aims to provide out-of-the-box integration of Kafka with other data sources & sinks, especially RDBMSes such as Mysql, Postgresql and many others. Kafka Connect has out-of-the-box support to use Avro as the wire format of each Kafka message. To solve the aforementioned schema problem, Kafka Connect ships with **Confluent Schema Registry**, which, through RESTful APIs, provides Kafka itself as well as other applications ability to acquire and share schemas of data that are to be transmitted in the Avro format.
     6  
     7  # Feature
     8  - TiCDC can now output data in Avro format to Kafka. It will automatically register the schema of the relevant table(s) to a user-managed Confluent Schema Registry instance, and the Avro data is compatible with the JDBC sink connector of Kafka Connect.
     9  - User interface supports "avro" as a sink-uri parameter for Kafka sink, and accepts "registry=http://..." as a parameter in "--opts". For example `bin/cdc cli changefeed create --sink-uri "kafka://127.0.0.1:9092/testdb.test?protocol=avro" --opts registry="http://127.0.0.1:8081"`.
    10  
    11  # Key Design Decisions
    12  - Only the owner can update the Schema Registry.
    13  - Processors retrieve the **latest** schema from the Registry.
    14  - Processors maintain a local cache of the Avro schema(s) for the relevant tables. Cache items are invalidated if and only if the table's layout's `updateTs` has been changed.
    15  - In order to maintain a stable interface and at the same time be compatible with Kafka Connect, the `AvroEventBatchEncoder` has a buffer that at most contains **one** pending message.
    16  
    17  # Key Data Structures
    18  
    19  ### AvroEventBatchEncoder:
    20  implements the interface `EventBatchEncoder`. 
    21  ##### Caveats
    22  - `AppendResolvedEvent` is no-op because Kafka Connect does not expect such events.
    23  - `AppendDDLEvent` *does not* emit any Kafka message but *does* update the Avro Schema Registry with the latest schema if necessary.
    24  - `Size()` is always 0 or 1, which, albeit a slight violation of the expected semantics, clearly conveys whether the buffer is full or not.
    25  
    26  ### AvroSchemaManager
    27  Provides basic operations on the Schema Registry. 
    28  Note
    29   - `NewAvroSchemaManager` takes a parameter `subjectSuffix string` because Kafka Connect is supposed to look up the schemas by names in the forms of `{subject}-key` or `{subject}-value`.
    30   - The `goavro.Codec` instance is cached to avoid parsing the JSON representation of the Avro schemas each time.
    31   -  `AvroSchemaManager` is tested with the help of `jarcoal/httpmock`, which intercepts requests sent via the default HTTP client and mocks an implementation of the Schema Registry.
    32   
    33  # Known Limitations
    34  - The Kafka message keys are the internal `rowid`, which is not very useful to users.
    35  - Given that a changefeed can only write to one Kafka topic, capturing on multiple tables could confuse Kafka sink connectors.