github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2020-07-16-ticdc-avro-protocol-design.md (about) 1 # Background 2 3 - [Apache Avro](https://avro.apache.org/) is a data serialization system, which provides algorithms to convert complex data structures to and from compact binary representations. The Avro format is compact in the sense that it is _not_ self-explanatory the way JSON is, and ideally the schema of the data should be acquired separately from the data, and therefore a centralized Schema Registry could be ideal in some cases. 4 5 - [Kafka Connect](https://docs.confluent.io/current/connect/index.html) is a component of the Kafka platform that aims to provide out-of-the-box integration of Kafka with other data sources & sinks, especially RDBMSes such as Mysql, Postgresql and many others. Kafka Connect has out-of-the-box support to use Avro as the wire format of each Kafka message. To solve the aforementioned schema problem, Kafka Connect ships with **Confluent Schema Registry**, which, through RESTful APIs, provides Kafka itself as well as other applications ability to acquire and share schemas of data that are to be transmitted in the Avro format. 6 7 # Feature 8 9 - TiCDC can now output data in Avro format to Kafka. It will automatically register the schema of the relevant table(s) to a user-managed Confluent Schema Registry instance, and the Avro data is compatible with the JDBC sink connector of Kafka Connect. 10 - User interface supports "avro" as a sink-uri parameter for Kafka sink, and accepts "registry=http://..." as a parameter in "--opts". For example `bin/cdc cli changefeed create --sink-uri "kafka://127.0.0.1:9092/testdb.test?protocol=avro" --opts registry="http://127.0.0.1:8081"`. 11 12 # Key Design Decisions 13 14 - Only the owner can update the Schema Registry. 15 - Processors retrieve the **latest** schema from the Registry. 16 - Processors maintain a local cache of the Avro schema(s) for the relevant tables. Cache items are invalidated if and only if the table's layout's `updateTs` has been changed. 17 - In order to maintain a stable interface and at the same time be compatible with Kafka Connect, the `AvroEventBatchEncoder` has a buffer that at most contains **one** pending message. 18 19 # Key Data Structures 20 21 ### AvroEventBatchEncoder: 22 23 implements the interface `EventBatchEncoder`. 24 25 ##### Caveats 26 27 - `AppendResolvedEvent` is no-op because Kafka Connect does not expect such events. 28 - `AppendDDLEvent` _does not_ emit any Kafka message but _does_ update the Avro Schema Registry with the latest schema if necessary. 29 - `Size()` is always 0 or 1, which, albeit a slight violation of the expected semantics, clearly conveys whether the buffer is full or not. 30 31 ### AvroSchemaManager 32 33 Provides basic operations on the Schema Registry. 34 Note 35 36 - `NewAvroSchemaManager` takes a parameter `subjectSuffix string` because Kafka Connect is supposed to look up the schemas by names in the forms of `{subject}-key` or `{subject}-value`. 37 - The `goavro.Codec` instance is cached to avoid parsing the JSON representation of the Avro schemas each time. 38 - `AvroSchemaManager` is tested with the help of `jarcoal/httpmock`, which intercepts requests sent via the default HTTP client and mocks an implementation of the Schema Registry. 39 40 # Known Limitations 41 42 - The Kafka message keys are the internal `rowid`, which is not very useful to users. 43 - Given that a changefeed can only write to one Kafka topic, capturing on multiple tables could confuse Kafka sink connectors.