github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2020-07-16-ticdc-avro-protocol-design.md

github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2020-07-16-ticdc-avro-protocol-design.md (about)

1 # Background
2
3 - [Apache Avro](https://avro.apache.org/) is a data serialization system, which provides algorithms to convert complex data structures to and from compact binary representations. The Avro format is compact in the sense that it is _not_ self-explanatory the way JSON is, and ideally the schema of the data should be acquired separately from the data, and therefore a centralized Schema Registry could be ideal in some cases.
4
5 - [Kafka Connect](https://docs.confluent.io/current/connect/index.html) is a component of the Kafka platform that aims to provide out-of-the-box integration of Kafka with other data sources & sinks, especially RDBMSes such as Mysql, Postgresql and many others. Kafka Connect has out-of-the-box support to use Avro as the wire format of each Kafka message. To solve the aforementioned schema problem, Kafka Connect ships with **Confluent Schema Registry**, which, through RESTful APIs, provides Kafka itself as well as other applications ability to acquire and share schemas of data that are to be transmitted in the Avro format.
6
7 # Feature
8
9 - TiCDC can now output data in Avro format to Kafka. It will automatically register the schema of the relevant table(s) to a user-managed Confluent Schema Registry instance, and the Avro data is compatible with the JDBC sink connector of Kafka Connect.
10 - User interface supports "avro" as a sink-uri parameter for Kafka sink, and accepts "registry=http://..." as a parameter in "--opts". For example `bin/cdc cli changefeed create --sink-uri "kafka://127.0.0.1:9092/testdb.test?protocol=avro" --opts registry="http://127.0.0.1:8081"`.
11
12 # Key Design Decisions
13
14 - Only the owner can update the Schema Registry.
15 - Processors retrieve the **latest** schema from the Registry.
16 - Processors maintain a local cache of the Avro schema(s) for the relevant tables. Cache items are invalidated if and only if the table's layout's `updateTs` has been changed.
17 - In order to maintain a stable interface and at the same time be compatible with Kafka Connect, the `AvroEventBatchEncoder` has a buffer that at most contains **one** pending message.
18
19 # Key Data Structures
20
21 ### AvroEventBatchEncoder:
22
23 implements the interface `EventBatchEncoder`.
24
25 ##### Caveats
26
27 - `AppendResolvedEvent` is no-op because Kafka Connect does not expect such events.
28 - `AppendDDLEvent` _does not_ emit any Kafka message but _does_ update the Avro Schema Registry with the latest schema if necessary.
29 - `Size()` is always 0 or 1, which, albeit a slight violation of the expected semantics, clearly conveys whether the buffer is full or not.
30
31 ### AvroSchemaManager
32
33 Provides basic operations on the Schema Registry.
34 Note
35
36 - `NewAvroSchemaManager` takes a parameter `subjectSuffix string` because Kafka Connect is supposed to look up the schemas by names in the forms of `{subject}-key` or `{subject}-value`.
37 - The `goavro.Codec` instance is cached to avoid parsing the JSON representation of the Avro schemas each time.
38 - `AvroSchemaManager` is tested with the help of `jarcoal/httpmock`, which intercepts requests sent via the default HTTP client and mocks an implementation of the Schema Registry.
39
40 # Known Limitations
41
42 - The Kafka message keys are the internal `rowid`, which is not very useful to users.
43 - Given that a changefeed can only write to one Kafka topic, capturing on multiple tables could confuse Kafka sink connectors.