github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/distributed-scheduling.md

github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/distributed-scheduling.md (about)

1 # Distributed Scheduling in TiCDC
2
3 ## Background
4
5 TiCDC boasts high availability and horizontal scalability. To make this possible, TiCDC needs a distributed scheduling mechanism, a mechanism by which tables can be distributed across the nodes in the cluster and sustain failures of nodes. We still need a center of control, which we call the Owner, but the Owner should fail over very quickly if a node running the Owner has crashed.
6
7 In the beginning of the TiCDC project, we chose a solution that sends all information over Etcd, which is a distributed key-value store. However, this solution has proven to be problematic as it fails to scale both horizontally and vertically. As a result, we created a solution using direct peer-to-peer messaging, which not only has semantics better suited to scheduling, but also performs and scales better.
8
9 ## The Abstraction
10
11 To succinctly describe the algorithm used here, we need to abstract the TiCDC Owner and Processor. To simplify the matter, we will omit "changefeed" management here, and suppose that multiple "changefeeds" are isolated from each other as far as scheduling is concerned.
12
13 - The Owner is a piece of code that can persist and restore a timestamp `global checkpoint`, which is guaranteed to be monotonically increasing and is a lower bound of the progresses of all nodes on all tables. The Owner will call our `ScheduleDispatcher` periodically and supply it with both the latest `global checkpoint` and a list of tables that should currently be replicating.
14 - The Processor is a piece of code that actually replicates the tables. We can `Add` and `Remove` tables from it, and query about the status of a table.
15
16 ## The Protocol
17
18 The communication protocol between the Owner and the Processors is as follows:
19
20 ### Message Types
21
22 #### DispatchTable
23
24 - Direction: Owner -> Processor
25
26 - Semantics: Informs the processor to start (or stop) replicating a table.
27
28 - ```go
29 type DispatchTableMessage struct {
30 OwnerRev int64 `json:"owner-rev"`
31 ID TableID `json:"id"`
32 IsDelete bool `json:"is-delete"`
33 }
34 ```
35
36 #### DispatchTableResponse
37
38 - Direction: Processor -> Owner
39
40 - Semantics: Informs the owner that the processor has finished a table operation on the given table.
41
42 - ```go
43 type DispatchTableResponseMessage struct {
44 ID TableID `json:"id"`
45 }
46 ```
47
48 #### Announce
49
50 - Direction: Owner -> Processor
51
52 - Semantics: Announces the election of the sender node as the owner.
53
54 - ```go
55 type AnnounceMessage struct {
56 OwnerRev int64 `json:"owner-rev"`
57 // Sends the owner's version for compatibility check
58 OwnerVersion string `json:"owner-version"`
59 }
60 ```
61
62 #### Sync
63
64 - Direction: Processor -> Owner
65
66 - Semantics: Tells the newly elected owner the processor's state, or tells the owner that the processor has restarted.
67
68 - ```go
69 type SyncMessage struct {
70 // Sends the processor's version for compatibility check
71 ProcessorVersion string
72
73 Running []TableID
74 Adding []TableID
75 Removing []TableID
76 }
77 ```
78
79 #### Checkpoint
80
81 - Direction: Processor -> Owner
82
83 - Semantics: The processor reports to the owner its current watermarks.
84
85 - ```go
86 type CheckpointMessage struct {
87 CheckpointTs Ts `json:"checkpoint-ts"`
88 ResolvedTs Ts `json:"resolved-ts"`
89 // We can add more fields in the future
90 }
91 ```
92
93 ### Interaction
94
95 ![](./media/scheduling_proto.svg)
96
97 The figure above shows the basic flow of interaction between the Owner and the Processors.
98
99 1. **Owner** gets elected and announces its ownership to **Processor A**.
100 2. Similarly, **Owner** announces its ownership to **Processor B**.
101 3. **Processor A** reports to **Owner** its internal state, which includes which tables are being added, removed and run.
102 4. **Processor B** does the same.
103 5. **Owner** tells **Processor A** to start replicating _Table 1_.
104 6. **Owner** tells **Processor B** to start replicating _Table 2_.
105 7. **Processor A** reports that _Table 1_ has finished initializing and is now being replicated.
106 8. **Processor B** reports that _Table 2_ has finished initializing and is now being replicated.
107 9. **Processor A** sends its watermark.
108 10. **Processor B** sends its watermark too.
109
110 ### Owner Switches
111
112 Because of TiCDC's high availability, the communication protocol needs to handle owner switches. The basic idea here is that, when a new owner takes over, it queries all alive processors' states and prepares to react to the processors' messages (especially `DispatchTableResponse` and `Checkpoint`) exactly as the previous owner would.
113
114 Moreover, if the old owner is ignorant of the fact that it is no longer the owner and still tries to act as the owner, the processors would reject the old owner's commands once it receives at least one message from the new owner. The order of owners' succession is recorded by the `owner-rev` field in `Announce` and `DispatchTable`.
115
116 ![](./media/scheduling_proto_owner_change.svg)
117
118 ## Correctness Discussion
119
120 ### Assumptions
121
122 Here are some basic assumptions needed before we discuss correctness.
123
124 - Etcd is correctly implemented, i.e. it does not violate its own safety promises. For example, we should not have two owners elected at the same time, and two successive owners should hold `owner-rev` with one larger than the other.
125 - The Processors can stop writing to the downstream immediately after it loses its Etcd session. This is a very strong assumption and is not realistic. But if we accept the possibility of a processor writing to the downstream infinitely into the future even after it loses connection to Etcd, no protocol would be correct for our purpose.
126
127 One part of correctness is _safety_, and it mainly consists of two parts:
128
129 1. Two processors do not write to the same downstream table at the same time (No Double Write)
130 2. The owner does not push `checkpoint-ts` unless all tables are being replicated and all of them meet the `checkpoint`(No Lost Table).
131
132 In addition to safety, _liveness_ is also in a broad sense part of correctness. Our liveness guarantee is simple:
133
134 - If the cluster is stable, i.e., no node crashes and no network isolation happens, replication will _eventually_ make progress.
135
136 In other words, the liveness guarantee says that _the cluster does not deadlock itself, and when everything is running and network is working, the cluster eventually works as a whole_.
137
138 We will be focusing on _safety_ here because it is more difficult to detect.
139
140 ### Owner switches
141
142 - For _No Double Write_ to be violated, the new owner must not assign a table again when the table is still running. But since the new owner will only assign tables when the captures registered to Etcd at some point (_T0_) have all sent _Sync_ to the new owner, the owner cannot reassign a table already running on any of these captures. To see the impossibility, we know the only possibility for _No Double Write_ to be violated is for a processor to be running at some point _T1_ after _T0_, but this would imply that the capture has gone online after the new owner is elected, and since the new owner cannot reassign a table, it must have been an old owner who has assigned the table. But since _EtcdWorker_ does not allow the old owner to receive new capture information after the new owner gets elected, it is an impossibility.
143 - _No Lost Table_ is guaranteed because the owner will advance the watermarks only if all captures have sent _Sync_ and sent their respective watermarks.
144
145 ### Processor restarts
146
147 Definition: a processor is called to have restarted if its internal state has been cleared but its capture ID has **not** changed.
148
149 - First assume that the system is correct before the restart. Then since the restart only makes certain tables' replication stop, it will not create any _Double Write_. Then the restarted processor will send _Sync_ to the owner, which tells the owner that it is no longer replicating any table, and then the owner will re-dispatch the lost tables. So _No Double Write_ is not violated.
150 - _No Lost Table_ is not violated either, because a restarted processor does not replicate any table, which means that it will not upload any _checkpoint_. In other words, the global checkpoint will not be advanced during the restart of the processor.