github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/distributed-scheduling.md (about) 1 # Distributed Scheduling in TiCDC 2 3 ## Background 4 5 TiCDC boasts high availability and horizontal scalability. To make this possible, TiCDC needs a distributed scheduling mechanism, a mechanism by which tables can be distributed across the nodes in the cluster and sustain failures of nodes. We still need a center of control, which we call the Owner, but the Owner should fail over very quickly if a node running the Owner has crashed. 6 7 In the beginning of the TiCDC project, we chose a solution that sends all information over Etcd, which is a distributed key-value store. However, this solution has proven to be problematic as it fails to scale both horizontally and vertically. As a result, we created a solution using direct peer-to-peer messaging, which not only has semantics better suited to scheduling, but also performs and scales better. 8 9 ## The Abstraction 10 11 To succinctly describe the algorithm used here, we need to abstract the TiCDC Owner and Processor. To simplify the matter, we will omit "changefeed" management here, and suppose that multiple "changefeeds" are isolated from each other as far as scheduling is concerned. 12 13 - The Owner is a piece of code that can persist and restore a timestamp `global checkpoint`, which is guaranteed to be monotonically increasing and is a lower bound of the progresses of all nodes on all tables. The Owner will call our `ScheduleDispatcher` periodically and supply it with both the latest `global checkpoint` and a list of tables that should currently be replicating. 14 - The Processor is a piece of code that actually replicates the tables. We can `Add` and `Remove` tables from it, and query about the status of a table. 15 16 ## The Protocol 17 18 The communication protocol between the Owner and the Processors is as follows: 19 20 ### Message Types 21 22 #### DispatchTable 23 24 - Direction: Owner -> Processor 25 26 - Semantics: Informs the processor to start (or stop) replicating a table. 27 28 - ```go 29 type DispatchTableMessage struct { 30 OwnerRev int64 `json:"owner-rev"` 31 ID TableID `json:"id"` 32 IsDelete bool `json:"is-delete"` 33 } 34 ``` 35 36 #### DispatchTableResponse 37 38 - Direction: Processor -> Owner 39 40 - Semantics: Informs the owner that the processor has finished a table operation on the given table. 41 42 - ```go 43 type DispatchTableResponseMessage struct { 44 ID TableID `json:"id"` 45 } 46 ``` 47 48 #### Announce 49 50 - Direction: Owner -> Processor 51 52 - Semantics: Announces the election of the sender node as the owner. 53 54 - ```go 55 type AnnounceMessage struct { 56 OwnerRev int64 `json:"owner-rev"` 57 // Sends the owner's version for compatibility check 58 OwnerVersion string `json:"owner-version"` 59 } 60 ``` 61 62 #### Sync 63 64 - Direction: Processor -> Owner 65 66 - Semantics: Tells the newly elected owner the processor's state, or tells the owner that the processor has restarted. 67 68 - ```go 69 type SyncMessage struct { 70 // Sends the processor's version for compatibility check 71 ProcessorVersion string 72 73 Running []TableID 74 Adding []TableID 75 Removing []TableID 76 } 77 ``` 78 79 #### Checkpoint 80 81 - Direction: Processor -> Owner 82 83 - Semantics: The processor reports to the owner its current watermarks. 84 85 - ```go 86 type CheckpointMessage struct { 87 CheckpointTs Ts `json:"checkpoint-ts"` 88 ResolvedTs Ts `json:"resolved-ts"` 89 // We can add more fields in the future 90 } 91 ``` 92 93 ### Interaction 94 95 ![](./media/scheduling_proto.svg) 96 97 The figure above shows the basic flow of interaction between the Owner and the Processors. 98 99 1. **Owner** gets elected and announces its ownership to **Processor A**. 100 2. Similarly, **Owner** announces its ownership to **Processor B**. 101 3. **Processor A** reports to **Owner** its internal state, which includes which tables are being added, removed and run. 102 4. **Processor B** does the same. 103 5. **Owner** tells **Processor A** to start replicating _Table 1_. 104 6. **Owner** tells **Processor B** to start replicating _Table 2_. 105 7. **Processor A** reports that _Table 1_ has finished initializing and is now being replicated. 106 8. **Processor B** reports that _Table 2_ has finished initializing and is now being replicated. 107 9. **Processor A** sends its watermark. 108 10. **Processor B** sends its watermark too. 109 110 ### Owner Switches 111 112 Because of TiCDC's high availability, the communication protocol needs to handle owner switches. The basic idea here is that, when a new owner takes over, it queries all alive processors' states and prepares to react to the processors' messages (especially `DispatchTableResponse` and `Checkpoint`) exactly as the previous owner would. 113 114 Moreover, if the old owner is ignorant of the fact that it is no longer the owner and still tries to act as the owner, the processors would reject the old owner's commands once it receives at least one message from the new owner. The order of owners' succession is recorded by the `owner-rev` field in `Announce` and `DispatchTable`. 115 116 ![](./media/scheduling_proto_owner_change.svg) 117 118 ## Correctness Discussion 119 120 ### Assumptions 121 122 Here are some basic assumptions needed before we discuss correctness. 123 124 - Etcd is correctly implemented, i.e. it does not violate its own safety promises. For example, we should not have two owners elected at the same time, and two successive owners should hold `owner-rev` with one larger than the other. 125 - The Processors can stop writing to the downstream immediately after it loses its Etcd session. This is a very strong assumption and is not realistic. But if we accept the possibility of a processor writing to the downstream infinitely into the future even after it loses connection to Etcd, no protocol would be correct for our purpose. 126 127 One part of correctness is _safety_, and it mainly consists of two parts: 128 129 1. Two processors do not write to the same downstream table at the same time (No Double Write) 130 2. The owner does not push `checkpoint-ts` unless all tables are being replicated and all of them meet the `checkpoint`(No Lost Table). 131 132 In addition to safety, _liveness_ is also in a broad sense part of correctness. Our liveness guarantee is simple: 133 134 - If the cluster is stable, i.e., no node crashes and no network isolation happens, replication will _eventually_ make progress. 135 136 In other words, the liveness guarantee says that _the cluster does not deadlock itself, and when everything is running and network is working, the cluster eventually works as a whole_. 137 138 We will be focusing on _safety_ here because it is more difficult to detect. 139 140 ### Owner switches 141 142 - For _No Double Write_ to be violated, the new owner must not assign a table again when the table is still running. But since the new owner will only assign tables when the captures registered to Etcd at some point (_T0_) have all sent _Sync_ to the new owner, the owner cannot reassign a table already running on any of these captures. To see the impossibility, we know the only possibility for _No Double Write_ to be violated is for a processor to be running at some point _T1_ after _T0_, but this would imply that the capture has gone online after the new owner is elected, and since the new owner cannot reassign a table, it must have been an old owner who has assigned the table. But since _EtcdWorker_ does not allow the old owner to receive new capture information after the new owner gets elected, it is an impossibility. 143 - _No Lost Table_ is guaranteed because the owner will advance the watermarks only if all captures have sent _Sync_ and sent their respective watermarks. 144 145 ### Processor restarts 146 147 Definition: a processor is called to have restarted if its internal state has been cleared but its capture ID has **not** changed. 148 149 - First assume that the system is correct before the restart. Then since the restart only makes certain tables' replication stop, it will not create any _Double Write_. Then the restarted processor will send _Sync_ to the owner, which tells the owner that it is no longer replicating any table, and then the owner will re-dispatch the lost tables. So _No Double Write_ is not violated. 150 - _No Lost Table_ is not violated either, because a restarted processor does not replicate any table, which means that it will not upload any _checkpoint_. In other words, the global checkpoint will not be advanced during the restart of the processor.