github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/distributed-scheduling.md (about)

     1  # Distributed Scheduling in TiCDC
     2  
     3  ## Background
     4  
     5  TiCDC boasts high availability and horizontal scalability. To make this possible, TiCDC needs a distributed scheduling mechanism, a mechanism by which tables can be distributed across the nodes in the cluster and sustain failures of nodes. We still need a center of control, which we call the Owner, but the Owner should fail over very quickly if a node running the Owner has crashed.
     6  
     7  In the beginning of the TiCDC project, we chose a solution that sends all information over Etcd, which is a distributed key-value store. However, this solution has proven to be problematic as it fails to scale both horizontally and vertically. As a result, we created a solution using direct peer-to-peer messaging, which not only has semantics better suited to scheduling, but also performs and scales better.
     8  
     9  ## The Abstraction
    10  
    11  To succinctly describe the algorithm used here, we need to abstract the TiCDC Owner and Processor. To simplify the matter, we will omit "changefeed" management here, and suppose that multiple "changefeeds" are isolated from each other as far as scheduling is concerned.
    12  
    13  - The Owner is a piece of code that can persist and restore a timestamp `global checkpoint`, which is guaranteed to be monotonically increasing and is a lower bound of the progresses of all nodes on all tables. The Owner will call our `ScheduleDispatcher` periodically and supply it with both the latest `global checkpoint` and a list of tables that should currently be replicating.
    14  - The Processor is a piece of code that actually replicates the tables. We can `Add` and `Remove` tables from it, and query about the status of a table.
    15  
    16  ## The Protocol
    17  
    18  The communication protocol between the Owner and the Processors is as follows:
    19  
    20  ### Message Types
    21  
    22  #### DispatchTable
    23  
    24  - Direction: Owner -> Processor
    25  
    26  - Semantics: Informs the processor to start (or stop) replicating a table.
    27  
    28  - ```go
    29    type DispatchTableMessage struct {
    30    	OwnerRev int64   `json:"owner-rev"`
    31    	ID       TableID `json:"id"`
    32    	IsDelete bool    `json:"is-delete"`
    33    }
    34    ```
    35  
    36  #### DispatchTableResponse
    37  
    38  - Direction: Processor -> Owner
    39  
    40  - Semantics: Informs the owner that the processor has finished a table operation on the given table.
    41  
    42  - ```go
    43    type DispatchTableResponseMessage struct {
    44    	ID TableID `json:"id"`
    45    }
    46    ```
    47  
    48  #### Announce
    49  
    50  - Direction: Owner -> Processor
    51  
    52  - Semantics: Announces the election of the sender node as the owner.
    53  
    54  - ```go
    55    type AnnounceMessage struct {
    56    	OwnerRev int64 `json:"owner-rev"`
    57    	// Sends the owner's version for compatibility check
    58    	OwnerVersion string `json:"owner-version"`
    59    }
    60    ```
    61  
    62  #### Sync
    63  
    64  - Direction: Processor -> Owner
    65  
    66  - Semantics: Tells the newly elected owner the processor's state, or tells the owner that the processor has restarted.
    67  
    68  - ```go
    69    type SyncMessage struct {
    70    	// Sends the processor's version for compatibility check
    71    	ProcessorVersion string
    72  
    73    	Running          []TableID
    74    	Adding           []TableID
    75    	Removing         []TableID
    76    }
    77    ```
    78  
    79  #### Checkpoint
    80  
    81  - Direction: Processor -> Owner
    82  
    83  - Semantics: The processor reports to the owner its current watermarks.
    84  
    85  - ```go
    86    type CheckpointMessage struct {
    87    	CheckpointTs Ts `json:"checkpoint-ts"`
    88    	ResolvedTs   Ts `json:"resolved-ts"`
    89        // We can add more fields in the future
    90    }
    91    ```
    92  
    93  ### Interaction
    94  
    95  ![](./media/scheduling_proto.svg)
    96  
    97  The figure above shows the basic flow of interaction between the Owner and the Processors.
    98  
    99  1. **Owner** gets elected and announces its ownership to **Processor A**.
   100  2. Similarly, **Owner** announces its ownership to **Processor B**.
   101  3. **Processor A** reports to **Owner** its internal state, which includes which tables are being added, removed and run.
   102  4. **Processor B** does the same.
   103  5. **Owner** tells **Processor A** to start replicating _Table 1_.
   104  6. **Owner** tells **Processor B** to start replicating _Table 2_.
   105  7. **Processor A** reports that _Table 1_ has finished initializing and is now being replicated.
   106  8. **Processor B** reports that _Table 2_ has finished initializing and is now being replicated.
   107  9. **Processor A** sends its watermark.
   108  10. **Processor B** sends its watermark too.
   109  
   110  ### Owner Switches
   111  
   112  Because of TiCDC's high availability, the communication protocol needs to handle owner switches. The basic idea here is that, when a new owner takes over, it queries all alive processors' states and prepares to react to the processors' messages (especially `DispatchTableResponse` and `Checkpoint`) exactly as the previous owner would.
   113  
   114  Moreover, if the old owner is ignorant of the fact that it is no longer the owner and still tries to act as the owner, the processors would reject the old owner's commands once it receives at least one message from the new owner. The order of owners' succession is recorded by the `owner-rev` field in `Announce` and `DispatchTable`.
   115  
   116  ![](./media/scheduling_proto_owner_change.svg)
   117  
   118  ## Correctness Discussion
   119  
   120  ### Assumptions
   121  
   122  Here are some basic assumptions needed before we discuss correctness.
   123  
   124  - Etcd is correctly implemented, i.e. it does not violate its own safety promises. For example, we should not have two owners elected at the same time, and two successive owners should hold `owner-rev` with one larger than the other.
   125  - The Processors can stop writing to the downstream immediately after it loses its Etcd session. This is a very strong assumption and is not realistic. But if we accept the possibility of a processor writing to the downstream infinitely into the future even after it loses connection to Etcd, no protocol would be correct for our purpose.
   126  
   127  One part of correctness is _safety_, and it mainly consists of two parts:
   128  
   129  1. Two processors do not write to the same downstream table at the same time (No Double Write)
   130  2. The owner does not push `checkpoint-ts` unless all tables are being replicated and all of them meet the `checkpoint`(No Lost Table).
   131  
   132  In addition to safety, _liveness_ is also in a broad sense part of correctness. Our liveness guarantee is simple:
   133  
   134  - If the cluster is stable, i.e., no node crashes and no network isolation happens, replication will _eventually_ make progress.
   135  
   136  In other words, the liveness guarantee says that _the cluster does not deadlock itself, and when everything is running and network is working, the cluster eventually works as a whole_.
   137  
   138  We will be focusing on _safety_ here because it is more difficult to detect.
   139  
   140  ### Owner switches
   141  
   142  - For _No Double Write_ to be violated, the new owner must not assign a table again when the table is still running. But since the new owner will only assign tables when the captures registered to Etcd at some point (_T0_) have all sent _Sync_ to the new owner, the owner cannot reassign a table already running on any of these captures. To see the impossibility, we know the only possibility for _No Double Write_ to be violated is for a processor to be running at some point _T1_ after _T0_, but this would imply that the capture has gone online after the new owner is elected, and since the new owner cannot reassign a table, it must have been an old owner who has assigned the table. But since _EtcdWorker_ does not allow the old owner to receive new capture information after the new owner gets elected, it is an impossibility.
   143  - _No Lost Table_ is guaranteed because the owner will advance the watermarks only if all captures have sent _Sync_ and sent their respective watermarks.
   144  
   145  ### Processor restarts
   146  
   147  Definition: a processor is called to have restarted if its internal state has been cleared but its capture ID has **not** changed.
   148  
   149  - First assume that the system is correct before the restart. Then since the restart only makes certain tables' replication stop, it will not create any _Double Write_. Then the restarted processor will send _Sync_ to the owner, which tells the owner that it is no longer replicating any table, and then the owner will re-dispatch the lost tables. So _No Double Write_ is not violated.
   150  - _No Lost Table_ is not violated either, because a restarted processor does not replicate any table, which means that it will not upload any _checkpoint_. In other words, the global checkpoint will not be advanced during the restart of the processor.