vitess.io/vitess@v0.16.2/doc/design-docs/TwoPhaseCommitDesign.md (about) 1 # Design doc: 2PC in Vitess 2 3 # Objective 4 5 Provide a mechanism to support atomic commits for distributed transactions across multiple Vitess databases. Transactions should either complete successfully or rollback completely. 6 7 # Background 8 9 Vitess distributed transactions have so far been Best Effort Commit (BEC). An application is allowed to send DMLs that go to different shards or keyspaces in a single transaction. When a commit is issued, Vitess tries to individually commit each db transaction that was initiated. However, if a database goes down in the middle of a commit, that part of the transaction is lost. Moreover, with the support of lookup vindexes, VTGates could themselves open distributed transactions from single statements issued by the app. 10 11 2PC is the de facto protocol for atomically committing distributed transactions. Unfortunately, this has been considered impractical, and has predominantly failed in the industry. There are a few reasons: 12 13 * A database that goes down in the middle of a 2PC commit would hold transactions in other databases hostage till it was recovered. This is now a solved problem due to replication and fast failovers. 14 * The ACID requirements of relational databases were too demanding and contentious for a pure implementation to practically scale. 15 * The industry standard distributed transaction protocol (XA) overreached on flexibility and became too chatty. 16 * Subpar schemes for transaction management: Some added too much additional overhead, and some paid lip service and defeated the reliability of 2PC. 17 18 This document intends to address the above concerns with some practical trade-offs. 19 20 Although MySQL supports the XA protocol, it’s been unusable due to bugs. Version 5.7 claims to have fixed them all, but the more common versions in use are 5.6 and below, and we need to make 2PC work for those versions also. Even at 5.7, we still have to contend with the chattiness of XA, and the fact that it’s unused code. 21 22 The most critical component of the 2PC protocol is the `Prepare` functionality. There is actually a way to implement Prepare on top of a transactional system. This is explained in a [Vitess Blog](https://vitess.io/blog/2016-06-07-distributed-transactions-in-vitess/), which will be used as foundation for this design. 23 24 Familiarity with the blog and the [2PC algorithm](http://c2.com/cgi/wiki?TwoPhaseCommit) are required to understand the rest of the document. 25 26 # Overview 27 28 Vitess will add a few variations to the traditional 2PC algorithm: 29 30 * There is a presumption that the Resource Managers (aka participants) have to know upfront that they’re involved in a 2PC transaction. Many of the APIs force the application to make this choice at the beginning of a transaction. This is actually not required. In the case of Vitess, a distributed transaction will start off just like before, with a normal Begin. It will be converted only if the application requests a 2PC commit. This approach allows us to optimize some common use cases. 31 * The 2PC algorithm does not specify how the Transaction Manager maintains the metadata. If you work through all the failure modes, it will become evident that the manager must also be an HA transactional system that must survive failures without data loss. Since the VTTablets are already built to be HA, there’s no reason to build yet another system. So, we’ll split the role of the Transaction Manager into two: 32 * The Coordinator will be stateless and will orchestrate the work. VTGates are the perfect fit for this role. 33 * One of the VTTablets will be designated as the Metadata Manager (MM). It will be used to store the metadata and perform the necessary state transitions. 34 * If we designate one of the participant VTTablets to be the MM, then that database can avoid the prepare phase: If you assume there are N participants, the typical explanation says that you perform prepares from 1->N, and then commit from 1->N. If we instead went from 1->N for prepare, and N->1 for commit. Then the N’th database would perform a Prepare->Decide to commit->Commit. Instead, we execute the DML needed to transition the metadata state to ‘Decide to Commit’ as part of the app transaction, and commit it. If the commit fails, then it’s treated as the prepare having failed. If the commit succeeds, then it’s treated as all three operations having succeeded. 35 * The Prepare functionality will be implemented as explained in the [blog](https://vitess.io/blog/2016-06-07-distributed-transactions-in-vitess/). 36 37 Combining the above changes allows us to keep the most common use case efficient: A transaction that affects only one database incurs no additional cost due to 2PC. 38 39 In the case of multi-db transactions, we can choose the participant with the highest number of statements to be the MM; That database will not incur the cost of going through the Prepare phase, and we also avoid requiring a separate transaction to persist the commit decision. 40 41 ## ACID trade-offs 42 43 The core 2PC algorithm only guarantees Atomicity. Either the entire transaction commits, or it’s rolled back completely. 44 45 Consistency is an orthogonal property because it’s mainly related to making sure the values in the database don’t break relational rules. 46 47 Durability is guaranteed by each database, and the collective durability is inherited by the 2PC process. 48 49 Isolation requires additional work. If a client tries to read data in the middle of a distributed commit, it could see partial commits. In order to prevent this, databases put read locks on rows that are involved in a 2PC. So, anyone that tries to read them will have to wait till the transaction is resolved. This type of locking is so contentious that it often defeats the purpose of distributing the data. 50 51 In reality, this level of Isolation guarantee is overkill for most code paths of an application. So, it’s more practical to relax this for the sake of scalability, and let the application use explicit locks where it thinks better Isolation is required. 52 53 On the other hand, Atomicity is critical; Non-atomic transactions can result in partial commits, which is effectively corrupt data. As stated earlier, this is what we get from 2PC. 54 55 # Glossary 56 57 We introduced many terms in the previous sections. It’s time for a quick recap: 58 59 * Distributed Transaction: Any transaction that spans multiple databases is a distributed transaction. It does not imply any commit protocol. 60 * Best Effort Commit (BEC): This protocol is what’s currently supported by Vitess, where commits are sent to all participants. This could result in partial commits if there are failures during the process. 61 * Two-Phase Commit (2PC): This is the protocol that guarantees Atomic distributed commits. 62 * Coordinator: This is a stateless process that is responsible for initiating, resuming and completing a 2PC transaction. This role is fulfilled by the VTGates. 63 * Resource Manager aka Participant: Any database that’s involved in a distributed transaction. Only VTTablets can be participants. 64 * Metadata Manager (MM): The database responsible for storing the metadata and performing its state transitions. In Vitess, one of the participants will be designated as the MM. 65 * Watchdog: The watchdog looks for abandoned transactions and initiates the process to get them resolved. 66 * Distributed Transaction ID (DTID): A unique identifier for a 2PC transaction. 67 * VTTablet transaction id (VTID): This is the individual transaction ID for each VTTablet participant that contains the application’s statements to be committed/rolled back. 68 * Decision: This is the irreversible decision to either commit or rollback the transaction. Although confusing, this is also referred to as the ‘Commit Decision’. We’ll also indirectly refer to this as ‘Metadata state transition’. This is because a transaction undergoes many state changes. The Decision is a critical transition. So, it warrants its own name. 69 70 # Life of a 2PC transaction 71 72 * The application issues a Begin to VTGate. At this time, the Session proto is just updated to indicate that it’s in a transaction. 73 * The application sends DMLs to VTGate. As these DMLs are received, VTGate starts transactions against various VTTablets. The transaction id for each VTTablet (VTID) is stored in the Session proto. 74 * The application requests a 2PC. Until this point, there is no difference between a BEC and a 2PC. In the case of BEC, VTGate just sends the commit to all participating VTTablets. For 2PC, VTGate initiates and executes the workflow described in the subsequent steps. 75 76 ## Prepare 77 78 * Generate a DTID. 79 * The VTTablet with the most DMLs is singled out as the MM. To this VTTablet, issue a CreateTransaction command with the DTID. This information will be monitored by the watchdogs. 80 * Issue a Prepare to all other VTTablets. Send the DTID as part of the prepare request. 81 82 ## Commit 83 84 * Execute the 3-in-1 action of Prepare->Decide->Commit (StartCommit) for the MM VTTablet. This will change the metadata state to ‘Commit’. 85 * Issue a CommitPrepared commands to all the prepared VTTablets using the DTID. 86 * Delete the transaction in the MM (ConcludeTransaction). 87 88 ## Rollback 89 90 Any form of failure until the point of saving the commit decision will result in a decision to rollback. 91 92 * Transition the metadata state to ‘Rollback’. 93 * Issue RollbackPrepared commands to the prepared transactions using the DTID. 94 * If the original VTGate is still orchestrating, rollback the unprepared transactions using their VTIDs. The initial version will just execute RollbackPrepared on all participants with the assumption that any unprepared transactions will be rolled back by the transaction killer. 95 * Delete the transaction in the MM (ConcludeTransaction). 96 97 ## Watchdog 98 99 A watchdog will kick in if a transaction remains unresolved for too long. If such a transaction is found, it will be in one of three states: 100 101 1. Prepare 102 2. Rollback 103 3. Commit 104 105 For #1 and #2, the Rollback workflow is initiated. For #3, the commit is resumed. 106 107 The following diagram illustrates the life-cycle of a Vitess transaction. 108 109  110 111 A transaction generally starts off as a single DB transaction. It becomes a distributed transaction as soon as more than one VTTablet is affected. If the app issues a rollback, then all participants are simply rolled back. If a BEC is issued, then all transactions are individually committed. These actions are the same irrespective of single or distributed transactions. 112 113 In the case of a single DB transactions, a 2PC is also a BEC. 114 115 If a 2PC is issued to a distributed transaction, the new machinery kicks in. Actual metadata is created. The state starts off as ‘Prepare’ and remains so while Prepares are issued. In this state, only Prepares are allowed. 116 117 If Prepares are successful, then the state is transitioned to ‘Commit’. In the Commit state, only commits are allowed. By the guarantee given by the Prepare contract, all databases will eventually accept the commits. 118 119 Any failure during the Prepare state will result in the state being transitioned to ‘Rollback’. In this state, only rollbacks are allowed. 120 121 # Component interactions 122 123 In order to make 2PC work, the following pieces of functionality have to be built: 124 125 * DTID generation 126 * Prepare API 127 * Metadata Manager API 128 * Coordinator 129 * Watchdogs 130 * Client API 131 * Production support 132 133 The diagram below show how the various components interact. 134 135  136 137 The detailed design explains all the functionalities and interactions. 138 139 # Detailed Design 140 141 ## DTID generation 142 143 Currently, transaction ids are issued by VTTablets (VTID), and those ids are considered local. In order to coordinate distributed transactions, a new system is needed to identify and track them. This is needed mainly so that the watchdog process can pick up an orphaned transaction and resolve it to completion. 144 145 The DTID will be generated by taking the VTID of the MM and prefixing it with the keyspace, shard info and a sequence to prevent collisions. If the MM’s VTID was ‘1234’ for keyspace ‘order’ and shard ‘40-80’, then the DTID would be ‘order:40-80:1234’. A collision could still happen if there is a failover and the new vttablet’s starting VTID had overlaps with the previous instance. To prevent this, the starting VTID of the vttablet will be adjusted to a value higher than any used by the prepared GTIDs. 146 147 ## Prepare API 148 149 The Prepare API will be provided by VTTablet, and will follow the guidelines of the [blog](https://vitess.io/blog/2016-06-07-distributed-transactions-in-vitess/). It’s essentially three functions: Prepare, CommitPrepared and RollbackPrepared. 150 151 ### Statement list and state 152 153 Every transaction will have to remember its statement list. VTTablet already records queries against each transaction (RecordQuery). However, it’s currently the original queries of the request. This has to be changed to the DMLs that are sent to the database. 154 155 The current RecordQuery functionality is mainly for troubleshooting and diagnostics. So, it’s not very material if we changed it to record actual DMLs. It would remain equally useful. 156 157 ### Schema 158 159 The tables will be in the \_vt database. All time stamps are represented as unix nanoseconds. 160 161 The redo_state table needs to support the following use cases: 162 163 * Prepare: Create row. 164 * Recover & repair tool: Fetch all transactions: full joined table scan. 165 * Resolve: Transition state for a DTID: update where dtid = :dtid and state = :prepared. 166 * Watchdog: Count unresolved transactions that are older than X: select where time_created < X. 167 * Delete a resolved transaction: delete where dtid = :dtid. 168 169 ``` 170 create table redo_state( 171 dtid varbinary(512), 172 state bigint, // state can be 0: Failed, 1: Prepared. 173 time_created bigint, 174 primary key(dtid) 175 ) 176 ``` 177 178 The redo_statement table is a detail of redo_log_transaction table. It needs the ability to read the statements of a dtid in the correct order (by id), and the ability to delete all statements for a given dtid: 179 180 ``` 181 create table redo_statement( 182 dtid varbinary(512), 183 id bigint, 184 statement mediumblob, 185 primary key(dtid, id) 186 ) 187 ``` 188 189 ### Prepare 190 191 This function will take a DTID and a VTID as input. 192 193 * Get the tx conn for use, and move it to the prepared pool. If the prepared pool is full, rollback the transaction and return an error. 194 * Save the metadata into the redo logs as a separate transaction. If this step fails, the main transaction is also rolled back and an error is returned. 195 196 If VTTablet is asked to shut down or change state from primary, the code that waits for tx pool must internally rollback the prepared transactions and return them to the tx pool. Note that the rollback must happen only after the currently pending (non-prepared) transactions are resolved. If a pending transaction is waiting on a lock held by a prepared transaction, it will eventually timeout and get rolled back. 197 198 Eventually, a different VTTablet will be transitioned to become the primary. At that point, it will recreate the unresolved transactions from redo logs. If the replays fail, we’ll raise an alert and start the query service anyway. 199 200 Typically, a replay is not expected to fail because vttablet does not allow writing to the database until the replays are done. Also, no external agent should be allowed to perform writes to MySQL, which is a loosely enforced Vitess requirement. Other vitess processes do write to MySQL directly, but they’re not the kind that interfere with the normal flow of transactions. 201 202 *Unresolved issue: If a resharding happens in the middle of a prepare, such a transaction potentially becomes multiple different transactions in a target shard. For now, this means that a resharding failover has to wait for all prepared transactions to be resolved. Special code has to be written in vttablet to handle this specific workflow.* 203 204 VTTablet always brackets DMLs with BEGIN-COMMIT. This will ensure that no autocommit statements can slip through if connections are inadvertently closed out of sequence. 205 206 ### CommitPrepared 207 208 * Extract the transaction from the Prepare pool. 209 * If transaction is in the failed pool, return an error. 210 * If not found, return success (it was already resolved). 211 * As part of the current transaction (VTID), transition the state in redo_log to Committed and commit it. 212 * On failure, move it to the failed pool. Subsequent commits will permanently fail. 213 * Return the conn to the tx pool. 214 215 ### RollbackPrepared 216 217 * Delete the redo log entries for the dtid in a separate transaction. 218 * Extract the transaction from the Prepare pool, rollback and return the conn to the tx pool. 219 220 ## Metadata Manager API 221 222 The MM functionality is provided by VTTablet. This could be implemented as a separate service, but designating one of the 223 participants to act as the manager gives us some optimization opportunities. The supported functions are CreateTransaction, StartCommit, SetRollback, and ConcludeTransaction. 224 225 ### Schema 226 227 The transaction metadata will consist of two tables. It will need to fulfil the following use cases: 228 229 * CreateTransaction: Create row. 230 * Transition state: update where dtid = :dtid and state = :prepare. 231 * Resolve flow: select dt_state & dt_participant where dtid = :dtid. 232 * Watchdog: full joined table scan where time_created < X. 233 * Delete a resolved transaction: delete where dtid = :dtid. 234 235 ``` 236 create table dt_state( 237 dtid varbinary(512), 238 state bigint, // state PREPARE, COMMIT, ROLLBACK as defined in the protobuf for TransactionMetadata. 239 time_created bigint, 240 primary key(dtid) 241 ) 242 ``` 243 244 ``` 245 create table dt_participant( 246 dtid varbinary(512), 247 id bigint, 248 keyspace varchar(256), 249 shard varchar(256), 250 primary key (dtid, id) 251 ) 252 ``` 253 254 ### CreateTransaction 255 256 This statement creates a row in transaction. The initial state will be PREPARE. A successful create begins the 2PC process. This will be followed by VTGate issuing prepares to the rest of the participants. 257 258 ### StartCommit 259 260 This function can only be called for a transaction that’s not been abandoned. A watchdog that initiates a recovery will never make a decision to commit. This means that we can assume that the participant’s transaction (VTID) is still alive. 261 262 The function will issue a DML that will transition the state from PREPARE to COMMIT as part of the participant’s transaction (VTID). If not successful, it returns an error, which will be treated as failure to prepare, and will cause VTGate to rollback the rest of the transactions. 263 264 If successful, a commit is issued, which will also finalize the decision to commit the rest of the transactions. 265 266 ### SetRollback 267 268 SetRollback transitions the state from PREPARE to ROLLBACK using an independent transaction. When this function is called, the MM’s transaction (VTID) may still be alive. So, we infer the transaction id from the dtid and perform a best effort rollback. If the transaction is not found, it’s a no-op. 269 270 ### ConcludeTransaction 271 272 This function just deletes the row. 273 274 ### ReadTransaction 275 276 This function returns the transaction info given the dtid. 277 278 ### ReadTwopcInflight 279 280 This function returns all transaction metadata including the info in the redo logs. 281 282 ## Coordinator 283 284 VTGate is already responsible for BEC, aka Commit(Atomic=false), it can naturally be extended to act as the coordinator for 2PC. It needs to support Commit(Atomic=true), and ResolveTransaction. 285 286 If there are operational errors before the commit decision, the transaction is rolled back. If the rollback fails, or if a failure happens after the commit decision, we give up. The watchdog will later pick it up and try to resolve it. 287 288 ### Commit(Atomic=true) 289 290 This call is issued on an active transaction, whose Session info is known. The function will perform the workflow described in the life of a transaction: 291 292 * Identify a VTTablet as MM, and generate a DTID based on the identity of the MM. 293 * CreateTransaction on the MM 294 * Prepare on all other participants 295 * StartCommit on the MM 296 * CommitPrepared on all other participants 297 * ResolveTransaction on the MM 298 299 Any non-operational failure before StartCommit will trigger the rollback workflow: 300 301 * SetRollback on the MM 302 * RollbackPrepared on all participants for which Prepare was sent 303 * Rollback on all other participants 304 * ResolveTransaction on the MM 305 306 ### ResolveTransaction 307 308 This function is called by a watchdog if a VTGate had failed to complete a transaction. It could be due to VTGate crashing, or other unrecoverable errors. 309 310 The function starts off with a ReadTransaction, and based on the state, it performs the following actions: 311 312 * Prepare: SetRollback and initiate rollback workflow. 313 * Rollback: initiate rollback workflow. 314 * Commit: initiate commit workflow. 315 316 Commit workflow: 317 318 * CommitPrepared on all participants. 319 * ResolveTransaction on the MM 320 321 Rollback workflow: 322 323 * RollbackPrepared on all participants. 324 * ResolveTransaction on the MM. 325 326 ## Watchdogs 327 328 The stateless VTGates are considered ephemeral and can fail at any time, which means that transactions could be abandoned in the middle of a distributed commit. To mitigate this, every primary vttablet will poll its dt_state table for distributed transactions that are lingering. If any such transaction is found, it invokes VTGate with that dtid for a Resolve to be retried. 329 330 _This is not a clean design because it introduces a backward dependency from VTTablet to VTGate. However, it saves us the need to create yet another server that will add to the overall complexity of the deployment. It was decided that this is a worthy trade-off._ 331 332 ## Client API 333 334 The client API change will be an additional flag to the Commit call, where the app can set Atomic to true or false. 335 336 ## Production support 337 338 Beyond the basic functionality, additional work is needed to make 2PC viable for production. The areas of concern are monitoring, tooling and configuration. 339 340 ### Monitoring 341 342 To facilitate monitoring, new variables have to be exported. 343 344 VTTablet 345 346 * The Transactions hierarchy will be extended to report CommitPrepared and RollbackPrepared stats, which includes histograms. Since Prepare is an intermediate step, it will not be rolled up in this variable. 347 * For Prepare, two new variables will be created: 348 * Prepare histogram will report prepare timings. 349 * PrepareStatements histogram will report the number of statements for each Prepare. 350 * New histogram variables will be exported for all the new MM functions. 351 * LingeringCount is a gauge that reports if a transaction has been unresolved for too long. This most likely means that it’s repeatedly failing. So, an alert should be raised. This applies to prepared transactions also. 352 * Any unexpected errors during a 2PC will increment a counter for InternalErrors, which should already be set to raise an alert. 353 354 VTGate 355 356 * TwoPCTransactions will report Commit, Rollback, ResolveCommit and ResolveRollback stats. The Resolve subvars are for the ResolveTransaction function. 357 * TwoPCParticipants will report the transaction count and the ParticipantCount. This is a way to track the average number of participants per 2PC transaction. 358 359 ### Tooling 360 361 For vttablet, a new URL, /twopcz, will display unresolved twopc transactions and transactions that are in the Prepare state. It will also provide buttons to force the following actions: 362 363 * Discard a Prepare that failed to commit. 364 * Force a commit or rollback of a prepared transaction. 365 * Resolve a distributed transaction. 366 367 # Data guarantees 368 369 Although the above workflows are foolproof, they do rely on the data guarantees provided by the underlying systems and the fact that prepared transactions can get killed only together with vttablet. Of these, one failure mode has to be visited: It’s possible that there’s data loss when a primary goes down and a new replica gets elected as the new primary. This loss is highly mitigated with semi-sync turned on, but it’s still possible. In such situations, we have to describe how 2PC will behave. 370 371 In all of the scenarios below, there is irrecoverable data loss. But the system needs to alert correctly, and we must be able to make best effort recovery and move on. For now, these scenarios require operator intervention, but the system could be made to automatically perform these as we gain confidence. 372 373 ## Loss of MM’s transaction and metadata 374 375 Scenario: An MM VTTablet experiences a network partition, and the coordinator continues to commit transactions. Eventually, there’s a reparent and all these transactions are lost. 376 377 In this situation, it’s possible that the participants are in a prepared state, but if you looked for their metadata, you’ll not find it because it’s lost. These transactions will remain in the prepared state forever, holding locks. If this happened, a Lingering alert will be raised. An operator will then realize that there was data loss, and can manually rollback these transactions from the /twopcz dashboard. 378 379 ## Loss of a Prepared transaction 380 381 The previous scenario could happen to one of the participants instead. If so, the 2PC transaction will become unresolvable because an attempt to commit the prepared transaction will repeatedly fail on the participant that lost the prepared transaction. 382 383 This situation will raise a 2PC Lingering transaction alert. The operator can force the 2PC transaction as resolved. 384 385 ## Loss of MM’s transaction after commit decision 386 387 Scenario: Network partition happened after metadata was created. VTGate performs a StartCommit, succeeds in a few commits and crashes. Now, some transactions are in the prepared state. After the recovery, the metadata of the 2PC transaction is also in the Prepared state. 388 389 The watchdog will grab this transaction and invoke a ResolveTransaction. The VTGate will then make a decision to rollback, because all it sees is a 2PC in Prepare state. It will attempt to rollback all participants, while some might have already committed. A failure like this will be undetectable. 390 391 ## Prepared transaction gets killed 392 393 It is possible for an external agent to kill the connection of a prepared transaction. If this happened, MySQL will roll it back. If the system is serving live traffic, it may make forward progress in such a way that the transaction may not be replayable, or may replay with different outcome. 394 395 This is a very unlikely occurrence. But if something like this happen, then an alert will be raised when the coordinator finds that the transaction is missing. That transaction will be marked as Failed until an operator resolves it. 396 397 But if there’s a failover before the transaction is marked as failed, it will be resurrected over future transaction possibly with incorrect changes. A failure like this will be undetectable. 398 399 # Testing Plan 400 401 The main workflow of 2PC is fairly straightforward and easy to test. What makes it complicated are the failure modes. But those have to be tested thoroughly. Otherwise, we’ll not be able to gain the confidence to take this to production. 402 403 Some important failure scenarios that must be tested are: 404 405 * Correct shutdown of vttablet when it has prepared transactions. 406 * Resurrection of prepared transactions when a vttablet becomes a primary. 407 * A reparent of a VTTablet that has prepared transactions. This is effectively tested by the previous two steps, but it will be nice as an integration test. It will be even nicer if we could go a step further and see if VTGate can still complete a transaction if a reparent happened in the middle of a commit. 408 409 # Innovation 410 411 This design has a bunch of innovative ideas. However, it’s possible that they’ve been used before under other circumstances, or even 2PC itself. Here’s a summary of all the new ideas in this document, some with more merit than others: 412 413 * Moving away from the heavyweight XA standard. 414 * Implementing Prepare functionality on top of a system that does not inherently support it. 415 * Storing the Metadata in a transactional engine and making the coordinator stateless. 416 * Storing the Metadata with one of the participants and avoiding the cost of a Prepare for that participant. 417 * Choosing to relax Isolation guarantees while maintaining Atomicity.