github.com/voedger/voedger@v0.0.0-20240520144910-273e84102129/design/consistency/README.md (about) 1 ## Consistency, Isolation and Transactions in Heeus 2 3 Heeus design uses CQRS design pattern where different data models are used for writes (by Commands) and reads (by Queries) ([HEEUSCQRS](../README.md#event-sourcing--cqrs)). 4 5 Microsoft states [[MSFTCQRS](https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs)] that one of the challenges in CQRS systems is eventual consistency of the Read Model: 6 7 > Eventual consistency. If you separate the read and write databases, the read data may be stale. The read model store must be updated to reflect changes to the write model store, and it can be difficult to detect when a user has issued a request based on stale read data 8 > 9 > Microsoft, CQRS pattern 10 11 This paper: 12 13 - Defines User Stories to be satisfied 14 - Analyzes existing literature about Consistency and Isolation 15 - Defines terms 16 - Proposes the way for handling Transactions, Consistency and Isolation in Heeus 17 18 ## Content 19 20 - [User Stories](#user-stories) 21 - [Literature](#literature) 22 - [ANSI SQL-92](#ansi-sql-92) 23 - [A Critique of ANSI SQL Isolation Levels](#a-critique-of-ansi-sql-isolation-levels) 24 - [Prof. Abadi: Introduction to Transaction Isolation Levels](#prof-abadi-correctness-anomalies-under-serializable-isolation) 25 - [Prof. Abadi: Correctness Anomalies under Serializable Isolation](#prof-abadi-correctness-anomalies-under-serializable-isolation) 26 - [Consistency levels in Azure Cosmos DB](#consistency-levels-in-azure-cosmos-db) 27 - [jepsen.io: Consistency Models](#jepsenio-consistency-models) 28 29 ## User Stories 30 31 We beleive that User is interested in the following scenarious: 32 33 **Read ASAP**, consistency doesn't matter 34 - Client not always see its own writes or wites from other clients 35 - Examples: 36 - Read dashboard figures 37 - Read journal (WLog) for building reports 38 39 **Read fresh data** 40 - Client sees its own writes and writes from other clients which happened before the read transaction stared 41 - Note that exact clocks synchronization is impossible 42 - Client sees writes which commited during read transaction 43 - In fact this is ANSI READ COMMITED as defined in [ANSISQLCRIT] 44 - Examples: 45 - Read transaction history after making an order or payment (not used atm) 46 47 **Read data snapshot** 48 - Examples: 49 - Read the BO state 50 - Read the TablesOverview (//FIXME Why we need a snapshot here?) 51 - Read a complex document which consists of many records 52 - ? Supported by KV-driver since such documents must be kept in one partition 53 - Data enrichment (join with classifiers) 54 - Backup 55 56 57 ## Literature 58 59 Draft literature overview has been done earlier in [MGINVART](inv-articles-consistency.md), next sections analyzes in more details: 60 61 - [ANSISQL92](https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt), [ANSI X3.135-1992, American National Standard for Information Systems — Database Language — SQL, November, 1992] 62 - [ANSISQLCRIT](https://arxiv.org/ftp/cs/papers/0701/0701157.pdf), A Critique of ANSI SQL Isolation Levels, Jun 1995, Microsoft Research 63 - [ABAISO](http://dbmsmusings.blogspot.com/2019/05/introduction-to-transaction-isolation.html), D. Abadi, Introduction to Transaction Isolation Levels, blogspot.com, May 2019 64 - [ABACASER](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html), D. Abadi, Correctness Anomalies Under Serializable Isolation, blogspot.com, June 2019 65 - [COSMOS](https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels), Microsoft, Consistency levels in Azure Cosmos DB, 2022, microsoft.com 66 - [JEPSEN](https://jepsen.io/consistency), Consistency Models, jepsen.io, 2022 67 68 ## ANSI SQL-92 69 70 Source: [[ANSISQL92]](https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt), ANSI X3.135-1992, American National Standard for Information Systems — Database Language — SQL, November, 1992. 71 72 The isolation level specifies the kind of phenomena that can occur during the execution of concurrent SQL-transactions. The following phenomena are possible: 73 ``` 74 1) P1 ("Dirty read"): SQL-transaction T1 modifies a row. SQL- 75 transaction T2 then reads that row before T1 performs a COMMIT. 76 If T1 then performs a ROLLBACK, T2 will have read a row that was 77 never committed and that may thus be considered to have never 78 existed. 79 80 2) P2 ("Non-repeatable read"): SQL-transaction T1 reads a row. SQL- 81 transaction T2 then modifies or deletes that row and performs 82 a COMMIT. If T1 then attempts to reread the row, it may receive 83 the modified value or discover that the row has been deleted. 84 85 3) P3 ("Phantom"): SQL-transaction T1 reads the set of rows N 86 that satisfy some <search condition>. SQL-transaction T2 then 87 executes SQL-statements that generate one or more rows that 88 satisfy the <search condition> used by SQL-transaction T1. If 89 SQL-transaction T1 then repeats the initial read with the same 90 <search condition>, it obtains a different collection of rows. 91 ``` 92 93 Table 9, "SQL-transaction isolation levels and the three phenomena" specifies the phenomena that are possible and not possible for a given isolation level. 94 ``` 95 _Level__________________P1_________P2_________P3_______________________ 96 | READ UNCOMMITTED | Possible | Possible | Possible | 97 | | | | | 98 | READ COMMITTED | Not | Possible | Possible | 99 | | Possible | | | 100 | | | | | 101 | REPEATABLE READ | Not | Not | Possible | 102 | | Possible | Possible | | 103 | | | | | 104 | SERIALIZABLE | Not | Not | Not Possible | 105 | | Possible |Possible | | 106 ``` 107 108 > - The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable 109 > - A **serializable execution** is defined to be an execution of the operations of concurrently executing SQL-transactions that produces the same effect as some serial execution of those same SQL-transactions 110 111 (!!!) Is is curious that standart claims: 112 ``` 113 Significant new features are: 114 ... 115 15)Support for transaction consistency levels 116 ... 117 ``` 118 119 But "consistency levels" further are not mentioned. 120 121 ## A Critique of ANSI SQL Isolation Levels 122 123 Source: [[ANSISQLCRIT](https://arxiv.org/ftp/cs/papers/0701/0701157.pdf)], A Critique of ANSI SQL Isolation Levels, Jun 1995, Microsoft Research, 124 - Science paper by Microsoft, Sybase, UMass (University of Massachusetts Amherst) 125 126 **Phenomenon** and **anomalies**: 127 128 > The concept of a phenomenon is not explicitly defined in the ANSI specifications, but the specifications suggest that phenomena are action subsequences that may lead to anomalous (perhaps nonserializable) behavior. 129 130 Phenomena/anomalies which are prevented by isolation are shown using notation described in [ANSISQLCRIT]: 131 132 > Histories consisting of reads, writes, commits, and aborts can be written in a shorthand notation: “w1[x]” means a write by transaction 1 on data item x (which is how a data item is “modified’), and “r2[x]” represents a read of x by transaction 2. Transaction 1 reading and writing a set of records satisfying predicate P is denoted by r1[P] and w1[P] respectively. Transaction 1’s commit and abort (ROLLBACK) are written “c1” and “a1”, respectively. 133 134 Phenomena defined by [ANSISQL92] are called "anomalies": 135 136 - **A1**: w1[x]...r2[x]...(a1 and c2 in either order) 137 - **A2**: r1[x]...w2[x]...c2...r1[x]...c1 138 - **A3**: r1[P]...w2[y in P]...c2...r1[P]...c1 139 140 The broad interpretations of anomalies are suggested, extra phenomenon are added and the paper argues that this is "a must": 141 142 - P0 (Dirty Write) 143 - P1 (Dirty Read) 144 - P2 (Fuzzy or Non-Repeatable Read) 145 - P3 (Phantom) 146 - P4 (Lost Update) 147 - P4C (Cursor Lost Update) 148 - A5A (Read Skew) 149 - A5B (Write Skew) 150 151 ### P0 (Dirty Write) 152 153 > w1[x]...w2[x]...(c1 or a1) 154 155 > One reason why Dirty Writes are bad is that they can violate database consistency. For example consider the history w1[x]...w2[x]...w2[y]...c2...w1[y]...c1. T1's changes to y and T2's 156 to x both “survive”. If T1 writes 1 in both x and y while T2 writes 2, the result will be x=2, y=1 violating x = y. 157 158 159 ### P1 (Dirty Read) 160 161 > A1: w1[x]...r2[x]...(a1 and c2 in either order) 162 163 > P1: w1[x]...r2[x]...(c1 or a1) 164 165 > Consider history H1, involving a $40 transfer between bank balance rows x and y, [x=50, y=50]: 166 167 > H1: r1[x=50]...w1[x=10]...r2[x=10]...r2[y=50]...c2...r1[y=50]...w1[y=90]...c1 168 169 - T2 gets an incorrect balance of 60, which never existed: [x=50, y=50]...w1[x=10]...r2[x=10]...r2[y=50] 170 - History is non-serializable (interesting claim) 171 - History does not contain any phenomena from [ANSISQL92] 172 - A1 requres abort 173 - A2 requres double read 174 - A3 requires set operation 175 176 ### P2 (Fuzzy or Non-Repeatable Read) 177 178 > A2: r1[x]...w2[x]...c2...r1[x]...c1 179 180 > P2: r1[x]...w2[x]...(c1 or a1) 181 182 > Similar arguments show that P2 should be taken as the ANSI intention rather than A2. A history that discriminates these two interpretations is: 183 184 > H2: r1[x=50]...r2[x=50]...w2[x=10]...r2[y=50]...w2[y=90]...c2...r1[y=90]...c1 185 186 - T1 gets an incorrect balance of 140: [x=50, y=50]...r1[x=50]...w2[y=90]...r1[y=90] 187 - H2 is non-serializable 188 - History does not contain any phenomena from [ANSISQL92] 189 190 ### P3 (Phantom) 191 192 > A3: r1[P]...w2[y in P]...c2...r1[P]...c1 193 194 > P3: r1[P]...w2[y in P]...(c1 or a1) 195 196 - Same rationale as for P2 197 - w2[y in P]: insert, delete an element which satisfies the P condition, or update an element so that it satisfies the P condition 198 199 ### P4 (Lost Update) 200 201 > r1[x=1]...w2[x=10]...w1[x=1+1]...c1 202 203 > However, forbidding P2 also precludes P4, since w2[x] comes after r1[x] and before T1 commits or aborts. Therefore the anomaly P4 is useful in distinguishing isolation levels intermediate in strength between READ COMMITTED and REPEATABLE READ. 204 - READ COMMITTED << Cursor Stability << REPEATABLE READ 205 - `<<` means "weaker" (prevents less anomalies) 206 - We beleive an example is incorrect since Dirty Write occurs here, should be: r1[x=1]...w2[x=10]...c2...w1[x=1+1]...c1 207 - See also discussion [here](https://stackoverflow.com/questions/72850415/isolation-level-difference-between-dirty-write-and-lost-update) 208 209 210 ### P4C (Cursor Lost Update) 211 212 > rc1[x=1]...w2[x=10]...w1[x=1+1]...c1 213 214 - Prevented by non-ANSI Cursor Stability 215 - Pros: _short duration lock_ rather then _long duration lock_ 216 - Only value under cursor is protected 217 - Perhaps should be: rc1[x=1]...w2[x=10]...c2...w1[x=1+1]...c1 218 - Otherwise Dirty Write occurs 219 - Note that this is possible: rc1[x]...rc1[y]...w2[x]...c2...w1[y]...c1 220 221 ### A5A (Read Skew) 222 223 > r1[x]...w2[x]...w2[y]...c2...r1[y]...(c1 or a1) 224 225 - Incosistent pair x,y read by T1 226 227 > Clearly neither A5A nor A5B could arise in histories where P2 is precluded, since both A5A and A5B have T2 write a data item that has been previously read by an uncommitted T1. Thus, henomena A5A and A5B are only useful for distinguishing isolation levels that are below REPEATABLE READ in strength. 228 229 ### A5B (Write Skew) 230 231 > r1[x]...r2[y]...w1[y]...w2[x]...(c1 and c2 occur) 232 233 - Incosistent pair x,y written by T1 and T2 234 - E.g. 235 - constraint x*2 <= y 236 - r1[x=3]...r2[y=4]...w1[y=6]...w2[x=2]...c1...c2...[x=2,y=6] 237 238 239 ### Isolation Types 240 241  242 243 Strange things: 244 245 - Perhaps it is wrong that A5B is possible within Cursor Stability 246 - Note that according to this table Snapshot sometimes allows P3 (Phantom) whereas [ANSISQLCRIT] also claims that: 247 > Perhaps most remarkable of all, Snapshot Isolation has no phantoms (in the strict sense of the ANSI definitions A3) 248 - The clue may be is that A3 is defined as `r1[P]...w2[y in P]...c2...r1[P]...c1` but the phenomenon to be prevented is (P3): 249 - r1[P]...w2[y in P]...(c1 or a1) 250 - But then P2 should also be "Sometimes Possible"? 251 252 253 ## Prof. Abadi: Introduction to Transaction Isolation Levels 254 255 Source: [[ABAISO](http://dbmsmusings.blogspot.com/2019/05/introduction-to-transaction-isolation.html)], D. Abadi, Introduction to Transaction Isolation Levels, blogspot.com, May 2019 256 - Daniel Abadi is the Darnell-Kanal Professor of Computer Science at University of Maryland, College Park 257 - He is best-known for the development of the storage and query execution engines of the C-Store (column-oriented database) prototype, which was commercialized by Vertica and eventually acquired by Hewlett-Packard in 2011, for his HadoopDB research on fault tolerant scalable analytical database systems which was commercialized by Hadapt and acquired by Teradata in 2014, and deterministic, scalable, transactional, distributed systems such as Calvin which is currently being commercialized by Fauna 258 259 > **Database isolation** refers to the ability of a database to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality there can be a large number of concurrently running transactions). The overarching goal is to prevent reads and writes of temporary, aborted, or otherwise incorrect data written by concurrent transactions. 260 261 > The key point for our purposes is that we are defining **“perfect isolation”** as the ability of a system to run transactions in parallel, but in a way that is equivalent to as if they were running one after the other. In the SQL standard, this perfect isolation level is called **serializability**. 262 263 Notes: 264 265 - Preventing ALL anomalies is NOT enough, e.g. Snapshot Isolation prevents all [ANSISQL92] anomalies 266 - **Perfect Isolation** does NOT require that "earlier" concurrent transaction should be "earlier" in equivalent serial execution 267 268 ## Prof. Abadi: Correctness Anomalies Under Serializable Isolation 269 270 Source: [[ABACASER](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html)], D. Abadi, Correctness Anomalies Under Serializable Isolation, blogspot.com, June 2019 271 272 > In the good old days of having a “database server” which is running on a single physical machine, serializable isolation was indeed sufficient, and database vendors never attempted to sell database software with stronger correctness guarantees than SERIALIZABLE. However, **as distributed and replicated database systems have started** to proliferate in the last few decades, **anomalies and bugs have started to appear** in applications even when running over a database system that guarantees serializable isolation. As a consequence, database system vendors **started to release systems with stronger correctness guarantees than serializable isolation**, which promise a lack of vulnerability to these newer anomalies. In this post, we will discuss several well known **bugs and anomalies in serializable distributed database systems**, and modern correctness guarantees that ensure avoidance of these anomalies. 273 274 275 [YB] give the following definition of the One Copy Serializability (1SR): 276 277 > The **One Copy Serializability** [7] is the highest correctness criterion for replica control protocols... In order to achieve this correctness criterion, it is required that interleaved execution of transactions on replicas be equivalent to serial execution of those transactions on one copy of a database. 278 279 [ABACASER] introduces new anomalies which can be found even in 1SR-systems and attribute them as follows: 280 281 > The next few sections describe some forms of time-travel anomalies that occur in distributed and/or replicated systems, and the types of application bugs that they may cause. 282 283 Two of these anomalies, though, may occur even in Heees CE with storage backed by bboltdb driver, therefore we do not link these anomalies solely with "distributed and/or replicated systems" 284 285 ### Immortal Write 286 287 > ru: Бессмертная Запись 288 289 Anomaly: 290 - Real-time: w1[x=Daniel]...c1...w2[x=Danny]...c2...w3[x=Danger]...c3 291 - Serial order: w1[x=Daniel]...w3[x=Danger]...w2[x=Danny] 292 - w3 goes back in time (time-travel, анахронизм, anachronism) 293 294 Notes: 295 - Can be caused by async replication AND Unsynchronized Clock problem 296 - System can decide to do that due to other reasons, since this does not violate serializability guarantee 297 - When the “Danny” transaction and/or the other name-change transactions also perform a read to the database as part of the same transaction as the write to the name, the ability to time-travel without violating serializability becomes much more difficult. But for **“blind write” transactions** such as these examples, time-travel is **easy to accomplish**. 298 299 ### Stale Read 300 301 > ru: Несвежее Чтение 302 303 Anomaly: 304 - Real-time: w1[x=50]...с1...w2[x=0]...c2...r3...c3 305 - Serial order: w1[x=50]...r3[x=50]...w2[x=0] 306 - r3 goes back in time (time-travel) 307 308 Reasons: 309 1. Async replication (distributed) 310 2. Unsynchronized Clock problem (distributed) 311 3. Projection update delay (single node) 312 4. System can decide to do that due to other reasons 313 314 ### Causal Reverse 315 316 > ru: Обратная Причинность, "реверс козла" 317 318 Anomaly (exchange x and y): 319 - Real-time: [x=1000000, y=0]...r1[x, y]...w2[x=0]...c2...w3[y=1000000]...c3...с1 320 - Serial order: w3[y=1000000]...r1[x=1000000, y=1000000]...w2[x=0] 321 322 "Real-life" scenario: 323 - User has 1000000 on accountx and 0 on accounty 324 - User gets 1000000 cash from accountx 325 - User puts 1000000 cash to accounty 326 327 One example of a distributed database system that allows the causal reverse is CockroachDB (aka CRDB): 328 - CockroachDB partitions a database such that each partition commits writes and synchronously replicates data separately from other partitions 329 - Each write receives a timestamp based on the local clock on one of the servers within that partition 330 - In general, it is impossible to perfectly synchronize clocks across a large number of machines, so CockroachDB allows a maximum clock skew for which clocks across a deployment can differ 331 - It is possible in CockroachDB for a transaction to commit, and a later transaction to come along (that writes data to a different partition), that was caused by the earlier one (that started after the earlier one finished), and still receive an earlier timestamp than the earlier transaction. 332 - This enables a read (**in CockroachDB’s case, this read has to be sent to the system before the two write transactions**) to potentially see the write of the later transaction, but not the earlier one 333 334 ### Preventing Serial Anomalies 335 336 [ABACASER] gives the following classification of serializable systems. 337 338 > In distributed and replicated database systems, this additional guarantee of “no time travel” on top of the other serializability guarantees is non-trivial, but has nonetheless been accomplished by several systems such as FaunaDB/Calvin, FoundationDB, and Spanner. This high level of correctness is called **strict serializability** 339 340 **Strong Session Serializable** systems guarantee strict serializability of transactions within the same session, but otherwise only one-copy serializability 341 - Implementation example: "Sticky session", all requests routed to the same node 342 343 **Strong Write Serializable** systems guarantee strict serializability for all transactions that insert or update data, but only one-copy serializability for read-only transactions 344 - Implementation example: Read-only replica systems where all update transactions go to the master replica which processes them with strict serializability 345 346 **Strong Partition Serializable** systems guarantee strict serializability only on a per-partition basis 347 - Data is divided into a number of disjoint partitions 348 - Within each partition, transactions that access data within that partition are guaranteed to be strictly serializable 349 - (!!!) But otherwise, the system only guarantees one-copy serializability 350 351 |System Guarantee|Dirty read|Non-repeatable read|Phantom Read|Write Skew|Immortal write|Stale read|Causal reverse| 352 |--- |--- |--- |--- |--- |--- |--- |--- | 353 |READ UNCOMMITTED|Possible|Possible|Possible|Possible|Possible|Possible|Possible| 354 |READ COMMITTED|-|Possible|Possible|Possible|Possible|Possible|Possible| 355 |REPEATABLE READ|-|-|Possible|Possible|Possible|Possible|Possible| 356 |SNAPSHOT ISOLATION|-|-|-|Possible|Possible|Possible|Possible| 357 |SERIALIZABLE / ONE COPY SERIALIZABLE / STRONG SESSION SERIALIZABLE|-|-|-|-|Possible|Possible|Possible| 358 |STRONG WRITE SERIALIZABLE|-|-|-|-|-|**Possible**|-| 359 |STRONG PARTITION SERIALIZABLE|-|-|-|-|-|-|**Possible**| 360 |STRICT SERIALIZABLE|-|-|-|-|-|-|-| 361 362 N/B: 363 - Almost "triangle matrix" but STRONG WRITE/PARTITION SERIALIZABLE 364 - Read Skew phenomenon excluded for some reason 365 366 ## Consistency levels in Azure Cosmos DB 367 368 Source: [[COSMOS](https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels)], Microsoft, Consistency levels in Azure Cosmos DB, 2022, microsoft.com 369 - [Уровни согласованности в Azure Cosmos DB](https://learn.microsoft.com/ru-ru/azure/cosmos-db/consistency-levels) 370 - (!!!) https://github.com/MicrosoftDocs/azure-docs.ru-ru/blob/live/articles/cosmos-db/consistency-levels.md 371 372 Сonsistency levels: 373 374 - Strong (Сильная) 375 - Bounded staleness ([baʊndɪd ˈsteɪlnəs], Запаздывание, Ограниченное устаревание, Ограниченная несвежесть) 376 - Session (Сеанс, Сессионная) 377 - Consistent prefix ([kənˈsɪstənt ˈpriːfɪks], Согласованный префикс) 378 - Eventual (Итоговая, Светлое будущее) 379 380 ### Strong Consistency 381 382 - The reads are guaranteed to return the most recent committed version of an item 383 - A client never sees an uncommitted or partial write. 384 385  386 387 ### Bounded Staleness Consistency 388 389 It is kind of "eventual consistency" with limited "eventuality". 390 391 - The reads might lag behind writes by at most "K" versions (that is, "updates") of an item or by "T" time interval, whichever is reached first 392 - Bounded staleness offers total global order outside of the "staleness window" 393 - Bounded staleness is frequently chosen by globally distributed applications that expect low write latencies but require total global order guarantee. 394 - When a client performs read operations within a region that accepts writes, the guarantees provided by bounded staleness consistency are identical to those guarantees by the strong consistency 395 396  397 398 ### Session Consistency 399 400 In session consistency, within a single client session reads are guaranteed to honor: 401 402 - consistent-prefix 403 - monotonic reads 404 - monotonic writes 405 - read-your-writes 406 - write-follows-reads 407 - ref. [JEPSEN](https://jepsen.io/consistency) for some definitions 408 409 This assumes a single "writer" session or sharing the session token for multiple writers 410 411  412 413 ### Consistent Prefix Consistency 414 415 > This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order. 416 > 417 > [[CLEPP] Martin Kleppmann, Designing Data-Intensive Applications](https://ebrary.net/64710/computer_science/consistent_prefix_reads) 418 419  420 421 422 ### Eventual Consistency 423 424  425 426 ## jepsen.io: Consistency Models 427 428 Source: [[JEPSEN](https://jepsen.io/consistency)], Consistency Models, jepsen.io, 2022 429 - > Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc. We maintain an open source software library for systems testing, as well as blog posts and conference talks exploring particular systems’ failure modes 430 - > [2020-12-23](https://jepsen.io/analyses/scylla-4.2-rc3): Together with the ScyllaDB team, we found seven problems in Scylla, including lightweight transaction (LWT) split-brain in healthy clusters due to a.) incomplete row hashcodes and b.) multiple problems with membership changes. We also identified incomplete or inaccurate documentation, including claims that non-LWT operations were isolated and atomic, and undocumented rules about what kinds of membership operations were legal. Scylla has corrected almost all of these errors via patches and documentation; remaining cases of split-brain appear limited to concurrent membership changes. 431 432 433 434 [](https://jepsen.io/consistency) 435 436 On the left: 437 438 - > `x` is a transactional model: operations (usually termed “transactions”) can involve several primitive sub-operations performed in order 439 - > It is also a multi-object property: operations can act on multiple objects in the system 440 - > `x` does not impose any real-time, or even per-process constraints. If process A completes write w, then process B begins a read r, r is not necessarily guaranteed to observe w. 441 442 On the right: 443 444 - > `x` is a single-object model, but the scope of “an object” varies. Some systems provide linearizability on individual keys in a key-value store; others might provide linearizable operations on multiple keys in a table, or multiple tables in a database—but not between different tables or databases, respectively 445 - `x` impose some real-time constraints 446 447 We can come up with definitions: 448 - **Isolation**: preventing concurrent transactions execution anomalies 449 - **Consistency**: preventing serial execution anomalies 450 - Remember about "equivalent" 451 452 Suddenly (isolation): 453 454 - > Read uncommitted is a consistency model which **prohibits dirty writes**, where two transactions modify the same object concurrently before committing 455 - > Note that read uncommitted **does not impose any real-time constraints**. If process A completes write `w`, then process B begins a read `r`, `r` is not necessarily guaranteed to observe `w` 456 - > In fact, a process can fail to observe its own prior writes, if those writes occurred in different transactions 457 - > Read uncommitted can be totally available: in the presence of network partitions, every node can make progress 458 -> See "pathological orderings" 459 - > Like serializability, read uncommitted allows pathological orderings. For instance, a read uncommmitted database can always return the empty state for any reads, by appearing to execute those reads at time 0. It can also discard write-only transactions by reordering them to execute at the very end of the history, after any reads. Operations like increments can also be discarded, assuming the result of the increment is never observed. Luckily, most implementations don’t seem to take advantage of these optimization opportunities 460 - Read Commited can see only few effects from previous transaction 461 462 463 ### Read Uncommited 464 465 - > Read uncommitted is a consistency model which **prohibits dirty writes**, where two transactions modify the same object concurrently before committing 466 - > P0 (Dirty Write): w1(x)...w2(x) 467 - > (!!!) Read uncommitted can be totally available: in the presence of network partitions, every node can make progress 468 - > (!!!) Note that read uncommitted **does not impose any real-time constraints**. If process A completes write `w`, then process B begins a read `r`, `r` is not necessarily guaranteed to observe `w` 469 - > (!!!) In fact, a process can fail to observe its own prior writes, if those writes occurred in different transactions. 470 - > (!!!) Like serializability, read uncommitted allows pathological orderings. For instance, a read uncommmitted database can always return the empty state for any reads, by appearing to execute those reads at time 0. It can also discard write-only transactions by reordering them to execute at the very end of the history, after any reads. Operations like increments can also be discarded, assuming the result of the increment is never observed. Luckily, most implementations don’t seem to take advantage of these optimization opportunities. 471 - ??? Why transaction sees its own writes? 472 473 ### Read Commited 474 475 - > Read committed is a consistency model which strengthens read uncommitted by preventing dirty reads 476 - > P1 (Dirty Read): w1(x)...r2(x) 477 - > Read committed can be totally available 478 - > Note that read committed **does not impose any real-time constraints**... 479 480 ### Monotonic Atomic View 481 482 - > Monotonic atomic view is a consistency model which strengthens read committed by preventing transactions from observing some, but not all, of a previously committed transaction’s effects. Once a write from transaction T1 is observed by transaction T2, then all effects of T1 should be visible to T2. 483 - > Monotonic atomic view can be totally available 484 - > However, it **does not impose any real-time, or even per-process constraints**... 485 486 ### Cursor Stability 487 488 - > Cursor stability is a consistency model which strengthens read committed by preventing lost updates. It introduces the concept of a cursor, which refers to a particular object being accessed by a transaction. Transactions may have multiple cursors. When a transaction reads an object using a cursor, that object cannot be modified by any other transaction until the cursor is released, or the transaction commits 489 - rc1[x=1]...w2[x=10]...w1[x=1+1]...c1 ([ANSISQLCRIT]) 490 - > Cursor stability cannot be totally available; in the presence of network partitions, some or all nodes may be unable to make progress. 491 - > However, it **does not impose any real-time, or even per-process constraints**... 492 493 ### Repeatable Read 494 495 - > Repeatable read is closely related to serializability, but unlike serializable, it allows phantoms: if a transaction T1 reads a predicate, like "the set of all people with the name “Dikembe”, then another transaction T2 may create or modify a person with the name “Dikembe” before T1 commits. Individual objects are stable once read, but the predicate itself may not be 496 - > P2 (Fuzzy Read): r1(x)...w2(x) 497 - > Repeatable read cannot be totally available 498 - > However, it **does not impose any real-time, or even per-process constraints**... 499 500 ### Snapshot Isolation 501 502 - > It does not impose any real-time constraints. If process A completes write w, then process B begins a read r, r is not necessarily guaranteed to observe w. 503 - > Unlike serializability, which enforces a total order of transactions, snapshot isolation only forces a partial order: sub-operations in one transaction may interleave with those from other transactions. 504 - > The most notable phenomena allowed by snapshot isolation are write skews... 505 - r1[x]...r2[y]...w1[y(x)]...w2[x(y)]...(c1 and c2 occur) ([ANSISQLCRIT]) 506 - > ...and a read-only transaction anomaly, involving partially disjoint write sets 507 - > Note that read committed **does not impose any real-time constraints**... 508 509 **Read-only transaction-anomaly** 510 - H3: **R2(X0,0) R2(Y0,0)** R1(Y0,0) W1(Y1,20) C1 _R3(X0,0) R3(Y1,20) C3_ **W2(X2,-11) C2** [[FOO](https://www.cs.umb.edu/~poneil/ROAnom.pdf)] 511 - Final: Y = 20 and X = -11 512 - Two accounts X, Y 513 - T2 widraws 10 from X, -1 penalty applied for overdraft (X+Y goes negative) 514 - T1 adds 20 to Y 515 - Result Y = 20 and X = -11 equals to sequence [T2, T1] 516 - Whereas T3 reads Y = 20 and X = 0 which is impossible 517 - The anomaly that arises in this transaction is that read-only transaction T3 prints out X = 0 and Y = 20, while final values are Y = 20 and X = -11 518 - See also [stackoverflow](https://stackoverflow.com/questions/68697789/read-only-transaction-anomaly), [johann.schleier-smith.com](https://johann.schleier-smith.com/blog/2016/01/06/analyzing-a-read-only-transaction-anomaly-under-snapshot-isolation.html), [muratbuffalo.blogspot.com](http://muratbuffalo.blogspot.com/2021/12/a-read-only-transaction-anomaly-under.html) 519 520 ### Serializable 521 522 - > Informally, serializability means that transactions appear to have occurred in some total order. 523 - > Serializability does not require a per-process order between transactions. A process can observe a write, then fail to observe that same write in a subsequent transaction. In fact, a process can fail to observe its own prior writes, if those writes occurred in different transactions. 524 - > However, it **does not impose any real-time, or even per-process constraints**... 525 526 ### Strict Serializability 527 528 - > Informally, strict serializability (a.k.a. PL-SS, Strict 1SR, Strong 1SR) means that operations appear to have occurred in some order, consistent with the real-time ordering of those operations; e.g. if operation A completes before operation B begins, then A should appear to precede B in the serialization order 529 - > You can think of strict serializability as serializability’s total order of transactional multi-object operations, plus linearizability’s real-time constraints 530 - > Alternatively, you can think of a strict serializable database as a linearizable object in which the object’s state is the entire database 531 532 ### Writes Follow Reads 533 534 - If a process reads a value v, which came from a write w1, and later performs write w2, then w2 must be visible after w1 535 - Avoids Immortal Write [ABACASER] 536 - A write operation by a process on a data item x following a previous read operation on x by the same process is guaranteed to take place on the same or a more recent value of x that was read [WIKICONS] 537 - also known as session causality 538 539 ### Monotonic Reads 540 541 - > if a process performs read r1, then r2, then r2 cannot observe a state prior to the writes which were reflected in r1; intuitively, reads cannot go backwards 542 543 ### Monotonic Writes 544 545 - > if a process performs write w1, then w2, then all processes observe w1 before w 546 547 ### Read Your Writes 548 549 - > Requires that if a process performs a write w, then that same process performs a subsequent read r, then r must observe w’s effects. 550 - > Note that read your writes does not apply to operations performed by different processes 551 552 ### PRAM 553 554 - > PRAM is exactly equivalent to read your writes, monotonic writes, and monotonic reads. 555 556 ### Causal Consistency 557 558 - > Causal consistency captures the notion that causally-related operations should appear in the same order on all processes—though processes may disagree about the order of causally independent operations. 559 - > For example, consider a single object representing a chat between three people, where Attiya asks “shall we have lunch?”, and Barbarella & Cyrus respond with “yes”, and “no”, respectively. Causal consistency allows Attiya to observe “lunch?”, “yes”, “no”; and Barbarella to observe “lunch?”, “no”, “yes”. However, no participant ever observes “yes” or “no” prior to the question “lunch?” 560 - > Convergent causal systems require that the values of objects in the system converge to identical values, once the same operations are visible. In such a system, users could transiently observe “lunch”, “yes”; and “lunch”, “no”—but everyone would eventually agree on (to pick an arbitrary order) “lunch”, “yes”, “no”. 561 562 563 ### Sequential Consistency 564 565 - All operations, not just causal-related 566 - > Informally, sequential consistency implies that operations appear to take place in some total order, and that that order is consistent with the order of operations on each individual process 567 - Still no real-time constraints 568 569 ### Linearizability 570 571 - Consistent with the real-time ordering 572 - > Linearizability is one of the strongest single-object consistency models, and implies that every operation appears to take place atomically, in some order, consistent with the real-time ordering of those operations: e.g., if operation A completes before operation B begins, then B should logically take effect after A. 573 - > Linearizability is a single-object model, but the scope of “an object” varies. Some systems provide linearizability on individual keys in a key-value store; others might provide linearizable operations on multiple keys in a table, or multiple tables in a database—but not between different tables or databases, respectively. 574 - > (!!!) When you need linearizability across multiple objects, try strict serializability 575 576 ### Graph 577 578 On the left: 579 - "x" isolation is a transactional model: operations (usually termed “transactions”) can involve several primitive sub-operations performed in order. It is also a multi-object property: operations can act on multiple objects in the system. 580 581 ## Draft: Consistency 582 583 - Workspace Consistency 584 - Registry - multi-region consistency 585 586 587 ## Draft: Client-assisted Consistency 588 589 - Each record has a version represented by WLogOffset 590 - When client reads a record, it gets a WLogOffset 591 - When client sends a record, it sends a WLogOffset 592 - If client WLogOffset does not equal current 593 594 ## References 595 596 - [[ABACASER](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html)], D. Abadi, Correctness Anomalies Under Serializable Isolation, blogspot.com, June 2019 597 - [[ABAISO](http://dbmsmusings.blogspot.com/2019/05/introduction-to-transaction-isolation.html)], D. Abadi, Introduction to Transaction Isolation Levels, blogspot.com, May 2019 598 - [[ANSISQLCRIT](https://arxiv.org/ftp/cs/papers/0701/0701157.pdf)], A Critique of ANSI SQL Isolation Levels, Jun 1995, Microsoft Research 599 - [[ANSISQL92](https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt)], [ANSI X3.135-1992, American National Standard for Information Systems — Database Language — SQL, November, 1992] 600 - [[ANSISQL99](http://web.cecs.pdx.edu/~len/sql1999.pdf)], ANSI/ISO/IEC International Standard (IS) Database Language SQL — Part 2: Foundation (SQL/Foundation) «Part 2» 601 - [[CLEPP](https://ebrary.net/64591/computer_science/designing_data-intensive_applications_the_big_ideas_behind_reliable_scalable_and_maintainable_syst)], Martin Kleppmann, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, First Edition, March 2017, O'REILLY 602 - [[COSMOS](https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels)], Microsoft, Consistency levels in Azure Cosmos DB, 2022, microsoft.com 603 - [[FOO](https://www.cs.umb.edu/~poneil/ROAnom.pdf)], Alan Fekete, Elizabeth O'Neil, and Patrick O'Neil, A Read-Only Transaction Anomaly Under Snapshot Isolation, www.cs.umb.edu 604 - [[HEEUSCQRS](../README.md#event-sourcing--cqrs)], Heeus, Event Sourcing & CQRS, github.com 605 - [[JEPSEN](https://jepsen.io/consistency)], Consistency Models, jepsen.io, 2022 606 - [[MGINVART](inv-articles-consistency.md)], Maxim Geraskin, inv-articles-consistency.md, github.com, Oct 2022 607 - [[MSFTCQRS](https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs)], Microsoft, CQRS pattern, microsoft.com 608 - [[WIKICONS](https://en.wikipedia.org/wiki/Consistency_model)], Consistency model, en.wikipedia.org 609 - [[YB](https://eprints.soton.ac.uk/262096/1/reft.pdf)], D. Yadav, M. Butler, Rigorous Design of Fault-Tolerant Transactions for Replicated Database Systems using Event B, School of Electronics and Computer Science University of Southampton 610 - [[VV](https://arxiv.org/pdf/1512.00168.pdf)], Paolo Viotti, Marko Vukolic, Consistency in Non-Transactional Distributed Storage Systems, arxiv.org, 2016 611 612 613 ## See also 614 615 - [README-v1.md](README-v1.md)