github.com/voedger/voedger@v0.0.0-20240520144910-273e84102129/design/consistency/README.md (about)

     1  ## Consistency, Isolation and Transactions in Heeus
     2  
     3  Heeus design uses CQRS design pattern where different data models are used for writes (by Commands) and reads (by Queries) ([HEEUSCQRS](../README.md#event-sourcing--cqrs)).
     4  
     5  Microsoft states [[MSFTCQRS](https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs)] that one of the challenges in CQRS systems is eventual consistency of the Read Model:
     6  
     7  > Eventual consistency. If you separate the read and write databases, the read data may be stale. The read model store must be updated to reflect changes to the write model store, and it can be difficult to detect when a user has issued a request based on stale read data
     8  >
     9  > Microsoft, CQRS pattern
    10  
    11  This paper:
    12  
    13  - Defines User Stories to be satisfied
    14  - Analyzes existing literature about Consistency and Isolation
    15  - Defines terms
    16  - Proposes the way for handling Transactions, Consistency and Isolation in Heeus
    17  
    18  ## Content
    19  
    20  - [User Stories](#user-stories)
    21  - [Literature](#literature)
    22    - [ANSI SQL-92](#ansi-sql-92)
    23    - [A Critique of ANSI SQL Isolation Levels](#a-critique-of-ansi-sql-isolation-levels)
    24    - [Prof. Abadi: Introduction to Transaction Isolation Levels](#prof-abadi-correctness-anomalies-under-serializable-isolation)
    25    - [Prof. Abadi: Correctness Anomalies under Serializable Isolation](#prof-abadi-correctness-anomalies-under-serializable-isolation)
    26    - [Consistency levels in Azure Cosmos DB](#consistency-levels-in-azure-cosmos-db)
    27    - [jepsen.io: Consistency Models](#jepsenio-consistency-models)
    28  
    29  ## User Stories
    30  
    31  We beleive that User is interested in the following scenarious:
    32  
    33  **Read ASAP**, consistency doesn't matter
    34  - Client not always see its own writes or wites from other clients
    35  - Examples: 
    36    - Read dashboard figures
    37    - Read journal (WLog) for building reports
    38  
    39  **Read fresh data**
    40  - Client sees its own writes and writes from other clients which happened before the read transaction stared
    41    - Note that exact clocks synchronization is impossible
    42  - Client sees writes which commited during read transaction
    43  - In fact this is ANSI READ COMMITED as defined in [ANSISQLCRIT]
    44  - Examples: 
    45    - Read transaction history after making an order or payment (not used atm)
    46  
    47  **Read data snapshot**
    48  - Examples:
    49    - Read the BO state
    50    - Read the TablesOverview (//FIXME Why we need a snapshot here?)
    51    - Read a complex document which consists of many records
    52      - ? Supported by KV-driver since such documents must be kept in one partition
    53    - Data enrichment (join with classifiers)
    54    - Backup
    55  
    56  
    57  ## Literature
    58  
    59  Draft literature overview has been done earlier in [MGINVART](inv-articles-consistency.md), next sections analyzes in more details:
    60  
    61  - [ANSISQL92](https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt), [ANSI X3.135-1992, American National Standard for Information Systems — Database Language — SQL, November, 1992]
    62  - [ANSISQLCRIT](https://arxiv.org/ftp/cs/papers/0701/0701157.pdf), A Critique of ANSI SQL Isolation Levels, Jun 1995, Microsoft Research
    63  - [ABAISO](http://dbmsmusings.blogspot.com/2019/05/introduction-to-transaction-isolation.html), D. Abadi, Introduction to Transaction Isolation Levels, blogspot.com, May 2019
    64  - [ABACASER](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html), D. Abadi, Correctness Anomalies Under Serializable Isolation, blogspot.com, June 2019
    65  - [COSMOS](https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels), Microsoft, Consistency levels in Azure Cosmos DB, 2022, microsoft.com
    66  - [JEPSEN](https://jepsen.io/consistency), Consistency Models, jepsen.io, 2022
    67  
    68  ## ANSI SQL-92
    69  
    70  Source: [[ANSISQL92]](https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt), ANSI X3.135-1992, American National Standard for Information Systems — Database Language — SQL, November, 1992.
    71  
    72  The isolation level specifies the kind of phenomena that can occur during the execution of concurrent SQL-transactions. The following phenomena are possible:
    73  ```
    74           1) P1 ("Dirty read"): SQL-transaction T1 modifies a row. SQL-
    75              transaction T2 then reads that row before T1 performs a COMMIT.
    76              If T1 then performs a ROLLBACK, T2 will have read a row that was
    77              never committed and that may thus be considered to have never
    78              existed.
    79  
    80           2) P2 ("Non-repeatable read"): SQL-transaction T1 reads a row. SQL-
    81              transaction T2 then modifies or deletes that row and performs
    82              a COMMIT. If T1 then attempts to reread the row, it may receive
    83              the modified value or discover that the row has been deleted.
    84  
    85           3) P3 ("Phantom"): SQL-transaction T1 reads the set of rows N
    86              that satisfy some <search condition>. SQL-transaction T2 then
    87              executes SQL-statements that generate one or more rows that
    88              satisfy the <search condition> used by SQL-transaction T1. If
    89              SQL-transaction T1 then repeats the initial read with the same
    90              <search condition>, it obtains a different collection of rows.
    91  ```
    92  
    93  Table 9, "SQL-transaction isolation levels and the three phenomena" specifies the phenomena that are possible and not possible for a given isolation level.
    94  ```
    95           _Level__________________P1_________P2_________P3_______________________
    96          | READ UNCOMMITTED     | Possible | Possible | Possible                 |
    97          |                      |          |          |                          |
    98          | READ COMMITTED       | Not      | Possible | Possible                 |
    99          |                      | Possible |          |                          |  
   100          |                      |          |          |                          |  
   101          | REPEATABLE READ      | Not      | Not      | Possible                 |
   102          |                      | Possible | Possible |                          |
   103          |                      |          |          |                          |
   104          | SERIALIZABLE         | Not      | Not      | Not Possible             |
   105          |                      | Possible |Possible  |                          |
   106  ```
   107  
   108  > - The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable
   109  > - A **serializable execution** is defined to be an execution of the operations of concurrently executing SQL-transactions that produces the same effect as some serial execution of those same SQL-transactions
   110  
   111  (!!!) Is is curious that standart claims:
   112  ```
   113           Significant new features are:
   114           ...
   115           15)Support for transaction consistency levels
   116           ...
   117  ```
   118  
   119  But "consistency levels" further are not mentioned.
   120  
   121  ## A Critique of ANSI SQL Isolation Levels
   122  
   123  Source: [[ANSISQLCRIT](https://arxiv.org/ftp/cs/papers/0701/0701157.pdf)], A Critique of ANSI SQL Isolation Levels, Jun 1995, Microsoft Research, 
   124  - Science paper by Microsoft, Sybase, UMass (University of Massachusetts Amherst)
   125  
   126  **Phenomenon** and **anomalies**:
   127  
   128  > The concept of a phenomenon is not explicitly defined in the ANSI specifications, but the specifications suggest that phenomena are action subsequences that may lead to anomalous (perhaps nonserializable) behavior. 
   129  
   130  Phenomena/anomalies which are prevented by isolation are shown using notation described in [ANSISQLCRIT]:
   131  
   132  > Histories consisting of reads, writes, commits, and aborts can be written in a shorthand notation: “w1[x]” means a write by transaction 1 on data item x (which is how a data item is “modified’), and “r2[x]” represents a read of x by transaction 2. Transaction 1 reading and writing a set of records satisfying predicate P is denoted by r1[P] and w1[P] respectively. Transaction 1’s commit and abort (ROLLBACK) are written “c1” and “a1”, respectively. 
   133  
   134  Phenomena defined by [ANSISQL92] are called "anomalies":
   135  
   136  - **A1**: w1[x]...r2[x]...(a1 and c2 in either order)
   137  - **A2**: r1[x]...w2[x]...c2...r1[x]...c1
   138  - **A3**: r1[P]...w2[y in P]...c2...r1[P]...c1
   139  
   140  The broad interpretations of anomalies are suggested, extra phenomenon are added and the paper argues that this is "a must":
   141  
   142  - P0 (Dirty Write)
   143  - P1 (Dirty Read)
   144  - P2 (Fuzzy or  Non-Repeatable Read)
   145  - P3 (Phantom)
   146  - P4 (Lost Update)
   147  - P4C (Cursor Lost Update)
   148  - A5A (Read Skew)
   149  - A5B (Write Skew)
   150  
   151  ### P0 (Dirty Write)
   152  
   153  > w1[x]...w2[x]...(c1 or a1)
   154  
   155  > One reason why Dirty Writes are bad is that they can violate database consistency. For example consider the history w1[x]...w2[x]...w2[y]...c2...w1[y]...c1. T1's changes to y and T2's
   156  to x both “survive”. If T1 writes 1 in both x and y while T2 writes 2, the result will be x=2, y=1 violating x = y. 
   157  
   158  
   159  ### P1 (Dirty Read)
   160  
   161  > A1: w1[x]...r2[x]...(a1 and c2 in either order)
   162  
   163  > P1: w1[x]...r2[x]...(c1 or a1)
   164  
   165  > Consider history H1, involving a $40 transfer between bank balance rows x and y, [x=50, y=50]: 
   166  
   167  > H1: r1[x=50]...w1[x=10]...r2[x=10]...r2[y=50]...c2...r1[y=50]...w1[y=90]...c1 
   168  
   169  - T2 gets an incorrect balance of 60, which never existed: [x=50, y=50]...w1[x=10]...r2[x=10]...r2[y=50]
   170  - History is non-serializable (interesting claim)
   171  - History does not contain any phenomena from [ANSISQL92]
   172    - A1 requres abort
   173    - A2 requres double read
   174    - A3 requires set operation
   175  
   176  ### P2 (Fuzzy or  Non-Repeatable Read)
   177  
   178  > A2: r1[x]...w2[x]...c2...r1[x]...c1
   179  
   180  > P2: r1[x]...w2[x]...(c1 or a1)
   181  
   182  > Similar arguments show that P2 should be taken as the ANSI intention rather than A2. A history that discriminates these two interpretations is:
   183  
   184  > H2: r1[x=50]...r2[x=50]...w2[x=10]...r2[y=50]...w2[y=90]...c2...r1[y=90]...c1
   185  
   186  - T1 gets an incorrect balance of 140: [x=50, y=50]...r1[x=50]...w2[y=90]...r1[y=90]
   187  - H2 is non-serializable 
   188  - History does not contain any phenomena from [ANSISQL92]
   189  
   190  ### P3 (Phantom)
   191  
   192  > A3: r1[P]...w2[y in P]...c2...r1[P]...c1
   193  
   194  > P3: r1[P]...w2[y in P]...(c1 or a1)
   195  
   196  - Same rationale as for P2
   197  - w2[y in P]: insert, delete an element which satisfies the P condition, or update an element so that it satisfies the P condition
   198  
   199  ### P4 (Lost Update)
   200  
   201  > r1[x=1]...w2[x=10]...w1[x=1+1]...c1
   202  
   203  > However, forbidding P2 also precludes P4, since w2[x] comes after r1[x] and before T1 commits or aborts. Therefore the anomaly P4 is useful in distinguishing isolation levels intermediate in strength between READ COMMITTED and REPEATABLE READ.
   204  - READ COMMITTED << Cursor Stability << REPEATABLE READ 
   205    - `<<` means "weaker" (prevents less anomalies)
   206  - We beleive an example is incorrect since Dirty Write occurs here, should be: r1[x=1]...w2[x=10]...c2...w1[x=1+1]...c1
   207    - See also discussion [here](https://stackoverflow.com/questions/72850415/isolation-level-difference-between-dirty-write-and-lost-update)
   208  
   209  
   210  ### P4C (Cursor Lost Update)
   211  
   212  > rc1[x=1]...w2[x=10]...w1[x=1+1]...c1
   213  
   214  - Prevented by non-ANSI Cursor Stability
   215    - Pros: _short duration lock_ rather then _long duration lock_
   216  - Only value under cursor is protected
   217  - Perhaps should be: rc1[x=1]...w2[x=10]...c2...w1[x=1+1]...c1
   218    - Otherwise Dirty Write occurs
   219  - Note that this is possible: rc1[x]...rc1[y]...w2[x]...c2...w1[y]...c1
   220  
   221  ### A5A (Read Skew)
   222  
   223  > r1[x]...w2[x]...w2[y]...c2...r1[y]...(c1 or a1)
   224  
   225  - Incosistent pair x,y read by T1
   226  
   227  > Clearly neither A5A nor A5B could arise in histories where P2 is precluded, since both A5A and A5B have T2 write a data item that has been previously read by an uncommitted T1. Thus,  henomena A5A and A5B are only useful for distinguishing isolation levels that are below REPEATABLE READ in strength.
   228  
   229  ### A5B (Write Skew)
   230  
   231  > r1[x]...r2[y]...w1[y]...w2[x]...(c1 and c2 occur)
   232  
   233  - Incosistent pair x,y written by T1 and T2
   234  - E.g. 
   235    - constraint x*2 <= y
   236    - r1[x=3]...r2[y=4]...w1[y=6]...w2[x=2]...c1...c2...[x=2,y=6]
   237  
   238  
   239  ### Isolation Types
   240  
   241  ![](images/concur-anomalies.png)
   242  
   243  Strange things:
   244  
   245  - Perhaps it is wrong that A5B is possible within Cursor Stability
   246  - Note that according to this table Snapshot sometimes allows P3 (Phantom) whereas [ANSISQLCRIT] also claims that:
   247  > Perhaps most remarkable of all, Snapshot Isolation has no phantoms (in the strict sense of the ANSI definitions A3)
   248  - The clue may be is that A3 is defined as `r1[P]...w2[y in P]...c2...r1[P]...c1` but the phenomenon to be prevented is (P3):
   249    - r1[P]...w2[y in P]...(c1 or a1)
   250  - But then P2 should also be "Sometimes Possible"?
   251  
   252  
   253  ## Prof. Abadi: Introduction to Transaction Isolation Levels
   254  
   255  Source: [[ABAISO](http://dbmsmusings.blogspot.com/2019/05/introduction-to-transaction-isolation.html)], D. Abadi, Introduction to Transaction Isolation Levels, blogspot.com, May 2019
   256  - Daniel Abadi is the Darnell-Kanal Professor of Computer Science at University of Maryland, College Park
   257  - He is best-known for the development of the storage and query execution engines of the C-Store (column-oriented database) prototype, which was commercialized by Vertica and eventually acquired by Hewlett-Packard in 2011, for his HadoopDB research on fault tolerant scalable analytical database systems which was commercialized by Hadapt and acquired by Teradata in 2014, and deterministic, scalable, transactional, distributed systems such as Calvin which is currently being commercialized by Fauna
   258  
   259  > **Database isolation** refers to the ability of a database to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality there can be a large number of concurrently running transactions). The overarching goal is to prevent reads and writes of temporary, aborted, or otherwise incorrect data written by concurrent transactions.
   260  
   261  > The key point for our purposes is that we are defining **“perfect isolation”** as the ability of a system to run transactions in parallel, but in a way that is equivalent to as if they were running one after the other. In the SQL standard, this perfect isolation level is called **serializability**.
   262  
   263  Notes:
   264  
   265  - Preventing ALL anomalies is NOT enough, e.g. Snapshot Isolation prevents all [ANSISQL92] anomalies
   266  - **Perfect Isolation** does NOT require that "earlier" concurrent transaction should be "earlier" in equivalent serial execution
   267  
   268  ## Prof. Abadi: Correctness Anomalies Under Serializable Isolation
   269  
   270  Source: [[ABACASER](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html)], D. Abadi, Correctness Anomalies Under Serializable Isolation, blogspot.com, June 2019
   271  
   272  > In the good old days of having a “database server” which is running on a single physical machine, serializable isolation was indeed sufficient, and database vendors never attempted to sell database software with stronger correctness guarantees than SERIALIZABLE. However, **as distributed and replicated database systems have started** to proliferate in the last few decades, **anomalies and bugs have started to appear** in applications even when running over a database system that guarantees serializable isolation. As a consequence, database system vendors **started to release systems with stronger correctness guarantees than serializable isolation**, which promise a lack of vulnerability to these newer anomalies. In this post, we will discuss several well known **bugs and anomalies in serializable distributed database systems**, and modern correctness guarantees that ensure avoidance of these anomalies. 
   273  
   274  
   275  [YB] give the following definition of the One Copy Serializability (1SR):
   276  
   277  > The **One Copy Serializability** [7] is the highest correctness criterion for replica control protocols... In order to achieve this correctness criterion, it is required that interleaved execution of transactions on replicas be equivalent to serial execution of those transactions on one copy of a database.
   278  
   279  [ABACASER] introduces new anomalies which can be found even in 1SR-systems and attribute them as follows:
   280  
   281  > The next few sections describe some forms of time-travel anomalies that occur in distributed and/or replicated systems, and the types of application bugs that they may cause.
   282  
   283  Two of these anomalies, though, may occur even in Heees CE with storage backed by bboltdb driver, therefore we do not link these anomalies  solely with "distributed and/or replicated systems"
   284  
   285  ### Immortal Write
   286  
   287  > ru: Бессмертная Запись
   288  
   289  Anomaly:
   290  - Real-time: w1[x=Daniel]...c1...w2[x=Danny]...c2...w3[x=Danger]...c3
   291  - Serial order: w1[x=Daniel]...w3[x=Danger]...w2[x=Danny]
   292    - w3 goes back in time (time-travel, анахронизм, anachronism)
   293  
   294  Notes:
   295  - Can be caused by async replication AND Unsynchronized Clock problem
   296  - System can decide to do that due to other reasons, since this does not violate serializability guarantee
   297  - When the “Danny” transaction and/or the other name-change transactions also perform a read to the database as part of the same transaction as the write to the name, the ability to time-travel without violating serializability becomes much more difficult. But for **“blind write” transactions** such as these examples, time-travel is **easy to accomplish**.
   298  
   299  ### Stale Read
   300  
   301  > ru: Несвежее Чтение
   302  
   303  Anomaly:
   304  - Real-time: w1[x=50]...с1...w2[x=0]...c2...r3...c3
   305  - Serial order: w1[x=50]...r3[x=50]...w2[x=0]
   306    - r3 goes back in time (time-travel)
   307  
   308  Reasons:
   309  1. Async replication (distributed)
   310  2. Unsynchronized Clock problem (distributed)
   311  3. Projection update delay (single node)
   312  4. System can decide to do that due to other reasons
   313  
   314  ### Causal Reverse
   315  
   316  > ru: Обратная Причинность, "реверс козла"
   317  
   318  Anomaly (exchange x and y):
   319  - Real-time: [x=1000000, y=0]...r1[x, y]...w2[x=0]...c2...w3[y=1000000]...c3...с1
   320  - Serial order: w3[y=1000000]...r1[x=1000000, y=1000000]...w2[x=0]
   321  
   322  "Real-life" scenario:
   323    - User has 1000000 on accountx and 0 on accounty
   324    - User gets 1000000 cash from accountx
   325    - User puts 1000000 cash to accounty
   326  
   327  One example of a distributed database system that allows the causal reverse is CockroachDB (aka CRDB):
   328  - CockroachDB partitions a database such that each partition commits writes and synchronously replicates data separately from other partitions
   329  - Each write receives a timestamp based on the local clock on one of the servers within that partition
   330  - In general, it is impossible to perfectly synchronize clocks across a large number of machines, so CockroachDB allows a maximum clock skew for which clocks across a deployment can differ
   331  - It is possible in CockroachDB for a transaction to commit, and a later transaction to come along (that writes data to a different partition), that was caused by the earlier one (that started after the earlier one finished), and still receive an earlier timestamp than the earlier transaction.
   332  - This enables a read (**in CockroachDB’s case, this read has to be sent to the system before the two write transactions**) to potentially see the write of the later transaction, but not the earlier one
   333  
   334  ### Preventing Serial Anomalies
   335  
   336  [ABACASER] gives the following classification of serializable systems.
   337  
   338  > In distributed and replicated database systems, this additional guarantee of “no time travel” on top of the other serializability guarantees is non-trivial, but has nonetheless been accomplished by several systems such as FaunaDB/Calvin, FoundationDB, and Spanner. This high level of correctness is called **strict serializability**
   339  
   340  **Strong Session Serializable** systems guarantee strict serializability of transactions within the same session, but otherwise only one-copy serializability
   341  - Implementation example: "Sticky session", all requests routed to the same node
   342  
   343  **Strong Write Serializable** systems guarantee strict serializability for all transactions that insert or update data, but only one-copy serializability for read-only transactions
   344  - Implementation example: Read-only replica systems where all update transactions go to the master replica which processes them with strict serializability
   345  
   346  **Strong Partition Serializable** systems guarantee strict serializability only on a per-partition basis
   347  - Data is divided into a number of disjoint partitions
   348  - Within each partition, transactions that access data within that partition are guaranteed to be strictly serializable
   349  - (!!!) But otherwise, the system only guarantees one-copy serializability
   350  
   351  |System Guarantee|Dirty read|Non-repeatable read|Phantom Read|Write Skew|Immortal write|Stale read|Causal reverse|
   352  |--- |--- |--- |--- |--- |--- |--- |--- |
   353  |READ UNCOMMITTED|Possible|Possible|Possible|Possible|Possible|Possible|Possible|
   354  |READ COMMITTED|-|Possible|Possible|Possible|Possible|Possible|Possible|
   355  |REPEATABLE READ|-|-|Possible|Possible|Possible|Possible|Possible|
   356  |SNAPSHOT ISOLATION|-|-|-|Possible|Possible|Possible|Possible|
   357  |SERIALIZABLE / ONE COPY SERIALIZABLE / STRONG SESSION SERIALIZABLE|-|-|-|-|Possible|Possible|Possible|
   358  |STRONG WRITE SERIALIZABLE|-|-|-|-|-|**Possible**|-|
   359  |STRONG PARTITION SERIALIZABLE|-|-|-|-|-|-|**Possible**|
   360  |STRICT SERIALIZABLE|-|-|-|-|-|-|-|
   361  
   362  N/B:
   363  - Almost "triangle matrix" but STRONG WRITE/PARTITION SERIALIZABLE
   364  - Read Skew phenomenon excluded for some reason
   365  
   366  ## Consistency levels in Azure Cosmos DB
   367  
   368  Source: [[COSMOS](https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels)], Microsoft, Consistency levels in Azure Cosmos DB, 2022, microsoft.com
   369  - [Уровни согласованности в Azure Cosmos DB](https://learn.microsoft.com/ru-ru/azure/cosmos-db/consistency-levels)
   370  - (!!!) https://github.com/MicrosoftDocs/azure-docs.ru-ru/blob/live/articles/cosmos-db/consistency-levels.md
   371  
   372  Сonsistency levels:
   373  
   374  - Strong (Сильная)
   375  - Bounded staleness ([baʊndɪd ˈsteɪlnəs], Запаздывание, Ограниченное устаревание, Ограниченная несвежесть)
   376  - Session (Сеанс, Сессионная)
   377  - Consistent prefix ([kənˈsɪstənt ˈpriːfɪks], Согласованный префикс)
   378  - Eventual (Итоговая, Светлое будущее)
   379  
   380  ### Strong Consistency
   381  
   382  - The reads are guaranteed to return the most recent committed version of an item
   383  - A client never sees an uncommitted or partial write.
   384  
   385  ![](images/strong-consistency.gif)
   386  
   387  ### Bounded Staleness Consistency
   388  
   389  It is kind of "eventual consistency" with limited "eventuality".
   390  
   391  - The reads might lag behind writes by at most "K" versions (that is, "updates") of an item or by "T" time interval, whichever is reached first
   392  - Bounded staleness offers total global order outside of the "staleness window"
   393  - Bounded staleness is frequently chosen by globally distributed applications that expect low write latencies but require total global order guarantee. 
   394  - When a client performs read operations within a region that accepts writes, the guarantees provided by bounded staleness consistency are identical to those guarantees by the strong consistency
   395  
   396  ![](images/bounded-staleness-consistency.gif)
   397  
   398  ### Session Consistency
   399  
   400  In session consistency, within a single client session reads are guaranteed to honor:
   401  
   402  - consistent-prefix
   403  - monotonic reads 
   404  - monotonic writes
   405  - read-your-writes
   406  - write-follows-reads
   407  - ref. [JEPSEN](https://jepsen.io/consistency) for some definitions
   408  
   409  This assumes a single "writer" session or sharing the session token for multiple writers
   410  
   411  ![](images/session-consistency.gif)
   412  
   413  ### Consistent Prefix Consistency
   414  
   415  > This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
   416  > 
   417  > [[CLEPP]  Martin Kleppmann, Designing Data-Intensive Applications](https://ebrary.net/64710/computer_science/consistent_prefix_reads)
   418  
   419  ![](images/consistent-prefix.gif)
   420  
   421  
   422  ### Eventual Consistency
   423  
   424  ![](images/eventual-consistency.gif)
   425  
   426  ## jepsen.io: Consistency Models
   427  
   428  Source: [[JEPSEN](https://jepsen.io/consistency)], Consistency Models, jepsen.io, 2022
   429  - > Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc. We maintain an open source software library for systems testing, as well as blog posts and conference talks exploring particular systems’ failure modes
   430  - > [2020-12-23](https://jepsen.io/analyses/scylla-4.2-rc3): Together with the ScyllaDB team, we found seven problems in Scylla, including lightweight transaction (LWT) split-brain in healthy clusters due to a.) incomplete row hashcodes and b.) multiple problems with membership changes. We also identified incomplete or inaccurate documentation, including claims that non-LWT operations were isolated and atomic, and undocumented rules about what kinds of membership operations were legal. Scylla has corrected almost all of these errors via patches and documentation; remaining cases of split-brain appear limited to concurrent membership changes.
   431  
   432  
   433  
   434  [![Consistency Models](images/consistency-models.png)](https://jepsen.io/consistency)
   435  
   436  On the left:
   437  
   438  - > `x` is a transactional model: operations (usually termed “transactions”) can involve several primitive sub-operations performed in order
   439  - > It is also a multi-object property: operations can act on multiple objects in the system
   440  - > `x` does not impose any real-time, or even per-process constraints. If process A completes write w, then process B begins a read r, r is not necessarily guaranteed to observe w. 
   441  
   442  On the right:
   443  
   444  - > `x` is a single-object model, but the scope of “an object” varies. Some systems provide linearizability on individual keys in a key-value store; others might provide linearizable operations on multiple keys in a table, or multiple tables in a database—but not between different tables or databases, respectively
   445  - `x` impose some real-time constraints
   446  
   447  We can come up with definitions:
   448  - **Isolation**: preventing concurrent transactions execution anomalies
   449  - **Consistency**: preventing serial execution anomalies
   450    - Remember about "equivalent"
   451  
   452  Suddenly (isolation):
   453  
   454  - > Read uncommitted is a consistency model which **prohibits dirty writes**, where two transactions modify the same object concurrently before committing
   455  - > Note that read uncommitted **does not impose any real-time constraints**. If process A completes write `w`, then process B begins a read `r`, `r` is not necessarily guaranteed to observe `w`
   456    - > In fact, a process can fail to observe its own prior writes, if those writes occurred in different transactions
   457  - > Read uncommitted can be totally available: in the presence of network partitions, every node can make progress
   458    -> See "pathological orderings"
   459  - > Like serializability, read uncommitted allows pathological orderings. For instance, a read uncommmitted database can always return the empty state for any reads, by appearing to execute those reads at time 0. It can also discard write-only transactions by reordering them to execute at the very end of the history, after any reads. Operations like increments can also be discarded, assuming the result of the increment is never observed. Luckily, most implementations don’t seem to take advantage of these optimization opportunities
   460  - Read Commited can see only few effects from previous transaction
   461  
   462  
   463  ### Read Uncommited
   464  
   465  - > Read uncommitted is a consistency model which **prohibits dirty writes**, where two transactions modify the same object concurrently before committing
   466  - > P0 (Dirty Write): w1(x)...w2(x)
   467  - > (!!!) Read uncommitted can be totally available: in the presence of network partitions, every node can make progress
   468  - > (!!!) Note that read uncommitted **does not impose any real-time constraints**. If process A completes write `w`, then process B begins a read `r`, `r` is not necessarily guaranteed to observe `w`
   469  - > (!!!) In fact, a process can fail to observe its own prior writes, if those writes occurred in different transactions.
   470  - > (!!!) Like serializability, read uncommitted allows pathological orderings. For instance, a read uncommmitted database can always return the empty state for any reads, by appearing to execute those reads at time 0. It can also discard write-only transactions by reordering them to execute at the very end of the history, after any reads. Operations like increments can also be discarded, assuming the result of the increment is never observed. Luckily, most implementations don’t seem to take advantage of these optimization opportunities.
   471  - ??? Why transaction sees its own writes?
   472  
   473  ### Read Commited
   474  
   475  - > Read committed is a consistency model which strengthens read uncommitted by preventing dirty reads
   476  - > P1 (Dirty Read): w1(x)...r2(x)
   477  - > Read committed can be totally available
   478  - > Note that read committed **does not impose any real-time constraints**...
   479  
   480  ### Monotonic Atomic View
   481  
   482  - > Monotonic atomic view is a consistency model which strengthens read committed by preventing transactions from observing some, but not all, of a previously committed transaction’s effects. Once a write from transaction T1 is observed by transaction T2, then all effects of T1 should be visible to T2. 
   483  - > Monotonic atomic view can be totally available
   484  - > However, it **does not impose any real-time, or even per-process constraints**...
   485  
   486  ### Cursor Stability
   487  
   488  - > Cursor stability is a consistency model which strengthens read committed by preventing lost updates. It introduces the concept of a cursor, which refers to a particular object being accessed by a transaction. Transactions may have multiple cursors. When a transaction reads an object using a cursor, that object cannot be modified by any other transaction until the cursor is released, or the transaction commits
   489  - rc1[x=1]...w2[x=10]...w1[x=1+1]...c1 ([ANSISQLCRIT])
   490  - > Cursor stability cannot be totally available; in the presence of network partitions, some or all nodes may be unable to make progress.
   491  - > However, it **does not impose any real-time, or even per-process constraints**... 
   492  
   493  ### Repeatable Read
   494  
   495  - > Repeatable read is closely related to serializability, but unlike serializable, it allows phantoms: if a transaction T1 reads a predicate, like "the set of all people with the name “Dikembe”, then another transaction T2 may create or modify a person with the name “Dikembe” before T1 commits. Individual objects are stable once read, but the predicate itself may not be
   496  - > P2 (Fuzzy Read): r1(x)...w2(x)
   497  - > Repeatable read cannot be totally available
   498  - > However, it **does not impose any real-time, or even per-process constraints**... 
   499  
   500  ### Snapshot Isolation
   501  
   502  - > It does not impose any real-time constraints. If process A completes write w, then process B begins a read r, r is not necessarily guaranteed to observe w. 
   503  - > Unlike serializability, which enforces a total order of transactions, snapshot isolation only forces a partial order: sub-operations in one transaction may interleave with those from other transactions.
   504  - > The most notable phenomena allowed by snapshot isolation are write skews...
   505    - r1[x]...r2[y]...w1[y(x)]...w2[x(y)]...(c1 and c2 occur) ([ANSISQLCRIT])
   506  - > ...and a read-only transaction anomaly, involving partially disjoint write sets
   507  - > Note that read committed **does not impose any real-time constraints**...
   508  
   509  **Read-only transaction-anomaly**
   510  -  H3: **R2(X0,0) R2(Y0,0)** R1(Y0,0) W1(Y1,20) C1 _R3(X0,0) R3(Y1,20) C3_ **W2(X2,-11) C2** [[FOO](https://www.cs.umb.edu/~poneil/ROAnom.pdf)]
   511    - Final: Y = 20 and X = -11
   512    - Two accounts X, Y
   513    - T2 widraws 10 from X, -1 penalty applied for overdraft (X+Y goes negative)
   514    - T1 adds 20 to Y
   515    - Result Y = 20 and X = -11 equals to sequence [T2, T1]
   516    - Whereas T3 reads Y = 20 and X = 0 which is impossible
   517  - The anomaly that arises in this transaction is that read-only transaction T3 prints out X = 0 and Y = 20, while final values are Y = 20 and X = -11
   518  - See also [stackoverflow](https://stackoverflow.com/questions/68697789/read-only-transaction-anomaly), [johann.schleier-smith.com](https://johann.schleier-smith.com/blog/2016/01/06/analyzing-a-read-only-transaction-anomaly-under-snapshot-isolation.html), [muratbuffalo.blogspot.com](http://muratbuffalo.blogspot.com/2021/12/a-read-only-transaction-anomaly-under.html)
   519  
   520  ### Serializable
   521  
   522  - > Informally, serializability means that transactions appear to have occurred in some total order.
   523  - > Serializability does not require a per-process order between transactions. A process can observe a write, then fail to observe that same write in a subsequent transaction. In fact, a process can fail to observe its own prior writes, if those writes occurred in different transactions.
   524  - > However, it **does not impose any real-time, or even per-process constraints**...
   525  
   526  ### Strict Serializability
   527  
   528  - > Informally, strict serializability (a.k.a. PL-SS, Strict 1SR, Strong 1SR) means that operations appear to have occurred in some order, consistent with the real-time ordering of those operations; e.g. if operation A completes before operation B begins, then A should appear to precede B in the serialization order
   529  - > You can think of strict serializability as serializability’s total order of transactional multi-object operations, plus linearizability’s real-time constraints
   530  - > Alternatively, you can think of a strict serializable database as a linearizable object in which the object’s state is the entire database
   531  
   532  ### Writes Follow Reads
   533  
   534  - If a process reads a value v, which came from a write w1, and later performs write w2, then w2 must be visible after w1
   535    - Avoids Immortal Write [ABACASER]
   536  - A write operation by a process on a data item x following a previous read operation on x by the same process is guaranteed to take place on the same or a more recent value of x that was read [WIKICONS]
   537  - also known as session causality
   538  
   539  ### Monotonic Reads
   540  
   541  - > if a process performs read r1, then r2, then r2 cannot observe a state prior to the writes which were reflected in r1; intuitively, reads cannot go backwards
   542  
   543  ### Monotonic Writes
   544  
   545  - > if a process performs write w1, then w2, then all processes observe w1 before w
   546  
   547  ### Read Your Writes
   548  
   549  - > Requires that if a process performs a write w, then that same process performs a subsequent read r, then r must observe w’s effects.
   550  - > Note that read your writes does not apply to operations performed by different processes
   551  
   552  ### PRAM
   553  
   554  - > PRAM is exactly equivalent to read your writes, monotonic writes, and monotonic reads.
   555  
   556  ### Causal Consistency
   557  
   558  - > Causal consistency captures the notion that causally-related operations should appear in the same order on all processes—though processes may disagree about the order of causally independent operations.
   559    - > For example, consider a single object representing a chat between three people, where Attiya asks “shall we have lunch?”, and Barbarella & Cyrus respond with “yes”, and “no”, respectively. Causal consistency allows Attiya to observe “lunch?”, “yes”, “no”; and Barbarella to observe “lunch?”, “no”, “yes”. However, no participant ever observes “yes” or “no” prior to the question “lunch?”
   560  - > Convergent causal systems require that the values of objects in the system converge to identical values, once the same operations are visible. In such a system, users could transiently observe “lunch”, “yes”; and “lunch”, “no”—but everyone would eventually agree on (to pick an arbitrary order) “lunch”, “yes”, “no”.
   561  
   562  
   563  ### Sequential Consistency
   564  
   565  - All operations, not just causal-related
   566  - > Informally, sequential consistency implies that operations appear to take place in some total order, and that that order is consistent with the order of operations on each individual process
   567  - Still no real-time constraints
   568  
   569  ### Linearizability
   570  
   571  - Consistent with the real-time ordering
   572  - > Linearizability is one of the strongest single-object consistency models, and implies that every operation appears to take place atomically, in some order, consistent with the real-time ordering of those operations: e.g., if operation A completes before operation B begins, then B should logically take effect after A.
   573  - > Linearizability is a single-object model, but the scope of “an object” varies. Some systems provide linearizability on individual keys in a key-value store; others might provide linearizable operations on multiple keys in a table, or multiple tables in a database—but not between different tables or databases, respectively.
   574  - > (!!!) When you need linearizability across multiple objects, try strict serializability
   575  
   576  ### Graph
   577  
   578  On the left:
   579  - "x" isolation is a transactional model: operations (usually termed “transactions”) can involve several primitive sub-operations performed in order. It is also a multi-object property: operations can act on multiple objects in the system.
   580  
   581  ## Draft: Consistency
   582  
   583  - Workspace Consistency
   584  - Registry - multi-region consistency
   585  
   586  
   587  ## Draft: Client-assisted Consistency
   588  
   589  - Each record has a version represented by WLogOffset
   590  - When client reads a record, it gets a WLogOffset
   591  - When client sends a record, it sends a WLogOffset
   592  - If client WLogOffset does not equal current
   593  
   594  ## References
   595  
   596  - [[ABACASER](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html)], D. Abadi, Correctness Anomalies Under Serializable Isolation, blogspot.com, June 2019
   597  - [[ABAISO](http://dbmsmusings.blogspot.com/2019/05/introduction-to-transaction-isolation.html)], D. Abadi, Introduction to Transaction Isolation Levels, blogspot.com, May 2019
   598  - [[ANSISQLCRIT](https://arxiv.org/ftp/cs/papers/0701/0701157.pdf)], A Critique of ANSI SQL Isolation Levels, Jun 1995, Microsoft Research
   599  - [[ANSISQL92](https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt)], [ANSI X3.135-1992, American National Standard for Information Systems — Database Language — SQL, November, 1992]
   600  - [[ANSISQL99](http://web.cecs.pdx.edu/~len/sql1999.pdf)], ANSI/ISO/IEC International Standard (IS) Database Language SQL — Part 2: Foundation (SQL/Foundation) «Part 2»
   601  - [[CLEPP](https://ebrary.net/64591/computer_science/designing_data-intensive_applications_the_big_ideas_behind_reliable_scalable_and_maintainable_syst)], Martin Kleppmann, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, First Edition, March 2017, O'REILLY
   602  - [[COSMOS](https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels)], Microsoft, Consistency levels in Azure Cosmos DB, 2022, microsoft.com
   603  - [[FOO](https://www.cs.umb.edu/~poneil/ROAnom.pdf)], Alan Fekete, Elizabeth O'Neil, and Patrick O'Neil, A Read-Only Transaction Anomaly Under Snapshot Isolation, www.cs.umb.edu 
   604  - [[HEEUSCQRS](../README.md#event-sourcing--cqrs)], Heeus, Event Sourcing & CQRS, github.com
   605  - [[JEPSEN](https://jepsen.io/consistency)], Consistency Models, jepsen.io, 2022
   606  - [[MGINVART](inv-articles-consistency.md)], Maxim Geraskin, inv-articles-consistency.md, github.com, Oct 2022
   607  - [[MSFTCQRS](https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs)], Microsoft, CQRS pattern, microsoft.com
   608  - [[WIKICONS](https://en.wikipedia.org/wiki/Consistency_model)], Consistency model, en.wikipedia.org
   609  - [[YB](https://eprints.soton.ac.uk/262096/1/reft.pdf)], D. Yadav,  M. Butler, Rigorous Design of Fault-Tolerant Transactions for Replicated Database Systems using Event B, School of Electronics and Computer Science University of Southampton
   610  - [[VV](https://arxiv.org/pdf/1512.00168.pdf)], Paolo Viotti, Marko Vukolic, Consistency in Non-Transactional Distributed Storage Systems, arxiv.org, 2016
   611  
   612  
   613  ## See also
   614  
   615  - [README-v1.md](README-v1.md)