github.com/dgraph-io/dgraph@v1.2.8/wiki/content/design-concepts/index.md (about)

     1  +++
     2  date = "2017-03-20T22:25:17+11:00"
     3  title = "Design Concepts"
     4  +++
     5  
     6  ## Transactions: FAQ
     7  
     8  Dgraph supports distributed ACID transactions through snapshot isolation.
     9  
    10  ### Can we do pre-writes only on leaders?
    11  
    12  Seems like a good idea, but has bad implications. If we only do a prewrite
    13  in-memory, only on leader, then this prewrite wouldn't make it to the Raft log,
    14  or disk; but would be considered successful.
    15  
    16  Then zero could mark the transaction as committed; but this leader could go
    17  down, or leadership could change. In such a case, we'd end up losing the
    18  transaction altogether despite it having been considered committed.
    19  
    20  Therefore, pre-writes do have to make it to disk. And if so, better to propose
    21  them in a Raft group.
    22  
    23  ## Consistency Models
    24  [Last updated: Mar 2018]
    25  Basing it [on this
    26  article](https://aphyr.com/posts/313-strong-consistency-models) by aphyr.
    27  
    28  - **Sequential Consistency:** Different users would see updates at different times, but each user would see operations in order.
    29  
    30  Dgraph has a client-side sequencing mode, which provides sequential consistency.
    31  
    32  Here, let’s replace a “user” with a “client” (or a single process). In Dgraph, each client maintains a linearizable read map (linread map). Dgraph's data set is sharded into many "groups". Each group is a Raft group, where every write is done via a "proposal." You can think of a transaction in Dgraph, to consist of many group proposals.
    33  
    34  The leader in Raft group always has the most recent proposal, while
    35  replicas could be behind the leader in varying degrees. You can determine this
    36  by just looking at the latest applied proposal ID. A leader's proposal ID would
    37  be greater than or equal to some replicas' applied proposal ID.
    38  
    39  `linread` map stores a group -> max proposal ID seen, per client. If a client's
    40  last read had seen updates corresponding to proposal ID X, then `linread` map
    41  would store X for that group. The client would then use the `linread` map to
    42  inform future reads to ensure that the server servicing the request, has
    43  proposals >= X applied before servicing the read. Thus, all future reads,
    44  irrespective of which replica it might hit, would see updates for proposals >=
    45  X. Also, the `linread` map is updated continuously with max seen proposal IDs
    46  across all groups as reads and writes are done across transactions (within that
    47  client).
    48  
    49  In short, this map ensures that updates made by the client, or seen by the
    50  client, would never be *unseen*; in fact, they would be visible in a sequential
    51  order. There might be jumps though, for e.g., if a value X → Y → Z, the client
    52  might see X, then Z (and not see Y at all).
    53  
    54  - **Linearizability:** Each op takes effect atomically at some point between invocation and completion. Once op is complete, it would be visible to all.
    55  
    56  Dgraph supports server-side sequencing of updates, which provides
    57  linearizability. Unlike sequential consistency which provides sequencing per
    58  client, this provide sequencing across all clients. This is necessary to make
    59  transactions work across clients. Thus, once a transaction is committed,
    60  it would be visible to all future readers, irrespective of client boundaries.
    61  
    62  - **Causal consistency:** Dgraph does not have a concept of dependencies among transactions. So, does NOT order based on dependencies.
    63  - **Serializable consistency:** Dgraph does NOT allow arbitrary reordering of transactions, but does provide a linear order per key.
    64  
    65  ---
    66  
    67  {{% notice "outdated" %}}Sections below this one are outdated. You will find [Tour of Dgraph](https://tour.dgraph.io) a much helpful resource.{{% /notice %}}
    68  
    69  ## Concepts
    70  
    71  ### Edges
    72  
    73  Typical data format is RDF [N-Quad](https://www.w3.org/TR/n-quads/) which is:
    74  
    75  * `Subject, Predicate, Object, Label`, aka
    76  * `Entity, Attribute, Other Entity / Value, Label`
    77  
    78  Both the terminologies get used interchangeably in our code. Dgraph considers edges to be directional,
    79  i.e. from `Subject -> Object`. This is the direction that the queries would be run.
    80  
    81  {{% notice "tip" %}}Dgraph can automatically generate a reverse edge. If the user wants to run
    82  queries in that direction, they would need to define the [reverse edge](/query-language#reverse-edges)
    83  as part of the schema.{{% /notice %}}
    84  
    85  Internally, the RDF N-Quad gets parsed into this format.
    86  
    87  ```
    88  type DirectedEdge struct {
    89    Entity      uint64
    90    Attr        string
    91    Value       []byte
    92    ValueType   uint32
    93    ValueId     uint64
    94    Label       string
    95    Lang 	      string
    96    Op          DirectedEdge_Op // Set or Delete
    97    Facets      []*facetsp.Facet
    98  }
    99  ```
   100  
   101  Note that irrespective of the input, both `Entity` and `Object/ValueId` get converted in `UID` format.
   102  
   103  ### Posting List
   104  Conceptually, a posting list contains all the `DirectedEdges` corresponding to an `Attribute`, in the
   105  following format:
   106  
   107  ```
   108  Attribute: Entity -> sorted list of ValueId // Everything in uint64 representation.
   109  ```
   110  
   111  So, for, e.g., if we're storing a list of friends, such as:
   112  
   113  Entity | Attribute| ValueId
   114  -------|----------|--------
   115  Me     | friend   | person0
   116  Me     | friend   | person1
   117  Me     | friend   | person2
   118  Me     | friend   | person3
   119  
   120  
   121  Then a posting list `friend` would be generated. Seeking for `Me` in this PL
   122  would produce a list of friends, namely `[person0, person1, person2, person3]`.
   123  
   124  The big advantage of having such a structure is that we have all the data to do one join in one
   125  Posting List. This means, one RPC to
   126  the machine serving that Posting List would result in a join, without any further
   127  network calls, reducing joins to lookups.
   128  
   129  Implementation wise, a `Posting List` is a list of `Postings`. This is how they look in
   130  [Protocol Buffers]({{< relref "#protocol-buffers" >}}) format.
   131  ```
   132  message Posting {
   133    fixed64 uid = 1;
   134    bytes value = 2;
   135    enum ValType {
   136      DEFAULT = 0;
   137      BINARY = 1;
   138      INT = 2; // We treat it as int64.
   139      FLOAT = 3;
   140      BOOL = 4;
   141      DATE = 5;
   142      DATETIME = 6;
   143      GEO = 7;
   144      UID = 8;
   145      PASSWORD = 9;
   146      STRING = 10;
   147  
   148    }
   149    ValType val_type = 3;
   150    enum PostingType {
   151      REF=0;          // UID
   152      VALUE=1;        // simple, plain value
   153      VALUE_LANG=2;   // value with specified language
   154          // VALUE_TIMESERIES=3; // value from timeseries, with specified timestamp
   155    }
   156    PostingType posting_type = 4;
   157    bytes metadata = 5; // for VALUE_LANG: Language, for VALUE_TIMESERIES: timestamp, etc..
   158    string label = 6;
   159    uint64 commit = 7;  // More inclination towards smaller values.
   160    repeated facetsp.Facet facets = 8;
   161  
   162    // TODO: op is only used temporarily. See if we can remove it from here.
   163    uint32 op = 12;
   164  }
   165  
   166  message PostingList {
   167    repeated Posting postings = 1;
   168    bytes checksum = 2;
   169    uint64 commit = 3; // More inclination towards smaller values.
   170  }
   171  ```
   172  
   173  There is typically more than one Posting in a PostingList.
   174  
   175  The RDF Label is stored as `label` in each posting.
   176  {{% notice "warning" %}}We don't currently retrieve label via query -- but would use it in the future.{{% /notice %}}
   177  
   178  ###  Badger
   179  PostingLists are served via [Badger](https://github.com/dgraph-io/badger), given the latter provides enough
   180  knobs to decide how much data should be served out of memory, SSD or disk.
   181  Also, it supports bloom filters on keys, which makes random lookups efficient.
   182  
   183  To allow Badger full access to memory to optimize for caches, we'll have
   184  one Badger instance per machine. Each instance would contain all the
   185  posting lists served by the machine.
   186  
   187  Posting Lists get stored in Badger, in a key-value format, like so:
   188  ```
   189  (Predicate, Subject) --> PostingList
   190  ```
   191  
   192  ### Group
   193  
   194  Every Alpha server belongs to a particular group, and each group is responsible for serving a
   195  particular set of predicates. Multiple servers in a single group replicate the same data to achieve
   196  high availability and redundancy of data.
   197  
   198  Predicates are automatically assigned to each group based on which group first receives the
   199  predicate. By default periodically predicates can be moved around to different groups upon
   200  heuristics to evenly distribute the data across the cluster. Predicates can also be moved manually
   201  if desired.
   202  
   203  In a future version, if a group gets too big, it could be split further. In this case, a single
   204  `Predicate` essentially gets divided across two groups.
   205  
   206  ```
   207    Original Group:
   208              (Predicate, Sa..z)
   209    After split:
   210    Group 1:  (Predicate, Sa..i)
   211    Group 2:  (Predicate, Sj..z)
   212  ```
   213  
   214  Note that keys are sorted in BadgerDB. So, the group split would be done in a way to maintain that
   215  sorting order, i.e. it would be split in a way where the lexicographically earlier subjects would be
   216  in one group, and the later in the second.
   217  
   218  ### Replication and Server Failure
   219  Each group should typically be served by atleast 3 servers, if available. In the case of a machine
   220  failure, other servers serving the same group can still handle the load in that case.
   221  
   222  ### New Server and Discovery
   223  Dgraph cluster can detect new machines allocated to the [cluster](/deploy#cluster),
   224  establish connections, and transfer a subset of existing predicates to it based on the groups served
   225  by the new machine.
   226  
   227  ### Write Ahead Logs
   228  Every mutation upon hitting the database doesn't immediately make it on disk via BadgerDB. We avoid
   229  re-generating the posting list too often, because all the postings need to be kept sorted, and it's
   230  expensive. Instead, every mutation gets logged and synced to disk via append only log files called
   231  `write-ahead logs`. So, any acknowledged writes would always be on disk. This allows us to recover
   232  from a system crash, by replaying all the mutations since the last write to `Posting List`.
   233  
   234  ### Mutations
   235  
   236  {{% notice "outdated" %}}
   237  This section needs to be improved.
   238  {{% /notice %}}
   239  
   240  In addition to being written to `Write Ahead Logs`, a mutation also gets stored in memory as an
   241  overlay over immutable `Posting list` in a mutation layer. This mutation layer allows us to iterate
   242  over `Posting`s as though they're sorted, without requiring re-creating the posting list.
   243  
   244  When a posting list has mutations in memory, it's considered a `dirty` posting list. Periodically,
   245  we re-generate the immutable version, and write to BadgerDB. Note that the writes to BadgerDB are
   246  asynchronous, which means they don't get flushed out to disk immediately, but that wouldn't lead
   247  to data loss on a machine crash. When `Posting lists` are initialized, write-ahead logs get referred,
   248  and any missing writes get applied.
   249  
   250  Every time we regenerate a posting list, we also write the max commit log timestamp that was
   251  included -- this helps us figure out how long back to seek in write-ahead logs when initializing
   252  the posting list, the first time it's brought back into memory.
   253  
   254  ### Queries
   255  
   256  Let's understand how query execution works, by looking at an example.
   257  
   258  ```
   259  {
   260      me(func: uid(0x1)) {
   261        pred_A
   262        pred_B {
   263          pred_B1
   264          pred_B2
   265        }
   266        pred_C {
   267          pred_C1
   268          pred_C2 {
   269            pred_C21
   270        }
   271        }
   272    }
   273  }
   274  
   275  ```
   276  
   277  Let's assume we have 3 Alpha instances, and instance id=2 receives this query. These are the steps:
   278  
   279  * Send queries to look up keys = `pred_A, 0x1`, `pred_B, 0x1`, and `pred_C, 0x1`. These predicates could
   280  belong to 3 different groups, served by potentially different Alpha servers. So, this would typically
   281  incur at max 3 network calls (equal to number of predicates at this step).
   282  * The above queries would return back 3 lists of UIDs or values. The result of `pred_B` and `pred_C`
   283  would be converted into queries for `pred_Bi` and `pred_Ci`.
   284  * `pred_Bi` and `pred_Ci` would then cause at max 4 network calls, depending upon where these
   285  predicates are located. The keys for `pred_Bi`, for example, would be `pred_Bi, res_pred_Bk`, where
   286  res_pred_Bk = list of resulting UIDs from `pred_B, u`.
   287  * Looking at `res_pred_C2`, you'll notice that this would be a list of lists aka list matrix. We
   288  merge these list of lists into a sorted list with distinct elements to form the query for `pred_C21`.
   289  * Another network call depending upon where `pred_C21` lies, and this would again give us a list of
   290  list UIDs / value.
   291  
   292  If the query was run via HTTP interface `/query`, this subgraph gets converted into JSON for
   293  replying back to the client. If the query was run via [gRPC](https://www.grpc.io/) interface using
   294  the language [clients]({{< relref "clients/index.md" >}}), the subgraph gets converted to
   295  [protocol buffer](https://developers.google.com/protocol-buffers/) format and then returned to client.
   296  
   297  ### Network Calls
   298  Compared to RAM or SSD access, network calls are slow.
   299  Dgraph minimizes the number of network calls required to execute queries. As explained above, the
   300  data sharding is done based on `predicate`, not `entity`. Thus, even if we have a large set of
   301  intermediate results, they'd still only increase the payload of a network call, not the number of
   302  network calls itself. In general, the number of network calls done in Dgraph is directly proportional
   303  to the number of predicates in the query, or the complexity of the query, not the number of
   304  intermediate or final results.
   305  
   306  In the above example, we have eight predicates, and so including a call to convert to UID, we'll
   307  have at max nine network calls. The total number of entity results could be in millions.
   308  
   309  ### Worker
   310  In Queries section, you noticed how the calls were made to query for `(predicate, uids)`. All those
   311  network calls / local processing are done via workers. Each server exposes a
   312  [gRPC](https://www.grpc.io) interface, which can then be called by the query processor to retrieve data.
   313  
   314  ### Worker Pool
   315  Worker Pool is just a pool of open TCP connections which can be reused by multiple goroutines.
   316  This avoids having to recreate a new connection every time a network call needs to be made.
   317  
   318  ### Protocol Buffers
   319  All data in Dgraph that is stored or transmitted is first converted into byte arrays through
   320  serialization using [Protocol Buffers](https://developers.google.com/protocol-buffers/). When
   321  the result is to be returned to the user, the protocol buffer object is traversed, and the JSON
   322  object is formed.
   323  
   324  ## Minimizing network calls explained
   325  
   326  To explain how Dgraph minimizes network calls, let's start with an example query we should be able
   327  to run.
   328  
   329  *Find all posts liked by friends of friends of mine over the last year, written by a popular author X.*
   330  
   331  ### SQL/NoSQL
   332  In a distributed SQL/NoSQL database, this would require you to retrieve a lot of data.
   333  
   334  Method 1:
   335  
   336  * Find all the friends (~ 338 [friends](http://www.pewresearch.org/fact-tank/2014/02/03/6-new-facts-about-facebook/</ref>)).
   337  * Find all their friends (~ 338 * 338 = 40,000 people).
   338  * Find all the posts liked by these people over the last year (resulting set in millions).
   339  * Intersect these posts with posts authored by person X.
   340  
   341  Method 2:
   342  
   343  * Find all posts written by popular author X over the last year (possibly thousands).
   344  * Find all people who liked those posts (easily millions) `result set 1`.
   345  * Find all your friends.
   346  * Find all their friends `result set 2`.
   347  * Intersect `result set 1` with `result set 2`.
   348  
   349  Both of these approaches would result in a lot of data going back and forth between database and
   350  application; would be slow to execute, or would require you to run an offline job.
   351  
   352  ### Dgraph
   353  This is how it would run in Dgraph:
   354  
   355  * Node X contains posting list for predicate `friends`.
   356  * Seek to caller's userid in Node X **(1 RPC)**. Retrieve a list of friend uids.
   357  * Do multiple seeks for each of the friend uids, to generate a list of friends of friends uids. `result set 1`
   358  * Node Y contains posting list for predicate `posts_liked`.
   359  * Ship result set 1 to Node Y **(1 RPC)**, and do seeks to generate a list of all posts liked by
   360  result set 1. `reult set 2`
   361  * Node Z contains posting list for predicate `author`.
   362  * Ship result set 2 to Node Z **(1 RPC)**. Seek to author X, and generate a list of posts authored
   363  by X. `result set 3`
   364  * Intersect the two sorted lists, `result set 2` and `result set 3`. `result set 4`
   365  * Node N contains names for all uids.
   366  * Ship `result set 4` to Node N **(1 RPC)**, and convert uids to names by doing multiple seeks. `result set 5`
   367  * Ship `result set 5` back to caller.
   368  
   369  In 4-5 RPCs, we have figured out all the posts liked by friends of friends, written by popular author X.
   370  
   371  This design allows vast scalability, and yet consistent production level latencies,
   372  to support running complicated queries requiring deep joins.
   373  
   374  ## RAFT
   375  
   376  This section aims to explain the RAFT consensus algorithm in simple terms. The idea is to give you
   377  just enough to make you understand the basic concepts, without going into explanations about why it
   378  works accurately. For a detailed explanation of RAFT, please read the original thesis paper by
   379  [Diego Ongaro](https://github.com/ongardie/dissertation).
   380  
   381  ### Term
   382  Each election cycle is considered a **term**, during which there is a single leader
   383  *(just like in a democracy)*. When a new election starts, the term number is increased. This is
   384  straightforward and obvious but is a critical factor for the accuracy of the algorithm.
   385  
   386  In rare cases, if no leader could be elected within an `ElectionTimeout`, that term can end without
   387  a leader.
   388  
   389  ### Server States
   390  Each server in cluster can be in one of the following three states:
   391  
   392  * Leader
   393  * Follower
   394  * Candidate
   395  
   396  Generally, the servers are in leader or follower state. When the leader crashes or the communication
   397  breaks down, the followers will wait for election timeout before converting to candidates. The
   398  election timeout is randomized. This would allow one of them to declare candidacy before others.
   399  The candidate would vote for itself and wait for the majority of the cluster to vote for it as well.
   400  If a follower hears from a candidate with a higher term than the current (*dead in this case*) leader,
   401  it would vote for it. The candidate who gets majority votes wins the election and becomes the leader.
   402  
   403  The leader then tells the rest of the cluster about the result (<tt>Heartbeat</tt>
   404  [Communication]({{< relref "#communication" >}})) and the other candidates then become followers.
   405  Again, the cluster goes back into leader-follower model.
   406  
   407  A leader could revert to being a follower without an election, if it finds another leader in the
   408  cluster with a higher [Term]({{< relref "#term" >}})). This might happen in rare cases (network partitions).
   409  
   410  ### Communication
   411  There is unidirectional RPC communication, from leader to followers. The followers never ping the
   412  leader. The leader sends `AppendEntries` messages to the followers with logs containing state
   413  updates. When the leader sends `AppendEntries` with zero logs, that's considered a
   414  <tt>Heartbeat</tt>. Leader sends all followers <tt>Heartbeats</tt> at regular intervals.
   415  
   416  If a follower doesn't receive <tt>Heartbeat</tt> for `ElectionTimeout` duration (generally between
   417  150ms to 300ms), it converts it's state to candidate (as mentioned in [Server States]({{< relref "#server-states" >}})).
   418  It then requests for votes by sending a `RequestVote` call to other servers. Again, if it gets
   419  majority votes, candidate becomes a leader. At becoming leader, it then sends <tt>Heartbeats</tt>
   420  to all other servers to establish its authority *(Cartman style, "Respect my authoritah!")*.
   421  
   422  Every communication request contains a term number. If a server receives a request with a stale term
   423  number, it rejects the request.
   424  
   425  Raft believes in retrying RPCs indefinitely.
   426  
   427  ### Log Entries
   428  Log Entries are numbered sequentially and contain a term number. Entry is considered **committed** if
   429  it has been replicated to a majority of the servers.
   430  
   431  On receiving a client request, the leader does four things (aka Log Replication):
   432  
   433  * Appends and persists to its log.
   434  * Issue `AppendEntries` in parallel to other servers.
   435  * On majority replication, consider the entry committed and apply to its state machine.
   436  * Notify followers that entry is committed so that they can apply it to their state machines.
   437  
   438  A leader never overwrites or deletes its entries. There is a guarantee that if an entry is committed,
   439  all future leaders will have it. A leader can, however, force overwrite the followers' logs, so they
   440  match leader's logs *(elected democratically, but got a dictator)*.
   441  
   442  ### Voting
   443  Each server persists its current term and vote, so it doesn't end up voting twice in the same term.
   444  On receiving a `RequestVote` RPC, the server denies its vote if its log is more up-to-date than the
   445  candidate. It would also deny a vote, if a minimum `ElectionTimeout` hasn't passed since the last
   446  <tt>Heartbeat</tt> from the leader. Otherwise, it gives a vote and resets its `ElectionTimeout` timer.
   447  
   448  Up-to-date property of logs is determined as follows:
   449  
   450  * Term number comparison
   451  * Index number or log length comparison
   452  
   453  {{% notice "tip" %}}To understand the above sections better, you can see this
   454  [interactive visualization](http://thesecretlivesofdata.com/raft).{{% /notice %}}
   455  
   456  ### Cluster membership
   457  Raft only allows single-server changes, i.e. only one server can be added or deleted at a time.
   458  This is achieved by cluster configuration changes. Cluster configurations are communicated using
   459  special entries in `AppendEntries`.
   460  
   461  The significant difference in how cluster configuration changes are applied compared to how typical
   462  [Log Entries]({{< relref "#log-entries" >}}) are applied is that the followers don't wait for a
   463  commitment confirmation from the leader before enabling it.
   464  
   465  A server can respond to both `AppendEntries` and `RequestVote`, without checking current
   466  configuration. This mechanism allows new servers to participate without officially being part of
   467  the cluster. Without this feature, things won't work.
   468  
   469  When a new server joins, it won't have any logs, and they need to be streamed. To ensure cluster
   470  availability, Raft allows this server to join the cluster as a non-voting member. Once it's caught
   471  up, voting can be enabled. This also allows the cluster to remove this server in case it's too slow
   472  to catch up, before giving voting rights *(sort of like getting a green card to allow assimilation
   473  before citizenship is awarded providing voting rights)*.
   474  
   475  
   476  {{% notice "tip" %}}If you want to add a few servers and remove a few servers, do the addition
   477  before the removal. To bootstrap a cluster, start with one server to allow it to become the leader,
   478  and then add servers to the cluster one-by-one.{{% /notice %}}
   479  
   480  ### Log Compaction
   481  One of the ways to do this is snapshotting. As soon as the state machine is synced to disk, the
   482  logs can be discarded.
   483  
   484  ### Clients
   485  Clients must locate the cluster to interact with it. Various approaches can be used for discovery.
   486  
   487  A client can randomly pick up any server in the cluster. If the server isn't a leader, the request
   488  should be rejected, and the leader information passed along. The client can then re-route it's query
   489  to the leader. Alternatively, the server can proxy the client's request to the leader.
   490  
   491  When a client first starts up, it can register itself with the cluster using `RegisterClient` RPC.
   492  This creates a new client id, which is used for all subsequent RPCs.
   493  
   494  ### Linearizable Semantics
   495  
   496  Servers must filter out duplicate requests. They can do this via session tracking where they use
   497  the client id and another request UID set by the client to avoid reprocessing duplicate requests.
   498  RAFT also suggests storing responses along with the request UIDs to reply back in case it receives
   499  a duplicate request.
   500  
   501  Linearizability requires the results of a read to reflect the latest committed write.
   502  Serializability, on the other hand, allows stale reads.
   503  
   504  ### Read-only queries
   505  
   506  To ensure linearizability of read-only queries run via leader, leader must take these steps:
   507  
   508  * Leader must have at least one committed entry in its term. This would allow for up-to-dated-ness.
   509  *(C'mon! Now that you're in power do something at least!)*
   510  * Leader stores it's latest commit index.
   511  * Leader sends <tt>Heartbeats</tt> to the cluster and waits for ACK from majority. Now it knows
   512  that it's the leader. *(No successful coup. Yup, still the democratically elected dictator I was before!)*
   513  * Leader waits for its state machine to advance to readIndex.
   514  * Leader can now run the queries against state machine and reply to clients.
   515  
   516  Read-only queries can also be serviced by followers to reduce the load on the leader. But this
   517  could lead to stale results unless the follower confirms that its leader is the real leader(network partition).
   518  To do so, it would have to send a query to the leader, and the leader would have to do steps 1-3.
   519  Then the follower can do 4-5.
   520  
   521  Read-only queries would have to be batched up, and then RPCs would have to go to the leader for each
   522  batch, who in turn would have to send further RPCs to the whole cluster. *(This is not scalable
   523  without considerable optimizations to deal with latency.)*
   524  
   525  **An alternative approach** would be to have the servers return the index corresponding to their
   526  state machine. The client can then keep track of the maximum index it has received from replies so far.
   527  And pass it along to the server for the next request. If a server's state machine hasn't reached the
   528  index provided by the client, it will not service the request. This approach avoids inter-server
   529  communication and is a lot more scalable. *(This approach does not guarantee linearizability, but
   530  should converge quickly to the latest write.)*