github.com/dgraph-io/dgraph@v1.2.8/wiki/content/design-concepts/index.md (about) 1 +++ 2 date = "2017-03-20T22:25:17+11:00" 3 title = "Design Concepts" 4 +++ 5 6 ## Transactions: FAQ 7 8 Dgraph supports distributed ACID transactions through snapshot isolation. 9 10 ### Can we do pre-writes only on leaders? 11 12 Seems like a good idea, but has bad implications. If we only do a prewrite 13 in-memory, only on leader, then this prewrite wouldn't make it to the Raft log, 14 or disk; but would be considered successful. 15 16 Then zero could mark the transaction as committed; but this leader could go 17 down, or leadership could change. In such a case, we'd end up losing the 18 transaction altogether despite it having been considered committed. 19 20 Therefore, pre-writes do have to make it to disk. And if so, better to propose 21 them in a Raft group. 22 23 ## Consistency Models 24 [Last updated: Mar 2018] 25 Basing it [on this 26 article](https://aphyr.com/posts/313-strong-consistency-models) by aphyr. 27 28 - **Sequential Consistency:** Different users would see updates at different times, but each user would see operations in order. 29 30 Dgraph has a client-side sequencing mode, which provides sequential consistency. 31 32 Here, let’s replace a “user” with a “client” (or a single process). In Dgraph, each client maintains a linearizable read map (linread map). Dgraph's data set is sharded into many "groups". Each group is a Raft group, where every write is done via a "proposal." You can think of a transaction in Dgraph, to consist of many group proposals. 33 34 The leader in Raft group always has the most recent proposal, while 35 replicas could be behind the leader in varying degrees. You can determine this 36 by just looking at the latest applied proposal ID. A leader's proposal ID would 37 be greater than or equal to some replicas' applied proposal ID. 38 39 `linread` map stores a group -> max proposal ID seen, per client. If a client's 40 last read had seen updates corresponding to proposal ID X, then `linread` map 41 would store X for that group. The client would then use the `linread` map to 42 inform future reads to ensure that the server servicing the request, has 43 proposals >= X applied before servicing the read. Thus, all future reads, 44 irrespective of which replica it might hit, would see updates for proposals >= 45 X. Also, the `linread` map is updated continuously with max seen proposal IDs 46 across all groups as reads and writes are done across transactions (within that 47 client). 48 49 In short, this map ensures that updates made by the client, or seen by the 50 client, would never be *unseen*; in fact, they would be visible in a sequential 51 order. There might be jumps though, for e.g., if a value X → Y → Z, the client 52 might see X, then Z (and not see Y at all). 53 54 - **Linearizability:** Each op takes effect atomically at some point between invocation and completion. Once op is complete, it would be visible to all. 55 56 Dgraph supports server-side sequencing of updates, which provides 57 linearizability. Unlike sequential consistency which provides sequencing per 58 client, this provide sequencing across all clients. This is necessary to make 59 transactions work across clients. Thus, once a transaction is committed, 60 it would be visible to all future readers, irrespective of client boundaries. 61 62 - **Causal consistency:** Dgraph does not have a concept of dependencies among transactions. So, does NOT order based on dependencies. 63 - **Serializable consistency:** Dgraph does NOT allow arbitrary reordering of transactions, but does provide a linear order per key. 64 65 --- 66 67 {{% notice "outdated" %}}Sections below this one are outdated. You will find [Tour of Dgraph](https://tour.dgraph.io) a much helpful resource.{{% /notice %}} 68 69 ## Concepts 70 71 ### Edges 72 73 Typical data format is RDF [N-Quad](https://www.w3.org/TR/n-quads/) which is: 74 75 * `Subject, Predicate, Object, Label`, aka 76 * `Entity, Attribute, Other Entity / Value, Label` 77 78 Both the terminologies get used interchangeably in our code. Dgraph considers edges to be directional, 79 i.e. from `Subject -> Object`. This is the direction that the queries would be run. 80 81 {{% notice "tip" %}}Dgraph can automatically generate a reverse edge. If the user wants to run 82 queries in that direction, they would need to define the [reverse edge](/query-language#reverse-edges) 83 as part of the schema.{{% /notice %}} 84 85 Internally, the RDF N-Quad gets parsed into this format. 86 87 ``` 88 type DirectedEdge struct { 89 Entity uint64 90 Attr string 91 Value []byte 92 ValueType uint32 93 ValueId uint64 94 Label string 95 Lang string 96 Op DirectedEdge_Op // Set or Delete 97 Facets []*facetsp.Facet 98 } 99 ``` 100 101 Note that irrespective of the input, both `Entity` and `Object/ValueId` get converted in `UID` format. 102 103 ### Posting List 104 Conceptually, a posting list contains all the `DirectedEdges` corresponding to an `Attribute`, in the 105 following format: 106 107 ``` 108 Attribute: Entity -> sorted list of ValueId // Everything in uint64 representation. 109 ``` 110 111 So, for, e.g., if we're storing a list of friends, such as: 112 113 Entity | Attribute| ValueId 114 -------|----------|-------- 115 Me | friend | person0 116 Me | friend | person1 117 Me | friend | person2 118 Me | friend | person3 119 120 121 Then a posting list `friend` would be generated. Seeking for `Me` in this PL 122 would produce a list of friends, namely `[person0, person1, person2, person3]`. 123 124 The big advantage of having such a structure is that we have all the data to do one join in one 125 Posting List. This means, one RPC to 126 the machine serving that Posting List would result in a join, without any further 127 network calls, reducing joins to lookups. 128 129 Implementation wise, a `Posting List` is a list of `Postings`. This is how they look in 130 [Protocol Buffers]({{< relref "#protocol-buffers" >}}) format. 131 ``` 132 message Posting { 133 fixed64 uid = 1; 134 bytes value = 2; 135 enum ValType { 136 DEFAULT = 0; 137 BINARY = 1; 138 INT = 2; // We treat it as int64. 139 FLOAT = 3; 140 BOOL = 4; 141 DATE = 5; 142 DATETIME = 6; 143 GEO = 7; 144 UID = 8; 145 PASSWORD = 9; 146 STRING = 10; 147 148 } 149 ValType val_type = 3; 150 enum PostingType { 151 REF=0; // UID 152 VALUE=1; // simple, plain value 153 VALUE_LANG=2; // value with specified language 154 // VALUE_TIMESERIES=3; // value from timeseries, with specified timestamp 155 } 156 PostingType posting_type = 4; 157 bytes metadata = 5; // for VALUE_LANG: Language, for VALUE_TIMESERIES: timestamp, etc.. 158 string label = 6; 159 uint64 commit = 7; // More inclination towards smaller values. 160 repeated facetsp.Facet facets = 8; 161 162 // TODO: op is only used temporarily. See if we can remove it from here. 163 uint32 op = 12; 164 } 165 166 message PostingList { 167 repeated Posting postings = 1; 168 bytes checksum = 2; 169 uint64 commit = 3; // More inclination towards smaller values. 170 } 171 ``` 172 173 There is typically more than one Posting in a PostingList. 174 175 The RDF Label is stored as `label` in each posting. 176 {{% notice "warning" %}}We don't currently retrieve label via query -- but would use it in the future.{{% /notice %}} 177 178 ### Badger 179 PostingLists are served via [Badger](https://github.com/dgraph-io/badger), given the latter provides enough 180 knobs to decide how much data should be served out of memory, SSD or disk. 181 Also, it supports bloom filters on keys, which makes random lookups efficient. 182 183 To allow Badger full access to memory to optimize for caches, we'll have 184 one Badger instance per machine. Each instance would contain all the 185 posting lists served by the machine. 186 187 Posting Lists get stored in Badger, in a key-value format, like so: 188 ``` 189 (Predicate, Subject) --> PostingList 190 ``` 191 192 ### Group 193 194 Every Alpha server belongs to a particular group, and each group is responsible for serving a 195 particular set of predicates. Multiple servers in a single group replicate the same data to achieve 196 high availability and redundancy of data. 197 198 Predicates are automatically assigned to each group based on which group first receives the 199 predicate. By default periodically predicates can be moved around to different groups upon 200 heuristics to evenly distribute the data across the cluster. Predicates can also be moved manually 201 if desired. 202 203 In a future version, if a group gets too big, it could be split further. In this case, a single 204 `Predicate` essentially gets divided across two groups. 205 206 ``` 207 Original Group: 208 (Predicate, Sa..z) 209 After split: 210 Group 1: (Predicate, Sa..i) 211 Group 2: (Predicate, Sj..z) 212 ``` 213 214 Note that keys are sorted in BadgerDB. So, the group split would be done in a way to maintain that 215 sorting order, i.e. it would be split in a way where the lexicographically earlier subjects would be 216 in one group, and the later in the second. 217 218 ### Replication and Server Failure 219 Each group should typically be served by atleast 3 servers, if available. In the case of a machine 220 failure, other servers serving the same group can still handle the load in that case. 221 222 ### New Server and Discovery 223 Dgraph cluster can detect new machines allocated to the [cluster](/deploy#cluster), 224 establish connections, and transfer a subset of existing predicates to it based on the groups served 225 by the new machine. 226 227 ### Write Ahead Logs 228 Every mutation upon hitting the database doesn't immediately make it on disk via BadgerDB. We avoid 229 re-generating the posting list too often, because all the postings need to be kept sorted, and it's 230 expensive. Instead, every mutation gets logged and synced to disk via append only log files called 231 `write-ahead logs`. So, any acknowledged writes would always be on disk. This allows us to recover 232 from a system crash, by replaying all the mutations since the last write to `Posting List`. 233 234 ### Mutations 235 236 {{% notice "outdated" %}} 237 This section needs to be improved. 238 {{% /notice %}} 239 240 In addition to being written to `Write Ahead Logs`, a mutation also gets stored in memory as an 241 overlay over immutable `Posting list` in a mutation layer. This mutation layer allows us to iterate 242 over `Posting`s as though they're sorted, without requiring re-creating the posting list. 243 244 When a posting list has mutations in memory, it's considered a `dirty` posting list. Periodically, 245 we re-generate the immutable version, and write to BadgerDB. Note that the writes to BadgerDB are 246 asynchronous, which means they don't get flushed out to disk immediately, but that wouldn't lead 247 to data loss on a machine crash. When `Posting lists` are initialized, write-ahead logs get referred, 248 and any missing writes get applied. 249 250 Every time we regenerate a posting list, we also write the max commit log timestamp that was 251 included -- this helps us figure out how long back to seek in write-ahead logs when initializing 252 the posting list, the first time it's brought back into memory. 253 254 ### Queries 255 256 Let's understand how query execution works, by looking at an example. 257 258 ``` 259 { 260 me(func: uid(0x1)) { 261 pred_A 262 pred_B { 263 pred_B1 264 pred_B2 265 } 266 pred_C { 267 pred_C1 268 pred_C2 { 269 pred_C21 270 } 271 } 272 } 273 } 274 275 ``` 276 277 Let's assume we have 3 Alpha instances, and instance id=2 receives this query. These are the steps: 278 279 * Send queries to look up keys = `pred_A, 0x1`, `pred_B, 0x1`, and `pred_C, 0x1`. These predicates could 280 belong to 3 different groups, served by potentially different Alpha servers. So, this would typically 281 incur at max 3 network calls (equal to number of predicates at this step). 282 * The above queries would return back 3 lists of UIDs or values. The result of `pred_B` and `pred_C` 283 would be converted into queries for `pred_Bi` and `pred_Ci`. 284 * `pred_Bi` and `pred_Ci` would then cause at max 4 network calls, depending upon where these 285 predicates are located. The keys for `pred_Bi`, for example, would be `pred_Bi, res_pred_Bk`, where 286 res_pred_Bk = list of resulting UIDs from `pred_B, u`. 287 * Looking at `res_pred_C2`, you'll notice that this would be a list of lists aka list matrix. We 288 merge these list of lists into a sorted list with distinct elements to form the query for `pred_C21`. 289 * Another network call depending upon where `pred_C21` lies, and this would again give us a list of 290 list UIDs / value. 291 292 If the query was run via HTTP interface `/query`, this subgraph gets converted into JSON for 293 replying back to the client. If the query was run via [gRPC](https://www.grpc.io/) interface using 294 the language [clients]({{< relref "clients/index.md" >}}), the subgraph gets converted to 295 [protocol buffer](https://developers.google.com/protocol-buffers/) format and then returned to client. 296 297 ### Network Calls 298 Compared to RAM or SSD access, network calls are slow. 299 Dgraph minimizes the number of network calls required to execute queries. As explained above, the 300 data sharding is done based on `predicate`, not `entity`. Thus, even if we have a large set of 301 intermediate results, they'd still only increase the payload of a network call, not the number of 302 network calls itself. In general, the number of network calls done in Dgraph is directly proportional 303 to the number of predicates in the query, or the complexity of the query, not the number of 304 intermediate or final results. 305 306 In the above example, we have eight predicates, and so including a call to convert to UID, we'll 307 have at max nine network calls. The total number of entity results could be in millions. 308 309 ### Worker 310 In Queries section, you noticed how the calls were made to query for `(predicate, uids)`. All those 311 network calls / local processing are done via workers. Each server exposes a 312 [gRPC](https://www.grpc.io) interface, which can then be called by the query processor to retrieve data. 313 314 ### Worker Pool 315 Worker Pool is just a pool of open TCP connections which can be reused by multiple goroutines. 316 This avoids having to recreate a new connection every time a network call needs to be made. 317 318 ### Protocol Buffers 319 All data in Dgraph that is stored or transmitted is first converted into byte arrays through 320 serialization using [Protocol Buffers](https://developers.google.com/protocol-buffers/). When 321 the result is to be returned to the user, the protocol buffer object is traversed, and the JSON 322 object is formed. 323 324 ## Minimizing network calls explained 325 326 To explain how Dgraph minimizes network calls, let's start with an example query we should be able 327 to run. 328 329 *Find all posts liked by friends of friends of mine over the last year, written by a popular author X.* 330 331 ### SQL/NoSQL 332 In a distributed SQL/NoSQL database, this would require you to retrieve a lot of data. 333 334 Method 1: 335 336 * Find all the friends (~ 338 [friends](http://www.pewresearch.org/fact-tank/2014/02/03/6-new-facts-about-facebook/</ref>)). 337 * Find all their friends (~ 338 * 338 = 40,000 people). 338 * Find all the posts liked by these people over the last year (resulting set in millions). 339 * Intersect these posts with posts authored by person X. 340 341 Method 2: 342 343 * Find all posts written by popular author X over the last year (possibly thousands). 344 * Find all people who liked those posts (easily millions) `result set 1`. 345 * Find all your friends. 346 * Find all their friends `result set 2`. 347 * Intersect `result set 1` with `result set 2`. 348 349 Both of these approaches would result in a lot of data going back and forth between database and 350 application; would be slow to execute, or would require you to run an offline job. 351 352 ### Dgraph 353 This is how it would run in Dgraph: 354 355 * Node X contains posting list for predicate `friends`. 356 * Seek to caller's userid in Node X **(1 RPC)**. Retrieve a list of friend uids. 357 * Do multiple seeks for each of the friend uids, to generate a list of friends of friends uids. `result set 1` 358 * Node Y contains posting list for predicate `posts_liked`. 359 * Ship result set 1 to Node Y **(1 RPC)**, and do seeks to generate a list of all posts liked by 360 result set 1. `reult set 2` 361 * Node Z contains posting list for predicate `author`. 362 * Ship result set 2 to Node Z **(1 RPC)**. Seek to author X, and generate a list of posts authored 363 by X. `result set 3` 364 * Intersect the two sorted lists, `result set 2` and `result set 3`. `result set 4` 365 * Node N contains names for all uids. 366 * Ship `result set 4` to Node N **(1 RPC)**, and convert uids to names by doing multiple seeks. `result set 5` 367 * Ship `result set 5` back to caller. 368 369 In 4-5 RPCs, we have figured out all the posts liked by friends of friends, written by popular author X. 370 371 This design allows vast scalability, and yet consistent production level latencies, 372 to support running complicated queries requiring deep joins. 373 374 ## RAFT 375 376 This section aims to explain the RAFT consensus algorithm in simple terms. The idea is to give you 377 just enough to make you understand the basic concepts, without going into explanations about why it 378 works accurately. For a detailed explanation of RAFT, please read the original thesis paper by 379 [Diego Ongaro](https://github.com/ongardie/dissertation). 380 381 ### Term 382 Each election cycle is considered a **term**, during which there is a single leader 383 *(just like in a democracy)*. When a new election starts, the term number is increased. This is 384 straightforward and obvious but is a critical factor for the accuracy of the algorithm. 385 386 In rare cases, if no leader could be elected within an `ElectionTimeout`, that term can end without 387 a leader. 388 389 ### Server States 390 Each server in cluster can be in one of the following three states: 391 392 * Leader 393 * Follower 394 * Candidate 395 396 Generally, the servers are in leader or follower state. When the leader crashes or the communication 397 breaks down, the followers will wait for election timeout before converting to candidates. The 398 election timeout is randomized. This would allow one of them to declare candidacy before others. 399 The candidate would vote for itself and wait for the majority of the cluster to vote for it as well. 400 If a follower hears from a candidate with a higher term than the current (*dead in this case*) leader, 401 it would vote for it. The candidate who gets majority votes wins the election and becomes the leader. 402 403 The leader then tells the rest of the cluster about the result (<tt>Heartbeat</tt> 404 [Communication]({{< relref "#communication" >}})) and the other candidates then become followers. 405 Again, the cluster goes back into leader-follower model. 406 407 A leader could revert to being a follower without an election, if it finds another leader in the 408 cluster with a higher [Term]({{< relref "#term" >}})). This might happen in rare cases (network partitions). 409 410 ### Communication 411 There is unidirectional RPC communication, from leader to followers. The followers never ping the 412 leader. The leader sends `AppendEntries` messages to the followers with logs containing state 413 updates. When the leader sends `AppendEntries` with zero logs, that's considered a 414 <tt>Heartbeat</tt>. Leader sends all followers <tt>Heartbeats</tt> at regular intervals. 415 416 If a follower doesn't receive <tt>Heartbeat</tt> for `ElectionTimeout` duration (generally between 417 150ms to 300ms), it converts it's state to candidate (as mentioned in [Server States]({{< relref "#server-states" >}})). 418 It then requests for votes by sending a `RequestVote` call to other servers. Again, if it gets 419 majority votes, candidate becomes a leader. At becoming leader, it then sends <tt>Heartbeats</tt> 420 to all other servers to establish its authority *(Cartman style, "Respect my authoritah!")*. 421 422 Every communication request contains a term number. If a server receives a request with a stale term 423 number, it rejects the request. 424 425 Raft believes in retrying RPCs indefinitely. 426 427 ### Log Entries 428 Log Entries are numbered sequentially and contain a term number. Entry is considered **committed** if 429 it has been replicated to a majority of the servers. 430 431 On receiving a client request, the leader does four things (aka Log Replication): 432 433 * Appends and persists to its log. 434 * Issue `AppendEntries` in parallel to other servers. 435 * On majority replication, consider the entry committed and apply to its state machine. 436 * Notify followers that entry is committed so that they can apply it to their state machines. 437 438 A leader never overwrites or deletes its entries. There is a guarantee that if an entry is committed, 439 all future leaders will have it. A leader can, however, force overwrite the followers' logs, so they 440 match leader's logs *(elected democratically, but got a dictator)*. 441 442 ### Voting 443 Each server persists its current term and vote, so it doesn't end up voting twice in the same term. 444 On receiving a `RequestVote` RPC, the server denies its vote if its log is more up-to-date than the 445 candidate. It would also deny a vote, if a minimum `ElectionTimeout` hasn't passed since the last 446 <tt>Heartbeat</tt> from the leader. Otherwise, it gives a vote and resets its `ElectionTimeout` timer. 447 448 Up-to-date property of logs is determined as follows: 449 450 * Term number comparison 451 * Index number or log length comparison 452 453 {{% notice "tip" %}}To understand the above sections better, you can see this 454 [interactive visualization](http://thesecretlivesofdata.com/raft).{{% /notice %}} 455 456 ### Cluster membership 457 Raft only allows single-server changes, i.e. only one server can be added or deleted at a time. 458 This is achieved by cluster configuration changes. Cluster configurations are communicated using 459 special entries in `AppendEntries`. 460 461 The significant difference in how cluster configuration changes are applied compared to how typical 462 [Log Entries]({{< relref "#log-entries" >}}) are applied is that the followers don't wait for a 463 commitment confirmation from the leader before enabling it. 464 465 A server can respond to both `AppendEntries` and `RequestVote`, without checking current 466 configuration. This mechanism allows new servers to participate without officially being part of 467 the cluster. Without this feature, things won't work. 468 469 When a new server joins, it won't have any logs, and they need to be streamed. To ensure cluster 470 availability, Raft allows this server to join the cluster as a non-voting member. Once it's caught 471 up, voting can be enabled. This also allows the cluster to remove this server in case it's too slow 472 to catch up, before giving voting rights *(sort of like getting a green card to allow assimilation 473 before citizenship is awarded providing voting rights)*. 474 475 476 {{% notice "tip" %}}If you want to add a few servers and remove a few servers, do the addition 477 before the removal. To bootstrap a cluster, start with one server to allow it to become the leader, 478 and then add servers to the cluster one-by-one.{{% /notice %}} 479 480 ### Log Compaction 481 One of the ways to do this is snapshotting. As soon as the state machine is synced to disk, the 482 logs can be discarded. 483 484 ### Clients 485 Clients must locate the cluster to interact with it. Various approaches can be used for discovery. 486 487 A client can randomly pick up any server in the cluster. If the server isn't a leader, the request 488 should be rejected, and the leader information passed along. The client can then re-route it's query 489 to the leader. Alternatively, the server can proxy the client's request to the leader. 490 491 When a client first starts up, it can register itself with the cluster using `RegisterClient` RPC. 492 This creates a new client id, which is used for all subsequent RPCs. 493 494 ### Linearizable Semantics 495 496 Servers must filter out duplicate requests. They can do this via session tracking where they use 497 the client id and another request UID set by the client to avoid reprocessing duplicate requests. 498 RAFT also suggests storing responses along with the request UIDs to reply back in case it receives 499 a duplicate request. 500 501 Linearizability requires the results of a read to reflect the latest committed write. 502 Serializability, on the other hand, allows stale reads. 503 504 ### Read-only queries 505 506 To ensure linearizability of read-only queries run via leader, leader must take these steps: 507 508 * Leader must have at least one committed entry in its term. This would allow for up-to-dated-ness. 509 *(C'mon! Now that you're in power do something at least!)* 510 * Leader stores it's latest commit index. 511 * Leader sends <tt>Heartbeats</tt> to the cluster and waits for ACK from majority. Now it knows 512 that it's the leader. *(No successful coup. Yup, still the democratically elected dictator I was before!)* 513 * Leader waits for its state machine to advance to readIndex. 514 * Leader can now run the queries against state machine and reply to clients. 515 516 Read-only queries can also be serviced by followers to reduce the load on the leader. But this 517 could lead to stale results unless the follower confirms that its leader is the real leader(network partition). 518 To do so, it would have to send a query to the leader, and the leader would have to do steps 1-3. 519 Then the follower can do 4-5. 520 521 Read-only queries would have to be batched up, and then RPCs would have to go to the leader for each 522 batch, who in turn would have to send further RPCs to the whole cluster. *(This is not scalable 523 without considerable optimizations to deal with latency.)* 524 525 **An alternative approach** would be to have the servers return the index corresponding to their 526 state machine. The client can then keep track of the maximum index it has received from replies so far. 527 And pass it along to the server for the next request. If a server's state machine hasn't reached the 528 index provided by the client, it will not service the request. This approach avoids inter-server 529 communication and is a lot more scalable. *(This approach does not guarantee linearizability, but 530 should converge quickly to the latest write.)*