github.com/jbendotnet/noms@v0.0.0-20190904222105-c43e4293ea92/doc/intro.md

github.com/jbendotnet/noms@v0.0.0-20190904222105-c43e4293ea92/doc/intro.md (about)

     1  [Home](../README.md) »
     2  
     3  **Technical Overview**&nbsp; | &nbsp;[Use Cases](../README.md#use-cases)&nbsp; | &nbsp;[Command-Line Interface](cli-tour.md)&nbsp; | &nbsp;[Go bindings Tour](go-tour.md) | &nbsp;[Path Syntax](spelling.md)&nbsp; | &nbsp;[FAQ](faq.md)&nbsp;
     4  <br><br>
     5  # Noms Technical Overview
     6  
     7  Most conventional database systems share two central properties:
     8  
     9  1. Data is modeled as a single point-in-time. Once a transaction commits, the previous state of the database is either lost, or available only as a fallback by reconstructing from transaction logs.
    10  
    11  2. Data is modeled as a single source of truth. Even large-scale distributed databases which are internally a fault-tolerant network of nodes, present the abstraction to clients of being a single logical master, with which clients must coordinate in order to change state.
    12  
    13  Noms blends the properties of decentralized systems, such as [Git](https://git-scm.com/), with properties of traditional databases in order to create a general-purpose decentralized database, in which:
    14  
    15  1. Any peer’s state is as valid as any other.
    16  
    17  2. All commits of the database are retained and available at any time.
    18  
    19  3. Any peer is free to move forward independently of communication from any other—while retaining the ability to reconcile changes at some point in the future.
    20  
    21  4. The basic properties of structured databases (efficient queries, updates, and range scans) are retained.
    22  
    23  5. Diffs between any two sets of data can be computed efficiently.
    24  
    25  6. Synchronization between disconnected copies of the database can be performed efficiently and correctly.
    26  
    27  ## Basics
    28  
    29  As in Git, [Bitcoin](https://bitcoin.org/en/), [Ethereum](https://www.ethereum.org/), [IPFS](https://ipfs.io/), [Camlistore](https://camlistore.org/), [bup](https://bup.github.io/), and other systems, Noms models data as a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) of nodes in which every node has a _hash_. A node's hash is derived from the values encoded in the node and (transitively) from the values encoded in all nodes which are reachable from that node.
    30  
    31  In other words, a Noms database is a single large [Merkle DAG](https://github.com/jbenet/random-ideas/issues/20).
    32  
    33  When two nodes have the same hash, they represent identical logical values and the respective subgraph of nodes reachable from each are topologically equivalent. Importantly, in Noms, the reverse is also true: a single logical value has one and only one hash. When two nodes have differnet hashes, they represent different logical values.
    34  
    35  Noms extends the ideas of prior systems to enable efficiently computing and reconciling differences, synchronizing state, and building indexes over large-scale, structured data.
    36  
    37  ## Databases and Datasets
    38  
    39  A _database_ is the top-level abstraction in Noms.
    40  
    41  A database has two responsibilities: it provides storage of [content-addressed](https://en.wikipedia.org/wiki/Content-addressable_storage) chunks of data, and it keeps track of zero or more _datasets_.
    42  
    43  A Noms database can be implemented on top of any underlying storage system that provides key/value storage with at least optional optimistic concurrency. We only use optimistic concurrency to store the current value of each dataset. Chunks themselves are immutable.
    44  
    45  We have implementations of Noms databases on top of our own file-backed store [Noms Block Store (NBS)](https://github.com/attic-labs/noms/tree/master/go/nbs) (usually used locally), our own [HTTP protocol](https://github.com/attic-labs/noms/blob/master/go/datas/database_server.go) (used for working with a remote database), [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), and [memory](https://github.com/attic-labs/noms/blob/master/go/chunks/memory_store.go) (mainly used for testing).
    46  
    47  Here's an example of creating an http-backed database using the [Go Noms SDK](go-tour.md):
    48  
    49  ```go
    50  package main
    51  
    52  import (
    53    "fmt"
    54    "os"
    55  
    56    "github.com/attic-labs/noms/go/spec"
    57  )
    58  
    59  func main() {
    60    sp, err := spec.ForDatabase("http://localhost:8000")
    61    if err != nil {
    62      fmt.Fprintf(os.Stderr, "Could not access database: %s\n", err)
    63      return
    64    }
    65    defer sp.Close()
    66  }
    67  ```
    68  
    69  A dataset is nothing more than a named pointer into the DAG. Consider the following command to copy the dataset named `foo` to the dataset named `bar` within a database:
    70  
    71  ```shell
    72  noms sync http://localhost:8000::foo http://localhost:8000::bar
    73  ```
    74  
    75  This command is trivial and causes basically zero IO. Noms first resolves the dataset name `foo` in `http://localhost:8000`. This results in a hash. Noms then checks whether that hash exists in the destination database (which in this case is the same as the source database), finds that it does, and then adds a new dataset pointing at that chunk.
    76  
    77  Syncs across database can be efficient by the same logic if the destination database already has all or most of the chunks required chunks.
    78  
    79  ## Time
    80  
    81  All data in Noms is immutable. Once a piece of data is stored, it is never changed. To represent state changes, Noms uses a progression of `Commit`  structures.
    82  
    83  [TODO - diagram]
    84  
    85  As in Git, Commits typically have one _parent_, which is the previous commit in time. But in the cases of merges, a Noms commit can have multiple parents.
    86  
    87  ### Chunks
    88  
    89  When a value is stored in Noms, it is stored as one or more chunks of data.  Chunk boundaries are typically created implicitly, as a way to store large collections efficiently (see [Prolly Trees](#prolly-trees-probabilistic-b-trees)). Programmers can also create explicit chunk boundaries using the `Ref` type (see [Types](#types )).
    90  
    91  [TODO - Diagram]
    92  
    93  Every chunk encodes a single logical value (which may be a component of another value and/or be composed of sub-values). Chunks are [addressed](https://en.wikipedia.org/wiki/Content-addressable_storage) in the Noms persistence layer by the hash of the value they encode.
    94  
    95  ## Types
    96  
    97  Noms is a typed system, meaning that every Noms value is classified into one of the following _types_:
    98  
    99  * `Boolean`
   100  * `Number` (arbitrary precision binary)
   101  * `String` (utf8-encoded)
   102  * `Blob` (raw binary data)
   103  * `Set<T>`
   104  * `List<T>`
   105  * `Map<K,V>`
   106  * Unions: `T|U|V|...`
   107  * `Ref<T>` (explicit out-of-line references)
   108  * `Struct` (user-defined record types, e.g., `Struct Person { name: String, age?: Number })`
   109  * `Type` (A value that stores a Noms type)
   110  
   111  Blobs, sets, lists, and maps can be gigantic - Noms will _chunk_ these types into reasonable sized parts internally for efficient storage, searching, and updating (see [Prolly Trees](#prolly-trees-probabilistic-b-trees) below for more on this).
   112  
   113  Strings, numbers, unions, and structs are not chunked, and should be used for "reasonably-sized" values. Use `Ref` if you need to force a particular value to be in a different chunk for some reason.
   114  
   115  Types serve several purposes in Noms:
   116  
   117  1. Most importantly, types make Noms data self-describing. You can use the `types.TypeOf` function on any Noms `Value`, no matter how large, and get a very precise description of the entire value and all values reachable from it. This allows software to interoperate without prior agreement or planning.
   118  
   119  2. Users of Noms can define their own structures and publish data that uses them. This allows for ad-hoc standardization of types within communities working on similar data.
   120  
   121  3. Types can be used _structurally_. A program can check incoming data against a required type. If the incoming root chunk matches the type, or is a superset of it, then the program can proceed with certainty of the shape of all accessible data. This enables richer interoperability between software, since schemas can be expanded over time as long as a compatible subset remains.
   122  
   123  4. Eventually, we plan to add type restrictions to datasets, which would enforce the allowed types that can be committed to a dataset. This would allow something akin to schema validation in traditional databases.
   124  
   125  ### Refs vs Hashes
   126  
   127  A _hash_ in Noms is just like the hashes used elsewhere in computing: a short string of bytes that uniquely identifies a larger value. Every value in Noms has a hash. Noms currently uses the [sha2-512](https://github.com/attic-labs/noms/blob/master/go/hash/hash.go#L7) hash function, but that can change in future versions of the system.
   128  
   129  A _ref_ is different in subtle, but important ways. A `Ref` is a part of the type system - a `Ref` is a value. Anywhere you can find a Noms value, you can find a `Ref`. For example, you can commit a `Ref<T>` to a dataset, but you can't commit a bare hash.
   130  
   131  The difference is that `Ref` carries the type of its target, along with the hash. This allows us to efficiently validate commits that include `Ref`, among other things.
   132  
   133  ### Type Accretion
   134  
   135  Noms is an immutable database, which leads to the question: How do you change the schema? If I have a dataset containing `Set<Number>`, and I later decide that it should be `Set<String>`, what do I do?
   136  
   137  You might say that you just commit the new type, but that would mean that users can't look at a dataset and understand what types previous versions contained, without manually exploring every one of those commits.
   138  
   139  We call our solution to this problem _Type Accretion_.
   140  
   141  If you construct a `Set` containing only `Number`s, its type will be `Set<Number>`. If you then insert a string into this set, the type of the resulting value is `Set<Number|String>`.
   142  
   143  This is usually completely implicit, done based on the data you store (you can set types explicitly though, which is useful in some cases).
   144  
   145  We do the same thing for datasets. If you commit a `Set<Number>`, the type of the commit we create for you is:
   146  
   147  ```go
   148  Struct Commit {
   149  	Value: Set<Number>
   150  	Parents: Set<Ref<Cycle<Commit>>>
   151  }
   152  ```
   153  
   154  This tells you that the current and all previous commits have values of type `Set<Number>`.
   155  
   156  But if you then commit a `Set<String>` to this same dataset, then the type of that commit will be:
   157  
   158  ```go
   159  Struct Commit {
   160  	Value: Set<String>
   161  	Parents: Set<Ref<Cycle<Commit>> |
   162  		Ref<Struct Commit {
   163  			Value: Set<Number>
   164  			Parents: Cycle<Commit>
   165  		}>>
   166  	}
   167  }
   168  ```
   169  
   170  This tells you that the dataset's current commit has a value of type `Set<String>` and that previous commits are either the same, or else have a value of type `Set<Number>`.
   171  
   172  Type accretion has a number of benefits related to schema changes:
   173  
   174  1. You can widen the type of any container (list, set, map) without rewriting any existing data. `Set<Struct { name: String }>` becomes `Set<Struct { name: String }> | Struct { name: String, age: Number }>>` and all existing data is reused.
   175  2. You can widen containers in ways that other databases wouldn't allow. For example, you can go from `Set<Number>` to `Set<Number|String>`. Existing data is still reused.
   176  3. You can change the type of a dataset in either direction - either widening or narrowing it, and the dataset remains self-documenting as to its current and previous types.
   177  
   178  ## Prolly Trees: Probabilistic B-Trees
   179  
   180  A critical invariant of Noms is that the same value will be represented by the same graph, having the same chunk boundaries, regardless of what past sequence of logical mutations resulted in the value. This is the essence of content-addressing and it is what makes deduplication, efficient sync, indexing, and and other features of Noms possible.
   181  
   182  But this invariant also rules out the use of classical B-Trees, because a B-Tree’s internal state depends upon its mutation history. In order to model large mutable collections in Noms, of the type where B-Trees would typically be used, Noms instead introduces _Prolly Trees_.
   183  
   184  A Prolly Tree is a [search tree](https://en.wikipedia.org/wiki/Search_tree) where the number of values stored in each node is determined probabilistically, based on the data which is stored in the tree.
   185  
   186  A Prolly Tree is similar in many ways to a B-Tree, except that the number of values in each node has a probabilistic average rather than an enforced upper and lower bound, and the set of values in each node is determined by the output of a rolling hash function over the values, rather than via split and join operations when upper and lower bounds are exceeded.
   187  
   188  ### Indexing and Searching with Prolly Trees
   189  
   190  Like B-Trees, Prolly Trees are sorted. Keys of type Boolean, Number, and String sort in their natural order. Other types sort by their hash.
   191  
   192  Because of this sorting, Noms collections can be used as efficient indexes, in the same manner as primary and secondary indexes in traditional databases.
   193  
   194  For example, say you want to quickly be able to find `Person` structs by their age. You could build a map of type `Map<Number, Set<Person>>`. This would allow you to quickly (~log<sub>k</sub>(n) seeks, where `k` is average prolly tree width, which is currently 64) find all the people of an exact age. But it would _also_ allow you to find all people within a range of ages efficiently (~num_results/log<sub>k</sub>(n) seeks), even if the ages are non-integral.
   195  
   196  Also, because Noms collections are ordered search trees, it is possible to implement set operations like union and intersect efficiently on them.
   197  
   198  So, for example, if you wanted to find all the people of a particular age AND having a particular hair color, you could construct a second map having type `Map<String, Set<Person>>`, and intersect the two sets.
   199  
   200  Over time, we plan to develop this basic capability into support for some kind of generalized query system.