github.com/slspeek/camlistore_namedsearch@v0.0.0-20140519202248-ed6f70f7721a/doc/overview.txt (about)

     1  ============================================================================
     2  Camlistore: Content-Addressable Multi-Layer, Indexed Store
     3  ============================================================================
     4  
     5  This file contains old design notes.  They're correct in spirit, but shouldn't
     6  be considered authorative.
     7  
     8  See http://camlistore.org/docs/
     9  
    10  
    11  -=-=-=-=-=-=-=-=-=-=-=-=-=-
    12  Design goals:
    13  -=-=-=-=-=-=-=-=-=-=-=-=-=-
    14  
    15  * Content storage & indexing & backup system
    16  * No master node
    17  * Anything can sync any which way, in any directed graph (cycles or not)
    18    (phone -> personal server <-> home machine <-> amazon <-> google, etc)
    19  * No sync state or races on arguments of latest versions
    20  * Future-proof
    21  * Very obvious/intuitive schema (easy to recover in the future, even
    22    if all docs/notes about Camlistore are lost, or the recoverer in
    23    five decades after I die doesn't even know that Camlistore was being
    24    used....) should be easy for future digital archaeologists to grok.
    25  
    26  -=-=-=-=-=-=-=-=-=-=-=-=-=-
    27  Design assumptions:
    28  -=-=-=-=-=-=-=-=-=-=-=-=-=-
    29  
    30  * disk is cheap and getting cheaper
    31  * bandwidth is high and getting faster
    32  * plentiful CPU & compression will fix size & redundancy of metadata
    33  
    34  -=-=-=-=-=-=-=-=-=-=-=-=-=-
    35  Layer 1:
    36  -=-=-=-=-=-=-=-=-=-=-=-=-=-
    37  
    38  * content-addressable blobs only
    39    - no notion of "files", filenames, dates, streams, encryption,
    40      permissions, metadata.
    41  * immutable
    42  * only operations:
    43    - store(digest, bytes)
    44    - check(digest) => bool (have it or not)
    45    - get(digest) => bytes
    46    - list([start_digest]) => [(digest[, size]), ...]+
    47  * amenable to implementation on ordinary filesystems (e.g. ext3, vfat,
    48    ntfs) or on Amazon S3, BigTable, AppEngine Datastore, Azure, Hadoop
    49    HDFS, etc.
    50  
    51  -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-
    52  Schema of files/objects in Layer 1:
    53  -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-
    54  
    55  * Let's start by describing the storage of files that aren't self-describing,
    56    e.g. "some-notes.txt" (as opposed to a jpg file from a camera that might
    57    likely contain EXIF data, addressed later...).  This file, for reference,
    58    is in doc/json-signing/example/some-notes.txt
    59  
    60  * The bytes of file "some-notes.txt" are stored as-is in one blob,
    61    addressed as "sha1-8ba9e53cbc83c1be3835b94a3690c3b03de0b522".
    62    (note the explicit naming of the hash function as part of the name,
    63    for upgradability later, and so all parties involved know how to
    64    verify it...)
    65  
    66  * The filename, stat(2) metadata (modtime, ctime, permissions, etc) now
    67    also need to be stored.  The key design point here is that file
    68    metdata is ALSO just a blob, content-addressed.  The blob is a JSON
    69    file (for human readability, compactness).  XML and Protocol Buffers
    70    were both also considered, but the former is too redundant, bloaty,
    71    tree-ish (overkill) and out of vogue, while Protocol Buffers don't
    72    stand up to the human readable future digital archaeologist test,
    73    and they're also not self-describing with the proto schema declared
    74    in-line.
    75  
    76    This file would thus be represented by a JSON file, as seen in
    77    docs/json-signing/example/some-notes.txt.camli, and addressed as
    78    "sha1-7e7960756b39cd7da614e7edbcf1fa7d696eb660", its sha1sum. This identifier
    79    can be used in directory listings, etc. Note that camli files do not have any
    80    magical filename, as they're not typically stored with their filename. (they
    81    are in the doc/json-signing/examples/ directory just to separate them out, but
    82    that's a rare case.) Instead, a camli JSON object is known as such if the
    83    bytes of the file begin exactly with the bytes:
    84  
    85          {"camliVersion"
    86  
    87    ... which lets upper layers know what it is, and how to index it.
    88  
    89    See the doc/schema/ directory for details on Camli JSON objects and their
    90    schema.
    91  
    92  * Note that camli files can represent:
    93  
    94    -- files
    95    -- directories
    96    -- trees/snapshots (git-style)
    97    -- tags on other objects
    98    -- stars/ratings on other objects
    99    -- deletion claims/requests (since everything is immutable, you can
   100       only request a deletion, and wait/hope for GC later...)
   101    -- signed statements/claims on other objects
   102       (think decentralized commenting/starring on the web,
   103        verifying claims with webfinger lookups to find
   104        public keys to verify signatures)
   105    -- references to encrypted/split files
   106    -- etc... (extensible over time)
   107  
   108  -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-
   109  Syncing
   110  -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-
   111  
   112  -- nodes can push/pull between storage layers without thought.  No
   113     chance of overwriting stuff.
   114  
   115  -- the assumption is that users control and trust and secure all their
   116     storage nodes: e.g. your phone, your home server, your internet
   117     server, your Amazon S3 node, your App Engine appid / datastore
   118     instance, etc.
   119  
   120  -- users configure which nodes push/pull to which other nodes, forming
   121     their own sync topology.  For instance, your phone may not need a
   122     full copy of all content you've ever saved/produced... its primary
   123     goal in life is probably to quickly push out any unique content it
   124     produces (e.g. photos) to another machine for backup.  And maybe
   125     cache other recently-accessed content locally, but not worry about
   126     it being destroyed when you drop and break your phone.
   127  
   128  -- no encryption is assumed at the Camli storage layer, though you may
   129     run a Camli storage node on an encrypted filesystem or blockdevice.
   130  
   131  -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-
   132  Indexing Layer
   133  -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-
   134  
   135  * scans/mapreduces over all blobs, provides higher-level APIs to list
   136    objects, list directories, see snapshots of trees at points in time,
   137    traverse graphs of objects (reverse indexing e.g. tags/stars/claims
   138    object<->object)
   139  
   140  * ... TODO: document
   141  
   142  -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-
   143  Mid layer
   144  -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-
   145  
   146  * It'll often be the case that a client (e.g. your phone) knows about
   147    a file (e.g. a photo) and has its metadata, but doesn't have its raw
   148    JPEG blob bytes, which might be several MB, and slow to transfer
   149    over a wireless connection.  Camli storage nodes may also declare
   150    their support for helper APIs for when the client knows/assumes the
   151    type of a given blob.
   152  
   153    In addition to the operations in layer 1 above, you could also assume
   154    most Camli storage nodes would support any API such as:
   155  
   156       getThumbnail(blobName, [ ... sizeParams .. ]) -> JPEG thumbnail
   157  
   158    .. which would make mobile content browsers lives easier.
   159  
   160  
   161  TODO: finish documenting