github.com/slspeek/camlistore_namedsearch@v0.0.0-20140519202248-ed6f70f7721a/doc/overview.txt (about) 1 ============================================================================ 2 Camlistore: Content-Addressable Multi-Layer, Indexed Store 3 ============================================================================ 4 5 This file contains old design notes. They're correct in spirit, but shouldn't 6 be considered authorative. 7 8 See http://camlistore.org/docs/ 9 10 11 -=-=-=-=-=-=-=-=-=-=-=-=-=- 12 Design goals: 13 -=-=-=-=-=-=-=-=-=-=-=-=-=- 14 15 * Content storage & indexing & backup system 16 * No master node 17 * Anything can sync any which way, in any directed graph (cycles or not) 18 (phone -> personal server <-> home machine <-> amazon <-> google, etc) 19 * No sync state or races on arguments of latest versions 20 * Future-proof 21 * Very obvious/intuitive schema (easy to recover in the future, even 22 if all docs/notes about Camlistore are lost, or the recoverer in 23 five decades after I die doesn't even know that Camlistore was being 24 used....) should be easy for future digital archaeologists to grok. 25 26 -=-=-=-=-=-=-=-=-=-=-=-=-=- 27 Design assumptions: 28 -=-=-=-=-=-=-=-=-=-=-=-=-=- 29 30 * disk is cheap and getting cheaper 31 * bandwidth is high and getting faster 32 * plentiful CPU & compression will fix size & redundancy of metadata 33 34 -=-=-=-=-=-=-=-=-=-=-=-=-=- 35 Layer 1: 36 -=-=-=-=-=-=-=-=-=-=-=-=-=- 37 38 * content-addressable blobs only 39 - no notion of "files", filenames, dates, streams, encryption, 40 permissions, metadata. 41 * immutable 42 * only operations: 43 - store(digest, bytes) 44 - check(digest) => bool (have it or not) 45 - get(digest) => bytes 46 - list([start_digest]) => [(digest[, size]), ...]+ 47 * amenable to implementation on ordinary filesystems (e.g. ext3, vfat, 48 ntfs) or on Amazon S3, BigTable, AppEngine Datastore, Azure, Hadoop 49 HDFS, etc. 50 51 -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=- 52 Schema of files/objects in Layer 1: 53 -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=- 54 55 * Let's start by describing the storage of files that aren't self-describing, 56 e.g. "some-notes.txt" (as opposed to a jpg file from a camera that might 57 likely contain EXIF data, addressed later...). This file, for reference, 58 is in doc/json-signing/example/some-notes.txt 59 60 * The bytes of file "some-notes.txt" are stored as-is in one blob, 61 addressed as "sha1-8ba9e53cbc83c1be3835b94a3690c3b03de0b522". 62 (note the explicit naming of the hash function as part of the name, 63 for upgradability later, and so all parties involved know how to 64 verify it...) 65 66 * The filename, stat(2) metadata (modtime, ctime, permissions, etc) now 67 also need to be stored. The key design point here is that file 68 metdata is ALSO just a blob, content-addressed. The blob is a JSON 69 file (for human readability, compactness). XML and Protocol Buffers 70 were both also considered, but the former is too redundant, bloaty, 71 tree-ish (overkill) and out of vogue, while Protocol Buffers don't 72 stand up to the human readable future digital archaeologist test, 73 and they're also not self-describing with the proto schema declared 74 in-line. 75 76 This file would thus be represented by a JSON file, as seen in 77 docs/json-signing/example/some-notes.txt.camli, and addressed as 78 "sha1-7e7960756b39cd7da614e7edbcf1fa7d696eb660", its sha1sum. This identifier 79 can be used in directory listings, etc. Note that camli files do not have any 80 magical filename, as they're not typically stored with their filename. (they 81 are in the doc/json-signing/examples/ directory just to separate them out, but 82 that's a rare case.) Instead, a camli JSON object is known as such if the 83 bytes of the file begin exactly with the bytes: 84 85 {"camliVersion" 86 87 ... which lets upper layers know what it is, and how to index it. 88 89 See the doc/schema/ directory for details on Camli JSON objects and their 90 schema. 91 92 * Note that camli files can represent: 93 94 -- files 95 -- directories 96 -- trees/snapshots (git-style) 97 -- tags on other objects 98 -- stars/ratings on other objects 99 -- deletion claims/requests (since everything is immutable, you can 100 only request a deletion, and wait/hope for GC later...) 101 -- signed statements/claims on other objects 102 (think decentralized commenting/starring on the web, 103 verifying claims with webfinger lookups to find 104 public keys to verify signatures) 105 -- references to encrypted/split files 106 -- etc... (extensible over time) 107 108 -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=- 109 Syncing 110 -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=- 111 112 -- nodes can push/pull between storage layers without thought. No 113 chance of overwriting stuff. 114 115 -- the assumption is that users control and trust and secure all their 116 storage nodes: e.g. your phone, your home server, your internet 117 server, your Amazon S3 node, your App Engine appid / datastore 118 instance, etc. 119 120 -- users configure which nodes push/pull to which other nodes, forming 121 their own sync topology. For instance, your phone may not need a 122 full copy of all content you've ever saved/produced... its primary 123 goal in life is probably to quickly push out any unique content it 124 produces (e.g. photos) to another machine for backup. And maybe 125 cache other recently-accessed content locally, but not worry about 126 it being destroyed when you drop and break your phone. 127 128 -- no encryption is assumed at the Camli storage layer, though you may 129 run a Camli storage node on an encrypted filesystem or blockdevice. 130 131 -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=- 132 Indexing Layer 133 -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=- 134 135 * scans/mapreduces over all blobs, provides higher-level APIs to list 136 objects, list directories, see snapshots of trees at points in time, 137 traverse graphs of objects (reverse indexing e.g. tags/stars/claims 138 object<->object) 139 140 * ... TODO: document 141 142 -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=- 143 Mid layer 144 -=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=- 145 146 * It'll often be the case that a client (e.g. your phone) knows about 147 a file (e.g. a photo) and has its metadata, but doesn't have its raw 148 JPEG blob bytes, which might be several MB, and slow to transfer 149 over a wireless connection. Camli storage nodes may also declare 150 their support for helper APIs for when the client knows/assumes the 151 type of a given blob. 152 153 In addition to the operations in layer 1 above, you could also assume 154 most Camli storage nodes would support any API such as: 155 156 getThumbnail(blobName, [ ... sizeParams .. ]) -> JPEG thumbnail 157 158 .. which would make mobile content browsers lives easier. 159 160 161 TODO: finish documenting