github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/storage/doc.go (about) 1 // Copyright 2014 The Cockroach Authors. 2 // 3 // Use of this software is governed by the Business Source License 4 // included in the file licenses/BSL.txt. 5 // 6 // As of the Change Date specified in that file, in accordance with 7 // the Business Source License, use of this software will be governed 8 // by the Apache License, Version 2.0, included in the file 9 // licenses/APL.txt. 10 11 /* 12 Package storage provides low-level storage. It interacts with storage 13 backends (e.g. LevelDB, RocksDB, etc.) via the Engine interface. At 14 one level higher, MVCC provides multi-version concurrency control 15 capability on top of an Engine instance. 16 17 The Engine interface provides an API for key-value stores. InMem 18 implements an in-memory engine using a sorted map. RocksDB implements 19 an engine for data stored to local disk using RocksDB, a variant of 20 LevelDB. 21 22 MVCC provides a multi-version concurrency control system on top of an 23 engine. MVCC is the basis for Cockroach's support for distributed 24 transactions. It is intended for direct use from storage.Range 25 objects. 26 27 Notes on MVCC architecture 28 29 Each MVCC value contains a metadata key/value pair and one or more 30 version key/value pairs. The MVCC metadata key is the actual key for 31 the value, using the util/encoding.EncodeBytes scheme. The MVCC 32 metadata value is of type MVCCMetadata and contains the most recent 33 version timestamp and an optional roachpb.Transaction message. If 34 set, the most recent version of the MVCC value is a transactional 35 "intent". It also contains some information on the size of the most 36 recent version's key and value for efficient stat counter 37 computations. Note that it is not necessary to explicitly store the 38 MVCC metadata as its contents can be reconstructed from the most 39 recent versioned value as long as an intent is not present. The 40 implementation takes advantage of this and deletes the MVCC metadata 41 when possible. 42 43 Each MVCC version key/value pair has a key which is also 44 binary-encoded, but is suffixed with a decreasing, big-endian encoding 45 of the timestamp (eight bytes for the nanosecond wall time, followed 46 by four bytes for the logical time except for meta key value pairs, 47 for which the timestamp is implicit). The MVCC version value is 48 a message of type roachpb.Value. A deletion is indicated by an 49 empty value. Note that an empty roachpb.Value will encode to 50 a non-empty byte slice. The decreasing encoding on the timestamp sorts 51 the most recent version directly after the metadata key, which is 52 treated specially by the RocksDB comparator (by making the zero 53 timestamp sort first). This increases the likelihood that an 54 Engine.Get() of the MVCC metadata will get the same block containing 55 the most recent version, even if there are many versions. We rely on 56 getting the MVCC metadata key/value and then using it to directly get 57 the MVCC version using the metadata's most recent version timestamp. 58 This avoids using an expensive merge iterator to scan the most recent 59 version. It also allows us to leverage RocksDB's bloom filters. 60 61 The following is an example of the sort order for MVCC key/value pairs: 62 63 ... 64 keyA: MVCCMetadata of keyA 65 keyA_Timestamp_n: value of version_n 66 keyA_Timestamp_n-1: value of version_n-1 67 ... 68 keyA_Timestamp_0: value of version_0 69 keyB: MVCCMetadata of keyB 70 71 The binary encoding used on the MVCC keys allows arbitrary keys to be 72 stored in the map (no restrictions on intermediate nil-bytes, for 73 example), while still sorting lexicographically and guaranteeing that 74 all timestamp-suffixed MVCC version keys sort consecutively with the 75 metadata key. We use an escape-based encoding which transforms all nul 76 ("\x00") characters in the key and is terminated with the sequence 77 "\x00\x01", which is guaranteed to not occur elsewhere in the encoded 78 value. See util/encoding/encoding.go for more details. 79 80 We considered inlining the most recent MVCC version in the 81 MVCCMetadata. This would reduce the storage overhead of storing the 82 same key twice (which is small due to block compression), and the 83 runtime overhead of two separate DB lookups. On the other hand, all 84 writes that create a new version of an existing key would incur a 85 double write as the previous value is moved out of the MVCCMetadata 86 into its versioned key. Preliminary benchmarks have not shown enough 87 performance improvement to justify this change, although we may 88 revisit this decision if it turns out that multiple versions of the 89 same key are rare in practice. 90 91 However, we do allow inlining in order to use the MVCC interface to 92 store non-versioned values. It turns out that not everything which 93 Cockroach needs to store would be efficient or possible using MVCC. 94 Examples include transaction records, abort span entries, stats 95 counters, time series data, and system-local config values. However, 96 supporting a mix of encodings is problematic in terms of resulting 97 complexity. So Cockroach treats an MVCC timestamp of zero to mean an 98 inlined, non-versioned value. These values are replaced if they exist 99 on a Put operation and are cleared from the engine on a delete. 100 Importantly, zero-timestamped MVCC values may be merged, as is 101 necessary for stats counters and time series data. 102 */ 103 package storage