github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/storage/doc.go (about)

     1  // Copyright 2014 The Cockroach Authors.
     2  //
     3  // Use of this software is governed by the Business Source License
     4  // included in the file licenses/BSL.txt.
     5  //
     6  // As of the Change Date specified in that file, in accordance with
     7  // the Business Source License, use of this software will be governed
     8  // by the Apache License, Version 2.0, included in the file
     9  // licenses/APL.txt.
    10  
    11  /*
    12  Package storage provides low-level storage. It interacts with storage
    13  backends (e.g. LevelDB, RocksDB, etc.) via the Engine interface. At
    14  one level higher, MVCC provides multi-version concurrency control
    15  capability on top of an Engine instance.
    16  
    17  The Engine interface provides an API for key-value stores. InMem
    18  implements an in-memory engine using a sorted map. RocksDB implements
    19  an engine for data stored to local disk using RocksDB, a variant of
    20  LevelDB.
    21  
    22  MVCC provides a multi-version concurrency control system on top of an
    23  engine. MVCC is the basis for Cockroach's support for distributed
    24  transactions. It is intended for direct use from storage.Range
    25  objects.
    26  
    27  Notes on MVCC architecture
    28  
    29  Each MVCC value contains a metadata key/value pair and one or more
    30  version key/value pairs. The MVCC metadata key is the actual key for
    31  the value, using the util/encoding.EncodeBytes scheme. The MVCC
    32  metadata value is of type MVCCMetadata and contains the most recent
    33  version timestamp and an optional roachpb.Transaction message. If
    34  set, the most recent version of the MVCC value is a transactional
    35  "intent". It also contains some information on the size of the most
    36  recent version's key and value for efficient stat counter
    37  computations. Note that it is not necessary to explicitly store the
    38  MVCC metadata as its contents can be reconstructed from the most
    39  recent versioned value as long as an intent is not present. The
    40  implementation takes advantage of this and deletes the MVCC metadata
    41  when possible.
    42  
    43  Each MVCC version key/value pair has a key which is also
    44  binary-encoded, but is suffixed with a decreasing, big-endian encoding
    45  of the timestamp (eight bytes for the nanosecond wall time, followed
    46  by four bytes for the logical time except for meta key value pairs,
    47  for which the timestamp is implicit). The MVCC version value is
    48  a message of type roachpb.Value. A deletion is indicated by an
    49  empty value. Note that an empty roachpb.Value will encode to
    50  a non-empty byte slice. The decreasing encoding on the timestamp sorts
    51  the most recent version directly after the metadata key, which is
    52  treated specially by the RocksDB comparator (by making the zero
    53  timestamp sort first). This increases the likelihood that an
    54  Engine.Get() of the MVCC metadata will get the same block containing
    55  the most recent version, even if there are many versions. We rely on
    56  getting the MVCC metadata key/value and then using it to directly get
    57  the MVCC version using the metadata's most recent version timestamp.
    58  This avoids using an expensive merge iterator to scan the most recent
    59  version. It also allows us to leverage RocksDB's bloom filters.
    60  
    61  The following is an example of the sort order for MVCC key/value pairs:
    62  
    63  		...
    64  		keyA: MVCCMetadata of keyA
    65  		keyA_Timestamp_n: value of version_n
    66  		keyA_Timestamp_n-1: value of version_n-1
    67  		...
    68  		keyA_Timestamp_0: value of version_0
    69  		keyB: MVCCMetadata of keyB
    70  
    71  The binary encoding used on the MVCC keys allows arbitrary keys to be
    72  stored in the map (no restrictions on intermediate nil-bytes, for
    73  example), while still sorting lexicographically and guaranteeing that
    74  all timestamp-suffixed MVCC version keys sort consecutively with the
    75  metadata key. We use an escape-based encoding which transforms all nul
    76  ("\x00") characters in the key and is terminated with the sequence
    77  "\x00\x01", which is guaranteed to not occur elsewhere in the encoded
    78  value. See util/encoding/encoding.go for more details.
    79  
    80  We considered inlining the most recent MVCC version in the
    81  MVCCMetadata. This would reduce the storage overhead of storing the
    82  same key twice (which is small due to block compression), and the
    83  runtime overhead of two separate DB lookups. On the other hand, all
    84  writes that create a new version of an existing key would incur a
    85  double write as the previous value is moved out of the MVCCMetadata
    86  into its versioned key. Preliminary benchmarks have not shown enough
    87  performance improvement to justify this change, although we may
    88  revisit this decision if it turns out that multiple versions of the
    89  same key are rare in practice.
    90  
    91  However, we do allow inlining in order to use the MVCC interface to
    92  store non-versioned values. It turns out that not everything which
    93  Cockroach needs to store would be efficient or possible using MVCC.
    94  Examples include transaction records, abort span entries, stats
    95  counters, time series data, and system-local config values. However,
    96  supporting a mix of encodings is problematic in terms of resulting
    97  complexity. So Cockroach treats an MVCC timestamp of zero to mean an
    98  inlined, non-versioned value. These values are replaced if they exist
    99  on a Put operation and are cleared from the engine on a delete.
   100  Importantly, zero-timestamped MVCC values may be merged, as is
   101  necessary for stats counters and time series data.
   102  */
   103  package storage