github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170601_raft_sstable_sideloading.md (about) 1 - Feature Name: Raft Proposal Sideloading for SSTable ingestion 2 - Status: completed 3 - Start Date: 2017-05-24 4 - Authors: Dan Harrison and Tobias Schottdorf, original suggestion Ben Darnell 5 - RFC PR: #16159 6 - Cockroach Issue: #16263 7 8 # Summary and non-goals 9 10 Allow (small) SSTables ingestions to be proposed without putting the SSTable in 11 the Raft log. An SSTable ingestion entails giving a file to RocksDB for direct 12 inclusion in its underlying LSM. For such proposals, only metadata is stored in 13 the Raft log, and the actual payload is written directly to the file system. 14 This happens transparently on each node, i.e. the proposal is sent over the 15 wire like a regular Raft proposal, with the payload inlined. 16 17 It is explicitly a non-goal to deal with (overly) large proposals, as none of 18 the Raft interfaces expect proposals of nontrivial size. It is not expected that 19 sideloadable proposals will exceed a few MB in size. 20 21 Specifying the creation and details of ingestion of the SSTable has its own 22 subtleties and is left to a sister PR to appear shortly. 23 24 # Motivation 25 26 `RESTORE` needs a fast mechanism to ingest data into a Range. It wants to rely 27 on Raft for simplicity, but the naive approach of sending a (chunked) 28 `WriteBatch` has downsides: 29 30 1. all of the data is written to the Raft log, where it incurs a large write 31 amplification, and 32 1. applying a large `WriteBatch` to RocksDB incurs another sizable write 33 amplification factor: it is far better to link SSTables directly into the 34 LSM. 35 36 Instead, `RESTORE` sends small (initially ~2MB) SSTables which are to be linked 37 directly into RocksDB. This eliminates the latter point above. However, it does 38 not address the former, and this is what this RFC is about: achieving the 39 optimal write amplification of 1 or 2 (the latter amounting to dodging some 40 technical difficulties discussed at the end). 41 42 Note also that housing the Raft log outside of RocksDB only addresses the former 43 point if it results in a SSTable being stored inside of the RocksDB directory, 44 which is likely to be a non-goal of that project. 45 46 ## Detailed design 47 48 The mechanism proposed here is the following: 49 50 1. `storagebase.RaftCommand` gains fields `SideloadedHash []byte` (on 51 `ReplicatedEvalResult`) and `SideloadedData []byte` (next to `WriteBatch`) 52 which are unused for regular Raft proposals. 53 1. When a proposal is to be sideloaded, a regular proposal is generated, but the 54 sideloaded data and its hash populated. In the concrete case of `SSTable` 55 this means that `evalAddSSTable` creates an essentially empty proposal 56 (perhaps accounting for an MVCC stats delta). 57 1. On its way into Raft, the 58 `SideloadedData` bytes are then written to disk and stripped from the 59 proposal. On disk, the payload is stored in the `Store`'s directory, 60 accessible via its log index, using the following scheme (for an explanation 61 on why this scheme was chosen, see later sections): 62 63 ``` 64 <storeDir>/sideloaded_proposals/<rangeID>.<replicaID>/<logIndex>.<term> 65 ``` 66 67 where the file contains the raw sideloaded data, e.g. an `SSTable` that will 68 populate `(storagebase.RaftCommand).SideloadedData`. 69 1. Whenever the content of a sideloaded proposal is required, a `raftpb.Entry` 70 is reconstituted from the corresponding file, while verifying the checksum in 71 `SideloadedHash`. A failure of either operation is treated as fatal (a 72 `ReplicaCorruptionError`). Note that restoring an entry is expensive: load 73 the payload from disk, unmarshal the `RaftCommand`, compute the checksum and 74 compare it to that stored in `RaftCommand`, put the payload into the 75 `RaftCommand`, marshal the `RaftCommand`. The Raft entries cache should help 76 mitigate this cost, though we should watch the overhead closely to check 77 whether we can perhaps eagerly populate the cache when proposing. 78 1. When such an entry is sent to the follower over the network 79 (`sendRaftMessage`), the original proposal is sent (via the mechanism above). 80 Note that this could happen via `append` or via the snapshotting mechanism 81 (`sendSnapshot` or rather its child `iterateEntries`), which sends a part of 82 the Raft log along with the snapshot. 83 1. Apply-time side effects have access to the payload, and can make use of it 84 under the contract that they do not alter the on-disk file. For the SSTable 85 use case, when a sideloaded entry is applied to the state machine, the 86 corresponding file is copied (hard-linking instead being an optimization 87 discussed later) RocksDB directory and, from there, ingested by RocksDB. 88 89 ## Determining whether a `raftpb.Entry` is sideloaded 90 91 The principal difficulty is sniffing sideloadable proposals from a 92 `(raftpb.Entry).Data` byte slice. This is necessary since sideloading happens at 93 that fairly low level, and unmarshalling every single message squarely in the 94 hot write path is clearly not an option. `(raftpb.Entry).Data` equals 95 `encodeRaftCommand(p.idKey, data)`, where `data` is a marshaled 96 `storagebase.RaftCommand`, resulting in 97 98 ``` 99 (raftpb.Entry).Data = append(raftCommandEncodingVersion, idKey, data) 100 ``` 101 102 We can't hide information in `data` as we would have to unmarshal it too often, 103 and so we make the idiomatic choice: Sideloaded Raft proposals are sent using a 104 new `raftCommandEncodingVersion`. See the next section for migrations. 105 106 Armed with that, `(*Replica).append` and the other methods above can assure that 107 there is no externally visible change to the way Raft proposals are handled, and 108 we can drop sideloaded proposals directly on disk, where they are accessible via 109 their log index. 110 111 TODO: use cmdID instead? Or can we do something else entirely? 112 113 ## Migration story 114 115 The feature will initially only be used for SSTable ingestion, which remains 116 behind a feature flag. An upgrade is carried out as follows: 117 118 1. rolling cluster restart to new version 119 1. enable the feature flag 120 1. restore now uses SSTable ingestion 121 122 A downgrade is slightly less stable but should work out OK in practice: 123 124 1. disable the feature flag 125 1. wait until it's clear that no SSTable ingestions are active anywhere (i.e. 126 any such proposals have been applied by all replicas affected by them) 127 1. either wait until all ingestions have been purged from the log (hard to 128 control) or be prepared to manually remove the sideload directory after the 129 downgrade (to avoid stale files that are never cleaned up). 130 1. rolling cluster restart to old version. 131 1. in the unlikely event of a node crashing, upgrade that node again and back to 132 the first step 133 134 Due to the feature flag, rolling updates are possible as usual. 135 136 ## Details on file creation and removal 137 138 There are three procedures that need to mutate sideloaded files. These are, in 139 increasing difficulty, replica GC, log truncation, and `append()`. The main 140 objectives here are making sure that we move the bulk of disk I/O outside of 141 critical locks, and that all files are eventually cleaned up. 142 143 ### Replica GC 144 145 When a Replica is garbage collected, we know that if it is ever recreated, then 146 it will be at a higher `ReplicaID`. For that reason, the sideloaded proposals 147 are disambiguated by `ReplicaID`; we can postpone cleanup of these files until 148 we don't hold any locks. During node startup, we check for sideloaded proposals 149 that do not correspond to a Replica of their Store and remove them. 150 151 Concretely, 152 153 - after GC'ing the Replica, Replica GC wholesale removes the directory 154 `<storeDir>/sideloaded_proposals/<rangeID>.<replicaID>` after releasing all 155 locks. 156 - on node startup, after all Replicas have been instantiated but before 157 receiving traffic, delete those directories which do not correspond to a live 158 replica. Assert that there is no directory for a replicaID larger than what we 159 have instantiated. 160 161 ### Log truncation 162 163 Similarly to Replica GC, once the log is truncated to some new first index `i`, 164 we know that no older sideloaded data is ever going to be loaded again, and we 165 can lazily and without any locks unlink all files for which `index < i` 166 (regardless of the term). 167 168 In case of a crash before this cleanup, these files will be deleted with either 169 the next truncation, or the replica being garbage collected. 170 171 ### append() 172 173 This is the interesting one. `append()` is called by Raft when it wants us to 174 store new entries into the log. 175 176 Once we commit the corresponding RocksDB batch, any sideloaded payloads must be 177 on disk as well, or an ill-timed crash would lead to log entries which are 178 acknowledged but have no payload associated to them, a situation impossible to 179 recover from (OK, not impossible since we crash before sending a message out to 180 the lease holder - we could remove the message from the log again at node 181 startup, but it's a bad idea). So we have to make sure that all sideloaded 182 payloads are written to disk before the batch passed to `append()` is committed, 183 and that obsolete payloads are removed *after*. 184 185 Initially, we will write the files directly in `append()` which means they are 186 all on disk when the batch commits, but they are written under the `raftMu` and 187 `replicaMu` locks, which is not ideal. However, it should be relatively easy to 188 optimize it by eagerly creating the files much earlier outside of the lock; this 189 is made possible since we disambiguate by term, and once a payload for a given 190 index and term has been written, it will not be changed in that term. 191 192 An interesting subcase arises when Raft wants us to **replace** our tail of the 193 log, which would then have a term bump associated with it. In particular, we may 194 need to replace a sideloaded entry with a different higher-term sideloaded 195 entry. Thanks to disambiguation by term, both payloads can exist side by side, 196 and we can write first the higher-term payload and then remove the replaced once. 197 198 We must tolerate a file with a higher term than we know existing, though its 199 existence should be short-lived as our Replica should learn about that higher 200 term shortly. 201 202 In summary, what we do is: 203 204 - write the new payloads as early as possible; initially in `append()` but 205 theoretically before acquiring any locks. 206 - tolerate existing files - they are identical (as a `(term,index)` can be 207 assigned only one log entry). Note how this encourages the write-early 208 optimization. 209 - remove outdated payloads as late as possible; initially after `append()`'s 210 batch commits, later after releasing any locks. 211 212 ## Details on reconstituting a `raftpb.Entry` 213 214 Whenever a `raftpb.Entry` is required for transmission to a follower, we have to 215 reconstitute it (i.e. inline the payload). This is straightforward, though a bit 216 expensive. 217 218 - check the RaftCommandVersion (can be sniffed from the `Data` slice); if it's 219 not a sideloaded entry, do nothing. Otherwise: 220 - decode the command, unmarshal the contained `cmd storagebase.RaftCommand` 221 - load the on-disk payload into `cmd.SideloadedData` (term, replicaID and log 222 index are known at this point) and compare its hash with 223 `cmd.SideloadedSha`, failing on mismatch. 224 - marshal and encode the command into a new `raftpb.Entry`, using the new raft 225 command version. 226 227 ## Hash function 228 229 Speed matters in this application, and RocksDB internally uses CRC32. For that 230 reason, CRC32 is deemed sufficient here. 231 232 # Drawbacks 233 234 There is some overlap with the project to move the Raft log out of RocksDB. 235 However, the end goal of that is not necessarily compatible with this RFC, and 236 the changes proposed here are agnostic of changes in the Raft log backend. 237 238 # Unresolved questions 239 240 ## Optimization for 1x write amplification: hard-link into RocksDB directory 241 242 If we are aiming high and want 1x write amplification, then we do not want to 243 copy the file to RocksDB; we want to hard-link it there. However, RocksDB may 244 *alter* the SSTable. One case in which it does this is that it may set an 245 internal sequence number used for RocksDB transactions; this happens when there 246 are open RocksDB snapshots, for example. 247 248 Knowing that this can happen, we can avoid it: the SSTable is always created 249 with a zero sequence number, and we can ignore any updated on-disk sequence 250 number when reading the file from disk for treating it as a log entry. 251 252 However, we need to be very sure that RocksDB will not perform other 253 modifications to the file that could trip the consistency checker. 254 255 We'll exclude this from the initial implementation, though it seems 256 straightforward enough to add later without migration headaches. 257 258 ### Complications of using hard links 259 260 The design above uses hard linking for SSTable ingestion for minimal write 261 amplification. Hard links may not be supported across the board (think: 262 virtualized environments), and we crucially rely on the fact that a file is only 263 deleted after all referencing hardlinks have been. In these situations, an extra 264 copy can be made; this will be exposed as either a hidden knob or cluster 265 setting. 266 267 ## Usefulness of generalization 268 269 Sideloading could be useful for bulk INSERTs (non-RESTORE data loading) and 270 DeleteRange, or more generally any proposal that's large enough to profit from 271 reduced write amplification compared to what we have today. However, moving the 272 raft log out of rocksdb likely already addresses that suitably. 273 274 [1]: https://github.com/cockroachdb/cockroach/pull/9459/files#diff-2967750a9f426e20041d924947ff1d46R707