github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170601_raft_sstable_sideloading.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170601_raft_sstable_sideloading.md (about)

1 - Feature Name: Raft Proposal Sideloading for SSTable ingestion
2 - Status: completed
3 - Start Date: 2017-05-24
4 - Authors: Dan Harrison and Tobias Schottdorf, original suggestion Ben Darnell
5 - RFC PR: #16159
6 - Cockroach Issue: #16263
7
8 # Summary and non-goals
9
10 Allow (small) SSTables ingestions to be proposed without putting the SSTable in
11 the Raft log. An SSTable ingestion entails giving a file to RocksDB for direct
12 inclusion in its underlying LSM. For such proposals, only metadata is stored in
13 the Raft log, and the actual payload is written directly to the file system.
14 This happens transparently on each node, i.e. the proposal is sent over the
15 wire like a regular Raft proposal, with the payload inlined.
16
17 It is explicitly a non-goal to deal with (overly) large proposals, as none of
18 the Raft interfaces expect proposals of nontrivial size. It is not expected that
19 sideloadable proposals will exceed a few MB in size.
20
21 Specifying the creation and details of ingestion of the SSTable has its own
22 subtleties and is left to a sister PR to appear shortly.
23
24 # Motivation
25
26 `RESTORE` needs a fast mechanism to ingest data into a Range. It wants to rely
27 on Raft for simplicity, but the naive approach of sending a (chunked)
28 `WriteBatch` has downsides:
29
30 1. all of the data is written to the Raft log, where it incurs a large write
31 amplification, and
32 1. applying a large `WriteBatch` to RocksDB incurs another sizable write
33 amplification factor: it is far better to link SSTables directly into the
34 LSM.
35
36 Instead, `RESTORE` sends small (initially ~2MB) SSTables which are to be linked
37 directly into RocksDB. This eliminates the latter point above. However, it does
38 not address the former, and this is what this RFC is about: achieving the
39 optimal write amplification of 1 or 2 (the latter amounting to dodging some
40 technical difficulties discussed at the end).
41
42 Note also that housing the Raft log outside of RocksDB only addresses the former
43 point if it results in a SSTable being stored inside of the RocksDB directory,
44 which is likely to be a non-goal of that project.
45
46 ## Detailed design
47
48 The mechanism proposed here is the following:
49
50 1. `storagebase.RaftCommand` gains fields `SideloadedHash []byte` (on
51 `ReplicatedEvalResult`) and `SideloadedData []byte` (next to `WriteBatch`)
52 which are unused for regular Raft proposals.
53 1. When a proposal is to be sideloaded, a regular proposal is generated, but the
54 sideloaded data and its hash populated. In the concrete case of `SSTable`
55 this means that `evalAddSSTable` creates an essentially empty proposal
56 (perhaps accounting for an MVCC stats delta).
57 1. On its way into Raft, the
58 `SideloadedData` bytes are then written to disk and stripped from the
59 proposal. On disk, the payload is stored in the `Store`'s directory,
60 accessible via its log index, using the following scheme (for an explanation
61 on why this scheme was chosen, see later sections):
62
63 ```
64 <storeDir>/sideloaded_proposals/<rangeID>.<replicaID>/<logIndex>.<term>
65 ```
66
67 where the file contains the raw sideloaded data, e.g. an `SSTable` that will
68 populate `(storagebase.RaftCommand).SideloadedData`.
69 1. Whenever the content of a sideloaded proposal is required, a `raftpb.Entry`
70 is reconstituted from the corresponding file, while verifying the checksum in
71 `SideloadedHash`. A failure of either operation is treated as fatal (a
72 `ReplicaCorruptionError`). Note that restoring an entry is expensive: load
73 the payload from disk, unmarshal the `RaftCommand`, compute the checksum and
74 compare it to that stored in `RaftCommand`, put the payload into the
75 `RaftCommand`, marshal the `RaftCommand`. The Raft entries cache should help
76 mitigate this cost, though we should watch the overhead closely to check
77 whether we can perhaps eagerly populate the cache when proposing.
78 1. When such an entry is sent to the follower over the network
79 (`sendRaftMessage`), the original proposal is sent (via the mechanism above).
80 Note that this could happen via `append` or via the snapshotting mechanism
81 (`sendSnapshot` or rather its child `iterateEntries`), which sends a part of
82 the Raft log along with the snapshot.
83 1. Apply-time side effects have access to the payload, and can make use of it
84 under the contract that they do not alter the on-disk file. For the SSTable
85 use case, when a sideloaded entry is applied to the state machine, the
86 corresponding file is copied (hard-linking instead being an optimization
87 discussed later) RocksDB directory and, from there, ingested by RocksDB.
88
89 ## Determining whether a `raftpb.Entry` is sideloaded
90
91 The principal difficulty is sniffing sideloadable proposals from a
92 `(raftpb.Entry).Data` byte slice. This is necessary since sideloading happens at
93 that fairly low level, and unmarshalling every single message squarely in the
94 hot write path is clearly not an option. `(raftpb.Entry).Data` equals
95 `encodeRaftCommand(p.idKey, data)`, where `data` is a marshaled
96 `storagebase.RaftCommand`, resulting in
97
98 ```
99 (raftpb.Entry).Data = append(raftCommandEncodingVersion, idKey, data)
100 ```
101
102 We can't hide information in `data` as we would have to unmarshal it too often,
103 and so we make the idiomatic choice: Sideloaded Raft proposals are sent using a
104 new `raftCommandEncodingVersion`. See the next section for migrations.
105
106 Armed with that, `(*Replica).append` and the other methods above can assure that
107 there is no externally visible change to the way Raft proposals are handled, and
108 we can drop sideloaded proposals directly on disk, where they are accessible via
109 their log index.
110
111 TODO: use cmdID instead? Or can we do something else entirely?
112
113 ## Migration story
114
115 The feature will initially only be used for SSTable ingestion, which remains
116 behind a feature flag. An upgrade is carried out as follows:
117
118 1. rolling cluster restart to new version
119 1. enable the feature flag
120 1. restore now uses SSTable ingestion
121
122 A downgrade is slightly less stable but should work out OK in practice:
123
124 1. disable the feature flag
125 1. wait until it's clear that no SSTable ingestions are active anywhere (i.e.
126 any such proposals have been applied by all replicas affected by them)
127 1. either wait until all ingestions have been purged from the log (hard to
128 control) or be prepared to manually remove the sideload directory after the
129 downgrade (to avoid stale files that are never cleaned up).
130 1. rolling cluster restart to old version.
131 1. in the unlikely event of a node crashing, upgrade that node again and back to
132 the first step
133
134 Due to the feature flag, rolling updates are possible as usual.
135
136 ## Details on file creation and removal
137
138 There are three procedures that need to mutate sideloaded files. These are, in
139 increasing difficulty, replica GC, log truncation, and `append()`. The main
140 objectives here are making sure that we move the bulk of disk I/O outside of
141 critical locks, and that all files are eventually cleaned up.
142
143 ### Replica GC
144
145 When a Replica is garbage collected, we know that if it is ever recreated, then
146 it will be at a higher `ReplicaID`. For that reason, the sideloaded proposals
147 are disambiguated by `ReplicaID`; we can postpone cleanup of these files until
148 we don't hold any locks. During node startup, we check for sideloaded proposals
149 that do not correspond to a Replica of their Store and remove them.
150
151 Concretely,
152
153 - after GC'ing the Replica, Replica GC wholesale removes the directory
154 `<storeDir>/sideloaded_proposals/<rangeID>.<replicaID>` after releasing all
155 locks.
156 - on node startup, after all Replicas have been instantiated but before
157 receiving traffic, delete those directories which do not correspond to a live
158 replica. Assert that there is no directory for a replicaID larger than what we
159 have instantiated.
160
161 ### Log truncation
162
163 Similarly to Replica GC, once the log is truncated to some new first index `i`,
164 we know that no older sideloaded data is ever going to be loaded again, and we
165 can lazily and without any locks unlink all files for which `index < i`
166 (regardless of the term).
167
168 In case of a crash before this cleanup, these files will be deleted with either
169 the next truncation, or the replica being garbage collected.
170
171 ### append()
172
173 This is the interesting one. `append()` is called by Raft when it wants us to
174 store new entries into the log.
175
176 Once we commit the corresponding RocksDB batch, any sideloaded payloads must be
177 on disk as well, or an ill-timed crash would lead to log entries which are
178 acknowledged but have no payload associated to them, a situation impossible to
179 recover from (OK, not impossible since we crash before sending a message out to
180 the lease holder - we could remove the message from the log again at node
181 startup, but it's a bad idea). So we have to make sure that all sideloaded
182 payloads are written to disk before the batch passed to `append()` is committed,
183 and that obsolete payloads are removed *after*.
184
185 Initially, we will write the files directly in `append()` which means they are
186 all on disk when the batch commits, but they are written under the `raftMu` and
187 `replicaMu` locks, which is not ideal. However, it should be relatively easy to
188 optimize it by eagerly creating the files much earlier outside of the lock; this
189 is made possible since we disambiguate by term, and once a payload for a given
190 index and term has been written, it will not be changed in that term.
191
192 An interesting subcase arises when Raft wants us to **replace** our tail of the
193 log, which would then have a term bump associated with it. In particular, we may
194 need to replace a sideloaded entry with a different higher-term sideloaded
195 entry. Thanks to disambiguation by term, both payloads can exist side by side,
196 and we can write first the higher-term payload and then remove the replaced once.
197
198 We must tolerate a file with a higher term than we know existing, though its
199 existence should be short-lived as our Replica should learn about that higher
200 term shortly.
201
202 In summary, what we do is:
203
204 - write the new payloads as early as possible; initially in `append()` but
205 theoretically before acquiring any locks.
206 - tolerate existing files - they are identical (as a `(term,index)` can be
207 assigned only one log entry). Note how this encourages the write-early
208 optimization.
209 - remove outdated payloads as late as possible; initially after `append()`'s
210 batch commits, later after releasing any locks.
211
212 ## Details on reconstituting a `raftpb.Entry`
213
214 Whenever a `raftpb.Entry` is required for transmission to a follower, we have to
215 reconstitute it (i.e. inline the payload). This is straightforward, though a bit
216 expensive.
217
218 - check the RaftCommandVersion (can be sniffed from the `Data` slice); if it's
219 not a sideloaded entry, do nothing. Otherwise:
220 - decode the command, unmarshal the contained `cmd storagebase.RaftCommand`
221 - load the on-disk payload into `cmd.SideloadedData` (term, replicaID and log
222 index are known at this point) and compare its hash with
223 `cmd.SideloadedSha`, failing on mismatch.
224 - marshal and encode the command into a new `raftpb.Entry`, using the new raft
225 command version.
226
227 ## Hash function
228
229 Speed matters in this application, and RocksDB internally uses CRC32. For that
230 reason, CRC32 is deemed sufficient here.
231
232 # Drawbacks
233
234 There is some overlap with the project to move the Raft log out of RocksDB.
235 However, the end goal of that is not necessarily compatible with this RFC, and
236 the changes proposed here are agnostic of changes in the Raft log backend.
237
238 # Unresolved questions
239
240 ## Optimization for 1x write amplification: hard-link into RocksDB directory
241
242 If we are aiming high and want 1x write amplification, then we do not want to
243 copy the file to RocksDB; we want to hard-link it there. However, RocksDB may
244 *alter* the SSTable. One case in which it does this is that it may set an
245 internal sequence number used for RocksDB transactions; this happens when there
246 are open RocksDB snapshots, for example.
247
248 Knowing that this can happen, we can avoid it: the SSTable is always created
249 with a zero sequence number, and we can ignore any updated on-disk sequence
250 number when reading the file from disk for treating it as a log entry.
251
252 However, we need to be very sure that RocksDB will not perform other
253 modifications to the file that could trip the consistency checker.
254
255 We'll exclude this from the initial implementation, though it seems
256 straightforward enough to add later without migration headaches.
257
258 ### Complications of using hard links
259
260 The design above uses hard linking for SSTable ingestion for minimal write
261 amplification. Hard links may not be supported across the board (think:
262 virtualized environments), and we crucially rely on the fact that a file is only
263 deleted after all referencing hardlinks have been. In these situations, an extra
264 copy can be made; this will be exposed as either a hidden knob or cluster
265 setting.
266
267 ## Usefulness of generalization
268
269 Sideloading could be useful for bulk INSERTs (non-RESTORE data loading) and
270 DeleteRange, or more generally any proposal that's large enough to profit from
271 reduced write amplification compared to what we have today. However, moving the
272 raft log out of rocksdb likely already addresses that suitably.
273
274 [1]: https://github.com/cockroachdb/cockroach/pull/9459/files#diff-2967750a9f426e20041d924947ff1d46R707