github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160901_time_series_culling.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160901_time_series_culling.md (about)

1 - Feature Name: Time Series Culling
2 - Status: in progress
3 - Start Date: 2016-08-29
4 - Authors: Matt Tracy
5 - RFC PR: #9343
6 - Cockroach Issue: #5910
7
8 # Summary
9 Currently, Time Series data recorded by CockroachDB for its own internal
10 metrics is retained indefinitely. High-resolution metrics data quickly loses
11 utility as it ages, consuming disk space and creating range-related overhead
12 without conferring an appropriate benefit.
13
14 The simplest solution to deal with this would be to build a system that deletes
15 time series data older than a certain threshold; however, this RFC suggests a
16 mechanism for "rolling up" old time series from the system into a lower
17 resolution that is still retained. This will allow us to keep some metrics
18 information indefinitely, which can be used for historical performance
19 evaluation, without needing to keep an unacceptably expensive amount of
20 information.
21
22 Fully realizing this solution has three components:
23
24 1. A distributed "culling" algorithm that occasionally searches for
25 high-resolution time series data older than a certain threshold and runs a
26 "roll-up" process on the discovered keys.
27 2. A "roll-up" process that computes low-resolution time series data from the
28 existing data in a high-resolution time series key, deleting the high-resolution
29 key in the process.
30 3. Modifications to the query system to utilize underlying data which is stored
31 at multiple resolutions (currently only supports a single resolution). This
32 includes the use of data at different resolutions to serve a single query.
33
34 # Motivation
35
36 In our test clusters, time series create a very large amount of data (on the
37 order of several gigabytes per week) which quickly loses utility as it ages.
38
39 To estimate how much data this is, we first observe the data usage of a single
40 time series. A single time series stores data as contiguous samples representing
41 ten-second intervals; all samples for a wall-clock hour are stored in a single
42 key. In the engine, the keys look like this:
43
44 | Key | Key Size | Value Size |
45 |----------------------------------------------------------|----------|------------|
46 | /System/tsd/cr.store.replicas/1/10s/2016-09-26T17:00:00Z | 30 | 5670 |
47 | /System/tsd/cr.store.replicas/1/10s/2016-09-26T18:00:00Z | 30 | 5535 |
48 | /System/tsd/cr.store.replicas/1/10s/2016-09-26T19:00:00Z | 30 | 5046 |
49
50 The above is the data stored for one time series over three complete hours.
51 Notice the variation in the size of the values; this is due to the fact that
52 samples may be absent for some ten-second periods, due to the asynchronous
53 nature of this system. For our purposes, we will estimate the size of a single
54 hour of data for a single time series to be *5500* bytes, or 5.5K.
55
56 The total disk usage of high-resolution data on the cluster can thus be
57 estimated with the following function:
58
59 ` Total bytes = [bytes per time series hour] * [# of time series per node] * [# of nodes] * [# of hours] `
60
61 Thus, data accumulates over time, and as more nodes are added (or if later
62 versions of cockroach add additional time series), the rate of new time series
63 data being accumulated increases linearly. As of this writing, each single-node store records
64 **242** time series. Thus, the bytes needed per hour on a ten-node cluster is:
65
66 `Total Bytes (hour) = 5500 * 242 * 10 = 13310000 (12.69 MiB)`
67
68 After just one week:
69
70 `Total Bytes (week) = 12.69MiB * 168 hours = 2.08 GiB`
71
72 As time passes, this data can represent a large share (or in the case of idle
73 clusters, the majority) of in-use data on the cluster. This data will also
74 continue to build indefinitely; a static CockroachDB Cluster will eventually
75 consume all available disk space, even if no external data is written! With just
76 the current time series, a ten-node cluster will generate almost a terabyte of
77 metrics data over a single year.
78
79 The prompt culling of old data is thus a clear area of improvement for
80 CockroachDB. However, rather than simply deleting data older than a threshold,
81 this RFC proposes a solution which efficiently keeps metrics data for a longer
82 time span by downsampling it to a much lower resolution on disk.
83
84 To give some context of numbers: currently, all metrics on disk are stored in a
85 format which is downsampled to _ten second sample periods_; this is the
86 "high-resolution" data. We are looking to delete this data when it is older
87 than a certain threshold, which will likely be set in the range of _2-4 weeks_.
88 We also propose that, when this data is deleted, it is first downsampled further
89 into _one hour sample periods_; this is the "low-resolution" data. This data
90 will be kept for a much longer time, likely _6-12 months_, but perhaps longer.
91
92 In the lower resolution, each datapoint represents the same data as an _entire
93 slab_ of high-resolution data (at the ten second resolution, data is stored in
94 slabs corresponding to a wall-clock hour; each slab contains up to 360 samples).
95 Thus, the expected data storage of the low-resolution is approximately _180x
96 smaller_ than the high-resolution (not 360 because the individual low-resolution
97 samples will include a "min" and "max" value not present at the high-resolution.
98 The high-resolution keys only contain a "sum" and "count" field.)
99
100 By keeping data at the low resolution, users will still be able to inspect
101 cluster performance over larger times scales, without requiring the storage of
102 an excessive amount of metrics data.
103
104 # Detailed design
105
106 ## Culling algorithm
107
108 The culling algorithm is responsible for identifying high-resolution time series
109 keys that are older than a system-set threshold. Once identified, the keys are
110 passed into the rollup/delete process.
111
112 There are two primary design requirements of the culling algorithm:
113
114 1. From a single node, efficiently locating time series keys which need to be
115 culled.
116 2. Across the cluster, efficiently distributing the task of culling with minimal
117 coordination between nodes.
118
119 #### Locating Time Series Keys
120
121 Locating time series keys to be culled is not completely trivial due to the
122 construction of time series keys, which is thus:
123 `[ts prefix][series name][timestamp][source]`
124
125 > Example: "ts/cr.node.sql.inserts/1473739200/1" would contain time series data
126 > for "cr.node.sql.inserts" on September 13th 2016 between 4am-5am UTC,
127 > specifically for node 1.
128
129 Because of this construction, which prioritizes name over timestamp, the most
130 recent time series data for series "A" would sort *before* the oldest time
131 series data for series "B". This means that we cannot simply cull the beginning
132 of the time series range.
133
134 The simplest alternative would be to scan the time series range looking for
135 invalid keys; however, this is considered to be a burdensome scan due to the
136 number of keys that are not culled. For a per-node time series being recorded on
137 a 10 node cluster with a 2 week retention period, we would expect to retain (10
138 x 24 x 14) = *3360* keys that should not be culled. In a system that maintains
139 dozens, possibly hundreds of time series, this is a lot of data for each node to
140 scan on a regular basis.
141
142 However, this scan can be effectively distributed across the cluster by creating
143 a new *replica queue* which searches for time series keys. The new queue can
144 quickly determine if each range contains time series keys (by inspecting
145 start/end keys); for ranges that do contain time series keys, specific keys
146 can then be inspected at the engine level. This means that key inspections do
147 not require network calls, and the number of keys that can be inspected at once
148 is limited to the size of a range.
149
150 Once the queue discovers a range that contains time series keys, the scanning
151 process does not need to inspect every key on the range. The algorithm is as
152 follows:
153
154 1. Find the first time series key in the range (scan for [ts prefix]).
155 2. Deconstruct the key to retrieve its name.
156 3. Run the rollup/delete operation on all keys in the range:
157 `[ts prefix][series name][0] - [tsprefix][series name][now - threshold]`
158 4. Find the next key on the range which contains data for a different time
159 series by searching for key `PrefixEnd([ts prefix][series name])`.
160 5. If a key was found in step 5, return to step 2 with that series name.
161
162 This algorithm will avoid scanning keys that do not need to be rolled up; this
163 is desirable, as once the culling algorithm is in place and has run once, the
164 majority of time series keys will *not* need to be culled.
165
166 The queue will be configured to run only on the range leader for a given range
167 in order to avoid duplicate work; however, this is *not* necessary for
168 correctness, as demonstrated in the [Rollup Algorithm](#rollup-algorithm)
169 section below.
170
171 The queue will initially be set to process replicas at the same rate as the
172 replica GC queue (as of this RFC, one range per 50 milliseconds).
173
174 ##### Package Dependency
175
176 There is one particular complication to this method: *go package dependency*.
177 Knowledge on how to identify and cull time series keys is contained in the `ts`
178 package, but all logic for replica queues (and all current queues) lives in
179 `storage`, meaning that one of three things must happen:
180
181 + `storage` can depend on `ts`. This seems to be trivially possible now, but may
182 be unintuitive to those trying to understand our code-base. For reference, the
183 `storage` package used to depend on the `sql` package in order to record event
184 logs, but this eventually became an impediment to new development and had to be
185 modified.
186 + The queue logic could be implemented in `ts`, and `storage` could implement
187 an interface that allows it to use the `ts` code without a dependency.
188 + Parts of the `ts` package could be split off into another package that can
189 intuitively live below `storage`. However, this is likely to be a considerable
190 portion of `ts` in order to properly implement rollups.
191
192 Tenatively, we will be attempting to use the first method and have `storage`
193 depend on `ts`; if it is indeed trivially possible, this will be the fastest
194 method of completing this project.
195
196 #### Culling low resolution data
197
198 Although the volume is much lower, low-resolution data will still build
199 up indefinitely unless it is culled. This data will also be culled by the same
200 algorithm outlined here; however, it will not be rolled up further, but will
201 simply be deleted.
202
203 ## Rollup algorithm
204
205 The rollup algorithm is intended to be run on a single high-resolution key
206 identified by the culling algorithm. The algorithm is as follows:
207
208 1. Read the data in the key. Each key represents a "slab" of high resolution
209 samples captured over a wall-clock hour (up to 360 samples per hour).
210 2. "Downsample" all of the data in the key into a single sample; the new sample
211 will have a sum, count, min and max, computed from the samples in the original
212 key.
213 3. Write the computed sample as a low-resolution data point into the time series
214 system; this is exactly the same process as currently recorded time series,
215 except it will be writing to a different key space (with a different key
216 prefix).
217 4. Delete the original high-resolution key.
218
219 This algorithm is safe to use, even in the case where the same key is being
220 culled by multiple nodes at the same time; this is because step 3 and 4 are
221 currently *idempotent*. The low-resolution sample generated by each node will be
222 identical, and the engine-level time series merging system currently discards
223 duplicate samples. The deletion of the high-resolution key may cause an error on
224 some of the nodes, but only because the key will have already been deleted.
225
226 The end result is that the culled high-resolution key is gone, but a single
227 sample (representing the entire hour) has been written into a low-resolution
228 time series with the same name and source.
229
230 ## Querying Across Culling Boundary
231
232 The final component of this is to allow querying across the culling boundary;
233 that is, if an incoming time series query wants data from both sides of the
234 culling boundary, it will have to process data from two different resolutions.
235
236 There are no broad design decisions to make here; this is simply a matter
237 of modifying low-level iterators and querying slightly different data. This
238 component will likely be the most complicated to actually *write*, but it should
239 be somewhat easier to *test* than the above algorithms, as there is already
240 an existing test infrastructure for time series queries.
241
242 ## Implementation
243
244 This system can (and should) be implemented in three distinct phases:
245
246 1. The "culling" algorithm will be implemented, but will not roll-up the data in
247 discovered keys; instead, it will simply *delete* the discovered time series by
248 issuing a DeleteRange command. This will provide the immediate benefit of
249 limiting the growth of time series data on the cluster.
250
251 2. The "rollup" algorithm will be implemented, generating low-resolution data
252 before deleting the high-resolution data. However, the low-resolution data will
253 not immediately be accessible for queries.
254
255 3. The query system will be modified to consider the high-resolution data.
256
257 # Drawbacks
258
259 + Culling represents another periodic process which runs on each node, which can
260 occasionally cause unexpected issues.
261
262 + Depending on the exact layout of time series data across ranges, it is
263 possible that deleting time series could result in empty ranges. Specifically,
264 this can occur if a range contains data only for a single time series *and* the
265 subsequent range also contains data for that same time series. If this is a
266 common occurrence, it could result in a "trail" of ranges with no data, which
267 might add overhead into storage algorithms that scale with the number of ranges.
268
269 # Alternatives
270
271 ### Alternative Location algorithm
272
273 As an alternative to the queue-based location algorithm, we could use a system
274 where each node maintains a list of time series it has written; given the name
275 of a series, it is easy to construct a scan range which will return all keys
276 that need to be culled:
277
278 `[ts prefix][series name][0] - [ts prefix][series name][(now - threshold)]`
279
280 This will return all keys in the series which are older than the threshold. Note
281 that this includes time series keys generated by any node, not just the current
282 node; this is acceptable, as the rollup algorithm can be run on any key from
283 any node.
284
285 This process can also be effectively distributed across nodes with the following
286 algorithm:
287
288 + Each node's time series module maintains a list of time series it is
289 responsible for culling. This is initialized to a list of "retired" time series,
290 and is augmented each time the node writes a time series it has not written
291 before (in the currently running instance).
292 + The time series module maintains a random permutation of the list; this
293 permutation is randomized again each time a new time series is added. This
294 should normalize very quickly, as new time series are not currently added while
295 a node is running.
296 + Each node will periodically attempt to cull data for a single time series;
297 this starts with the first name in the current permutation, and proceeds through
298 it in a loop.
299
300 In this way, each node eventually attempts to cull all time series (guaranteeing
301 that each is culled), but the individual nodes proceed through the series in a
302 random order - this helps to distribute the work across nodes, and helps to
303 avoid the chance of duplicate work. The total speed of work can be tuned by
304 adjusting the frequency of the per-node culling process.
305
306 This alternative was rejected due to a complication that occurs when a time
307 series is "retired"; we only know about a time series name if the currently
308 running process has recorded it. If a time series is removed from the system,
309 its data will never be culled. Thus, we must also maintain a list of *retired*
310 time series names in the event that any are removed. This requires some manual
311 effort on the part of developers; the consequences for failing to do so are not
312 especially severe (a limited amount of old data will persist on the cluster),
313 but this is still considered inferior to the queue-based solution.
314
315 ### Immediate Rollups
316
317 This was the original intention of the time series system: when a
318 high-resolution data sample is recorded, it is actually directly merged into
319 both the high-resolution AND the low-resolution time series. The engine-level
320 time series merging system would then be responsible for properly aggregating
321 multiple high-resolution samples into a single composite sample in the
322 low-resolution series.
323
324 The advantage of this method is that it does not require queries to use multiple
325 resolutions, and it allows for the delete-only culling process to be used. This
326 was also the original design of the time series system.
327
328 Unfortunately, it is not currently possible due to recent changes which were
329 required by the replica consistency checker. The engine-level merge component no
330 longer aggregates samples, it decimates (discarding only the most recent sample
331 for a period). This was necessary to deal with the unforunate reality of raft
332 command replays.
333
334 ### Opportunistic Rollups
335
336 Instead of rolling up when low-resolution data is deleted, it is instead rolled
337 up as soon as an entire hour of high-resolution samples has been collected in a
338 key. That is, at 5:01 it should be appropriate to roll-up the data stored in the
339 4:00 key. With this alternative, cross-resolution queries can also be avoided
340 and the delete-only culling method can be used.
341
342 However, this introduces additional complications and drawbacks:
343
344 + When querying at low resolution, data from the most recent hour will not be
345 even partially available.
346 + This requires maintaining additional metadata on the cluster about which
347 keys have already been rolled up.
348
349 # Unresolved questions