github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160901_time_series_culling.md (about) 1 - Feature Name: Time Series Culling 2 - Status: in progress 3 - Start Date: 2016-08-29 4 - Authors: Matt Tracy 5 - RFC PR: #9343 6 - Cockroach Issue: #5910 7 8 # Summary 9 Currently, Time Series data recorded by CockroachDB for its own internal 10 metrics is retained indefinitely. High-resolution metrics data quickly loses 11 utility as it ages, consuming disk space and creating range-related overhead 12 without conferring an appropriate benefit. 13 14 The simplest solution to deal with this would be to build a system that deletes 15 time series data older than a certain threshold; however, this RFC suggests a 16 mechanism for "rolling up" old time series from the system into a lower 17 resolution that is still retained. This will allow us to keep some metrics 18 information indefinitely, which can be used for historical performance 19 evaluation, without needing to keep an unacceptably expensive amount of 20 information. 21 22 Fully realizing this solution has three components: 23 24 1. A distributed "culling" algorithm that occasionally searches for 25 high-resolution time series data older than a certain threshold and runs a 26 "roll-up" process on the discovered keys. 27 2. A "roll-up" process that computes low-resolution time series data from the 28 existing data in a high-resolution time series key, deleting the high-resolution 29 key in the process. 30 3. Modifications to the query system to utilize underlying data which is stored 31 at multiple resolutions (currently only supports a single resolution). This 32 includes the use of data at different resolutions to serve a single query. 33 34 # Motivation 35 36 In our test clusters, time series create a very large amount of data (on the 37 order of several gigabytes per week) which quickly loses utility as it ages. 38 39 To estimate how much data this is, we first observe the data usage of a single 40 time series. A single time series stores data as contiguous samples representing 41 ten-second intervals; all samples for a wall-clock hour are stored in a single 42 key. In the engine, the keys look like this: 43 44 | Key | Key Size | Value Size | 45 |----------------------------------------------------------|----------|------------| 46 | /System/tsd/cr.store.replicas/1/10s/2016-09-26T17:00:00Z | 30 | 5670 | 47 | /System/tsd/cr.store.replicas/1/10s/2016-09-26T18:00:00Z | 30 | 5535 | 48 | /System/tsd/cr.store.replicas/1/10s/2016-09-26T19:00:00Z | 30 | 5046 | 49 50 The above is the data stored for one time series over three complete hours. 51 Notice the variation in the size of the values; this is due to the fact that 52 samples may be absent for some ten-second periods, due to the asynchronous 53 nature of this system. For our purposes, we will estimate the size of a single 54 hour of data for a single time series to be *5500* bytes, or 5.5K. 55 56 The total disk usage of high-resolution data on the cluster can thus be 57 estimated with the following function: 58 59 ` Total bytes = [bytes per time series hour] * [# of time series per node] * [# of nodes] * [# of hours] ` 60 61 Thus, data accumulates over time, and as more nodes are added (or if later 62 versions of cockroach add additional time series), the rate of new time series 63 data being accumulated increases linearly. As of this writing, each single-node store records 64 **242** time series. Thus, the bytes needed per hour on a ten-node cluster is: 65 66 `Total Bytes (hour) = 5500 * 242 * 10 = 13310000 (12.69 MiB)` 67 68 After just one week: 69 70 `Total Bytes (week) = 12.69MiB * 168 hours = 2.08 GiB` 71 72 As time passes, this data can represent a large share (or in the case of idle 73 clusters, the majority) of in-use data on the cluster. This data will also 74 continue to build indefinitely; a static CockroachDB Cluster will eventually 75 consume all available disk space, even if no external data is written! With just 76 the current time series, a ten-node cluster will generate almost a terabyte of 77 metrics data over a single year. 78 79 The prompt culling of old data is thus a clear area of improvement for 80 CockroachDB. However, rather than simply deleting data older than a threshold, 81 this RFC proposes a solution which efficiently keeps metrics data for a longer 82 time span by downsampling it to a much lower resolution on disk. 83 84 To give some context of numbers: currently, all metrics on disk are stored in a 85 format which is downsampled to _ten second sample periods_; this is the 86 "high-resolution" data. We are looking to delete this data when it is older 87 than a certain threshold, which will likely be set in the range of _2-4 weeks_. 88 We also propose that, when this data is deleted, it is first downsampled further 89 into _one hour sample periods_; this is the "low-resolution" data. This data 90 will be kept for a much longer time, likely _6-12 months_, but perhaps longer. 91 92 In the lower resolution, each datapoint represents the same data as an _entire 93 slab_ of high-resolution data (at the ten second resolution, data is stored in 94 slabs corresponding to a wall-clock hour; each slab contains up to 360 samples). 95 Thus, the expected data storage of the low-resolution is approximately _180x 96 smaller_ than the high-resolution (not 360 because the individual low-resolution 97 samples will include a "min" and "max" value not present at the high-resolution. 98 The high-resolution keys only contain a "sum" and "count" field.) 99 100 By keeping data at the low resolution, users will still be able to inspect 101 cluster performance over larger times scales, without requiring the storage of 102 an excessive amount of metrics data. 103 104 # Detailed design 105 106 ## Culling algorithm 107 108 The culling algorithm is responsible for identifying high-resolution time series 109 keys that are older than a system-set threshold. Once identified, the keys are 110 passed into the rollup/delete process. 111 112 There are two primary design requirements of the culling algorithm: 113 114 1. From a single node, efficiently locating time series keys which need to be 115 culled. 116 2. Across the cluster, efficiently distributing the task of culling with minimal 117 coordination between nodes. 118 119 #### Locating Time Series Keys 120 121 Locating time series keys to be culled is not completely trivial due to the 122 construction of time series keys, which is thus: 123 `[ts prefix][series name][timestamp][source]` 124 125 > Example: "ts/cr.node.sql.inserts/1473739200/1" would contain time series data 126 > for "cr.node.sql.inserts" on September 13th 2016 between 4am-5am UTC, 127 > specifically for node 1. 128 129 Because of this construction, which prioritizes name over timestamp, the most 130 recent time series data for series "A" would sort *before* the oldest time 131 series data for series "B". This means that we cannot simply cull the beginning 132 of the time series range. 133 134 The simplest alternative would be to scan the time series range looking for 135 invalid keys; however, this is considered to be a burdensome scan due to the 136 number of keys that are not culled. For a per-node time series being recorded on 137 a 10 node cluster with a 2 week retention period, we would expect to retain (10 138 x 24 x 14) = *3360* keys that should not be culled. In a system that maintains 139 dozens, possibly hundreds of time series, this is a lot of data for each node to 140 scan on a regular basis. 141 142 However, this scan can be effectively distributed across the cluster by creating 143 a new *replica queue* which searches for time series keys. The new queue can 144 quickly determine if each range contains time series keys (by inspecting 145 start/end keys); for ranges that do contain time series keys, specific keys 146 can then be inspected at the engine level. This means that key inspections do 147 not require network calls, and the number of keys that can be inspected at once 148 is limited to the size of a range. 149 150 Once the queue discovers a range that contains time series keys, the scanning 151 process does not need to inspect every key on the range. The algorithm is as 152 follows: 153 154 1. Find the first time series key in the range (scan for [ts prefix]). 155 2. Deconstruct the key to retrieve its name. 156 3. Run the rollup/delete operation on all keys in the range: 157 `[ts prefix][series name][0] - [tsprefix][series name][now - threshold]` 158 4. Find the next key on the range which contains data for a different time 159 series by searching for key `PrefixEnd([ts prefix][series name])`. 160 5. If a key was found in step 5, return to step 2 with that series name. 161 162 This algorithm will avoid scanning keys that do not need to be rolled up; this 163 is desirable, as once the culling algorithm is in place and has run once, the 164 majority of time series keys will *not* need to be culled. 165 166 The queue will be configured to run only on the range leader for a given range 167 in order to avoid duplicate work; however, this is *not* necessary for 168 correctness, as demonstrated in the [Rollup Algorithm](#rollup-algorithm) 169 section below. 170 171 The queue will initially be set to process replicas at the same rate as the 172 replica GC queue (as of this RFC, one range per 50 milliseconds). 173 174 ##### Package Dependency 175 176 There is one particular complication to this method: *go package dependency*. 177 Knowledge on how to identify and cull time series keys is contained in the `ts` 178 package, but all logic for replica queues (and all current queues) lives in 179 `storage`, meaning that one of three things must happen: 180 181 + `storage` can depend on `ts`. This seems to be trivially possible now, but may 182 be unintuitive to those trying to understand our code-base. For reference, the 183 `storage` package used to depend on the `sql` package in order to record event 184 logs, but this eventually became an impediment to new development and had to be 185 modified. 186 + The queue logic could be implemented in `ts`, and `storage` could implement 187 an interface that allows it to use the `ts` code without a dependency. 188 + Parts of the `ts` package could be split off into another package that can 189 intuitively live below `storage`. However, this is likely to be a considerable 190 portion of `ts` in order to properly implement rollups. 191 192 Tenatively, we will be attempting to use the first method and have `storage` 193 depend on `ts`; if it is indeed trivially possible, this will be the fastest 194 method of completing this project. 195 196 #### Culling low resolution data 197 198 Although the volume is much lower, low-resolution data will still build 199 up indefinitely unless it is culled. This data will also be culled by the same 200 algorithm outlined here; however, it will not be rolled up further, but will 201 simply be deleted. 202 203 ## Rollup algorithm 204 205 The rollup algorithm is intended to be run on a single high-resolution key 206 identified by the culling algorithm. The algorithm is as follows: 207 208 1. Read the data in the key. Each key represents a "slab" of high resolution 209 samples captured over a wall-clock hour (up to 360 samples per hour). 210 2. "Downsample" all of the data in the key into a single sample; the new sample 211 will have a sum, count, min and max, computed from the samples in the original 212 key. 213 3. Write the computed sample as a low-resolution data point into the time series 214 system; this is exactly the same process as currently recorded time series, 215 except it will be writing to a different key space (with a different key 216 prefix). 217 4. Delete the original high-resolution key. 218 219 This algorithm is safe to use, even in the case where the same key is being 220 culled by multiple nodes at the same time; this is because step 3 and 4 are 221 currently *idempotent*. The low-resolution sample generated by each node will be 222 identical, and the engine-level time series merging system currently discards 223 duplicate samples. The deletion of the high-resolution key may cause an error on 224 some of the nodes, but only because the key will have already been deleted. 225 226 The end result is that the culled high-resolution key is gone, but a single 227 sample (representing the entire hour) has been written into a low-resolution 228 time series with the same name and source. 229 230 ## Querying Across Culling Boundary 231 232 The final component of this is to allow querying across the culling boundary; 233 that is, if an incoming time series query wants data from both sides of the 234 culling boundary, it will have to process data from two different resolutions. 235 236 There are no broad design decisions to make here; this is simply a matter 237 of modifying low-level iterators and querying slightly different data. This 238 component will likely be the most complicated to actually *write*, but it should 239 be somewhat easier to *test* than the above algorithms, as there is already 240 an existing test infrastructure for time series queries. 241 242 ## Implementation 243 244 This system can (and should) be implemented in three distinct phases: 245 246 1. The "culling" algorithm will be implemented, but will not roll-up the data in 247 discovered keys; instead, it will simply *delete* the discovered time series by 248 issuing a DeleteRange command. This will provide the immediate benefit of 249 limiting the growth of time series data on the cluster. 250 251 2. The "rollup" algorithm will be implemented, generating low-resolution data 252 before deleting the high-resolution data. However, the low-resolution data will 253 not immediately be accessible for queries. 254 255 3. The query system will be modified to consider the high-resolution data. 256 257 # Drawbacks 258 259 + Culling represents another periodic process which runs on each node, which can 260 occasionally cause unexpected issues. 261 262 + Depending on the exact layout of time series data across ranges, it is 263 possible that deleting time series could result in empty ranges. Specifically, 264 this can occur if a range contains data only for a single time series *and* the 265 subsequent range also contains data for that same time series. If this is a 266 common occurrence, it could result in a "trail" of ranges with no data, which 267 might add overhead into storage algorithms that scale with the number of ranges. 268 269 # Alternatives 270 271 ### Alternative Location algorithm 272 273 As an alternative to the queue-based location algorithm, we could use a system 274 where each node maintains a list of time series it has written; given the name 275 of a series, it is easy to construct a scan range which will return all keys 276 that need to be culled: 277 278 `[ts prefix][series name][0] - [ts prefix][series name][(now - threshold)]` 279 280 This will return all keys in the series which are older than the threshold. Note 281 that this includes time series keys generated by any node, not just the current 282 node; this is acceptable, as the rollup algorithm can be run on any key from 283 any node. 284 285 This process can also be effectively distributed across nodes with the following 286 algorithm: 287 288 + Each node's time series module maintains a list of time series it is 289 responsible for culling. This is initialized to a list of "retired" time series, 290 and is augmented each time the node writes a time series it has not written 291 before (in the currently running instance). 292 + The time series module maintains a random permutation of the list; this 293 permutation is randomized again each time a new time series is added. This 294 should normalize very quickly, as new time series are not currently added while 295 a node is running. 296 + Each node will periodically attempt to cull data for a single time series; 297 this starts with the first name in the current permutation, and proceeds through 298 it in a loop. 299 300 In this way, each node eventually attempts to cull all time series (guaranteeing 301 that each is culled), but the individual nodes proceed through the series in a 302 random order - this helps to distribute the work across nodes, and helps to 303 avoid the chance of duplicate work. The total speed of work can be tuned by 304 adjusting the frequency of the per-node culling process. 305 306 This alternative was rejected due to a complication that occurs when a time 307 series is "retired"; we only know about a time series name if the currently 308 running process has recorded it. If a time series is removed from the system, 309 its data will never be culled. Thus, we must also maintain a list of *retired* 310 time series names in the event that any are removed. This requires some manual 311 effort on the part of developers; the consequences for failing to do so are not 312 especially severe (a limited amount of old data will persist on the cluster), 313 but this is still considered inferior to the queue-based solution. 314 315 ### Immediate Rollups 316 317 This was the original intention of the time series system: when a 318 high-resolution data sample is recorded, it is actually directly merged into 319 both the high-resolution AND the low-resolution time series. The engine-level 320 time series merging system would then be responsible for properly aggregating 321 multiple high-resolution samples into a single composite sample in the 322 low-resolution series. 323 324 The advantage of this method is that it does not require queries to use multiple 325 resolutions, and it allows for the delete-only culling process to be used. This 326 was also the original design of the time series system. 327 328 Unfortunately, it is not currently possible due to recent changes which were 329 required by the replica consistency checker. The engine-level merge component no 330 longer aggregates samples, it decimates (discarding only the most recent sample 331 for a period). This was necessary to deal with the unforunate reality of raft 332 command replays. 333 334 ### Opportunistic Rollups 335 336 Instead of rolling up when low-resolution data is deleted, it is instead rolled 337 up as soon as an entire hour of high-resolution samples has been collected in a 338 key. That is, at 5:01 it should be appropriate to roll-up the data stored in the 339 4:00 key. With this alternative, cross-resolution queries can also be avoided 340 and the delete-only culling method can be used. 341 342 However, this introduces additional complications and drawbacks: 343 344 + When querying at low resolution, data from the most recent hour will not be 345 even partially available. 346 + This requires maintaining additional metadata on the cluster about which 347 keys have already been rolled up. 348 349 # Unresolved questions