github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160901_time_series_culling.md (about)

     1  - Feature Name: Time Series Culling
     2  - Status: in progress
     3  - Start Date: 2016-08-29
     4  - Authors: Matt Tracy
     5  - RFC PR: #9343
     6  - Cockroach Issue: #5910
     7  
     8  # Summary
     9  Currently, Time Series data recorded by CockroachDB for its own internal
    10  metrics is retained indefinitely. High-resolution metrics data quickly loses 
    11  utility as it ages, consuming disk space and creating range-related overhead
    12  without conferring an appropriate benefit.
    13  
    14  The simplest solution to deal with this would be to build a system that deletes
    15  time series data older than a certain threshold; however, this RFC suggests a
    16  mechanism for "rolling up" old time series from the system into a lower
    17  resolution that is still retained. This will allow us to keep some metrics
    18  information indefinitely, which can be used for historical performance
    19  evaluation, without needing to keep an unacceptably expensive amount of
    20  information.
    21  
    22  Fully realizing this solution has three components:
    23  
    24  1. A distributed "culling" algorithm that occasionally searches for
    25  high-resolution time series data older than a certain threshold and runs a
    26  "roll-up" process on the discovered keys.
    27  2. A "roll-up" process that computes low-resolution time series data from the
    28  existing data in a high-resolution time series key, deleting the high-resolution
    29  key in the process.
    30  3. Modifications to the query system to utilize underlying data which is stored
    31  at multiple resolutions (currently only supports a single resolution). This
    32  includes the use of data at different resolutions to serve a single query.
    33  
    34  # Motivation
    35  
    36  In our test clusters, time series create a very large amount of data (on the
    37  order of several gigabytes per week) which quickly loses utility as it ages.
    38  
    39  To estimate how much data this is, we first observe the data usage of a single
    40  time series. A single time series stores data as contiguous samples representing
    41  ten-second intervals; all samples for a wall-clock hour are stored in a single
    42  key. In the engine, the keys look like this:
    43  
    44  | Key                                                      | Key Size | Value Size |
    45  |----------------------------------------------------------|----------|------------|
    46  | /System/tsd/cr.store.replicas/1/10s/2016-09-26T17:00:00Z | 30       | 5670       |
    47  | /System/tsd/cr.store.replicas/1/10s/2016-09-26T18:00:00Z | 30       | 5535       |
    48  | /System/tsd/cr.store.replicas/1/10s/2016-09-26T19:00:00Z | 30       | 5046       |
    49  
    50  The above is the data stored for one time series over three complete hours.
    51  Notice the variation in the size of the values; this is due to the fact that
    52  samples may be absent for some ten-second periods, due to the asynchronous
    53  nature of this system. For our purposes, we will estimate the size of a single 
    54  hour of data for a single time series to be *5500* bytes, or 5.5K.
    55  
    56  The total disk usage of high-resolution data on the cluster can thus be
    57  estimated with the following function:
    58  
    59  ` Total bytes = [bytes per time series hour] * [# of time series per node] * [# of nodes] * [# of hours] `
    60  
    61  Thus, data accumulates over time, and as more nodes are added (or if later
    62  versions of cockroach add additional time series), the rate of new time series
    63  data being accumulated increases linearly. As of this writing, each single-node store records
    64  **242** time series. Thus, the bytes needed per hour on a ten-node cluster is:
    65  
    66  `Total Bytes (hour) = 5500 * 242 * 10 = 13310000 (12.69 MiB)`
    67  
    68  After just one week:
    69  
    70  `Total Bytes (week) = 12.69MiB * 168 hours = 2.08 GiB`
    71  
    72  As time passes, this data can represent a large share (or in the case of idle
    73  clusters, the majority) of in-use data on the cluster. This data will also
    74  continue to build indefinitely; a static CockroachDB Cluster will eventually
    75  consume all available disk space, even if no external data is written! With just
    76  the current time series, a ten-node cluster will generate almost a terabyte of
    77  metrics data over a single year.
    78  
    79  The prompt culling of old data is thus a clear area of improvement for
    80  CockroachDB. However, rather than simply deleting data older than a threshold,
    81  this RFC proposes a solution which efficiently keeps metrics data for a longer
    82  time span by downsampling it to a much lower resolution on disk.
    83  
    84  To give some context of numbers: currently, all metrics on disk are stored in a
    85  format which is downsampled to _ten second sample periods_; this is the
    86  "high-resolution" data. We are looking to delete this data when it is older
    87  than a certain threshold, which will likely be set in the range of _2-4 weeks_.
    88  We also propose that, when this data is deleted, it is first downsampled further
    89  into _one hour sample periods_; this is the "low-resolution" data. This data
    90  will be kept for a much longer time, likely _6-12 months_, but perhaps longer.
    91  
    92  In the lower resolution, each datapoint represents the same data as an _entire
    93  slab_ of high-resolution data (at the ten second resolution, data is stored in
    94  slabs corresponding to a wall-clock hour; each slab contains up to 360 samples).
    95  Thus, the expected data storage of the low-resolution is approximately _180x
    96  smaller_ than the high-resolution (not 360 because the individual low-resolution
    97  samples will include a "min" and "max" value not present at the high-resolution.
    98  The high-resolution keys only contain a "sum" and "count" field.)
    99  
   100  By keeping data at the low resolution, users will still be able to inspect
   101  cluster performance over larger times scales, without requiring the storage of
   102  an excessive amount of metrics data.
   103  
   104  # Detailed design
   105  
   106  ## Culling algorithm
   107  
   108  The culling algorithm is responsible for identifying high-resolution time series
   109  keys that are older than a system-set threshold. Once identified, the keys are
   110  passed into the rollup/delete process.
   111  
   112  There are two primary design requirements of the culling algorithm:
   113  
   114  1. From a single node, efficiently locating time series keys which need to be 
   115  culled.
   116  2. Across the cluster, efficiently distributing the task of culling with minimal
   117  coordination between nodes.
   118  
   119  #### Locating Time Series Keys
   120  
   121  Locating time series keys to be culled is not completely trivial due to the
   122  construction of time series keys, which is thus: 
   123  `[ts prefix][series name][timestamp][source]`
   124  
   125  > Example: "ts/cr.node.sql.inserts/1473739200/1" would contain time series data
   126  > for "cr.node.sql.inserts" on September 13th 2016 between 4am-5am UTC,
   127  > specifically for node 1.
   128  
   129  Because of this construction, which prioritizes name over timestamp, the most 
   130  recent time series data for series "A" would sort *before* the oldest time
   131  series data for series "B". This means that we cannot simply cull the beginning
   132  of the time series range.
   133  
   134  The simplest alternative would be to scan the time series range looking for
   135  invalid keys; however, this is considered to be a burdensome scan due to the
   136  number of keys that are not culled. For a per-node time series being recorded on
   137  a 10 node cluster with a 2 week retention period, we would expect to retain (10
   138  x 24 x 14) = *3360* keys that should not be culled. In a system that maintains
   139  dozens, possibly hundreds of time series, this is a lot of data for each node to
   140  scan on a regular basis.
   141  
   142  However, this scan can be effectively distributed across the cluster by creating
   143  a new *replica queue* which searches for time series keys. The new queue can
   144  quickly determine if each range contains time series keys (by inspecting 
   145  start/end keys); for ranges that do contain time series keys, specific keys
   146  can then be inspected at the engine level. This means that key inspections do
   147  not require network calls, and the number of keys that can be inspected at once
   148  is limited to the size of a range.
   149  
   150  Once the queue discovers a range that contains time series keys, the scanning
   151  process does not need to inspect every key on the range. The algorithm is as
   152  follows:
   153  
   154  1. Find the first time series key in the range (scan for [ts prefix]). 
   155  2. Deconstruct the key to retrieve its name.
   156  3. Run the rollup/delete operation on all keys in the range: 
   157      `[ts prefix][series name][0] - [tsprefix][series name][now - threshold]`
   158  4. Find the next key on the range which contains data for a different time
   159  series by searching for key `PrefixEnd([ts prefix][series name])`.
   160  5. If a key was found in step 5, return to step 2 with that series name.
   161  
   162  This algorithm will avoid scanning keys that do not need to be rolled up; this
   163  is desirable, as once the culling algorithm is in place and has run once, the
   164  majority of time series keys will *not* need to be culled.
   165  
   166  The queue will be configured to run only on the range leader for a given range
   167  in order to avoid duplicate work; however, this is *not* necessary for
   168  correctness, as demonstrated in the [Rollup Algorithm](#rollup-algorithm)
   169  section below.
   170  
   171  The queue will initially be set to process replicas at the same rate as the
   172  replica GC queue (as of this RFC, one range per 50 milliseconds).
   173  
   174  ##### Package Dependency
   175  
   176  There is one particular complication to this method: *go package dependency*.
   177  Knowledge on how to identify and cull time series keys is contained in the `ts`
   178  package, but all logic for replica queues (and all current queues) lives in
   179  `storage`, meaning that one of three things must happen:
   180  
   181  + `storage` can depend on `ts`. This seems to be trivially possible now, but may
   182  be unintuitive to those trying to understand our code-base. For reference, the
   183  `storage` package used to depend on the `sql` package in order to record event
   184  logs, but this eventually became an impediment to new development and had to be
   185  modified.
   186  + The queue logic could be implemented in `ts`, and `storage` could implement
   187  an interface that allows it to use the `ts` code without a dependency.
   188  + Parts of the `ts` package could be split off into another package that can
   189  intuitively live below `storage`. However, this is likely to be a considerable
   190  portion of `ts` in order to properly implement rollups.
   191  
   192  Tenatively, we will be attempting to use the first method and have `storage`
   193  depend on `ts`; if it is indeed trivially possible, this will be the fastest
   194  method of completing this project.
   195  
   196  #### Culling low resolution data
   197  
   198  Although the volume is much lower, low-resolution data will still build
   199  up indefinitely unless it is culled. This data will also be culled by the same
   200  algorithm outlined here; however, it will not be rolled up further, but will
   201  simply be deleted.
   202  
   203  ## Rollup algorithm
   204  
   205  The rollup algorithm is intended to be run on a single high-resolution key
   206  identified by the culling algorithm. The algorithm is as follows:
   207  
   208  1. Read the data in the key. Each key represents a "slab" of high resolution
   209  samples captured over a wall-clock hour (up to 360 samples per hour).
   210  2. "Downsample" all of the data in the key into a single sample; the new sample
   211  will have a sum, count, min and max, computed from the samples in the original
   212  key.
   213  3. Write the computed sample as a low-resolution data point into the time series
   214  system; this is exactly the same process as currently recorded time series,
   215  except it will be writing to a different key space (with a different key
   216  prefix).
   217  4. Delete the original high-resolution key.
   218  
   219  This algorithm is safe to use, even in the case where the same key is being
   220  culled by multiple nodes at the same time; this is because step 3 and 4 are
   221  currently *idempotent*. The low-resolution sample generated by each node will be
   222  identical, and the engine-level time series merging system currently discards
   223  duplicate samples. The deletion of the high-resolution key may cause an error on
   224  some of the nodes, but only because the key will have already been deleted.
   225  
   226  The end result is that the culled high-resolution key is gone, but a single
   227  sample (representing the entire hour) has been written into a low-resolution
   228  time series with the same name and source.
   229  
   230  ## Querying Across Culling Boundary
   231  
   232  The final component of this is to allow querying across the culling boundary;
   233  that is, if an incoming time series query wants data from both sides of the
   234  culling boundary, it will have to process data from two different resolutions.
   235  
   236  There are no broad design decisions to make here; this is simply a matter
   237  of modifying low-level iterators and querying slightly different data. This
   238  component will likely be the most complicated to actually *write*, but it should
   239  be somewhat easier to *test* than the above algorithms, as there is already
   240  an existing test infrastructure for time series queries.
   241  
   242  ## Implementation
   243  
   244  This system can (and should) be implemented in three distinct phases:
   245  
   246  1. The "culling" algorithm will be implemented, but will not roll-up the data in
   247  discovered keys; instead, it will simply *delete* the discovered time series by
   248  issuing a DeleteRange command. This will provide the immediate benefit of
   249  limiting the growth of time series data on the cluster.
   250  
   251  2. The "rollup" algorithm will be implemented, generating low-resolution data
   252  before deleting the high-resolution data. However, the low-resolution data will
   253  not immediately be accessible for queries.
   254  
   255  3. The query system will be modified to consider the high-resolution data.
   256  
   257  # Drawbacks
   258  
   259  + Culling represents another periodic process which runs on each node, which can
   260  occasionally cause unexpected issues.
   261  
   262  + Depending on the exact layout of time series data across ranges, it is
   263  possible that deleting time series could result in empty ranges. Specifically,
   264  this can occur if a range contains data only for a single time series *and* the
   265  subsequent range also contains data for that same time series. If this is a
   266  common occurrence, it could result in a "trail" of ranges with no data, which
   267  might add overhead into storage algorithms that scale with the number of ranges.
   268  
   269  # Alternatives
   270  
   271  ### Alternative Location algorithm
   272  
   273  As an alternative to the queue-based location algorithm, we could use a system
   274  where each node maintains a list of time series it has written; given the name
   275  of a series, it is easy to construct a scan range which will return all keys
   276  that need to be culled:
   277  
   278  `[ts prefix][series name][0] - [ts prefix][series name][(now - threshold)]`
   279  
   280  This will return all keys in the series which are older than the threshold. Note
   281  that this includes time series keys generated by any node, not just the current
   282  node; this is acceptable, as the rollup algorithm can be run on any key from
   283  any node.
   284  
   285  This process can also be effectively distributed across nodes with the following
   286  algorithm:
   287  
   288  + Each node's time series module maintains a list of time series it is
   289  responsible for culling. This is initialized to a list of "retired" time series,
   290  and is augmented each time the node writes a time series it has not written
   291  before (in the currently running instance).
   292  + The time series module maintains a random permutation of the list; this
   293  permutation is randomized again each time a new time series is added. This
   294  should normalize very quickly, as new time series are not currently added while
   295  a node is running.
   296  + Each node will periodically attempt to cull data for a single time series;
   297  this starts with the first name in the current permutation, and proceeds through
   298  it in a loop.
   299  
   300  In this way, each node eventually attempts to cull all time series (guaranteeing
   301  that each is culled), but the individual nodes proceed through the series in a
   302  random order - this helps to distribute the work across nodes, and helps to
   303  avoid the chance of duplicate work. The total speed of work can be tuned by
   304  adjusting the frequency of the per-node culling process.
   305  
   306  This alternative was rejected due to a complication that occurs when a time
   307  series is "retired"; we only know about a time series name if the currently
   308  running process has recorded it. If a time series is removed from the system,
   309  its data will never be culled. Thus, we must also maintain a list of *retired*
   310  time series names in the event that any are removed. This requires some manual
   311  effort on the part of developers; the consequences for failing to do so are not
   312  especially severe (a limited amount of old data will persist on the cluster),
   313  but this is still considered inferior to the queue-based solution.
   314  
   315  ### Immediate Rollups
   316  
   317  This was the original intention of the time series system: when a
   318  high-resolution data sample is recorded, it is actually directly merged into
   319  both the high-resolution AND the low-resolution time series. The engine-level
   320  time series merging system would then be responsible for properly aggregating
   321  multiple high-resolution samples into a single composite sample in the
   322  low-resolution series.
   323  
   324  The advantage of this method is that it does not require queries to use multiple
   325  resolutions, and it allows for the delete-only culling process to be used. This
   326  was also the original design of the time series system.
   327  
   328  Unfortunately, it is not currently possible due to recent changes which were
   329  required by the replica consistency checker. The engine-level merge component no
   330  longer aggregates samples, it decimates (discarding only the most recent sample
   331  for a period). This was necessary to deal with the unforunate reality of raft
   332  command replays.
   333  
   334  ### Opportunistic Rollups
   335  
   336  Instead of rolling up when low-resolution data is deleted, it is instead rolled
   337  up as soon as an entire hour of high-resolution samples has been collected in a
   338  key. That is, at 5:01 it should be appropriate to roll-up the data stored in the
   339  4:00 key. With this alternative, cross-resolution queries can also be avoided
   340  and the delete-only culling method can be used.
   341  
   342  However, this introduces additional complications and drawbacks:
   343  
   344  + When querying at low resolution, data from the most recent hour will not be 
   345  even partially available.
   346  + This requires maintaining additional metadata on the cluster about which
   347  keys have already been rolled up.
   348  
   349  # Unresolved questions