github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/ts/doc.go (about) 1 // Copyright 2016 The Cockroach Authors. 2 // 3 // Use of this software is governed by the Business Source License 4 // included in the file licenses/BSL.txt. 5 // 6 // As of the Change Date specified in that file, in accordance with 7 // the Business Source License, use of this software will be governed 8 // by the Apache License, Version 2.0, included in the file 9 // licenses/APL.txt. 10 11 /* 12 Package ts provides a basic time series database on top of the underlying 13 CockroachDB key/value datastore. It is used to server basic metrics generated by 14 CockroachDB. 15 16 Storing time series data is a unique challenge for databases. Time series data 17 is typically generated at an extremely high volume, and is queried by providing 18 a range of time of arbitrary size, which can lead to an enormous amount of data 19 being scanned for a query. Many specialized time series databases already exist 20 to meet these challenges; those solutions are built on top of specialized 21 storage engines which are often unsuitable for general data storage needs, but 22 currently superior to CockroachDB for the purpose of time series data. 23 24 However, it is a broad goal of CockroachDB to provide a good experience for 25 developers, and out-of-the-box recording of internal metrics has proven to be a 26 good step towards that goal. This package provides a specialized time series 27 database, relatively narrow in scope, that can store this data with good 28 performance characteristics. 29 30 31 Organization Structure 32 33 Time series data is organized on disk according to two basic, sortable properties: 34 + Time series name (i.e "sql.operations.selects") 35 + Timestamp 36 37 This is optimized for querying data for a single series over multiple 38 timestamps: data for the same series at different timestamps is stored 39 contiguously. 40 41 42 Downsampling 43 44 The amount of data produced by time series sampling can be considerable; storing 45 every incoming data point with perfect fidelity can command a tremendous amount 46 of computing and storage resources. 47 48 However, in many use cases perfect fidelity is not necessary; the exact time a 49 sample was taken is unimportant, with the overall trend of the data over time 50 being far more important to analysis than the individual samples. 51 52 With this in mind, CockroachDB downsamples data before storing it; the original 53 timestamp for each data point in a series is not recorded. CockroachDB instead 54 divides time into contiguous slots of uniform length (currently 10 seconds); if 55 multiple data points for a series fall in the same slot, only the most recent 56 sample is kept. 57 58 In addition to the on-disk downsampling, queries may request further 59 downsampling for returned data. For example, a query may request one datapoint 60 be returned for every 10 minute interval, even though the data is stored 61 internally at a 10 second resolution; the downsampling is performed on the 62 server side before returning the data. One restriction is that a query cannot 63 request a downsampling period which is shorter than the smallest on-disk 64 resolution (e.g. one data point per second). 65 66 67 Slab Storage 68 69 In order to use key space efficiently, we pack data for multiple contiguous 70 samples into "slab" values, with data for each slab stored in a CockroachDB key. 71 This is done by again dividing time into contiguous slots, but with a longer 72 duration; this is known as the "slab duration". For example, CockroachDB 73 downsamples its internal data at a resolution of 10 seconds, but stores it with 74 a "slab duration" of 1 hour, meaning that all samples that fall in the same hour 75 are stored at the same key. This strategy helps reduce the number of keys 76 scanned during a query. 77 78 79 Source Keys 80 81 Another common use case of time series queries is the aggregation of multiple 82 series; for example, you may want to query the same metric (e.g. "queries per 83 second") across multiple machines on a cluster, and aggregate the result. 84 85 Specialized Time-series databases can often aggregate across arbitrary series; 86 however, CockroachDB is specialized for aggregation of the same series across 87 different machines or disks. 88 89 This is done by creating a "source key", typically a node or store ID, which is 90 an optional identifier that is separate from the series name itself. The source 91 key is appended to the key as a suffix, after the series name and timestamp; 92 this means that data that is from the same series and time period, but from 93 different nodes, will be stored contiguously in the key space. Data from all 94 sources in a series can thus be queried in a single scan. 95 96 97 Multiple resolutions 98 99 In order to save space on disk, the database stores older data for time series 100 at lower resolution, more commonly known as a "rollup". 101 102 Each single series is recorded initially at a resolution of 10 seconds - one 103 point is recorded for every ten second interval. Once the data has aged past a 104 configurable threshold (default 10 days), the data is "rolled" up so that it 105 has a single datapoint per 30 minute period. This 30-minute resolution data is 106 retained for 90 days by default. 107 108 Note that each rolled-up datapoint contains the first, last, min, max, sum, 109 count and variance of the original 10 second points used to create the 30 110 minute point; this means that any downsampler that could have been used on the 111 original data is still accessible in the rolled-up data. 112 113 114 Example 115 116 A hypothetical example from CockroachDB: we want to record the available 117 capacity of all stores in the cluster. 118 119 The series name is: cockroach.capacity.available 120 121 Data points for this series are automatically collected from all stores. When data points are 122 written, they are recorded with a source key of: [store id] 123 124 There are 3 stores which contain data: 1, 2 and 3. These are arbitrary and may 125 change over time. 126 127 Data is recorded for January 1st, 2016 between 10:05 pm and 11:05 pm. The data 128 is recorded at a 10 second resolution. 129 130 The data is recorded into keys structurally similar to the following: 131 132 tsd.cockroach.capacity.available.10s.403234.1 133 tsd.cockroach.capacity.available.10s.403234.2 134 tsd.cockroach.capacity.available.10s.403234.3 135 tsd.cockroach.capacity.available.10s.403235.1 136 tsd.cockroach.capacity.available.10s.403235.2 137 tsd.cockroach.capacity.available.10s.403235.3 138 139 Data for each source is stored in two keys: one for the 10 pm hour, and one 140 for the 11pm hour. Each key contains the tsd prefix, the series name, the 141 resolution (10s), a timestamp representing the hour, and finally the series key. The 142 keys will appear in the data store in the order shown above. 143 144 (Note that the keys will NOT be exactly as pictured above; they will be encoded 145 in a way that is more efficient, but is not readily human readable.) 146 */ 147 package ts