github.com/muhammadn/cortex@v1.9.1-0.20220510110439-46bb7000d03d/docs/guides/capacity-planning.md (about)

     1  ---
     2  title: "Capacity Planning"
     3  linkTitle: "Capacity Planning"
     4  weight: 10
     5  slug: capacity-planning
     6  ---
     7  
     8  
     9  You will want to estimate how many nodes are required, how many of
    10  each component to run, and how much storage space will be required.
    11  In practice, these will vary greatly depending on the metrics being
    12  sent to Cortex.
    13  
    14  Some key parameters are:
    15  
    16   1. The number of active series. If you have Prometheus already you
    17   can query `prometheus_tsdb_head_series` to see this number.
    18   2. Sampling rate, e.g. a new sample for each series every minute
    19   (the default Prometheus [scrape_interval](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)).
    20   Multiply this by the number of active series to get the
    21   total rate at which samples will arrive at Cortex.
    22   3. The rate at which series are added and removed. This can be very
    23   high if you monitor objects that come and go - for example if you run
    24   thousands of batch jobs lasting a minute or so and capture metrics
    25   with a unique ID for each one. [Read how to analyse this on
    26   Prometheus](https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality).
    27   4. How compressible the time-series data are. If a metric stays at
    28   the same value constantly, then Cortex can compress it very well, so
    29   12 hours of data sampled every 15 seconds would be around 2KB.  On
    30   the other hand if the value jumps around a lot it might take 10KB.
    31   There are not currently any tools available to analyse this.
    32   5. How long you want to retain data for, e.g. 1 month or 2 years.
    33  
    34  Other parameters which can become important if you have particularly
    35  high values:
    36  
    37   6. Number of different series under one metric name.
    38   7. Number of labels per series.
    39   8. Rate and complexity of queries.
    40  
    41  Now, some rules of thumb:
    42  
    43   1. Each million series in an ingester takes 15GB of RAM. Total number
    44   of series in ingesters is number of active series times the
    45   replication factor. This is with the default of 12-hour chunks - RAM
    46   required will reduce if you set `-ingester.max-chunk-age` lower
    47   (trading off more back-end database IO).
    48   There are some additional considerations for planning for ingester memory usage.
    49      1. Memory increases during write ahead log (WAL) replay, [See Prometheus issue #6934](https://github.com/prometheus/prometheus/issues/6934#issuecomment-726039115). If you do not have enough memory for WAL replay, the ingester will not be able to restart successfully without intervention.
    50       2. Memory temporarily increases during resharding since timeseries are temporarily on both the new and old ingesters. This means you should scale up the number of ingesters before memory utilization is too high, otherwise you will not have the headroom to account for the temporary increase.
    51   2. Each million series (including churn) consumes 15GB of chunk
    52   storage and 4GB of index, per day (so multiply by the retention
    53   period).
    54   3. The distributors CPU utilization depends on the specific Cortex cluster
    55      setup, while they don't need much RAM. Typically, distributors are capable
    56      to process between 20,000 and 100,000 samples/sec with 1 CPU core. It's also
    57      highly recommended to configure Prometheus `max_samples_per_send` to 1,000
    58      samples, in order to reduce the distributors CPU utilization given the same
    59      total samples/sec throughput.
    60  
    61  If you turn on compression between distributors and ingesters (for
    62  example to save on inter-zone bandwidth charges at AWS/GCP) they will use
    63  significantly more CPU (approx 100% more for distributor and 50% more
    64  for ingester).