github.com/thanos-io/thanos@v0.32.5/docs/proposals-accepted/202206-active-series-limiting-hashring.md

github.com/thanos-io/thanos@v0.32.5/docs/proposals-accepted/202206-active-series-limiting-hashring.md (about)

     1  ---
     2  type: proposal
     3  title: Active Series Limiting for Hashring Topology
     4  status: accepted
     5  owner: saswatamcode
     6  menu: proposals-accepted
     7  ---
     8  
     9  ## 1 Related links/tickets
    10  
    11  * https://github.com/thanos-io/thanos/pull/5520
    12  * https://github.com/thanos-io/thanos/pull/5333
    13  * https://github.com/thanos-io/thanos/pull/5402
    14  * https://github.com/thanos-io/thanos/issues/5404
    15  * https://github.com/thanos-io/thanos/issues/4972
    16  
    17  ## 2 Why
    18  
    19  Thanos is built to be a set of components that can be composed into a highly available metrics system with unlimited storage capacity. But for achieving true HA we need to ensure that tenants in our write path, cannot push too much data and cause issues. There need to be limits in place so that our ingestion systems maintain the level of Quality of Service (QoS) and only block the tenant that exceeds limits.
    20  
    21  With limiting, we also need tracking and configuration of those limits to reliably use them.
    22  
    23  ## 3 Pitfalls of the current solution
    24  
    25  We run multiple Thanos Receive replicas in a hashring topology, to ingest metric data from multiple tenants via remote write requests and distribute the load evenly. This allows us to scalably process these requests and write tenant data into a local Receive TSDB for Thanos Querier to query. We also replicate write requests across multiple (usually three) Receive replicas, so that even during times where replicas are unavailable, we can still ingest the data.
    26  
    27  While this composes a scalable and highly available system, sudden increased load, i.e increase in [active series](https://grafana.com/docs/grafana-cloud/fundamentals/active-series-and-dpm/) from any tenant can cause Receive to hit its limits and cause incidents.
    28  
    29  We could scale horizontally automatically during such increased load (once we implement [this](https://github.com/thanos-io/thanos/issues/4972)), but it is yet safe to do so, plus a full solution cannot have unbounded cost scaling. Thus, some form of limits must be put in place, to prevent such issues from occurring and causing incidents.
    30  
    31  ## 4 Audience
    32  
    33  * Users who run Thanos at a large scale and would like to benefit from improved stability of the Thanos write path and get a grasp on the amount of data they ingest.
    34  
    35  ## 5 Goals
    36  
    37  * Add a mechanism to get the number of active series per tenant in Thanos and be able to generate meta-monitoring metrics from that (which could even provide us with some “pseudo-billing”).
    38  * Use the same implementation to limit the number of active series per tenant within Thanos and fail remote write requests as they go above a configured limit for Hashring topologies with multiple Receive replicas.
    39  * Explain how such a limit will work for partial errors.
    40  
    41  ## 6 Non-Goals
    42  
    43  * [Request-based limiting](https://github.com/thanos-io/thanos/issues/5404), i.e, number of samples in a remote write request
    44  * Per-replica-tenant limiting which is already being discussed in this [PR](https://github.com/thanos-io/thanos/pull/5333)
    45  * Using [consistent hashing](https://github.com/thanos-io/thanos/issues/4972) implementation in Receive to make it easily scalable
    46  
    47  ## 7 How
    48  
    49  Thanos Receive uses Prometheus TSDB and creates a separate TSDB database instance internally for each of its tenants. When a Receive replica gets a remote write request, it loops through the timeseries, hashes labels with tenant name as prefix and forwards remote write request to other Receive nodes. Upon receiving samples in a remote write request from a tenant, the Receive node appends the samples to the in-memory head block of a tenant.
    50  
    51  We can leverage this fact, and generate statistics from the TSDB head block, which can give us an accurate idea of the active (head) series of a tenant. This is also exposed as a metric.
    52  
    53  Thus, any remote write request can be failed completely, with a 429 status code and appropriate error message. We can even check if it increases the number of active series above the configured limit for a tenant in certain approaches. Partially accepting write requests might lead to confusing results and error semantics, so we propose to avoid this and focus on retries from client-side.
    54  
    55  There are however a few challenges to this, as tenant metric data is distributed and replicated across multiple Thanos Receive replicas. Also, with a multi-replica setup, we have the concept of per-replica-tenant and per-tenant limits that can be defined as,
    56  
    57  * **Per-replica-tenant limit**: The limit imposed for active series per tenant, per replica of Thanos Receive. An initial implementation of this is already WIP in this [PR](https://github.com/thanos-io/thanos/pull/5333). This can also be treated as the active series limit for non-hashring topology or single replica Receive setups.
    58  
    59  * **Per-tenant limit**: The overall limit imposed for active series per tenant across all replicas of Thanos Receive. This is essentially what this proposal is for.
    60  
    61  ## 8 Proposal
    62  
    63  In general, we would need three measures to impose a limit,
    64  
    65  * The current count of active series for a tenant (across all replicas if it is a *per-tenant* limit)
    66  * The user configured limit (*per-tenant* and *per-replica-tenant* can be different). We can assume this would be available as a user flag and would be same for all tenants (in initial implementation)
    67  * The increase in the number of active series, when a new tenant [remote write request](https://github.com/prometheus/prometheus/blob/v2.36.1/prompb/remote.proto#L22) would be ingested (this can be optional as seen in [meta-monitoring approach](#81-meta-monitoring-based-receive-router-validation)).
    68  
    69  There are a few ways in which we can achieve the outlined goals of this proposal and get the above measurements to impose a limit. The order of approaches is based on preference.
    70  
    71  ### 8.1 Meta-monitoring-based Receive Router Validation
    72  
    73  We could leverage any meta-monitoring solution, that in the context of this proposal, would mean any Prometheus Query API compatible solution which is capable of consuming metrics exposes by all Thanos Receive instances. Such query endpoint would allows getting the scrape time seconds old number of all active series per tenant with TSDB metrics like `prometheus_tsdb_head_series`, and limit based on that value.
    74  
    75  This approach would add validation logic within Receive Router or RouterIngestor modes and can be optionally enabled via flags.
    76  
    77  With such approach, we do not need to calculate an increase based on requests, as this will be handled by Receive instrumentation and meta-monitoring solution. We only need to query the latest HEAD series value for a tenant summed across all receives and limit remote write requests if the result of the instant query is greater than the configured limit.
    78  
    79  The value of current active series for each tenant can be cached in a map which would be updated by meta-monitoring query which is executed periodically. This map will be referred to for `latestCurrentSeries`.
    80  
    81  So if a user configures a *per-tenant* limit, say `globalSeriesLimit`, the resultant limiting equation here would simply be `globalSeriesLimit >= latestCurrentSeries` which is checked on request.
    82  
    83  <img src="../img/meta-monitoring-validator.png" alt="Meta-monitoring-based Validator" width="800"/>
    84  
    85  #### 8.1.1 Pros:
    86  
    87  * Simpler as compared to other solutions and easier to implement
    88  * Lesser endpoint calls, so improved latency
    89  * Relies on "external to Thanos" system, and doesn’t increase load on Receive
    90  * Does not add much tenancy-based complexity to Thanos
    91  * No need to merge statistics across replicas, handled by meta-monitoring
    92  * Additional request-based rate limiting can be done within same component
    93  * In case, external meta-monitoring solution is down, can fall back to per-replica-tenant limits
    94  * Growing our instrumentation to improve validator, improves our observability as well
    95  * Easy to iterate. We can move out of meta-monitoring to different standalone solution if needed later on.
    96  
    97  #### 8.1.2 Cons:
    98  
    99  * Not very accurate
   100    * We do not know exact state of each TSDB, only know view of meta-monitoring solution, which gets updated on every scrape
   101    * We do not account for how much a remote write request will increase the number of active series, only infer that from query result after the fact
   102  * Data replication (quorum-based) will likely cause inaccuracies in HEAD stat metrics
   103  * Dependence on an external meta-monitoring system that is Prometheus API compatible. It's fairly easy and reliable with local Prometheus setup, but if the user uses 3rd party system like central remote monitoring, this might be less trivial to setup.
   104  
   105  #### 8.1.3 Why this is preferred?
   106  
   107  This is the simplest solution that can be implemented within Thanos Receive that can help us achieve best-effort limiting and stability. The fact that it does not rely on inter-Receive communication, which is very complex to implement, makes it a pragmatic solution.
   108  
   109  A full-fledged reference implementation of this can be found here: https://github.com/thanos-io/thanos/pull/5520.
   110  
   111  ## 9 Alternatives
   112  
   113  There are a few alternatives to what is proposed above,
   114  
   115  ### 9.1 Receive Router Validation
   116  
   117  We can implement some new endpoints on Thanos Receive.
   118  
   119  Firstly, we can take advantage of the `api/v1/status/tsdb` endpoint that is exposed by [Prometheus TSDB](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats) but has been implemented in Thanos Receive ([PR](https://github.com/thanos-io/thanos/pull/5402) which utilizes tenant headers to get local tenant TSDB stats in Receive).
   120  
   121  In its current implementation, it can provide us stats for each local TSDB of a tenant which contains a measure of active series (head series). It can return stats for all tenants in a Receive instance as well ([PR](https://github.com/thanos-io/thanos/pull/5470)). We can merge this across replicas to get the total number of active series a tenant has.
   122  
   123  Furthermore, we also have each tenant’s [Appendable](https://pkg.go.dev/github.com/thanos-io/thanos/pkg/receive#Appendable) in multitsdb, which returns a Prometheus [storage.Appender](https://pkg.go.dev/github.com/prometheus/prometheus/storage#Appender), which can in turn give us a [storage.GetRef](https://pkg.go.dev/github.com/prometheus/prometheus/storage#GetRef.GetRef) interface. This helps us know if a TSDB has a cached reference for a particular set of labels in its HEAD.
   124  
   125  This GetRef interface returns a [SeriesRef](https://pkg.go.dev/github.com/prometheus/prometheus/storage#SeriesRef) when a set of labels is passed to it. If the SeriesRef is 0, it means that that set of labels is not cached, and any sample with that set of labels will generate a new active series. This data can also be fetched from a new endpoint like `api/v1/getrefmap` and merged across replicas.
   126  
   127  This approach would add validation logic within Receive Router, which we can call as **“Validator”**. This can be optionally enabled via flags and a Validator can be used in front of a Receive Hashring. This is where we can get data from hashring Receivers and merge them to limit remote write requests.
   128  
   129  The implementation would be as follows,
   130  
   131  * Implement configuration option for global series limit (which would be the same for each tenant initially) i.e `globalSeriesLimit`
   132  * Implement validation logic in Receive Router mode, which can recognize other Receive replicas and call the `api/v1/status/tsdb` endpoint on each replica with `all_tenants=true` query parameter and merge the count of active series i.e `currentSeries` of a tenant
   133  * Implement an endpoint in Receive, `api/v1/getrefmap`, which when provided with a tenant id and a remote write request returns a map of SeriesRef and labelsets
   134  * We can then merge this with maps from other replicas, and get the number of series for which `SeriesRef == 0` for all replicas. This is the increase in the number of active series if the remote write request is ingested i.e `increaseOnRequest`. For example,
   135  
   136  <img src="../img/get-ref-map.png" alt="SeriesRef Map merge across replicas" width="600"/>
   137  
   138  * The above merged results may be exposed as metrics by Validator
   139  * Each remote write request is first intercepted by a Validator, which perform the above and calculates if the request is under the limit.
   140  * [Request-based limits](https://github.com/thanos-io/thanos/issues/5404) can also be implemented with such approach.
   141  
   142  So, the limiting equation in this case becomes `globalSeriesLimit >= currentSeries + increaseOnRequest`.
   143  
   144  <img src="../img/receive-validator.png" alt="Receive Validator" width="800"/>
   145  
   146  We treat the two endpoints `api/v1/status/tsdb` & `api/v1/getrefmap` as two different endpoints throughout this proposal but maybe exposing some gRPC API that combines the two would be much more suitable here, for example,
   147  
   148  ```proto
   149  /// Limit represents an API on Thanos Receive, which is used for limiting remote write requests based on active series.
   150  service Limit {
   151   /// Status returns various cardinality statistics about any Receive tenant TSDB.
   152   rpc Status(Tenant) returns (TSDBStatus);
   153  
   154   /// GetRefMap returns a map of ZLabelSet and SeriesRef.
   155   rpc GetRefMap(WriteRequest) returns (SeriesRefMap);
   156  }
   157  
   158  message SeriesRefMap {
   159   map<ZLabelSet, uint64> series_ref_map = 1 [(gogoproto.nullable) = false];
   160  }
   161  ```
   162  
   163  #### 9.1.1 Pros:
   164  
   165  * Would result in more accurate measurements to limit on, however data replication would still make `api/v1/status/tsdb` [inaccurate](https://github.com/thanos-io/thanos/pull/5402#discussion_r893434246)
   166    * It considers the exact amount of current active series for a tenant as it calls status API each time
   167    * It considers how much the number of active series would increase after a remote write request
   168  * No new TSDB-related changes, it utilizes interfaces that are already present
   169  * Simple proxy-like solution, as an optional component
   170  * Does not change existing way in which Receive nodes communicate with each other
   171  * Additional request-based rate limiting can be done within same component
   172  
   173  #### 9.1.2 Cons:
   174  
   175  * Adding a new component to manage.
   176  * Increased tenant complexity in Thanos due to new APIs in Receive which need to account for tenants
   177  * Many endpoint calls on each remote write request received, only for limiting
   178  * Non-trivial increase in latency
   179  * Can scale due to new component being stateless, but this can lead to more endpoints calls on Receive nodes in hashring
   180  
   181  ### 9.2 Per-Receive Validation
   182  
   183  We can implement the same new endpoints as mentioned in the previous approach, on Thanos Receive, but do merging and checking operations on each Receive node in the hashring, i.e change the existing Router and Ingestor modes to handle the same limting logic.
   184  
   185  The implementation would be as follows,
   186  
   187  * Implement configuration option for global series limit (which would be the same for each tenant initially) i.e `globalSeriesLimit`
   188  * Implement a method so that each replica of Receive can call `api/v1/status/tsdb` of other replicas for a particular tenant and merge the count of HEAD series i.e `currentSeries`
   189  * Implement an endpoint in Receive, `api/v1/getrefmap`, which when provided with a tenant id and a remote write request returns a map of SeriesRef and labelsets
   190  * We can then merge this with maps from other replicas and get the number of series for which `SeriesRef == 0` for all replicas. This is the increase in the number of active series if the remote write request is ingested i.e `increaseOnRequest`.
   191  * The above merged results may be exposed as metrics
   192  * When any Receive gets a remote write request, it performs the above and calculates if the request is under the limit.
   193  
   194  So, the limiting equation in this case is also the same as before, `globalSeriesLimit >= currentSeries + increaseOnRequest`.
   195  
   196  <img src="../img/per-receive.png" alt="Per-Receive Validation" width="800"/>
   197  
   198  The option of using gRPC instead of two API calls each time is also valid here.
   199  
   200  #### 9.2.1 Pros:
   201  
   202  * Would result in more accurate measurements to limit on however data replication would still make `api/v1/status/tsdb` [inaccurate](https://github.com/thanos-io/thanos/pull/5402#discussion_r893434246)
   203    * It considers the exact amount of active series for a tenant as it calls status API each time
   204    * It considers how much the number of active series would increase after a remote write request
   205  * No new TSDB-related changes, it utilizes interfaces that are already present
   206  
   207  #### 9.2.2 Cons:
   208  
   209  * Increased tenant complexity in Thanos due to new APIs which need to account for tenants
   210  * Many endpoint calls on each remote write request received only for limiting
   211  * Non-trivial increase in latency
   212  * Difficult to scale up/down
   213  * Adds more complexity to how Receivers in hashring communicate with each other
   214  
   215  ### 9.3 Only local limits
   216  
   217  An alternative could be just not to limit active series globally and make do with local limits only.
   218  
   219  ### 9.4 Make scaling-up non-disruptive
   220  
   221  [Consistent hashing](https://github.com/thanos-io/thanos/issues/4972) might be implemented and problems with sharding can be sorted out, which would make adding Receive replicas to hashring a non-disruptive operation, so that solutions like HPA can be used and make scale up/down operations much easier to the point where limits are not needed.
   222  
   223  ### 9.5 Implement somewhere else (e.g Observatorium)
   224  
   225  Not implementing this within Thanos, but rather using some other API gateway-like component, which can parse remote write requests and maintain running counts of active series for all tenants and limit based on that. A particular example of such a project where this can be implemented is [Observatorium](https://github.com/observatorium/observatorium).
   226  
   227  ## 10 Open Questions
   228  
   229  * Is there a particular way to get an accurate count of HEAD series across multiple replicas of Receive, when replication factor is greater than zero?
   230  * Any alternatives to GetRef which would be easier to merge across replicas?