github.com/muhammadn/cortex@v1.9.1-0.20220510110439-46bb7000d03d/docs/proposals/shuffle-sharding-on-the-read-path.md (about)

     1  ---
     2  title: "Shuffle sharding on the read path"
     3  linkTitle: "Shuffle sharding on the read path"
     4  weight: 1
     5  slug: shuffle-sharding-on-the-read-path
     6  ---
     7  
     8  - Author: @pracucci, @tomwilkie, @pstibrany
     9  - Reviewers:
    10  - Date: August 2020
    11  - Status: Proposed, partially implemented
    12  
    13  ## Background
    14  
    15  Cortex currently supports sharding of tenants to a subset of the ingesters on the write path [PR](https://github.com/cortexproject/cortex/pull/1947).
    16  
    17  This feature is called “subring”, because it computes a subset of nodes registered to the hash ring. The aim of this feature is to improve isolation between tenants and reduce the number of tenants impacted by an outage.
    18  
    19  This approach is similar to the techniques described in [Amazon’s Shuffle Sharding article](https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding/), but currently suffers from a non random selection of nodes (*proposed solution below*).
    20  
    21  Cortex can be **configured** with a default subring size, and then it can be [customized on a per-tenant basis](https://cortexmetrics.io/docs/configuration/configuration-file/#limits_config). The per-tenant configuration is live reloaded during runtime and applied without restarting the Cortex process.
    22  
    23  The subring sharding currently supports only the write-path. The read-path is not shuffle sharding aware. For example, an outage of more than one ingester with RF=3 will affect all tenants, or a particularly noisy tenant wrt queries has the ability to affect all tenants.
    24  
    25  ## Goals
    26  
    27  The Cortex **read path should support shuffle sharding to isolate** the impact of an outage in the cluster. The shard size must be dynamically configurable on a per-tenant basis during runtime.
    28  
    29  This deliverable involves introducing shuffle sharding in:
    30  - **Query-frontend → Querier** (for queries sharding) [PR #3113](https://github.com/cortexproject/cortex/pull/3113)
    31  - **Querier → Store-gateway** (for blocks sharding) [PR #3069](https://github.com/cortexproject/cortex/pull/3069)
    32  - **Querier→ Ingesters** (for queries on recent data)
    33  - **Ruler** (for rule and alert evaluation)
    34  
    35  ### Prerequisite: fix subring shuffling
    36  
    37  The solution is implemented in https://github.com/cortexproject/cortex/pull/3090.
    38  
    39  #### The problem
    40  The subring is a subset of nodes that should be used for a specific tenant.
    41  
    42  The current subring implementation doesn’t shuffle tenants across nodes. Given a tenant ID, it finds the first node owning the hash(tenant ID) token and then it picks N distinct consecutive nodes walking the ring clockwise.
    43  
    44  For example, in a cluster with 6 nodes (numbered 1-6) and a replication factor of 3, three tenants (A, B, C) could have the following shards:
    45  
    46  Tenant ID | Node 1 | Node 2 | Node 3 | Node 4 | Node 5 | Node 6
    47  ----------|--------|--------|--------|--------|--------|-------
    48  A | x | x | x | | |
    49  B | | x | x | x | |
    50  C | | | x | x | x |
    51  
    52  
    53  #### Proposal
    54  
    55  We propose to build the subring picking N distinct and random nodes registered in the ring, using the following algorithm:
    56  
    57  1. SID = tenant ID
    58  2. SID = hash(SID)
    59  3. Look for the node owning the token range containing FNV-1a(SID)
    60  4. Loop to (2) until we’ve found N distinct nodes (where N is the shard size)
    61  
    62  *hash() function to be decided. The required property is to be strong enough to not generate loops across multiple subsequent hashing of the previous hash.*
    63  
    64  ### Query-frontend → Queriers shuffle sharding
    65  
    66  Implemented in https://github.com/cortexproject/cortex/pull/3113.
    67  
    68  ### How querier runs query-frontend jobs
    69  
    70  Today **each** querier connects to **each** query-frontend instance, and calls a single “Process” method via gRPC.
    71  
    72  “Process” is a bi-directional streaming gRPC method – using the server-to-client stream for sending requests from query-frontend to the querier, and client-to-server stream for returning results from querier to the query-frontend.  NB this is the opposite of what might be considered normal. Query-frontend scans all its queues with pending query requests, and picks a query to execute based on a fair schedule between tenants.
    73  
    74  The query request is then sent to an idle querier worker over the stream opened in the Process method, and the query-frontend then waits for a response from querier. This loop repeats until querier disconnects.
    75  
    76  ### Proposal
    77  
    78  To support shuffle sharding, Query-Frontends will keep a list of connected Queriers, and randomly (but consistently between query-frontends) choose N of them to distribute requests to. When Query-Frontend looks for the next request to send to a given querier, it will only consider tenants that “belong” to the Querier.
    79  
    80  To choose N Queriers for a tenant, we propose to use a simple algorithm:
    81  
    82  1. Sort all Queriers by their ID
    83  2. SID = tenant ID
    84  3. SID = hash(SID)
    85  4. Pick the querier from the list of sorted queries with:<br />
    86  index = FNV-1a(SID) % number of Queriers
    87  5. Loop to (3) until we’ve found N distinct queriers (where N is the shard size) and stop early if there aren’t enough queriers
    88  
    89  *hash() function to be decided. The required property is to be strong enough to not generate loops across multiple subsequent hashing of the previous hash.*
    90  
    91  ### Properties
    92  
    93  - **Stability:** this will produce the same result on all query-frontends as long as all queriers are connected to all query-frontends.
    94  - **Simplicity:** no external dependencies.
    95  - **No consistent hashing:** adding/removing queriers will cause “resharding” of tenants between queriers. While in general that’s not desirable property, queriers are stateless so it doesn’t seem to matter in this case.
    96  
    97  ### Implementation notes
    98  
    99  - **Caching:** once this list of queriers to use for a tenant is computed in the query-frontend, it is cached in memory until queriers are added or removed. Per-tenant cache entries will have a TTL to discard tenants not “seen” since a while.
   100  - **Querier ID:** Query-frontends currently don’t have any identity for queriers. We need to introduce sending of a unique  ID (eg. hostname) by querier to query-frontend when it calls “Process” method.
   101  - **Backward-compatibility:** when querier shuffle sharding is enabled, the system expects that both query-frontend and querier will run a compatible version. Cluster version upgrade will require to rollout new query-frontends and queriers first, and then enable shuffle sharding.
   102  - **UI:** we propose to expose the current state of the query-frontend through a new endpoint which should display:
   103    - Which querier are connected to the query-frontend
   104    - Are there any “old” queriers, that are receiving requests from all tenants?
   105    - Mapping of tenants to queriers. Note that this mapping may only be available for tenants with pending requests on given query-frontend, and therefore be very dynamic.
   106  
   107  ### Configuration
   108  
   109  - **Shard size** will be configurable on a per-tenant basis via existing “runtime-configuration” mechanism (limits overrides). Changing a value for a tenant needs to invalidate cached per-tenant queriers.
   110  - Queriers shard size will be a different setting than then one used for writes.
   111  
   112  ### Evaluated alternatives
   113  
   114  #### Use the subring
   115  
   116  An alternative option would be using the subring. This implies having queriers registering to the hash ring and query-frontend instances using the ring client to find the queriers subring for each tenant.
   117  
   118  This solution looks adding more complexity without any actual benefit.
   119  
   120  #### Change query-frontend → querier architecture
   121  
   122  Completely different approach would be to introduce a place where starting queriers would register (eg. DNS-based service discovery), and let query-frontends discover queriers from this central registry.
   123  
   124  Possible benefit would be that queriers don’t need to initiate connection to all query-frontends, but query-frontends would only connect to queriers for which they have actual pending requests. However this would be a significant redesign of how query-frontend / querier communication works.
   125  
   126  ## Querier → Store-gateway shuffle sharding
   127  
   128  Implemented in https://github.com/cortexproject/cortex/pull/3069.
   129  
   130  ### Introduction
   131  
   132  As of today, the store-gateway supports blocks sharding with customizable replication factor (defaults to 3). Blocks of a single tenant are sharded across all store-gateway instances and so to execute a query the querier may touch any store-gateway in the cluster.
   133  
   134  The current sharding implementation is based on a **hash ring** formed by store-gateway instances.
   135  
   136  ### Proposal
   137  
   138  The proposed solution to add shuffle sharding support to the store-gateway is to **leverage on the existing hash ring** to build a per-tenant **subring**, which is then used both by the querier and store-gateway to know to which store-gateway a block belongs to.
   139  
   140  ### Configuration
   141  
   142  - Shuffle sharding can be enabled in the **store-gateway configuration.** It supports a **default sharding factor,** which is **overridable on a per-tenant basis** and live reloaded during runtime (using the existing limits config).
   143  - The querier already requires the store-gateway configuration when the blocks sharding is enabled. Similarly, when shuffle sharding is enabled the querier will require the store-gateway shuffle sharding configuration as well.
   144  
   145  ### Implementation notes
   146  
   147  When shuffle sharding is enabled:
   148  
   149  - The **store-gateway** `syncUsersBlocks()` will build a tenant’s subring for each tenant found scanning the bucket and will skip any tenant not belonging to its shard.<br />
   150  Likewise, ShardingMetadataFilter will first build a **tenant’s subring** and then will use the existing logic to filter out blocks not belonging to store-gateway instance itself. The tenant ID can be read from the block’s meta.json.
   151  - The **querier** `blocksStoreReplicationSet.GetClientsFor()` will first build a **tenant’s subring** and then will use the existing logic to find out to which store-gateway instance each requested block belongs to.
   152  
   153  ### Evaluated alternatives
   154  
   155  *Given the store-gateways already form a ring and building the shuffle sharding based on the ring (like in the write path) doesn’t introduce extra operational complexity, we haven’t discussed alternatives.*
   156  
   157  ## Querier→ Ingesters shuffle sharding
   158  
   159  We’re currently discussing/evaluating different options.
   160  
   161  ### Problem
   162  
   163  Cortex must guarantee query correctness; transiently incorrect results may be cached and returned forever. The main problem to solve when introducing ingesters shuffle sharding on the read path is to make sure that a querier fetch data from all ingesters having at least 1 sample for a given tenant.
   164  
   165  The problem to solve is: how can a querier efficiently find which ingesters have data for a given tenant?  Each option must consider the changing of the set of ingesters and the changing of each tenant’s subring size.
   166  
   167  ### Proposal: use only the information contained in the ring.
   168  
   169  *This section describes an alternative approach.  Discussion is still on-going.*
   170  
   171  The idea is for the queries to be able to deduce what ingesters could possibly hold data for a given tenant by just consulting the ring (and the per-tenant sub ring sizes).  We posit that this is possible with only a single piece of extra information: a single timestamp per ingester saying when the ingester first joined the ring.
   172  
   173  #### Scenario: ingester scale up
   174  
   175  When a new ingester is added to the ring, there will be a set of user subrings that see a change: an ingester being removed, and a new one being added.  We need to guarantee that for some time period (the block flush interval), the ingester removed is also consulted for queries.
   176  
   177  To do this, during the subring selection if we encounters an ingester added within the time period, we will add this to the subring but continue node selection as before - in effect, selecting an extra ingester:
   178  
   179  ```go
   180  var (
   181      subringSize   int
   182      selectedNodes []Node
   183      deadline      = time.Now().Add(-flushWindow)
   184  )
   185  
   186  for len(selectedNodes) < subringSize {
   187      token := random.Next()
   188      node := getNodeByToken(token)
   189      for {
   190          if node in selectedNodes {
   191              node = node.Next()
   192              continue
   193          }
   194          if node.Added.After(deadline) {
   195              subringSize++
   196              selectedNodes.Add(node)
   197              node = node.Next()
   198              continue
   199          )
   200          selectedNodes.Add(node)
   201          break
   202      )
   203  }
   204  ```
   205  
   206  #### Scenario: ingester scale down
   207  
   208  When an ingester is permanently removed from the ring it will flush its data to the object store and the subrings containing the removed ingester will gain a “new” ingester.  Queries consult the store and merge the results with those from the ingesters, so no data will be missed.
   209  
   210  Queriers and store-gateways will discover newly flushed blocks on next sync (`-blocks-storage.bucket-store.sync-interval`, default 5 minutes).
   211  Multiple ingesters should not be scaled-down within this interval.
   212  
   213  To improve read-performance, queriers and rulers are usually configured with non-zero value of `-querier.query-store-after` option.
   214  This option makes queriers and rulers to consult **only** ingesters when running queries within specified time window (eg. 12h).
   215  During scale-down this needs to be lowered in order to let queriers and rulers use flushed blocks from the storage.
   216  
   217  #### Scenario: increase size of a tenant’s subring
   218  
   219  Node selection for subrings is stable - increasing the size of a subring is guaranteed to only add new nodes to it (and not remove any nodes).  Hence, if a tenant’s subring is increase in size the queriers will notice the config change and start consulting the new ingester.
   220  
   221  #### Scenario: decreasing size of a tenant’s subring
   222  
   223  If a tenant’s subring decreases in size, there is currently no way for the queriers to know how big the ring was previously, and hence they will potentially miss an ingester with data for that tenant.
   224  
   225  This is deemed an infrequent operation that we considered banning, but have a proposal for how we might make it possible:
   226  
   227  The proposal is to have separate read subring and write subring size in the config.  The read subring will not be allowed to be smaller than the write subring.  When reducing the size of a tenant’s subring, operators must first reduce the write subring, and then two hours later when the blocks have been flushed, the read subring.  In the majority of cases the read subring will not need to be specified, as it will default to the write subring size.
   228  
   229  ### Considered alternative #1: Ingesters expose list of tenants
   230  
   231  A possible solution could be keeping in the querier an in-memory data structure to map each ingester to the list of tenants for which it has some data. This data structure would be constructed at querier startup, and then periodically updated, interpolating two information:
   232  
   233  1. The current state of the ring
   234  2. The list of tenants directly exposed by each ingester (via a dedicated gRPC call)
   235  
   236  #### Scenario: new querier starts up
   237  
   238  When a querier starts up and before getting ready:
   239  
   240  1. It scans all ingesters (discovered via the ring) and fetches the list of tenants for which each ingester has some data
   241  2. For each found tenant (unique list of tenant IDs across all ingesters responses), the querier looks at the current state of the ring and adds to the map the list of ingesters currently assigned to the tenant shard, even if they don’t hold any data yet (because may start receiving series shortly)
   242  
   243  Then the querier watches the ingester ring and rebuilds the in-memory map whenever the ring topology changes.
   244  
   245  #### Scenario: querier receives a query for an unknown tenant
   246  
   247  A new tenant starts remote writing to the cluster. The querier doesn’t know it in its in-memory map, so it adds the tenant on the fly to the map just looking at the current state of the ring.
   248  
   249  #### Scenario: ingester scale up / down
   250  
   251  When a new ingester is added / removed to / from the ring, the ring topology changes and queriers will update the in-memory map.
   252  
   253  #### Scenario: per-tenant shard size increases
   254  
   255  Queriers periodically (every 1m) reload the limits config file. When a tenant shard size change is detected, the querier updates the in-memory map for the affected tenant.
   256  
   257  **Issue:** some time series data may be missing in queries up to 1m.
   258  
   259  #### Edge case: queriers notice the ring topology change before distributors
   260  
   261  Consider the following scenario:
   262  
   263  1. Tenant A shard is composed by ingesters 1,2,3,4,5,6
   264  2. Tenant A is remote writing 1 single series and gets replicated to ingester 1,2,3
   265  3. The ring topology changes and tenant A shard is ingesters 1,2,3,7,8,9
   266  4. Querier notices the ring topology change and updates the in-memory map. Given tenant A series were only on ingester 1,2,3, the querier maps tenant A to ingester 1,2,3 (because of what received from ingesters via gRPC) and 7,8,9 (because of the current state of the ring)
   267  5. Distributor hasn’t updated the ring state yet
   268  6. Tenant A remote writes 1 **new** series, which get replicated to 4,5,6
   269  7. Distributor updates the ring state
   270  8. **Race condition:** querier will not know that ingesters 4,5,6 contains tenant A data until the next sync
   271  
   272  ### Considered alternative #2: streaming updates from ingesters to queriers
   273  
   274  *This section describes an alternative approach.*
   275  
   276  #### Current state
   277  
   278  As of today, queriers discover ingesters via the ring:
   279  
   280  - **Ingesters** register (and update their heartbeat timestamp) to the ring and queriers watch the ring, keeping an in-memory copy of the latest ingesters ring state.
   281  - **Queriers** use the in-memory ring state to discover all ingesters that should be queried at query time.
   282  
   283  #### Proposal
   284  
   285  The proposal is to expose a new gRPC endpoint on ingesters, which allows queriers to receive a stream of real time updates from ingesters about the tenants for which an ingester currently has time series data.
   286  
   287  From the querier side:
   288  
   289  - At **startup** the querier discovers all existing ingesters. For each ingester, the querier calls the ingester’s gRPC endpoint WatchTenants() (to be created). As soon as the WatchTenants() rpc is called, the ingester sends the entire set of tenants to the querier and then will send incremental updates (tenant added or removed from ingester) while the WatchTenants() stream connection is alive.
   290  - If the querier **loses the connection** to an ingester, it will automatically retry (with backoff) while the ingester is within the ring.
   291  - The querier **watches the ring** to discover added/removed ingesters. When an ingester is added, the querier adds the ingester to the pool of ingesters whose state should be monitored via WatchTenants().
   292  - At **query time,** the querier looks for all ingesters within the ring. There are two options:
   293    1. The querier knows the state of the ingester: the ingester will be queried only if it contains data for the query’s tenant.
   294    2. The querier doesn’t know the state of the ingester (eg. because it was just registered to the ring and WatchTenants() hasn’t succeeded yet): the ingester will be queried anyway (correctness first).
   295  - The querier will fine tune [gRPC keepalive](https://godoc.org/google.golang.org/grpc/keepalive) settings to ensure a lost connection between the querier and ingester will be early detected and retried.
   296  
   297  #### Trade-offs
   298  
   299  Pros:
   300  
   301  - The querier logic, used to find ingesters for a tenant’s shard, **does not require to watch the overrides** config file (containing tenant shard size override). Watching the file in the querier is problematic because of introduced delays (ConfigMap update and Cortex file polling) which could lead to distributors apply changes before queriers.
   302  - The querier **never uses the current state of the ring** as a source of information to detect which ingesters have data for a specific tenant. This information comes directly from the ingesters themselves, which makes the implementation less likely to be subject to race conditions.
   303  
   304  Cons:
   305  
   306  - Each querier needs to open a gRPC connection to each ingester. Given gRPC supports multiplexing, the underlying TCP connection could be the same connection used to fetch samples from ingesters at query time, basically having 1 single TCP connection between a querier and an ingester.
   307  - The “Edge case: queriers notice the ring topology change before distributors” described in attempt #1 can still happen in case of delays in the propagation of the state update from an ingester to queriers:
   308    - Short delay: a short delay (few seconds) shouldn’t be a real problem. From the final user perspective, there’s no real difference between this edge case and a delay of few seconds in the ingestion path (eg. Prometheus remote write lagging behind few seconds). In the real case of Prometheus remote writing to Cortex, there’s no easy way to know if the latest samples are missing because has not been remote written yet by Prometheus or any delay in the propagation of this information between ingesters and queriers.
   309    - Long delay: in case of networking issue propagating the state update from an ingester to the querier, the gRPC keepalive will trigger (because of failed ping-pong) and the querier will remove the failing ingesters in-memory data, so the ingester will be always tried by the querier for any query, until the state update will be re-established.
   310  
   311  ## Ruler sharding
   312  
   313  ### Introduction
   314  
   315  The ruler currently supports rule groups sharding across a pool of rulers. When sharding is enabled, rulers form a hash ring and each ruler uses the ring to check if it should evaluate a specific rule group.
   316  
   317  At a polling interval (defaults to 1 minute), the ruler:
   318  
   319  - List all the bucket objects to find all rule groups (listing is done specifying an empty delimiter so it return objects at any depth)
   320  - For each discovered rule group, the ruler hashes the object key and checks if it belongs to the range of tokens assigned to the ruler itself. If not, the rule group is discarded, otherwise it’s kept for evaluation.
   321  
   322  ### Proposal
   323  
   324  We propose to introduce shuffle sharding in the ruler as well, leveraging on the already existing hash ring used by the current sharding implementation.
   325  
   326  The **configuration** will be extended to allow to configure:
   327  
   328  - Enable/disable shuffle sharding
   329  - Default shard size
   330  - Per-tenant overrides (reloaded at runtime)
   331  
   332  When shuffle sharding is enabled:
   333  
   334  - The ruler lists (ListBucketV2) the tenants for which rule groups are stored in the bucket
   335  - The ruler filters out tenants not belonging to its shard
   336  - For each tenant belonging to its shard, the ruler does a ListBucketV2 call with the “<tenant-id>/” prefix and with empty delimiter to find all the rule groups, which are then evaluated in the ruler
   337  
   338  The ruler re-syncs the rule groups from the bucket whenever one of the following conditions happen:
   339  
   340  1. Periodic interval (configurable)
   341  2. Ring topology changes
   342  3. The configured shard size of a tenant has changed
   343  
   344  ### Other notes
   345  
   346  - The “subring” implementation is unoptimized. We will optimize it as part of this work to make sure no performance degradation is introduced when using the subring vs the normal ring.
   347