github.com/muhammadn/cortex@v1.9.1-0.20220510110439-46bb7000d03d/docs/proposals/shuffle-sharding-on-the-read-path.md (about) 1 --- 2 title: "Shuffle sharding on the read path" 3 linkTitle: "Shuffle sharding on the read path" 4 weight: 1 5 slug: shuffle-sharding-on-the-read-path 6 --- 7 8 - Author: @pracucci, @tomwilkie, @pstibrany 9 - Reviewers: 10 - Date: August 2020 11 - Status: Proposed, partially implemented 12 13 ## Background 14 15 Cortex currently supports sharding of tenants to a subset of the ingesters on the write path [PR](https://github.com/cortexproject/cortex/pull/1947). 16 17 This feature is called “subring”, because it computes a subset of nodes registered to the hash ring. The aim of this feature is to improve isolation between tenants and reduce the number of tenants impacted by an outage. 18 19 This approach is similar to the techniques described in [Amazon’s Shuffle Sharding article](https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding/), but currently suffers from a non random selection of nodes (*proposed solution below*). 20 21 Cortex can be **configured** with a default subring size, and then it can be [customized on a per-tenant basis](https://cortexmetrics.io/docs/configuration/configuration-file/#limits_config). The per-tenant configuration is live reloaded during runtime and applied without restarting the Cortex process. 22 23 The subring sharding currently supports only the write-path. The read-path is not shuffle sharding aware. For example, an outage of more than one ingester with RF=3 will affect all tenants, or a particularly noisy tenant wrt queries has the ability to affect all tenants. 24 25 ## Goals 26 27 The Cortex **read path should support shuffle sharding to isolate** the impact of an outage in the cluster. The shard size must be dynamically configurable on a per-tenant basis during runtime. 28 29 This deliverable involves introducing shuffle sharding in: 30 - **Query-frontend → Querier** (for queries sharding) [PR #3113](https://github.com/cortexproject/cortex/pull/3113) 31 - **Querier → Store-gateway** (for blocks sharding) [PR #3069](https://github.com/cortexproject/cortex/pull/3069) 32 - **Querier→ Ingesters** (for queries on recent data) 33 - **Ruler** (for rule and alert evaluation) 34 35 ### Prerequisite: fix subring shuffling 36 37 The solution is implemented in https://github.com/cortexproject/cortex/pull/3090. 38 39 #### The problem 40 The subring is a subset of nodes that should be used for a specific tenant. 41 42 The current subring implementation doesn’t shuffle tenants across nodes. Given a tenant ID, it finds the first node owning the hash(tenant ID) token and then it picks N distinct consecutive nodes walking the ring clockwise. 43 44 For example, in a cluster with 6 nodes (numbered 1-6) and a replication factor of 3, three tenants (A, B, C) could have the following shards: 45 46 Tenant ID | Node 1 | Node 2 | Node 3 | Node 4 | Node 5 | Node 6 47 ----------|--------|--------|--------|--------|--------|------- 48 A | x | x | x | | | 49 B | | x | x | x | | 50 C | | | x | x | x | 51 52 53 #### Proposal 54 55 We propose to build the subring picking N distinct and random nodes registered in the ring, using the following algorithm: 56 57 1. SID = tenant ID 58 2. SID = hash(SID) 59 3. Look for the node owning the token range containing FNV-1a(SID) 60 4. Loop to (2) until we’ve found N distinct nodes (where N is the shard size) 61 62 *hash() function to be decided. The required property is to be strong enough to not generate loops across multiple subsequent hashing of the previous hash.* 63 64 ### Query-frontend → Queriers shuffle sharding 65 66 Implemented in https://github.com/cortexproject/cortex/pull/3113. 67 68 ### How querier runs query-frontend jobs 69 70 Today **each** querier connects to **each** query-frontend instance, and calls a single “Process” method via gRPC. 71 72 “Process” is a bi-directional streaming gRPC method – using the server-to-client stream for sending requests from query-frontend to the querier, and client-to-server stream for returning results from querier to the query-frontend. NB this is the opposite of what might be considered normal. Query-frontend scans all its queues with pending query requests, and picks a query to execute based on a fair schedule between tenants. 73 74 The query request is then sent to an idle querier worker over the stream opened in the Process method, and the query-frontend then waits for a response from querier. This loop repeats until querier disconnects. 75 76 ### Proposal 77 78 To support shuffle sharding, Query-Frontends will keep a list of connected Queriers, and randomly (but consistently between query-frontends) choose N of them to distribute requests to. When Query-Frontend looks for the next request to send to a given querier, it will only consider tenants that “belong” to the Querier. 79 80 To choose N Queriers for a tenant, we propose to use a simple algorithm: 81 82 1. Sort all Queriers by their ID 83 2. SID = tenant ID 84 3. SID = hash(SID) 85 4. Pick the querier from the list of sorted queries with:<br /> 86 index = FNV-1a(SID) % number of Queriers 87 5. Loop to (3) until we’ve found N distinct queriers (where N is the shard size) and stop early if there aren’t enough queriers 88 89 *hash() function to be decided. The required property is to be strong enough to not generate loops across multiple subsequent hashing of the previous hash.* 90 91 ### Properties 92 93 - **Stability:** this will produce the same result on all query-frontends as long as all queriers are connected to all query-frontends. 94 - **Simplicity:** no external dependencies. 95 - **No consistent hashing:** adding/removing queriers will cause “resharding” of tenants between queriers. While in general that’s not desirable property, queriers are stateless so it doesn’t seem to matter in this case. 96 97 ### Implementation notes 98 99 - **Caching:** once this list of queriers to use for a tenant is computed in the query-frontend, it is cached in memory until queriers are added or removed. Per-tenant cache entries will have a TTL to discard tenants not “seen” since a while. 100 - **Querier ID:** Query-frontends currently don’t have any identity for queriers. We need to introduce sending of a unique ID (eg. hostname) by querier to query-frontend when it calls “Process” method. 101 - **Backward-compatibility:** when querier shuffle sharding is enabled, the system expects that both query-frontend and querier will run a compatible version. Cluster version upgrade will require to rollout new query-frontends and queriers first, and then enable shuffle sharding. 102 - **UI:** we propose to expose the current state of the query-frontend through a new endpoint which should display: 103 - Which querier are connected to the query-frontend 104 - Are there any “old” queriers, that are receiving requests from all tenants? 105 - Mapping of tenants to queriers. Note that this mapping may only be available for tenants with pending requests on given query-frontend, and therefore be very dynamic. 106 107 ### Configuration 108 109 - **Shard size** will be configurable on a per-tenant basis via existing “runtime-configuration” mechanism (limits overrides). Changing a value for a tenant needs to invalidate cached per-tenant queriers. 110 - Queriers shard size will be a different setting than then one used for writes. 111 112 ### Evaluated alternatives 113 114 #### Use the subring 115 116 An alternative option would be using the subring. This implies having queriers registering to the hash ring and query-frontend instances using the ring client to find the queriers subring for each tenant. 117 118 This solution looks adding more complexity without any actual benefit. 119 120 #### Change query-frontend → querier architecture 121 122 Completely different approach would be to introduce a place where starting queriers would register (eg. DNS-based service discovery), and let query-frontends discover queriers from this central registry. 123 124 Possible benefit would be that queriers don’t need to initiate connection to all query-frontends, but query-frontends would only connect to queriers for which they have actual pending requests. However this would be a significant redesign of how query-frontend / querier communication works. 125 126 ## Querier → Store-gateway shuffle sharding 127 128 Implemented in https://github.com/cortexproject/cortex/pull/3069. 129 130 ### Introduction 131 132 As of today, the store-gateway supports blocks sharding with customizable replication factor (defaults to 3). Blocks of a single tenant are sharded across all store-gateway instances and so to execute a query the querier may touch any store-gateway in the cluster. 133 134 The current sharding implementation is based on a **hash ring** formed by store-gateway instances. 135 136 ### Proposal 137 138 The proposed solution to add shuffle sharding support to the store-gateway is to **leverage on the existing hash ring** to build a per-tenant **subring**, which is then used both by the querier and store-gateway to know to which store-gateway a block belongs to. 139 140 ### Configuration 141 142 - Shuffle sharding can be enabled in the **store-gateway configuration.** It supports a **default sharding factor,** which is **overridable on a per-tenant basis** and live reloaded during runtime (using the existing limits config). 143 - The querier already requires the store-gateway configuration when the blocks sharding is enabled. Similarly, when shuffle sharding is enabled the querier will require the store-gateway shuffle sharding configuration as well. 144 145 ### Implementation notes 146 147 When shuffle sharding is enabled: 148 149 - The **store-gateway** `syncUsersBlocks()` will build a tenant’s subring for each tenant found scanning the bucket and will skip any tenant not belonging to its shard.<br /> 150 Likewise, ShardingMetadataFilter will first build a **tenant’s subring** and then will use the existing logic to filter out blocks not belonging to store-gateway instance itself. The tenant ID can be read from the block’s meta.json. 151 - The **querier** `blocksStoreReplicationSet.GetClientsFor()` will first build a **tenant’s subring** and then will use the existing logic to find out to which store-gateway instance each requested block belongs to. 152 153 ### Evaluated alternatives 154 155 *Given the store-gateways already form a ring and building the shuffle sharding based on the ring (like in the write path) doesn’t introduce extra operational complexity, we haven’t discussed alternatives.* 156 157 ## Querier→ Ingesters shuffle sharding 158 159 We’re currently discussing/evaluating different options. 160 161 ### Problem 162 163 Cortex must guarantee query correctness; transiently incorrect results may be cached and returned forever. The main problem to solve when introducing ingesters shuffle sharding on the read path is to make sure that a querier fetch data from all ingesters having at least 1 sample for a given tenant. 164 165 The problem to solve is: how can a querier efficiently find which ingesters have data for a given tenant? Each option must consider the changing of the set of ingesters and the changing of each tenant’s subring size. 166 167 ### Proposal: use only the information contained in the ring. 168 169 *This section describes an alternative approach. Discussion is still on-going.* 170 171 The idea is for the queries to be able to deduce what ingesters could possibly hold data for a given tenant by just consulting the ring (and the per-tenant sub ring sizes). We posit that this is possible with only a single piece of extra information: a single timestamp per ingester saying when the ingester first joined the ring. 172 173 #### Scenario: ingester scale up 174 175 When a new ingester is added to the ring, there will be a set of user subrings that see a change: an ingester being removed, and a new one being added. We need to guarantee that for some time period (the block flush interval), the ingester removed is also consulted for queries. 176 177 To do this, during the subring selection if we encounters an ingester added within the time period, we will add this to the subring but continue node selection as before - in effect, selecting an extra ingester: 178 179 ```go 180 var ( 181 subringSize int 182 selectedNodes []Node 183 deadline = time.Now().Add(-flushWindow) 184 ) 185 186 for len(selectedNodes) < subringSize { 187 token := random.Next() 188 node := getNodeByToken(token) 189 for { 190 if node in selectedNodes { 191 node = node.Next() 192 continue 193 } 194 if node.Added.After(deadline) { 195 subringSize++ 196 selectedNodes.Add(node) 197 node = node.Next() 198 continue 199 ) 200 selectedNodes.Add(node) 201 break 202 ) 203 } 204 ``` 205 206 #### Scenario: ingester scale down 207 208 When an ingester is permanently removed from the ring it will flush its data to the object store and the subrings containing the removed ingester will gain a “new” ingester. Queries consult the store and merge the results with those from the ingesters, so no data will be missed. 209 210 Queriers and store-gateways will discover newly flushed blocks on next sync (`-blocks-storage.bucket-store.sync-interval`, default 5 minutes). 211 Multiple ingesters should not be scaled-down within this interval. 212 213 To improve read-performance, queriers and rulers are usually configured with non-zero value of `-querier.query-store-after` option. 214 This option makes queriers and rulers to consult **only** ingesters when running queries within specified time window (eg. 12h). 215 During scale-down this needs to be lowered in order to let queriers and rulers use flushed blocks from the storage. 216 217 #### Scenario: increase size of a tenant’s subring 218 219 Node selection for subrings is stable - increasing the size of a subring is guaranteed to only add new nodes to it (and not remove any nodes). Hence, if a tenant’s subring is increase in size the queriers will notice the config change and start consulting the new ingester. 220 221 #### Scenario: decreasing size of a tenant’s subring 222 223 If a tenant’s subring decreases in size, there is currently no way for the queriers to know how big the ring was previously, and hence they will potentially miss an ingester with data for that tenant. 224 225 This is deemed an infrequent operation that we considered banning, but have a proposal for how we might make it possible: 226 227 The proposal is to have separate read subring and write subring size in the config. The read subring will not be allowed to be smaller than the write subring. When reducing the size of a tenant’s subring, operators must first reduce the write subring, and then two hours later when the blocks have been flushed, the read subring. In the majority of cases the read subring will not need to be specified, as it will default to the write subring size. 228 229 ### Considered alternative #1: Ingesters expose list of tenants 230 231 A possible solution could be keeping in the querier an in-memory data structure to map each ingester to the list of tenants for which it has some data. This data structure would be constructed at querier startup, and then periodically updated, interpolating two information: 232 233 1. The current state of the ring 234 2. The list of tenants directly exposed by each ingester (via a dedicated gRPC call) 235 236 #### Scenario: new querier starts up 237 238 When a querier starts up and before getting ready: 239 240 1. It scans all ingesters (discovered via the ring) and fetches the list of tenants for which each ingester has some data 241 2. For each found tenant (unique list of tenant IDs across all ingesters responses), the querier looks at the current state of the ring and adds to the map the list of ingesters currently assigned to the tenant shard, even if they don’t hold any data yet (because may start receiving series shortly) 242 243 Then the querier watches the ingester ring and rebuilds the in-memory map whenever the ring topology changes. 244 245 #### Scenario: querier receives a query for an unknown tenant 246 247 A new tenant starts remote writing to the cluster. The querier doesn’t know it in its in-memory map, so it adds the tenant on the fly to the map just looking at the current state of the ring. 248 249 #### Scenario: ingester scale up / down 250 251 When a new ingester is added / removed to / from the ring, the ring topology changes and queriers will update the in-memory map. 252 253 #### Scenario: per-tenant shard size increases 254 255 Queriers periodically (every 1m) reload the limits config file. When a tenant shard size change is detected, the querier updates the in-memory map for the affected tenant. 256 257 **Issue:** some time series data may be missing in queries up to 1m. 258 259 #### Edge case: queriers notice the ring topology change before distributors 260 261 Consider the following scenario: 262 263 1. Tenant A shard is composed by ingesters 1,2,3,4,5,6 264 2. Tenant A is remote writing 1 single series and gets replicated to ingester 1,2,3 265 3. The ring topology changes and tenant A shard is ingesters 1,2,3,7,8,9 266 4. Querier notices the ring topology change and updates the in-memory map. Given tenant A series were only on ingester 1,2,3, the querier maps tenant A to ingester 1,2,3 (because of what received from ingesters via gRPC) and 7,8,9 (because of the current state of the ring) 267 5. Distributor hasn’t updated the ring state yet 268 6. Tenant A remote writes 1 **new** series, which get replicated to 4,5,6 269 7. Distributor updates the ring state 270 8. **Race condition:** querier will not know that ingesters 4,5,6 contains tenant A data until the next sync 271 272 ### Considered alternative #2: streaming updates from ingesters to queriers 273 274 *This section describes an alternative approach.* 275 276 #### Current state 277 278 As of today, queriers discover ingesters via the ring: 279 280 - **Ingesters** register (and update their heartbeat timestamp) to the ring and queriers watch the ring, keeping an in-memory copy of the latest ingesters ring state. 281 - **Queriers** use the in-memory ring state to discover all ingesters that should be queried at query time. 282 283 #### Proposal 284 285 The proposal is to expose a new gRPC endpoint on ingesters, which allows queriers to receive a stream of real time updates from ingesters about the tenants for which an ingester currently has time series data. 286 287 From the querier side: 288 289 - At **startup** the querier discovers all existing ingesters. For each ingester, the querier calls the ingester’s gRPC endpoint WatchTenants() (to be created). As soon as the WatchTenants() rpc is called, the ingester sends the entire set of tenants to the querier and then will send incremental updates (tenant added or removed from ingester) while the WatchTenants() stream connection is alive. 290 - If the querier **loses the connection** to an ingester, it will automatically retry (with backoff) while the ingester is within the ring. 291 - The querier **watches the ring** to discover added/removed ingesters. When an ingester is added, the querier adds the ingester to the pool of ingesters whose state should be monitored via WatchTenants(). 292 - At **query time,** the querier looks for all ingesters within the ring. There are two options: 293 1. The querier knows the state of the ingester: the ingester will be queried only if it contains data for the query’s tenant. 294 2. The querier doesn’t know the state of the ingester (eg. because it was just registered to the ring and WatchTenants() hasn’t succeeded yet): the ingester will be queried anyway (correctness first). 295 - The querier will fine tune [gRPC keepalive](https://godoc.org/google.golang.org/grpc/keepalive) settings to ensure a lost connection between the querier and ingester will be early detected and retried. 296 297 #### Trade-offs 298 299 Pros: 300 301 - The querier logic, used to find ingesters for a tenant’s shard, **does not require to watch the overrides** config file (containing tenant shard size override). Watching the file in the querier is problematic because of introduced delays (ConfigMap update and Cortex file polling) which could lead to distributors apply changes before queriers. 302 - The querier **never uses the current state of the ring** as a source of information to detect which ingesters have data for a specific tenant. This information comes directly from the ingesters themselves, which makes the implementation less likely to be subject to race conditions. 303 304 Cons: 305 306 - Each querier needs to open a gRPC connection to each ingester. Given gRPC supports multiplexing, the underlying TCP connection could be the same connection used to fetch samples from ingesters at query time, basically having 1 single TCP connection between a querier and an ingester. 307 - The “Edge case: queriers notice the ring topology change before distributors” described in attempt #1 can still happen in case of delays in the propagation of the state update from an ingester to queriers: 308 - Short delay: a short delay (few seconds) shouldn’t be a real problem. From the final user perspective, there’s no real difference between this edge case and a delay of few seconds in the ingestion path (eg. Prometheus remote write lagging behind few seconds). In the real case of Prometheus remote writing to Cortex, there’s no easy way to know if the latest samples are missing because has not been remote written yet by Prometheus or any delay in the propagation of this information between ingesters and queriers. 309 - Long delay: in case of networking issue propagating the state update from an ingester to the querier, the gRPC keepalive will trigger (because of failed ping-pong) and the querier will remove the failing ingesters in-memory data, so the ingester will be always tried by the querier for any query, until the state update will be re-established. 310 311 ## Ruler sharding 312 313 ### Introduction 314 315 The ruler currently supports rule groups sharding across a pool of rulers. When sharding is enabled, rulers form a hash ring and each ruler uses the ring to check if it should evaluate a specific rule group. 316 317 At a polling interval (defaults to 1 minute), the ruler: 318 319 - List all the bucket objects to find all rule groups (listing is done specifying an empty delimiter so it return objects at any depth) 320 - For each discovered rule group, the ruler hashes the object key and checks if it belongs to the range of tokens assigned to the ruler itself. If not, the rule group is discarded, otherwise it’s kept for evaluation. 321 322 ### Proposal 323 324 We propose to introduce shuffle sharding in the ruler as well, leveraging on the already existing hash ring used by the current sharding implementation. 325 326 The **configuration** will be extended to allow to configure: 327 328 - Enable/disable shuffle sharding 329 - Default shard size 330 - Per-tenant overrides (reloaded at runtime) 331 332 When shuffle sharding is enabled: 333 334 - The ruler lists (ListBucketV2) the tenants for which rule groups are stored in the bucket 335 - The ruler filters out tenants not belonging to its shard 336 - For each tenant belonging to its shard, the ruler does a ListBucketV2 call with the “<tenant-id>/” prefix and with empty delimiter to find all the rule groups, which are then evaluated in the ruler 337 338 The ruler re-syncs the rule groups from the bucket whenever one of the following conditions happen: 339 340 1. Periodic interval (configurable) 341 2. Ring topology changes 342 3. The configured shard size of a tenant has changed 343 344 ### Other notes 345 346 - The “subring” implementation is unoptimized. We will optimize it as part of this work to make sure no performance degradation is introduced when using the subring vs the normal ring. 347