github.com/muhammadn/cortex@v1.9.1-0.20220510110439-46bb7000d03d/docs/proposals/shuffle-sharding-and-zone-awareness.md

github.com/muhammadn/cortex@v1.9.1-0.20220510110439-46bb7000d03d/docs/proposals/shuffle-sharding-and-zone-awareness.md (about)

1 ---
2 title: "Shuffle sharding and zone awareness"
3 linkTitle: "Shuffle sharding and zone awareness"
4 weight: 1
5 slug: shuffle-sharding-and-zone-awareness
6 ---
7
8 - Author: @pracucci, @tomwilkie, @pstibrany
9 - Reviewers:
10 - Date: August 2020
11 - Status: Accepted, implemented in [PR #3090](https://github.com/cortexproject/cortex/pull/3090)
12
13 ## Shuffle sharding and zone awareness
14
15
16 ### Background
17
18 Cortex shards the received series across all available ingesters. In a multi-tenant cluster, each tenant series are sharded across all ingesters. This allows to horizontally scale the series across the pool of ingesters but also suffers some issues:
19
20 1. Given every tenant writes series to all ingesters, there’s no isolation between tenants - a single misbehaving tenant can affect the whole cluster.
21 2. Each ingester needs an open TSDB per tenant per ingester - which has significant memory overhead. The larger the number of tenants, the higher the TSDB memory overhead, regardless of the number of series stored in each TSDB.
22 3. Similarly, the number of uploaded blocks to the storage every 2 hours is a function of the number of TSDBs open for each ingester. A cluster with a large number of small tenants will upload a very large number of blocks to the storage, each block being very small, increasing the number of API calls against the storage bucket.
23
24 Cortex currently supports sharding a tenant to a subset of the ingesters on the write path [PR](https://github.com/cortexproject/cortex/pull/1947), using a feature called “**subring**”. However, the current subring implementation suffers two issues:
25
26 1. **No zone awareness:** it doesn’t guarantee selected instances are balanced across availability zones
27 2. **No shuffling:** the implementation is based on the hash ring and it selects N consecutive instances in the ring. This means that, instead of minimizing the likelihood that two tenants share the same instances, it emphasises it. In order to provide a good isolation between tenants, we want to minimize the chances that two tenants share the same instances.
28
29 ### Goal
30
31 The goal of this work is to fix “shuffling” and “zone-awareness” when building the subring for a given tenant, honoring the following properties:
32
33 - **Stability:** given the same ring, the algorithm always generates the same subring for a given tenant, even across different machines
34 - [**Consistency:**](https://en.wikipedia.org/wiki/Consistent_hashing) when the ring is resized, only n/m series are remapped on average (where n is the number of series and m is the number of replicas).
35 - **Shuffling:** probabilistically and for a large enough cluster, ensure every tenant gets a different set of instances, with a reduced number of overlapping instances between two tenants to improve failure isolation.
36 - **Zone-awareness (balanced):** the subring built for each tenant contains a balanced number of instances for each availability zone. Selecting the same number of instances in each zone is an important property because we want to preserve the balance of in-memory series across ingesters. Having less replicas in one zone will mean more load per node in this zone, which is something we want to avoid.
37
38 ### Proposal
39
40 This proposal is based on [Amazon’s Shuffle Sharding article](https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding/) and the algorithm has been inspired by shuffle sharding implementation in the [AWS Route53 infima library](https://github.com/awslabs/route53-infima/blob/master/src/main/java/com/amazonaws/services/route53/infima/SimpleSignatureShuffleSharder.java).
41
42 Given a tenant and a shard size S (number of instances to which tenant data/workload should be sharded to), we build a subring selecting N instances from each zone, where N = ceil(S / num of zones). The shard size S is required to be a multiple of the number of zones, in order to select an equal number of instances from each zone.
43
44 To do it, we **treat each zone as a separate ring** and select N unique instances from each zone. The instances selection process works as follow:
45
46 1. Generate a seed based on the tenant ID
47 2. Initialise a pseudo random number generator with the tenant’s seed. The random generator must guarantee predictable numbers given the same input seed.
48 3. Generate a sequence of N random numbers, where N is the number of instances to select from the zone. Each random number is used as a “token” to look up instances in the ring. For each random number:
49 1. Lookup the instance holding that token in the ring
50 2. If the instance has not been previously selected, then pick it
51 3. If the instance was previously selected (we call this a “collision”), then continue walking the ring clockwise until we find an instance which has not been selected yet
52
53 ### Guaranteed properties
54
55 #### Stability
56
57 The same tenant ID always generates the same seed. Given the same seed, the pseudo number random generator always generates the same sequence of numbers.
58
59 This guarantees that, given the same ring, we generate the same exact subring for a given tenant.
60
61 #### Consistency
62
63 The consistency property is honored by two aspects of the algorithm:
64
65 1. The quantity of random numbers generated is always equal to the shard size S, even in case of “collisions”. A collision is when the instance holding the random token has already been picked and we need to select a different instance which has not been picked yet.
66 2. In case of collisions, we select the “next” instance continuing walking the ring instead of generating another random number
67
68 ##### Example adding an instance to the ring
69
70 Let’s consider an initial ring with 3 instances and 1 zone (for simplicity):
71
72 - I1 - Tokens: 1, 8, 15
73 - I2 - Tokens: 5, 11, 19
74 - I3 - Tokens: 7, 13, 21
75
76 With a replication factor = 2, the random sequence looks up:
77
78 - 3 (I2)
79 - 6 (I1)
80
81 Then we add a new instance and the **updated ring** is:
82
83 - I1 - Tokens: 1, 8, 15
84 - I2 - Tokens: 5, 11, 19
85 - I3 - Tokens: 7, 13, 21
86 - I4 - Tokens: 4, 7, 17
87
88 Now, let’s compare two different algorithms to solve collisions:
89
90 - Using the random generator:<br />
91 Random sequence = 3 (**I4**), 6 (I4 - collision), 12 (**I3**)<br />
92 **all instances are different** (I4, I3)
93 - Walking the ring:<br />
94 Random sequence = 3 (**I4**), 6 (I4 - collision, next is **I1**)<br />
95 **only 1 instance is different** (I4, I1)
96
97 #### Shuffling
98
99 Unless when resolving collisions, the algorithm doesn’t walk the ring to find the next instances, but uses a sequence of random numbers. This guarantees instances are shuffled, between different tenants, when building the subring.
100
101 #### Zone-awareness
102
103 We treat each zone as a separate ring and select an equal number of instances from each zone. This guarantees a fair balance of instances between zones.
104
105 ### Proof of concept
106
107 We’ve built a [reference implementation](https://github.com/cortexproject/cortex/pull/3090) of the proposed algorithm, to test the properties described above.
108
109 In particular, we’ve observed that the [actual distribution](https://github.com/cortexproject/cortex/pull/3090/files#diff-121ffce90aa9932f6b87ffd138e0f36aR281) of matching instances between different tenants is very close to the [theoretical one](https://docs.google.com/spreadsheets/d/1FXbiWTXi6bdERtamH-IfmpgFq1fNL4GP_KX_yJvbRi4/edit), as well as consistency and stability properties are both honored.
110