github.com/grafana/pyroscope@v1.18.0/pkg/segmentwriter/client/distributor/README.md (about) 1 # Data Distribution Algorithm 2 3 ## Background 4 5 The **main requirement** for a distribution algorithm is that profile series of the same tenant service should be 6 co-located, as spatial locality is crucial for compaction and query performance. The distribution algorithm should 7 not aim to place all profiles of a specific tenant service in a dedicated shard; instead, it should distribute them 8 among the optimal number of shards. 9 10 Distributors must be aware of availability zones and should only route profiles to segment writers in the home AZ. 11 Depending on the environment, crossing AZ network boundaries usually incurs penalties: cloud providers may charge 12 for cross-AZ traffic, or in on-premises deployments, AZs may represent different data centers with high-latency 13 connections. 14 15 The number of shards and segment writers is not constant and may change over time. The distribution algorithm should 16 aim to minimize data re-balancing when such changes occur. Nevertheless, we do not perform actual data re-balancing: 17 data written to a shard remains there until it is compacted or deleted. The only reason for minimizing re-balancing 18 is to optimize data locality; this concerns both data in transit, as segment writers are sensitive to the variance of 19 the datasets, and data at rest, as this is crucial for compaction efficiency and query performance in the end. 20 21 # Overview 22 23 * Profiles are *distributed* among segment-writers based on the profile labels. 24 * Profile labels _must_ include `service_name` label, which denotes the dataset the profile belongs to. 25 * Each profile belongs to a tenant. 26 27 The choice of a placement for a profile involves a three-step process: 28 29 1. Finding *m* suitable locations from the total of *N* options using the request `tenant_id`. 30 2. Finding *n* suitable locations from the total of *m* options using the `service_name` label. 31 3. Finding the exact location *s* from the total of *n* options. 32 33 Where: 34 35 * **N** is the total number of shards in the deployment. 36 * **m** – tenant shard limits – is configured explicitly. 37 * **n** – dataset shard limits – selected dynamically, based on the observed ingestion rate and patterns. 38 39 The number of shards in the deployment is determined by the number of nodes in the deployment: 40 * We seek to minimize the number of shards to optimize the cost of the solution: as we flush segments per shard, 41 the number of shards directly affects the number of write operations to the object storage. 42 * Experimentally, we found that a conservative processing rate is approximately 8 MB/s per core, depending on the 43 processor and the network bandwidth (thus, 128 cores should be generally enough to handle 1 GB/s). This unit is 44 recommended as a quantifier of the deployment size and the shard size. 45 46 Due to the nature of continuous profiling, it is usually beneficial to keep the same profile series on the same shard, 47 as this allows for more optimal utilization of the TSDB index (the inverted index used for searching by labels). 48 However, data is often distributed across profile series unevenly; using a series label hash as the distribution key 49 at any of the steps above may lead to significant data skews. To mitigate this, we propose to employ adaptive load 50 balancing: use `fingerprint mod n` as the distribution key at step 3 by default, and switch to `random(n)`, when a 51 skew is observed. 52 53 In case of a failure, the next suitable segment writer is selected (from *n* options available to the tenant service, 54 increasing the number if needed). The shard identifier is specified explicitly in the request to the segment writer to 55 maintain data locality in case of transient failures and rollouts. 56 57 The proposed approach assumes that two requests with the same distribution key may end up in different shards. 58 This should be a rare occurrence, but such placement is expected. 59 60 # Implementation 61 62 The existing ring implementation is used for discovery: the underlying [memberlist](https://github.com/hashicorp/memberlist) 63 library is used to maintain the list of the segment-writer service instances: 64 65 ```mermaid 66 graph RL 67 Lifecycler-.->Memberlist 68 Memberlist-.->Ring 69 70 subgraph SegmentWriter["segment-writer"] 71 Lifecycler["lifecycler"] 72 end 73 74 subgraph Memberlist["memberlist"] 75 end 76 77 subgraph Distributor["distributor"] 78 Ring["ring"] 79 end 80 81 ``` 82 83 Instead of using the ring for the actual placement, distributor builds its own view of the ring, which is then used to 84 determine the placement of the keys (profiles). The main reason for this is that the exising ring implementation is not 85 well suited for the proposed algorithm, as it does not provide a way to map a key to a specific shard. 86 87 In accordance to the algorithm, for each key (profile), we need to identify a subset of shards allowed for the tenant, 88 and subset of shards allowed for the dataset. [Jump consistent hash](https://arxiv.org/pdf/1406.2294) is used to pick 89 the subring position: 90 91 ```cpp 92 int32_t JumpConsistentHash(uint64_t key, int32_t num_buckets) { 93 int64_t b = 1, j = 0; 94 while (j < num_buckets) { 95 b = j; 96 key = key * 2862933555777941757ULL + 1; 97 j = (b + 1) * (double(1LL << 31) / double((key >> 33) + 1)); 98 } 99 return b; 100 } 101 ``` 102 103 The function ensures _balance_, which essentially states that objects are evenly distributed among buckets, and 104 _monotonicity_, which says that when the number of buckets is increased, objects move only from old buckets to new 105 buckets, thus doing no unnecessary rearrangement. 106 107 The diagram below illustrates how a specific key (profile) can be mapped to a specific shard and node: 108 1. First subring (tenant shards) starts at offset 3 and its size is 8 (configured explicitly). 109 2. Second subring (dataset shards) starts at offset 1 within the parent subring (tenant) and includes 4 shards (determined dynamically). 110 111 ```mermaid 112 block-beta 113 columns 15 114 115 nodes["nodes"]:2 116 space 117 node_a["node A"]:4 118 node_b["node B"]:4 119 node_c["node C"]:4 120 121 shards["ring"]:2 122 space 123 shard_0["0"] 124 shard_1["1"] 125 shard_2["2"] 126 shard_3["3"] 127 shard_4["4"] 128 shard_5["5"] 129 shard_6["6"] 130 shard_7["7"] 131 shard_8["8"] 132 shard_9["9"] 133 shard_10["10"] 134 shard_11["11"] 135 136 tenant["tenant"]:2 137 space:4 138 ts_3["3"] 139 ts_4["4"] 140 ts_5["5"] 141 ts_6["6"] 142 ts_7["7"] 143 ts_8["8"] 144 ts_9["9"] 145 space:2 146 147 dataset["dataset"]:2 148 space:5 149 ds_4["4"] 150 ds_5["5"] 151 ds_6["6"] 152 ds_7["7"] 153 space:4 154 155 ``` 156 157 Such placement enables hot spots: in this specific example, all the dataset shards end up on the same node, which may 158 lead to uneven load distribution and poses problems in case of node failures. For example, if node B fails, all the 159 requests that target it, would be routed to node A (or C), which may lead to a cascading failure. 160 161 To mitigate this, shards are mapped to instances through a separate mapping table. The mapping table is updated every 162 time when the number of nodes changes, but it preserves the existing mapping as much as possible. 163 164 ```mermaid 165 block-beta 166 columns 15 167 168 shards["ring"]:2 169 space 170 shard_0["0"] 171 shard_1["1"] 172 shard_2["2"] 173 shard_3["3"] 174 shard_4["4"] 175 shard_5["5"] 176 shard_6["6"] 177 shard_7["7"] 178 shard_8["8"] 179 shard_9["9"] 180 shard_10["10"] 181 shard_11["11"] 182 183 tenant["tenant"]:2 184 space:4 185 ts_3["3"] 186 ts_4["4"] 187 ts_5["5"] 188 ts_6["6"] 189 ts_7["7"] 190 ts_8["8"] 191 ts_9["9"] 192 space:2 193 194 dataset["dataset"]:2 195 space:5 196 ds_4["4"] 197 ds_5["5"] 198 ds_6["6"] 199 ds_7["7"] 200 space:4 201 202 space:15 203 204 mapping["mapping"]:2 205 space 206 map_4["4"] 207 map_11["11"] 208 map_5["5"] 209 map_2["2"] 210 map_3["3"] 211 map_0["0"] 212 map_7["7"] 213 map_9["9"] 214 map_8["8"] 215 map_10["10"] 216 map_1["1"] 217 map_6["6"] 218 219 space:15 220 221 m_shards["shards"]:2 222 space 223 m_shard_0["0"] 224 m_shard_1["1"] 225 m_shard_2["2"] 226 m_shard_3["3"] 227 m_shard_4["4"] 228 m_shard_5["5"] 229 m_shard_6["6"] 230 m_shard_7["7"] 231 m_shard_8["8"] 232 m_shard_9["9"] 233 m_shard_10["10"] 234 m_shard_11["11"] 235 236 nodes["nodes"]:2 237 space 238 node_a["node A"]:4 239 node_b["node B"]:4 240 node_c["node C"]:4 241 242 style m_shard_3 fill:#969,stroke:#333,stroke-width:4px 243 style m_shard_0 fill:#969,stroke:#333,stroke-width:4px 244 style m_shard_7 fill:#969,stroke:#333,stroke-width:4px 245 style m_shard_9 fill:#969,stroke:#333,stroke-width:4px 246 247 ds_4 --> map_3 248 map_3 --> m_shard_3 249 250 ds_5 --> map_0 251 map_0 --> m_shard_0 252 253 ds_6 --> map_7 254 map_7 --> m_shard_7 255 256 ds_7 --> map_9 257 map_9 --> m_shard_9 258 ``` 259 260 In the current implementation, the mapping is a simple permutation generated with a predefined random seed using the 261 Fisher-Yates: when N new shards added or removed, at max N shards are moved to a different node. Ideally, the random 262 distribution should ensure uniform distribution of the shards across the nodes. 263 264 Now, if node B fails, its shards are distributed among the remaining nodes, which ensures that the load is distributed 265 evenly even in case of a failure. In our example: 266 1. Suppose, we selected shard 6. 267 2. It's routed to node B via the mapping table. 268 3. An attempt to store a profile in the shard 6 fails because of the node failure. 269 4. We pick the next location (7) which is then mapped to node C. 270 5. Until node B is back, we will route writes to the shard 6 to node C. 271 272 > Use of continuous shard ranges in subrings allows to minimize the number of datasets affected by a topology change: 273 ones that overlap the parent ring boundaries are affected the most, and it is expected that a dataset may change its 274 mapping entirely. However, such impact is preferable over the alternative, where larger number of datasets is affected 275 in a more subtle way. 276 277 ## Placement management 278 279 Placement is managed by the Placement Manager, which resides in the metastore. The Placement Manager is a singleton and 280 runs only on the Raft leader node. 281 282 The Placement Manager keeps track of dataset statistics based on the metadata records received from the segment-writer 283 service instances. Currently, the only metric that affects placement is the dataset size after it is written in the wire 284 format. 285 286 The Placement Manager builds placement rules at regular intervals, which are then used by the distributor to determine 287 the placement for each received profile. Since actual data re-balancing is not performed, the placement rules are not 288 synchronized across the distributor instances. 289 290 ```mermaid 291 graph LR 292 Distributor==>SegmentWriter 293 PlacementAgent-.-PlacementRules 294 SegmentWriter-->|metadata|PlacementManager 295 SegmentWriter==>|data|Segments 296 PlacementManager-.->PlacementRules 297 298 subgraph Distributor["distributor"] 299 PlacementAgent 300 end 301 302 subgraph Metastore["metastore"] 303 PlacementManager 304 end 305 306 subgraph ObjectStore["object store"] 307 PlacementRules(placement rules) 308 Segments(segments) 309 end 310 311 subgraph SegmentWriter["segment-writer"] 312 end 313 ``` 314 315 Placement rules are defined in the [protobuf format](placement/adaptive_placement/adaptive_placementpb/adaptive_placement.proto). 316 317 As of now, placement rules do not include the exact shards and mappings to nodes. Instead, they specify how many shards 318 are allocated for a specific dataset and tenant, and what load balancing strategy should be used: `fingerprint mod` or 319 `round robin`. In the future, placement management might be extended to include direct shard-to-node mappings, thus 320 implementing directory-based sharding. 321 322 There are a number of basic heuristics to determine the minimal sufficient number of shards for a dataset with a minimal 323 control options. Specifically, due to a substantial lag in the feedback loop (up to tens of seconds), shard allocation 324 is pessimistic and may lead to over-allocation of shards if a persistent burst trend is observed. Conversely, when the 325 observed data rate decreases, the number of shards is not reduced immediately.