github.com/grafana/pyroscope@v1.18.0/pkg/segmentwriter/client/distributor/README.md

github.com/grafana/pyroscope@v1.18.0/pkg/segmentwriter/client/distributor/README.md (about)

     1  # Data Distribution Algorithm 
     2  
     3  ## Background
     4  
     5  The **main requirement** for a distribution algorithm is that profile series of the same tenant service should be
     6  co-located, as spatial locality is crucial for compaction and query performance. The distribution algorithm should
     7  not aim to place all profiles of a specific tenant service in a dedicated shard; instead, it should distribute them
     8  among the optimal number of shards.
     9  
    10  Distributors must be aware of availability zones and should only route profiles to segment writers in the home AZ.
    11  Depending on the environment, crossing AZ network boundaries usually incurs penalties: cloud providers may charge
    12  for cross-AZ traffic, or in on-premises deployments, AZs may represent different data centers with high-latency
    13  connections.
    14  
    15  The number of shards and segment writers is not constant and may change over time. The distribution algorithm should
    16  aim to minimize data re-balancing when such changes occur. Nevertheless, we do not perform actual data re-balancing:
    17  data written to a shard remains there until it is compacted or deleted. The only reason for minimizing re-balancing
    18  is to optimize data locality; this concerns both data in transit, as segment writers are sensitive to the variance of
    19  the datasets, and data at rest, as this is crucial for compaction efficiency and query performance in the end.
    20  
    21  # Overview
    22  
    23  * Profiles are *distributed* among segment-writers based on the profile labels.
    24  * Profile labels _must_ include `service_name` label, which denotes the dataset the profile belongs to.
    25  * Each profile belongs to a tenant.
    26  
    27  The choice of a placement for a profile involves a three-step process:
    28  
    29  1. Finding *m* suitable locations from the total of *N* options using the request `tenant_id`.
    30  2. Finding *n* suitable locations from the total of *m* options using the `service_name` label.  
    31  3. Finding the exact location *s* from the total of *n* options.
    32  
    33  Where:
    34  
    35   * **N** is the total number of shards in the deployment.
    36   * **m** – tenant shard limits – is configured explicitly.
    37   * **n** – dataset shard limits – selected dynamically, based on the observed ingestion rate and patterns.
    38  
    39  The number of shards in the deployment is determined by the number of nodes in the deployment:
    40  * We seek to minimize the number of shards to optimize the cost of the solution: as we flush segments per shard,
    41    the number of shards directly affects the number of write operations to the object storage.
    42  * Experimentally, we found that a conservative processing rate is approximately 8 MB/s per core, depending on the
    43    processor and the network bandwidth (thus, 128 cores should be generally enough to handle 1 GB/s). This unit is
    44    recommended as a quantifier of the deployment size and the shard size.
    45  
    46  Due to the nature of continuous profiling, it is usually beneficial to keep the same profile series on the same shard,
    47  as this allows for more optimal utilization of the TSDB index (the inverted index used for searching by labels).
    48  However, data is often distributed across profile series unevenly; using a series label hash as the distribution key
    49  at any of the steps above may lead to significant data skews. To mitigate this, we propose to employ adaptive load
    50  balancing: use `fingerprint mod n` as the distribution key at step 3 by default, and switch to `random(n)`, when a
    51  skew is observed.
    52  
    53  In case of a failure, the next suitable segment writer is selected (from *n* options available to the tenant service,
    54  increasing the number if needed). The shard identifier is specified explicitly in the request to the segment writer to
    55  maintain data locality in case of transient failures and rollouts.
    56  
    57  The proposed approach assumes that two requests with the same distribution key may end up in different shards.
    58  This should be a rare occurrence, but such placement is expected.
    59  
    60  # Implementation
    61  
    62  The existing ring implementation is used for discovery: the underlying [memberlist](https://github.com/hashicorp/memberlist)
    63  library is used to maintain the list of the segment-writer service instances:
    64  
    65  ```mermaid
    66  graph RL
    67      Lifecycler-.->Memberlist
    68      Memberlist-.->Ring
    69  
    70      subgraph SegmentWriter["segment-writer"]
    71          Lifecycler["lifecycler"]
    72      end
    73  
    74      subgraph Memberlist["memberlist"]
    75      end
    76  
    77      subgraph Distributor["distributor"]
    78          Ring["ring"]
    79      end
    80      
    81  ```
    82  
    83  Instead of using the ring for the actual placement, distributor builds its own view of the ring, which is then used to
    84  determine the placement of the keys (profiles). The main reason for this is that the exising ring implementation is not
    85  well suited for the proposed algorithm, as it does not provide a way to map a key to a specific shard.
    86  
    87  In accordance to the algorithm, for each key (profile), we need to identify a subset of shards allowed for the tenant,
    88  and subset of shards allowed for the dataset. [Jump consistent hash](https://arxiv.org/pdf/1406.2294) is used to pick
    89  the subring position:
    90  
    91  ```cpp
    92  int32_t JumpConsistentHash(uint64_t key, int32_t num_buckets) {
    93      int64_t b = 1, j = 0;
    94      while (j < num_buckets) {
    95          b = j;
    96          key = key * 2862933555777941757ULL + 1;
    97          j = (b + 1) * (double(1LL << 31) / double((key >> 33) + 1));
    98      }
    99      return b;
   100  }
   101  ```
   102  
   103  The function ensures _balance_, which essentially states that objects are evenly distributed among buckets, and
   104  _monotonicity_, which says that when the number of buckets is increased, objects move only from old buckets to new
   105  buckets, thus doing no unnecessary rearrangement.
   106  
   107  The diagram below illustrates how a specific key (profile) can be mapped to a specific shard and node:
   108  1. First subring (tenant shards) starts at offset 3 and its size is 8 (configured explicitly).
   109  2. Second subring (dataset shards) starts at offset 1 within the parent subring (tenant) and includes 4 shards (determined dynamically).
   110  
   111  ```mermaid
   112  block-beta
   113      columns 15
   114  
   115      nodes["nodes"]:2
   116      space
   117      node_a["node A"]:4
   118      node_b["node B"]:4
   119      node_c["node C"]:4
   120  
   121      shards["ring"]:2
   122      space
   123      shard_0["0"]
   124      shard_1["1"]
   125      shard_2["2"]
   126      shard_3["3"]
   127      shard_4["4"]
   128      shard_5["5"]
   129      shard_6["6"]
   130      shard_7["7"]
   131      shard_8["8"]
   132      shard_9["9"]
   133      shard_10["10"]
   134      shard_11["11"]
   135  
   136      tenant["tenant"]:2
   137      space:4
   138      ts_3["3"]
   139      ts_4["4"]
   140      ts_5["5"]
   141      ts_6["6"]
   142      ts_7["7"]
   143      ts_8["8"]
   144      ts_9["9"]
   145      space:2
   146  
   147      dataset["dataset"]:2
   148      space:5
   149      ds_4["4"]
   150      ds_5["5"]
   151      ds_6["6"]
   152      ds_7["7"]
   153      space:4
   154  
   155  ```
   156  
   157  Such placement enables hot spots: in this specific example, all the dataset shards end up on the same node, which may
   158  lead to uneven load distribution and poses problems in case of node failures. For example, if node B fails, all the
   159  requests that target it, would be routed to node A (or C), which may lead to a cascading failure.
   160  
   161  To mitigate this, shards are mapped to instances through a separate mapping table. The mapping table is updated every
   162  time when the number of nodes changes, but it preserves the existing mapping as much as possible.
   163  
   164  ```mermaid
   165  block-beta
   166      columns 15
   167   
   168      shards["ring"]:2
   169      space
   170      shard_0["0"]
   171      shard_1["1"]
   172      shard_2["2"]
   173      shard_3["3"]
   174      shard_4["4"]
   175      shard_5["5"]
   176      shard_6["6"]
   177      shard_7["7"]
   178      shard_8["8"]
   179      shard_9["9"]
   180      shard_10["10"]
   181      shard_11["11"]
   182  
   183      tenant["tenant"]:2
   184      space:4
   185      ts_3["3"]
   186      ts_4["4"]
   187      ts_5["5"]
   188      ts_6["6"]
   189      ts_7["7"]
   190      ts_8["8"]
   191      ts_9["9"]
   192      space:2
   193  
   194      dataset["dataset"]:2
   195      space:5
   196      ds_4["4"]
   197      ds_5["5"]
   198      ds_6["6"]
   199      ds_7["7"]
   200      space:4
   201  
   202      space:15
   203  
   204      mapping["mapping"]:2
   205      space
   206      map_4["4"]
   207      map_11["11"]
   208      map_5["5"]
   209      map_2["2"]
   210      map_3["3"]
   211      map_0["0"]
   212      map_7["7"]
   213      map_9["9"]
   214      map_8["8"]
   215      map_10["10"]
   216      map_1["1"]
   217      map_6["6"]
   218  
   219      space:15
   220  
   221      m_shards["shards"]:2
   222      space
   223      m_shard_0["0"]
   224      m_shard_1["1"]
   225      m_shard_2["2"]
   226      m_shard_3["3"]
   227      m_shard_4["4"]
   228      m_shard_5["5"]
   229      m_shard_6["6"]
   230      m_shard_7["7"]
   231      m_shard_8["8"]
   232      m_shard_9["9"]
   233      m_shard_10["10"]
   234      m_shard_11["11"]
   235  
   236      nodes["nodes"]:2
   237      space
   238      node_a["node A"]:4
   239      node_b["node B"]:4
   240      node_c["node C"]:4
   241  
   242      style m_shard_3 fill:#969,stroke:#333,stroke-width:4px
   243      style m_shard_0 fill:#969,stroke:#333,stroke-width:4px
   244      style m_shard_7 fill:#969,stroke:#333,stroke-width:4px
   245      style m_shard_9 fill:#969,stroke:#333,stroke-width:4px
   246  
   247      ds_4 --> map_3
   248      map_3 --> m_shard_3
   249      
   250      ds_5 --> map_0
   251      map_0 --> m_shard_0
   252      
   253      ds_6 --> map_7
   254      map_7 --> m_shard_7
   255      
   256      ds_7 --> map_9
   257      map_9 --> m_shard_9
   258  ```
   259  
   260  In the current implementation, the mapping is a simple permutation generated with a predefined random seed using the 
   261  Fisher-Yates: when N new shards added or removed, at max N shards are moved to a different node. Ideally, the random
   262  distribution should ensure uniform distribution of the shards across the nodes.
   263  
   264  Now, if node B fails, its shards are distributed among the remaining nodes, which ensures that the load is distributed
   265  evenly even in case of a failure. In our example:
   266  1. Suppose, we selected shard 6.
   267  2. It's routed to node B via the mapping table.
   268  3. An attempt to store a profile in the shard 6 fails because of the node failure.
   269  4. We pick the next location (7) which is then mapped to node C.
   270  5. Until node B is back, we will route writes to the shard 6 to node C.
   271  
   272  > Use of continuous shard ranges in subrings allows to minimize the number of datasets affected by a topology change:
   273  ones that overlap the parent ring boundaries are affected the most, and it is expected that a dataset may change its
   274  mapping entirely. However, such impact is preferable over the alternative, where larger number of datasets is affected
   275  in a more subtle way.
   276  
   277  ## Placement management
   278  
   279  Placement is managed by the Placement Manager, which resides in the metastore. The Placement Manager is a singleton and
   280  runs only on the Raft leader node.
   281  
   282  The Placement Manager keeps track of dataset statistics based on the metadata records received from the segment-writer
   283  service instances. Currently, the only metric that affects placement is the dataset size after it is written in the wire
   284  format.
   285  
   286  The Placement Manager builds placement rules at regular intervals, which are then used by the distributor to determine
   287  the placement for each received profile. Since actual data re-balancing is not performed, the placement rules are not
   288  synchronized across the distributor instances.
   289  
   290  ```mermaid
   291  graph LR
   292      Distributor==>SegmentWriter
   293      PlacementAgent-.-PlacementRules
   294      SegmentWriter-->|metadata|PlacementManager
   295      SegmentWriter==>|data|Segments
   296      PlacementManager-.->PlacementRules
   297  
   298      subgraph Distributor["distributor"]
   299          PlacementAgent
   300      end
   301      
   302      subgraph Metastore["metastore"]
   303          PlacementManager
   304      end
   305  
   306      subgraph ObjectStore["object store"]
   307          PlacementRules(placement rules)
   308          Segments(segments)
   309      end
   310  
   311      subgraph SegmentWriter["segment-writer"]
   312      end
   313  ```
   314  
   315  Placement rules are defined in the [protobuf format](placement/adaptive_placement/adaptive_placementpb/adaptive_placement.proto).
   316  
   317  As of now, placement rules do not include the exact shards and mappings to nodes. Instead, they specify how many shards
   318  are allocated for a specific dataset and tenant, and what load balancing strategy should be used: `fingerprint mod` or
   319  `round robin`. In the future, placement management might be extended to include direct shard-to-node mappings, thus
   320  implementing directory-based sharding.
   321  
   322  There are a number of basic heuristics to determine the minimal sufficient number of shards for a dataset with a minimal
   323  control options. Specifically, due to a substantial lag in the feedback loop (up to tens of seconds), shard allocation
   324  is pessimistic and may lead to over-allocation of shards if a persistent burst trend is observed. Conversely, when the
   325  observed data rate decreases, the number of shards is not reduced immediately.