github.com/grafana/pyroscope@v1.18.0/pkg/metastore/compaction/README.md (about) 1 # Block Compaction 2 3 ## Background 4 5 The Pyroscope ingestion pipeline is designed to gather data in memory as small segments, which are periodically flushed 6 to object storage, along with the metadata entries being added to the metastore index. Depending on the configuration 7 and deployment scale, the number of segments created per second can increase significantly, reaching millions of objects 8 per hour or day. This can lead to performance degradation in the query path due to high read amplification caused by the 9 large number of small segments. In addition to read amplification, a high number of metadata entries can also lead to 10 performance degradation across the entire cluster, impacting the write path as well. 11 12 The new background compaction process helps mitigate this by merging small segments into larger ones, aiming to reduce 13 the number of objects a query needs to fetch from object storage. 14 15 # Compaction Service 16 17 The compaction service is responsible for planning compaction jobs, scheduling their execution, and updating the 18 metastore index with the results. The compaction service resides within the metastore component, while the compaction 19 worker is a separate service designed to scale horizontally. 20 21 The compaction service relies on the Raft protocol to guarantee consistency across the replicas. The diagram below 22 illustrates the interaction between the compaction worker and the compaction service: workers poll the service on a 23 regular basis to request new compaction jobs and report status updates. 24 25 A status update is processed by the leader node in two steps, each of which is a Raft command committed to the log: 26 1. First, the leader prepares the plan update – compaction job state changes based on the reported status updates. 27 This is a read only operation that never modifies the node state. 28 2. The leader proposes the plan update: all the replicas must apply the planned changes to their state in an idempotent 29 way, if the proposal is accepted (committed to the Raft log). 30 31 Critical sections are guaranteed to be executed serially in the context of the Raft state machine and by the same 32 leader (within the same *term*), and atomically from the cluster's perspective. If the prepared compaction plan update 33 is not accepted by the Raft log, the update plan is discarded, and the new leader will propose a new plan. 34 35 The two-step process ensures that all the replicas use the same compaction plan, regardless of their internal state, 36 as long as the replicas can apply `UpdateCompactionPlan`change. This is true even in case the compaction algorithm 37 (the `GetCompactionPlanUpdate` step) changes across the replicas during the ongoing migration – version upgrade or 38 downgrade. 39 40 > As of now, both steps are committed to the Raft log. However, as an optimization, the first step – preparation, 41 > can be implemented as a **Linearizable Read** through **Read Index** (which we already use in metadata queries) 42 > to avoid unnecessary replication of the read-only operation. This approach is already used by the metadata index 43 > cleaner: leader read with a follow-up proposal. 44 > 45 > However, unlike cleanup, compaction is a more complex operation, and the serialization guarantees provided 46 > by Raft command execution flow help avoid many potential issues with concurrent read/write access. 47 48 ```mermaid 49 sequenceDiagram 50 participant W as Compaction Worker 51 52 box Compaction Service 53 participant H as Endpoint 54 participant R as Raft 55 end 56 57 loop 58 W ->>+H: PollCompactionJobsRequest 59 60 critical 61 critical FSM state read 62 H ->>R: GetCompactionPlanUpdate 63 create participant U as Plan Update 64 R ->>U: 65 U ->>+S: Job status updates 66 Note right of U: Job ownership is protected with<br>leases with fencing token 67 S ->>-U: Job state changes 68 U ->>+S: Assign jobs 69 S ->>-U: Job state changes 70 U ->>+P: Create jobs 71 Note right of U: New jobs are created if<br>workers have enough capacity 72 P ->>P: Dequeue blocks<br>and load tombstones 73 P ->>-U: New jobs 74 U ->>+S: Add jobs 75 S ->>-U: Job state changes 76 destroy U 77 U ->>R: CompactionPlanUpdate 78 R ->>H: CompactionPlanUpdate 79 end 80 81 critical FSM state update 82 H ->>R: UpdateCompactionPlan 83 R ->>S: Update schedule<br>(new, completed, assigned, reassigned jobs) 84 R ->>P: Remove source blocks from the planner queue (new jobs) 85 R ->>I: Replace source blocks in the index (completed jobs)<br>and create tombstones for deleted 86 I ->>+C: Add new blocks 87 C ->>C: Enqueue 88 C ->>-I: 89 I ->>R: 90 R ->>H: CompactionPlanUpdate 91 end 92 end 93 94 H ->> W: PollCompactionJobsResponse 95 end 96 97 box FSM 98 participant C as Compactor 99 participant P as Planner 100 participant S as Scheduler 101 participant I as Metadata Index 102 end 103 ``` 104 105 --- 106 107 # Job Planner 108 109 The compactor is responsible for maintaining a queue of source blocks eligible for compaction. Currently, this queue 110 is a simple doubly-linked FIFO structure, populated with new block batches as they are added to the index. In the 111 current implementation, a new compaction job is created once the sufficient number of blocks have been enqueued. 112 Compaction jobs are planned on demand when requests are received from the compaction service. 113 114 The queue is segmented by the `Tenant`, `Shard`, and `Level` attributes of the block metadata entries, meaning that 115 a block compaction never crosses these boundaries. This segmentation helps avoid unnecessary compactions of unrelated 116 blocks. However, the downside is that blocks are never compacted across different shards, which can lead to suboptimal 117 compaction results. Due to the dynamic data placement, it is possible for a tenant to be placed on a shard for only a 118 short period of time. As a result, the data in that shard may not be compacted with other data from the same tenant. 119 120 Cross-shard compaction is to be implemented as a future enhancement. The observed impact of the limitation is moderate. 121 122 ## Data Layout 123 124 Profiling data from each service (identified by the `service_name` label) is stored as a separate dataset within a block. 125 126 The block layout is composed of a collection of non-overlapping, independent datasets, each containing distinct data. 127 At compaction, matching datasets from different blocks are merged: their tsdb index, symbols, and profile tables are 128 merged and rewritten to a new block, to optimize the data for efficient reading. 129 130 --- 131 132 # Job Scheduler 133 134 The scheduler implements the basic **Small Job First** strategy: blocks of lower levels are considered smaller than 135 blocks of higher levels, and their compaction is prioritized. This is justifiable because the smaller blocks affect 136 read amplification more than the larger blocks, and the compaction of smaller blocks is more efficient. 137 138 --- 139 140 Compaction jobs are assigned to workers in the order of their priority. 141 142 Internally, the scheduler maintains a priority queue of jobs for each compaction level. Jobs of lower levels are 143 assigned first, and the scheduler does not consider jobs of higher levels until all eligible jobs of lower levels are 144 assigned. 145 146 The priority is determined by several factors: 147 1. Compaction level. 148 2. Status (enum order). 149 - `COMPACTION_STATUS_UNSPECIFIED`: unassigned jobs. 150 - `COMPACTION_STATUS_IN_PROGRESS`: in-progress jobs. The first job that can't be reassigned is a sentinel: 151 no more jobs are eligible for assignment at this level. 152 3. Failures: jobs with fewer failures are prioritized. 153 4. Lease expiration time: the job with the earliest lease expiration time is considered first. 154 155 See [Job Status Description](#job-status-description) for more details. 156 157 > The challenge is that we don't know the capacity of our worker fleet in advance, and we have no control over them; 158 they can appear and disappear at any time. Another problem is that in some failure modes, such as unavailability or 159 lack of compaction workers, or temporary unavailability of the metastore service, the number of blocks to be compacted 160 may reach significant levels (millions). 161 > 162 > Therefore, we use an adaptive approach to keep the scheduler's job queue short while ensuring the compaction 163 workers are fully utilized. In every request, the worker specifies how many free slots it has available for new jobs. 164 As the compaction procedure is a synchronous CPU-bound task, we use the number of logical CPU cores as the worker's max 165 capacity and decrement it for each in-progress compaction job. When a new request arrives, it specifies the current 166 worker's capacity, which serves as evidence that the entire worker fleet has enough resources to handle at least 167 this number of jobs. Thus, for every request, we try to enqueue a number of jobs equal to the reported capacity. 168 > 169 > Over time, this ensures good balance between the number of jobs in the queue and the worker capacity utilization, 170 even if there are millions of blocks to compact. 171 172 --- 173 174 ## Job Ownership 175 176 Distributed locking implementation is inspired by [The Chubby lock service](https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf) 177 and [Leases: An Efficient Fault-Tolerant Mechanism 178 for Distributed File Cache Consistency](https://dl.acm.org/doi/pdf/10.1145/74851.74870). The implementation is based on 179 the Raft protocol. 180 181 Ownership of a compaction job is granted to a compaction worker for a specified period – a *lease*: 182 > A lease is a contract that gives its holder specified rights over property for a limited period of time. 183 184 The real-time clock of the worker and the scheduler cannot be used; instead, the timestamp of the Raft log entry, 185 assigned by the Raft leader when the entry is appended to the log, serves as the reference point in time. 186 187 > The fact that leases are allocated by the current leader allows for spurious *lease invalidation* when the leader 188 > changes and the clock skew exceeds the lease duration. This is acceptable because jobs will be reassigned repeatedly, 189 > and the occurrence of the event should be very rare. However, the solution does not tolerate clock skews exceeding 190 > the job lease duration (which is 15 seconds by default). 191 192 The log entry index is used as the [fencing token](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html) 193 of protected resources (compaction jobs). 194 195 The Raft log entry index is a monotonically increasing integer, guaranteed to be unique for each command. 196 Each time a job is assigned to a worker, the worker is provided with the current Raft log index as the fencing token, 197 which is also assigned to the job. For subsequent requests, the worker must provide the fencing token it was given at 198 assignment. The ownership of the job is confirmed if the provided token is greater than or equal to the job's token. 199 The job's token may change if the job is reassigned to another worker, and the new token is derived from the current 200 Raft log index, which is guaranteed to be greater. 201 202 > Token authentication is not enforced in this design, as the system operates in a trusted environment with cooperative 203 > workers. However, m malicious workers can arbitrarily specify a token. In the future, we may consider implementing a 204 > basic authentication mechanism based on cryptographic signatures to further ensure the integrity of token usage. 205 > 206 > This is an advisory locking mechanism, meaning resources are not automatically restricted from access when the lock 207 > is not acquired. Consequently, a client might choose to delete source blocks associated with a compaction job or 208 > continue processing the job even without holding the lease. This behavior, however, should be avoided in the worker 209 > implementation. 210 211 ## Procedures 212 213 ### Assignment 214 215 When a worker requests a new assignment, the scheduler must find the highest-priority job that is not assigned yet, 216 and assign it to the worker. When a job is assigned, the worker is given a lease with a deadline. 217 The worker should refresh the lease before it expires. 218 219 ### Lease Refresh 220 221 The worker must send a status update to the scheduler to refresh the lease. 222 The scheduler must update the lease expiration time if the worker still owns the job. 223 224 ### Reassignment 225 226 The scheduler may revoke a job if the worker does not send the status update within the lease duration. 227 228 When a new assignment is requested by a worker, the scheduler inspects in-progress jobs and checks if the 229 lease duration has expired. If the lease has expired, the job is reassigned to the worker requested for a 230 new assignment. 231 232 --- 233 234 If the timestamp of the current Raft log entry (command) exceeds the job `lease_expires_at` timestamp, 235 the scheduler must revoke the job: 236 1. Set the status to `COMPACTION_STATUS_IN_PROGRESS`. 237 2. Allocate a new lease with an expiration period calculated starting from the current command timestamp. 238 3. Set the fencing token to the current command index (guaranteed to be higher than the job fencing token). 239 240 --- 241 242 The worker instance that has lost the job is not notified immediately. If the worker reports an update for a job that it 243 is not assigned to, or if the job is not found (for example, if it has been completed by another worker), the scheduler 244 does not allocate a new lease; the worker should stop processing. This mechanism prevents the worker from processing 245 jobs unnecessarily. 246 247 If the worker is not capable of executing the job, it may abandon the job without further notifications. The scheduler 248 will eventually reassign the job to another worker. The lost job might be reassigned to the same worker instance if that 249 instance detects the loss before others do: abandoned jobs are assigned to the first worker that requests new 250 assignments when no unassigned jobs are available. 251 252 There is no explicit mechanism for reporting a failure from the worker. In fact, the scheduler must not rely on error 253 reports from workers, as jobs that cause workers to crash would yield no reports at all. 254 255 To avoid infinite reassignment loops, the scheduler keeps track of reassignments (failures) for each job. If the number 256 of failures exceeds a set threshold, the job is not reassigned and remains at the bottom of the queue. Once the cause of 257 failure is resolved, the error limit can be temporarily increased to reprocess these jobs. 258 259 The scheduler queue has a size limit. Typically, the only scenario in which this limit is reached is when the compaction 260 process is not functioning correctly (e.g., due to a bug in the compaction procedure), preventing blocks from being 261 compacted and resulting in many jobs remaining in a failed state. Once the queue size limit is reached, failed jobs are 262 evicted, meaning the corresponding blocks will never be compacted. This may cause read amplification of the data queries 263 and bloat the metadata index. Therefore, the limit should be large enough. The recommended course of action is to roll 264 back or fix the bug and restart the compaction process, temporarily increasing the error limit if necessary. 265 266 ### Job Completion 267 268 When the worker reports a successful completion of the job, the scheduler must remove the job from the schedule and 269 notify the planner about the completion. 270 271 ## Job Status Description 272 273 The diagram below depicts the state machine of the job status. 274 275 ```mermaid 276 stateDiagram-v2 277 [*] --> Unassigned : Create Job 278 Unassigned --> InProgress : Assign Job 279 InProgress --> Success : Job Completed 280 InProgress --> LeaseExpired: Job Lease Expires 281 LeaseExpired: Abandoned Job 282 283 LeaseExpired --> Excluded: Failure Threshold Exceeded 284 Excluded: Faulty Job 285 286 Success --> [*] : Remove Job from Schedule 287 LeaseExpired --> InProgress : Reassign Job 288 289 Unassigned : COMPACTION_STATUS_UNSPECIFIED 290 InProgress : COMPACTION_STATUS_IN_PROGRESS 291 Success : COMPACTION_STATUS_SUCCESS 292 293 LeaseExpired : COMPACTION_STATUS_IN_PROGRESS 294 Excluded: COMPACTION_STATUS_IN_PROGRESS 295 ``` 296 297 ### Communication 298 299 ### Scheduler to Worker 300 301 | Status | Description | 302 |---------------------------------|-------------------------------------------------------------------------------------| 303 | `COMPACTION_STATUS_UNSPECIFIED` | Not allowed. | 304 | `COMPACTION_STATUS_IN_PROGRESS` | Job lease refresh. The worker should refresh the new lease before the new deadline. | 305 | `COMPACTION_STATUS_SUCCESS` | Not allowed. | 306 | --- | No lease refresh from the scheduler. The worker should stop processing. | 307 308 ### Worker to Scheduler 309 310 | Status | Description | 311 |---------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------| 312 | `COMPACTION_STATUS_UNSPECIFIED` | Not allowed. | 313 | `COMPACTION_STATUS_IN_PROGRESS` | Job lease refresh. The scheduler must extend the lease of the job, if the worker still owns it. | 314 | `COMPACTION_STATUS_SUCCESS` | The job has been successfully completed. The scheduler must remove the job from the schedule and communicate the update to the planner. | 315 316 ### Notes 317 318 * Job status `COMPACTION_STATUS_UNSPECIFIED` is never sent over the wire between the scheduler and workers. 319 * Job in `COMPACTION_STATUS_IN_PROGRESS` cannot be reassigned if its failure counter exceeds the threshold. 320 * Job in `COMPACTION_STATUS_SUCCESS` is removed from the schedule immediately.