github.com/grafana/pyroscope@v1.18.0/pkg/metastore/compaction/README.md

github.com/grafana/pyroscope@v1.18.0/pkg/metastore/compaction/README.md (about)

1 # Block Compaction
2
3 ## Background
4
5 The Pyroscope ingestion pipeline is designed to gather data in memory as small segments, which are periodically flushed
6 to object storage, along with the metadata entries being added to the metastore index. Depending on the configuration
7 and deployment scale, the number of segments created per second can increase significantly, reaching millions of objects
8 per hour or day. This can lead to performance degradation in the query path due to high read amplification caused by the
9 large number of small segments. In addition to read amplification, a high number of metadata entries can also lead to
10 performance degradation across the entire cluster, impacting the write path as well.
11
12 The new background compaction process helps mitigate this by merging small segments into larger ones, aiming to reduce
13 the number of objects a query needs to fetch from object storage.
14
15 # Compaction Service
16
17 The compaction service is responsible for planning compaction jobs, scheduling their execution, and updating the
18 metastore index with the results. The compaction service resides within the metastore component, while the compaction
19 worker is a separate service designed to scale horizontally.
20
21 The compaction service relies on the Raft protocol to guarantee consistency across the replicas. The diagram below
22 illustrates the interaction between the compaction worker and the compaction service: workers poll the service on a
23 regular basis to request new compaction jobs and report status updates.
24
25 A status update is processed by the leader node in two steps, each of which is a Raft command committed to the log:
26 1. First, the leader prepares the plan update – compaction job state changes based on the reported status updates.
27 This is a read only operation that never modifies the node state.
28 2. The leader proposes the plan update: all the replicas must apply the planned changes to their state in an idempotent
29 way, if the proposal is accepted (committed to the Raft log).
30
31 Critical sections are guaranteed to be executed serially in the context of the Raft state machine and by the same
32 leader (within the same *term*), and atomically from the cluster's perspective. If the prepared compaction plan update
33 is not accepted by the Raft log, the update plan is discarded, and the new leader will propose a new plan.
34
35 The two-step process ensures that all the replicas use the same compaction plan, regardless of their internal state,
36 as long as the replicas can apply `UpdateCompactionPlan`change. This is true even in case the compaction algorithm
37 (the `GetCompactionPlanUpdate` step) changes across the replicas during the ongoing migration – version upgrade or
38 downgrade.
39
40 > As of now, both steps are committed to the Raft log. However, as an optimization, the first step – preparation,
41 > can be implemented as a **Linearizable Read** through **Read Index** (which we already use in metadata queries)
42 > to avoid unnecessary replication of the read-only operation. This approach is already used by the metadata index
43 > cleaner: leader read with a follow-up proposal.
44 >
45 > However, unlike cleanup, compaction is a more complex operation, and the serialization guarantees provided
46 > by Raft command execution flow help avoid many potential issues with concurrent read/write access.
47
48 ```mermaid
49 sequenceDiagram
50 participant W as Compaction Worker
51
52 box Compaction Service
53 participant H as Endpoint
54 participant R as Raft
55 end
56
57 loop
58 W ->>+H: PollCompactionJobsRequest
59
60 critical
61 critical FSM state read
62 H ->>R: GetCompactionPlanUpdate
63 create participant U as Plan Update
64 R ->>U:
65 U ->>+S: Job status updates
66 Note right of U: Job ownership is protected with leases with fencing token
67 S ->>-U: Job state changes
68 U ->>+S: Assign jobs
69 S ->>-U: Job state changes
70 U ->>+P: Create jobs
71 Note right of U: New jobs are created if workers have enough capacity
72 P ->>P: Dequeue blocks and load tombstones
73 P ->>-U: New jobs
74 U ->>+S: Add jobs
75 S ->>-U: Job state changes
76 destroy U
77 U ->>R: CompactionPlanUpdate
78 R ->>H: CompactionPlanUpdate
79 end
80
81 critical FSM state update
82 H ->>R: UpdateCompactionPlan
83 R ->>S: Update schedule (new, completed, assigned, reassigned jobs)
84 R ->>P: Remove source blocks from the planner queue (new jobs)
85 R ->>I: Replace source blocks in the index (completed jobs) and create tombstones for deleted
86 I ->>+C: Add new blocks
87 C ->>C: Enqueue
88 C ->>-I:
89 I ->>R:
90 R ->>H: CompactionPlanUpdate
91 end
92 end
93
94 H ->> W: PollCompactionJobsResponse
95 end
96
97 box FSM
98 participant C as Compactor
99 participant P as Planner
100 participant S as Scheduler
101 participant I as Metadata Index
102 end
103 ```
104
105 ---
106
107 # Job Planner
108
109 The compactor is responsible for maintaining a queue of source blocks eligible for compaction. Currently, this queue
110 is a simple doubly-linked FIFO structure, populated with new block batches as they are added to the index. In the
111 current implementation, a new compaction job is created once the sufficient number of blocks have been enqueued.
112 Compaction jobs are planned on demand when requests are received from the compaction service.
113
114 The queue is segmented by the `Tenant`, `Shard`, and `Level` attributes of the block metadata entries, meaning that
115 a block compaction never crosses these boundaries. This segmentation helps avoid unnecessary compactions of unrelated
116 blocks. However, the downside is that blocks are never compacted across different shards, which can lead to suboptimal
117 compaction results. Due to the dynamic data placement, it is possible for a tenant to be placed on a shard for only a
118 short period of time. As a result, the data in that shard may not be compacted with other data from the same tenant.
119
120 Cross-shard compaction is to be implemented as a future enhancement. The observed impact of the limitation is moderate.
121
122 ## Data Layout
123
124 Profiling data from each service (identified by the `service_name` label) is stored as a separate dataset within a block.
125
126 The block layout is composed of a collection of non-overlapping, independent datasets, each containing distinct data.
127 At compaction, matching datasets from different blocks are merged: their tsdb index, symbols, and profile tables are
128 merged and rewritten to a new block, to optimize the data for efficient reading.
129
130 ---
131
132 # Job Scheduler
133
134 The scheduler implements the basic **Small Job First** strategy: blocks of lower levels are considered smaller than
135 blocks of higher levels, and their compaction is prioritized. This is justifiable because the smaller blocks affect
136 read amplification more than the larger blocks, and the compaction of smaller blocks is more efficient.
137
138 ---
139
140 Compaction jobs are assigned to workers in the order of their priority.
141
142 Internally, the scheduler maintains a priority queue of jobs for each compaction level. Jobs of lower levels are
143 assigned first, and the scheduler does not consider jobs of higher levels until all eligible jobs of lower levels are
144 assigned.
145
146 The priority is determined by several factors:
147 1. Compaction level.
148 2. Status (enum order).
149 - `COMPACTION_STATUS_UNSPECIFIED`: unassigned jobs.
150 - `COMPACTION_STATUS_IN_PROGRESS`: in-progress jobs. The first job that can't be reassigned is a sentinel:
151 no more jobs are eligible for assignment at this level.
152 3. Failures: jobs with fewer failures are prioritized.
153 4. Lease expiration time: the job with the earliest lease expiration time is considered first.
154
155 See [Job Status Description](#job-status-description) for more details.
156
157 > The challenge is that we don't know the capacity of our worker fleet in advance, and we have no control over them;
158 they can appear and disappear at any time. Another problem is that in some failure modes, such as unavailability or
159 lack of compaction workers, or temporary unavailability of the metastore service, the number of blocks to be compacted
160 may reach significant levels (millions).
161 >
162 > Therefore, we use an adaptive approach to keep the scheduler's job queue short while ensuring the compaction
163 workers are fully utilized. In every request, the worker specifies how many free slots it has available for new jobs.
164 As the compaction procedure is a synchronous CPU-bound task, we use the number of logical CPU cores as the worker's max
165 capacity and decrement it for each in-progress compaction job. When a new request arrives, it specifies the current
166 worker's capacity, which serves as evidence that the entire worker fleet has enough resources to handle at least
167 this number of jobs. Thus, for every request, we try to enqueue a number of jobs equal to the reported capacity.
168 >
169 > Over time, this ensures good balance between the number of jobs in the queue and the worker capacity utilization,
170 even if there are millions of blocks to compact.
171
172 ---
173
174 ## Job Ownership
175
176 Distributed locking implementation is inspired by [The Chubby lock service](https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf)
177 and [Leases: An Efficient Fault-Tolerant Mechanism
178 for Distributed File Cache Consistency](https://dl.acm.org/doi/pdf/10.1145/74851.74870). The implementation is based on
179 the Raft protocol.
180
181 Ownership of a compaction job is granted to a compaction worker for a specified period – a *lease*:
182 > A lease is a contract that gives its holder specified rights over property for a limited period of time.
183
184 The real-time clock of the worker and the scheduler cannot be used; instead, the timestamp of the Raft log entry,
185 assigned by the Raft leader when the entry is appended to the log, serves as the reference point in time.
186
187 > The fact that leases are allocated by the current leader allows for spurious *lease invalidation* when the leader
188 > changes and the clock skew exceeds the lease duration. This is acceptable because jobs will be reassigned repeatedly,
189 > and the occurrence of the event should be very rare. However, the solution does not tolerate clock skews exceeding
190 > the job lease duration (which is 15 seconds by default).
191
192 The log entry index is used as the [fencing token](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html)
193 of protected resources (compaction jobs).
194
195 The Raft log entry index is a monotonically increasing integer, guaranteed to be unique for each command.
196 Each time a job is assigned to a worker, the worker is provided with the current Raft log index as the fencing token,
197 which is also assigned to the job. For subsequent requests, the worker must provide the fencing token it was given at
198 assignment. The ownership of the job is confirmed if the provided token is greater than or equal to the job's token.
199 The job's token may change if the job is reassigned to another worker, and the new token is derived from the current
200 Raft log index, which is guaranteed to be greater.
201
202 > Token authentication is not enforced in this design, as the system operates in a trusted environment with cooperative
203 > workers. However, m malicious workers can arbitrarily specify a token. In the future, we may consider implementing a
204 > basic authentication mechanism based on cryptographic signatures to further ensure the integrity of token usage.
205 >
206 > This is an advisory locking mechanism, meaning resources are not automatically restricted from access when the lock
207 > is not acquired. Consequently, a client might choose to delete source blocks associated with a compaction job or
208 > continue processing the job even without holding the lease. This behavior, however, should be avoided in the worker
209 > implementation.
210
211 ## Procedures
212
213 ### Assignment
214
215 When a worker requests a new assignment, the scheduler must find the highest-priority job that is not assigned yet,
216 and assign it to the worker. When a job is assigned, the worker is given a lease with a deadline.
217 The worker should refresh the lease before it expires.
218
219 ### Lease Refresh
220
221 The worker must send a status update to the scheduler to refresh the lease.
222 The scheduler must update the lease expiration time if the worker still owns the job.
223
224 ### Reassignment
225
226 The scheduler may revoke a job if the worker does not send the status update within the lease duration.
227
228 When a new assignment is requested by a worker, the scheduler inspects in-progress jobs and checks if the
229 lease duration has expired. If the lease has expired, the job is reassigned to the worker requested for a
230 new assignment.
231
232 ---
233
234 If the timestamp of the current Raft log entry (command) exceeds the job `lease_expires_at` timestamp,
235 the scheduler must revoke the job:
236 1. Set the status to `COMPACTION_STATUS_IN_PROGRESS`.
237 2. Allocate a new lease with an expiration period calculated starting from the current command timestamp.
238 3. Set the fencing token to the current command index (guaranteed to be higher than the job fencing token).
239
240 ---
241
242 The worker instance that has lost the job is not notified immediately. If the worker reports an update for a job that it
243 is not assigned to, or if the job is not found (for example, if it has been completed by another worker), the scheduler
244 does not allocate a new lease; the worker should stop processing. This mechanism prevents the worker from processing
245 jobs unnecessarily.
246
247 If the worker is not capable of executing the job, it may abandon the job without further notifications. The scheduler
248 will eventually reassign the job to another worker. The lost job might be reassigned to the same worker instance if that
249 instance detects the loss before others do: abandoned jobs are assigned to the first worker that requests new
250 assignments when no unassigned jobs are available.
251
252 There is no explicit mechanism for reporting a failure from the worker. In fact, the scheduler must not rely on error
253 reports from workers, as jobs that cause workers to crash would yield no reports at all.
254
255 To avoid infinite reassignment loops, the scheduler keeps track of reassignments (failures) for each job. If the number
256 of failures exceeds a set threshold, the job is not reassigned and remains at the bottom of the queue. Once the cause of
257 failure is resolved, the error limit can be temporarily increased to reprocess these jobs.
258
259 The scheduler queue has a size limit. Typically, the only scenario in which this limit is reached is when the compaction
260 process is not functioning correctly (e.g., due to a bug in the compaction procedure), preventing blocks from being
261 compacted and resulting in many jobs remaining in a failed state. Once the queue size limit is reached, failed jobs are
262 evicted, meaning the corresponding blocks will never be compacted. This may cause read amplification of the data queries
263 and bloat the metadata index. Therefore, the limit should be large enough. The recommended course of action is to roll
264 back or fix the bug and restart the compaction process, temporarily increasing the error limit if necessary.
265
266 ### Job Completion
267
268 When the worker reports a successful completion of the job, the scheduler must remove the job from the schedule and
269 notify the planner about the completion.
270
271 ## Job Status Description
272
273 The diagram below depicts the state machine of the job status.
274
275 ```mermaid
276 stateDiagram-v2
277 [*] --> Unassigned : Create Job
278 Unassigned --> InProgress : Assign Job
279 InProgress --> Success : Job Completed
280 InProgress --> LeaseExpired: Job Lease Expires
281 LeaseExpired: Abandoned Job
282
283 LeaseExpired --> Excluded: Failure Threshold Exceeded
284 Excluded: Faulty Job
285
286 Success --> [*] : Remove Job from Schedule
287 LeaseExpired --> InProgress : Reassign Job
288
289 Unassigned : COMPACTION_STATUS_UNSPECIFIED
290 InProgress : COMPACTION_STATUS_IN_PROGRESS
291 Success : COMPACTION_STATUS_SUCCESS
292
293 LeaseExpired : COMPACTION_STATUS_IN_PROGRESS
294 Excluded: COMPACTION_STATUS_IN_PROGRESS
295 ```
296
297 ### Communication
298
299 ### Scheduler to Worker
300
301 | Status | Description |
302 |---------------------------------|-------------------------------------------------------------------------------------|
303 | `COMPACTION_STATUS_UNSPECIFIED` | Not allowed. |
304 | `COMPACTION_STATUS_IN_PROGRESS` | Job lease refresh. The worker should refresh the new lease before the new deadline. |
305 | `COMPACTION_STATUS_SUCCESS` | Not allowed. |
306 | --- | No lease refresh from the scheduler. The worker should stop processing. |
307
308 ### Worker to Scheduler
309
310 | Status | Description |
311 |---------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
312 | `COMPACTION_STATUS_UNSPECIFIED` | Not allowed. |
313 | `COMPACTION_STATUS_IN_PROGRESS` | Job lease refresh. The scheduler must extend the lease of the job, if the worker still owns it. |
314 | `COMPACTION_STATUS_SUCCESS` | The job has been successfully completed. The scheduler must remove the job from the schedule and communicate the update to the planner. |
315
316 ### Notes
317
318 * Job status `COMPACTION_STATUS_UNSPECIFIED` is never sent over the wire between the scheduler and workers.
319 * Job in `COMPACTION_STATUS_IN_PROGRESS` cannot be reassigned if its failure counter exceeds the threshold.
320 * Job in `COMPACTION_STATUS_SUCCESS` is removed from the schedule immediately.