github.com/grafana/pyroscope@v1.18.0/pkg/metastore/compaction/README.md (about)

     1  # Block Compaction
     2  
     3  ## Background
     4  
     5  The Pyroscope ingestion pipeline is designed to gather data in memory as small segments, which are periodically flushed
     6  to object storage, along with the metadata entries being added to the metastore index. Depending on the configuration
     7  and deployment scale, the number of segments created per second can increase significantly, reaching millions of objects
     8  per hour or day. This can lead to performance degradation in the query path due to high read amplification caused by the
     9  large number of small segments. In addition to read amplification, a high number of metadata entries can also lead to
    10  performance degradation across the entire cluster, impacting the write path as well.
    11  
    12  The new background compaction process helps mitigate this by merging small segments into larger ones, aiming to reduce
    13  the number of objects a query needs to fetch from object storage.
    14  
    15  # Compaction Service
    16  
    17  The compaction service is responsible for planning compaction jobs, scheduling their execution, and updating the
    18  metastore index with the results. The compaction service resides within the metastore component, while the compaction
    19  worker is a separate service designed to scale horizontally.
    20  
    21  The compaction service relies on the Raft protocol to guarantee consistency across the replicas. The diagram below
    22  illustrates the interaction between the compaction worker and the compaction service: workers poll the service on a
    23  regular basis to request new compaction jobs and report status updates.
    24  
    25  A status update is processed by the leader node in two steps, each of which is a Raft command committed to the log:
    26  1. First, the leader prepares the plan update – compaction job state changes based on the reported status updates.
    27  This is a read only operation that never modifies the node state.
    28  2. The leader proposes the plan update: all the replicas must apply the planned changes to their state in an idempotent
    29  way, if the proposal is accepted (committed to the Raft log).
    30  
    31  Critical sections are guaranteed to be executed serially in the context of the Raft state machine and by the same
    32  leader (within the same *term*), and atomically from the cluster's perspective. If the prepared compaction plan update
    33  is not accepted by the Raft log, the update plan is discarded, and the new leader will propose a new plan.
    34  
    35  The two-step process ensures that all the replicas use the same compaction plan, regardless of their internal state,
    36  as long as the replicas can apply `UpdateCompactionPlan`change. This is true even in case the compaction algorithm
    37  (the `GetCompactionPlanUpdate` step) changes across the replicas during the ongoing migration – version upgrade or
    38  downgrade.
    39  
    40  > As of now, both steps are committed to the Raft log. However, as an optimization, the first step – preparation,
    41  > can be implemented as a **Linearizable Read** through **Read Index** (which we already use in metadata queries)
    42  > to avoid unnecessary replication of the read-only operation. This approach is already used by the metadata index  
    43  > cleaner: leader read with a follow-up proposal.
    44  >
    45  > However, unlike cleanup, compaction is a more complex operation, and the serialization guarantees provided  
    46  > by Raft command execution flow help avoid many potential issues with concurrent read/write access.
    47  
    48  ```mermaid
    49  sequenceDiagram
    50      participant W as Compaction Worker
    51  
    52      box Compaction Service
    53          participant H as Endpoint
    54          participant R as Raft
    55      end
    56      
    57      loop
    58          W ->>+H: PollCompactionJobsRequest
    59      
    60          critical
    61              critical FSM state read
    62                  H ->>R: GetCompactionPlanUpdate
    63                  create participant U as Plan Update
    64                      R ->>U: 
    65                      U ->>+S: Job status updates
    66                      Note right of U: Job ownership is protected with<br>leases with fencing token
    67                      S ->>-U: Job state changes
    68                      U ->>+S: Assign jobs
    69                      S ->>-U: Job state changes
    70                      U ->>+P: Create jobs
    71                      Note right of U: New jobs are created if<br>workers have enough capacity
    72                          P ->>P: Dequeue blocks<br>and load tombstones
    73                      P ->>-U: New jobs
    74                      U ->>+S: Add jobs
    75                      S ->>-U: Job state changes
    76                  destroy U
    77                  U ->>R: CompactionPlanUpdate
    78                  R ->>H: CompactionPlanUpdate
    79              end
    80      
    81              critical FSM state update
    82                  H ->>R: UpdateCompactionPlan
    83                      R ->>S: Update schedule<br>(new, completed, assigned, reassigned jobs)
    84                      R ->>P: Remove source blocks from the planner queue (new jobs)
    85                      R ->>I: Replace source blocks in the index (completed jobs)<br>and create tombstones for deleted
    86                          I ->>+C: Add new blocks
    87                              C ->>C: Enqueue
    88                          C ->>-I: 
    89                      I ->>R: 
    90                  R ->>H: CompactionPlanUpdate
    91              end
    92          end
    93   
    94          H ->> W: PollCompactionJobsResponse
    95      end
    96  
    97      box FSM
    98          participant C as Compactor
    99          participant P as Planner
   100          participant S as Scheduler
   101          participant I as Metadata Index
   102      end
   103  ```
   104  
   105  ---
   106  
   107  # Job Planner
   108  
   109  The compactor is responsible for maintaining a queue of source blocks eligible for compaction. Currently, this queue
   110  is a simple doubly-linked FIFO structure, populated with new block batches as they are added to the index. In the
   111  current implementation, a new compaction job is created once the sufficient number of blocks have been enqueued.
   112  Compaction jobs are planned on demand when requests are received from the compaction service.
   113  
   114  The queue is segmented by the `Tenant`, `Shard`, and `Level` attributes of the block metadata entries, meaning that
   115  a block compaction never crosses these boundaries. This segmentation helps avoid unnecessary compactions of unrelated
   116  blocks. However, the downside is that blocks are never compacted across different shards, which can lead to suboptimal
   117  compaction results. Due to the dynamic data placement, it is possible for a tenant to be placed on a shard for only a
   118  short period of time. As a result, the data in that shard may not be compacted with other data from the same tenant.
   119  
   120  Cross-shard compaction is to be implemented as a future enhancement. The observed impact of the limitation is moderate.
   121  
   122  ## Data Layout
   123  
   124  Profiling data from each service (identified by the `service_name` label) is stored as a separate dataset within a block.
   125  
   126  The block layout is composed of a collection of non-overlapping, independent datasets, each containing distinct data.
   127  At compaction, matching datasets from different blocks are merged: their tsdb index, symbols, and profile tables are
   128  merged and rewritten to a new block, to optimize the data for efficient reading.
   129  
   130  ---
   131  
   132  # Job Scheduler
   133  
   134  The scheduler implements the basic **Small Job First** strategy: blocks of lower levels are considered smaller than
   135  blocks of higher levels, and their compaction is prioritized. This is justifiable because the smaller blocks affect
   136  read amplification more than the larger blocks, and the compaction of smaller blocks is more efficient.
   137  
   138  ---
   139  
   140  Compaction jobs are assigned to workers in the order of their priority.
   141  
   142  Internally, the scheduler maintains a priority queue of jobs for each compaction level. Jobs of lower levels are
   143  assigned first, and the scheduler does not consider jobs of higher levels until all eligible jobs of lower levels are
   144  assigned.
   145  
   146  The priority is determined by several factors:
   147  1. Compaction level.
   148  2. Status (enum order).
   149     - `COMPACTION_STATUS_UNSPECIFIED`: unassigned jobs.
   150     - `COMPACTION_STATUS_IN_PROGRESS`: in-progress jobs. The first job that can't be reassigned is a sentinel:
   151        no more jobs are eligible for assignment at this level.
   152  3. Failures: jobs with fewer failures are prioritized.
   153  4. Lease expiration time: the job with the earliest lease expiration time is considered first.
   154  
   155  See [Job Status Description](#job-status-description) for more details.
   156  
   157  > The challenge is that we don't know the capacity of our worker fleet in advance, and we have no control over them;
   158  they can appear and disappear at any time. Another problem is that in some failure modes, such as unavailability or
   159  lack of compaction workers, or temporary unavailability of the metastore service, the number of blocks to be compacted
   160  may reach significant levels (millions).
   161  >
   162  > Therefore, we use an adaptive approach to keep the scheduler's job queue short while ensuring the compaction
   163  workers are fully utilized. In every request, the worker specifies how many free slots it has available for new jobs.
   164  As the compaction procedure is a synchronous CPU-bound task, we use the number of logical CPU cores as the worker's max
   165  capacity and decrement it for each in-progress compaction job. When a new request arrives, it specifies the current
   166  worker's capacity, which serves as evidence that the entire worker fleet has enough resources to handle at least
   167  this number of jobs. Thus, for every request, we try to enqueue a number of jobs equal to the reported capacity.
   168  >
   169  > Over time, this ensures good balance between the number of jobs in the queue and the worker capacity utilization,
   170  even if there are millions of blocks to compact.
   171  
   172  ---
   173  
   174  ## Job Ownership
   175  
   176  Distributed locking implementation is inspired by [The Chubby lock service](https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf)
   177  and [Leases: An Efficient Fault-Tolerant Mechanism
   178  for Distributed File Cache Consistency](https://dl.acm.org/doi/pdf/10.1145/74851.74870). The implementation is based on
   179  the Raft protocol.
   180  
   181  Ownership of a compaction job is granted to a compaction worker for a specified period – a *lease*:
   182  > A lease is a contract that gives its holder specified rights over property for a limited period of time.
   183  
   184  The real-time clock of the worker and the scheduler cannot be used; instead, the timestamp of the Raft log entry,
   185  assigned by the Raft leader when the entry is appended to the log, serves as the reference point in time.
   186  
   187  > The fact that leases are allocated by the current leader allows for spurious *lease invalidation* when the leader
   188  > changes and the clock skew exceeds the lease duration. This is acceptable because jobs will be reassigned repeatedly,
   189  > and the occurrence of the event should be very rare. However, the solution does not tolerate clock skews exceeding
   190  > the job lease duration (which is 15 seconds by default).
   191  
   192  The log entry index is used as the [fencing token](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html)
   193  of protected resources (compaction jobs).
   194  
   195  The Raft log entry index is a monotonically increasing integer, guaranteed to be unique for each command.
   196  Each time a job is assigned to a worker, the worker is provided with the current Raft log index as the fencing token,
   197  which is also assigned to the job. For subsequent requests, the worker must provide the fencing token it was given at
   198  assignment. The ownership of the job is confirmed if the provided token is greater than or equal to the job's token.
   199  The job's token may change if the job is reassigned to another worker, and the new token is derived from the current
   200  Raft log index, which is guaranteed to be greater.
   201  
   202  > Token authentication is not enforced in this design, as the system operates in a trusted environment with cooperative
   203  > workers. However, m malicious workers can arbitrarily specify a token. In the future, we may consider implementing a
   204  > basic authentication mechanism based on cryptographic signatures to further ensure the integrity of token usage.
   205  >
   206  > This is an advisory locking mechanism, meaning resources are not automatically restricted from access when the lock
   207  > is not acquired. Consequently, a client might choose to delete source blocks associated with a compaction job or
   208  > continue processing the job even without holding the lease. This behavior, however, should be avoided in the worker
   209  > implementation.
   210  
   211  ## Procedures
   212  
   213  ### Assignment
   214  
   215  When a worker requests a new assignment, the scheduler must find the highest-priority job that is not assigned yet,
   216  and assign it to the worker. When a job is assigned, the worker is given a lease with a deadline.
   217  The worker should refresh the lease before it expires.
   218  
   219  ### Lease Refresh
   220  
   221  The worker must send a status update to the scheduler to refresh the lease.
   222  The scheduler must update the lease expiration time if the worker still owns the job.
   223  
   224  ### Reassignment
   225  
   226  The scheduler may revoke a job if the worker does not send the status update within the lease duration.
   227  
   228  When a new assignment is requested by a worker, the scheduler inspects in-progress jobs and checks if the
   229  lease duration has expired. If the lease has expired, the job is reassigned to the worker requested for a
   230  new assignment.
   231  
   232  ---
   233  
   234  If the timestamp of the current Raft log entry (command) exceeds the job `lease_expires_at` timestamp,
   235  the scheduler must revoke the job:
   236  1. Set the status to `COMPACTION_STATUS_IN_PROGRESS`.
   237  2. Allocate a new lease with an expiration period calculated starting from the current command timestamp.
   238  3. Set the fencing token to the current command index (guaranteed to be higher than the job fencing token).
   239  
   240  ---
   241  
   242  The worker instance that has lost the job is not notified immediately. If the worker reports an update for a job that it
   243  is not assigned to, or if the job is not found (for example, if it has been completed by another worker), the scheduler
   244  does not allocate a new lease; the worker should stop processing. This mechanism prevents the worker from processing
   245  jobs unnecessarily.
   246  
   247  If the worker is not capable of executing the job, it may abandon the job without further notifications. The scheduler
   248  will eventually reassign the job to another worker. The lost job might be reassigned to the same worker instance if that
   249  instance detects the loss before others do: abandoned jobs are assigned to the first worker that requests new
   250  assignments when no unassigned jobs are available.
   251  
   252  There is no explicit mechanism for reporting a failure from the worker. In fact, the scheduler must not rely on error
   253  reports from workers, as jobs that cause workers to crash would yield no reports at all.
   254  
   255  To avoid infinite reassignment loops, the scheduler keeps track of reassignments (failures) for each job. If the number
   256  of failures exceeds a set threshold, the job is not reassigned and remains at the bottom of the queue. Once the cause of
   257  failure is resolved, the error limit can be temporarily increased to reprocess these jobs.
   258  
   259  The scheduler queue has a size limit. Typically, the only scenario in which this limit is reached is when the compaction
   260  process is not functioning correctly (e.g., due to a bug in the compaction procedure), preventing blocks from being
   261  compacted and resulting in many jobs remaining in a failed state. Once the queue size limit is reached, failed jobs are
   262  evicted, meaning the corresponding blocks will never be compacted. This may cause read amplification of the data queries
   263  and bloat the metadata index. Therefore, the limit should be large enough. The recommended course of action is to roll
   264  back or fix the bug and restart the compaction process, temporarily increasing the error limit if necessary.
   265  
   266  ### Job Completion
   267  
   268  When the worker reports a successful completion of the job, the scheduler must remove the job from the schedule and
   269  notify the planner about the completion.
   270  
   271  ## Job Status Description
   272  
   273  The diagram below depicts the state machine of the job status.
   274  
   275  ```mermaid
   276  stateDiagram-v2
   277      [*] --> Unassigned : Create Job
   278      Unassigned --> InProgress : Assign Job
   279      InProgress --> Success : Job Completed
   280      InProgress --> LeaseExpired: Job Lease Expires
   281      LeaseExpired: Abandoned Job
   282  
   283      LeaseExpired --> Excluded: Failure Threshold Exceeded
   284      Excluded: Faulty Job
   285  
   286      Success --> [*] : Remove Job from Schedule
   287      LeaseExpired --> InProgress : Reassign Job
   288  
   289      Unassigned : COMPACTION_STATUS_UNSPECIFIED
   290      InProgress : COMPACTION_STATUS_IN_PROGRESS
   291      Success : COMPACTION_STATUS_SUCCESS
   292  
   293      LeaseExpired : COMPACTION_STATUS_IN_PROGRESS
   294      Excluded: COMPACTION_STATUS_IN_PROGRESS
   295  ```
   296  
   297  ### Communication
   298  
   299  ### Scheduler to Worker
   300  
   301  | Status                          | Description                                                                         |
   302  |---------------------------------|-------------------------------------------------------------------------------------|
   303  | `COMPACTION_STATUS_UNSPECIFIED` | Not allowed.                                                                        |
   304  | `COMPACTION_STATUS_IN_PROGRESS` | Job lease refresh. The worker should refresh the new lease before the new deadline. |
   305  | `COMPACTION_STATUS_SUCCESS`     | Not allowed.                                                                        |
   306  | ---                             | No lease refresh from the scheduler. The worker should stop processing.             |
   307  
   308  ### Worker to Scheduler
   309  
   310  | Status                          | Description                                                                                                                             |
   311  |---------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
   312  | `COMPACTION_STATUS_UNSPECIFIED` | Not allowed.                                                                                                                            |
   313  | `COMPACTION_STATUS_IN_PROGRESS` | Job lease refresh. The scheduler must extend the lease of the job, if the worker still owns it.                                         | 
   314  | `COMPACTION_STATUS_SUCCESS`     | The job has been successfully completed. The scheduler must remove the job from the schedule and communicate the update to the planner. |
   315  
   316  ### Notes
   317  
   318  * Job status `COMPACTION_STATUS_UNSPECIFIED` is never sent over the wire between the scheduler and workers.
   319  * Job in `COMPACTION_STATUS_IN_PROGRESS` cannot be reassigned if its failure counter exceeds the threshold.
   320  * Job in `COMPACTION_STATUS_SUCCESS` is removed from the schedule immediately.