github.com/grafana/pyroscope@v1.18.0/pkg/metastore/index/README.md

github.com/grafana/pyroscope@v1.18.0/pkg/metastore/index/README.md (about)

     1  # Metadata Index
     2  
     3  The metadata index stores metadata entries for objects located in the data store. In essence, it is a document store built on top of a key-value database.
     4  
     5  It is implemented using BoltDB as the underlying key-value store, with Raft providing replication via consensus. BoltDB was chosen for its simplicity and efficiency in this use case – a single writer, concurrent readers, and ACID transactions. For better performance, the index database can be stored on an in-memory volume, as it's recovered from the Raft log and snapshot on startup, and durable storage is not required for the index itself.
     6  
     7  ## Metadata entries
     8  
     9  Data objects, more widely called _blocks_ in the codebase, are stored in the data store (object storage) and are identified by a unique identifier (ULID).
    10  
    11  A block is a collection of _datasets_ that share certain properties, such as tenant and shard identifiers, time range, creation time, and more. Simplified, a metadata entry looks like this:
    12  
    13  ```proto
    14  struct BlockMeta {
    15    uint32    format
    16    string    id
    17    int32     tenant
    18    uint32    shard
    19    []Dataset datasets
    20    []string  string_table
    21  }
    22  ```
    23  
    24  ```proto
    25  struct Dataset {
    26    uint32   format
    27    int32    tenant
    28    int32    name
    29    []uint64 table_of_contents
    30    []int32  labels
    31  }
    32  ```
    33  
    34  Datasets' content is defined by the `format` field, which indicates the binary format of the dataset. The `table_of_contents` field is a list of offsets that point to data sections within the dataset, allowing for efficient access to specific parts of the data. The table of contents is specific to the dataset format.
    35  
    36  Metadata labels allow specifying additional attributes than can be then used for filtering and querying. Labels are represented as a slice of `int32` values that refer to strings in the metadata entry's string table. The slice is a sequence of length-prefixed key-value (KV) pairs:
    37  
    38  ```
    39  len(2) | k1 | v1 | k2 | v2 | len(3) | k1 | v3 | k2 | v4 | k3 | v5
    40  ```
    41  
    42  Refer to the [`BlockMeta`](../../../api/metastore/v1/types.proto) protobuf definition to learn more about the metadata format.
    43  
    44  The metadata entry is also included to the object itself, its offset is specified in the `metadata_offset` attribute. If the offset is not known, the metadata can be retrieved from the object using the footer structure:
    45  
    46  ```it-is-not-meant-to-be-a-markdown-table
    47  Offset  | Size       | Description
    48  --------|------------|------------------------------------------
    49  0       | data       | Object data
    50  --------|------------|------------------------------------------
    51  data    | metadata   | Protobuf-encoded metadata
    52  end-8   | be_uint32  | Size of the raw metadata
    53  end-4   | be_uint32  | CRC32 of the raw metadata and size
    54  ```
    55  
    56  ## Structure
    57  
    58  The index is partitioned by time – each partition covers a 6-hour window. Within each partition, data is organized by tenant and shard:
    59  
    60  ```
    61  Partition (6h window)
    62  ├── Tenant A
    63  │   ├── Shard 0
    64  │   ├── Shard 1
    65  │   └── Shard N
    66  └── Tenant B
    67      ├── Shard 0
    68      └── Shard N
    69  ```
    70  
    71  Metadata entries are stored in shard buckets as key-value pairs, where the key is the block ID (ULID) and the value is the serialized block metadata. The block identifier is a ULID, where the timestamp represents the block's creation time. However, blocks span data ranges defined by the actual timestamps of the data they contain (specified in the block metadata). When blocks are compacted together (merged), the output block identifier uses the timestamp of the oldest block in the input set and reflects the actual time range of the compacted data.
    72  
    73  Every block is assigned to a shard at [data distribution](../../ingester/client/distributor/README.md) time, and this assignment never changes. The assigned shard identifier is stored in the block metadata entry and is used to locate the block within the tenant bucket.
    74  
    75  Shard-level structures:
    76  - To save space, strings in block metadata are deduplicated using a dictionary (`StringTable`).
    77  - Each shard maintains a small index for efficient filtering (`ShardIndex`).
    78    * The index indicates the time range of the shard's data (min and max timestamps).
    79  
    80  ```
    81  Partition
    82  └── Tenant
    83      └── Shard
    84          ├── .index
    85          ├── .strings
    86          ├── 01JYB4J3P5YFCZ80XRG11RMNEK => Block Metadata Entry
    87          └── 01JYB4JNHARQZYPKR01W46EB54 => Block Metadata Entry
    88  ```
    89  
    90  The index uses several caches for performance:
    91  - The shard cache keeps shard indexes and string tables in memory.
    92  - The block cache stores decoded metadata entries.
    93  
    94  ## Index Writes
    95  
    96  Index writes are performed by the `segment-writer` service, which is responsible for writing metadata entries to the index.
    97  
    98  The write process spans multiple components and involves Raft consensus:
    99  
   100  ```mermaid
   101  sequenceDiagram
   102      participant SW as segment-writer
   103  
   104      box Index Service
   105          participant H as Endpoint
   106          participant R as Raft
   107      end
   108      
   109      box FSM
   110          participant T as Tombstones
   111          participant MI as Metadata Index
   112          participant C as Compactor
   113      end
   114  
   115      SW->>+H: AddBlock(BlockMeta)
   116          H->>+R: Propose ADD_BLOCK_METADATA
   117              R-->>+T: Exists?
   118              Note over T: Reject if block was <br/>already compacted
   119              T-->>-R: 
   120              R->>MI: InsertBlock(BlockMeta)
   121              MI-->>R:  
   122              R-->>+C: Compact
   123              Note over C: Add block to <br/>compaction queue
   124              C-->>-R: 
   125          R-->>-H: 
   126      H-->>-SW: 
   127  
   128  ```
   129  
   130  A tombstone check is necessary to prevent adding metadata for blocks that have already been compacted and removed from the index. This situation can occur if the writer fails to receive a response from the index, even though the entry was already added to a compaction and processed. During compaction, source objects are not removed immediately but only after a configured delay – long enough to cover the expected retry window. Refer to the [compaction](#compaction) paragraph for more details.
   131  
   132  ### Dead Letter Queue
   133  
   134  If block metadata cannot be added to the index by the client, the metadata may be written to a DLQ in object storage. The recovery process scans for these entries every 15 seconds and attempts to re-add them to the index. Note that use of the DLQ may result in a "stale reads" phenomenon, in which a read fails to observe the effects of a completed write. If strongly consistent (linearizable) reads are required, the client should not use the DLQ.
   135  
   136  ## Index Queries
   137  
   138  Queries access the index through the `ConsistentRead` API, which implements the _linearizable read_ pattern. This ensures that the replica reflects the most recent state agreed upon by the Raft consensus at the moment of access, and any previous writes are visible to the read operation. Refer to the [implementation](../raftnode/node_read.go) for details. The approach enables Raft _follower_ replicas to serve queries: in practice, both the leader and followers serve queries.
   139  
   140  Index queries are performed by the `query-frontend` service, which is responsible for locating data objects for a given query.
   141  
   142  ```mermaid
   143  sequenceDiagram
   144      participant QF as query-frontend
   145  
   146      box Query Service
   147          participant H as Endpoint
   148      end
   149  
   150      box Raft
   151          participant N as Local
   152          participant L as Leader
   153      end
   154  
   155      box FSM
   156          participant MI as Metadata Index
   157      end
   158         
   159  
   160      QF->>+H: QueryMetadata(Query)
   161          critical Liearizable Read
   162              H->>N: ConsistentRead
   163              N->>L: ReadIndex
   164              Note over L: Leader check
   165              L-->>N: CommitIndex + Term
   166              Note over N: Wait CommitIndex applied
   167              H->>+MI: QueryMetadata(Query)
   168              Note over MI: Read state
   169              MI-->>H: 
   170              N-->>H: 
   171          end
   172      H-->>-QF: 
   173  
   174  ```
   175  
   176  The `QueryService` offers two APIs: for metadata entry queries and for dataset label queries.
   177  
   178  The first API allows querying metadata entries based on various criteria, such as tenant, shard, time range, and labels with a Prometheus-like syntax.
   179  
   180  The second API provides a way to query dataset labels in the form of Prometheus series labels, without accessing the data objects themselves. For example, a typical query might list all dataset names or a subset of attributes.
   181  
   182  Example query:
   183  
   184  ```go
   185  query := MetadataQuery{
   186      Expr:      `{service_name="frontend"}`,
   187      StartTime: time.Now().Add(-time.Hour),
   188      EndTime:   time.Now(),
   189      Tenant:    []string{"tenant-1"},
   190      Labels:    []string{"profile_type"}
   191  }
   192  ```
   193  
   194  The query will return all metadata entries with datasets that match the specified criteria and will preserve the `profile_type` label, if present.
   195  
   196  ## Retention
   197  
   198  ### Compaction
   199  
   200  Compaction is the process of merging multiple blocks into a single block to reduce the number of objects in the data store and improve query performance by consolidating data. This improves data locality and reduces the read amplification factor. Compaction is also crucial for the metadata index: without it, metastore performance quickly degrades over a short period of time (hours) and may become inoperable. Compaction is performed by the `compaction-worker` service, which is orchestrated by the Compaction Service implemented in the metastore. Refer to the [compaction documentation](../compaction/README.md) for more details.
   201  
   202  When compacted blocks are added to the index, metadata entries of the source blocks are replaced immediately, while the data objects are removed only after a configured delay to prevent interference with queries. Tombstone entries are created in the metastore to keep track of objects that need to be removed. Eventually, tombstones are included in a compaction job, and the compaction worker removes the source objects from the data store.
   203  
   204  ### Retention Policies
   205  
   206  Retention policies are applied in a coarse-grained manner: individual blocks are not evaluated for deletion. Instead, entire partitions are removed when required by the retention configuration. A partition is identified by a key comprising its time range, tenant ID, and shard ID.
   207  
   208  #### Time-based Retention Policy
   209  
   210  Currently, only a time-based retention policy is implemented, which deletes partitions older than a specified duration. Retention is based on the block creation time to support data backfilling scenarios. However, data timestamps are also respected: a block is only removed if its upper boundary has passed the retention period.
   211  
   212  The time-based retention policies are tenant-specific and can be configured per tenant.
   213  
   214  ### Cleanup
   215  
   216  The cleaner component, running on the Raft leader node, is responsible for enforcing the retention policies. It deletes partitions from the index and generates _tombstones_ that are handled later, during compaction.
   217  
   218  The diagram below illustrates the cleanup process:
   219  
   220  ```mermaid
   221  sequenceDiagram
   222      participant C as cleaner
   223      
   224      box Index Service
   225          participant H as Endpoint
   226          participant R as Raft
   227      end
   228      
   229      box FSM
   230          participant MI as Metadata Index
   231          participant T as Tombstones
   232      end
   233  
   234      Note over C: Periodic cleanup trigger
   235      C->>+H: TruncateIndex(policy)
   236          critical
   237              H->>MI: List partitions (ConsistentRead)
   238              Note over MI: Read state
   239              MI-->>H: 
   240              Note over H: Apply retention policy<br/>Generate tombstones
   241              H->>+R: Propose TRUNCATE_INDEX
   242                  R->>MI: Delete partitions
   243                  R->>T: Add tombstones
   244              R-->>-H: 
   245          end
   246      H-->>-C: 
   247  ```
   248  
   249  This is a two-phase process:
   250   1. The cleaner lists all partitions in the index and applies the retention policy to determine which partitions should be deleted. Unlike the `QueryService`, the cleaner can only read the local state, assuming that this is the Raft leader node: it still uses the `ConsistentRead` interface, however, it can only communicate with the local node.
   251   2. The cleaner proposes a Raft command to delete the partitions and generate _tombstones_ for the data objects that need to be removed. The proposal is rejected by followers if the leader has changed since the moment of the read operation. This ensures that the proposal reflects the most recent state of the index and was created by the acting leader, eliminating the possibility of conflicts.