github.com/grafana/pyroscope@v1.18.0/pkg/metastore/index/README.md (about) 1 # Metadata Index 2 3 The metadata index stores metadata entries for objects located in the data store. In essence, it is a document store built on top of a key-value database. 4 5 It is implemented using BoltDB as the underlying key-value store, with Raft providing replication via consensus. BoltDB was chosen for its simplicity and efficiency in this use case – a single writer, concurrent readers, and ACID transactions. For better performance, the index database can be stored on an in-memory volume, as it's recovered from the Raft log and snapshot on startup, and durable storage is not required for the index itself. 6 7 ## Metadata entries 8 9 Data objects, more widely called _blocks_ in the codebase, are stored in the data store (object storage) and are identified by a unique identifier (ULID). 10 11 A block is a collection of _datasets_ that share certain properties, such as tenant and shard identifiers, time range, creation time, and more. Simplified, a metadata entry looks like this: 12 13 ```proto 14 struct BlockMeta { 15 uint32 format 16 string id 17 int32 tenant 18 uint32 shard 19 []Dataset datasets 20 []string string_table 21 } 22 ``` 23 24 ```proto 25 struct Dataset { 26 uint32 format 27 int32 tenant 28 int32 name 29 []uint64 table_of_contents 30 []int32 labels 31 } 32 ``` 33 34 Datasets' content is defined by the `format` field, which indicates the binary format of the dataset. The `table_of_contents` field is a list of offsets that point to data sections within the dataset, allowing for efficient access to specific parts of the data. The table of contents is specific to the dataset format. 35 36 Metadata labels allow specifying additional attributes than can be then used for filtering and querying. Labels are represented as a slice of `int32` values that refer to strings in the metadata entry's string table. The slice is a sequence of length-prefixed key-value (KV) pairs: 37 38 ``` 39 len(2) | k1 | v1 | k2 | v2 | len(3) | k1 | v3 | k2 | v4 | k3 | v5 40 ``` 41 42 Refer to the [`BlockMeta`](../../../api/metastore/v1/types.proto) protobuf definition to learn more about the metadata format. 43 44 The metadata entry is also included to the object itself, its offset is specified in the `metadata_offset` attribute. If the offset is not known, the metadata can be retrieved from the object using the footer structure: 45 46 ```it-is-not-meant-to-be-a-markdown-table 47 Offset | Size | Description 48 --------|------------|------------------------------------------ 49 0 | data | Object data 50 --------|------------|------------------------------------------ 51 data | metadata | Protobuf-encoded metadata 52 end-8 | be_uint32 | Size of the raw metadata 53 end-4 | be_uint32 | CRC32 of the raw metadata and size 54 ``` 55 56 ## Structure 57 58 The index is partitioned by time – each partition covers a 6-hour window. Within each partition, data is organized by tenant and shard: 59 60 ``` 61 Partition (6h window) 62 ├── Tenant A 63 │ ├── Shard 0 64 │ ├── Shard 1 65 │ └── Shard N 66 └── Tenant B 67 ├── Shard 0 68 └── Shard N 69 ``` 70 71 Metadata entries are stored in shard buckets as key-value pairs, where the key is the block ID (ULID) and the value is the serialized block metadata. The block identifier is a ULID, where the timestamp represents the block's creation time. However, blocks span data ranges defined by the actual timestamps of the data they contain (specified in the block metadata). When blocks are compacted together (merged), the output block identifier uses the timestamp of the oldest block in the input set and reflects the actual time range of the compacted data. 72 73 Every block is assigned to a shard at [data distribution](../../ingester/client/distributor/README.md) time, and this assignment never changes. The assigned shard identifier is stored in the block metadata entry and is used to locate the block within the tenant bucket. 74 75 Shard-level structures: 76 - To save space, strings in block metadata are deduplicated using a dictionary (`StringTable`). 77 - Each shard maintains a small index for efficient filtering (`ShardIndex`). 78 * The index indicates the time range of the shard's data (min and max timestamps). 79 80 ``` 81 Partition 82 └── Tenant 83 └── Shard 84 ├── .index 85 ├── .strings 86 ├── 01JYB4J3P5YFCZ80XRG11RMNEK => Block Metadata Entry 87 └── 01JYB4JNHARQZYPKR01W46EB54 => Block Metadata Entry 88 ``` 89 90 The index uses several caches for performance: 91 - The shard cache keeps shard indexes and string tables in memory. 92 - The block cache stores decoded metadata entries. 93 94 ## Index Writes 95 96 Index writes are performed by the `segment-writer` service, which is responsible for writing metadata entries to the index. 97 98 The write process spans multiple components and involves Raft consensus: 99 100 ```mermaid 101 sequenceDiagram 102 participant SW as segment-writer 103 104 box Index Service 105 participant H as Endpoint 106 participant R as Raft 107 end 108 109 box FSM 110 participant T as Tombstones 111 participant MI as Metadata Index 112 participant C as Compactor 113 end 114 115 SW->>+H: AddBlock(BlockMeta) 116 H->>+R: Propose ADD_BLOCK_METADATA 117 R-->>+T: Exists? 118 Note over T: Reject if block was <br/>already compacted 119 T-->>-R: 120 R->>MI: InsertBlock(BlockMeta) 121 MI-->>R: 122 R-->>+C: Compact 123 Note over C: Add block to <br/>compaction queue 124 C-->>-R: 125 R-->>-H: 126 H-->>-SW: 127 128 ``` 129 130 A tombstone check is necessary to prevent adding metadata for blocks that have already been compacted and removed from the index. This situation can occur if the writer fails to receive a response from the index, even though the entry was already added to a compaction and processed. During compaction, source objects are not removed immediately but only after a configured delay – long enough to cover the expected retry window. Refer to the [compaction](#compaction) paragraph for more details. 131 132 ### Dead Letter Queue 133 134 If block metadata cannot be added to the index by the client, the metadata may be written to a DLQ in object storage. The recovery process scans for these entries every 15 seconds and attempts to re-add them to the index. Note that use of the DLQ may result in a "stale reads" phenomenon, in which a read fails to observe the effects of a completed write. If strongly consistent (linearizable) reads are required, the client should not use the DLQ. 135 136 ## Index Queries 137 138 Queries access the index through the `ConsistentRead` API, which implements the _linearizable read_ pattern. This ensures that the replica reflects the most recent state agreed upon by the Raft consensus at the moment of access, and any previous writes are visible to the read operation. Refer to the [implementation](../raftnode/node_read.go) for details. The approach enables Raft _follower_ replicas to serve queries: in practice, both the leader and followers serve queries. 139 140 Index queries are performed by the `query-frontend` service, which is responsible for locating data objects for a given query. 141 142 ```mermaid 143 sequenceDiagram 144 participant QF as query-frontend 145 146 box Query Service 147 participant H as Endpoint 148 end 149 150 box Raft 151 participant N as Local 152 participant L as Leader 153 end 154 155 box FSM 156 participant MI as Metadata Index 157 end 158 159 160 QF->>+H: QueryMetadata(Query) 161 critical Liearizable Read 162 H->>N: ConsistentRead 163 N->>L: ReadIndex 164 Note over L: Leader check 165 L-->>N: CommitIndex + Term 166 Note over N: Wait CommitIndex applied 167 H->>+MI: QueryMetadata(Query) 168 Note over MI: Read state 169 MI-->>H: 170 N-->>H: 171 end 172 H-->>-QF: 173 174 ``` 175 176 The `QueryService` offers two APIs: for metadata entry queries and for dataset label queries. 177 178 The first API allows querying metadata entries based on various criteria, such as tenant, shard, time range, and labels with a Prometheus-like syntax. 179 180 The second API provides a way to query dataset labels in the form of Prometheus series labels, without accessing the data objects themselves. For example, a typical query might list all dataset names or a subset of attributes. 181 182 Example query: 183 184 ```go 185 query := MetadataQuery{ 186 Expr: `{service_name="frontend"}`, 187 StartTime: time.Now().Add(-time.Hour), 188 EndTime: time.Now(), 189 Tenant: []string{"tenant-1"}, 190 Labels: []string{"profile_type"} 191 } 192 ``` 193 194 The query will return all metadata entries with datasets that match the specified criteria and will preserve the `profile_type` label, if present. 195 196 ## Retention 197 198 ### Compaction 199 200 Compaction is the process of merging multiple blocks into a single block to reduce the number of objects in the data store and improve query performance by consolidating data. This improves data locality and reduces the read amplification factor. Compaction is also crucial for the metadata index: without it, metastore performance quickly degrades over a short period of time (hours) and may become inoperable. Compaction is performed by the `compaction-worker` service, which is orchestrated by the Compaction Service implemented in the metastore. Refer to the [compaction documentation](../compaction/README.md) for more details. 201 202 When compacted blocks are added to the index, metadata entries of the source blocks are replaced immediately, while the data objects are removed only after a configured delay to prevent interference with queries. Tombstone entries are created in the metastore to keep track of objects that need to be removed. Eventually, tombstones are included in a compaction job, and the compaction worker removes the source objects from the data store. 203 204 ### Retention Policies 205 206 Retention policies are applied in a coarse-grained manner: individual blocks are not evaluated for deletion. Instead, entire partitions are removed when required by the retention configuration. A partition is identified by a key comprising its time range, tenant ID, and shard ID. 207 208 #### Time-based Retention Policy 209 210 Currently, only a time-based retention policy is implemented, which deletes partitions older than a specified duration. Retention is based on the block creation time to support data backfilling scenarios. However, data timestamps are also respected: a block is only removed if its upper boundary has passed the retention period. 211 212 The time-based retention policies are tenant-specific and can be configured per tenant. 213 214 ### Cleanup 215 216 The cleaner component, running on the Raft leader node, is responsible for enforcing the retention policies. It deletes partitions from the index and generates _tombstones_ that are handled later, during compaction. 217 218 The diagram below illustrates the cleanup process: 219 220 ```mermaid 221 sequenceDiagram 222 participant C as cleaner 223 224 box Index Service 225 participant H as Endpoint 226 participant R as Raft 227 end 228 229 box FSM 230 participant MI as Metadata Index 231 participant T as Tombstones 232 end 233 234 Note over C: Periodic cleanup trigger 235 C->>+H: TruncateIndex(policy) 236 critical 237 H->>MI: List partitions (ConsistentRead) 238 Note over MI: Read state 239 MI-->>H: 240 Note over H: Apply retention policy<br/>Generate tombstones 241 H->>+R: Propose TRUNCATE_INDEX 242 R->>MI: Delete partitions 243 R->>T: Add tombstones 244 R-->>-H: 245 end 246 H-->>-C: 247 ``` 248 249 This is a two-phase process: 250 1. The cleaner lists all partitions in the index and applies the retention policy to determine which partitions should be deleted. Unlike the `QueryService`, the cleaner can only read the local state, assuming that this is the Raft leader node: it still uses the `ConsistentRead` interface, however, it can only communicate with the local node. 251 2. The cleaner proposes a Raft command to delete the partitions and generate _tombstones_ for the data objects that need to be removed. The proposal is rejected by followers if the leader has changed since the moment of the read operation. This ensures that the proposal reflects the most recent state of the index and was created by the acting leader, eliminating the possibility of conflicts.