github.com/grafana/pyroscope@v1.18.0/pkg/pyroscope/PYROSCOPE_V2.md (about)

     1  # Pyroscope v2
     2  
     3  We're working on the next major version of **Pyroscope** – a complete architectural redesign focused on improving
     4  scalability, performance, and cost-efficiency. The biggest change in Pyroscope v2 is how it handles storage: data
     5  is now written directly to object storage, removing the need for local disks in ingesters. For single-node
     6  deployments, local file systems can still be used as object storage, but this setup isn't supported in the microservice 
     7  mode.
     8  
     9  We've also **decoupled the write and query paths**. This means each path can scale independently, so even the heaviest
    10  queries won't interfere with ingestion performance. The read path can now scale to hundreds of instances instantly.
    11  Looking ahead, we're exploring a serverless query backend to make querying even more cost-effective. Compaction,
    12  a previous bottleneck, has also been overhauled. The new design supports significantly higher throughput and
    13  scalability, allowing hundreds of tenants to ingest thousands of profiles per second – without compromising performance.
    14  
    15  This is made possible by a dedicated control plane that orchestrates data placement and compaction. To ensure high
    16  availability and fault tolerance, the control plane uses Raft consensus and is the only component that requires
    17  persistent local storage. In the future, we plan to transition this to a serverless model as well – making Pyroscope 
    18  fully stateless and diskless.
    19  
    20  > **Note:** This project is currently under active testing. Some features may not yet be fully implemented or stable.
    21  
    22  ## Getting started
    23  
    24  If you want to evaluate the new version, we recommend using the Kubernetes setup. Pyroscope can be deployed as usual,
    25  using the Helm chart and the values file located in the `tools/dev/v2` directory.
    26  
    27  # Architecture Overview
    28  
    29  Pyroscope is designed to be a scalable and cost-effective solution for storing and querying profiling data.
    30  The architecture is built around the following goals:
    31   - High write throughput
    32   - Cost-effective storage
    33   - Scalable query performance
    34   - Low operational overhead
    35  
    36  In order to achieve these goals, Pyroscope uses a distributed architecture consisting of several components that work
    37  together to ingest, store, and query profiling data. We aim to minimize the number of stateful components and design
    38  the data storage to operate without local disks, relying entirely on object storage.
    39  
    40  The high-level components of the architecture include:
    41  
    42  ```mermaid
    43  graph TD
    44  
    45  %% Entry Points %%
    46      subgraph entry_points[" "]
    47          ingest_entry["Ingest Path"]:::entry_ingest --> distributor
    48          query_entry["Query Path"]:::entry_query --> query_frontend
    49      end
    50  
    51  %% Components %%
    52  
    53      distributor -->|writes to| segment_writer
    54      segment_writer -->|updates| metastore
    55      segment_writer -->|creates segments| object_storage
    56  
    57      metastore -->|coordinates| compaction_worker
    58      compaction_worker -->|compacts| object_storage
    59  
    60      query_frontend -->|invokes| query_backend
    61      query_backend -->|reads from| object_storage
    62      query_frontend -->|queries| metastore
    63  
    64      distributor["distributor"]
    65      segment_writer["segment-writer"]
    66      metastore["metastore"]
    67      compaction_worker["compaction-worker"]
    68      query_backend["query-backend"]
    69      query_frontend["query-frontend"]
    70  
    71  %% Object Storage %%
    72      subgraph object_storage["object storage"]
    73          segments
    74          blocks
    75      end
    76  
    77  %% Data Flow Colors %%
    78      linkStyle 0 stroke:#a855f7,stroke-width:2px %% Dashed entry for ingest
    79      linkStyle 1 stroke:#3b82f6,stroke-width:2px %% Dashed entry for query
    80  
    81       linkStyle 2,3,4 stroke:#a855f7,stroke-width:2px  %% Purple: ingestion path
    82       linkStyle 6 stroke:#a855f7,stroke-width:2px  %% Purple: compaction process
    83       linkStyle 7,8,9 stroke:#3b82f6,stroke-width:2px  %% Blue: query path
    84  
    85  %% Styling %%
    86       classDef entry_ingest stroke:#a855f7,stroke-width:2px,font-weight:bold
    87       classDef entry_query stroke:#3b82f6,stroke-width:2px,font-weight:bold
    88  ```
    89  
    90  ## Ingestion
    91  
    92  Profiles are ingested through the Push RPC API and HTTP `/ingest` API to distributors. The write path includes
    93  distributor and segment-writer services: both are stateless, disk-less, and scale horizontally with high efficiency.
    94  
    95  Profile ingest requests are randomly distributed among distributors, which then route them to segment-writers
    96  to co-locate profiles from the same application. This ensures that profiles likely to be queried
    97  together are stored together. You can find a detailed description of the distribution algorithm in the distributor documentation.
    98  
    99  The segment-writer service accumulates profiles in small blocks (segments) and writes them to object storage while
   100  updating the block index with metadata of newly added objects. Each writer produces a _single object per shard_
   101  containing data of _all tenant services_ per shard; this approach minimizes the number of write operations to the
   102  object storage, optimizing the cost of the solution.
   103  
   104  Ingestion clients are blocked until data is durably stored in object storage and an entry for the object is
   105  created in the metadata index. By default, ingestion is synchronous, with median latency expected to be
   106  less than 500ms using default settings and popular object storage providers such as Amazon S3, Google Cloud Storage, and
   107  Azure Blob Storage.
   108  
   109  You can learn more about the write path in the [distributor documentation](../segmentwriter/client/distributor/README.md).
   110  
   111  ## Metastore
   112  
   113  The metastore service is responsible for maintaining the metadata index and coordinating the compaction process.
   114  This is the only stateful component in the architecture, and it uses local disk as durable storage: even a large-scale
   115  cluster only needs a few gigabytes of disk space for the metadata index. The metastore service uses the Raft protocol
   116  for consensus and replication.
   117  
   118  The metadata index includes information about data objects stored in object storage and their contents, such
   119  as time ranges and datasets containing profiling data for particular services.
   120  
   121  The metastore service is designed to be highly available and fault-tolerant. In a cluster of three nodes, it can
   122  tolerate the loss of a single node, and in a cluster of five nodes, it can tolerate the loss of two nodes.
   123  
   124  You can learn more about the metadata index in the [metastore index documentation](../metastore/index/README.md).
   125  
   126  ## Compaction
   127  
   128  The number of objects created in storage can reach millions per hour. This can severely degrade query performance due
   129  to high read amplification and excessive calls to object storage. Additionally, a high number of metadata entries can
   130  degrade performance across the entire cluster, impacting the write path as well.
   131  
   132  To ensure high query performance, data objects are compacted in the background. The compaction-worker service is
   133  responsible for merging small segments into larger blocks, which are then written back to object storage. Compaction
   134  workers compact data as soon as possible after it's written to object storage, with median time to the
   135  first compaction not exceeding 15 seconds.
   136  
   137  Compaction workers are coordinated by the metastore service, which maintains the metadata index and schedules compaction
   138  jobs. Compaction workers are stateless and do not require any local storage.
   139  
   140  You can learn more about the compaction process in the [compaction documentation](../metastore/compaction/README.md). 
   141  
   142  ## Querying
   143  
   144  Profiling data is queried through the Query API available in the query-frontend service.
   145  
   146  A regular flame graph query users see in the UI may require fetching many gigabytes of data from storage. Moreover, the
   147  raw profiling data needs very expensive post-processing to be displayed in flame graph format. Pyroscope addresses
   148  this challenge through adaptive data placement that minimizes the number of objects that need to be read to satisfy a
   149  query, and high parallelism in query execution.
   150  
   151  The query frontend is responsible for preliminary query planning and routing the query to the query backend service.
   152  Data objects are located using the metastore service, which maintains the metadata index.
   153  
   154  Queries are executed by the query-backend service with high parallelism. Query execution is represented as a graph
   155  where the results of sub-queries are combined and optimized. This minimizes network overhead and enables horizontal
   156  scalability of the read path without needing traditional disk-based solutions or even a caching layer.
   157  
   158  Both query-frontend and query-backend are stateless services that can scale out to hundreds of instances.
   159  In future versions, we plan to add a serverless query-backend option.