github.com/muhammadn/cortex@v1.9.1-0.20220510110439-46bb7000d03d/docs/architecture.md

github.com/muhammadn/cortex@v1.9.1-0.20220510110439-46bb7000d03d/docs/architecture.md (about)

1 ---
2 title: "Cortex Architecture"
3 linkTitle: "Architecture"
4 weight: 2
5 slug: architecture
6 ---
7
8 Cortex consists of multiple horizontally scalable microservices. Each microservice uses the most appropriate technique for horizontal scaling; most are stateless and can handle requests for any users while some (namely the [ingesters](#ingester)) are semi-stateful and depend on consistent hashing. This document provides a basic overview of Cortex's architecture.
9
10 <img src="../images/architecture.png" alt="Cortex Architecture">
11
12 ## The role of Prometheus
13
14 Prometheus instances scrape samples from various targets and then push them to Cortex (using Prometheus' [remote write API](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations)). That remote write API emits batched [Snappy](https://google.github.io/snappy/)-compressed [Protocol Buffer](https://developers.google.com/protocol-buffers/) messages inside the body of an HTTP `PUT` request.
15
16 Cortex requires that each HTTP request bear a header specifying a tenant ID for the request. Request authentication and authorization are handled by an external reverse proxy.
17
18 Incoming samples (writes from Prometheus) are handled by the [distributor](#distributor) while incoming reads (PromQL queries) are handled by the [querier](#querier) or optionally by the [query frontend](#query-frontend).
19
20 ## Storage
21
22 Cortex currently supports two storage engines to store and query the time series:
23
24 - Chunks (deprecated)
25 - Blocks
26
27 The two engines mostly share the same Cortex architecture with few differences outlined in the rest of the document.
28
29 ### Chunks storage (deprecated)
30
31 The chunks storage stores each single time series into a separate object called _Chunk_. Each Chunk contains the samples for a given period (defaults to 12 hours). Chunks are then indexed by time range and labels, in order to provide a fast lookup across many (over millions) Chunks.
32
33 For this reason, the chunks storage consists of:
34
35 * An index for the Chunks. This index can be backed by:
36 * [Amazon DynamoDB](https://aws.amazon.com/dynamodb)
37 * [Google Bigtable](https://cloud.google.com/bigtable)
38 * [Apache Cassandra](https://cassandra.apache.org)
39 * An object store for the Chunk data itself, which can be:
40 * [Amazon DynamoDB](https://aws.amazon.com/dynamodb)
41 * [Google Bigtable](https://cloud.google.com/bigtable)
42 * [Apache Cassandra](https://cassandra.apache.org)
43 * [Amazon S3](https://aws.amazon.com/s3)
44 * [Google Cloud Storage](https://cloud.google.com/storage/)
45 * [Microsoft Azure Storage](https://azure.microsoft.com/en-us/services/storage/)
46
47 For more information, please check out the [Chunks storage](./chunks-storage/_index.md) documentation.
48
49 ### Blocks storage
50
51 The blocks storage is based on [Prometheus TSDB](https://prometheus.io/docs/prometheus/latest/storage/): it stores each tenant's time series into their own TSDB which write out their series to a on-disk Block (defaults to 2h block range periods). Each Block is composed by a few files storing the chunks and the block index.
52
53 The TSDB chunk files contain the samples for multiple series. The series inside the Chunks are then indexed by a per-block index, which indexes metric names and labels to time series in the chunk files.
54
55 The blocks storage doesn't require a dedicated storage backend for the index. The only requirement is an object store for the Block files, which can be:
56
57 * [Amazon S3](https://aws.amazon.com/s3)
58 * [Google Cloud Storage](https://cloud.google.com/storage/)
59 * [Microsoft Azure Storage](https://azure.microsoft.com/en-us/services/storage/)
60 * [OpenStack Swift](https://wiki.openstack.org/wiki/Swift) (experimental)
61 * [Local Filesystem](https://thanos.io/storage.md/#filesystem) (single node only)
62
63 For more information, please check out the [Blocks storage](./blocks-storage/_index.md) documentation.
64
65 ## Services
66
67 Cortex has a service-based architecture, in which the overall system is split up into a variety of components that perform a specific task. These components run separately and in parallel. Cortex can alternatively run in a single process mode, where all components are executed within a single process. The single process mode is particularly handy for local testing and development.
68
69 Cortex is, for the most part, a shared-nothing system. Each layer of the system can run multiple instances of each component and they don't coordinate or communicate with each other within that layer.
70
71 The Cortex services are:
72
73 - [Distributor](#distributor)
74 - [Ingester](#ingester)
75 - [Querier](#querier)
76 - [Query frontend](#query-frontend) (optional)
77 - [Ruler](#ruler) (optional)
78 - [Alertmanager](#alertmanager) (optional)
79 - [Configs API](#configs-api) (optional)
80
81 ### Distributor
82
83 The **distributor** service is responsible for handling incoming samples from Prometheus. It's the first stop in the write path for series samples. Once the distributor receives samples from Prometheus, each sample is validated for correctness and to ensure that it is within the configured tenant limits, falling back to default ones in case limits have not been overridden for the specific tenant. Valid samples are then split into batches and sent to multiple [ingesters](#ingester) in parallel.
84
85 The validation done by the distributor includes:
86
87 - The metric labels name are formally correct
88 - The configured max number of labels per metric is respected
89 - The configured max length of a label name and value is respected
90 - The timestamp is not older/newer than the configured min/max time range
91
92 Distributors are **stateless** and can be scaled up and down as needed.
93
94 #### High Availability Tracker
95
96 The distributor features a **High Availability (HA) Tracker**. When enabled, the distributor deduplicates incoming samples from redundant Prometheus servers. This allows you to have multiple HA replicas of the same Prometheus servers, writing the same series to Cortex and then deduplicate these series in the Cortex distributor.
97
98 The HA Tracker deduplicates incoming samples based on a cluster and replica label. The cluster label uniquely identifies the cluster of redundant Prometheus servers for a given tenant, while the replica label uniquely identifies the replica within the Prometheus cluster. Incoming samples are considered duplicated (and thus dropped) if received by any replica which is not the current primary within a cluster.
99
100 The HA Tracker requires a key-value (KV) store to coordinate which replica is currently elected. The distributor will only accept samples from the current leader. Samples with one or no labels (of the replica and cluster) are accepted by default and never deduplicated.
101
102 The supported KV stores for the HA tracker are:
103
104 * [Consul](https://www.consul.io)
105 * [Etcd](https://etcd.io)
106
107 Note: Memberlist is not supported. Memberlist-based KV store propagates updates using gossip, which is very slow for HA purposes: result is that different distributors may see different Prometheus server as elected HA replica, which is definitely not desirable.
108
109 For more information, please refer to [config for sending HA pairs data to Cortex](guides/ha-pair-handling.md) in the documentation.
110
111 #### Hashing
112
113 Distributors use consistent hashing, in conjunction with a configurable replication factor, to determine which ingester instance(s) should receive a given series.
114
115 Cortex supports two hashing strategies:
116
117 1. Hash the metric name and tenant ID (default)
118 2. Hash the metric name, labels and tenant ID (enabled with `-distributor.shard-by-all-labels=true`)
119
120 The trade-off associated with the latter is that writes are more balanced across ingesters but each query needs to talk to all ingesters since a metric could be spread across multiple ingesters given different label sets.
121
122 #### The hash ring
123
124 A hash ring (stored in a key-value store) is used to achieve consistent hashing for the series sharding and replication across the ingesters. All [ingesters](#ingester) register themselves into the hash ring with a set of tokens they own; each token is a random unsigned 32-bit number. Each incoming series is [hashed](#hashing) in the distributor and then pushed to the ingester owning the tokens range for the series hash number plus N-1 subsequent ingesters in the ring, where N is the replication factor.
125
126 To do the hash lookup, distributors find the smallest appropriate token whose value is larger than the [hash of the series](#hashing). When the replication factor is larger than 1, the next subsequent tokens (clockwise in the ring) that belong to different ingesters will also be included in the result.
127
128 The effect of this hash set up is that each token that an ingester owns is responsible for a range of hashes. If there are three tokens with values 0, 25, and 50, then a hash of 3 would be given to the ingester that owns the token 25; the ingester owning token 25 is responsible for the hash range of 1-25.
129
130 The supported KV stores for the hash ring are:
131
132 * [Consul](https://www.consul.io)
133 * [Etcd](https://etcd.io)
134 * Gossip [memberlist](https://github.com/hashicorp/memberlist)
135
136 #### Quorum consistency
137
138 Since all distributors share access to the same hash ring, write requests can be sent to any distributor and you can setup a stateless load balancer in front of it.
139
140 To ensure consistent query results, Cortex uses [Dynamo-style](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) quorum consistency on reads and writes. This means that the distributor will wait for a positive response of at least one half plus one of the ingesters to send the sample to before successfully responding to the Prometheus write request.
141
142 #### Load balancing across distributors
143
144 We recommend randomly load balancing write requests across distributor instances. For example, if you're running Cortex in a Kubernetes cluster, you could run the distributors as a Kubernetes [Service](https://kubernetes.io/docs/concepts/services-networking/service/).
145
146 ### Ingester
147
148 The **ingester** service is responsible for writing incoming series to a [long-term storage backend](#storage) on the write path and returning in-memory series samples for queries on the read path.
149
150 Incoming series are not immediately written to the storage but kept in memory and periodically flushed to the storage (by default, 12 hours for the chunks storage and 2 hours for the blocks storage). For this reason, the [queriers](#querier) may need to fetch samples both from ingesters and long-term storage while executing a query on the read path.
151
152 Ingesters contain a **lifecycler** which manages the lifecycle of an ingester and stores the **ingester state** in the [hash ring](#the-hash-ring). Each ingester could be in one of the following states:
153
154 - **`PENDING`** 
155 The ingester has just started. While in this state, the ingester doesn't receive neither write and read requests, and could be waiting for time series data transfer from another ingester if running the chunks storage and the [hand-over](guides/ingesters-rolling-updates.md#chunks-storage-with-wal-disabled-hand-over) is enabled.
156 - **`JOINING`** 
157 The ingester is starting up and joining the ring. While in this state the ingester doesn't receive neither write and read requests. The ingester will join the ring using tokens received by a leaving ingester as part of the [hand-over](guides/ingesters-rolling-updates.md#chunks-storage-with-wal-disabled-hand-over) process (if enabled), otherwise it could load tokens from disk (if `-ingester.tokens-file-path` is configured) or generate a set of new random ones. Finally, the ingester optionally observes the ring for tokens conflicts and then, once any conflict is resolved, will move to `ACTIVE` state.
158 - **`ACTIVE`** 
159 The ingester is up and running. While in this state the ingester can receive both write and read requests.
160 - **`LEAVING`** 
161 The ingester is shutting down and leaving the ring. While in this state the ingester doesn't receive write requests, while it could receive read requests.
162 - **`UNHEALTHY`** 
163 The ingester has failed to heartbeat to the ring's KV Store. While in this state, distributors skip the ingester while building the replication set for incoming series and the ingester does not receive write or read requests.
164
165 _The ingester states are internally used for different purposes, including the series hand-over process supported by the chunks storage. For more information about it, please check out the [Ingester hand-over](guides/ingesters-rolling-updates.md#chunks-storage-with-wal-disabled-hand-over) documentation._
166
167 Ingesters are **semi-stateful**.
168
169 #### Ingesters failure and data loss
170
171 If an ingester process crashes or exits abruptly, all the in-memory series that have not yet been flushed to the long-term storage will be lost. There are two main ways to mitigate this failure mode:
172
173 1. Replication
174 2. Write-ahead log (WAL)
175
176 The **replication** is used to hold multiple (typically 3) replicas of each time series in the ingesters. If the Cortex cluster looses an ingester, the in-memory series hold by the lost ingester are also replicated at least to another ingester. In the event of a single ingester failure, no time series samples will be lost while, in the event of multiple ingesters failure, time series may be potentially lost if failure affects all the ingesters holding the replicas of a specific time series.
177
178 The **write-ahead log** (WAL) is used to write to a persistent disk all incoming series samples until they're flushed to the long-term storage. In the event of an ingester failure, a subsequent process restart will replay the WAL and recover the in-memory series samples.
179
180 Contrary to the sole replication and given the persistent disk data is not lost, in the event of multiple ingesters failure each ingester will recover the in-memory series samples from WAL upon subsequent restart. The replication is still recommended in order to ensure no temporary failures on the read path in the event of a single ingester failure.
181
182 The WAL for the chunks storage is disabled by default, while it's always enabled for the blocks storage.
183
184 #### Ingesters write de-amplification
185
186 Ingesters store recently received samples in-memory in order to perform write de-amplification. If the ingesters would immediately write received samples to the long-term storage, the system would be very difficult to scale due to the very high pressure on the storage. For this reason, the ingesters batch and compress samples in-memory and periodically flush them out to the storage.
187
188 Write de-amplification is the main source of Cortex's low total cost of ownership (TCO).
189
190 ### Querier
191
192 The **querier** service handles queries using the [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) query language.
193
194 Queriers fetch series samples both from the ingesters and long-term storage: the ingesters hold the in-memory series which have not yet been flushed to the long-term storage. Because of the replication factor, it is possible that the querier may receive duplicated samples; to resolve this, for a given time series the querier internally **deduplicates** samples with the same exact timestamp.
195
196 Queriers are **stateless** and can be scaled up and down as needed.
197
198 ### Query frontend
199
200 The **query frontend** is an **optional service** providing the querier's API endpoints and can be used to accelerate the read path. When the query frontend is in place, incoming query requests should be directed to the query frontend instead of the queriers. The querier service will be still required within the cluster, in order to execute the actual queries.
201
202 The query frontend internally performs some query adjustments and holds queries in an internal queue. In this setup, queriers act as workers which pull jobs from the queue, execute them, and return them to the query-frontend for aggregation. Queriers need to be configured with the query frontend address (via the `-querier.frontend-address` CLI flag) in order to allow them to connect to the query frontends.
203
204 Query frontends are **stateless**. However, due to how the internal queue works, it's recommended to run a few query frontend replicas to reap the benefit of fair scheduling. Two replicas should suffice in most cases.
205
206 Flow of the query in the system when using query-frontend:
207
208 1) Query is received by query frontend, which can optionally split it or serve from the cache.
209 2) Query frontend stores the query into in-memory queue, where it waits for some querier to pick it up.
210 3) Querier picks up the query, and executes it.
211 4) Querier sends result back to query-frontend, which then forwards it to the client.
212
213 #### Queueing
214
215 The query frontend queuing mechanism is used to:
216
217 * Ensure that large queries, that could cause an out-of-memory (OOM) error in the querier, will be retried on failure. This allows administrators to under-provision memory for queries, or optimistically run more small queries in parallel, which helps to reduce the TCO.
218 * Prevent multiple large requests from being convoyed on a single querier by distributing them across all queriers using a first-in/first-out queue (FIFO).
219 * Prevent a single tenant from denial-of-service-ing (DOSing) other tenants by fairly scheduling queries between tenants.
220
221 #### Splitting
222
223 The query frontend splits multi-day queries into multiple single-day queries, executing these queries in parallel on downstream queriers and stitching the results back together again. This prevents large (multi-day) queries from causing out of memory issues in a single querier and helps to execute them faster.
224
225 #### Caching
226
227 The query frontend supports caching query results and reuses them on subsequent queries. If the cached results are incomplete, the query frontend calculates the required subqueries and executes them in parallel on downstream queriers. The query frontend can optionally align queries with their step parameter to improve the cacheability of the query results. The result cache is compatible with any cortex caching backend (currently memcached, redis, and an in-memory cache).
228
229 ### Query Scheduler
230
231 Query Scheduler is an **optional** service that moves the internal queue from query frontend into separate component.
232 This enables independent scaling of query frontends and number of queues (query scheduler).
233
234 In order to use query scheduler, both query frontend and queriers must be configured with query scheduler address
235 (using `-frontend.scheduler-address` and `-querier.scheduler-address` options respectively).
236
237 Flow of the query in the system changes when using query scheduler:
238
239 1) Query is received by query frontend, which can optionally split it or serve from the cache.
240 2) Query frontend forwards the query to random query scheduler process.
241 3) Query scheduler stores the query into in-memory queue, where it waits for some querier to pick it up.
242 3) Querier picks up the query, and executes it.
243 4) Querier sends result back to query-frontend, which then forwards it to the client.
244
245 Query schedulers are **stateless**. It is recommended to run two replicas to make sure queries can still be serviced while one replica is restarting.
246
247 ### Ruler
248
249 The **ruler** is an **optional service** executing PromQL queries for recording rules and alerts. The ruler requires a database storing the recording rules and alerts for each tenant.
250
251 Ruler is **semi-stateful** and can be scaled horizontally.
252 Running rules internally have state, as well as the ring the rulers initiate.
253 However, if the rulers all fail and restart,
254 Prometheus alert rules have a feature where an alert is restored and returned to a firing state
255 if it would have been active in its for period.
256 However, there would be gaps in the series generated by the recording rules.
257
258 ### Alertmanager
259
260 The **alertmanager** is an **optional service** responsible for accepting alert notifications from the [ruler](#ruler), deduplicating and grouping them, and routing them to the correct notification channel, such as email, PagerDuty or OpsGenie.
261
262 The Cortex alertmanager is built on top of the [Prometheus Alertmanager](https://prometheus.io/docs/alerting/alertmanager/), adding multi-tenancy support. Like the [ruler](#ruler), the alertmanager requires a database storing the per-tenant configuration.
263
264 Alertmanager is **semi-stateful**.
265 The Alertmanager persists information about silences and active alerts to its disk.
266 If all of the alertmanager nodes failed simultaneously there would be a loss of data.
267
268 ### Configs API
269
270 The **configs API** is an **optional service** managing the configuration of Rulers and Alertmanagers.
271 It provides APIs to get/set/update the ruler and alertmanager configurations and store them into backend.
272 Current supported backend are PostgreSQL and in-memory.
273
274 Configs API is **stateless**.