github.com/yankunsam/loki/v2@v2.6.3-0.20220817130409-389df5235c27/docs/sources/fundamentals/architecture/components.md (about)

     1  ---
     2  title: Components
     3  weight: 30
     4  ---
     5  # Components
     6  
     7  ![components_diagram](../loki_architecture_components.svg)
     8  
     9  ## Distributor
    10  
    11  The **distributor** service is responsible for handling incoming streams by
    12  clients. It's the first stop in the write path for log data. Once the
    13  distributor receives a set of streams, each stream is validated for correctness
    14  and to ensure that it is within the configured tenant (or global) limits. Valid
    15  chunks are then split into batches and sent to multiple [ingesters](#ingester)
    16  in parallel.
    17  
    18  It is important that a load balancer sits in front of the distributor in order to properly balance traffic to them.
    19  
    20  The distributor is a stateless component. This makes it easy to scale and offload as much work as possible from the ingesters, which are the most critical component on the write path. The ability to independently scale these validation operations mean that Loki can also protect itself against denial of service attacks (either malicious or not) that could otherwise overload the ingesters. They act like the bouncer at the front door, ensuring everyone is appropriately dressed and has an invitation. It also allows us to fan-out writes according to our replication factor.
    21  
    22  ### Validation
    23  
    24  The first step the distributor takes is to ensure that all incoming data is according to specification. This includes things like checking that the labels are valid Prometheus labels as well as ensuring the timestamps aren't too old or too new or the log lines aren't too long.
    25  
    26  ### Preprocessing
    27  
    28  Currently the only way the distributor mutates incoming data is by normalizing labels. What this means is making `{foo="bar", bazz="buzz"}` equivalent to `{bazz="buzz", foo="bar"}`, or in other words, sorting the labels. This allows Loki to cache and hash them deterministically.
    29  
    30  ### Rate limiting
    31  
    32  The distributor can also rate limit incoming logs based on the maximum per-tenant bitrate. It does this by checking a per tenant limit and dividing it by the current number of distributors. This allows the rate limit to be specified per tenant at the cluster level and enables us to scale the distributors up or down and have the per-distributor limit adjust accordingly. For instance, say we have 10 distributors and tenant A has a 10MB rate limit. Each distributor will allow up to 1MB/second before limiting. Now, say another large tenant joins the cluster and we need to spin up 10 more distributors. The now 20 distributors will adjust their rate limits for tenant A to `(10MB / 20 distributors) = 500KB/s`! This is how global limits allow much simpler and safer operation of the Loki cluster.
    33  
    34  **Note: The distributor uses the `ring` component under the hood to register itself amongst it's peers and get the total number of active distributors. This is a different "key" than the ingesters use in the ring and comes from the distributor's own [ring configuration](../../../configuration#distributor_config).**
    35  
    36  ### Forwarding
    37  
    38  Once the distributor has performed all of it's validation duties, it forwards data to the ingester component which is ultimately responsible for acknowledging the write.
    39  
    40  #### Replication factor
    41  
    42  In order to mitigate the chance of _losing_ data on any single ingester, the distributor will forward writes to a _replication_factor_ of them. Generally, this is `3`. Replication allows for ingester restarts and rollouts without failing writes and adds additional protection from data loss for some scenarios. Loosely, for each label set (called a _stream_) that is pushed to a distributor, it will hash the labels and use the resulting value to look up `replication_factor` ingesters in the `ring` (which is a subcomponent that exposes a [distributed hash table](https://en.wikipedia.org/wiki/Distributed_hash_table)). It will then try to write the same data to all of them. This will error if less than a _quorum_ of writes succeed. A quorum is defined as `floor(replication_factor / 2) + 1`. So, for our `replication_factor` of `3`, we require that two writes succeed. If less than two writes succeed, the distributor returns an error and the write can be retried.
    43  
    44  **Caveat: There's also an edge case where we acknowledge a write if 2 of the three ingesters do which means that in the case where 2 writes succeed, we can only lose one ingester before suffering data loss.**
    45  
    46  Replication factor isn't the only thing that prevents data loss, though, and arguably these days it's main purpose is to allow writes to continue uninterrupted during rollouts & restarts. The `ingester` component now includes a [write ahead log](https://en.wikipedia.org/wiki/Write-ahead_logging) which persists incoming writes to disk to ensure they're not lost as long as the disk isn't corrupted. The complementary nature of replication factor and WAL ensures data isn't lost unless there are significant failures in both mechanisms (i.e. multiple ingesters die and lose/corrupt their disks).
    47  
    48  ### Hashing
    49  
    50  Distributors use consistent hashing in conjunction with a configurable
    51  replication factor to determine which instances of the ingester service should
    52  receive a given stream.
    53  
    54  A stream is a set of logs associated to a tenant and a unique labelset. The
    55  stream is hashed using both the tenant ID and the labelset and then the hash is
    56  used to find the ingesters to send the stream to.
    57  
    58  A hash ring stored in [Consul](https://www.consul.io) is used to achieve
    59  consistent hashing; all [ingesters](#ingester) register themselves into the hash
    60  ring with a set of tokens they own. Each token is a random unsigned 32-bit
    61  number. Along with a set of tokens, ingesters register their state into the
    62  hash ring. The state JOINING, and ACTIVE may all receive write requests, while
    63  ACTIVE and LEAVING ingesters may receive read requests. When doing a hash
    64  lookup, distributors only use tokens for ingesters who are in the appropriate
    65  state for the request.
    66  
    67  To do the hash lookup, distributors find the smallest appropriate token whose
    68  value is larger than the hash of the stream. When the replication factor is
    69  larger than 1, the next subsequent tokens (clockwise in the ring) that belong to
    70  different ingesters will also be included in the result.
    71  
    72  The effect of this hash set up is that each token that an ingester owns is
    73  responsible for a range of hashes. If there are three tokens with values 0, 25,
    74  and 50, then a hash of 3 would be given to the ingester that owns the token 25;
    75  the ingester owning token 25 is responsible for the hash range of 1-25.
    76  
    77  ### Quorum consistency
    78  
    79  Since all distributors share access to the same hash ring, write requests can be
    80  sent to any distributor.
    81  
    82  To ensure consistent query results, Loki uses
    83  [Dynamo-style](https://www.cs.princeton.edu/courses/archive/fall15/cos518/studpres/dynamo.pdf)
    84  quorum consistency on reads and writes. This means that the distributor will wait
    85  for a positive response of at least one half plus one of the ingesters to send
    86  the sample to before responding to the client that initiated the send.
    87  
    88  ## Ingester
    89  
    90  The **ingester** service is responsible for writing log data to long-term
    91  storage backends (DynamoDB, S3, Cassandra, etc.) on the write path and returning
    92  log data for in-memory queries on the read path.
    93  
    94  Ingesters contain a _lifecycler_ which manages the lifecycle of an ingester in
    95  the hash ring. Each ingester has a state of either `PENDING`, `JOINING`,
    96  `ACTIVE`, `LEAVING`, or `UNHEALTHY`:
    97  
    98  **Deprecated: the WAL (write ahead log) supersedes this feature**
    99  1. `PENDING` is an Ingester's state when it is waiting for a handoff from
   100     another ingester that is `LEAVING`.
   101  
   102  1. `JOINING` is an Ingester's state when it is currently inserting its tokens
   103     into the ring and initializing itself. It may receive write requests for
   104     tokens it owns.
   105  
   106  1. `ACTIVE` is an Ingester's state when it is fully initialized. It may receive
   107     both write and read requests for tokens it owns.
   108  
   109  1. `LEAVING` is an Ingester's state when it is shutting down. It may receive
   110     read requests for data it still has in memory.
   111  
   112  1. `UNHEALTHY` is an Ingester's state when it has failed to heartbeat to
   113     Consul. `UNHEALTHY` is set by the distributor when it periodically checks the ring.
   114  
   115  Each log stream that an ingester receives is built up into a set of many
   116  "chunks" in memory and flushed to the backing storage backend at a configurable
   117  interval.
   118  
   119  Chunks are compressed and marked as read-only when:
   120  
   121  1. The current chunk has reached capacity (a configurable value).
   122  1. Too much time has passed without the current chunk being updated
   123  1. A flush occurs.
   124  
   125  Whenever a chunk is compressed and marked as read-only, a writable chunk takes
   126  its place.
   127  
   128  If an ingester process crashes or exits abruptly, all the data that has not yet
   129  been flushed will be lost. Loki is usually configured to replicate multiple
   130  replicas (usually 3) of each log to mitigate this risk.
   131  
   132  When a flush occurs to a persistent storage provider, the chunk is hashed based
   133  on its tenant, labels, and contents. This means that multiple ingesters with the
   134  same copy of data will not write the same data to the backing store twice, but
   135  if any write failed to one of the replicas, multiple differing chunk objects
   136  will be created in the backing store. See [Querier](#querier) for how data is
   137  deduplicated.
   138  
   139  ### Timestamp Ordering
   140  
   141  Loki can be configured to [accept out-of-order writes](../../configuration/#accept-out-of-order-writes).
   142  
   143  When not configured to accept out-of-order writes, the ingester validates that ingested log lines are in order. When an
   144  ingester receives a log line that doesn't follow the expected order, the line
   145  is rejected and an error is returned to the user. 
   146  
   147  The ingester validates that log lines are received in
   148  timestamp-ascending order. Each log has a timestamp that occurs at a later
   149  time than the log before it. When the ingester receives a log that does not
   150  follow this order, the log line is rejected and an error is returned.
   151  
   152  Logs from each unique set of labels are built up into "chunks" in memory and
   153  then flushed to the backing storage backend.
   154  
   155  If an ingester process crashes or exits abruptly, all the data that has not yet
   156  been flushed could be lost. Loki is usually configured with a [Write Ahead Log](../../operations/storage/wal) which can be _replayed_ on restart as well as with a `replication_factor` (usually 3) of each log to mitigate this risk.
   157  
   158  When not configured to accept out-of-order writes,
   159  all lines pushed to Loki for a given stream (unique combination of
   160  labels) must have a newer timestamp than the line received before it. There are,
   161  however, two cases for handling logs for the same stream with identical
   162  nanosecond timestamps:
   163  
   164  1. If the incoming line exactly matches the previously received line (matching
   165     both the previous timestamp and log text), the incoming line will be treated
   166     as an exact duplicate and ignored.
   167  
   168  2. If the incoming line has the same timestamp as the previous line but
   169     different content, the log line is accepted. This means it is possible to
   170     have two different log lines for the same timestamp.
   171  
   172  ### Handoff - Deprecated in favor of the [WAL](../../operations/storage/wal)
   173  
   174  By default, when an ingester is shutting down and tries to leave the hash ring,
   175  it will wait to see if a new ingester tries to enter before flushing and will
   176  try to initiate a handoff. The handoff will transfer all of the tokens and
   177  in-memory chunks owned by the leaving ingester to the new ingester.
   178  
   179  Before joining the hash ring, ingesters will wait in `PENDING` state for a
   180  handoff to occur. After a configurable timeout, ingesters in the `PENDING` state
   181  that have not received a transfer will join the ring normally, inserting a new
   182  set of tokens.
   183  
   184  This process is used to avoid flushing all chunks when shutting down, which is a
   185  slow process.
   186  
   187  ### Filesystem Support
   188  
   189  While ingesters do support writing to the filesystem through BoltDB, this only
   190  works in single-process mode as [queriers](#querier) need access to the same
   191  back-end store and BoltDB only allows one process to have a lock on the DB at a
   192  given time.
   193  
   194  ## Query frontend
   195  
   196  The **query frontend** is an **optional service** providing the querier's API endpoints and can be used to accelerate the read path. When the query frontend is in place, incoming query requests should be directed to the query frontend instead of the queriers. The querier service will be still required within the cluster, in order to execute the actual queries.
   197  
   198  The query frontend internally performs some query adjustments and holds queries in an internal queue. In this setup, queriers act as workers which pull jobs from the queue, execute them, and return them to the query-frontend for aggregation. Queriers need to be configured with the query frontend address (via the `-querier.frontend-address` CLI flag) in order to allow them to connect to the query frontends.
   199  
   200  Query frontends are **stateless**. However, due to how the internal queue works, it's recommended to run a few query frontend replicas to reap the benefit of fair scheduling. Two replicas should suffice in most cases.
   201  
   202  ### Queueing
   203  
   204  The query frontend queuing mechanism is used to:
   205  
   206  - Ensure that large queries, that could cause an out-of-memory (OOM) error in the querier, will be retried on failure. This allows administrators to under-provision memory for queries, or optimistically run more small queries in parallel, which helps to reduce the TCO.
   207  - Prevent multiple large requests from being convoyed on a single querier by distributing them across all queriers using a first-in/first-out queue (FIFO).
   208  - Prevent a single tenant from denial-of-service-ing (DOSing) other tenants by fairly scheduling queries between tenants.
   209  
   210  ### Splitting
   211  
   212  The query frontend splits larger queries into multiple smaller queries, executing these queries in parallel on downstream queriers and stitching the results back together again. This prevents large (multi-day, etc) queries from causing out of memory issues in a single querier and helps to execute them faster.
   213  
   214  ### Caching
   215  
   216  #### Metric Queries
   217  
   218  The query frontend supports caching metric query results and reuses them on subsequent queries. If the cached results are incomplete, the query frontend calculates the required subqueries and executes them in parallel on downstream queriers. The query frontend can optionally align queries with their step parameter to improve the cacheability of the query results. The result cache is compatible with any loki caching backend (currently memcached, redis, and an in-memory cache).
   219  
   220  #### Log Queries - Coming soon!
   221  
   222  Caching log (filter, regexp) queries are under active development.
   223  
   224  ## Querier
   225  
   226  The **querier** service handles queries using the [LogQL](../../logql/) query
   227  language, fetching logs both from the ingesters and from long-term storage.
   228  
   229  Queriers query all ingesters for in-memory data before falling back to
   230  running the same query against the backend store. Because of the replication
   231  factor, it is possible that the querier may receive duplicate data. To resolve
   232  this, the querier internally **deduplicates** data that has the same nanosecond
   233  timestamp, label set, and log message.
   234