github.com/thanos-io/thanos@v0.32.5/docs/components/receive.md (about)

     1  # Receiver
     2  
     3  The `thanos receive` command implements the [Prometheus Remote Write API](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write). It builds on top of existing Prometheus TSDB and retains its usefulness while extending its functionality with long-term-storage, horizontal scalability, and downsampling. Prometheus instances are configured to continuously write metrics to it, and then Thanos Receive uploads TSDB blocks to an object storage bucket every 2 hours by default. Thanos Receive exposes the StoreAPI so that [Thanos Queriers](query.md) can query received metrics in real-time.
     4  
     5  We recommend this component to users who can only push into a Thanos due to air-gapped, or egress only environments. Please note the [various pros and cons of pushing metrics](https://docs.google.com/document/d/1H47v7WfyKkSLMrR8_iku6u9VB73WrVzBHb2SB6dL9_g/edit#heading=h.2v27snv0lsur).
     6  
     7  Thanos Receive supports multi-tenancy by using labels. See [Multi-tenancy documentation here](../operating/multi-tenancy.md).
     8  
     9  Thanos Receive supports ingesting [exemplars](https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exemplars) via remote-write. By default, the exemplars are silently discarded as `--tsdb.max-exemplars` is set to `0`. To enable exemplars storage, set the `--tsdb.max-exemplars` flag to a non-zero value. It exposes the ExemplarsAPI so that the [Thanos Queriers](query.md) can query the stored exemplars. Take a look at the documentation for [exemplars storage in Prometheus](https://prometheus.io/docs/prometheus/latest/disabled_features/#exemplars-storage) to know more about it.
    10  
    11  For more information please check out [initial design proposal](../proposals-done/201812-thanos-remote-receive.md). For further information on tuning Prometheus Remote Write [see remote write tuning document](https://prometheus.io/docs/practices/remote_write/).
    12  
    13  > NOTE: As the block producer it's important to set correct "external labels" that will identify data block across Thanos clusters. See [external labels](../storage.md#external-labels) docs for details.
    14  
    15  ## Series distribution algorithms
    16  
    17  The Receive component currently supports two algorithms for distributing timeseries across Receive nodes and can be set using the `receive.hashrings-algorithm` flag.
    18  
    19  ### Ketama (recommended)
    20  
    21  The Ketama algorithm is a consistent hashing scheme which enables stable scaling of Receivers without the drawbacks of the `hashmod` algorithm. This is the recommended algorithm for all new installations.
    22  
    23  If you are using the `hashmod` algorithm and wish to migrate to `ketama`, the simplest and safest way would be to set up a new pool receivers with `ketama` hashrings and start remote-writing to them. Provided you are on the latest Thanos version, old receivers will flush their TSDBs after the configured retention period and will upload blocks to object storage. Once you have verified that is done, decommission the old receivers.
    24  
    25  ### Hashmod (discouraged)
    26  
    27  This algorithm uses a `hashmod` function over all labels to decide which receiver is responsible for a given timeseries. This is the default algorithm due to historical reasons. However, its usage for new Receive installations is discouraged since adding new Receiver nodes leads to series churn and memory usage spikes.
    28  
    29  ### Hashring management and autoscaling in Kubernetes
    30  
    31  The [Thanos Receive Controller](https://github.com/observatorium/thanos-receive-controller) project aims to automate hashring management when running Thanos in Kubernetes. In combination with the Ketama hashring algorithm, this controller can also be used to keep hashrings up to date when Receivers are scaled automatically using an HPA or [Keda](https://keda.sh/).
    32  
    33  ## TSDB stats
    34  
    35  Thanos Receive supports getting TSDB stats using the `/api/v1/status/tsdb` endpoint. Use the `THANOS-TENANT` HTTP header to get stats for individual Tenants. Use the `limit` query parameter to tweak the number of stats to return (the default is 10). The output format of the endpoint is compatible with [Prometheus API](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats).
    36  
    37  Note that each Thanos Receive will only expose local stats and replicated series will not be included in the response.
    38  
    39  ## Tenant lifecycle management
    40  
    41  Tenants in Receivers are created dynamically and do not need to be provisioned upfront. When a new value is detected in the tenant HTTP header, Receivers will provision and start managing an independent TSDB for that tenant. TSDB blocks that are sent to S3 will contain a unique `tenant_id` label which can be used to compact blocks independently for each tenant.
    42  
    43  A Receiver will automatically decommission a tenant once new samples have not been seen for longer than the `--tsdb.retention` period configured for the Receiver. The tenant decommission process includes flushing all in-memory samples for that tenant to disk, sending all unsent blocks to S3, and removing the tenant TSDB from the filesystem. If a tenant receives new samples after being decommissioned, a new TSDB will be created for the tenant.
    44  
    45  Note that because of the built-in decommissioning process, the semantic of the `--tsdb.retention` flag in the Receiver is different than the one in Prometheus. For Receivers, `--tsdb.retention=t` indicates that the data for a tenant will be kept for `t` amount of time, whereas in Prometheus, `--tsdb.retention=t` denotes that the last `t` duration of data will be maintained in TSDB. In other words, Prometheus will keep the last `t` duration of data even when it stops getting new samples.
    46  
    47  ## Example
    48  
    49  ```bash
    50  thanos receive \
    51      --tsdb.path "/path/to/receive/data/dir" \
    52      --grpc-address 0.0.0.0:10907 \
    53      --http-address 0.0.0.0:10909 \
    54      --receive.replication-factor 1 \
    55      --label "receive_replica=\"0\"" \
    56      --label "receive_cluster=\"eu1\"" \
    57      --receive.local-endpoint 127.0.0.1:10907 \
    58      --receive.hashrings-file ./data/hashring.json \
    59      --remote-write.address 0.0.0.0:10908 \
    60      --objstore.config-file "bucket.yml"
    61  ```
    62  
    63  The example of `remote_write` Prometheus configuration:
    64  
    65  ```yaml
    66  remote_write:
    67  - url: http://<thanos-receive-container-ip>:10908/api/v1/receive
    68  ```
    69  
    70  where `<thanos-receive-containter-ip>` is an IP address reachable by Prometheus Server.
    71  
    72  The example content of `bucket.yml`:
    73  
    74  ```yaml mdox-exec="go run scripts/cfggen/main.go --name=gcs.Config"
    75  type: GCS
    76  config:
    77    bucket: ""
    78    service_account: ""
    79  prefix: ""
    80  ```
    81  
    82  The example content of `hashring.json`:
    83  
    84  ```json
    85  [
    86      {
    87          "endpoints": [
    88              "127.0.0.1:10907",
    89              "127.0.0.1:11907",
    90              "127.0.0.1:12907"
    91          ]
    92      }
    93  ]
    94  ```
    95  
    96  With such configuration any receive listens for remote write on `<ip>10908/api/v1/receive` and will forward to correct one in hashring if needed for tenancy and replication.
    97  
    98  ### AZ-aware Ketama hashring (experimental)
    99  
   100  In order to ensure even spread for replication over nodes in different availability-zones, you can choose to include az definition in your hashring config. If we for example have a 6 node cluster, spread over 3 different availability zones; A, B and C, we could use the following example `hashring.json`:
   101  
   102  ```json
   103  [
   104      {
   105          "endpoints": [
   106            {
   107              "address": "127.0.0.1:10907",
   108              "az": "A"
   109            },
   110            {
   111              "address": "127.0.0.1:11907",
   112              "az": "B"
   113            },
   114            {
   115              "address": "127.0.0.1:12907",
   116              "az": "C"
   117            },
   118            {
   119              "address": "127.0.0.1:13907",
   120              "az": "A"
   121            },
   122            {
   123              "address": "127.0.0.1:14907",
   124              "az": "B"
   125            },
   126            {
   127              "address": "127.0.0.1:15907",
   128              "az": "C"
   129            }
   130          ]
   131      }
   132  ]
   133  ```
   134  
   135  This is only supported for the Ketama algorithm.
   136  
   137  **NOTE:** This feature is made available from v0.32 onwards. Receive can still operate with `endpoints` set to an array of IP strings in ketama mode. But to use AZ-aware hashring, you would need to migrate your existing hashring (and surrounding automation) to the new JSON structure mentioned above.
   138  
   139  ## Limits & gates (experimental)
   140  
   141  Thanos Receive has some limits and gates that can be configured to control resource usage. Here's the difference between limits and gates:
   142  
   143  - **Limits**: if a request hits any configured limit the client will receive an error response from the server.
   144  - **Gates**: if a request hits a gate without capacity it will wait until the gate's capacity is replenished to be processed. It doesn't trigger an error response from the server.
   145  
   146  To configure the gates and limits you can use one of the two options:
   147  
   148  - `--receive.limits-config-file=<file-path>`: where `<file-path>` is the path to the YAML file. Any modification to the indicated file will trigger a configuration reload. If the updated configuration is invalid an error will be logged and it won't replace the previous valid configuration.
   149  - `--receive.limits-config=<content>`: where `<content>` is the content of YAML file.
   150  
   151  By default all the limits and gates are **disabled**.
   152  
   153  ### Understanding the configuration file
   154  
   155  The configuration file follows a few standards:
   156  
   157  1. The value `0` (zero) is used to explicitly define "there is no limit" (infinite limit).
   158  2. In the configuration of default limits (in the `default` section) or global limits (in the `global` section), a value that is not present means "no limit".
   159  3. In the configuration of per tenant limits (in the `tenants` section), a value that is not present means they are the same as the default.
   160  
   161  All the configuration for the remote write endpoint of Receive is contained in the `write` key. Inside it there are 3 subsections:
   162  
   163  - `global`: limits, gates and/or options that are applied considering all the requests.
   164  - `default`: the default values for limits in case a given tenant doesn't have any specified.
   165  - `tenants`: the limits for a given tenant.
   166  
   167  For a Receive instance with configuration like below, it's understood that:
   168  
   169  1. The Receive instance has a max concurrency of 30.
   170  2. The Receive instance has head series limiting enabled as it has `meta_monitoring_.*` options in `global`.
   171  3. The Receive instance has some default request limits as well as head series limits that apply of all tenants, **unless** a given tenant has their own limits (i.e. the `acme` tenant and partially for the `ajax` tenant).
   172  4. Tenant `acme` has no request limits, but has a higher head_series limit.
   173  5. Tenant `ajax` has a request series limit of 50000 and samples limit of 500. Their request size bytes limit is inherited from the default, 1024 bytes. Their head series are also inherited from default i.e, 1000.
   174  
   175  The next sections explain what each configuration value means.
   176  
   177  ```yaml mdox-exec="cat pkg/receive/testdata/limits_config/good_limits.yaml"
   178  write:
   179    global:
   180      max_concurrency: 30
   181      meta_monitoring_url: "http://localhost:9090"
   182      meta_monitoring_limit_query: "sum(prometheus_tsdb_head_series) by (tenant)"
   183    default:
   184      request:
   185        size_bytes_limit: 1024
   186        series_limit: 1000
   187        samples_limit: 10
   188      head_series_limit: 1000
   189    tenants:
   190      acme:
   191        request:
   192          size_bytes_limit: 0
   193          series_limit: 0
   194          samples_limit: 0
   195        head_series_limit: 2000
   196      ajax:
   197        request:
   198          series_limit: 50000
   199          samples_limit: 500
   200  ```
   201  
   202  **IMPORTANT**: this feature is experimental and a work-in-progress. It might change in the near future, i.e. configuration might move to a file (to allow easy configuration of different request limits per tenant) or its structure could change.
   203  
   204  ### Remote write request limits
   205  
   206  Thanos Receive supports setting limits on the incoming remote write request sizes. These limits should help you to prevent a single tenant from being able to send big requests and possibly crash the Receive.
   207  
   208  These limits are applied per request and can be configured within the `request` key:
   209  
   210  - `size_bytes_limit`: the maximum body size.
   211  - `series_limit`: the maximum amount of series in a single remote write request.
   212  - `samples_limit`: the maximum amount of samples in a single remote write request (summed from all series).
   213  
   214  Any request above these limits will cause an 413 HTTP response (*Entity Too Large*) and should not be retried without modifications.
   215  
   216  Currently a 413 HTTP response will cause data loss at the client, as none of them (Prometheus included) will break down 413 responses into smaller requests. The recommendation is to monitor these errors in the client and contact the owners of your Receive instance for more information on its configured limits.
   217  
   218  Future work that can improve this scenario:
   219  
   220  - Proper handling of 413 responses in clients, given Receive can somehow communicate which limit was reached.
   221  - Including in the 413 response which are the current limits that apply to the tenant.
   222  
   223  By default, all these limits are disabled.
   224  
   225  ### Remote write request gates
   226  
   227  The available request gates in Thanos Receive can be configured within the `global` key:
   228  - `max_concurrency`: the maximum amount of remote write requests that will be concurrently worked on. Any request request that would exceed this limit will be accepted, but wait until the gate allows it to be processed.
   229  
   230  ## Active Series Limiting (experimental)
   231  
   232  Thanos Receive, in Router or RouterIngestor mode, supports limiting tenant active (head) series to maintain the system's stability. It uses any Prometheus Query API compatible meta-monitoring solution that consumes the metrics exposed by all receivers in the Thanos system. Such query endpoint allows getting the scrape time seconds old number of all active series per tenant, which is then compared with a configured limit before ingesting any tenant's remote write request. In case a tenant has gone above the limit, their remote write requests fail fully.
   233  
   234  Every Receive Router/RouterIngestor node, queries meta-monitoring for active series of all tenants, every 15 seconds, and caches the results in a map. This cached result is used to limit all incoming remote write requests.
   235  
   236  To use the feature, one should specify the following limiting config options:
   237  
   238  Under `global`:
   239  - `meta_monitoring_url`: Specifies Prometheus Query API compatible meta-monitoring endpoint.
   240  - `meta_monitoring_limit_query`: Option to specify PromQL query to execute against meta-monitoring. If not specified it is set to `sum(prometheus_tsdb_head_series) by (tenant)` by default.
   241  - `meta_monitoring_http_client`: Optional YAML field specifying HTTP client config for meta-monitoring.
   242  
   243  Under `default` and per `tenant`:
   244  - `head_series_limit`: Specifies the total number of active (head) series for any tenant, across all replicas (including data replication), allowed by Thanos Receive. Set to 0 for unlimited.
   245  
   246  NOTE:
   247  - It is possible that Receive ingests more active series than the specified limit, as it relies on meta-monitoring, which may not have the latest data for current number of active series of a tenant at all times.
   248  - Thanos Receive performs best-effort limiting. In case meta-monitoring is down/unreachable, Thanos Receive will not impose limits and only log errors for meta-monitoring being unreachable. Similarly to when one receiver cannot be scraped.
   249  - Support for different limit configuration for different tenants is planned for the future.
   250  
   251  ## Flags
   252  
   253  ```$ mdox-exec="thanos receive --help"
   254  usage: thanos receive [<flags>]
   255  
   256  Accept Prometheus remote write API requests and write to local tsdb.
   257  
   258  Flags:
   259        --grpc-address="0.0.0.0:10901"
   260                                   Listen ip:port address for gRPC endpoints
   261                                   (StoreAPI). Make sure this address is routable
   262                                   from other components.
   263        --grpc-grace-period=2m     Time to wait after an interrupt received for
   264                                   GRPC Server.
   265        --grpc-server-max-connection-age=60m
   266                                   The grpc server max connection age. This
   267                                   controls how often to re-establish connections
   268                                   and redo TLS handshakes.
   269        --grpc-server-tls-cert=""  TLS Certificate for gRPC server, leave blank to
   270                                   disable TLS
   271        --grpc-server-tls-client-ca=""
   272                                   TLS CA to verify clients against. If no
   273                                   client CA is specified, there is no client
   274                                   verification on server side. (tls.NoClientCert)
   275        --grpc-server-tls-key=""   TLS Key for the gRPC server, leave blank to
   276                                   disable TLS
   277        --hash-func=               Specify which hash function to use when
   278                                   calculating the hashes of produced files.
   279                                   If no function has been specified, it does not
   280                                   happen. This permits avoiding downloading some
   281                                   files twice albeit at some performance cost.
   282                                   Possible values are: "", "SHA256".
   283    -h, --help                     Show context-sensitive help (also try
   284                                   --help-long and --help-man).
   285        --http-address="0.0.0.0:10902"
   286                                   Listen host:port for HTTP endpoints.
   287        --http-grace-period=2m     Time to wait after an interrupt received for
   288                                   HTTP Server.
   289        --http.config=""           [EXPERIMENTAL] Path to the configuration file
   290                                   that can enable TLS or authentication for all
   291                                   HTTP endpoints.
   292        --label=key="value" ...    External labels to announce. This flag will be
   293                                   removed in the future when handling multiple
   294                                   tsdb instances is added.
   295        --log.format=logfmt        Log format to use. Possible options: logfmt or
   296                                   json.
   297        --log.level=info           Log filtering level.
   298        --objstore.config=<content>
   299                                   Alternative to 'objstore.config-file'
   300                                   flag (mutually exclusive). Content of
   301                                   YAML file that contains object store
   302                                   configuration. See format details:
   303                                   https://thanos.io/tip/thanos/storage.md/#configuration
   304        --objstore.config-file=<file-path>
   305                                   Path to YAML file that contains object
   306                                   store configuration. See format details:
   307                                   https://thanos.io/tip/thanos/storage.md/#configuration
   308        --receive.default-tenant-id="default-tenant"
   309                                   Default tenant ID to use when none is provided
   310                                   via a header.
   311        --receive.grpc-compression=snappy
   312                                   Compression algorithm to use for gRPC requests
   313                                   to other receivers. Must be one of: snappy,
   314                                   none
   315        --receive.hashrings=<content>
   316                                   Alternative to 'receive.hashrings-file' flag
   317                                   (lower priority). Content of file that contains
   318                                   the hashring configuration.
   319        --receive.hashrings-algorithm=hashmod
   320                                   The algorithm used when distributing series in
   321                                   the hashrings. Must be one of hashmod, ketama.
   322                                   Will be overwritten by the tenant-specific
   323                                   algorithm in the hashring config.
   324        --receive.hashrings-file=<path>
   325                                   Path to file that contains the hashring
   326                                   configuration. A watcher is initialized
   327                                   to watch changes and update the hashring
   328                                   dynamically.
   329        --receive.hashrings-file-refresh-interval=5m
   330                                   Refresh interval to re-read the hashring
   331                                   configuration file. (used as a fallback)
   332        --receive.local-endpoint=RECEIVE.LOCAL-ENDPOINT
   333                                   Endpoint of local receive node. Used to
   334                                   identify the local node in the hashring
   335                                   configuration. If it's empty AND hashring
   336                                   configuration was provided, it means that
   337                                   receive will run in RoutingOnly mode.
   338        --receive.relabel-config=<content>
   339                                   Alternative to 'receive.relabel-config-file'
   340                                   flag (mutually exclusive). Content of YAML file
   341                                   that contains relabeling configuration.
   342        --receive.relabel-config-file=<file-path>
   343                                   Path to YAML file that contains relabeling
   344                                   configuration.
   345        --receive.replica-header="THANOS-REPLICA"
   346                                   HTTP header specifying the replica number of a
   347                                   write request.
   348        --receive.replication-factor=1
   349                                   How many times to replicate incoming write
   350                                   requests.
   351        --receive.tenant-certificate-field=
   352                                   Use TLS client's certificate field to
   353                                   determine tenant for write requests.
   354                                   Must be one of organization, organizationalUnit
   355                                   or commonName. This setting will cause the
   356                                   receive.tenant-header flag value to be ignored.
   357        --receive.tenant-header="THANOS-TENANT"
   358                                   HTTP header to determine tenant for write
   359                                   requests.
   360        --receive.tenant-label-name="tenant_id"
   361                                   Label name through which the tenant will be
   362                                   announced.
   363        --remote-write.address="0.0.0.0:19291"
   364                                   Address to listen on for remote write requests.
   365        --remote-write.client-server-name=""
   366                                   Server name to verify the hostname
   367                                   on the returned TLS certificates. See
   368                                   https://tools.ietf.org/html/rfc4366#section-3.1
   369        --remote-write.client-tls-ca=""
   370                                   TLS CA Certificates to use to verify servers.
   371        --remote-write.client-tls-cert=""
   372                                   TLS Certificates to use to identify this client
   373                                   to the server.
   374        --remote-write.client-tls-key=""
   375                                   TLS Key for the client's certificate.
   376        --remote-write.server-tls-cert=""
   377                                   TLS Certificate for HTTP server, leave blank to
   378                                   disable TLS.
   379        --remote-write.server-tls-client-ca=""
   380                                   TLS CA to verify clients against. If no
   381                                   client CA is specified, there is no client
   382                                   verification on server side. (tls.NoClientCert)
   383        --remote-write.server-tls-key=""
   384                                   TLS Key for the HTTP server, leave blank to
   385                                   disable TLS.
   386        --request.logging-config=<content>
   387                                   Alternative to 'request.logging-config-file'
   388                                   flag (mutually exclusive). Content
   389                                   of YAML file with request logging
   390                                   configuration. See format details:
   391                                   https://thanos.io/tip/thanos/logging.md/#configuration
   392        --request.logging-config-file=<file-path>
   393                                   Path to YAML file with request logging
   394                                   configuration. See format details:
   395                                   https://thanos.io/tip/thanos/logging.md/#configuration
   396        --store.limits.request-samples=0
   397                                   The maximum samples allowed for a single
   398                                   Series request, The Series call fails if
   399                                   this limit is exceeded. 0 means no limit.
   400                                   NOTE: For efficiency the limit is internally
   401                                   implemented as 'chunks limit' considering each
   402                                   chunk contains a maximum of 120 samples.
   403        --store.limits.request-series=0
   404                                   The maximum series allowed for a single Series
   405                                   request. The Series call fails if this limit is
   406                                   exceeded. 0 means no limit.
   407        --tracing.config=<content>
   408                                   Alternative to 'tracing.config-file' flag
   409                                   (mutually exclusive). Content of YAML file
   410                                   with tracing configuration. See format details:
   411                                   https://thanos.io/tip/thanos/tracing.md/#configuration
   412        --tracing.config-file=<file-path>
   413                                   Path to YAML file with tracing
   414                                   configuration. See format details:
   415                                   https://thanos.io/tip/thanos/tracing.md/#configuration
   416        --tsdb.allow-overlapping-blocks
   417                                   Allow overlapping blocks, which in turn enables
   418                                   vertical compaction and vertical query merge.
   419                                   Does not do anything, enabled all the time.
   420        --tsdb.max-exemplars=0     Enables support for ingesting exemplars and
   421                                   sets the maximum number of exemplars that will
   422                                   be stored per tenant. In case the exemplar
   423                                   storage becomes full (number of stored
   424                                   exemplars becomes equal to max-exemplars),
   425                                   ingesting a new exemplar will evict the oldest
   426                                   exemplar from storage. 0 (or less) value of
   427                                   this flag disables exemplars storage.
   428        --tsdb.no-lockfile         Do not create lockfile in TSDB data directory.
   429                                   In any case, the lockfiles will be deleted on
   430                                   next startup.
   431        --tsdb.path="./data"       Data directory of TSDB.
   432        --tsdb.retention=15d       How long to retain raw samples on local
   433                                   storage. 0d - disables the retention
   434                                   policy (i.e. infinite retention).
   435                                   For more details on how retention is
   436                                   enforced for individual tenants, please
   437                                   refer to the Tenant lifecycle management
   438                                   section in the Receive documentation:
   439                                   https://thanos.io/tip/components/receive.md/#tenant-lifecycle-management
   440        --tsdb.too-far-in-future.time-window=0s
   441                                   [EXPERIMENTAL] Configures the allowed time
   442                                   window for ingesting samples too far in the
   443                                   future. Disabled (0s) by defaultPlease note
   444                                   enable this flag will reject samples in the
   445                                   future of receive local NTP time + configured
   446                                   duration due to clock skew in remote write
   447                                   clients.
   448        --tsdb.wal-compression     Compress the tsdb WAL.
   449        --version                  Show application version.
   450  
   451  ```