github.com/thanos-io/thanos@v0.32.5/docs/proposals-done/202001-thanos-query-health-handling.md (about)

     1  ---
     2  type: proposal
     3  title: Thanos Query store nodes healthiness handling
     4  status: complete
     5  owner: GiedriusS
     6  menu: proposals-done
     7  ---
     8  
     9  ### Related Tickets
    10  
    11  * https://github.com/thanos-io/thanos/issues/1651 (Response caching for Thanos)
    12  * https://github.com/thanos-io/thanos/pull/2072 (query: add 'sticky' nodes)
    13  * https://github.com/cortexproject/cortex/pull/1974 (frontend: implement cache control)
    14  * https://github.com/thanos-io/thanos/pull/1984 (api/v1: add cache-control header)
    15  
    16  ## Summary
    17  
    18  This proposal document describes how currently the healthiness of store nodes is handled and why it should be changed (e.g. to improve response caching and avoid surprises).
    19  
    20  It explores a few options that are available - weighs their pros and cons, and finally proposes one final variant which we believe is the best out of the possible ones.
    21  
    22  ## Motivation
    23  
    24  Currently Thanos Query updates the list of healthy store nodes every 5 seconds. It does that by sending the `Info()` call via gRPC. The last successful check is noted in the `LastCheck` field. At this point if it fails then we note the error and remove it from the set of active store nodes. If that succeeds then it becomes a part of the active store set.
    25  
    26  After `--store.unhealthy-timeout` passes since `LastCheck` then it also gets removed from the UI. At this point we would forget about it completely.
    27  
    28  Every time a query is executed, we consult the active store set and send the query to all of the nodes which match according to the external labels and the min/max times that they advertise. However, if a node goes down according to our previous definition then in 5 seconds we will not send anything to it anymore. This means that we won't even be able to control this via the partial response options since no query is sent to those nodes in the first place.
    29  
    30  This is problematic in the cache of end-response caching. If we know that certain StoreAPI nodes should always be up then we always want to have not just an error (if partial response is disabled) but also the appropriate `Cache-Control` header in the response. But right now we would only get it for maximum 5 seconds after a store node would go down.
    31  
    32  Thus, this logic needs to be changed somehow. There are a few possible options:
    33  
    34  1. `--store.unhealthy-timeout` could be made to apply to this case as well - we could still consider it a part of the active store set while it is still visible in the UI.
    35  2. Another option could be introduced such as `--store.hold-timeout` which would be `--store.unhealthy-timeout`'s brother and we would hold the StoreAPI nodes for `max(hold_timeout, unhealthy_timeout)`.
    36  3. Another option such as `--store.strict-mode` could be introduced which means that we would always retain the last information of the StoreAPI nodes of the last successful check.
    37  4. The StoreAPI node specification format that is used in `--store` could be extended to include another flag which would let specify the previous option per-specific node.
    38  5. Instead of extending the specification format, we could move the same inforamtion to the command line options themselves. This would increase the explicitness of this new mode i.e. that it only applies to statically defined nodes.
    39  
    40  Lets look through their pros and cons:
    41  
    42  * In general it would be nice to avoid adding new options so the first option seems the most attractive in this regard but it is also the most invasive one.
    43  * Second option increases the configuration complexity the most since you would have to think about the other option `--store.unhealthy-timeout` as well while setting it.
    44  * The last option is the most complex from code's perspective but it is also least invasive; however it has some downsides like the syntax for specifying store nodes becomes really ugly and hard to understand because we would have not only the DNS query type in there but also a special marker to enable that mode.
    45  
    46  If we were to graph these choices in terms of their incisiveness and complexity it would look something like this:
    47  
    48  ```text
    49  Most incisive / Least Complex ------------ Least incisive / Most Complex
    50  #1           #2                                      #4
    51             #3    #5
    52  ```
    53  
    54  After careful consideration and with the rationale in this proposal, we have decided to go with the fifth option. It should provide a sweet spot between being too invasive and providing our users the ability to fall-back to the old behavior.
    55  
    56  ## Goals
    57  
    58  * Update the health check logic to properly handle the case when a node disappears in terms of a caching layer.
    59  
    60  ## No Goals
    61  
    62  * Fixing the cache handling in the cases where a **new** StoreAPI gets added to the active store set.
    63  
    64  This deserves a separate discussion and/or proposal. The issue when adding a completely new store node via service discovery is that a new node may suddenly provide new information in the past. In this paragraph when we are saying "new" it means new in terms of the data that it provides. Generally over time only a limited number of Prometheus instances will be providing data that can only change in the future (relatively to the current time).
    65  
    66  When this happens, we will need to most likely somehow signal the caching layer that it needs to drop (some of the) results cache that it has depending on:
    67  
    68  * Time ranges of a new node.
    69  * Its external labels.
    70  * The data that the node itself has by using some kind of hashing mechanism.
    71  
    72  The way this will need to be done should be as generic as possible so the design and solution of this is still an open question that this proposal does not solve.
    73  
    74  ## Verification
    75  
    76  * Unit tests which would fire up a dummy StoreAPI and check different scenarios.
    77  * Ad-hoc testing.
    78  
    79  ## Proposal
    80  
    81  * Add a new flag to Thanos Query `--store-strict` which will only accept statically specified nodes and Thanos Query will always retain the last successfully retrieved information of them via the `Info()` gRPC method. Thus, they will always be considered as part of the active store set.
    82  
    83  ## Risk
    84  
    85  * Users might have problems removing the store nodes from the active store set since they will be there forever with this option set. However, one might argue that if the nodes go down then something like DNS service discovery needs to be used which would dynamically add and remove those nodes.
    86  
    87  ## Work Plan
    88  
    89  * Implement the new flag `--store-strict` in Thanos Query which will only accept statically defined nodes that will be permanently kept around. It is optional to use so there will be no surprises when upgrading.
    90  * Implement tests with dummy store nodes.
    91  * Document the new behavior.
    92  
    93  ## Future Work
    94  
    95  * Handle the case when a new node appears in terms of the end-result cache i.e. when using SD:
    96    1. Need to somehow signal the upper layer to clear the end-result cache. Ideally we would only clear the relevant parts.
    97    2. Perhaps some kind of hashing can be used for this.