github.com/thanos-io/thanos@v0.32.5/docs/proposals-done/202005-query-logging.md (about)

     1  ---
     2  type: proposal
     3  title: Query Logging for Thanos
     4  status: approved
     5  owner: yashrsharma44
     6  menu: proposals-done
     7  ---
     8  
     9  ### Related Tickets
    10  
    11  * Add Query Logging: https://github.com/thanos-io/thanos/issues/1706
    12  
    13  ## Summary
    14  
    15  This proposal describes configuring a new internal feature: *Query logging* for various types of queries in **Thanos**.
    16  
    17  We will go through the problem statement, use-cases and potential solution for the same.
    18  
    19  ## Problem Statement / Motivation
    20  
    21  Conceptually and at a very high level, **Thanos** is a durable and cheap database, storing a very large amount of metric data in external object storage. This has the possible implication of having expensive data queries if the amount of data required to be queried is large. Even after local optimisations and efficient indexing, still querying a large amount of data is very resource-intensive, and unless the user knows what he/she is doing, it can be very expensive.
    22  
    23  Different types of Query Logging are useful as a feature in Thanos, which would help in improving the developer experience.
    24  
    25  One of such feature is **Active Query Logging** which helps in tracking all the active queries made in a database, and helps the users in knowing/debugging what queries made were resource-intensive, and how they can use the details to further optimise the database, or from the user’s perspective, to draw out the line from making such a request.
    26  
    27  A similar feature exists in **Prometheus**, here is a description of an active Query Log [[1]](https://www.robustperception.io/what-queries-were-running-when-prometheus-died) -
    28  
    29  ```powershell
    30  level=info ts=2019-08-28T14:30:09.142Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.12.0, branch=HEAD, revision=43acd0e2e93f9f70c49b2267efa0124f1e759e86)"
    31  level=info ts=2019-08-28T14:30:09.142Z caller=main.go:330 build_context="(go=go1.12.8, user=root@7a9dbdbe0cc7, date=20190818-13:53:16)"
    32  level=info ts=2019-08-28T14:30:09.142Z caller=main.go:331 host_details="(Linux 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 mari (none))"
    33  level=info ts=2019-08-28T14:30:09.142Z caller=main.go:332 fd_limits="(soft=1000000, hard=1000000)"
    34  level=info ts=2019-08-28T14:30:09.142Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
    35  level=info ts=2019-08-28T14:30:09.143Z caller=query_logger.go:74 component=activeQueryTracker msg="These queries didn't finish in prometheus' last run:" queries="[{\"query\":\"changes(changes(prometheus_http_request_duration_seconds_bucket[1h:1s])[1h:1s])\",\"timestamp_sec\":1567002604}]"
    36  level=info ts=2019-08-28T14:30:09.144Z caller=main.go:654 msg="Starting TSDB ..."level=info
    37  ```
    38  
    39  Here, we can see that this log was taken before the Prometheus instance died unnaturally, and the query logged under the component **ActiveQueryTracker**, logs the respective query that did the damage. Even though it does not pin-points the exact issue that created the problem, still it is quite helpful in considering this issue as one of the potential queries that led to the unnatural death of the Prometheus instance.
    40  
    41  Another requirement is using **Audit and Adaptive Logging** in Thanos. This is a useful feature, as the former helps in aggregating all the queries made until now, and that can be utilised to track down and provide a comprehensive overview of what is happening. Audit logs can help in capturing the flow of requests made, thus we can track line to line flow of queries which can be helpful in describing the observed behaviour of a request. More info - https://rollout.io/blog/audit-logs/
    42  
    43  ```log
    44      {
    45        protoPayload: {
    46          @type: "type.googleapis.com/google.cloud.audit.AuditLog",
    47          status: {},
    48          authenticationInfo: {
    49            principalEmail: "user@example.com"
    50          },
    51          serviceName: "appengine.googleapis.com",
    52          methodName: "SetIamPolicy",
    53          authorizationInfo: [...],
    54          serviceData: {
    55            @type: "type.googleapis.com/google.appengine.legacy.AuditData",
    56            policyDelta: { bindingDeltas: [
    57                action: "ADD",
    58                role: "roles/logging.privateLogViewer",
    59                member: "user:user@example.com"
    60            ], }
    61          },
    62          request: {
    63            resource: "my-gcp-project-id",
    64            policy: { bindings: [...], }
    65          },
    66          response: {
    67            bindings: [
    68              {
    69                role: "roles/logging.privateLogViewer",
    70                members: [ "user:user@example.com" ]
    71              }
    72            ],
    73          }
    74        },
    75        insertId: "53179D9A9B559.AD6ACC7.B40604EF",
    76        resource: {
    77          type: "gcp_app",
    78          labels: { project_id: "my-gcp-project-id" }
    79        },
    80        timestamp: "2019-05-27T16:24:56.135Z",
    81        severity: "NOTICE",
    82        logName: "projects/my-gcp-project-id/logs/cloudaudit.googleapis.com%2Factivity",
    83      }
    84  Taken from - https://cloud.google.com/logging/docs/audit/understanding-audit-logs
    85  ```
    86  
    87  *Example of an audit logging*
    88  
    89  **Adaptive Logging** is quite useful, as we can log all queries based upon fulfilling a certain criteria/filters, and later those can be used for inspecting abnormal queries and improving the user experience overall.
    90  
    91  ## Proposal
    92  
    93  ### Query Logging in Thanos
    94  
    95  The proposal for Query Logging has been divided into two major parts - Audit Logging and Adaptive Logging. Audit and Adaptive Logging would be implemented across the Store API level, and would also cover the HTTP APIs of querier and other components that use HTTP APIs.
    96  
    97  ![](../img/thanos_log_limit.png)
    98  
    99  Each of the use case of the sub-part would be described and a possible implementation would be discussed -
   100  
   101  #### Using Middlewares
   102  
   103  Since Middlewares are a good chunk of the implementation, it makes sense to discuss briefly about -
   104  
   105  * How we are going to use the middlewares in our logging.
   106  * How can users can configure it?
   107  * Discuss the policy of logging by the middlewares.
   108  
   109  **Usage of Middlewares** : As it has been suggested by the diagram above, we plan to put in interceptors/middlewares to intercept the connection for a server implementation.
   110  
   111  The middlewares helps in intercepting the queries flowing before a client, and all the queries leaving the server.
   112  
   113  The middlewares provide a nice abstraction to log the queries because we have the following objective from the logging -
   114  
   115  * Log all the queries.
   116  * Log some specific queries.
   117  * Logging at a certain level(Debug / Info).
   118  * Have a same request-id for a given request.
   119  
   120  All of the goals are easily possible using grpc middlewares, as the grpc-ecosystem has a nice set of middlewares that makes the above possible, so we can just plug and play our middlewares
   121  
   122  For HTTP, all these nice middlewares are not present, so we need to code up our implementation performing the same logic.
   123  
   124  **User specific configurations** : Currently we are going to provide the following user facing configurations for using the request logger -
   125  * Use pre-defined adaptive logging policy.
   126  * The Adaptive Logging policy plans to include logging of only requests that return an error, or crosses a specific duration of time.
   127  * The level for all requests is `INFO`. Error, however would be logged at the `ERROR` level.
   128  * We would add in a flag to turn on/off request logging.
   129  * Request Logging would be turned off by default, and default logging would be in stdout. For logging all the requests, we would accept a flag for the enabling the same.
   130  * *Currently the policy for request logging is hard-coded, may be extended as a user based policy later.*
   131  
   132  **Policy of Request Logging** : We are going to use the following middleware decider for achieving the above objectives -
   133  
   134  * [`logging.WithDecider`](https://github.com/grpc-ecosystem/go-grpc-middleware/blob/v2.0.0-rc.5/interceptors/logging/options.go#L51) - This `option` would help us in deciding the logic whether a given request should be logged or not. This should help in logging specific / all queries. It is all pre request interception.
   135  * [`logging.WithLevels`](https://github.com/grpc-ecosystem/go-grpc-middleware/blob/v2.0.0-rc.5/interceptors/logging/options.go#L58) - This `option` would help us in fixating the level of query that we might want to log. So based on a query hitting a certain criteria, we can switch the levels of it.
   136  * For using the request-id, we can store the request-id in the metadata of the `context`, and while logging, we can use it.
   137  * Same equivalent for the HTTP can be implemented for mirroring the logic of *gRPC middlewares*.
   138  * So we would generally have request logging and options:
   139    1. No request logging.
   140    2. Log Finish call of the request.
   141    3. Log start and finish call - Audit Log.
   142  
   143  #### Request Logging
   144  
   145  **Query Logging** in Thanos would be implemented as a one logger, **Request Logger**. This is done in considering that Adaptive is a specific case of Audit Logging, and would make sense to keep the similar logical components together, rather than creating another distinction between them. We might however discuss the specific sub-cases of the request logging, namely adaptive and audit logging for enumerating the purpose of it.
   146  
   147  #### Use case 1: Audit Logging
   148  
   149  **Audit logging** is a kind of logging, where we log in every internal API requests. Typically, rather than logging in specific queries, we ought to log in every query that has been made until now.
   150  
   151  From a developer’s perspective, audit logs can keep you sane by giving you some insight into how a complex system arrived at its current state. Developers are typically fond of state diagrams and deterministic outcomes. An audit log that shows that a person or process applied a change can give developers valuable insight.
   152  
   153  #### Implementation:
   154  
   155  * Currently, we are looking to implement Audit Logging at the **StoreAPI** level.
   156  * We will use a **middleware** to intercept all the queries made to the StoreAPI.
   157  * As StoreAPI is interconnected with each of the other components, we can leverage this to provide a Global overview of different queries which would be useful for auditing purposes.
   158  * We would use a custom decider for logging all the requests.
   159  
   160  #### Use case 2: Adaptive Logging
   161  
   162  **Adaptive logging** is totally different from the Audit Logging. Out here in adaptive logging, we would log only certain queries which *satisfies a certain policy/satisfies a certain condition*. Possible filters could be, queries making an invalid request, latency time crossing a certain barrier, and so on.
   163  
   164  #### Implementation:
   165  
   166  * Along with audit logging, we are looking to implement the Adaptive Logging at the **StoreAPI** level.
   167  * We can leverage **middlewares** to intercept and filter the queries for logging, which satisfies the given policy.
   168  * We can achieve **Global overview** of the logs, due to the StoreAPI being interconnected with the other components.
   169  * We would use a custom decider to log certain methods, based on the logging policy defined.
   170  
   171  #### Use case 3:
   172  
   173  #### Common Use case to all
   174  
   175  * **Tracking queries across different components** The current implementation provides addition for tracking requests across different components of **Thanos**. Since we are logging the queries at the **StoreAPI** level, and those queries are flowing from one component to another, the only way to track them would be to have a request-id, which would help in tracking the logs of an individual query between different component. Another use-case would be to have a correlation of logs based on queries, so the request-id also serves the same purpose.
   176  
   177  * **Implementation**
   178  
   179    * Since **Thanos** has a solid tracing infrastructure in place, we can **re-use the tracing-id** for a given request, and thus we can track the request across the different components.
   180  * In case tracing is disabled, we can use our implementation of generating a request-id on the fly for a given request.
   181  
   182  ### Alternatives
   183  
   184  #### Use Active Queries in StoreAPI
   185  
   186  This is not possible due to the **distributed nature** of Thanos. It has a lot of moving parts, and with the current implementation of *ActiveQueryLogging*, it is best used as a local logger for each component.
   187  
   188  #### Adding logfiles for storing the logs of request logging separately
   189  
   190  The implementation of request logger makes it logical to log the request logs into a separate file. However, we should not embed the log writing logic inside Thanos, described here - https://12factor.net/logs
   191  
   192  #### Don't add anything
   193  
   194  This proposal adds in a new feature for Thanos, so there isn't any alternative until now that would solve our problem.
   195  
   196  ### Non Goals
   197  
   198  * Query logging part would help in debugging the internal requests made.
   199  * This provides ample of insights about possible bottle-neck of individual components of Thanos.
   200  * Active Query Logging -
   201  
   202  *Currently, the implementation of Active Query Logging seems flaky, as it is quite difficult to achieve a consistent logging of active queries after a forced shutdown. Considering the probability of logs correctly getting logged, the cons of the logging outweighs the pros, and hence would be sensible to discuss the implementation after the initial goals are achieved.*
   203  
   204  As the name suggests, this logger logs all the current active logs that are running in a component. This type of logging is aimed at providing logs of the active queries that are running currently in an component. This logger would be **local** **to each component**, and would run as a **standalone for each component** in contrast to the other two loggers. This logger would help in debugging queries which led to the component instanced *OOM* killed, or helping in tracking queries which are taking too long. The purpose of this logger can clash with the above two, but this logger is solely focused on providing active queries that are running, much like Prometheus does.
   205  
   206  #### Implementation of Active Query Tracking
   207  
   208  ```go
   209  
   210  // This would be local to each component, so each component would log all the active queries
   211  type ActiveQueryLogging struct {
   212  
   213  	// mmapped file to store the queries.
   214  	mmappedFile []byte
   215  	// channel to generate the next available index, much like Python’s generator indexing
   216  	getNextIndex chan int
   217  	// logger, different from the usual one
   218  	logger log.Logger
   219  }
   220  ```
   221  
   222  This interface is designed with some ideas from a similar interface designed for *Prometheus Query Logger*[[2]](https://prometheus.io/docs/guides/query-log/). Here is a rough algorithm that would implement the Query Logging in Thanos -
   223  
   224  ```txt
   225  1) Thanos receives a query.
   226  2) It calls the index_generator, which is a Python-style generator that will generate natural numbers from 0(or 1) indicating the byte index at which to put the information about the query in the log file. This would ensure that the key remains unique for each of the queries logged.
   227  3) The log file would be a memory-mapped file, which would help in accessing a random position for logging the query in constant time.
   228  ```
   229  
   230  This algorithm has been heavily derived from this pull request[[3]](https://github.com/prometheus/prometheus/pull/5794) and would suggest referring this for the implementation of the same.
   231  
   232  ## Work Plan
   233  
   234  1. Complete the proposal.
   235  2. Roll out grpc-middlewareV2.
   236  3. Implement a grpc-interceptor for Store API currently(this might be extended for other APIs as well).
   237  4. Write up the grpc logging middleware along with the policies.
   238  5. Write up an equivalent http middleware for the logging of requests in http APIs.
   239  6. Finish up with the tests if required.