github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/contexts.md (about)

     1  # context.Context and how it relates to logging and tracing in CockroachDB
     2  
     3  Original author: Andrei Matei
     4  
     5  ## Introduction
     6  
     7  You might have noticed that much of our codebase passes around
     8  `context.Context` to functions all over the place. You can go a long
     9  way without understanding exactly what the point is (until you can’t),
    10  but that’s no way to live. You might have heard that we don’t like
    11  `context.TODO()` or that “storing `Contexts` in structs is bad” and
    12  you might be wondering what that’s about. This document is to serve as
    13  an explanation of Context-related topics.
    14  
    15  ## What is a `context.Context` ?
    16  
    17  `context.Context` is an interface defined in the `context` package of
    18  Go’s standard library.
    19  
    20  The package documentation says: *Package context defines the Context
    21  type, which carries deadlines, cancelation signals, and other
    22  request-scoped values across API boundaries and between processes.*
    23  
    24  So, the idea is that a `Context` represents an *operation*’s context -
    25  deadline, cancellation and values that we want to pass implicitly to
    26  all the function calls executed on behalf of that operation (this last
    27  use case may be achieved with thread-locals in other programming
    28  languages). The `context` package does not attempt to define what an
    29  operation is - that’s up to the programmer. Typical operations are
    30  serving an RPC call, serving a SQL query, or periodic background
    31  activities such as replica rebalancing. An important point is that
    32  operations, and `Contexts` are supposed to be hierarchical - an
    33  operation can have sub-operations which may inherit the parent’s
    34  deadline/cancellation/values or may define their own; that’s why the
    35  `context` package makes it easy to create *derived* (or *forked*)
    36  `Contexts` - child `Contexts` inheriting and augmenting the
    37  parent‘s properties.
    38  
    39  This is the interface definition:
    40  
    41      // A Context carries a deadline, cancelation signal, and request-scoped values
    42      // across API boundaries. Its methods are safe for simultaneous use by multiple
    43      // goroutines.
    44      type Context interface {
    45          // Done returns a channel that is closed when this Context is canceled
    46          // or times out.
    47          Done() <-chan struct{}
    48      
    49          // Err indicates why this context was canceled, after the Done channel
    50          // is closed.
    51          Err() error
    52      
    53          // Deadline returns the time when this Context will be canceled, if any.
    54          Deadline() (deadline time.Time, ok bool)
    55      
    56          // Value returns the value associated with key or nil if none.
    57          Value(key interface{}) interface{}
    58      }
    59  
    60  
    61  ## Motivations for contexts in CockroachDB
    62  
    63  At Cockroach Labs, we use `Contexts` extensively, as a result of
    64  cascading pushes by different people for improving logging and tracing.
    65  
    66  Logging is about keeping a global history of the activity on a node
    67  or cluster; improvements there aim to make messages more
    68  informative. Tracing is about keeping the cause-effect relationship
    69  related to individual activities, e.g. individual client requests;
    70  improvements there aim to increase the accuracy/usefulness of the
    71  network of cause-effect relations tracked by individual traces.
    72  **Go contexts enable CockroachDB to provide both services using a common interface.**
    73  
    74  We like passing `Contexts` everywhere, thus
    75  *informing each layer of the code about the context in which every
    76  function is being executed*. The main benefits we get are that all our
    77  log messages are prefixed with information about the component that’s
    78  logging them, and that all the logging messages also become part of
    79  the current operation’s trace.
    80  
    81  ### Contexts for logging
    82  
    83  Let’s look at an example of a component’s information being part of
    84  log messages. The following call:
    85  
    86      log.Infof(ctx, "request range lease (attempt #%d)", attempt)
    87  
    88  results in the following message being output:
    89  
    90      I170312 15:02:30.732131 179 storage/replica.go:1046  [n1,s1,r1/1:/{Min-Table/0},@c420498a80] request range lease (attempt #1)
    91  
    92  Notice how the prefix `[n1,s1,r1/1:/{Min-Table/0},@c420498a80]` is not
    93  part of the logging command. It’s just prepended auto-magically
    94  through the `ctx` argument that’s passed to the logging method - the
    95  `ctx` had been populated with values describing the replica executing
    96  the operation that lead to the range lease acquisition (node 1, store
    97  1, range 1 / replica 1, covering keys *Min→Table/0*). This annotation
    98  happens because, at the point where an operation entered the replica
    99  code module, we forked a child context containing all these
   100  annotations and used that child context for all downstream
   101  code. Everything logged in the replica module and downstream of it
   102  will carry information about *which particular replica* the operation
   103  is using. Isn’t that neat?
   104  
   105  Let’s look at an example of log messages containing information specific to the current operation:
   106  
   107      I170312 15:17:12.602218 149 kv/dist_sender.go:1142  [client=127.0.0.1:52149,user=root,n1] sending batch 1 CPut, 1 BeginTxn, 1 EndTxn to range 22
   108  
   109  Here we see information about the client (IP address and username)
   110  performing the SQL statement that eventually lead to the `DistSender`
   111  performing a `Batch` RPC call. This is achieved by having each user
   112  SQL session create a context with these annotations and using it (and
   113  children of it) for everything being executed on behalf of that
   114  session at any level in the code (at least executed locally on the
   115  session’s gateway node; see discussion below about distributed tracing
   116  for how this relates to remote code execution).
   117  
   118  ### Contexts for tracing
   119  
   120  Besides logging, *tracing* is also a big reason why we plumb contexts
   121  everywhere. A *trace* is a structured view of (messages printed on
   122  behalf of) an operation.
   123  
   124  It contains **all the logging messages printed by code servicing that
   125  operation**, plus timing information and information about the
   126  sequential or parallel nature of the execution of sub-operations.
   127  
   128  
   129  | **What gets emitted \ where it is visible**  | **Log files**        | **Trace visualisation** |
   130  | -------------------------------------------- | -------------------- | ----------------------- |
   131  | **log.Info, log.Warn, log.Error, log.Fatal** | ✓                    | ✓                       |
   132  | **Log tags**                                 | ✓                    | ✓                       |
   133  | **log.VEvent(ctx, N, …)**                    | *only if —vmodule=N+ | ✓                       |
   134  | **Span boundaries**                          |                      | ✓                       |
   135  | **Span hierarchy**                           |                      | ✓                       |
   136  
   137  
   138  The logging messages are the same as the ones that make it to the
   139  unstructured text log files that we discussed above - the messages
   140  produced by calls to `log.{Info,Warning,Error,VEvent}f?`).
   141  
   142  A trace consists of a tree of spans: each *span* represents a
   143  sub-operation (except the root one) and can contain child spans. Spans
   144  contain log messages and their timing information. When we “visualize
   145  a trace”, we visualize the tree of spans that make up that trace, up
   146  to the root span, and all their messages.
   147  
   148  In the rest of this section we’ll talk about how to see tracing
   149  information for CRDB operations and subsequently how the code around
   150  tracing looks like.
   151  
   152  Let’s look at a Lightstep trace of a SQL query (Lightstep is a
   153  distributed trace collector and viewer; see section on tracing
   154  below). Here’s the root span (trimmed):
   155  
   156  ![contexts/sshot1.png](contexts/sshot1.png)
   157  
   158  We can see log messages coming from the sql module and the kv module:
   159  the executor starts executing a `SELECT`, acquires a table lease,
   160  looks up a range lease, sends a batch with a `Scan` request, and (way
   161  later (*+1160us*), after receiving the results of the `Scan`) does
   162  some memory accounting).
   163  
   164  The blue part of the timeline that’s not covered by the yellow line
   165  corresponds to time spent in child spans; let’s look at these spans
   166  below:
   167  
   168  
   169  ![contexts/sshot2.png](contexts/sshot2.png)
   170  
   171  We can see two spans, one child of the other, related to performing an
   172  RPC. The `grpcTransport.SendNext` one corresponds to the client-side
   173  of the call, and the `node.Batch` one corresponds to the
   174  server-side. Through the magic of distributed tracing, the server
   175  could be the same process as the client or a remote node - the trace
   176  would look just the same! This is incredibly powerful - we can see log
   177  messages from all layers of the code, executed locally or even
   178  remotely, on behalf of a high-level operation that we define (e.g. in
   179  this case, executing a user’s `SELECT` query). We can see what code
   180  paths were taken, what caches were hit, what nodes were involved, how
   181  long different sub-operations took.
   182  
   183  Besides Lightstep, there are other ways to see “tracing
   184  information”. Spans and their events also show up in the
   185  `debug/requests` endpoint; as opposed to Lightstep which only receives
   186  spans once they are finished, the `debug/requests` endpoint can be
   187  used to look at spans that are “in progress”. The `debug/requests`
   188  endpoint is implemented via `net/trace` using `netTraceIntegrator`
   189  (this is called once whenever a span is created, and it returns an
   190  object which receives span “events”).
   191  
   192  Another way to see tracing information is through the `debug/events`
   193  endpoint. This is also implemented via `net/trace` and it is used for
   194  “top level” events that are not associated with a tracing span
   195  (e.g. queue events, top-level SQL events). The tracing infrastructure
   196  allows embedding a `trace.EventLog` in a `Context` ; events go to that
   197  `EventLog` automatically unless a span is also embedded in the
   198  context.
   199  
   200  There’s also `SET TRACING = on` that uses “snowball tracing” for
   201  collecting the distributed trace of one query and displaying it as SQL
   202  result rows. Snowball tracing is trace mode where trace information is
   203  stored in-memory, instead of being sent to a trace collector, so that
   204  the application can inspect its own trace later; this is the
   205  mechanism used by session tracing to present the execution
   206  trace back to the SQL client.
   207  
   208  There’s also the
   209  `COCKROACH_TRACE_SQL` env var, which causes the distributed traces
   210  of all queries taking longer than a set time to be collected in memory
   211  and then dumped into the
   212  logs. This is expensive because it enables snowball tracing for all queries,
   213  and may flood the logs. As of 03/2017, I (Andrei) am working on a new
   214  mechanism allowing a user to enable/disable tracing on a SQL session
   215  by issuing special SQL statements, and then making the collected
   216  traces available through a virtual table.
   217  
   218  All these tracing capabilities are enabled by the proliferation of
   219  `Contexts`.
   220  
   221  The mechanism for this is threefold:
   222  
   223  - we use `Spans` from package `opentracing` which represent the notion
   224    of "current operation" in the application (e.g. a current client
   225    request, a method call, whatever) and bind them to a `Tracer` object
   226    which is able to receive events for that span (e.g. a trace
   227    collector).
   228  - `Context` objects actually can carry a linked list of key/value pairs.
   229    we store `Span` objects in `Context` to represent the stacking/nesting
   230    of "current operations" in CockroachDB.
   231  - Whenever a logging method is invoked, the logging module inspects
   232    the `Context` object from the head, check whether it contains a
   233    span, and if it does, extracts the first span from the `Context`,
   234    sends a copy of the log event to its tracer, before printing them to
   235    the log file.
   236  
   237  
   238  ### Creating traces programmatically
   239  
   240  When thinking about how our operations should be traced, the first
   241  decision we need to make is *what constitutes an* ***operation*** *for
   242  the purposes of tracing*. See the section **What** ***operations***
   243  **do we track in crdb? Where do we track traces and spans?** below for
   244  a discussion about that.
   245  
   246  
   247  Here’s how the code looks like:
   248  The Executor creates a root span:
   249  
   250      // Create a root span for this SQL txn.
   251      tracer := e.cfg.AmbientCtx.Tracer
   252      sp = tracer.StartSpan("sql txn")
   253      // Put the new span in the context.
   254      ctx = opentracing.ContextWithSpan(ctx, sp)
   255      // sp needs to be Finish()ed by the state machine when this transaction 
   256      // is committed or rolled back.
   257  
   258  The sub-operations create sub-spans and fork the `Context`:
   259  
   260      ctx, span := tracing.ChildSpan(ctx, "join reader")
   261      defer tracing.FinishSpan(span)
   262  
   263  ### Context cancellation
   264  
   265  Contexts make distributed cancellation easy.
   266  A context object has a broadast "cancellation" channel which
   267  client components can check to see if the work is still "useful".
   268  For example a context for the worker of a network connection
   269  will be canceled by `net.Conn` if the remote connection
   270  is closed, and the worker can check that periodically and spontaneously
   271  abort, instead of checking the error status of trying to send
   272  data through the connection,
   273  
   274  This is really cool for things like backup/restore or distSQL that
   275  kick off a lot of work throughout the cluster.
   276  
   277  Of note:
   278  
   279  - a cancellable sub-context of a cancellable context means there is a
   280    goroutine spawn under you for propagating the parents cancellation,
   281    so it can be expensive.
   282  - it's stil la bit unclear how context cancellation crosses gRPC and
   283    libpq boundaries.  People are working on this.
   284  
   285  
   286  ## Technical notes
   287  
   288  ### Integration between logging and `Contexts`
   289  
   290  All the logging methods (`log.{Infof,VEventf,…}`) take a `Context` and
   291  [extract the](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/structured.go#L111)
   292  [“log tags”](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/structured.go#L111)
   293  [from it](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/structured.go#L111)
   294  and prepend them to the messages. The *log tags* are the prefixes we
   295  saw (e.g. `n1,s1,r1/1:/{Min-Table/0},@c420498a80`).
   296  
   297  To add an operation-specific tag, use the `log.WithLogTag()` method
   298  (e.g. `ctx = log.WithLogTag(ctx, "range-lookup", nil /* value */)`.
   299  
   300  To add a component-specific tag (e.g. the node id or replica id), we
   301  prefer a different pattern. These tags are specific to an instance of
   302  an object used to perform part of an operation - e.g. a `replica`
   303  instance. The same `replica` instance is used across many operations,
   304  but it always wants to provide the same annotations. And different
   305  instances want to provide different annotations. We use an
   306  `AmbientContext` - a container for a bunch of tags that an operation’s
   307  `Context` will be annotated with.
   308  
   309  For example, a `replica` initializes its `AmbientContext`
   310  [like so](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/storage/replica.go#L576):
   311  
   312      // Add replica log tag - the value is rangeStr.String().
   313      r.AmbientContext.AddLogTag("r", &r.rangeStr)
   314      // Add replica pointer value.
   315      r.AmbientContext.AddLogTag("@", fmt.Sprintf("%x", unsafe.Pointer(r)))
   316  
   317  And then all entry points into an instance use the `AmbientContext` to
   318  annotate the operation’s context. For example, `replica.Send()`
   319  [does](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/storage/replica.go#L1416):
   320  
   321      func (r *Replica) Send(
   322              ctx context.Context, ba roachpb.BatchRequest,
   323      ) (*roachpb.BatchResponse, *roachpb.Error) {
   324              // Add the range log tag.
   325              ctx = r.AnnotateCtx(ctx)
   326              ...
   327        
   328  
   329  We usually embed the `AmbientContext` in the objects that contain it
   330  and we say, for example, that a `replica` *annotates* all the
   331  `Contexts` that go into it.  For the history and evolution of the
   332  `AmbientContext` interface, see Radu’s
   333  [Enriching Log Messages Using Go Contexts](https://www.cockroachlabs.com/blog/enriching-log-messages-using-go-contexts/)
   334  blog post.
   335  
   336  ### Integration between tracing and `Contexts` 
   337  
   338  For tracing, we use the OpenTracing interface - an attempt to
   339  standardize tracing interfaces in different languages such that
   340  different *tracers* can be plugged in. A *tracer*
   341  (`opentracing.Tracer`) is an interface to objects that know how to
   342  create spans; tracers also contain or interface with *collectors* -
   343  components that collect traces and allows users to inspect them.
   344  
   345  CockroachDB currently contains the LightStep tracer (a collector
   346  developed by the people behind OpenTracing that sends traces to the
   347  LightStep cloud service). Clients that want to use this need to have a
   348  LightStep account (which is not free), and be willing to have their
   349  traces sent to this 3rd party, cloud service. The collector can be
   350  enabled with the `COCKROACH_LIGHTSTEP_TOKEN=<project-specific token>`
   351  env var.
   352  
   353  Besides LightStep, CockroachDB also has its own tracer which, together
   354  with custom code around our RPC boundaries, enables marshalling spans
   355  from an RPC server to the RPC client by serializing them in the gRPC
   356  response protos. We call this **“snowball tracing”** (a high-level
   357  client enables it by marking a request as “traced”, and then lower
   358  level rpc clients propagate this marker with their calls, and the
   359  trace grows like a snowball… Or something.). This mechanism enables
   360  session tracing and all the other trace collection features.
   361  
   362  If the Lightstep integration is enabled, crdb uses both the Lightstep
   363  and the internal tracers simultaneously - through a
   364  `[TeeTracer](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/tracing/tee_tracer.go#L28)`.
   365  
   366  As we’ve seen, all log messages also go to the current distributed
   367  trace (if any). The logging code [passes messages to the
   368  tracer](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/trace.go#L132),
   369  through the OpenTracing interface. The connection with that interface
   370  is a `Span` which is passed around - you’ve guessed it - in the
   371  `Context`. The logging code [extracts
   372  it](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/trace.go#L110)
   373  from there.
   374  
   375  ### What operations do we track in crdb? Where do we track traces and spans?
   376  
   377  We define operations at multiple levels in crdb for tracing and
   378  logging. These different levels correspond to different
   379  `Contexts`. For example, the whole of a user’s SQL session uses the
   380  same top-level `Context`, for the benefit of sharing those client
   381  logging tags (`client=127.0.0.1:52149,user=root`), and for appearing
   382  in `/debug/events` as a single trace (for `net.trace`‘s definition of
   383  a trace). The integration with `/debug/events` is done by putting a
   384  `[trace.EventLog](https://github.com/cockroachdb/cockroach/blob/e05f9ec8e47922c3ffb20570883da17544566953/pkg/sql/session.go#L199)`
   385  [in the
   386  session’s](https://github.com/cockroachdb/cockroach/blob/e05f9ec8e47922c3ffb20570883da17544566953/pkg/sql/session.go#L199)
   387  `[Context](https://github.com/cockroachdb/cockroach/blob/e05f9ec8e47922c3ffb20570883da17544566953/pkg/sql/session.go#L199)`. (FIXME:
   388  explain what this does. Radu says: “The most important usecase for
   389  debug/events is finding an operation that is stuck (you can’t see
   390  those in lightstep). All net/trace does is take events and present
   391  them in those debug/ HTTP endpoints.”)
   392  
   393  When talking about tracing, you generally want operations to make
   394  sense from a user’s perspective, as ultimately we hope users will make
   395  use of this info - e.g. a SQL query is a good candidate; an internal
   396  operation like acquiring some lease would not make sense to a user in
   397  isolation. There are some technical considerations too: a trace should
   398  not be too long-lived, as it generally only becomes visible when it’s
   399  done. Individual spans (sub-operations) should also not be too long or
   400  contain too many log messages, as messages pertaining to a span are
   401  buffered in memory until the span is closed. Also, spans cannot be
   402  used after they’re closed, so each span needs to clearly outlive all
   403  the work that’s supposed to be done inside it (which can get tricky
   404  when that work spawns goroutines which inherit the `Context` and don’t
   405  have very clear termination points - e.g. the `DistSender` trying to
   406  issue overlapping RPCs to multiple replicas). Another thing to keep in
   407  mind is that spans are only collected once they are `Finish()`ed.
   408  
   409  At the moment, for SQL processing, an operation is a SQL
   410  transaction. This allows us to produce traces that include automatic
   411  transaction retries. So, a SQL session would have been too long and a
   412  SQL statement too short. Sub-operations may be acquiring a lease,
   413  resolving a key to a range, or all the work done by individual DistSQL
   414  processors. So, we create a *root span* for each SQL transaction (so,
   415  with a larger lifetime than long-lived SQL session). This is done by
   416  the sql
   417  `[Executor](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/sql/session.go#L326)`. This
   418  means that the highest level grouping that we’ll see in LightStep for
   419  SQL operations is a SQL transaction.
   420  
   421  [ Side note: "is this a wise choice?"  Things like data collection of
   422  SQL result sets can span multiple transactions. Don’t we want instead
   423  to span over a batch of SQL code received by a single message in
   424  pgwire? The reason why we do it is that it is useful to see retries in
   425  a single trace, so we do it at the txn level. It hasn’t been a problem
   426  yet… When you’re using Lightstep for collection (or also the net.trace
   427  integrations), you’re relying on it to be fast and asynchronous and
   428  maybe drop log messages when too many have been accumulated.  When we
   429  talk about other trace collection mechanisms (e.g. session tracing),
   430  we’re assuming that the user is going to be judicious about not
   431  tracing operations that are too massive. ]
   432  
   433  Within a SQL transaction, there are many sub-operations. At the time
   434  of this writing, we’re a little shy about opening child spans. We only
   435  do it in `grpcTransport` when we’re [about to
   436  perform](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/kv/transport.go#L173)
   437  an RPC and on the [server side of the
   438  RPC](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/server/node.go#L883)
   439  (the situation is better in DistSQL, where we also perform more
   440  asynchronous actions, so separating logs into different spans becomes
   441  more important).
   442  
   443  Our `Contexts` get derived a lot more for logging and cancellation
   444  purposes than for tracing spans. Every layer that has an
   445  `AmbientContext` (e.g. node/store/replica) technically creates a child
   446  `Context` , but we don’t necessarily think of these as representing
   447  logical “sub-operations” - we’re simply trying to put more information
   448  in the following log messages. In other cases, we do explicitly for a
   449  `Context` to identify a sub-operation. For example, the
   450  `rangeDescriptorCache` identifies a sub-operation worthy of having a
   451  log tag:
   452  
   453      ctx = log.WithLogTag(ctx, "range-lookup", nil)
   454      descs, prefetched, pErr := rdc.db.RangeLookup(ctx, metadataKey, desc, useReverseScan)
   455  
   456  In cases like this, we should probably also be creating a child span,
   457  so that opentracing also displays the range lookup more prominently as
   458  a logically separate operation.
   459  
   460  
   461  
   462  ## FAQs
   463  
   464  **What’s wrong with storing a** `**Context**` **in a struct?**
   465  
   466  A couple of campaigns have been fought for removing `Context` struct
   467  members, and instead switching to explicitly plumbing a `ctx Context`
   468  as an argument to function calls.
   469  
   470  Generally speaking, we don’t like storing a `Context` in a struct
   471  because it tends to not represent accurately the intention of the
   472  programmer. `Contexts` are always tied to operations, where object
   473  instances are generally not.
   474  
   475  Depending on who you ask, there are some exceptions and shades of gray:
   476  
   477  1. Some structs are just “bags or arguments” for some method
   478     calls. This is a good pattern - if a method or group of methods has 30
   479     arguments, you probably want to combine them in a struct. For example,
   480     `[EvalContext](https://github.com/cockroachdb/cockroach/blob/e05f9ec/pkg/sql/parser/eval.go#L1605)`
   481     [](https://github.com/cockroachdb/cockroach/blob/e05f9ec/pkg/sql/parser/eval.go#L1605)is
   482     such a struct and it contains a `Context`. These structs don’t have
   483     any lifetime - they just exist for one method call. And the `Context`
   484     is not a special argument in this regard; it can stay in the struct
   485     like everything else.
   486  
   487  2. Some structs are supposed to have all their code be executed
   488     concurrently with the instantiator and the callers of the different
   489     methods. You create them and spawn some goroutine that calls some
   490     method representing an event loop, and separately other components
   491     communicate with the instance either by sending things on channels or
   492     by calling methods on it. You may want all methods invoked on an
   493     instance to be considered as running on behalf of the instantiator’s
   494     operation (or, rather, you probably want everything about that
   495     instance to be considered a sub-operation). For example, DistSQL
   496     processors get scheduled on behalf of a SQL query, live for the
   497     duration of that query, run concurrently with other processors and
   498     communicate with them by method calls. These processors save a
   499     `Context` as part of their initialization and use it throughout this
   500     lifetime.
   501  
   502     I guess the key here is that the lifetime of these processors is very clearly dictated by a rigid execution model.
   503     
   504  3. Some operations have lifetimes that don’t map in a very obvious way
   505     to a stack frame. For example, if we consider a SQL transaction an
   506     operation (technically, a sub-operation of a SQL session operation),
   507     that transaction lives across potentially many interactions between
   508     the server and the SQL client, and in fact may end in the middle of
   509     one of these interaction (if a txn is ended in the middle of a batch
   510     of statements). In this case, the current context is logically part of
   511     the session’s state machine, and so we do store it in the session’s
   512     state.
   513  
   514  4. … 47 others
   515  
   516  The rule of thumb is:
   517  
   518  - if some state struct is clearly and unambiguously tied to the
   519    operation that the context is attached to, then it is safe to have
   520    the context stored in the struct; however
   521  - when any refactoring/extension cause any of the following to happen,
   522    the context should probably get out of the struct ASAP:
   523   
   524    - **The struct’s lifetime becomes decoupled from the operation**,
   525      e.g. it is reused across multiple logical operations or there
   526      can be logical state changes while the struct is alive — then
   527      the **conceptual encapsulation doesn’t hold any more** and the
   528      programmer must then start to think when the context in the
   529      struct must be updated
   530    
   531    - **Some API using this struct also takes a context as explicit
   532      argument** (e.g. because it needs to implement some interface) —
   533      we generally want to **avoid having to decide** whether to use
   534      the context given as argument or the context in the struct.
   535    
   536    - **The operation is divided in sub-operations via spans** for the
   537      purpose of logging/tracing — this is actually a combination of
   538      both points above.
   539  
   540  To avoid thinking about this altogether, we generally suggest that
   541  `Contexts` live on some stack frame whose duration is the same as the
   542  operation we’re interested in, and we want it to be passed explicitly
   543  as function arguments from that point downwards.
   544  
   545  
   546  ### What about background operations?
   547  
   548  Besides interactive request-response processing that our servers do
   549  (e.g. for SQL processing), there’s also background work that happens
   550  periodically. For example, our queues do periodic processing. The
   551  queues have an `AmbientContext` that they use to annotate
   552  `context.Background()` in their main loop (generally inside a
   553  `stopper.RunWorker()`).
   554  
   555  It’s important to not put a tracing span in this context, since the
   556  trace would be very long lived and all the messages going to this span
   557  would be buffered in memory. Moreover, since a span is only collected
   558  when it’s closed, we might never collect all these messages if the
   559  server doesn’t shutdown cleanly. Of course, spans can be added to
   560  derived contexts for more specific, shorter sub-operations -
   561  e.g. rebalancing a particular range.
   562  
   563  We may add a `/net/trace` in these contexts, since Go’s `debug/events`
   564  endpoint is meant for long-running operations. For some reason, we
   565  seem to have stopped doing that for our queues.
   566  
   567  ### What’s `context.Background()`? What’s `context.TODO()` and what do I do when I see one?
   568  
   569  As we’ve seen, `Contexts` are hierarchical, which means there has
   570  to be a root `Context`. There are, in fact, two roots provided by the
   571  `context` package - `context.Background()`and
   572  `context.TODO()`. These are empty `Contexts` - no values, deadline
   573  or cancellation.
   574  
   575  We want to use `context.Background()` as the root for most of our
   576  other `Contexts` - operation contexts should be derived from it,
   577  background work contexts should be derived from it.
   578  
   579  `context.TODO()` is technically equivalent, except we (and Go
   580  itself) use it to mark places where a proper `Context` has not yet
   581  been plumbed to. So when you see this guy, it means that someone was
   582  lazy when `Context-ifying` some code module and didn’t plumb a
   583  proper operation context. We don’t want to use this for new non-test
   584  code.
   585  
   586  In tests we do use `context.TODO()` for the tests’ operations. We
   587  also sometimes use `context.Background()`; no pattern has emerged
   588  yet. Ideally each test would create one (or more) `Contexts`
   589  annotated with the test’s name and perhaps a span. But there
   590  currently is little benefit in doing that. I (Andrei) had a WIP a
   591  while ago for collecting traces from tests and dumping them into a
   592  file that can then be read and displayed by AppDash. (FIXME(andrei)
   593  add link) It was really cool but I didn’t clean it up. If we’d do
   594  this, then using proper `Contexts` in tests would be beneficial.
   595  
   596  FIXME: integrate side note: “Nathan: I just opened
   597  https://github.com/cockroachdb/cockroach/pull/14128 to address this
   598  first point. We should be using context.Background in tests, never
   599  context.TODO. Andrei: well it’s not technically Background() work. In
   600  fact, you can argue that Background() would be even worse, at it’d
   601  combine the test’s operations with background server activity.”
   602  
   603  ## Alternatives
   604  
   605  An alternative to passing a `Context` around to most functions in the
   606  codebase that has been suggested is adding some type of thread-local
   607  storage to the goroutines. This would involve augmenting the runtime
   608  (which can be done by modifying the goroutine struct inside the Go
   609  sources and adding an accessor method in the runtime library for
   610  applications to use), or getting access to a goroutine id and
   611  maintaining a map in userspace from id → storage. A possible advantage
   612  would be less typing and less runtime overhead (from copying the
   613  `Context` on all the stack frames). These ideas have not gained much
   614  traction for several reasons:
   615  
   616  
   617  1. they’d involve some “magic” that is not yet vetted by the Go team
   618  2. we’d need to maintain some non-standard Go stuff
   619  3. `Context` are becoming more and more universal in the Go ecosystem
   620     - e.g. our dependencies and the std lib. Doing something different in
   621     CockroachDB might make interfacing with external code harder. (FIXME:
   622     clarify/strengthen argument)
   623  4. avoiding hidden state. When a context is passed through as an arg,
   624     it’s usually easy to trace where a problematic context is coming
   625     from. With goroutine-local-storage, the context is the result of
   626     whichever code last set it; it may be much harder to figure out
   627     which code that is.
   628  5. the alternative solution is not trivial when a function needs to
   629     set up a “child” context for another function it calls - we would need
   630     some infrastructure to save the current context on stack and restore
   631     it before the function returns. Forgetting to restore it could lead to
   632     hard-to-debug issues.
   633  6. there are open questions about what to do when spawning a
   634     goroutine. Simply inheriting the context is problematic when contexts
   635     contain spans - the lifetime of the span is usually up to the initial
   636     goroutine (which `Finish()`es the span), so (in general) two
   637     goroutines can’t share the same span. That’s why we have things like
   638     `RunAsyncTask/ForkCtxSpan`. It’s not clear either what do we do with
   639     goroutines that, in their lifetime, service many different
   640     “operations.”
   641  
   642  ## Future directions
   643  
   644  We’d also like to embed a graphical viewer of distributed traces in
   645  CockroachDB - one such open-source one is AppDash.
   646