github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/contexts.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/contexts.md (about)

1 # context.Context and how it relates to logging and tracing in CockroachDB
2
3 Original author: Andrei Matei
4
5 ## Introduction
6
7 You might have noticed that much of our codebase passes around
8 `context.Context` to functions all over the place. You can go a long
9 way without understanding exactly what the point is (until you can’t),
10 but that’s no way to live. You might have heard that we don’t like
11 `context.TODO()` or that “storing `Contexts` in structs is bad” and
12 you might be wondering what that’s about. This document is to serve as
13 an explanation of Context-related topics.
14
15 ## What is a `context.Context` ?
16
17 `context.Context` is an interface defined in the `context` package of
18 Go’s standard library.
19
20 The package documentation says: *Package context defines the Context
21 type, which carries deadlines, cancelation signals, and other
22 request-scoped values across API boundaries and between processes.*
23
24 So, the idea is that a `Context` represents an *operation*’s context -
25 deadline, cancellation and values that we want to pass implicitly to
26 all the function calls executed on behalf of that operation (this last
27 use case may be achieved with thread-locals in other programming
28 languages). The `context` package does not attempt to define what an
29 operation is - that’s up to the programmer. Typical operations are
30 serving an RPC call, serving a SQL query, or periodic background
31 activities such as replica rebalancing. An important point is that
32 operations, and `Contexts` are supposed to be hierarchical - an
33 operation can have sub-operations which may inherit the parent’s
34 deadline/cancellation/values or may define their own; that’s why the
35 `context` package makes it easy to create *derived* (or *forked*)
36 `Contexts` - child `Contexts` inheriting and augmenting the
37 parent‘s properties.
38
39 This is the interface definition:
40
41 // A Context carries a deadline, cancelation signal, and request-scoped values
42 // across API boundaries. Its methods are safe for simultaneous use by multiple
43 // goroutines.
44 type Context interface {
45 // Done returns a channel that is closed when this Context is canceled
46 // or times out.
47 Done() <-chan struct{}
48
49 // Err indicates why this context was canceled, after the Done channel
50 // is closed.
51 Err() error
52
53 // Deadline returns the time when this Context will be canceled, if any.
54 Deadline() (deadline time.Time, ok bool)
55
56 // Value returns the value associated with key or nil if none.
57 Value(key interface{}) interface{}
58 }
59
60
61 ## Motivations for contexts in CockroachDB
62
63 At Cockroach Labs, we use `Contexts` extensively, as a result of
64 cascading pushes by different people for improving logging and tracing.
65
66 Logging is about keeping a global history of the activity on a node
67 or cluster; improvements there aim to make messages more
68 informative. Tracing is about keeping the cause-effect relationship
69 related to individual activities, e.g. individual client requests;
70 improvements there aim to increase the accuracy/usefulness of the
71 network of cause-effect relations tracked by individual traces.
72 **Go contexts enable CockroachDB to provide both services using a common interface.**
73
74 We like passing `Contexts` everywhere, thus
75 *informing each layer of the code about the context in which every
76 function is being executed*. The main benefits we get are that all our
77 log messages are prefixed with information about the component that’s
78 logging them, and that all the logging messages also become part of
79 the current operation’s trace.
80
81 ### Contexts for logging
82
83 Let’s look at an example of a component’s information being part of
84 log messages. The following call:
85
86 log.Infof(ctx, "request range lease (attempt #%d)", attempt)
87
88 results in the following message being output:
89
90 I170312 15:02:30.732131 179 storage/replica.go:1046 [n1,s1,r1/1:/{Min-Table/0},@c420498a80] request range lease (attempt #1)
91
92 Notice how the prefix `[n1,s1,r1/1:/{Min-Table/0},@c420498a80]` is not
93 part of the logging command. It’s just prepended auto-magically
94 through the `ctx` argument that’s passed to the logging method - the
95 `ctx` had been populated with values describing the replica executing
96 the operation that lead to the range lease acquisition (node 1, store
97 1, range 1 / replica 1, covering keys *Min→Table/0*). This annotation
98 happens because, at the point where an operation entered the replica
99 code module, we forked a child context containing all these
100 annotations and used that child context for all downstream
101 code. Everything logged in the replica module and downstream of it
102 will carry information about *which particular replica* the operation
103 is using. Isn’t that neat?
104
105 Let’s look at an example of log messages containing information specific to the current operation:
106
107 I170312 15:17:12.602218 149 kv/dist_sender.go:1142 [client=127.0.0.1:52149,user=root,n1] sending batch 1 CPut, 1 BeginTxn, 1 EndTxn to range 22
108
109 Here we see information about the client (IP address and username)
110 performing the SQL statement that eventually lead to the `DistSender`
111 performing a `Batch` RPC call. This is achieved by having each user
112 SQL session create a context with these annotations and using it (and
113 children of it) for everything being executed on behalf of that
114 session at any level in the code (at least executed locally on the
115 session’s gateway node; see discussion below about distributed tracing
116 for how this relates to remote code execution).
117
118 ### Contexts for tracing
119
120 Besides logging, *tracing* is also a big reason why we plumb contexts
121 everywhere. A *trace* is a structured view of (messages printed on
122 behalf of) an operation.
123
124 It contains **all the logging messages printed by code servicing that
125 operation**, plus timing information and information about the
126 sequential or parallel nature of the execution of sub-operations.
127
128
129 | **What gets emitted \ where it is visible** | **Log files** | **Trace visualisation** |
130 | -------------------------------------------- | -------------------- | ----------------------- |
131 | **log.Info, log.Warn, log.Error, log.Fatal** | ✓ | ✓ |
132 | **Log tags** | ✓ | ✓ |
133 | **log.VEvent(ctx, N, …)** | *only if —vmodule=N+ | ✓ |
134 | **Span boundaries** | | ✓ |
135 | **Span hierarchy** | | ✓ |
136
137
138 The logging messages are the same as the ones that make it to the
139 unstructured text log files that we discussed above - the messages
140 produced by calls to `log.{Info,Warning,Error,VEvent}f?`).
141
142 A trace consists of a tree of spans: each *span* represents a
143 sub-operation (except the root one) and can contain child spans. Spans
144 contain log messages and their timing information. When we “visualize
145 a trace”, we visualize the tree of spans that make up that trace, up
146 to the root span, and all their messages.
147
148 In the rest of this section we’ll talk about how to see tracing
149 information for CRDB operations and subsequently how the code around
150 tracing looks like.
151
152 Let’s look at a Lightstep trace of a SQL query (Lightstep is a
153 distributed trace collector and viewer; see section on tracing
154 below). Here’s the root span (trimmed):
155
156 ![contexts/sshot1.png](contexts/sshot1.png)
157
158 We can see log messages coming from the sql module and the kv module:
159 the executor starts executing a `SELECT`, acquires a table lease,
160 looks up a range lease, sends a batch with a `Scan` request, and (way
161 later (*+1160us*), after receiving the results of the `Scan`) does
162 some memory accounting).
163
164 The blue part of the timeline that’s not covered by the yellow line
165 corresponds to time spent in child spans; let’s look at these spans
166 below:
167
168
169 ![contexts/sshot2.png](contexts/sshot2.png)
170
171 We can see two spans, one child of the other, related to performing an
172 RPC. The `grpcTransport.SendNext` one corresponds to the client-side
173 of the call, and the `node.Batch` one corresponds to the
174 server-side. Through the magic of distributed tracing, the server
175 could be the same process as the client or a remote node - the trace
176 would look just the same! This is incredibly powerful - we can see log
177 messages from all layers of the code, executed locally or even
178 remotely, on behalf of a high-level operation that we define (e.g. in
179 this case, executing a user’s `SELECT` query). We can see what code
180 paths were taken, what caches were hit, what nodes were involved, how
181 long different sub-operations took.
182
183 Besides Lightstep, there are other ways to see “tracing
184 information”. Spans and their events also show up in the
185 `debug/requests` endpoint; as opposed to Lightstep which only receives
186 spans once they are finished, the `debug/requests` endpoint can be
187 used to look at spans that are “in progress”. The `debug/requests`
188 endpoint is implemented via `net/trace` using `netTraceIntegrator`
189 (this is called once whenever a span is created, and it returns an
190 object which receives span “events”).
191
192 Another way to see tracing information is through the `debug/events`
193 endpoint. This is also implemented via `net/trace` and it is used for
194 “top level” events that are not associated with a tracing span
195 (e.g. queue events, top-level SQL events). The tracing infrastructure
196 allows embedding a `trace.EventLog` in a `Context` ; events go to that
197 `EventLog` automatically unless a span is also embedded in the
198 context.
199
200 There’s also `SET TRACING = on` that uses “snowball tracing” for
201 collecting the distributed trace of one query and displaying it as SQL
202 result rows. Snowball tracing is trace mode where trace information is
203 stored in-memory, instead of being sent to a trace collector, so that
204 the application can inspect its own trace later; this is the
205 mechanism used by session tracing to present the execution
206 trace back to the SQL client.
207
208 There’s also the
209 `COCKROACH_TRACE_SQL` env var, which causes the distributed traces
210 of all queries taking longer than a set time to be collected in memory
211 and then dumped into the
212 logs. This is expensive because it enables snowball tracing for all queries,
213 and may flood the logs. As of 03/2017, I (Andrei) am working on a new
214 mechanism allowing a user to enable/disable tracing on a SQL session
215 by issuing special SQL statements, and then making the collected
216 traces available through a virtual table.
217
218 All these tracing capabilities are enabled by the proliferation of
219 `Contexts`.
220
221 The mechanism for this is threefold:
222
223 - we use `Spans` from package `opentracing` which represent the notion
224 of "current operation" in the application (e.g. a current client
225 request, a method call, whatever) and bind them to a `Tracer` object
226 which is able to receive events for that span (e.g. a trace
227 collector).
228 - `Context` objects actually can carry a linked list of key/value pairs.
229 we store `Span` objects in `Context` to represent the stacking/nesting
230 of "current operations" in CockroachDB.
231 - Whenever a logging method is invoked, the logging module inspects
232 the `Context` object from the head, check whether it contains a
233 span, and if it does, extracts the first span from the `Context`,
234 sends a copy of the log event to its tracer, before printing them to
235 the log file.
236
237
238 ### Creating traces programmatically
239
240 When thinking about how our operations should be traced, the first
241 decision we need to make is *what constitutes an* ***operation*** *for
242 the purposes of tracing*. See the section **What** ***operations***
243 **do we track in crdb? Where do we track traces and spans?** below for
244 a discussion about that.
245
246
247 Here’s how the code looks like:
248 The Executor creates a root span:
249
250 // Create a root span for this SQL txn.
251 tracer := e.cfg.AmbientCtx.Tracer
252 sp = tracer.StartSpan("sql txn")
253 // Put the new span in the context.
254 ctx = opentracing.ContextWithSpan(ctx, sp)
255 // sp needs to be Finish()ed by the state machine when this transaction
256 // is committed or rolled back.
257
258 The sub-operations create sub-spans and fork the `Context`:
259
260 ctx, span := tracing.ChildSpan(ctx, "join reader")
261 defer tracing.FinishSpan(span)
262
263 ### Context cancellation
264
265 Contexts make distributed cancellation easy.
266 A context object has a broadast "cancellation" channel which
267 client components can check to see if the work is still "useful".
268 For example a context for the worker of a network connection
269 will be canceled by `net.Conn` if the remote connection
270 is closed, and the worker can check that periodically and spontaneously
271 abort, instead of checking the error status of trying to send
272 data through the connection,
273
274 This is really cool for things like backup/restore or distSQL that
275 kick off a lot of work throughout the cluster.
276
277 Of note:
278
279 - a cancellable sub-context of a cancellable context means there is a
280 goroutine spawn under you for propagating the parents cancellation,
281 so it can be expensive.
282 - it's stil la bit unclear how context cancellation crosses gRPC and
283 libpq boundaries. People are working on this.
284
285
286 ## Technical notes
287
288 ### Integration between logging and `Contexts`
289
290 All the logging methods (`log.{Infof,VEventf,…}`) take a `Context` and
291 [extract the](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/structured.go#L111)
292 [“log tags”](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/structured.go#L111)
293 [from it](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/structured.go#L111)
294 and prepend them to the messages. The *log tags* are the prefixes we
295 saw (e.g. `n1,s1,r1/1:/{Min-Table/0},@c420498a80`).
296
297 To add an operation-specific tag, use the `log.WithLogTag()` method
298 (e.g. `ctx = log.WithLogTag(ctx, "range-lookup", nil /* value */)`.
299
300 To add a component-specific tag (e.g. the node id or replica id), we
301 prefer a different pattern. These tags are specific to an instance of
302 an object used to perform part of an operation - e.g. a `replica`
303 instance. The same `replica` instance is used across many operations,
304 but it always wants to provide the same annotations. And different
305 instances want to provide different annotations. We use an
306 `AmbientContext` - a container for a bunch of tags that an operation’s
307 `Context` will be annotated with.
308
309 For example, a `replica` initializes its `AmbientContext`
310 [like so](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/storage/replica.go#L576):
311
312 // Add replica log tag - the value is rangeStr.String().
313 r.AmbientContext.AddLogTag("r", &r.rangeStr)
314 // Add replica pointer value.
315 r.AmbientContext.AddLogTag("@", fmt.Sprintf("%x", unsafe.Pointer(r)))
316
317 And then all entry points into an instance use the `AmbientContext` to
318 annotate the operation’s context. For example, `replica.Send()`
319 [does](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/storage/replica.go#L1416):
320
321 func (r *Replica) Send(
322 ctx context.Context, ba roachpb.BatchRequest,
323 ) (*roachpb.BatchResponse, *roachpb.Error) {
324 // Add the range log tag.
325 ctx = r.AnnotateCtx(ctx)
326 ...
327
328
329 We usually embed the `AmbientContext` in the objects that contain it
330 and we say, for example, that a `replica` *annotates* all the
331 `Contexts` that go into it. For the history and evolution of the
332 `AmbientContext` interface, see Radu’s
333 [Enriching Log Messages Using Go Contexts](https://www.cockroachlabs.com/blog/enriching-log-messages-using-go-contexts/)
334 blog post.
335
336 ### Integration between tracing and `Contexts`
337
338 For tracing, we use the OpenTracing interface - an attempt to
339 standardize tracing interfaces in different languages such that
340 different *tracers* can be plugged in. A *tracer*
341 (`opentracing.Tracer`) is an interface to objects that know how to
342 create spans; tracers also contain or interface with *collectors* -
343 components that collect traces and allows users to inspect them.
344
345 CockroachDB currently contains the LightStep tracer (a collector
346 developed by the people behind OpenTracing that sends traces to the
347 LightStep cloud service). Clients that want to use this need to have a
348 LightStep account (which is not free), and be willing to have their
349 traces sent to this 3rd party, cloud service. The collector can be
350 enabled with the `COCKROACH_LIGHTSTEP_TOKEN=<project-specific token>`
351 env var.
352
353 Besides LightStep, CockroachDB also has its own tracer which, together
354 with custom code around our RPC boundaries, enables marshalling spans
355 from an RPC server to the RPC client by serializing them in the gRPC
356 response protos. We call this **“snowball tracing”** (a high-level
357 client enables it by marking a request as “traced”, and then lower
358 level rpc clients propagate this marker with their calls, and the
359 trace grows like a snowball… Or something.). This mechanism enables
360 session tracing and all the other trace collection features.
361
362 If the Lightstep integration is enabled, crdb uses both the Lightstep
363 and the internal tracers simultaneously - through a
364 `[TeeTracer](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/tracing/tee_tracer.go#L28)`.
365
366 As we’ve seen, all log messages also go to the current distributed
367 trace (if any). The logging code [passes messages to the
368 tracer](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/trace.go#L132),
369 through the OpenTracing interface. The connection with that interface
370 is a `Span` which is passed around - you’ve guessed it - in the
371 `Context`. The logging code [extracts
372 it](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/trace.go#L110)
373 from there.
374
375 ### What operations do we track in crdb? Where do we track traces and spans?
376
377 We define operations at multiple levels in crdb for tracing and
378 logging. These different levels correspond to different
379 `Contexts`. For example, the whole of a user’s SQL session uses the
380 same top-level `Context`, for the benefit of sharing those client
381 logging tags (`client=127.0.0.1:52149,user=root`), and for appearing
382 in `/debug/events` as a single trace (for `net.trace`‘s definition of
383 a trace). The integration with `/debug/events` is done by putting a
384 `[trace.EventLog](https://github.com/cockroachdb/cockroach/blob/e05f9ec8e47922c3ffb20570883da17544566953/pkg/sql/session.go#L199)`
385 [in the
386 session’s](https://github.com/cockroachdb/cockroach/blob/e05f9ec8e47922c3ffb20570883da17544566953/pkg/sql/session.go#L199)
387 `[Context](https://github.com/cockroachdb/cockroach/blob/e05f9ec8e47922c3ffb20570883da17544566953/pkg/sql/session.go#L199)`. (FIXME:
388 explain what this does. Radu says: “The most important usecase for
389 debug/events is finding an operation that is stuck (you can’t see
390 those in lightstep). All net/trace does is take events and present
391 them in those debug/ HTTP endpoints.”)
392
393 When talking about tracing, you generally want operations to make
394 sense from a user’s perspective, as ultimately we hope users will make
395 use of this info - e.g. a SQL query is a good candidate; an internal
396 operation like acquiring some lease would not make sense to a user in
397 isolation. There are some technical considerations too: a trace should
398 not be too long-lived, as it generally only becomes visible when it’s
399 done. Individual spans (sub-operations) should also not be too long or
400 contain too many log messages, as messages pertaining to a span are
401 buffered in memory until the span is closed. Also, spans cannot be
402 used after they’re closed, so each span needs to clearly outlive all
403 the work that’s supposed to be done inside it (which can get tricky
404 when that work spawns goroutines which inherit the `Context` and don’t
405 have very clear termination points - e.g. the `DistSender` trying to
406 issue overlapping RPCs to multiple replicas). Another thing to keep in
407 mind is that spans are only collected once they are `Finish()`ed.
408
409 At the moment, for SQL processing, an operation is a SQL
410 transaction. This allows us to produce traces that include automatic
411 transaction retries. So, a SQL session would have been too long and a
412 SQL statement too short. Sub-operations may be acquiring a lease,
413 resolving a key to a range, or all the work done by individual DistSQL
414 processors. So, we create a *root span* for each SQL transaction (so,
415 with a larger lifetime than long-lived SQL session). This is done by
416 the sql
417 `[Executor](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/sql/session.go#L326)`. This
418 means that the highest level grouping that we’ll see in LightStep for
419 SQL operations is a SQL transaction.
420
421 [ Side note: "is this a wise choice?" Things like data collection of
422 SQL result sets can span multiple transactions. Don’t we want instead
423 to span over a batch of SQL code received by a single message in
424 pgwire? The reason why we do it is that it is useful to see retries in
425 a single trace, so we do it at the txn level. It hasn’t been a problem
426 yet… When you’re using Lightstep for collection (or also the net.trace
427 integrations), you’re relying on it to be fast and asynchronous and
428 maybe drop log messages when too many have been accumulated. When we
429 talk about other trace collection mechanisms (e.g. session tracing),
430 we’re assuming that the user is going to be judicious about not
431 tracing operations that are too massive. ]
432
433 Within a SQL transaction, there are many sub-operations. At the time
434 of this writing, we’re a little shy about opening child spans. We only
435 do it in `grpcTransport` when we’re [about to
436 perform](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/kv/transport.go#L173)
437 an RPC and on the [server side of the
438 RPC](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/server/node.go#L883)
439 (the situation is better in DistSQL, where we also perform more
440 asynchronous actions, so separating logs into different spans becomes
441 more important).
442
443 Our `Contexts` get derived a lot more for logging and cancellation
444 purposes than for tracing spans. Every layer that has an
445 `AmbientContext` (e.g. node/store/replica) technically creates a child
446 `Context` , but we don’t necessarily think of these as representing
447 logical “sub-operations” - we’re simply trying to put more information
448 in the following log messages. In other cases, we do explicitly for a
449 `Context` to identify a sub-operation. For example, the
450 `rangeDescriptorCache` identifies a sub-operation worthy of having a
451 log tag:
452
453 ctx = log.WithLogTag(ctx, "range-lookup", nil)
454 descs, prefetched, pErr := rdc.db.RangeLookup(ctx, metadataKey, desc, useReverseScan)
455
456 In cases like this, we should probably also be creating a child span,
457 so that opentracing also displays the range lookup more prominently as
458 a logically separate operation.
459
460
461
462 ## FAQs
463
464 **What’s wrong with storing a** `**Context**` **in a struct?**
465
466 A couple of campaigns have been fought for removing `Context` struct
467 members, and instead switching to explicitly plumbing a `ctx Context`
468 as an argument to function calls.
469
470 Generally speaking, we don’t like storing a `Context` in a struct
471 because it tends to not represent accurately the intention of the
472 programmer. `Contexts` are always tied to operations, where object
473 instances are generally not.
474
475 Depending on who you ask, there are some exceptions and shades of gray:
476
477 1. Some structs are just “bags or arguments” for some method
478 calls. This is a good pattern - if a method or group of methods has 30
479 arguments, you probably want to combine them in a struct. For example,
480 `[EvalContext](https://github.com/cockroachdb/cockroach/blob/e05f9ec/pkg/sql/parser/eval.go#L1605)`
481 [](https://github.com/cockroachdb/cockroach/blob/e05f9ec/pkg/sql/parser/eval.go#L1605)is
482 such a struct and it contains a `Context`. These structs don’t have
483 any lifetime - they just exist for one method call. And the `Context`
484 is not a special argument in this regard; it can stay in the struct
485 like everything else.
486
487 2. Some structs are supposed to have all their code be executed
488 concurrently with the instantiator and the callers of the different
489 methods. You create them and spawn some goroutine that calls some
490 method representing an event loop, and separately other components
491 communicate with the instance either by sending things on channels or
492 by calling methods on it. You may want all methods invoked on an
493 instance to be considered as running on behalf of the instantiator’s
494 operation (or, rather, you probably want everything about that
495 instance to be considered a sub-operation). For example, DistSQL
496 processors get scheduled on behalf of a SQL query, live for the
497 duration of that query, run concurrently with other processors and
498 communicate with them by method calls. These processors save a
499 `Context` as part of their initialization and use it throughout this
500 lifetime.
501
502 I guess the key here is that the lifetime of these processors is very clearly dictated by a rigid execution model.
503
504 3. Some operations have lifetimes that don’t map in a very obvious way
505 to a stack frame. For example, if we consider a SQL transaction an
506 operation (technically, a sub-operation of a SQL session operation),
507 that transaction lives across potentially many interactions between
508 the server and the SQL client, and in fact may end in the middle of
509 one of these interaction (if a txn is ended in the middle of a batch
510 of statements). In this case, the current context is logically part of
511 the session’s state machine, and so we do store it in the session’s
512 state.
513
514 4. … 47 others
515
516 The rule of thumb is:
517
518 - if some state struct is clearly and unambiguously tied to the
519 operation that the context is attached to, then it is safe to have
520 the context stored in the struct; however
521 - when any refactoring/extension cause any of the following to happen,
522 the context should probably get out of the struct ASAP:
523
524 - **The struct’s lifetime becomes decoupled from the operation**,
525 e.g. it is reused across multiple logical operations or there
526 can be logical state changes while the struct is alive — then
527 the **conceptual encapsulation doesn’t hold any more** and the
528 programmer must then start to think when the context in the
529 struct must be updated
530
531 - **Some API using this struct also takes a context as explicit
532 argument** (e.g. because it needs to implement some interface) —
533 we generally want to **avoid having to decide** whether to use
534 the context given as argument or the context in the struct.
535
536 - **The operation is divided in sub-operations via spans** for the
537 purpose of logging/tracing — this is actually a combination of
538 both points above.
539
540 To avoid thinking about this altogether, we generally suggest that
541 `Contexts` live on some stack frame whose duration is the same as the
542 operation we’re interested in, and we want it to be passed explicitly
543 as function arguments from that point downwards.
544
545
546 ### What about background operations?
547
548 Besides interactive request-response processing that our servers do
549 (e.g. for SQL processing), there’s also background work that happens
550 periodically. For example, our queues do periodic processing. The
551 queues have an `AmbientContext` that they use to annotate
552 `context.Background()` in their main loop (generally inside a
553 `stopper.RunWorker()`).
554
555 It’s important to not put a tracing span in this context, since the
556 trace would be very long lived and all the messages going to this span
557 would be buffered in memory. Moreover, since a span is only collected
558 when it’s closed, we might never collect all these messages if the
559 server doesn’t shutdown cleanly. Of course, spans can be added to
560 derived contexts for more specific, shorter sub-operations -
561 e.g. rebalancing a particular range.
562
563 We may add a `/net/trace` in these contexts, since Go’s `debug/events`
564 endpoint is meant for long-running operations. For some reason, we
565 seem to have stopped doing that for our queues.
566
567 ### What’s `context.Background()`? What’s `context.TODO()` and what do I do when I see one?
568
569 As we’ve seen, `Contexts` are hierarchical, which means there has
570 to be a root `Context`. There are, in fact, two roots provided by the
571 `context` package - `context.Background()`and
572 `context.TODO()`. These are empty `Contexts` - no values, deadline
573 or cancellation.
574
575 We want to use `context.Background()` as the root for most of our
576 other `Contexts` - operation contexts should be derived from it,
577 background work contexts should be derived from it.
578
579 `context.TODO()` is technically equivalent, except we (and Go
580 itself) use it to mark places where a proper `Context` has not yet
581 been plumbed to. So when you see this guy, it means that someone was
582 lazy when `Context-ifying` some code module and didn’t plumb a
583 proper operation context. We don’t want to use this for new non-test
584 code.
585
586 In tests we do use `context.TODO()` for the tests’ operations. We
587 also sometimes use `context.Background()`; no pattern has emerged
588 yet. Ideally each test would create one (or more) `Contexts`
589 annotated with the test’s name and perhaps a span. But there
590 currently is little benefit in doing that. I (Andrei) had a WIP a
591 while ago for collecting traces from tests and dumping them into a
592 file that can then be read and displayed by AppDash. (FIXME(andrei)
593 add link) It was really cool but I didn’t clean it up. If we’d do
594 this, then using proper `Contexts` in tests would be beneficial.
595
596 FIXME: integrate side note: “Nathan: I just opened
597 https://github.com/cockroachdb/cockroach/pull/14128 to address this
598 first point. We should be using context.Background in tests, never
599 context.TODO. Andrei: well it’s not technically Background() work. In
600 fact, you can argue that Background() would be even worse, at it’d
601 combine the test’s operations with background server activity.”
602
603 ## Alternatives
604
605 An alternative to passing a `Context` around to most functions in the
606 codebase that has been suggested is adding some type of thread-local
607 storage to the goroutines. This would involve augmenting the runtime
608 (which can be done by modifying the goroutine struct inside the Go
609 sources and adding an accessor method in the runtime library for
610 applications to use), or getting access to a goroutine id and
611 maintaining a map in userspace from id → storage. A possible advantage
612 would be less typing and less runtime overhead (from copying the
613 `Context` on all the stack frames). These ideas have not gained much
614 traction for several reasons:
615
616
617 1. they’d involve some “magic” that is not yet vetted by the Go team
618 2. we’d need to maintain some non-standard Go stuff
619 3. `Context` are becoming more and more universal in the Go ecosystem
620 - e.g. our dependencies and the std lib. Doing something different in
621 CockroachDB might make interfacing with external code harder. (FIXME:
622 clarify/strengthen argument)
623 4. avoiding hidden state. When a context is passed through as an arg,
624 it’s usually easy to trace where a problematic context is coming
625 from. With goroutine-local-storage, the context is the result of
626 whichever code last set it; it may be much harder to figure out
627 which code that is.
628 5. the alternative solution is not trivial when a function needs to
629 set up a “child” context for another function it calls - we would need
630 some infrastructure to save the current context on stack and restore
631 it before the function returns. Forgetting to restore it could lead to
632 hard-to-debug issues.
633 6. there are open questions about what to do when spawning a
634 goroutine. Simply inheriting the context is problematic when contexts
635 contain spans - the lifetime of the span is usually up to the initial
636 goroutine (which `Finish()`es the span), so (in general) two
637 goroutines can’t share the same span. That’s why we have things like
638 `RunAsyncTask/ForkCtxSpan`. It’s not clear either what do we do with
639 goroutines that, in their lifetime, service many different
640 “operations.”
641
642 ## Future directions
643
644 We’d also like to embed a graphical viewer of distributed traces in
645 CockroachDB - one such open-source one is AppDash.
646