github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/contexts.md (about) 1 # context.Context and how it relates to logging and tracing in CockroachDB 2 3 Original author: Andrei Matei 4 5 ## Introduction 6 7 You might have noticed that much of our codebase passes around 8 `context.Context` to functions all over the place. You can go a long 9 way without understanding exactly what the point is (until you can’t), 10 but that’s no way to live. You might have heard that we don’t like 11 `context.TODO()` or that “storing `Contexts` in structs is bad” and 12 you might be wondering what that’s about. This document is to serve as 13 an explanation of Context-related topics. 14 15 ## What is a `context.Context` ? 16 17 `context.Context` is an interface defined in the `context` package of 18 Go’s standard library. 19 20 The package documentation says: *Package context defines the Context 21 type, which carries deadlines, cancelation signals, and other 22 request-scoped values across API boundaries and between processes.* 23 24 So, the idea is that a `Context` represents an *operation*’s context - 25 deadline, cancellation and values that we want to pass implicitly to 26 all the function calls executed on behalf of that operation (this last 27 use case may be achieved with thread-locals in other programming 28 languages). The `context` package does not attempt to define what an 29 operation is - that’s up to the programmer. Typical operations are 30 serving an RPC call, serving a SQL query, or periodic background 31 activities such as replica rebalancing. An important point is that 32 operations, and `Contexts` are supposed to be hierarchical - an 33 operation can have sub-operations which may inherit the parent’s 34 deadline/cancellation/values or may define their own; that’s why the 35 `context` package makes it easy to create *derived* (or *forked*) 36 `Contexts` - child `Contexts` inheriting and augmenting the 37 parent‘s properties. 38 39 This is the interface definition: 40 41 // A Context carries a deadline, cancelation signal, and request-scoped values 42 // across API boundaries. Its methods are safe for simultaneous use by multiple 43 // goroutines. 44 type Context interface { 45 // Done returns a channel that is closed when this Context is canceled 46 // or times out. 47 Done() <-chan struct{} 48 49 // Err indicates why this context was canceled, after the Done channel 50 // is closed. 51 Err() error 52 53 // Deadline returns the time when this Context will be canceled, if any. 54 Deadline() (deadline time.Time, ok bool) 55 56 // Value returns the value associated with key or nil if none. 57 Value(key interface{}) interface{} 58 } 59 60 61 ## Motivations for contexts in CockroachDB 62 63 At Cockroach Labs, we use `Contexts` extensively, as a result of 64 cascading pushes by different people for improving logging and tracing. 65 66 Logging is about keeping a global history of the activity on a node 67 or cluster; improvements there aim to make messages more 68 informative. Tracing is about keeping the cause-effect relationship 69 related to individual activities, e.g. individual client requests; 70 improvements there aim to increase the accuracy/usefulness of the 71 network of cause-effect relations tracked by individual traces. 72 **Go contexts enable CockroachDB to provide both services using a common interface.** 73 74 We like passing `Contexts` everywhere, thus 75 *informing each layer of the code about the context in which every 76 function is being executed*. The main benefits we get are that all our 77 log messages are prefixed with information about the component that’s 78 logging them, and that all the logging messages also become part of 79 the current operation’s trace. 80 81 ### Contexts for logging 82 83 Let’s look at an example of a component’s information being part of 84 log messages. The following call: 85 86 log.Infof(ctx, "request range lease (attempt #%d)", attempt) 87 88 results in the following message being output: 89 90 I170312 15:02:30.732131 179 storage/replica.go:1046 [n1,s1,r1/1:/{Min-Table/0},@c420498a80] request range lease (attempt #1) 91 92 Notice how the prefix `[n1,s1,r1/1:/{Min-Table/0},@c420498a80]` is not 93 part of the logging command. It’s just prepended auto-magically 94 through the `ctx` argument that’s passed to the logging method - the 95 `ctx` had been populated with values describing the replica executing 96 the operation that lead to the range lease acquisition (node 1, store 97 1, range 1 / replica 1, covering keys *Min→Table/0*). This annotation 98 happens because, at the point where an operation entered the replica 99 code module, we forked a child context containing all these 100 annotations and used that child context for all downstream 101 code. Everything logged in the replica module and downstream of it 102 will carry information about *which particular replica* the operation 103 is using. Isn’t that neat? 104 105 Let’s look at an example of log messages containing information specific to the current operation: 106 107 I170312 15:17:12.602218 149 kv/dist_sender.go:1142 [client=127.0.0.1:52149,user=root,n1] sending batch 1 CPut, 1 BeginTxn, 1 EndTxn to range 22 108 109 Here we see information about the client (IP address and username) 110 performing the SQL statement that eventually lead to the `DistSender` 111 performing a `Batch` RPC call. This is achieved by having each user 112 SQL session create a context with these annotations and using it (and 113 children of it) for everything being executed on behalf of that 114 session at any level in the code (at least executed locally on the 115 session’s gateway node; see discussion below about distributed tracing 116 for how this relates to remote code execution). 117 118 ### Contexts for tracing 119 120 Besides logging, *tracing* is also a big reason why we plumb contexts 121 everywhere. A *trace* is a structured view of (messages printed on 122 behalf of) an operation. 123 124 It contains **all the logging messages printed by code servicing that 125 operation**, plus timing information and information about the 126 sequential or parallel nature of the execution of sub-operations. 127 128 129 | **What gets emitted \ where it is visible** | **Log files** | **Trace visualisation** | 130 | -------------------------------------------- | -------------------- | ----------------------- | 131 | **log.Info, log.Warn, log.Error, log.Fatal** | ✓ | ✓ | 132 | **Log tags** | ✓ | ✓ | 133 | **log.VEvent(ctx, N, …)** | *only if —vmodule=N+ | ✓ | 134 | **Span boundaries** | | ✓ | 135 | **Span hierarchy** | | ✓ | 136 137 138 The logging messages are the same as the ones that make it to the 139 unstructured text log files that we discussed above - the messages 140 produced by calls to `log.{Info,Warning,Error,VEvent}f?`). 141 142 A trace consists of a tree of spans: each *span* represents a 143 sub-operation (except the root one) and can contain child spans. Spans 144 contain log messages and their timing information. When we “visualize 145 a trace”, we visualize the tree of spans that make up that trace, up 146 to the root span, and all their messages. 147 148 In the rest of this section we’ll talk about how to see tracing 149 information for CRDB operations and subsequently how the code around 150 tracing looks like. 151 152 Let’s look at a Lightstep trace of a SQL query (Lightstep is a 153 distributed trace collector and viewer; see section on tracing 154 below). Here’s the root span (trimmed): 155 156  157 158 We can see log messages coming from the sql module and the kv module: 159 the executor starts executing a `SELECT`, acquires a table lease, 160 looks up a range lease, sends a batch with a `Scan` request, and (way 161 later (*+1160us*), after receiving the results of the `Scan`) does 162 some memory accounting). 163 164 The blue part of the timeline that’s not covered by the yellow line 165 corresponds to time spent in child spans; let’s look at these spans 166 below: 167 168 169  170 171 We can see two spans, one child of the other, related to performing an 172 RPC. The `grpcTransport.SendNext` one corresponds to the client-side 173 of the call, and the `node.Batch` one corresponds to the 174 server-side. Through the magic of distributed tracing, the server 175 could be the same process as the client or a remote node - the trace 176 would look just the same! This is incredibly powerful - we can see log 177 messages from all layers of the code, executed locally or even 178 remotely, on behalf of a high-level operation that we define (e.g. in 179 this case, executing a user’s `SELECT` query). We can see what code 180 paths were taken, what caches were hit, what nodes were involved, how 181 long different sub-operations took. 182 183 Besides Lightstep, there are other ways to see “tracing 184 information”. Spans and their events also show up in the 185 `debug/requests` endpoint; as opposed to Lightstep which only receives 186 spans once they are finished, the `debug/requests` endpoint can be 187 used to look at spans that are “in progress”. The `debug/requests` 188 endpoint is implemented via `net/trace` using `netTraceIntegrator` 189 (this is called once whenever a span is created, and it returns an 190 object which receives span “events”). 191 192 Another way to see tracing information is through the `debug/events` 193 endpoint. This is also implemented via `net/trace` and it is used for 194 “top level” events that are not associated with a tracing span 195 (e.g. queue events, top-level SQL events). The tracing infrastructure 196 allows embedding a `trace.EventLog` in a `Context` ; events go to that 197 `EventLog` automatically unless a span is also embedded in the 198 context. 199 200 There’s also `SET TRACING = on` that uses “snowball tracing” for 201 collecting the distributed trace of one query and displaying it as SQL 202 result rows. Snowball tracing is trace mode where trace information is 203 stored in-memory, instead of being sent to a trace collector, so that 204 the application can inspect its own trace later; this is the 205 mechanism used by session tracing to present the execution 206 trace back to the SQL client. 207 208 There’s also the 209 `COCKROACH_TRACE_SQL` env var, which causes the distributed traces 210 of all queries taking longer than a set time to be collected in memory 211 and then dumped into the 212 logs. This is expensive because it enables snowball tracing for all queries, 213 and may flood the logs. As of 03/2017, I (Andrei) am working on a new 214 mechanism allowing a user to enable/disable tracing on a SQL session 215 by issuing special SQL statements, and then making the collected 216 traces available through a virtual table. 217 218 All these tracing capabilities are enabled by the proliferation of 219 `Contexts`. 220 221 The mechanism for this is threefold: 222 223 - we use `Spans` from package `opentracing` which represent the notion 224 of "current operation" in the application (e.g. a current client 225 request, a method call, whatever) and bind them to a `Tracer` object 226 which is able to receive events for that span (e.g. a trace 227 collector). 228 - `Context` objects actually can carry a linked list of key/value pairs. 229 we store `Span` objects in `Context` to represent the stacking/nesting 230 of "current operations" in CockroachDB. 231 - Whenever a logging method is invoked, the logging module inspects 232 the `Context` object from the head, check whether it contains a 233 span, and if it does, extracts the first span from the `Context`, 234 sends a copy of the log event to its tracer, before printing them to 235 the log file. 236 237 238 ### Creating traces programmatically 239 240 When thinking about how our operations should be traced, the first 241 decision we need to make is *what constitutes an* ***operation*** *for 242 the purposes of tracing*. See the section **What** ***operations*** 243 **do we track in crdb? Where do we track traces and spans?** below for 244 a discussion about that. 245 246 247 Here’s how the code looks like: 248 The Executor creates a root span: 249 250 // Create a root span for this SQL txn. 251 tracer := e.cfg.AmbientCtx.Tracer 252 sp = tracer.StartSpan("sql txn") 253 // Put the new span in the context. 254 ctx = opentracing.ContextWithSpan(ctx, sp) 255 // sp needs to be Finish()ed by the state machine when this transaction 256 // is committed or rolled back. 257 258 The sub-operations create sub-spans and fork the `Context`: 259 260 ctx, span := tracing.ChildSpan(ctx, "join reader") 261 defer tracing.FinishSpan(span) 262 263 ### Context cancellation 264 265 Contexts make distributed cancellation easy. 266 A context object has a broadast "cancellation" channel which 267 client components can check to see if the work is still "useful". 268 For example a context for the worker of a network connection 269 will be canceled by `net.Conn` if the remote connection 270 is closed, and the worker can check that periodically and spontaneously 271 abort, instead of checking the error status of trying to send 272 data through the connection, 273 274 This is really cool for things like backup/restore or distSQL that 275 kick off a lot of work throughout the cluster. 276 277 Of note: 278 279 - a cancellable sub-context of a cancellable context means there is a 280 goroutine spawn under you for propagating the parents cancellation, 281 so it can be expensive. 282 - it's stil la bit unclear how context cancellation crosses gRPC and 283 libpq boundaries. People are working on this. 284 285 286 ## Technical notes 287 288 ### Integration between logging and `Contexts` 289 290 All the logging methods (`log.{Infof,VEventf,…}`) take a `Context` and 291 [extract the](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/structured.go#L111) 292 [“log tags”](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/structured.go#L111) 293 [from it](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/structured.go#L111) 294 and prepend them to the messages. The *log tags* are the prefixes we 295 saw (e.g. `n1,s1,r1/1:/{Min-Table/0},@c420498a80`). 296 297 To add an operation-specific tag, use the `log.WithLogTag()` method 298 (e.g. `ctx = log.WithLogTag(ctx, "range-lookup", nil /* value */)`. 299 300 To add a component-specific tag (e.g. the node id or replica id), we 301 prefer a different pattern. These tags are specific to an instance of 302 an object used to perform part of an operation - e.g. a `replica` 303 instance. The same `replica` instance is used across many operations, 304 but it always wants to provide the same annotations. And different 305 instances want to provide different annotations. We use an 306 `AmbientContext` - a container for a bunch of tags that an operation’s 307 `Context` will be annotated with. 308 309 For example, a `replica` initializes its `AmbientContext` 310 [like so](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/storage/replica.go#L576): 311 312 // Add replica log tag - the value is rangeStr.String(). 313 r.AmbientContext.AddLogTag("r", &r.rangeStr) 314 // Add replica pointer value. 315 r.AmbientContext.AddLogTag("@", fmt.Sprintf("%x", unsafe.Pointer(r))) 316 317 And then all entry points into an instance use the `AmbientContext` to 318 annotate the operation’s context. For example, `replica.Send()` 319 [does](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/storage/replica.go#L1416): 320 321 func (r *Replica) Send( 322 ctx context.Context, ba roachpb.BatchRequest, 323 ) (*roachpb.BatchResponse, *roachpb.Error) { 324 // Add the range log tag. 325 ctx = r.AnnotateCtx(ctx) 326 ... 327 328 329 We usually embed the `AmbientContext` in the objects that contain it 330 and we say, for example, that a `replica` *annotates* all the 331 `Contexts` that go into it. For the history and evolution of the 332 `AmbientContext` interface, see Radu’s 333 [Enriching Log Messages Using Go Contexts](https://www.cockroachlabs.com/blog/enriching-log-messages-using-go-contexts/) 334 blog post. 335 336 ### Integration between tracing and `Contexts` 337 338 For tracing, we use the OpenTracing interface - an attempt to 339 standardize tracing interfaces in different languages such that 340 different *tracers* can be plugged in. A *tracer* 341 (`opentracing.Tracer`) is an interface to objects that know how to 342 create spans; tracers also contain or interface with *collectors* - 343 components that collect traces and allows users to inspect them. 344 345 CockroachDB currently contains the LightStep tracer (a collector 346 developed by the people behind OpenTracing that sends traces to the 347 LightStep cloud service). Clients that want to use this need to have a 348 LightStep account (which is not free), and be willing to have their 349 traces sent to this 3rd party, cloud service. The collector can be 350 enabled with the `COCKROACH_LIGHTSTEP_TOKEN=<project-specific token>` 351 env var. 352 353 Besides LightStep, CockroachDB also has its own tracer which, together 354 with custom code around our RPC boundaries, enables marshalling spans 355 from an RPC server to the RPC client by serializing them in the gRPC 356 response protos. We call this **“snowball tracing”** (a high-level 357 client enables it by marking a request as “traced”, and then lower 358 level rpc clients propagate this marker with their calls, and the 359 trace grows like a snowball… Or something.). This mechanism enables 360 session tracing and all the other trace collection features. 361 362 If the Lightstep integration is enabled, crdb uses both the Lightstep 363 and the internal tracers simultaneously - through a 364 `[TeeTracer](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/tracing/tee_tracer.go#L28)`. 365 366 As we’ve seen, all log messages also go to the current distributed 367 trace (if any). The logging code [passes messages to the 368 tracer](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/trace.go#L132), 369 through the OpenTracing interface. The connection with that interface 370 is a `Span` which is passed around - you’ve guessed it - in the 371 `Context`. The logging code [extracts 372 it](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/util/log/trace.go#L110) 373 from there. 374 375 ### What operations do we track in crdb? Where do we track traces and spans? 376 377 We define operations at multiple levels in crdb for tracing and 378 logging. These different levels correspond to different 379 `Contexts`. For example, the whole of a user’s SQL session uses the 380 same top-level `Context`, for the benefit of sharing those client 381 logging tags (`client=127.0.0.1:52149,user=root`), and for appearing 382 in `/debug/events` as a single trace (for `net.trace`‘s definition of 383 a trace). The integration with `/debug/events` is done by putting a 384 `[trace.EventLog](https://github.com/cockroachdb/cockroach/blob/e05f9ec8e47922c3ffb20570883da17544566953/pkg/sql/session.go#L199)` 385 [in the 386 session’s](https://github.com/cockroachdb/cockroach/blob/e05f9ec8e47922c3ffb20570883da17544566953/pkg/sql/session.go#L199) 387 `[Context](https://github.com/cockroachdb/cockroach/blob/e05f9ec8e47922c3ffb20570883da17544566953/pkg/sql/session.go#L199)`. (FIXME: 388 explain what this does. Radu says: “The most important usecase for 389 debug/events is finding an operation that is stuck (you can’t see 390 those in lightstep). All net/trace does is take events and present 391 them in those debug/ HTTP endpoints.”) 392 393 When talking about tracing, you generally want operations to make 394 sense from a user’s perspective, as ultimately we hope users will make 395 use of this info - e.g. a SQL query is a good candidate; an internal 396 operation like acquiring some lease would not make sense to a user in 397 isolation. There are some technical considerations too: a trace should 398 not be too long-lived, as it generally only becomes visible when it’s 399 done. Individual spans (sub-operations) should also not be too long or 400 contain too many log messages, as messages pertaining to a span are 401 buffered in memory until the span is closed. Also, spans cannot be 402 used after they’re closed, so each span needs to clearly outlive all 403 the work that’s supposed to be done inside it (which can get tricky 404 when that work spawns goroutines which inherit the `Context` and don’t 405 have very clear termination points - e.g. the `DistSender` trying to 406 issue overlapping RPCs to multiple replicas). Another thing to keep in 407 mind is that spans are only collected once they are `Finish()`ed. 408 409 At the moment, for SQL processing, an operation is a SQL 410 transaction. This allows us to produce traces that include automatic 411 transaction retries. So, a SQL session would have been too long and a 412 SQL statement too short. Sub-operations may be acquiring a lease, 413 resolving a key to a range, or all the work done by individual DistSQL 414 processors. So, we create a *root span* for each SQL transaction (so, 415 with a larger lifetime than long-lived SQL session). This is done by 416 the sql 417 `[Executor](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/sql/session.go#L326)`. This 418 means that the highest level grouping that we’ll see in LightStep for 419 SQL operations is a SQL transaction. 420 421 [ Side note: "is this a wise choice?" Things like data collection of 422 SQL result sets can span multiple transactions. Don’t we want instead 423 to span over a batch of SQL code received by a single message in 424 pgwire? The reason why we do it is that it is useful to see retries in 425 a single trace, so we do it at the txn level. It hasn’t been a problem 426 yet… When you’re using Lightstep for collection (or also the net.trace 427 integrations), you’re relying on it to be fast and asynchronous and 428 maybe drop log messages when too many have been accumulated. When we 429 talk about other trace collection mechanisms (e.g. session tracing), 430 we’re assuming that the user is going to be judicious about not 431 tracing operations that are too massive. ] 432 433 Within a SQL transaction, there are many sub-operations. At the time 434 of this writing, we’re a little shy about opening child spans. We only 435 do it in `grpcTransport` when we’re [about to 436 perform](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/kv/transport.go#L173) 437 an RPC and on the [server side of the 438 RPC](https://github.com/cockroachdb/cockroach/blob/281777256787cef83e7b1a8485648010442ffe48/pkg/server/node.go#L883) 439 (the situation is better in DistSQL, where we also perform more 440 asynchronous actions, so separating logs into different spans becomes 441 more important). 442 443 Our `Contexts` get derived a lot more for logging and cancellation 444 purposes than for tracing spans. Every layer that has an 445 `AmbientContext` (e.g. node/store/replica) technically creates a child 446 `Context` , but we don’t necessarily think of these as representing 447 logical “sub-operations” - we’re simply trying to put more information 448 in the following log messages. In other cases, we do explicitly for a 449 `Context` to identify a sub-operation. For example, the 450 `rangeDescriptorCache` identifies a sub-operation worthy of having a 451 log tag: 452 453 ctx = log.WithLogTag(ctx, "range-lookup", nil) 454 descs, prefetched, pErr := rdc.db.RangeLookup(ctx, metadataKey, desc, useReverseScan) 455 456 In cases like this, we should probably also be creating a child span, 457 so that opentracing also displays the range lookup more prominently as 458 a logically separate operation. 459 460 461 462 ## FAQs 463 464 **What’s wrong with storing a** `**Context**` **in a struct?** 465 466 A couple of campaigns have been fought for removing `Context` struct 467 members, and instead switching to explicitly plumbing a `ctx Context` 468 as an argument to function calls. 469 470 Generally speaking, we don’t like storing a `Context` in a struct 471 because it tends to not represent accurately the intention of the 472 programmer. `Contexts` are always tied to operations, where object 473 instances are generally not. 474 475 Depending on who you ask, there are some exceptions and shades of gray: 476 477 1. Some structs are just “bags or arguments” for some method 478 calls. This is a good pattern - if a method or group of methods has 30 479 arguments, you probably want to combine them in a struct. For example, 480 `[EvalContext](https://github.com/cockroachdb/cockroach/blob/e05f9ec/pkg/sql/parser/eval.go#L1605)` 481 [](https://github.com/cockroachdb/cockroach/blob/e05f9ec/pkg/sql/parser/eval.go#L1605)is 482 such a struct and it contains a `Context`. These structs don’t have 483 any lifetime - they just exist for one method call. And the `Context` 484 is not a special argument in this regard; it can stay in the struct 485 like everything else. 486 487 2. Some structs are supposed to have all their code be executed 488 concurrently with the instantiator and the callers of the different 489 methods. You create them and spawn some goroutine that calls some 490 method representing an event loop, and separately other components 491 communicate with the instance either by sending things on channels or 492 by calling methods on it. You may want all methods invoked on an 493 instance to be considered as running on behalf of the instantiator’s 494 operation (or, rather, you probably want everything about that 495 instance to be considered a sub-operation). For example, DistSQL 496 processors get scheduled on behalf of a SQL query, live for the 497 duration of that query, run concurrently with other processors and 498 communicate with them by method calls. These processors save a 499 `Context` as part of their initialization and use it throughout this 500 lifetime. 501 502 I guess the key here is that the lifetime of these processors is very clearly dictated by a rigid execution model. 503 504 3. Some operations have lifetimes that don’t map in a very obvious way 505 to a stack frame. For example, if we consider a SQL transaction an 506 operation (technically, a sub-operation of a SQL session operation), 507 that transaction lives across potentially many interactions between 508 the server and the SQL client, and in fact may end in the middle of 509 one of these interaction (if a txn is ended in the middle of a batch 510 of statements). In this case, the current context is logically part of 511 the session’s state machine, and so we do store it in the session’s 512 state. 513 514 4. … 47 others 515 516 The rule of thumb is: 517 518 - if some state struct is clearly and unambiguously tied to the 519 operation that the context is attached to, then it is safe to have 520 the context stored in the struct; however 521 - when any refactoring/extension cause any of the following to happen, 522 the context should probably get out of the struct ASAP: 523 524 - **The struct’s lifetime becomes decoupled from the operation**, 525 e.g. it is reused across multiple logical operations or there 526 can be logical state changes while the struct is alive — then 527 the **conceptual encapsulation doesn’t hold any more** and the 528 programmer must then start to think when the context in the 529 struct must be updated 530 531 - **Some API using this struct also takes a context as explicit 532 argument** (e.g. because it needs to implement some interface) — 533 we generally want to **avoid having to decide** whether to use 534 the context given as argument or the context in the struct. 535 536 - **The operation is divided in sub-operations via spans** for the 537 purpose of logging/tracing — this is actually a combination of 538 both points above. 539 540 To avoid thinking about this altogether, we generally suggest that 541 `Contexts` live on some stack frame whose duration is the same as the 542 operation we’re interested in, and we want it to be passed explicitly 543 as function arguments from that point downwards. 544 545 546 ### What about background operations? 547 548 Besides interactive request-response processing that our servers do 549 (e.g. for SQL processing), there’s also background work that happens 550 periodically. For example, our queues do periodic processing. The 551 queues have an `AmbientContext` that they use to annotate 552 `context.Background()` in their main loop (generally inside a 553 `stopper.RunWorker()`). 554 555 It’s important to not put a tracing span in this context, since the 556 trace would be very long lived and all the messages going to this span 557 would be buffered in memory. Moreover, since a span is only collected 558 when it’s closed, we might never collect all these messages if the 559 server doesn’t shutdown cleanly. Of course, spans can be added to 560 derived contexts for more specific, shorter sub-operations - 561 e.g. rebalancing a particular range. 562 563 We may add a `/net/trace` in these contexts, since Go’s `debug/events` 564 endpoint is meant for long-running operations. For some reason, we 565 seem to have stopped doing that for our queues. 566 567 ### What’s `context.Background()`? What’s `context.TODO()` and what do I do when I see one? 568 569 As we’ve seen, `Contexts` are hierarchical, which means there has 570 to be a root `Context`. There are, in fact, two roots provided by the 571 `context` package - `context.Background()`and 572 `context.TODO()`. These are empty `Contexts` - no values, deadline 573 or cancellation. 574 575 We want to use `context.Background()` as the root for most of our 576 other `Contexts` - operation contexts should be derived from it, 577 background work contexts should be derived from it. 578 579 `context.TODO()` is technically equivalent, except we (and Go 580 itself) use it to mark places where a proper `Context` has not yet 581 been plumbed to. So when you see this guy, it means that someone was 582 lazy when `Context-ifying` some code module and didn’t plumb a 583 proper operation context. We don’t want to use this for new non-test 584 code. 585 586 In tests we do use `context.TODO()` for the tests’ operations. We 587 also sometimes use `context.Background()`; no pattern has emerged 588 yet. Ideally each test would create one (or more) `Contexts` 589 annotated with the test’s name and perhaps a span. But there 590 currently is little benefit in doing that. I (Andrei) had a WIP a 591 while ago for collecting traces from tests and dumping them into a 592 file that can then be read and displayed by AppDash. (FIXME(andrei) 593 add link) It was really cool but I didn’t clean it up. If we’d do 594 this, then using proper `Contexts` in tests would be beneficial. 595 596 FIXME: integrate side note: “Nathan: I just opened 597 https://github.com/cockroachdb/cockroach/pull/14128 to address this 598 first point. We should be using context.Background in tests, never 599 context.TODO. Andrei: well it’s not technically Background() work. In 600 fact, you can argue that Background() would be even worse, at it’d 601 combine the test’s operations with background server activity.” 602 603 ## Alternatives 604 605 An alternative to passing a `Context` around to most functions in the 606 codebase that has been suggested is adding some type of thread-local 607 storage to the goroutines. This would involve augmenting the runtime 608 (which can be done by modifying the goroutine struct inside the Go 609 sources and adding an accessor method in the runtime library for 610 applications to use), or getting access to a goroutine id and 611 maintaining a map in userspace from id → storage. A possible advantage 612 would be less typing and less runtime overhead (from copying the 613 `Context` on all the stack frames). These ideas have not gained much 614 traction for several reasons: 615 616 617 1. they’d involve some “magic” that is not yet vetted by the Go team 618 2. we’d need to maintain some non-standard Go stuff 619 3. `Context` are becoming more and more universal in the Go ecosystem 620 - e.g. our dependencies and the std lib. Doing something different in 621 CockroachDB might make interfacing with external code harder. (FIXME: 622 clarify/strengthen argument) 623 4. avoiding hidden state. When a context is passed through as an arg, 624 it’s usually easy to trace where a problematic context is coming 625 from. With goroutine-local-storage, the context is the result of 626 whichever code last set it; it may be much harder to figure out 627 which code that is. 628 5. the alternative solution is not trivial when a function needs to 629 set up a “child” context for another function it calls - we would need 630 some infrastructure to save the current context on stack and restore 631 it before the function returns. Forgetting to restore it could lead to 632 hard-to-debug issues. 633 6. there are open questions about what to do when spawning a 634 goroutine. Simply inheriting the context is problematic when contexts 635 contain spans - the lifetime of the span is usually up to the initial 636 goroutine (which `Finish()`es the span), so (in general) two 637 goroutines can’t share the same span. That’s why we have things like 638 `RunAsyncTask/ForkCtxSpan`. It’s not clear either what do we do with 639 goroutines that, in their lifetime, service many different 640 “operations.” 641 642 ## Future directions 643 644 We’d also like to embed a graphical viewer of distributed traces in 645 CockroachDB - one such open-source one is AppDash. 646