github.com/onflow/flow-go@v0.35.7-crescendo-preview.23-atree-inlining/CodingConventions.md (about) 1 # Coding Style Guidelines for Flow Core Protocol 2 3 Below, we discuss code-style conventions for Flow's core-protocol implementation. They are _guidelines_ with the goal to 4 increase readability, uniformity, and maintainability of the code base. 5 6 We believe that sticking to these guidelines will increase code quality and reduce bugs in many situations. In some 7 cases, you will find that applying our guidelines does not provide an immediate benefit for the specific code you are 8 working on. Nevertheless, uniform conventions and sticking to established design patterns often makes it easier for 9 others to work with your code. 10 11 #### They are guidelines, not a law 12 13 Coding guidelines are tools: they are designed to be widely useful but generally have limitations as well. If you find 14 yourself in a situation where you feel that sticking to a guideline would have significant negative effects on a piece 15 of code 16 (e.g. bloats it significantly, makes it highly error-prone) you have the freedom to break with the conventions. In this 17 case, please include a comment in the code with a brief motivation for your decision. This will help reviewers and 18 prevent others from going through the same learning process that led to your decision to break with the conventions. 19 20 21 ## Motivation and Context 22 ### A Flow Node 23 24 On a high level, a Flow node consists of various data-processing components that receive information from node-internal 25 and external sources. We model the data flow inside a Flow node as a graph, aka the **data flow graph**: 26 * a data-processing component forms a vertex; 27 * two components exchanging messages is represented as an edge. 28 29 Generally, an individual vertex (data processing component) has a dedicated pool of go-routines for processing its work. 30 Inbound data packages are appended to queue(s), and the vertex's internal worker threads process the inbound messages from the queue(s). 31 In this design, there is an upper limit to the CPU resources an individual vertex can consume. A vertex with `k` go-routines as workers can 32 at most keep `k` cores busy. The benefit of this design is that the node can remain responsive even if one (or a few) of 33 its data processing vertices are completely overwhelmed. 34 35 In its entirety, a node is composed of many data flow vertices, that together form the node's data flow graph. The vertices 36 within the node typically trust each other; while external messages are untrusted. Depending on the inputs it consumes, a vertex 37 must carefully differentiate between trusted and untrusted inputs. 38 39 ⇒ **A vertex (data processing component within a node) implements the [`Component` interface](https://github.com/onflow/flow-go/blob/57f89d4e96259f08fe84163c91ecd32484401b45/module/component/component.go#L22)**. 40 41 For example, [`VoteAggregator`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L29) is 42 a data flow vertex within a consensus nodes, whose job it is to collect incoming votes from other consensus nodes. 43 Hence, the corresponding function [`AddVote(vote *model.Vote)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L37) 44 treats all inputs as untrusted. Furthermore, `VoteAggregator` consumes auxiliary inputs 45 ([`AddBlock(block *model.Proposal)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L45), 46 [`InvalidBlock(block *model.Proposal)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L51), 47 [`PruneUpToView(view uint64)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L57)) 48 from the node's main consensus logic, which is trusted. 49 50 51 ### External Byzantine messages result in benign sentinel errors 52 53 Any input that is coming from another node is untrusted and the vertex processing should gracefully handle any arbitrarily malicious input. 54 Under no circumstances should a byzantine input result in the vertex's internal state being corrupted. 55 Per convention, failure case due to a byzantine inputs are represented by specifically-typed 56 [sentinel errors](https://pkg.go.dev/errors#New) (for further details, see error handling section). 57 58 Byzantine inputs are one particular case, where a function returns a 'benign error'. The critical property for an error 59 to be benign is that the component returning it is still fully functional, despite encountering the error condition. All 60 benign errors should be handled within the vertex. As part of implementing error handling, developers must consider which 61 error conditions are benign in the context of the current component. 62 63 ### Invalid internal messages result in exceptions 64 65 Any input that is coming from another internal component is trusted by default. 66 If the internal message is malformed, this should be considered a critical exception. 67 68 **Exception:** some internal components forward external messages without verification. 69 For example, the `sync` engine forwards `Block` messages to the `compliance` engine without verifying that the block is valid. 70 In this case, validity of the forwarded `Block` is not part of the contract between the `sync` and `compliance` engines, 71 so an invalid forwarded `Block` should not be considered an invalid internal message. 72 73 ## Error handling 74 75 This is a highly-compressed summary 76 of [Error Handling in Flow (Notion)](https://www.notion.so/dapperlabs/Error-Handling-in-Flow-a7692d60681c4063b48851c0dbc6ed10) 77 78 A blockchain is a security-first system. Therefore, encountering a problem and continuing on a "best effort" basis is 79 *not* an option. There is a clear definition of the happy path (i.e. the block must be in storage). Any deviation of this 80 happy path is either 81 82 1. a protocol violation (from a different node), which itself is clearly defined 83 2. is an uncovered edge case (which could be exploited in the worst case to compromise the system) 84 3. a corrupted internal state of the node 85 86 ### Best Practice Guidelines 87 88 1. **Errors are part of your API:** If an error is returned by a function, the error's meaning is clearly documented in the function's GoDoc. We conceptually differentiate between the following two classes of errors: 89 90 1. _benign error: the component returning the error is still fully functional despite the error_. 91 Benign failure cases are represented as typed sentinel errors 92 ([basic errors](https://pkg.go.dev/errors#New) and [higher-level errors](https://dev.to/tigorlazuardi/go-creating-custom-error-wrapper-and-do-proper-error-equality-check-11k7)), 93 so we can do type checks. 94 2. _exception: the error is a potential symptom of internal state corruption_. 95 For example, a failed sanity check. In this case, the error is most likely fatal. 96 <br /><br /> 97 98 * Documentation of error returns should be part of every interface. We also encourage to copy the higher-level interface documentation to every implementation 99 and possibly extend it with implementation-specific comments. In particular, it simplifies maintenance work on the implementation, when the API-level contracts 100 are directly documented above the implementation. 101 * Adding a new sentinel often means that the higher-level logic has to gracefully handle an additional error case, potentially triggering slashing etc. 102 Therefore, changing the set of specified sentinel errors is generally considered a breaking API change. 103 104 105 2. **All errors beyond the specified, benign sentinel errors are considered unexpected failures, i.e. a symptom of potential state corruption.** 106 * We employ a fundamental principle of [High Assurance Software Engineering](https://www.researchgate.net/publication/228563190_High_Assurance_Software_Development), 107 where we treat everything beyond the known benign errors as critical failures. In unexpected failure cases, we assume that the vertex's in-memory state has been 108 broken and proper functioning is no longer guaranteed. The only safe route of recovery is to restart the vertex from a previously persisted, safe state. 109 Per convention, a vertex should throw any unexpected exceptions using the related [irrecoverable context](https://github.com/onflow/flow-go/blob/277b6515add6136946913747efebd508f0419a25/module/irrecoverable/irrecoverable.go). 110 * Many components in our BFT system can return benign errors (type (i)) and irrecoverable exceptions (type (ii)) 111 112 3. **Whether a particular error is benign or an exception depends on the caller's context. Errors _cannot_ be categorized as benign or exception based on their type alone.** 113 114 ![Error Handling](/docs/ErrorHandling.png) 115 116 * For example, consider `storage.ErrNotFound` that could be returned by the storage lager, when querying a block by ID 117 (method [`Headers.ByBlockID(flow.Identifier)`](https://github.com/onflow/flow-go/blob/a918616c7b541b772c254e7eaaae3573561e6c0a/storage/headers.go#L15-L18)). 118 In many cases, `storage.ErrNotFound` is expected, for instance if a node is receiving a new block proposal and checks whether the parent has already been ingested 119 or needs to be requested from a different node. In contrast, if we are querying a block that we know is already finalized and the storage returns a `storage.ErrNotFound` 120 something is badly broken. 121 * Use the special `irrecoverable.exception` [error type](https://github.com/onflow/flow-go/blob/master/module/irrecoverable/exception.go#L7-L26) 122 to denote an unexpected error (and strip any sentinel information from the error stack). 123 124 This is for any scenario when a higher-level function is interpreting a sentinel returned from a lower-level function as an exception. 125 To construct an example, lets look at our `storage.Blocks` API, which has a [`ByHeight` method](https://github.com/onflow/flow-go/blob/a918616c7b541b772c254e7eaaae3573561e6c0a/storage/blocks.go#L24-L26) 126 to retrieve _finalized_ blocks by height. The following could be a hypothetical implementation: 127 ```golang 128 // ByHeight retrieves the finalized block for the given height. 129 // From the perspective of the storage layer, the following errors are benign: 130 // - storage.ErrNotFound if no finalized block at the given height is known 131 func ByHeight(height uint64) (*flow.Block, error) { 132 // Step 1: retrieve the ID of the finalized block for the given height. We expect 133 // `storage.ErrNotFound` during normal operations, if no block at height has been 134 // finalized. We just bubble this sentinel up, as it already has the expected type. 135 blockID, err := retrieveBlockIdByHeight(height) 136 if err != nil { 137 return nil, fmt.Errorf("could not query block by height: %w", err) 138 } 139 140 // Step 2: retrieve full block by ID. Function `retrieveBlockByID` returns 141 // `storage.ErrNotFound` in case no block with the given ID is known. In other parts 142 // of the code that also use `retrieveBlockByID`, this would be expected during normal 143 // operations. However, here we are querying a block, which the storage layer has 144 // already indexed. Failure to retrieve the block implies our storage is corrupted. 145 block, err := retrieveBlockByID(blockID) 146 if err != nil { 147 // We cannot bubble up `storage.ErrNotFound` as this would hide this irrecoverable 148 // failure behind a benign sentinel error. We use the `Exception` wrapper, which 149 // also implements the error `interface` but provides no `Unwrap` method. Thereby, 150 // the `err`s sentinel type is hidden from upstream type checks, and consequently 151 // classified as unexpected, i.e. irrecoverable exceptions. 152 return nil, irrecoverable.NewExceptionf("storage inconsistency, failed to" + 153 "retrieve full block for indexed and finalized block %x: %w", blockID, err) 154 } 155 return block, nil 156 } 157 ``` 158 Functions **may** use `irrecoverable.NewExceptionf` when: 159 - they are interpreting any error returning from a 3rd party module as unexpected 160 - they are reacting to an unexpected condition internal to their stack frame and returning a generic error 161 162 Functions **must** usd `irrecoverable.NewExceptionf` when: 163 - they are interpreting any documented sentinel error returned from a flow-go module as unexpected 164 165 For brief illustration, let us consider some function body, in which there are multiple subsequent calls to other lower-level functions. 166 In most scenarios, a particular sentinel type is either always or never expected during normal operations. If it is expected, 167 then the sentinel type should be documented. If it is consistently not expected, the error should _not_ be mentioned in the 168 function's godoc. In the absence of positive affirmation that `error` is an expected and benign sentinel, the error is to be 169 treated as an irrecoverable exception. So if a sentinel type `T` is consistently not expected throughout the function's body, make 170 sure the sentinel `T` is not mentioned in the function's godoc. The latter is fully sufficient to classify `T` as an irrecoverable 171 exception. 172 173 174 5. _Optional Simplification for components that solely return benign errors._ 175 * In this case, you _can_ use untyped errors to represent benign error cases (e.g. using `fmt.Errorf`). 176 * By using untyped errors, the code would be _breaking with our best practice guideline_ that benign errors should be represented as typed sentinel errors. 177 Therefore, whenever all returned errors are benign, please clearly document this _for each public functions individually_. 178 For example, a statement like the following would be sufficient: 179 ```golang 180 // This function errors if XYZ was not successful. All returned errors are 181 // benign, as the function is side-effect free and failures are simply a no-op. 182 ``` 183 184 185 ### Hands-on suggestions 186 187 * Avoid generic errors, such as 188 ```golang 189 return fmt.Errorf("x failed") 190 ``` 191 * Use [sentinel errors](https://pkg.go.dev/errors#New): 192 ```golang 193 ErrXFailed := errors.New("x failed") 194 195 // foo does abc. 196 // Expected error returns during normal operations: 197 // * ErrXFailed: if x failed 198 func foo() err { 199 ... 200 return fmt.Errorf("details about failure: %w", ErrXFailed) 201 } 202 ``` 203 - If some operation _can_ fail, create specific sentinel errors. 204 - **Document the types of _all_ errors that are expected during normal operation in goDoc**. This enables the 205 calling code to specifically check for these errors, and decide how to handle them. 206 * Errors are bubbled up the call stack and wrapped to create a meaningful trace: 207 ```golang 208 // bar does ... 209 // Expected error returns during normal operations: 210 // * XFailedErr: if x failed 211 func bar() err { 212 ... 213 err := foo() 214 if err != nil { 215 return fmt.Errorf("failed to do foo: %w", err) 216 } 217 ... 218 } 219 ``` 220 * Handle errors at a level, where you still have enough context to decide whether the error is expected during normal 221 operations. 222 * Errors of unexpected types are indicators that the node's internal state might be corrupted. 223 224 225 ### Anti-Pattern 226 227 Continuing on a best-effort basis is not an option, i.e. the following is an anti-pattern in the context of Flow: 228 229 ```golang 230 err := foo() 231 if err != nil { 232 log.Error().Err(err).Msg("foo failed") 233 return 234 } 235 ``` 236 237 There are _rare_ instances, where this might be acceptable. For example, when attempting to send a message via the 238 network to another node and the networking layer errored. In this case, please include a comment with an explanation, 239 why it is acceptable to keep going, even if the other component returned an error. For this example, it is acceptable to handle the error by logging because: 240 * we expect transient errors from this component (networking layer) during normal operation 241 * the networking layer uses a 3rd party library, which does not expose specific and exhaustive sentinel errors for expected failure conditions 242 * either it is not critical that the message is successfully sent at all, or it will be retried later on 243 244 ### Prioritize Safety Over Liveness 245 246 **Ideally, a vertex should restart** (from a known good state), **when it encounters an unexpected error.** 247 Per convention, a vertex should throw any unexpected exceptions using the related [irrecoverable context](https://github.com/onflow/flow-go/blob/277b6515add6136946913747efebd508f0419a25/module/irrecoverable/irrecoverable.go). 248 249 <!-- TODO(alex) expand this section --> 250 251 If this is out of scope, we _prioritize safety over liveness_. This means that we rather crash the node than continue on a best-effort basis. 252 When in doubt, use the following as a fall-back: 253 ```golang 254 err := foo() 255 if errors.Is(err, XFailedErr) { 256 // expected error 257 return 258 } 259 if err != nil { 260 log.Fatal().Err(err).Msg("unexpected internal error") 261 return 262 } 263 ``` 264 265 The error-handling guidelines are set up such that they minimize technical debt, even if we don't support component 266 restarts right away. 267 268 * The code is already differentiating expected from unexpected errors in the core business logic and we documented them 269 accordingly. 270 * The code handles the expected errors at the appropriate level, only propagating unexpected errors. 271 272 As a first step, you can crash the node when an unexpected error hits the `Engine` level (for more details, see `Engine` 273 section of this doc). If you have followed the error handling guidelines, leaving the restart logic for the component as 274 technical debt for later hopefully does not add much overhead. In contrast, if you log errors and continue on a 275 best-effort basis, you also leave the _entire_ logic for differentiating errors as tech debt. When adding this logic 276 later, you need to revisit the _entire_ business logic again. 277 278 279 # Appendix 280 281 ## Engines (interface deprecated) 282 We model the data flow inside a node as a 'data flow graph'. An `Engine` is the deprecated interface for a vertex within the node's data flow graph. 283 New vertices should implement the [`Component` interface](https://github.com/onflow/flow-go/blob/57f89d4e96259f08fe84163c91ecd32484401b45/module/component/component.go#L22) 284 and throw unexpected errors using the related [irrecoverable context](https://github.com/onflow/flow-go/blob/277b6515add6136946913747efebd508f0419a25/module/irrecoverable/irrecoverable.go). 285 286 See [related tech debt issue](https://github.com/dapperlabs/flow-go/issues/6361) for more information on the deprecation. 287 288 Generally, we consider node-internal components as _trusted_ and external nodes as untrusted data sources. 289 The `Engine` API differentiates between trusted and untrusted inputs: 290 291 * **Trusted inputs** from other components within the same node are handed to the `Engine` through 292 ```golang 293 // SubmitLocal submits an event originating on the local node. 294 SubmitLocal(event interface{}) 295 296 // ProcessLocal processes an event originating on the local node. 297 ProcessLocal(event interface{}) error 298 ``` 299 * **Untrusted inputs** from other nodes are handed to the `Engine` via 300 ```golang 301 // Submit submits the given event from the node with the given origin ID 302 // for processing in a non-blocking manner. It returns instantly and logs 303 // a potential processing error internally when done. 304 Submit(channel Channel, originID flow.Identifier, event interface{}) 305 306 // Process processes the given event from the node with the given origin ID 307 // in a blocking manner. It returns the potential processing error when 308 // done. 309 Process(channel Channel, originID flow.Identifier, event interface{}) error 310 ``` 311 312 Furthermore, engines offer synchronous and asynchronous processing of their inputs: 313 314 * Methods `engine.ProcessLocal` and `engine.Process` run their logic in the caller's routine. Therefore, the core 315 business logic can return errors, which we should capture and propagate. 316 * Methods `engine.SubmitLocal` and `engine.Submit` run the business logic asynchronously. (no error return) 317 * Note: computationally very cheap and predominantly non-blocking computation should _still_ be executed in the 318 calling routine. This is beneficial to avoid a cascade of go-routine launches when stacking multiple engines. 319 320 ### Guidelines 321 322 * In the error handling within an `Engine`, differentiate between trusted and untrusted inputs. For example, most 323 engines expect specific input types. Inputs with incompatible type should result in a specific sentinel error ( 324 e.g. `InvalidMessageType`). 325 For messages that come from the network (methods `Submit` and `Process`), we expect invalid types. But if we get an 326 invalid message type from a trusted local engine: 327 (methods `SubmitLocal` and `ProcessLocal`), something is broken (critical failure). 328 ```golang 329 func (e *engine) process(event interface{}) error { 330 switch v := event.(type) { 331 ... 332 default: 333 return fmt.Errorf("invalid input type %T: %w", event, InvalidMessageType) 334 } 335 } 336 337 func (e *engine) Process(chan network.Channel, originID flow.Identifier, event interface{}) error { 338 err := e.process(event) 339 if err != nil { 340 if errors.Is(err, InvalidMessageType) { 341 // this is EXPECTED during normal operations 342 } 343 } 344 } 345 346 func (e *engine) ProcessLocal(event interface{}) { 347 err := e.process(event) 348 if err != nil { 349 if errors.Is(err, InvalidMessageType) { 350 // this is a CRITICAL BUG 351 } 352 } 353 } 354 ``` 355 * avoid swallowing errors. The following is generally _not_ acceptable 356 ```golang 357 func (e *engine) Submit(event interface{}) error { 358 e.unit.Launch(func() { 359 err := e.process(event) 360 if err != nil { 361 engine.LogError(e.log, err) 362 } 363 }) 364 } 365 ``` 366 Instead, implement: 367 ```golang 368 func (e *engine) Submit(chan network.Channel, originID flow.Identifier, event interface{}) { 369 e.unit.Launch(func() { 370 err := e.process(event) 371 if errors.Is(err, InvalidMessageType) { 372 // invalid input: ignore or slash 373 return 374 } 375 if err != nil { 376 // unexpected input: for now we prioritize safety over liveness and just crash 377 // TODO: restart engine from known good state 378 e.log.Fatal().Err(err).Msg("unexpected internal error") 379 } 380 }) 381 } 382 ```