github.com/onflow/flow-go@v0.35.7-crescendo-preview.23-atree-inlining/CodingConventions.md

github.com/onflow/flow-go@v0.35.7-crescendo-preview.23-atree-inlining/CodingConventions.md (about)

     1  # Coding Style Guidelines for Flow Core Protocol
     2  
     3  Below, we discuss code-style conventions for Flow's core-protocol implementation. They are _guidelines_ with the goal to
     4  increase readability, uniformity, and maintainability of the code base.
     5  
     6  We believe that sticking to these guidelines will increase code quality and reduce bugs in many situations. In some
     7  cases, you will find that applying our guidelines does not provide an immediate benefit for the specific code you are
     8  working on. Nevertheless, uniform conventions and sticking to established design patterns often makes it easier for
     9  others to work with your code.
    10  
    11  #### They are guidelines, not a law
    12  
    13  Coding guidelines are tools: they are designed to be widely useful but generally have limitations as well. If you find
    14  yourself in a situation where you feel that sticking to a guideline would have significant negative effects on a piece
    15  of code
    16  (e.g. bloats it significantly, makes it highly error-prone) you have the freedom to break with the conventions. In this
    17  case, please include a comment in the code with a brief motivation for your decision. This will help reviewers and
    18  prevent others from going through the same learning process that led to your decision to break with the conventions.
    19  
    20  
    21  ## Motivation and Context
    22  ### A Flow Node
    23  
    24  On a high level, a Flow node consists of various data-processing components that receive information from node-internal
    25  and external sources. We model the data flow inside a Flow node as a graph, aka the **data flow graph**:
    26  * a data-processing component forms a vertex;
    27  * two components exchanging messages is represented as an edge.
    28  
    29  Generally, an individual vertex (data processing component) has a dedicated pool of go-routines for processing its work.
    30  Inbound data packages are appended to queue(s), and the vertex's internal worker threads process the inbound messages from the queue(s).
    31  In this design, there is an upper limit to the CPU resources an individual vertex can consume. A vertex with `k` go-routines as workers can
    32  at most keep `k` cores busy. The benefit of this design is that the node can remain responsive even if one (or a few) of
    33  its data processing vertices are completely overwhelmed.
    34  
    35  In its entirety, a node is composed of many data flow vertices, that together form the node's data flow graph. The vertices
    36  within the node typically trust each other; while external messages are untrusted. Depending on the inputs it consumes, a vertex
    37  must carefully differentiate between trusted and untrusted inputs.
    38  
    39  ⇒ **A vertex (data processing component within a node) implements the [`Component` interface](https://github.com/onflow/flow-go/blob/57f89d4e96259f08fe84163c91ecd32484401b45/module/component/component.go#L22)**.
    40  
    41  For example, [`VoteAggregator`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L29) is
    42  a data flow vertex within a consensus nodes, whose job it is to collect incoming votes from other consensus nodes.
    43  Hence, the corresponding function [`AddVote(vote *model.Vote)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L37)
    44  treats all inputs as untrusted. Furthermore, `VoteAggregator` consumes auxiliary inputs
    45  ([`AddBlock(block *model.Proposal)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L45),
    46  [`InvalidBlock(block *model.Proposal)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L51),
    47  [`PruneUpToView(view uint64)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L57))
    48  from the node's main consensus logic, which is trusted.
    49  
    50  
    51  ### External Byzantine messages result in benign sentinel errors
    52  
    53  Any input that is coming from another node is untrusted and the vertex processing should gracefully handle any arbitrarily malicious input.
    54  Under no circumstances should a byzantine input result in the vertex's internal state being corrupted.
    55  Per convention, failure case due to a byzantine inputs are represented by specifically-typed
    56  [sentinel errors](https://pkg.go.dev/errors#New) (for further details, see error handling section).
    57  
    58  Byzantine inputs are one particular case, where a function returns a 'benign error'. The critical property for an error
    59  to be benign is that the component returning it is still fully functional, despite encountering the error condition. All
    60  benign errors should be handled within the vertex. As part of implementing error handling, developers must consider which
    61  error conditions are benign in the context of the current component.
    62  
    63  ### Invalid internal messages result in exceptions
    64  
    65  Any input that is coming from another internal component is trusted by default. 
    66  If the internal message is malformed, this should be considered a critical exception.
    67  
    68  **Exception:** some internal components forward external messages without verification. 
    69  For example, the `sync` engine forwards `Block` messages to the `compliance` engine without verifying that the block is valid.
    70  In this case, validity of the forwarded `Block` is not part of the contract between the `sync` and `compliance` engines,
    71  so an invalid forwarded `Block` should not be considered an invalid internal message.
    72  
    73  ## Error handling
    74  
    75  This is a highly-compressed summary
    76  of [Error Handling in Flow (Notion)](https://www.notion.so/dapperlabs/Error-Handling-in-Flow-a7692d60681c4063b48851c0dbc6ed10)
    77  
    78  A blockchain is a security-first system. Therefore, encountering a problem and continuing on a "best effort" basis is
    79  *not* an option. There is a clear definition of the happy path (i.e. the block must be in storage). Any deviation of this
    80  happy path is either
    81  
    82  1. a protocol violation (from a different node), which itself is clearly defined
    83  2. is an uncovered edge case (which could be exploited in the worst case to compromise the system)
    84  3. a corrupted internal state of the node
    85  
    86  ### Best Practice Guidelines
    87  
    88  1. **Errors are part of your API:** If an error is returned by a function, the error's meaning is clearly documented in the function's GoDoc. We conceptually differentiate between the following two classes of errors:
    89  
    90     1. _benign error: the component returning the error is still fully functional despite the error_.  
    91         Benign failure cases are represented as typed sentinel errors
    92         ([basic errors](https://pkg.go.dev/errors#New) and [higher-level errors](https://dev.to/tigorlazuardi/go-creating-custom-error-wrapper-and-do-proper-error-equality-check-11k7)),
    93         so we can do type checks.  
    94     2. _exception: the error is a potential symptom of internal state corruption_.  
    95        For example, a failed sanity check. In this case, the error is most likely fatal.
    96        <br /><br />
    97  
    98     * Documentation of error returns should be part of every interface. We also encourage to copy the higher-level interface documentation to every implementation
    99       and possibly extend it with implementation-specific comments. In particular, it simplifies maintenance work on the implementation, when the API-level contracts
   100       are directly documented above the implementation.
   101     * Adding a new sentinel often means that the higher-level logic has to gracefully handle an additional error case, potentially triggering slashing etc.
   102       Therefore, changing the set of specified sentinel errors is generally considered a breaking API change. 
   103  
   104  
   105  2. **All errors beyond the specified, benign sentinel errors are considered unexpected failures, i.e. a symptom of potential state corruption.**       
   106     * We employ a fundamental principle of [High Assurance Software Engineering](https://www.researchgate.net/publication/228563190_High_Assurance_Software_Development),
   107  where we treat everything beyond the known benign errors as critical failures. In unexpected failure cases, we assume that the vertex's in-memory state has been
   108  broken and proper functioning is no longer guaranteed. The only safe route of recovery is to restart the vertex from a previously persisted, safe state.
   109  Per convention, a vertex should throw any unexpected exceptions using the related [irrecoverable context](https://github.com/onflow/flow-go/blob/277b6515add6136946913747efebd508f0419a25/module/irrecoverable/irrecoverable.go).
   110     * Many components in our BFT system can return benign errors (type (i)) and irrecoverable exceptions (type (ii))
   111  
   112  3. **Whether a particular error is benign or an exception depends on the caller's context. Errors _cannot_ be categorized as benign or exception based on their type alone.**
   113  
   114     ![Error Handling](/docs/ErrorHandling.png)
   115  
   116     * For example, consider `storage.ErrNotFound` that could be returned by the storage lager, when querying a block by ID
   117     (method [`Headers.ByBlockID(flow.Identifier)`](https://github.com/onflow/flow-go/blob/a918616c7b541b772c254e7eaaae3573561e6c0a/storage/headers.go#L15-L18)).
   118     In many cases, `storage.ErrNotFound` is expected, for instance if a node is receiving a new block proposal and checks whether the parent has already been ingested
   119     or needs to be requested from a different node. In contrast, if we are querying a block that we know is already finalized and the storage returns a `storage.ErrNotFound`
   120     something is badly broken. 
   121     * Use the special `irrecoverable.exception` [error type](https://github.com/onflow/flow-go/blob/master/module/irrecoverable/exception.go#L7-L26)
   122       to denote an unexpected error (and strip any sentinel information from the error stack).
   123     
   124       This is for any scenario when a higher-level function is interpreting a sentinel returned from a lower-level function as an exception.
   125       To construct an example, lets look at our `storage.Blocks` API, which has a [`ByHeight` method](https://github.com/onflow/flow-go/blob/a918616c7b541b772c254e7eaaae3573561e6c0a/storage/blocks.go#L24-L26)
   126       to retrieve _finalized_ blocks by height. The following could be a hypothetical implementation:
   127        ```golang
   128        // ByHeight retrieves the finalized block for the given height.
   129        // From the perspective of the storage layer, the following errors are benign:
   130        //   - storage.ErrNotFound if no finalized block at the given height is known
   131        func ByHeight(height uint64) (*flow.Block, error) {
   132            // Step 1: retrieve the ID of the finalized block for the given height. We expect
   133            // `storage.ErrNotFound` during normal operations, if no block at height has been
   134            // finalized. We just bubble this sentinel up, as it already has the expected type.
   135            blockID, err := retrieveBlockIdByHeight(height)
   136            if err != nil {
   137                return nil, fmt.Errorf("could not query block by height: %w", err)
   138            }
   139  
   140            // Step 2: retrieve full block by ID. Function `retrieveBlockByID` returns
   141            // `storage.ErrNotFound` in case no block with the given ID is known. In other parts
   142            // of the code that also use `retrieveBlockByID`, this would be expected during normal
   143            // operations. However, here we are querying a block, which the storage layer has
   144            // already indexed. Failure to retrieve the block implies our storage is corrupted.
   145            block, err := retrieveBlockByID(blockID)
   146            if err != nil {
   147                // We cannot bubble up `storage.ErrNotFound` as this would hide this irrecoverable
   148                // failure behind a benign sentinel error. We use the `Exception` wrapper, which
   149                // also implements the error `interface` but provides no `Unwrap` method. Thereby,
   150                // the `err`s sentinel type is hidden from upstream type checks, and consequently 
   151                // classified as unexpected, i.e. irrecoverable exceptions. 
   152                return nil, irrecoverable.NewExceptionf("storage inconsistency, failed to" +
   153                          "retrieve full block for indexed and finalized block %x: %w", blockID, err)
   154            }
   155            return block, nil
   156        }
   157        ```
   158       Functions **may** use `irrecoverable.NewExceptionf` when:
   159         - they are interpreting any error returning from a 3rd party module as unexpected
   160         - they are reacting to an unexpected condition internal to their stack frame and returning a generic error
   161    
   162       Functions **must** usd `irrecoverable.NewExceptionf` when:
   163         - they are interpreting any documented sentinel error returned from a flow-go module as unexpected
   164  
   165       For brief illustration, let us consider some function body, in which there are multiple subsequent calls to other lower-level functions.
   166       In most scenarios, a particular sentinel type is either always or never expected during normal operations.  If it is expected,
   167       then the sentinel type should be documented. If it is consistently not expected, the error should _not_ be mentioned in the 
   168       function's godoc. In the absence of positive affirmation that `error` is an expected and benign sentinel, the error is to be
   169       treated as an irrecoverable exception. So if a sentinel type `T` is consistently not expected throughout the function's body, make 
   170       sure the sentinel `T` is not mentioned in the function's godoc. The latter is fully sufficient to classify `T` as an irrecoverable
   171       exception.   
   172  
   173  
   174  5. _Optional Simplification for components that solely return benign errors._
   175     * In this case, you _can_ use untyped errors to represent benign error cases (e.g. using `fmt.Errorf`).
   176     * By using untyped errors, the code would be _breaking with our best practice guideline_ that benign errors should be represented as typed sentinel errors.
   177  Therefore, whenever all returned errors are benign, please clearly document this _for each public functions individually_.
   178  For example, a statement like the following would be sufficient:
   179        ```golang
   180        // This function errors if XYZ was not successful. All returned errors are 
   181        // benign, as the function is side-effect free and failures are simply a no-op.  
   182        ```
   183  
   184  
   185  ### Hands-on suggestions
   186  
   187  * Avoid generic errors, such as
   188    ```golang
   189    return fmt.Errorf("x failed")
   190    ```
   191  * Use [sentinel errors](https://pkg.go.dev/errors#New):
   192    ```golang
   193    ErrXFailed := errors.New("x failed")
   194  
   195    // foo does abc.
   196    // Expected error returns during normal operations:
   197    //  * ErrXFailed: if x failed
   198    func foo() err {
   199       ...
   200       return fmt.Errorf("details about failure: %w", ErrXFailed)
   201    }
   202    ```
   203      - If some operation _can_ fail, create specific sentinel errors.
   204      - **Document the types of _all_ errors that are expected during normal operation in goDoc**. This enables the
   205        calling code to specifically check for these errors, and decide how to handle them.
   206  * Errors are bubbled up the call stack and wrapped to create a meaningful trace:
   207    ```golang
   208    // bar does ...
   209    // Expected error returns during normal operations:
   210    //  * XFailedErr: if x failed
   211    func bar() err {
   212       ...
   213       err := foo()
   214       if err != nil {
   215          return fmt.Errorf("failed to do foo: %w", err)
   216       }
   217       ...
   218    }
   219    ```
   220  * Handle errors at a level, where you still have enough context to decide whether the error is expected during normal
   221    operations.
   222  * Errors of unexpected types are indicators that the node's internal state might be corrupted.
   223  
   224  
   225  ### Anti-Pattern
   226  
   227  Continuing on a best-effort basis is not an option, i.e. the following is an anti-pattern in the context of Flow:
   228  
   229  ```golang
   230  err := foo()
   231  if err != nil {
   232     log.Error().Err(err).Msg("foo failed")
   233     return
   234  }
   235  ```
   236  
   237  There are _rare_ instances, where this might be acceptable. For example, when attempting to send a message via the
   238  network to another node and the networking layer errored. In this case, please include a comment with an explanation,
   239  why it is acceptable to keep going, even if the other component returned an error. For this example, it is acceptable to handle the error by logging because:
   240  * we expect transient errors from this component (networking layer) during normal operation
   241  * the networking layer uses a 3rd party library, which does not expose specific and exhaustive sentinel errors for expected failure conditions
   242  * either it is not critical that the message is successfully sent at all, or it will be retried later on
   243  
   244  ### Prioritize Safety Over Liveness
   245  
   246  **Ideally, a vertex should restart** (from a known good state), **when it encounters an unexpected error.**
   247  Per convention, a vertex should throw any unexpected exceptions using the related [irrecoverable context](https://github.com/onflow/flow-go/blob/277b6515add6136946913747efebd508f0419a25/module/irrecoverable/irrecoverable.go).
   248  
   249  <!-- TODO(alex) expand this section -->
   250  
   251  If this is out of scope, we _prioritize safety over liveness_. This means that we rather crash the node than continue on a best-effort basis.
   252  When in doubt, use the following as a fall-back:
   253  ```golang
   254  err := foo()
   255  if errors.Is(err, XFailedErr) {
   256     // expected error
   257     return
   258  }
   259  if err != nil {
   260     log.Fatal().Err(err).Msg("unexpected internal error")
   261     return
   262  }
   263  ```
   264  
   265  The error-handling guidelines are set up such that they minimize technical debt, even if we don't support component
   266  restarts right away.
   267  
   268  * The code is already differentiating expected from unexpected errors in the core business logic and we documented them
   269    accordingly.
   270  * The code handles the expected errors at the appropriate level, only propagating unexpected errors.
   271  
   272  As a first step, you can crash the node when an unexpected error hits the `Engine` level (for more details, see `Engine`
   273  section of this doc). If you have followed the error handling guidelines, leaving the restart logic for the component as
   274  technical debt for later hopefully does not add much overhead. In contrast, if you log errors and continue on a
   275  best-effort basis, you also leave the _entire_ logic for differentiating errors as tech debt. When adding this logic
   276  later, you need to revisit the _entire_ business logic again.
   277  
   278  
   279  # Appendix
   280  
   281  ## Engines (interface deprecated)
   282  We model the data flow inside a node as a 'data flow graph'. An `Engine` is the deprecated interface for a vertex within the node's data flow graph. 
   283  New vertices should implement the [`Component` interface](https://github.com/onflow/flow-go/blob/57f89d4e96259f08fe84163c91ecd32484401b45/module/component/component.go#L22)
   284  and throw unexpected errors using the related [irrecoverable context](https://github.com/onflow/flow-go/blob/277b6515add6136946913747efebd508f0419a25/module/irrecoverable/irrecoverable.go).
   285  
   286  See [related tech debt issue](https://github.com/dapperlabs/flow-go/issues/6361) for more information on the deprecation.
   287  
   288  Generally, we consider node-internal components as _trusted_ and external nodes as untrusted data sources. 
   289  The `Engine` API differentiates between trusted and untrusted inputs:
   290  
   291  * **Trusted inputs** from other components within the same node are handed to the `Engine` through
   292    ```golang
   293    // SubmitLocal submits an event originating on the local node.
   294    SubmitLocal(event interface{})
   295  
   296    // ProcessLocal processes an event originating on the local node.
   297    ProcessLocal(event interface{}) error
   298    ```
   299  * **Untrusted inputs** from other nodes are handed to the `Engine` via
   300    ```golang
   301    // Submit submits the given event from the node with the given origin ID
   302    // for processing in a non-blocking manner. It returns instantly and logs
   303    // a potential processing error internally when done.
   304    Submit(channel Channel, originID flow.Identifier, event interface{})
   305  
   306    // Process processes the given event from the node with the given origin ID
   307    // in a blocking manner. It returns the potential processing error when
   308    // done.
   309    Process(channel Channel, originID flow.Identifier, event interface{}) error
   310    ```
   311  
   312  Furthermore, engines offer synchronous and asynchronous processing of their inputs:
   313  
   314  * Methods `engine.ProcessLocal` and `engine.Process` run their logic in the caller's routine. Therefore, the core
   315    business logic can return errors, which we should capture and propagate.
   316  * Methods `engine.SubmitLocal` and `engine.Submit` run the business logic asynchronously. (no error return)
   317      * Note: computationally very cheap and predominantly non-blocking computation should _still_ be executed in the
   318        calling routine. This is beneficial to avoid a cascade of go-routine launches when stacking multiple engines.
   319  
   320  ### Guidelines
   321  
   322  * In the error handling within an `Engine`, differentiate between trusted and untrusted inputs. For example, most
   323    engines expect specific input types. Inputs with incompatible type should result in a specific sentinel error (
   324    e.g. `InvalidMessageType`).    
   325    For messages that come from the network (methods `Submit` and `Process`), we expect invalid types. But if we get an
   326    invalid message type from a trusted local engine:
   327    (methods `SubmitLocal` and `ProcessLocal`), something is broken (critical failure).
   328    ```golang
   329    func (e *engine) process(event interface{}) error {
   330      switch v := event.(type) {
   331         ...
   332         default:
   333            return fmt.Errorf("invalid input type %T: %w", event, InvalidMessageType)
   334      }
   335    }
   336  
   337    func (e *engine) Process(chan network.Channel, originID flow.Identifier, event interface{}) error {
   338      err := e.process(event)
   339      if err != nil {
   340          if errors.Is(err, InvalidMessageType) {
   341              // this is EXPECTED during normal operations
   342          }
   343      }
   344    }
   345  
   346    func (e *engine) ProcessLocal(event interface{}) {
   347      err := e.process(event)
   348      if err != nil {
   349          if errors.Is(err, InvalidMessageType) {
   350              // this is a CRITICAL BUG
   351          }
   352      }
   353    }
   354    ```
   355  * avoid swallowing errors. The following is generally _not_ acceptable
   356    ```golang
   357    func (e *engine) Submit(event interface{}) error {
   358      e.unit.Launch(func() {
   359          err := e.process(event)
   360          if err != nil {
   361              engine.LogError(e.log, err)
   362          }
   363      })
   364    }
   365    ```
   366    Instead, implement:
   367    ```golang
   368    func (e *engine) Submit(chan network.Channel, originID flow.Identifier, event interface{}) {
   369      e.unit.Launch(func() {
   370          err := e.process(event)
   371          if errors.Is(err, InvalidMessageType) {
   372              // invalid input: ignore or slash
   373              return
   374          }
   375          if err != nil {
   376              // unexpected input: for now we prioritize safety over liveness and just crash
   377              // TODO: restart engine from known good state
   378              e.log.Fatal().Err(err).Msg("unexpected internal error")
   379          }
   380      })
   381    }
   382    ```