github.com/koko1123/flow-go-1@v0.29.6/CodingConventions.md (about)

     1  # Coding Style Guidelines for Flow Core Protocol
     2  
     3  Below, we discuss code-style conventions for Flow's core-protocol implementation. They are _guidelines_ with the goal to
     4  increase readability, uniformity, and maintainability of the code base.
     5  
     6  We believe that sticking to these guidelines will increase code quality and reduce bugs in many situations. In some
     7  cases, you will find that applying our guidelines does not provide an immediate benefit for the specific code you are
     8  working on. Nevertheless, uniform conventions and sticking to established design patterns often makes it easier for
     9  others to work with your code.
    10  
    11  #### They are guidelines, not a law
    12  
    13  Coding guidelines are tools: they are designed to be widely useful but generally have limitations as well. If you find
    14  yourself in a situation where you feel that sticking to a guideline would have significant negative effects on a piece
    15  of code
    16  (e.g. bloats it significantly, makes it highly error-prone) you have the freedom to break with the conventions. In this
    17  case, please include a comment in the code with a brief motivation for your decision. This will help reviewers and
    18  prevent others from going through the same learning process that led to your decision to break with the conventions.
    19  
    20  
    21  ## Motivation and Context
    22  ### A Flow Node
    23  
    24  On a high level, a Flow node consists of various data-processing components that receive information from node-internal
    25  and external sources. We model the data flow inside a Flow node as a graph, aka the **data flow graph**:
    26  * a data-processing component forms a vertex;
    27  * two components exchanging messages is represented as an edge.
    28  
    29  Generally, an individual vertex (data processing component) has a dedicated pool of go-routines for processing its work.
    30  Inbound data packages are appended to queue(s), and the vertex's internal worker threads process the inbound messages from the queue(s).
    31  In this design, there is an upper limit to the CPU resources an individual vertex can consume. A vertex with `k` go-routines as workers can
    32  at most keep `k` cores busy. The benefit of this design is that the node can remain responsive even if one (or a few) of
    33  its data processing vertices are completely overwhelmed.
    34  
    35  In its entirety, a node is composed of many data flow vertices, that together form the node's data flow graph. The vertices
    36  within the node typically trust each other; while external messages are untrusted. Depending on the inputs it consumes, a vertex
    37  must carefully differentiate between trusted and untrusted inputs.
    38  
    39  ⇒ **A vertex (data processing component within a node) implements the [`Component` interface](https://github.com/onflow/flow-go/blob/57f89d4e96259f08fe84163c91ecd32484401b45/module/component/component.go#L22)**.
    40  
    41  For example, [`VoteAggregator`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L29) is
    42  a data flow vertex within a consensus nodes, whose job it is to collect incoming votes from other consensus nodes.
    43  Hence, the corresponding function [`AddVote(vote *model.Vote)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L37)
    44  treats all inputs as untrusted. Furthermore, `VoteAggregator` consumes auxiliary inputs
    45  ([`AddBlock(block *model.Proposal)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L45),
    46  [`InvalidBlock(block *model.Proposal)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L51),
    47  [`PruneUpToView(view uint64)`](https://github.com/onflow/flow-go/blob/3899d335d5a6ce3cbbb9fde4f7af780d6bec8c9a/consensus/hotstuff/vote_aggregator.go#L57))
    48  from the node's main consensus logic, which is trusted.
    49  
    50  
    51  ### External Byzantine messages result in benign sentinel errors
    52  
    53  Any input that is coming from another node is untrusted and the vertex processing should gracefully handle any arbitrarily malicious input.
    54  Under no circumstances should a byzantine input result in the vertex's internal state being corrupted.
    55  Per convention, failure case due to a byzantine inputs are represented by specifically-typed
    56  [sentinel errors](https://pkg.go.dev/errors#New) (for further details, see error handling section).
    57  
    58  Byzantine inputs are one particular case, where a function returns a 'benign error'. The critical property for an error
    59  to be benign is that the component returning it is still fully functional, despite encountering the error condition. All
    60  benign errors should be handled within the vertex. As part of implementing error handling, developers must consider which
    61  error conditions are benign in the context of the current component.
    62  
    63  ### Invalid internal messages result in exceptions
    64  
    65  Any input that is coming from another internal component is trusted by default. 
    66  If the internal message is malformed, this should be considered a critical exception.
    67  
    68  **Exception:** some internal components forward external messages without verification. 
    69  For example, the `sync` engine forwards `Block` messages to the `compliance` engine without verifying that the block is valid.
    70  In this case, validity of the forwarded `Block` is not part of the contract between the `sync` and `compliance` engines,
    71  so an invalid forwarded `Block` should not be considered an invalid internal message.
    72  
    73  ## Error handling
    74  
    75  This is a highly-compressed summary
    76  of [Error Handling in Flow (Notion)](https://www.notion.so/dapperlabs/Error-Handling-in-Flow-a7692d60681c4063b48851c0dbc6ed10)
    77  
    78  A blockchain is a security-first system. Therefore, encountering a problem and continuing on a "best effort" basis is
    79  *not* an option. There is a clear definition of the happy path (i.e. the block must be in storage). Any deviation of this
    80  happy path is either
    81  
    82  1. a protocol violation (from a different node), which itself is clearly defined
    83  2. is an uncovered edge case (which could be exploited in the worst case to compromise the system)
    84  3. a corrupted internal state of the node
    85  
    86  ### Best Practice Guidelines
    87  
    88  1. **Errors are part of your API:** If an error is returned by a function, the error's meaning is clearly documented in the function's GoDoc. We conceptually differentiate between the following two classes of errors:
    89  
    90     1. _benign error: the component returning the error is still fully functional despite the error_.  
    91         Benign failure cases are represented as typed sentinel errors
    92         ([basic errors](https://pkg.go.dev/errors#New) and [higher-level errors](https://dev.to/tigorlazuardi/go-creating-custom-error-wrapper-and-do-proper-error-equality-check-11k7)),
    93         so we can do type checks.  
    94     2. _exception: the error a potential symptom of internal state corruption_.  
    95        For example, a failed sanity check. In this case, the error is most likely fatal.
    96        <br /><br />
    97  
    98     * Documentation of error returns should be part of every interface. We also encourage to copy the higher-level interface documentation to every implementation
    99       and possibly extend it with implementation-specific comments. In particular, it simplifies maintenance work on the implementation, when the API-level contracts
   100       are directly documented above the implementation.
   101     * Adding a new sentinel often means that the higher-level logic has to gracefully handle an additional error case, potentially triggering slashing etc.
   102       Therefore, changing the set of specified sentinel errors is generally considered a breaking API change. 
   103  
   104  
   105  2. **All errors beyond the specified, benign sentinel errors ere considered unexpected failures, i.e. a symptom for potential state corruption.**       
   106     * We employ a fundamental principle of [High Assurance Software Engineering](https://www.researchgate.net/publication/228563190_High_Assurance_Software_Development),
   107  where we treat everything beyond the known benign errors as critical failures. In unexpected failure cases, we assume that the vertex's in-memory state has been
   108  broken and proper functioning is no longer guaranteed. The only safe route of recovery is to restart the vertex from a previously persisted, safe state.
   109  Per convention, a vertex should throw any unexpected exceptions using the related [irrecoverable context](https://github.com/onflow/flow-go/blob/277b6515add6136946913747efebd508f0419a25/module/irrecoverable/irrecoverable.go).
   110     * Many components in our BFT system can return benign errors (type (i)) and exceptions (type (ii))
   111  
   112      
   113  3. _Optional Simplification for components that solely return benign errors._
   114     * In this case, you _can_ use untyped errors to represent benign error cases (e.g. using `fmt.Errorf`).
   115     * By using untyped errors, the code would be _breaking with our best practice guideline_ that benign errors should be represented as typed sentinel errors.
   116  Therefore, whenever all returned errors are benign, please clearly document this _for each public functions individually_.
   117  For example, a statement like the following would be sufficient:
   118        ```golang
   119        // This function errors if XYZ was not successful. All returned errors are 
   120        // benign, as the function is side-effect free and failures are simply a no-op.  
   121        ```
   122  
   123  
   124  ### Hands-on suggestions
   125  
   126  * avoid generic errors, such as
   127    ```golang
   128    return fmt.Errorf("x failed")
   129    ```
   130  * Use [sentinel errors](https://pkg.go.dev/errors#New):
   131    ```golang
   132    ErrXFailed := errors.New("x failed")
   133  
   134    // foo does abc.
   135    // Expected error returns during normal operations:
   136    //  * ErrXFailed: if x failed
   137    func foo() err {
   138       ...
   139       return fmt.Errorf("details about failure: %w", ErrXFailed)
   140    }
   141    ```
   142      - If some operation _can_ fail, create specific sentinel errors.
   143      - **Document the types of _all_ errors that are expected during normal operation in goDoc**. This enables the
   144        calling code to specifically check for these errors, and decide how to handle them.
   145  * Errors are bubbled up the call stack and wrapped to create a meaningful trace:
   146    ```golang
   147    // bar does ...
   148    // Expected error returns during normal operations:
   149    //  * XFailedErr: if x failed
   150    func bar() err {
   151       ...
   152       err := foo()
   153       if err != nil {
   154          return fmt.Errorf("failed to do foo: %w", err)
   155       }
   156       ...
   157    }
   158    ```
   159  * Handle errors at a level, where you still have enough context to decide whether the error is expected during normal
   160    operations.
   161  * Errors of unexpected types are indicators that the node's internal state might be corrupted.
   162  
   163  ### Anti-Pattern
   164  
   165  Continuing on a best-effort basis is not an option, i.e. the following is an anti-pattern in the context of Flow:
   166  
   167  ```golang
   168  err := foo()
   169  if err != nil {
   170     log.Error().Err(err).Msg("foo failed")
   171     return
   172  }
   173  ```
   174  
   175  There are _rare_ instances, where this might be acceptable. For example, when attempting to send a message via the
   176  network to another node and the networking layer errored. In this case, please include a comment with an explanation,
   177  why it is acceptable to keep going, even if the other component returned an error. For this example, it is acceptable to handle the error by logging because:
   178  * we expect transient errors from this component (networking layer) during normal operation
   179  * the networking layer uses a 3rd party library, which does not expose specific and exhaustive sentinel errors for expected failure conditions
   180  * either it is not critical that the message is successfully sent at all, or it will be retried later on
   181  
   182  ### Prioritize Safety Over Liveness
   183  
   184  **Ideally, a vertex should restart** (from a known good state), **when it encounters an unexpected error.**
   185  Per convention, a vertex should throw any unexpected exceptions using the related [irrecoverable context](https://github.com/onflow/flow-go/blob/277b6515add6136946913747efebd508f0419a25/module/irrecoverable/irrecoverable.go).
   186  
   187  <!-- TODO(alex) expand this section -->
   188  
   189  If this is out of scope, we _prioritize safety over liveness_. This means that we rather crash the node than continue on a best-effort basis.
   190  When in doubt, use the following as a fall-back:
   191  ```golang
   192  err := foo()
   193  if errors.Is(err, XFailedErr) {
   194     // expected error
   195     return
   196  }
   197  if err != nil {
   198     log.Fatal().Err(err).Msg("unexpected internal error")
   199     return
   200  }
   201  ```
   202  
   203  The error-handling guidelines are set up such that they minimize technical debt, even if we don't support component
   204  restarts right away.
   205  
   206  * The code is already differentiating expected from unexpected errors in the core business logic and we documented them
   207    accordingly.
   208  * The code handles the expected errors at the appropriate level, only propagating unexpected errors.
   209  
   210  As a first step, you can crash the node when an unexpected error hits the `Engine` level (for more details, see `Engine`
   211  section of this doc). If you have followed the error handling guidelines, leaving the restart logic for the component as
   212  technical debt for later hopefully does not add much overhead. In contrast, if you log errors and continue on a
   213  best-effort basis, you also leave the _entire_ logic for differentiating errors as tech debt. When adding this logic
   214  later, you need to revisit the _entire_ business logic again.
   215  
   216  
   217  # Appendix
   218  
   219  ## Engines (interface deprecated)
   220  We model the data flow inside a node as a 'data flow graph'. An `Engine` is the deprecated interface for a vertex within the node's data flow graph. 
   221  New vertices should implement the [`Component` interface](https://github.com/onflow/flow-go/blob/57f89d4e96259f08fe84163c91ecd32484401b45/module/component/component.go#L22)
   222  and throw unexpected errors using the related [irrecoverable context](https://github.com/onflow/flow-go/blob/277b6515add6136946913747efebd508f0419a25/module/irrecoverable/irrecoverable.go).
   223  
   224  See [related tech debt issue](https://github.com/dapperlabs/flow-go/issues/6361) for more information on the deprecation.
   225  
   226  Generally, we consider node-internal components as _trusted_ and external nodes as untrusted data sources. 
   227  The `Engine` API differentiates between trusted and untrusted inputs:
   228  
   229  * **Trusted inputs** from other components within the same node are handed to the `Engine` through
   230    ```golang
   231    // SubmitLocal submits an event originating on the local node.
   232    SubmitLocal(event interface{})
   233  
   234    // ProcessLocal processes an event originating on the local node.
   235    ProcessLocal(event interface{}) error
   236    ```
   237  * **Untrusted inputs** from other nodes are handed to the `Engine` via
   238    ```golang
   239    // Submit submits the given event from the node with the given origin ID
   240    // for processing in a non-blocking manner. It returns instantly and logs
   241    // a potential processing error internally when done.
   242    Submit(channel Channel, originID flow.Identifier, event interface{})
   243  
   244    // Process processes the given event from the node with the given origin ID
   245    // in a blocking manner. It returns the potential processing error when
   246    // done.
   247    Process(channel Channel, originID flow.Identifier, event interface{}) error
   248    ```
   249  
   250  Furthermore, engines offer synchronous and asynchronous processing of their inputs:
   251  
   252  * Methods `engine.ProcessLocal` and `engine.Process` run their logic in the caller's routine. Therefore, the core
   253    business logic can return errors, which we should capture and propagate.
   254  * Methods `engine.SubmitLocal` and `engine.Submit` run the business logic asynchronously. (no error return)
   255      * Note: computationally very cheap and predominantly non-blocking computation should _still_ be executed in the
   256        calling routine. This is beneficial to avoid a cascade of go-routine launches when stacking multiple engines.
   257  
   258  ### Guidelines
   259  
   260  * In the error handling within an `Engine`, differentiate between trusted and untrusted inputs. For example, most
   261    engines expect specific input types. Inputs with incompatible type should result in a specific sentinel error (
   262    e.g. `InvalidMessageType`).    
   263    For messages that come from the network (methods `Submit` and `Process`), we expect invalid types. But if we get an
   264    invalid message type from a trusted local engine:
   265    (methods `SubmitLocal` and `ProcessLocal`), something is broken (critical failure).
   266    ```golang
   267    func (e *engine) process(event interface{}) error {
   268      switch v := event.(type) {
   269         ...
   270         default:
   271            return fmt.Errorf("invalid input type %T: %w", event, InvalidMessageType)
   272      }
   273    }
   274  
   275    func (e *engine) Process(chan network.Channel, originID flow.Identifier, event interface{}) error {
   276      err := e.process(event)
   277      if err != nil {
   278          if errors.Is(err, InvalidMessageType) {
   279              // this is EXPECTED during normal operations
   280          }
   281      }
   282    }
   283  
   284    func (e *engine) ProcessLocal(event interface{}) {
   285      err := e.process(event)
   286      if err != nil {
   287          if errors.Is(err, InvalidMessageType) {
   288              // this is a CRITICAL BUG
   289          }
   290      }
   291    }
   292    ```
   293  * avoid swallowing errors. The following is generally _not_ acceptable
   294    ```golang
   295    func (e *engine) Submit(event interface{}) error {
   296      e.unit.Launch(func() {
   297          err := e.process(event)
   298          if err != nil {
   299              engine.LogError(e.log, err)
   300          }
   301      })
   302    }
   303    ```
   304    Instead, implement:
   305    ```golang
   306    func (e *engine) Submit(chan network.Channel, originID flow.Identifier, event interface{}) {
   307      e.unit.Launch(func() {
   308          err := e.process(event)
   309          if errors.Is(err, InvalidMessageType) {
   310              // invalid input: ignore or slash
   311              return
   312          }
   313          if err != nil {
   314              // unexpected input: for now we prioritize safety over liveness and just crash
   315              // TODO: restart engine from known good state
   316              e.log.Fatal().Err(err).Msg("unexpected internal error")
   317          }
   318      })
   319    }
   320    ```