github.com/onflow/flow-go@v0.35.7-crescendo-preview.23-atree-inlining/flips/component-interface.md (about)

     1  # Component Interface (Core Protocol)
     2  
     3  | Status        | Proposed                                                  |
     4  :-------------- |:--------------------------------------------------------- |
     5  | **FLIP #**    | [1167](https://github.com/onflow/flow-go/pull/1167)       |
     6  | **Author(s)** | Simon Zhu (simon.zhu@dapperlabs.com)                      |
     7  | **Sponsor**   | Simon Zhu (simon.zhu@dapperlabs.com)                      |
     8  | **Updated**   | 9/16/2021                                                 |
     9  
    10  ## Objective
    11  
    12  FLIP to separate the API through which components are started from the API through which they expose their status.
    13  
    14  ## Current Implementation
    15  
    16  The [`ReadyDoneAware`](https://github.com/onflow/flow-go/blob/7763000ba5724bb03f522380e513b784b4597d46/module/common.go#L6) interface provides an interface through which components / modules can be started and stopped. Calling the `Ready` method should start the component and return a channel that will close when startup has completed, and `Done` should be the corresponding method to shut down the component.
    17  
    18  ### Potential problems 
    19  
    20  The current `ReadyDoneAware` interface is misleading, as by the name one might expect that it is only used to check the state of a component. However, in almost all current implementations the `Ready` method is used to both start the component *and* check when it has started up, and similarly for the `Done` method. 
    21  
    22  This introduces issues of concurrency safety / idempotency, as most implementations do not properly handle the case where the `Ready` or `Done` methods are called more than once. See [this example](https://github.com/onflow/flow-go/pull/1026).
    23  
    24  [Clearer documentation](https://github.com/onflow/flow-go/pull/1032) and a new [`LifecycleManager`](https://github.com/onflow/flow-go/pull/1031) component were introduced as a step towards fixing this by providing concurrency-safety for components implementing `ReadyDoneAware`, but this still does not provide a clear separation between the ability to start / stop a component and the ability to check its state. A component usually only needs to be started once, whereas multiple other components may wish to check its state.
    25  
    26  ## Proposal
    27  
    28  Moving forward, we will add a new `Startable` interface in addition to the existing `ReadyDoneAware`:
    29  ```golang
    30  // Startable provides an interface to start a component. Once started, the component
    31  // can be stopped by cancelling the given context.
    32  type Startable interface {
    33    // Start starts the component. Any errors encountered during startup should be returned
    34    // directly, whereas irrecoverable errors encountered while the component is running
    35    // should be thrown with the given SignalerContext.
    36    // This method should only be called once, and subsequent calls should return ErrMultipleStartup.
    37    Start(irrecoverable.SignalerContext) error
    38  }
    39  ```
    40  Components which implement this interface are passed in a `SignalerContext` upon startup, which they can use to propagate any irrecoverable errors they encounter up to their parent via `SignalerContext.Throw`. The parent can then choose to handle these errors however they like, including restarting the component, logging the error, propagating the error to their own parent, etc.
    41  
    42  ```golang
    43  // We define a constrained interface to provide a drop-in replacement for context.Context
    44  // including in interfaces that compose it.
    45  type SignalerContext interface {
    46    context.Context
    47    Throw(err error) // delegates to the signaler
    48    sealed()         // private, to constrain builder to using WithSignaler
    49  }
    50  
    51  // private, to force context derivation / WithSignaler
    52  type signalerCtx struct {
    53    context.Context
    54    *Signaler
    55  }
    56  
    57  func (sc signalerCtx) sealed() {}
    58  
    59  // the One True Way of getting a SignalerContext
    60  func WithSignaler(parent context.Context) (SignalerContext, <-chan error) {
    61    sig, errChan := NewSignaler()
    62    return &signalerCtx{parent, sig}, errChan
    63  }
    64  
    65  // Signaler sends the error out.
    66  type Signaler struct {
    67    errChan   chan error
    68    errThrown *atomic.Bool
    69  }
    70  
    71  func NewSignaler() (*Signaler, <-chan error) {
    72    errChan := make(chan error, 1)
    73    return &Signaler{
    74      errChan:   errChan,
    75      errThrown: atomic.NewBool(false),
    76    }, errChan
    77  }
    78  
    79  // Throw is a narrow drop-in replacement for panic, log.Fatal, log.Panic, etc
    80  // anywhere there's something connected to the error channel. It only sends
    81  // the first error it is called with to the error channel, there are various
    82  // options as to how subsequent errors can be handled.
    83  func (s *Signaler) Throw(err error) {
    84    defer runtime.Goexit()
    85    if s.errThrown.CAS(false, true) {
    86      s.errChan <- err
    87      close(s.errChan)
    88    } else {
    89      // Another thread, possibly from the same component, has already thrown
    90      // an irrecoverable error to this Signaler. Any subsequent irrecoverable
    91      // errors can either be logged or ignored, as the parent will already
    92      // be taking steps to remediate the first error.
    93    }
    94  }
    95  ```
    96  
    97  > For more details about `SignalerContext` and `ErrMultipleStartup`, see [#1275](https://github.com/onflow/flow-go/pull/1275) and [#1355](https://github.com/onflow/flow-go/pull/1355/).
    98  
    99  To start a component, a `SignalerContext` must be created to start it with:
   100  
   101  ```golang
   102  var parentCtx context.Context // this is the context for the routine which manages the component
   103  var childComponent component.Component
   104  
   105  ctx, cancel := context.WithCancel(parentCtx)
   106  
   107  // create a SignalerContext and return an error channel which can be used to receive
   108  // any irrecoverable errors thrown with the Signaler
   109  signalerCtx, errChan := irrecoverable.WithSignaler(ctx)
   110  
   111  // start the child component
   112  childComponent.Start(signalerCtx)
   113  
   114  // launch goroutine to handle errors thrown from the child component
   115  go func() {
   116    select {
   117    case err := <-errChan: // error thrown by child component
   118      cancel()
   119      // handle the error...
   120    case <-parentCtx.Done(): // canceled by parent
   121      // perform any necessary cleanup...
   122    }
   123  }
   124  ```
   125  
   126  With all of this in place, the semantics of `ReadyDoneAware` can be redefined to only be used to check a component's state (i.e wait for startup / shutdown to complete)
   127  ```golang
   128  type ReadyDoneAware interface {
   129    // Ready returns a channel that will close when component startup has completed.
   130    Ready() <-chan struct{}
   131    // Done returns a channel that will close when component shutdown has completed.
   132    Done() <-chan struct{}
   133  }
   134  ```
   135  
   136  Finally, we can define a `Component` interface which combines both of these interfaces:
   137  ```golang
   138  type Component interface {
   139    Startable
   140    ReadyDoneAware
   141  }
   142  ```
   143  
   144  A component will now be started by passing a `SignalerContext` to its `Start` method, and can be stopped by cancelling the `Context`. If a component needs to startup subcomponents, it can create child `Context`s from this `Context` and pass those to the subcomponents.
   145  ### Motivations
   146  - `Context`s are the standard way of doing go-routine lifecycle management in Go, and adhering to standards helps eliminate confusion and ambiguity for anyone interacting with the `flow-go` codebase. This is especially true now that we are beginning to provide API's and interfaces for third parties to interact with the codebase (e.g DPS).
   147    - Even to someone unfamiliar with our codebase (but familiar with Go idioms), it is clear how a method signature like `Start(context.Context) error` will behave. A method signature like `Ready()` is not so clear.
   148  - This promotes a hierarchical supervision paradigm, where each `Component` is equipped with a fresh signaler to its parent at launch, and is thus supervised by his parent for any irrecoverable errors it may encounter (the call to `WithSignaler` replaces the signaler in a parent context). As a consequence, sub-components themselves started by a component have it as a supervisor, which handles their irrecoverable failures, and so on.
   149    - If context propagation is done properly, there is no need to worry about any cleanup code in the `Done` method. Cancelling the context for a component will automatically cancel all subcomponents / child routines in the component tree, and we do not have to explicitly call `Done` on each and every subcomponent to trigger their shutdown.
   150    - This allows us to separate the capability to check a component's state from the capability to start / stop it. We may want to give multiple other components the capability to check its state, without giving them the capability to start or stop it. Here is an [example](https://github.com/onflow/flow-go/blob/b50f0ffe054103a82e4aa9e0c9e4610c2cbf2cc9/engine/common/splitter/network/network.go#L112) of where this would be useful.
   151    - This provides a clearer way of defining ownership of components, and hence may potentially eliminate the need to deal with concurrency-safety altogether. Whoever creates a component should be responsible for starting it, and therefore they should be the only one with access to its `Startable` interface. If each component only has a single parent that is capable of starting it, then we should never run into concurrency issues.
   152  
   153  ## Implementation (WIP)
   154  * Lifecycle management logic for components can be further abstracted into a `RunComponent` helper function:
   155  
   156    ```golang
   157    type ComponentFactory func() (Component, error)
   158  
   159    // OnError reacts to an irrecoverable error
   160    // It is meant to inspect the error, determining its type and seeing if e.g. a restart or some other measure is suitable,
   161    // and then return an ErrorHandlingResult indicating how RunComponent should proceed.
   162    // Before returning, it could also:
   163    // - panic (in sandboxnet / benchmark)
   164    // - log in various Error channels and / or send telemetry ...
   165    type OnError = func(err error) ErrorHandlingResult
   166  
   167    type ErrorHandlingResult int
   168  
   169    const (
   170      ErrorHandlingRestart ErrorHandlingResult = iota
   171      ErrorHandlingStop
   172    )
   173  
   174    // RunComponent repeatedly starts components returned from the given ComponentFactory, shutting them
   175    // down when they encounter irrecoverable errors and passing those errors to the given error handler.
   176    // If the given context is cancelled, it will wait for the current component instance to shutdown
   177    // before returning.
   178    // The returned error is either:
   179    // - The context error if the context was canceled
   180    // - The last error handled if the error handler returns ErrorHandlingStop
   181    // - An error returned from componentFactory while generating an instance of component
   182    func RunComponent(ctx context.Context, componentFactory ComponentFactory, handler OnError) error {
   183      // reference to per-run signals for the component
   184      var component Component
   185      var cancel context.CancelFunc
   186      var done <-chan struct{}
   187      var irrecoverableErr <-chan error
   188  
   189      start := func() error {
   190        var err error
   191  
   192        component, err = componentFactory()
   193        if err != nil {
   194          return err // failure to generate the component, should be handled out-of-band because a restart won't help
   195        }
   196  
   197        // context used to run the component
   198        var runCtx context.Context
   199        runCtx, cancel = context.WithCancel(ctx)
   200  
   201        // signaler context used for irrecoverables
   202        var signalCtx irrecoverable.SignalerContext
   203        signalCtx, irrecoverableErr = irrecoverable.WithSignaler(runCtx)
   204  
   205        component.Start(signalCtx)
   206  
   207        done = component.Done()
   208  
   209        return nil
   210      }
   211  
   212      stop := func() {
   213        // shutdown the component and wait until it's done
   214        cancel()
   215        <-done
   216      }
   217  
   218      for {
   219        select {
   220        case <-ctx.Done():
   221          return ctx.Err()
   222        default:
   223        }
   224  
   225        if err := start(); err != nil {
   226          return err // failure to start
   227        }
   228  
   229        select {
   230        case <-ctx.Done():
   231          stop()
   232          return ctx.Err()
   233        case err := <-irrecoverableErr:
   234          stop()
   235  
   236          // send error to the handler
   237          switch result := handler(err); result {
   238          case ErrorHandlingRestart:
   239            continue
   240          case ErrorHandlingStop:
   241            return err
   242          default:
   243            panic(fmt.Sprintf("invalid error handling result: %v", result))
   244          }
   245        case <-done:
   246          // Without this additional select, there is a race condition here where the done channel
   247          // could have been closed as a result of an irrecoverable error being thrown, so that when
   248          // the scheduler yields control back to this goroutine, both channels are available to read
   249          // from. If this last case happens to be chosen at random to proceed instead of the one
   250          // above, then we would return as if the component shutdown gracefully, when in fact it
   251          // encountered an irrecoverable error.
   252          select {
   253          case err := <-irrecoverableErr:
   254            switch result := handler(err); result {
   255            case ErrorHandlingRestart:
   256              continue
   257            case ErrorHandlingStop:
   258              return err
   259            default:
   260              panic(fmt.Sprintf("invalid error handling result: %v", result))
   261            }
   262          default:
   263          }
   264  
   265          // Similarly, the done channel could have closed as a result of the context being canceled.
   266          select {
   267          case <-ctx.Done():
   268            return ctx.Err()
   269          default:
   270          }
   271  
   272          // clean completion
   273          return nil
   274        }
   275      }
   276    }
   277    ```
   278  
   279    > Note: this is now implemented in [#1275](https://github.com/onflow/flow-go/pull/1275) and [#1355](https://github.com/onflow/flow-go/pull/1355), and an example can be found [here](https://github.com/onflow/flow-go/blob/24406ed3fde7661cb1df84a25755cedf041a1c50/module/irrecoverable/irrecoverable_example_test.go).
   280  * We may be able to encapsulate a lot of the boilerplate code involved in handling startup / shutdown of worker routines into a single `ComponentManager` struct:
   281  
   282    ```golang
   283    type ReadyFunc func()
   284  
   285    // ComponentWorker represents a worker routine of a component
   286    type ComponentWorker func(ctx irrecoverable.SignalerContext, ready ReadyFunc)
   287  
   288    // ComponentManagerBuilder provides a mechanism for building a ComponentManager
   289    type ComponentManagerBuilder interface {
   290      // AddWorker adds a worker routine for the ComponentManager
   291      AddWorker(ComponentWorker) ComponentManagerBuilder
   292  
   293      // Build builds and returns a new ComponentManager instance
   294      Build() *ComponentManager
   295    }
   296  
   297    // ComponentManager is used to manage the worker routines of a Component
   298    type ComponentManager struct {
   299      started        *atomic.Bool
   300      ready          chan struct{}
   301      done           chan struct{}
   302      shutdownSignal <-chan struct{}
   303  
   304      workers []ComponentWorker
   305    }
   306  
   307    // Start initiates the ComponentManager by launching all worker routines.
   308    func (c *ComponentManager) Start(parent irrecoverable.SignalerContext) {
   309      // only start once
   310      if c.started.CAS(false, true) {
   311        ctx, cancel := context.WithCancel(parent)
   312        signalerCtx, errChan := irrecoverable.WithSignaler(ctx)
   313        c.shutdownSignal = ctx.Done()
   314  
   315        // launch goroutine to propagate irrecoverable error
   316        go func() {
   317          select {
   318          case err := <-errChan:
   319            cancel() // shutdown all workers
   320  
   321            // we propagate the error directly to the parent because a failure in a
   322            // worker routine is considered irrecoverable
   323            parent.Throw(err)
   324          case <-c.done:
   325            // Without this additional select, there is a race condition here where the done channel
   326            // could be closed right after an irrecoverable error is thrown, so that when the scheduler
   327            // yields control back to this goroutine, both channels are available to read from. If this
   328            // second case happens to be chosen at random to proceed, then we would return and silently
   329            // ignore the error.
   330            select {
   331            case err := <-errChan:
   332              cancel()
   333              parent.Throw(err)
   334            default:
   335            }
   336          }
   337        }()
   338  
   339        var workersReady sync.WaitGroup
   340        var workersDone sync.WaitGroup
   341        workersReady.Add(len(c.workers))
   342        workersDone.Add(len(c.workers))
   343  
   344        // launch workers
   345        for _, worker := range c.workers {
   346          worker := worker
   347          go func() {
   348            defer workersDone.Done()
   349            var readyOnce sync.Once
   350            worker(signalerCtx, func() {
   351              readyOnce.Do(func() {
   352                workersReady.Done()
   353              })
   354            })
   355          }()
   356        }
   357  
   358        // launch goroutine to close ready channel
   359        go c.waitForReady(&workersReady)
   360  
   361        // launch goroutine to close done channel
   362        go c.waitForDone(&workersDone)
   363      } else {
   364        panic(module.ErrMultipleStartup)
   365      }
   366    }
   367  
   368    func (c *ComponentManager) waitForReady(workersReady *sync.WaitGroup) {
   369      workersReady.Wait()
   370      close(c.ready)
   371    }
   372  
   373    func (c *ComponentManager) waitForDone(workersDone *sync.WaitGroup) {
   374      workersDone.Wait()
   375      close(c.done)
   376    }
   377  
   378    // Ready returns a channel which is closed once all the worker routines have been launched and are ready.
   379    // If any worker routines exit before they indicate that they are ready, the channel returned from Ready will never close.
   380    func (c *ComponentManager) Ready() <-chan struct{} {
   381      return c.ready
   382    }
   383  
   384    // Done returns a channel which is closed once the ComponentManager has shut down.
   385    // This happens when all worker routines have shut down (either gracefully or by throwing an error).
   386    func (c *ComponentManager) Done() <-chan struct{} {
   387      return c.done
   388    }
   389  
   390    // ShutdownSignal returns a channel that is closed when shutdown has commenced.
   391    // This can happen either if the ComponentManager's context is canceled, or a worker routine encounters
   392    // an irrecoverable error.
   393    // If this is called before Start, a nil channel will be returned.
   394    func (c *ComponentManager) ShutdownSignal() <-chan struct{} {
   395      return c.shutdownSignal
   396    }
   397    ```
   398  
   399    Components that want to implement `Component` can use this `ComponentManager` to simplify implementation:
   400  
   401    ```golang
   402    type FooComponent struct {
   403      *component.ComponentManager
   404    }
   405  
   406    func NewFooComponent(foo fooType) *FooComponent {
   407      f := &FooComponent{}
   408  
   409      cmb := component.NewComponentManagerBuilder().
   410        AddWorker(f.childRoutine).
   411        AddWorker(f.childRoutineWithFooParameter(foo))
   412  
   413      f.ComponentManager = cmb.Build()
   414  
   415      return f
   416    }
   417  
   418    func (f *FooComponent) childRoutine(ctx irrecoverable.SignalerContext) {
   419      for {
   420        select {
   421        case <-ctx.Done():
   422          return
   423        default:
   424          // do work...
   425        }
   426      }
   427    }
   428  
   429    func (f *FooComponent) childRoutineWithFooParameter(foo fooType) component.ComponentWorker {
   430      return func(ctx irrecoverable.SignalerContext) {
   431        for {
   432          select {
   433          case <-ctx.Done():
   434            return
   435          default:
   436            // do work with foo...
   437  
   438            // encounter irrecoverable error
   439            ctx.Throw(errors.New("fatal error!"))
   440          }
   441        }
   442      }
   443    }
   444    ```
   445  
   446    > Note: this is now implemented in [#1355](https://github.com/onflow/flow-go/pull/1355)