github.com/onflow/flow-go@v0.35.7-crescendo-preview.23-atree-inlining/flips/component-interface.md (about) 1 # Component Interface (Core Protocol) 2 3 | Status | Proposed | 4 :-------------- |:--------------------------------------------------------- | 5 | **FLIP #** | [1167](https://github.com/onflow/flow-go/pull/1167) | 6 | **Author(s)** | Simon Zhu (simon.zhu@dapperlabs.com) | 7 | **Sponsor** | Simon Zhu (simon.zhu@dapperlabs.com) | 8 | **Updated** | 9/16/2021 | 9 10 ## Objective 11 12 FLIP to separate the API through which components are started from the API through which they expose their status. 13 14 ## Current Implementation 15 16 The [`ReadyDoneAware`](https://github.com/onflow/flow-go/blob/7763000ba5724bb03f522380e513b784b4597d46/module/common.go#L6) interface provides an interface through which components / modules can be started and stopped. Calling the `Ready` method should start the component and return a channel that will close when startup has completed, and `Done` should be the corresponding method to shut down the component. 17 18 ### Potential problems 19 20 The current `ReadyDoneAware` interface is misleading, as by the name one might expect that it is only used to check the state of a component. However, in almost all current implementations the `Ready` method is used to both start the component *and* check when it has started up, and similarly for the `Done` method. 21 22 This introduces issues of concurrency safety / idempotency, as most implementations do not properly handle the case where the `Ready` or `Done` methods are called more than once. See [this example](https://github.com/onflow/flow-go/pull/1026). 23 24 [Clearer documentation](https://github.com/onflow/flow-go/pull/1032) and a new [`LifecycleManager`](https://github.com/onflow/flow-go/pull/1031) component were introduced as a step towards fixing this by providing concurrency-safety for components implementing `ReadyDoneAware`, but this still does not provide a clear separation between the ability to start / stop a component and the ability to check its state. A component usually only needs to be started once, whereas multiple other components may wish to check its state. 25 26 ## Proposal 27 28 Moving forward, we will add a new `Startable` interface in addition to the existing `ReadyDoneAware`: 29 ```golang 30 // Startable provides an interface to start a component. Once started, the component 31 // can be stopped by cancelling the given context. 32 type Startable interface { 33 // Start starts the component. Any errors encountered during startup should be returned 34 // directly, whereas irrecoverable errors encountered while the component is running 35 // should be thrown with the given SignalerContext. 36 // This method should only be called once, and subsequent calls should return ErrMultipleStartup. 37 Start(irrecoverable.SignalerContext) error 38 } 39 ``` 40 Components which implement this interface are passed in a `SignalerContext` upon startup, which they can use to propagate any irrecoverable errors they encounter up to their parent via `SignalerContext.Throw`. The parent can then choose to handle these errors however they like, including restarting the component, logging the error, propagating the error to their own parent, etc. 41 42 ```golang 43 // We define a constrained interface to provide a drop-in replacement for context.Context 44 // including in interfaces that compose it. 45 type SignalerContext interface { 46 context.Context 47 Throw(err error) // delegates to the signaler 48 sealed() // private, to constrain builder to using WithSignaler 49 } 50 51 // private, to force context derivation / WithSignaler 52 type signalerCtx struct { 53 context.Context 54 *Signaler 55 } 56 57 func (sc signalerCtx) sealed() {} 58 59 // the One True Way of getting a SignalerContext 60 func WithSignaler(parent context.Context) (SignalerContext, <-chan error) { 61 sig, errChan := NewSignaler() 62 return &signalerCtx{parent, sig}, errChan 63 } 64 65 // Signaler sends the error out. 66 type Signaler struct { 67 errChan chan error 68 errThrown *atomic.Bool 69 } 70 71 func NewSignaler() (*Signaler, <-chan error) { 72 errChan := make(chan error, 1) 73 return &Signaler{ 74 errChan: errChan, 75 errThrown: atomic.NewBool(false), 76 }, errChan 77 } 78 79 // Throw is a narrow drop-in replacement for panic, log.Fatal, log.Panic, etc 80 // anywhere there's something connected to the error channel. It only sends 81 // the first error it is called with to the error channel, there are various 82 // options as to how subsequent errors can be handled. 83 func (s *Signaler) Throw(err error) { 84 defer runtime.Goexit() 85 if s.errThrown.CAS(false, true) { 86 s.errChan <- err 87 close(s.errChan) 88 } else { 89 // Another thread, possibly from the same component, has already thrown 90 // an irrecoverable error to this Signaler. Any subsequent irrecoverable 91 // errors can either be logged or ignored, as the parent will already 92 // be taking steps to remediate the first error. 93 } 94 } 95 ``` 96 97 > For more details about `SignalerContext` and `ErrMultipleStartup`, see [#1275](https://github.com/onflow/flow-go/pull/1275) and [#1355](https://github.com/onflow/flow-go/pull/1355/). 98 99 To start a component, a `SignalerContext` must be created to start it with: 100 101 ```golang 102 var parentCtx context.Context // this is the context for the routine which manages the component 103 var childComponent component.Component 104 105 ctx, cancel := context.WithCancel(parentCtx) 106 107 // create a SignalerContext and return an error channel which can be used to receive 108 // any irrecoverable errors thrown with the Signaler 109 signalerCtx, errChan := irrecoverable.WithSignaler(ctx) 110 111 // start the child component 112 childComponent.Start(signalerCtx) 113 114 // launch goroutine to handle errors thrown from the child component 115 go func() { 116 select { 117 case err := <-errChan: // error thrown by child component 118 cancel() 119 // handle the error... 120 case <-parentCtx.Done(): // canceled by parent 121 // perform any necessary cleanup... 122 } 123 } 124 ``` 125 126 With all of this in place, the semantics of `ReadyDoneAware` can be redefined to only be used to check a component's state (i.e wait for startup / shutdown to complete) 127 ```golang 128 type ReadyDoneAware interface { 129 // Ready returns a channel that will close when component startup has completed. 130 Ready() <-chan struct{} 131 // Done returns a channel that will close when component shutdown has completed. 132 Done() <-chan struct{} 133 } 134 ``` 135 136 Finally, we can define a `Component` interface which combines both of these interfaces: 137 ```golang 138 type Component interface { 139 Startable 140 ReadyDoneAware 141 } 142 ``` 143 144 A component will now be started by passing a `SignalerContext` to its `Start` method, and can be stopped by cancelling the `Context`. If a component needs to startup subcomponents, it can create child `Context`s from this `Context` and pass those to the subcomponents. 145 ### Motivations 146 - `Context`s are the standard way of doing go-routine lifecycle management in Go, and adhering to standards helps eliminate confusion and ambiguity for anyone interacting with the `flow-go` codebase. This is especially true now that we are beginning to provide API's and interfaces for third parties to interact with the codebase (e.g DPS). 147 - Even to someone unfamiliar with our codebase (but familiar with Go idioms), it is clear how a method signature like `Start(context.Context) error` will behave. A method signature like `Ready()` is not so clear. 148 - This promotes a hierarchical supervision paradigm, where each `Component` is equipped with a fresh signaler to its parent at launch, and is thus supervised by his parent for any irrecoverable errors it may encounter (the call to `WithSignaler` replaces the signaler in a parent context). As a consequence, sub-components themselves started by a component have it as a supervisor, which handles their irrecoverable failures, and so on. 149 - If context propagation is done properly, there is no need to worry about any cleanup code in the `Done` method. Cancelling the context for a component will automatically cancel all subcomponents / child routines in the component tree, and we do not have to explicitly call `Done` on each and every subcomponent to trigger their shutdown. 150 - This allows us to separate the capability to check a component's state from the capability to start / stop it. We may want to give multiple other components the capability to check its state, without giving them the capability to start or stop it. Here is an [example](https://github.com/onflow/flow-go/blob/b50f0ffe054103a82e4aa9e0c9e4610c2cbf2cc9/engine/common/splitter/network/network.go#L112) of where this would be useful. 151 - This provides a clearer way of defining ownership of components, and hence may potentially eliminate the need to deal with concurrency-safety altogether. Whoever creates a component should be responsible for starting it, and therefore they should be the only one with access to its `Startable` interface. If each component only has a single parent that is capable of starting it, then we should never run into concurrency issues. 152 153 ## Implementation (WIP) 154 * Lifecycle management logic for components can be further abstracted into a `RunComponent` helper function: 155 156 ```golang 157 type ComponentFactory func() (Component, error) 158 159 // OnError reacts to an irrecoverable error 160 // It is meant to inspect the error, determining its type and seeing if e.g. a restart or some other measure is suitable, 161 // and then return an ErrorHandlingResult indicating how RunComponent should proceed. 162 // Before returning, it could also: 163 // - panic (in sandboxnet / benchmark) 164 // - log in various Error channels and / or send telemetry ... 165 type OnError = func(err error) ErrorHandlingResult 166 167 type ErrorHandlingResult int 168 169 const ( 170 ErrorHandlingRestart ErrorHandlingResult = iota 171 ErrorHandlingStop 172 ) 173 174 // RunComponent repeatedly starts components returned from the given ComponentFactory, shutting them 175 // down when they encounter irrecoverable errors and passing those errors to the given error handler. 176 // If the given context is cancelled, it will wait for the current component instance to shutdown 177 // before returning. 178 // The returned error is either: 179 // - The context error if the context was canceled 180 // - The last error handled if the error handler returns ErrorHandlingStop 181 // - An error returned from componentFactory while generating an instance of component 182 func RunComponent(ctx context.Context, componentFactory ComponentFactory, handler OnError) error { 183 // reference to per-run signals for the component 184 var component Component 185 var cancel context.CancelFunc 186 var done <-chan struct{} 187 var irrecoverableErr <-chan error 188 189 start := func() error { 190 var err error 191 192 component, err = componentFactory() 193 if err != nil { 194 return err // failure to generate the component, should be handled out-of-band because a restart won't help 195 } 196 197 // context used to run the component 198 var runCtx context.Context 199 runCtx, cancel = context.WithCancel(ctx) 200 201 // signaler context used for irrecoverables 202 var signalCtx irrecoverable.SignalerContext 203 signalCtx, irrecoverableErr = irrecoverable.WithSignaler(runCtx) 204 205 component.Start(signalCtx) 206 207 done = component.Done() 208 209 return nil 210 } 211 212 stop := func() { 213 // shutdown the component and wait until it's done 214 cancel() 215 <-done 216 } 217 218 for { 219 select { 220 case <-ctx.Done(): 221 return ctx.Err() 222 default: 223 } 224 225 if err := start(); err != nil { 226 return err // failure to start 227 } 228 229 select { 230 case <-ctx.Done(): 231 stop() 232 return ctx.Err() 233 case err := <-irrecoverableErr: 234 stop() 235 236 // send error to the handler 237 switch result := handler(err); result { 238 case ErrorHandlingRestart: 239 continue 240 case ErrorHandlingStop: 241 return err 242 default: 243 panic(fmt.Sprintf("invalid error handling result: %v", result)) 244 } 245 case <-done: 246 // Without this additional select, there is a race condition here where the done channel 247 // could have been closed as a result of an irrecoverable error being thrown, so that when 248 // the scheduler yields control back to this goroutine, both channels are available to read 249 // from. If this last case happens to be chosen at random to proceed instead of the one 250 // above, then we would return as if the component shutdown gracefully, when in fact it 251 // encountered an irrecoverable error. 252 select { 253 case err := <-irrecoverableErr: 254 switch result := handler(err); result { 255 case ErrorHandlingRestart: 256 continue 257 case ErrorHandlingStop: 258 return err 259 default: 260 panic(fmt.Sprintf("invalid error handling result: %v", result)) 261 } 262 default: 263 } 264 265 // Similarly, the done channel could have closed as a result of the context being canceled. 266 select { 267 case <-ctx.Done(): 268 return ctx.Err() 269 default: 270 } 271 272 // clean completion 273 return nil 274 } 275 } 276 } 277 ``` 278 279 > Note: this is now implemented in [#1275](https://github.com/onflow/flow-go/pull/1275) and [#1355](https://github.com/onflow/flow-go/pull/1355), and an example can be found [here](https://github.com/onflow/flow-go/blob/24406ed3fde7661cb1df84a25755cedf041a1c50/module/irrecoverable/irrecoverable_example_test.go). 280 * We may be able to encapsulate a lot of the boilerplate code involved in handling startup / shutdown of worker routines into a single `ComponentManager` struct: 281 282 ```golang 283 type ReadyFunc func() 284 285 // ComponentWorker represents a worker routine of a component 286 type ComponentWorker func(ctx irrecoverable.SignalerContext, ready ReadyFunc) 287 288 // ComponentManagerBuilder provides a mechanism for building a ComponentManager 289 type ComponentManagerBuilder interface { 290 // AddWorker adds a worker routine for the ComponentManager 291 AddWorker(ComponentWorker) ComponentManagerBuilder 292 293 // Build builds and returns a new ComponentManager instance 294 Build() *ComponentManager 295 } 296 297 // ComponentManager is used to manage the worker routines of a Component 298 type ComponentManager struct { 299 started *atomic.Bool 300 ready chan struct{} 301 done chan struct{} 302 shutdownSignal <-chan struct{} 303 304 workers []ComponentWorker 305 } 306 307 // Start initiates the ComponentManager by launching all worker routines. 308 func (c *ComponentManager) Start(parent irrecoverable.SignalerContext) { 309 // only start once 310 if c.started.CAS(false, true) { 311 ctx, cancel := context.WithCancel(parent) 312 signalerCtx, errChan := irrecoverable.WithSignaler(ctx) 313 c.shutdownSignal = ctx.Done() 314 315 // launch goroutine to propagate irrecoverable error 316 go func() { 317 select { 318 case err := <-errChan: 319 cancel() // shutdown all workers 320 321 // we propagate the error directly to the parent because a failure in a 322 // worker routine is considered irrecoverable 323 parent.Throw(err) 324 case <-c.done: 325 // Without this additional select, there is a race condition here where the done channel 326 // could be closed right after an irrecoverable error is thrown, so that when the scheduler 327 // yields control back to this goroutine, both channels are available to read from. If this 328 // second case happens to be chosen at random to proceed, then we would return and silently 329 // ignore the error. 330 select { 331 case err := <-errChan: 332 cancel() 333 parent.Throw(err) 334 default: 335 } 336 } 337 }() 338 339 var workersReady sync.WaitGroup 340 var workersDone sync.WaitGroup 341 workersReady.Add(len(c.workers)) 342 workersDone.Add(len(c.workers)) 343 344 // launch workers 345 for _, worker := range c.workers { 346 worker := worker 347 go func() { 348 defer workersDone.Done() 349 var readyOnce sync.Once 350 worker(signalerCtx, func() { 351 readyOnce.Do(func() { 352 workersReady.Done() 353 }) 354 }) 355 }() 356 } 357 358 // launch goroutine to close ready channel 359 go c.waitForReady(&workersReady) 360 361 // launch goroutine to close done channel 362 go c.waitForDone(&workersDone) 363 } else { 364 panic(module.ErrMultipleStartup) 365 } 366 } 367 368 func (c *ComponentManager) waitForReady(workersReady *sync.WaitGroup) { 369 workersReady.Wait() 370 close(c.ready) 371 } 372 373 func (c *ComponentManager) waitForDone(workersDone *sync.WaitGroup) { 374 workersDone.Wait() 375 close(c.done) 376 } 377 378 // Ready returns a channel which is closed once all the worker routines have been launched and are ready. 379 // If any worker routines exit before they indicate that they are ready, the channel returned from Ready will never close. 380 func (c *ComponentManager) Ready() <-chan struct{} { 381 return c.ready 382 } 383 384 // Done returns a channel which is closed once the ComponentManager has shut down. 385 // This happens when all worker routines have shut down (either gracefully or by throwing an error). 386 func (c *ComponentManager) Done() <-chan struct{} { 387 return c.done 388 } 389 390 // ShutdownSignal returns a channel that is closed when shutdown has commenced. 391 // This can happen either if the ComponentManager's context is canceled, or a worker routine encounters 392 // an irrecoverable error. 393 // If this is called before Start, a nil channel will be returned. 394 func (c *ComponentManager) ShutdownSignal() <-chan struct{} { 395 return c.shutdownSignal 396 } 397 ``` 398 399 Components that want to implement `Component` can use this `ComponentManager` to simplify implementation: 400 401 ```golang 402 type FooComponent struct { 403 *component.ComponentManager 404 } 405 406 func NewFooComponent(foo fooType) *FooComponent { 407 f := &FooComponent{} 408 409 cmb := component.NewComponentManagerBuilder(). 410 AddWorker(f.childRoutine). 411 AddWorker(f.childRoutineWithFooParameter(foo)) 412 413 f.ComponentManager = cmb.Build() 414 415 return f 416 } 417 418 func (f *FooComponent) childRoutine(ctx irrecoverable.SignalerContext) { 419 for { 420 select { 421 case <-ctx.Done(): 422 return 423 default: 424 // do work... 425 } 426 } 427 } 428 429 func (f *FooComponent) childRoutineWithFooParameter(foo fooType) component.ComponentWorker { 430 return func(ctx irrecoverable.SignalerContext) { 431 for { 432 select { 433 case <-ctx.Done(): 434 return 435 default: 436 // do work with foo... 437 438 // encounter irrecoverable error 439 ctx.Throw(errors.New("fatal error!")) 440 } 441 } 442 } 443 } 444 ``` 445 446 > Note: this is now implemented in [#1355](https://github.com/onflow/flow-go/pull/1355)