github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/architecture/adr-033-pubsub.md (about)

     1  # ADR 033: pubsub 2.0
     2  
     3  Author: Anton Kaliaev (@melekes)
     4  
     5  ## Changelog
     6  
     7  02-10-2018: Initial draft
     8  
     9  16-01-2019: Second version based on our conversation with Jae
    10  
    11  17-01-2019: Third version explaining how new design solves current issues
    12  
    13  25-01-2019: Fourth version to treat buffered and unbuffered channels differently
    14  
    15  ## Context
    16  
    17  Since the initial version of the pubsub, there's been a number of issues
    18  raised: [#951], [#1879], [#1880]. Some of them are high-level issues questioning the
    19  core design choices made. Others are minor and mostly about the interface of
    20  `Subscribe()` / `Publish()` functions.
    21  
    22  ### Sync vs Async
    23  
    24  Now, when publishing a message to subscribers, we can do it in a goroutine:
    25  
    26  _using channels for data transmission_
    27  ```go
    28  for each subscriber {
    29      out := subscriber.outc
    30      go func() {
    31          out <- msg
    32      }
    33  }
    34  ```
    35  
    36  _by invoking callback functions_
    37  ```go
    38  for each subscriber {
    39      go subscriber.callbackFn()
    40  }
    41  ```
    42  
    43  This gives us greater performance and allows us to avoid "slow client problem"
    44  (when other subscribers have to wait for a slow subscriber). A pool of
    45  goroutines can be used to avoid uncontrolled memory growth.
    46  
    47  In certain cases, this is what you want. But in our case, because we need
    48  strict ordering of events (if event A was published before B, the guaranteed
    49  delivery order will be A -> B), we can't publish msg in a new goroutine every time.
    50  
    51  We can also have a goroutine per subscriber, although we'd need to be careful
    52  with the number of subscribers. It's more difficult to implement as well +
    53  unclear if we'll benefit from it (cause we'd be forced to create N additional
    54  channels to distribute msg to these goroutines).
    55  
    56  ### Non-blocking send
    57  
    58  There is also a question whenever we should have a non-blocking send.
    59  Currently, sends are blocking, so publishing to one client can block on
    60  publishing to another. This means a slow or unresponsive client can halt the
    61  system. Instead, we can use a non-blocking send:
    62  
    63  ```go
    64  for each subscriber {
    65      out := subscriber.outc
    66      select {
    67          case out <- msg:
    68          default:
    69              log("subscriber %v buffer is full, skipping...")
    70      }
    71  }
    72  ```
    73  
    74  This fixes the "slow client problem", but there is no way for a slow client to
    75  know if it had missed a message. We could return a second channel and close it
    76  to indicate subscription termination. On the other hand, if we're going to
    77  stick with blocking send, **devs must always ensure subscriber's handling code
    78  does not block**, which is a hard task to put on their shoulders.
    79  
    80  The interim option is to run goroutines pool for a single message, wait for all
    81  goroutines to finish. This will solve "slow client problem", but we'd still
    82  have to wait `max(goroutine_X_time)` before we can publish the next message.
    83  
    84  ### Channels vs Callbacks
    85  
    86  Yet another question is whether we should use channels for message transmission or
    87  call subscriber-defined callback functions. Callback functions give subscribers
    88  more flexibility - you can use mutexes in there, channels, spawn goroutines,
    89  anything you really want. But they also carry local scope, which can result in
    90  memory leaks and/or memory usage increase.
    91  
    92  Go channels are de-facto standard for carrying data between goroutines.
    93  
    94  ### Why `Subscribe()` accepts an `out` channel?
    95  
    96  Because in our tests, we create buffered channels (cap: 1). Alternatively, we
    97  can make capacity an argument and return a channel.
    98  
    99  ## Decision
   100  
   101  ### MsgAndTags
   102  
   103  Use a `MsgAndTags` struct on the subscription channel to indicate what tags the
   104  msg matched.
   105  
   106  ```go
   107  type MsgAndTags struct {
   108      Msg interface{}
   109      Tags TagMap
   110  }
   111  ```
   112  
   113  ### Subscription Struct
   114  
   115  
   116  Change `Subscribe()` function to return a `Subscription` struct:
   117  
   118  ```go
   119  type Subscription struct {
   120    // private fields
   121  }
   122  
   123  func (s *Subscription) Out() <-chan MsgAndTags
   124  func (s *Subscription) Canceled() <-chan struct{}
   125  func (s *Subscription) Err() error
   126  ```
   127  
   128  `Out()` returns a channel onto which messages and tags are published.
   129  `Unsubscribe`/`UnsubscribeAll` does not close the channel to avoid clients from
   130  receiving a nil message.
   131  
   132  `Canceled()` returns a channel that's closed when the subscription is terminated
   133  and supposed to be used in a select statement.
   134  
   135  If the channel returned by `Canceled()` is not closed yet, `Err()` returns nil.
   136  If the channel is closed, `Err()` returns a non-nil error explaining why:
   137  `ErrUnsubscribed` if the subscriber choose to unsubscribe,
   138  `ErrOutOfCapacity` if the subscriber is not pulling messages fast enough and the channel returned by `Out()` became full.
   139  After `Err()` returns a non-nil error, successive calls to `Err() return the same error.
   140  
   141  ```go
   142  subscription, err := pubsub.Subscribe(...)
   143  if err != nil {
   144    // ...
   145  }
   146  for {
   147  select {
   148    case msgAndTags <- subscription.Out():
   149      // ...
   150    case <-subscription.Canceled():
   151      return subscription.Err()
   152  }
   153  ```
   154  
   155  ### Capacity and Subscriptions
   156  
   157  Make the `Out()` channel buffered (with capacity 1) by default. In most cases, we want to
   158  terminate the slow subscriber. Only in rare cases, we want to block the pubsub
   159  (e.g. when debugging consensus). This should lower the chances of the pubsub
   160  being frozen.
   161  
   162  ```go
   163  // outCap can be used to set capacity of Out channel
   164  // (1 by default, must be greater than 0).
   165  Subscribe(ctx context.Context, clientID string, query Query, outCap... int) (Subscription, error) {
   166  ```
   167  
   168  Use a different function for an unbuffered channel:
   169  
   170  ```go
   171  // Subscription uses an unbuffered channel. Publishing will block.
   172  SubscribeUnbuffered(ctx context.Context, clientID string, query Query) (Subscription, error) {
   173  ```
   174  
   175  SubscribeUnbuffered should not be exposed to users.
   176  
   177  ### Blocking/Nonblocking
   178  
   179  The publisher should treat these kinds of channels separately.
   180  It should block on unbuffered channels (for use with internal consensus events
   181  in the consensus tests) and not block on the buffered ones. If a client is too
   182  slow to keep up with it's messages, it's subscription is terminated:
   183  
   184  for each subscription {
   185      out := subscription.outChan
   186      if cap(out) == 0 {
   187          // block on unbuffered channel
   188          out <- msg
   189      } else {
   190          // don't block on buffered channels
   191          select {
   192              case out <- msg:
   193              default:
   194                  // set the error, notify on the cancel chan
   195                  subscription.err = fmt.Errorf("client is too slow for msg)
   196                  close(subscription.cancelChan)
   197  
   198                  // ... unsubscribe and close out
   199          }
   200      }
   201  }
   202  
   203  ### How this new design solves the current issues?
   204  
   205  [#951] ([#1880]):
   206  
   207  Because of non-blocking send, situation where we'll deadlock is not possible
   208  anymore. If the client stops reading messages, it will be removed.
   209  
   210  [#1879]:
   211  
   212  MsgAndTags is used now instead of a plain message.
   213  
   214  ### Future problems and their possible solutions
   215  
   216  [#2826]
   217  
   218  One question I am still pondering about: how to prevent pubsub from slowing
   219  down consensus. We can increase the pubsub queue size (which is 0 now). Also,
   220  it's probably a good idea to limit the total number of subscribers.
   221  
   222  This can be made automatically. Say we set queue size to 1000 and, when it's >=
   223  80% full, refuse new subscriptions.
   224  
   225  ## Status
   226  
   227  Implemented
   228  
   229  ## Consequences
   230  
   231  ### Positive
   232  
   233  - more idiomatic interface
   234  - subscribers know what tags msg was published with
   235  - subscribers aware of the reason their subscription was canceled
   236  
   237  ### Negative
   238  
   239  - (since v1) no concurrency when it comes to publishing messages
   240  
   241  ### Neutral
   242  
   243  
   244  [#951]: https://github.com/tendermint/tendermint/issues/951
   245  [#1879]: https://github.com/tendermint/tendermint/issues/1879
   246  [#1880]: https://github.com/tendermint/tendermint/issues/1880
   247  [#2826]: https://github.com/tendermint/tendermint/issues/2826