github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/architecture/adr-033-pubsub.md (about) 1 # ADR 033: pubsub 2.0 2 3 Author: Anton Kaliaev (@melekes) 4 5 ## Changelog 6 7 02-10-2018: Initial draft 8 9 16-01-2019: Second version based on our conversation with Jae 10 11 17-01-2019: Third version explaining how new design solves current issues 12 13 25-01-2019: Fourth version to treat buffered and unbuffered channels differently 14 15 ## Context 16 17 Since the initial version of the pubsub, there's been a number of issues 18 raised: [#951], [#1879], [#1880]. Some of them are high-level issues questioning the 19 core design choices made. Others are minor and mostly about the interface of 20 `Subscribe()` / `Publish()` functions. 21 22 ### Sync vs Async 23 24 Now, when publishing a message to subscribers, we can do it in a goroutine: 25 26 _using channels for data transmission_ 27 ```go 28 for each subscriber { 29 out := subscriber.outc 30 go func() { 31 out <- msg 32 } 33 } 34 ``` 35 36 _by invoking callback functions_ 37 ```go 38 for each subscriber { 39 go subscriber.callbackFn() 40 } 41 ``` 42 43 This gives us greater performance and allows us to avoid "slow client problem" 44 (when other subscribers have to wait for a slow subscriber). A pool of 45 goroutines can be used to avoid uncontrolled memory growth. 46 47 In certain cases, this is what you want. But in our case, because we need 48 strict ordering of events (if event A was published before B, the guaranteed 49 delivery order will be A -> B), we can't publish msg in a new goroutine every time. 50 51 We can also have a goroutine per subscriber, although we'd need to be careful 52 with the number of subscribers. It's more difficult to implement as well + 53 unclear if we'll benefit from it (cause we'd be forced to create N additional 54 channels to distribute msg to these goroutines). 55 56 ### Non-blocking send 57 58 There is also a question whenever we should have a non-blocking send. 59 Currently, sends are blocking, so publishing to one client can block on 60 publishing to another. This means a slow or unresponsive client can halt the 61 system. Instead, we can use a non-blocking send: 62 63 ```go 64 for each subscriber { 65 out := subscriber.outc 66 select { 67 case out <- msg: 68 default: 69 log("subscriber %v buffer is full, skipping...") 70 } 71 } 72 ``` 73 74 This fixes the "slow client problem", but there is no way for a slow client to 75 know if it had missed a message. We could return a second channel and close it 76 to indicate subscription termination. On the other hand, if we're going to 77 stick with blocking send, **devs must always ensure subscriber's handling code 78 does not block**, which is a hard task to put on their shoulders. 79 80 The interim option is to run goroutines pool for a single message, wait for all 81 goroutines to finish. This will solve "slow client problem", but we'd still 82 have to wait `max(goroutine_X_time)` before we can publish the next message. 83 84 ### Channels vs Callbacks 85 86 Yet another question is whether we should use channels for message transmission or 87 call subscriber-defined callback functions. Callback functions give subscribers 88 more flexibility - you can use mutexes in there, channels, spawn goroutines, 89 anything you really want. But they also carry local scope, which can result in 90 memory leaks and/or memory usage increase. 91 92 Go channels are de-facto standard for carrying data between goroutines. 93 94 ### Why `Subscribe()` accepts an `out` channel? 95 96 Because in our tests, we create buffered channels (cap: 1). Alternatively, we 97 can make capacity an argument and return a channel. 98 99 ## Decision 100 101 ### MsgAndTags 102 103 Use a `MsgAndTags` struct on the subscription channel to indicate what tags the 104 msg matched. 105 106 ```go 107 type MsgAndTags struct { 108 Msg interface{} 109 Tags TagMap 110 } 111 ``` 112 113 ### Subscription Struct 114 115 116 Change `Subscribe()` function to return a `Subscription` struct: 117 118 ```go 119 type Subscription struct { 120 // private fields 121 } 122 123 func (s *Subscription) Out() <-chan MsgAndTags 124 func (s *Subscription) Canceled() <-chan struct{} 125 func (s *Subscription) Err() error 126 ``` 127 128 `Out()` returns a channel onto which messages and tags are published. 129 `Unsubscribe`/`UnsubscribeAll` does not close the channel to avoid clients from 130 receiving a nil message. 131 132 `Canceled()` returns a channel that's closed when the subscription is terminated 133 and supposed to be used in a select statement. 134 135 If the channel returned by `Canceled()` is not closed yet, `Err()` returns nil. 136 If the channel is closed, `Err()` returns a non-nil error explaining why: 137 `ErrUnsubscribed` if the subscriber choose to unsubscribe, 138 `ErrOutOfCapacity` if the subscriber is not pulling messages fast enough and the channel returned by `Out()` became full. 139 After `Err()` returns a non-nil error, successive calls to `Err() return the same error. 140 141 ```go 142 subscription, err := pubsub.Subscribe(...) 143 if err != nil { 144 // ... 145 } 146 for { 147 select { 148 case msgAndTags <- subscription.Out(): 149 // ... 150 case <-subscription.Canceled(): 151 return subscription.Err() 152 } 153 ``` 154 155 ### Capacity and Subscriptions 156 157 Make the `Out()` channel buffered (with capacity 1) by default. In most cases, we want to 158 terminate the slow subscriber. Only in rare cases, we want to block the pubsub 159 (e.g. when debugging consensus). This should lower the chances of the pubsub 160 being frozen. 161 162 ```go 163 // outCap can be used to set capacity of Out channel 164 // (1 by default, must be greater than 0). 165 Subscribe(ctx context.Context, clientID string, query Query, outCap... int) (Subscription, error) { 166 ``` 167 168 Use a different function for an unbuffered channel: 169 170 ```go 171 // Subscription uses an unbuffered channel. Publishing will block. 172 SubscribeUnbuffered(ctx context.Context, clientID string, query Query) (Subscription, error) { 173 ``` 174 175 SubscribeUnbuffered should not be exposed to users. 176 177 ### Blocking/Nonblocking 178 179 The publisher should treat these kinds of channels separately. 180 It should block on unbuffered channels (for use with internal consensus events 181 in the consensus tests) and not block on the buffered ones. If a client is too 182 slow to keep up with it's messages, it's subscription is terminated: 183 184 for each subscription { 185 out := subscription.outChan 186 if cap(out) == 0 { 187 // block on unbuffered channel 188 out <- msg 189 } else { 190 // don't block on buffered channels 191 select { 192 case out <- msg: 193 default: 194 // set the error, notify on the cancel chan 195 subscription.err = fmt.Errorf("client is too slow for msg) 196 close(subscription.cancelChan) 197 198 // ... unsubscribe and close out 199 } 200 } 201 } 202 203 ### How this new design solves the current issues? 204 205 [#951] ([#1880]): 206 207 Because of non-blocking send, situation where we'll deadlock is not possible 208 anymore. If the client stops reading messages, it will be removed. 209 210 [#1879]: 211 212 MsgAndTags is used now instead of a plain message. 213 214 ### Future problems and their possible solutions 215 216 [#2826] 217 218 One question I am still pondering about: how to prevent pubsub from slowing 219 down consensus. We can increase the pubsub queue size (which is 0 now). Also, 220 it's probably a good idea to limit the total number of subscribers. 221 222 This can be made automatically. Say we set queue size to 1000 and, when it's >= 223 80% full, refuse new subscriptions. 224 225 ## Status 226 227 Implemented 228 229 ## Consequences 230 231 ### Positive 232 233 - more idiomatic interface 234 - subscribers know what tags msg was published with 235 - subscribers aware of the reason their subscription was canceled 236 237 ### Negative 238 239 - (since v1) no concurrency when it comes to publishing messages 240 241 ### Neutral 242 243 244 [#951]: https://github.com/tendermint/tendermint/issues/951 245 [#1879]: https://github.com/tendermint/tendermint/issues/1879 246 [#1880]: https://github.com/tendermint/tendermint/issues/1880 247 [#2826]: https://github.com/tendermint/tendermint/issues/2826