github.com/aakash4dev/cometbft@v0.38.2/spec/p2p/implementation/switch.md (about) 1 # Switch 2 3 The switch is a core component of the p2p layer. 4 It manages the procedures for [dialing peers](#dialing-peers) and 5 [accepting](#accepting-peers) connections from peers, which are actually 6 implemented by the [transport](./transport.md). 7 It also manages the reactors, i.e., protocols implemented by the node that 8 interact with its peers. 9 Once a connection with a peer is established, the peer is [added](#add-peer) to 10 the switch and all registered reactors. 11 Reactors may also instruct the switch to [stop a peer](#stop-peer), namely 12 disconnect from it. 13 The switch, in this case, makes sure that the peer is removed from all 14 registered reactors. 15 16 ## Dialing peers 17 18 Dialing a peer is implemented by the `DialPeerWithAddress` method. 19 20 This method is invoked by the [peer manager](./peer_manager.md#ensure-peers) 21 to dial a peer address and establish a connection with an outbound peer. 22 23 The switch keeps a single dialing routine per peer ID. 24 This is ensured by keeping a synchronized map `dialing` with the IDs of peers 25 to which the peer is dialing. 26 A peer ID is added to `dialing` when the `DialPeerWithAddress` method is called 27 for that peer, and it is removed when the method returns for whatever reason. 28 The method returns immediately when invoked for a peer which ID is already in 29 the `dialing` structure. 30 31 The actual dialing is implemented by the [`Dial`](./transport.md#dial) method 32 of the transport configured for the switch, in the `addOutboundPeerWithConfig` 33 method. 34 If the transport succeeds establishing a connection, the returned `Peer` is 35 added to the switch using the [`addPeer`](#add-peer) method. 36 This operation can fail, returning an error. In this case, the switch invokes 37 the transport's [`Cleanup`](./transport.md#cleanup) method to clean any resources 38 associated with the peer. 39 40 If the transport fails to establish a connection with the peer that is configured 41 as a persistent peer, the switch spawns a routine to [reconnect to the peer](#reconnect-to-peer). 42 If the peer is already in the `reconnecting` state, the spawned routine has no 43 effect and returns immediately. 44 This is in fact a likely scenario, as the `reconnectToPeer` routine relies on 45 this same `DialPeerWithAddress` method for dialing peers. 46 47 ### Manual operation 48 49 The `DialPeersAsync` method receives a list of peer addresses (strings) 50 and dials all of them in parallel. 51 It is invoked in two situations: 52 53 - In the [setup](https://github.com/aakash4dev/cometbft/blob/v0.34.x/node/node.go#L987) 54 of a node, to establish connections with every configured persistent peer 55 - In the RPC package, to implement two unsafe RPC commands, not used in production: 56 [`DialSeeds`](https://github.com/aakash4dev/cometbft/blob/v0.34.x/rpc/core/net.go#L47) and 57 [`DialPeers`](https://github.com/aakash4dev/cometbft/blob/v0.34.x/rpc/core/net.go#L87) 58 59 The received list of peer addresses to dial is parsed into `NetAddress` instances. 60 In case of parsing errors, the method returns. An exception is made for 61 DNS resolution `ErrNetAddressLookup` errors, which do not interrupt the procedure. 62 63 As the peer addresses provided to this method are typically not known by the node, 64 contrarily to the addressed dialed using the `DialPeerWithAddress` method, 65 they are added to the node's address book, which is persisted to disk. 66 67 The switch dials the provided peers in parallel. 68 The list of peer addresses is randomly shuffled, and for each peer a routine is 69 spawned. 70 Each routine sleeps for a random interval, up to 3 seconds, then invokes the 71 `DialPeerWithAddress` method that actually dials the peer. 72 73 ### Reconnect to peer 74 75 The `reconnectToPeer` method is invoked when a connection attempt to a peer fails, 76 and the peer is configured as a persistent peer. 77 78 The `reconnecting` synchronized map keeps the peer's in this state, identified 79 by their IDs (string). 80 This should ensure that a single instance of this method is running at any time. 81 The peer is kept in this map while this method is running for it: it is set on 82 the beginning, and removed when the method returns for whatever reason. 83 If the peer is already in the `reconnecting` state, nothing is done. 84 85 The remaining of the method performs multiple connection attempts to the peer, 86 via `DialPeerWithAddress` method. 87 If a connection attempt succeeds, the methods returns and the routine finishes. 88 The same applies when an `ErrCurrentlyDialingOrExistingAddress` error is 89 returned by the dialing method, as it indicates that peer is already connected 90 or that another routine is attempting to (re)connect to it. 91 92 A first set of connection attempts is done at (about) regular intervals. 93 More precisely, between two attempts, the switch waits for a interval of 94 `reconnectInterval`, hard-coded to 5 seconds, plus a random jitter up to 95 `dialRandomizerIntervalMilliseconds`, hard-coded to 3 seconds. 96 At most `reconnectAttempts`, hard-coded to 20, are made using this 97 regular-interval approach. 98 99 A second set of connection attempts is done with exponentially increasing 100 intervals. 101 The base interval `reconnectBackOffBaseSeconds` is hard-coded to 3 seconds, 102 which is also the increasing factor. 103 The exponentially increasing dialing interval is adjusted as well by a random 104 jitter up to `dialRandomizerIntervalMilliseconds`. 105 At most `reconnectBackOffAttempts`, hard-coded to 10, are made using this approach. 106 107 > Note: the first sleep interval, to which a random jitter is applied, is 1, 108 > not `reconnectBackOffBaseSeconds`, as the first exponent is `0`... 109 110 ## Accepting peers 111 112 The `acceptRoutine` method is a persistent routine that handles connections 113 accepted by the transport configured for the switch. 114 115 The [`Accept`](./transport.md#accept) method of the configured transport 116 returns a `Peer` with which an inbound connection was established. 117 The switch accepts a new peer if the maximum number of inbound peers was not 118 reached, or if the peer was configured as an _unconditional peer_. 119 The maximum number of inbound peers is determined by the `MaxNumInboundPeers` 120 configuration parameter, whose default value is `40`. 121 122 If accepted, the peer is added to the switch using the [`addPeer`](#add-peer) method. 123 If the switch does not accept the established incoming connection, or if the 124 `addPeer` method returns an error, the switch invokes the transport's 125 [`Cleanup`](./transport.md#cleanup) method to clean any resources associated 126 with the peer. 127 128 The transport's `Accept` method can also return a number of errors. 129 Errors of `ErrRejected` or `ErrFilterTimeout` types are ignored, 130 an `ErrTransportClosed` causes the accepted routine to be interrupted, 131 while other errors cause the routine to panic. 132 133 > TODO: which errors can cause the routine to panic? 134 135 ## Add peer 136 137 The `addPeer` method adds a peer to the switch, 138 either after dialing (by `addOutboundPeerWithConfig`, called by `DialPeerWithAddress`) 139 a peer and establishing an outbound connection, 140 or after accepting (`acceptRoutine`) a peer and establishing an inbound connection. 141 142 The first step is to invoke the `filterPeer` method. 143 It checks whether the peer is already in the set of connected peers, 144 and whether any of the configured `peerFilter` methods reject the peer. 145 If the peer is already present or it is rejected by any filter, the `addPeer` 146 method fails and returns an error. 147 148 Then, the new peer is started, added to the set of connected peers, and added 149 to all reactors. 150 More precisely, first the new peer's information is first provided to every 151 reactor (`InitPeer` method). 152 Next, the peer's sending and receiving routines are started, and the peer is 153 added to set of connected peers. 154 These two operations can fail, causing `addPeer` to return an error. 155 Then, in the absence of previous errors, the peer is added to every reactor (`AddPeer` method). 156 157 > Adding the peer to the peer set returns a `ErrSwitchDuplicatePeerID` error 158 > when a peer with the same ID is already presented. 159 > 160 > TODO: Starting a peer could be reduced as starting the MConn with that peer? 161 162 ## Stop peer 163 164 There are two methods for stopping a peer, namely disconnecting from it, and 165 removing it from the table of connected peers. 166 167 The `StopPeerForError` method is invoked to stop a peer due to an external 168 error, which is provided to method as a generic "reason". 169 170 The `StopPeerGracefully` method stops a peer in the absence of errors or, more 171 precisely, not providing to the switch any "reason" for that. 172 173 In both cases the `Peer` instance is stopped, the peer is removed from all 174 registered reactors, and finally from the list of connected peers. 175 176 > Issue <https://github.com/tendermint/tendermint/issues/3338> is mentioned in 177 > the internal `stopAndRemovePeer` method explaining why removing the peer from 178 > the list of connected peers is the last action taken. 179 180 When there is a "reason" for stopping the peer (`StopPeerForError` method) 181 and the peer is a persistent peer, the method creates a routine to attempt 182 reconnecting to the peer address, using the `reconnectToPeer` method. 183 If the peer is an outbound peer, the peer's address is know, since the switch 184 has dialed the peer. 185 Otherwise, the peer address is retrieved from the `NodeInfo` instance from the 186 connection handshake. 187 188 ## Add reactor 189 190 The `AddReactor` method registers a `Reactor` to the switch. 191 192 The reactor is associated to the set of channel ids it employs. 193 Two reactors (in the same node) cannot share the same channel id. 194 195 There is a call back to the reactor, in which the switch passes itself to the 196 reactor. 197 198 ## Remove reactor 199 200 The `RemoveReactor` method unregisters a `Reactor` from the switch. 201 202 The reactor is disassociated from the set of channel ids it employs. 203 204 There is a call back to the reactor, in which the switch passes `nil` to the 205 reactor. 206 207 ## OnStart 208 209 This is a `BaseService` method. 210 211 All registered reactors are started. 212 213 The switch's `acceptRoutine` is started. 214 215 ## OnStop 216 217 This is a `BaseService` method. 218 219 All (connected) peers are stopped and removed from the peer's list using the 220 `stopAndRemovePeer` method. 221 222 All registered reactors are stopped. 223 224 ## Broadcast 225 226 This method broadcasts a message on a channel, by sending the message in 227 parallel to all connected peers. 228 229 The method spawns a thread for each connected peer, invoking the `Send` method 230 provided by each `Peer` instance with the provided message and channel ID. 231 The return value (a boolean) of these calls are redirected to a channel that is 232 returned by the method. 233 234 > TODO: detail where this method is invoked: 235 > 236 > - By the consensus protocol, in `broadcastNewRoundStepMessage`, 237 > `broadcastNewValidBlockMessage`, and `broadcastHasVoteMessage` 238 > - By the state sync protocol