github.com/osdi23p228/fabric@v0.0.0-20221218062954-77808885f5db/docs/source/raft_configuration.md (about) 1 # Configuring and operating a Raft ordering service 2 3 **Audience**: *Raft ordering node admins* 4 5 ## Conceptual overview 6 7 For a high level overview of the concept of ordering and how the supported 8 ordering service implementations (including Raft) work at a high level, check 9 out our conceptual documentation on the [Ordering Service](./orderer/ordering_service.html). 10 11 To learn about the process of setting up an ordering node --- including the 12 creation of a local MSP and the creation of a genesis block --- check out our 13 documentation on [Setting up an ordering node](orderer_deploy.html). 14 15 ## Configuration 16 17 While every Raft node must be added to the system channel, a node does not need 18 to be added to every application channel. Additionally, you can remove and add a 19 node from a channel dynamically without affecting the other nodes, a process 20 described in the Reconfiguration section below. 21 22 Raft nodes identify each other using TLS pinning, so in order to impersonate a 23 Raft node, an attacker needs to obtain the **private key** of its TLS 24 certificate. As a result, it is not possible to run a Raft node without a valid 25 TLS configuration. 26 27 A Raft cluster is configured in two planes: 28 29 * **Local configuration**: Governs node specific aspects, such as TLS 30 communication, replication behavior, and file storage. 31 32 * **Channel configuration**: Defines the membership of the Raft cluster for the 33 corresponding channel, as well as protocol specific parameters such as heartbeat 34 frequency, leader timeouts, and more. 35 36 Recall, each channel has its own instance of a Raft protocol running. Thus, a 37 Raft node must be referenced in the configuration of each channel it belongs to 38 by adding its server and client TLS certificates (in `PEM` format) to the channel 39 config. This ensures that when other nodes receive a message from it, they can 40 securely confirm the identity of the node that sent the message. 41 42 The following section from `configtx.yaml` shows three Raft nodes (also called 43 “consenters”) in the channel: 44 45 ``` 46 Consenters: 47 - Host: raft0.example.com 48 Port: 7050 49 ClientTLSCert: path/to/ClientTLSCert0 50 ServerTLSCert: path/to/ServerTLSCert0 51 - Host: raft1.example.com 52 Port: 7050 53 ClientTLSCert: path/to/ClientTLSCert1 54 ServerTLSCert: path/to/ServerTLSCert1 55 - Host: raft2.example.com 56 Port: 7050 57 ClientTLSCert: path/to/ClientTLSCert2 58 ServerTLSCert: path/to/ServerTLSCert2 59 ``` 60 61 Note: an orderer will be listed as a consenter in the system channel as well as 62 any application channels they're joined to. 63 64 When the channel config block is created, the `configtxgen` tool reads the paths 65 to the TLS certificates, and replaces the paths with the corresponding bytes of 66 the certificates. 67 68 ### Local configuration 69 70 The `orderer.yaml` has two configuration sections that are relevant for Raft 71 orderers: 72 73 **Cluster**, which determines the TLS communication configuration. And 74 **consensus**, which determines where Write Ahead Logs and Snapshots are 75 stored. 76 77 **Cluster parameters:** 78 79 By default, the Raft service is running on the same gRPC server as the client 80 facing server (which is used to send transactions or pull blocks), but it can be 81 configured to have a separate gRPC server with a separate port. 82 83 This is useful for cases where you want TLS certificates issued by the 84 organizational CAs, but used only by the cluster nodes to communicate among each 85 other, and TLS certificates issued by a public TLS CA for the client facing API. 86 87 * `ClientCertificate`, `ClientPrivateKey`: The file path of the client TLS certificate 88 and corresponding private key. 89 * `ListenPort`: The port the cluster listens on. 90 It must be same as `consenters[i].Port` in Channel configuration. 91 If blank, the port is the same port as the orderer general port (`general.listenPort`) 92 * `ListenAddress`: The address the cluster service is listening on. 93 * `ServerCertificate`, `ServerPrivateKey`: The TLS server certificate key pair 94 which is used when the cluster service is running on a separate gRPC server 95 (different port). 96 97 Note: `ListenPort`, `ListenAddress`, `ServerCertificate`, `ServerPrivateKey` must 98 be either set together or unset together. 99 If they are unset, they are inherited from the general TLS section, 100 in example `general.tls.{privateKey, certificate}`. 101 When general TLS is disabled: 102 - Use a different `ListenPort` than the orderer general port 103 - Properly configure TLS root CAs in the channel configuration. 104 105 There are also hidden configuration parameters for `general.cluster` which can be 106 used to further fine tune the cluster communication or replication mechanisms: 107 108 * `SendBufferSize`: Regulates the number of messages in the egress buffer. 109 * `DialTimeout`, `RPCTimeout`: Specify the timeouts of creating connections and 110 establishing streams. 111 * `ReplicationBufferSize`: the maximum number of bytes that can be allocated 112 for each in-memory buffer used for block replication from other cluster nodes. 113 Each channel has its own memory buffer. Defaults to `20971520` which is `20MB`. 114 * `PullTimeout`: the maximum duration the ordering node will wait for a block 115 to be received before it aborts. Defaults to five seconds. 116 * `ReplicationRetryTimeout`: The maximum duration the ordering node will wait 117 between two consecutive attempts. Defaults to five seconds. 118 * `ReplicationBackgroundRefreshInterval`: the time between two consecutive 119 attempts to replicate existing channels that this node was added to, or 120 channels that this node failed to replicate in the past. Defaults to five 121 minutes. 122 * `TLSHandshakeTimeShift`: If the TLS certificates of the ordering nodes 123 expire and are not replaced in time (see TLS certificate rotation below), 124 communication between them cannot be established, and it will be impossible 125 to send new transactions to the ordering service. 126 To recover from such a scenario, it is possible to make TLS handshakes 127 between ordering nodes consider the time to be shifted backwards a given 128 amount that is configured to `TLSHandshakeTimeShift`. 129 This setting only applies when a separate cluster listener is in use. If 130 the cluster service is sharing the orderer's main gRPC server, then instead 131 specify `TLSHandshakeTimeShift` in the `General.TLS` section. 132 133 **Consensus parameters:** 134 135 * `WALDir`: the location at which Write Ahead Logs for `etcd/raft` are stored. 136 Each channel will have its own subdirectory named after the channel ID. 137 * `SnapDir`: specifies the location at which snapshots for `etcd/raft` are stored. 138 Each channel will have its own subdirectory named after the channel ID. 139 140 There are also two hidden configuration parameters that can each be set by adding 141 them the consensus section in the `orderer.yaml`: 142 143 * `EvictionSuspicion`: The cumulative period of time of channel eviction 144 suspicion that triggers the node to pull blocks from other nodes and see if it 145 has been evicted from the channel in order to confirm its suspicion. If the 146 suspicion is confirmed (the inspected block doesn't contain the node's TLS 147 certificate), the node halts its operation for that channel. A node suspects 148 its channel eviction when it doesn't know about any elected leader nor can be 149 elected as leader in the channel. Defaults to 10 minutes. 150 * `TickIntervalOverride`: If set, this value will be preferred over the tick 151 interval configured in all channels where this ordering node is a consenter. 152 This value should be set only with great care, as a mismatch in tick interval 153 across orderers could result in a loss of quorum for one or more channels. 154 155 ### Channel configuration 156 157 Apart from the (already discussed) consenters, the Raft channel configuration has 158 an `Options` section which relates to protocol specific knobs. It is currently 159 not possible to change these values dynamically while a node is running. The 160 node have to be reconfigured and restarted. 161 162 The only exceptions is `SnapshotIntervalSize`, which can be adjusted at runtime. 163 164 Note: It is recommended to avoid changing the following values, as a misconfiguration 165 might lead to a state where a leader cannot be elected at all (i.e, if the 166 `TickInterval` and `ElectionTick` are extremely low). Situations where a leader 167 cannot be elected are impossible to resolve, as leaders are required to make 168 changes. Because of such dangers, we suggest not tuning these parameters for most 169 use cases. 170 171 * `TickInterval`: The time interval between two `Node.Tick` invocations. 172 * `ElectionTick`: The number of `Node.Tick` invocations that must pass between 173 elections. That is, if a follower does not receive any message from the leader 174 of current term before `ElectionTick` has elapsed, it will become candidate 175 and start an election. 176 * `ElectionTick` must be greater than `HeartbeatTick`. 177 * `HeartbeatTick`: The number of `Node.Tick` invocations that must pass between 178 heartbeats. That is, a leader sends heartbeat messages to maintain its 179 leadership every `HeartbeatTick` ticks. 180 * `MaxInflightBlocks`: Limits the max number of in-flight append blocks during 181 optimistic replication phase. 182 * `SnapshotIntervalSize`: Defines number of bytes per which a snapshot is taken. 183 184 ## Reconfiguration 185 186 The Raft orderer supports dynamic (meaning, while the channel is being serviced) 187 addition and removal of nodes as long as only one node is added or removed at a 188 time. Note that your cluster must be operational and able to achieve consensus 189 before you attempt to reconfigure it. For instance, if you have three nodes, and 190 two nodes fail, you will not be able to reconfigure your cluster to remove those 191 nodes. Similarly, if you have one failed node in a channel with three nodes, you 192 should not attempt to rotate a certificate, as this would induce a second fault. 193 As a rule, you should never attempt any configuration changes to the Raft 194 consenters, such as adding or removing a consenter, or rotating a consenter's 195 certificate unless all consenters are online and healthy. 196 197 If you do decide to change these parameters, it is recommended to only attempt 198 such a change during a maintenance cycle. Problems are most likely to occur when 199 a configuration is attempted in clusters with only a few nodes while a node is 200 down. For example, if you have three nodes in your consenter set and one of them 201 is down, it means you have two out of three nodes alive. If you extend the cluster 202 to four nodes while in this state, you will have only two out of four nodes alive, 203 which is not a quorum. The fourth node won't be able to onboard because nodes can 204 only onboard to functioning clusters (unless the total size of the cluster is 205 one or two). 206 207 So by extending a cluster of three nodes to four nodes (while only two are 208 alive) you are effectively stuck until the original offline node is resurrected. 209 210 Adding a new node to a Raft cluster is done by: 211 212 1. **Adding the TLS certificates** of the new node to the channel through a 213 channel configuration update transaction. Note: the new node must be added to 214 the system channel before being added to one or more application channels. 215 2. **Fetching the latest config block** of the system channel from an orderer node 216 that's part of the system channel. 217 3. **Ensuring that the node that will be added is part of the system channel** 218 by checking that the config block that was fetched includes the certificate of 219 (soon to be) added node. 220 4. **Starting the new Raft node** with the path to the config block in the 221 `General.BootstrapFile` configuration parameter. 222 5. **Waiting for the Raft node to replicate the blocks** from existing nodes for 223 all channels its certificates have been added to. After this step has been 224 completed, the node begins servicing the channel. 225 6. **Adding the endpoint** of the newly added Raft node to the channel 226 configuration of all channels. 227 228 It is possible to add a node that is already running (and participates in some 229 channels already) to a channel while the node itself is running. To do this, simply 230 add the node’s certificate to the channel config of the channel. The node will 231 autonomously detect its addition to the new channel (the default value here is 232 five minutes, but if you want the node to detect the new channel more quickly, 233 reboot the node) and will pull the channel blocks from an orderer in the 234 channel, and then start the Raft instance for that chain. 235 236 After it has successfully done so, the channel configuration can be updated to 237 include the endpoint of the new Raft orderer. 238 239 Removing a node from a Raft cluster is done by: 240 241 1. Removing its endpoint from the channel config for all channels, including 242 the system channel controlled by the orderer admins. 243 2. Removing its entry (identified by its certificates) from the channel 244 configuration for all channels. Again, this includes the system channel. 245 3. Shut down the node. 246 247 Removing a node from a specific channel, but keeping it servicing other channels 248 is done by: 249 250 1. Removing its endpoint from the channel config for the channel. 251 2. Removing its entry (identified by its certificates) from the channel 252 configuration. 253 3. The second phase causes: 254 * The remaining orderer nodes in the channel to cease communicating with 255 the removed orderer node in the context of the removed channel. They might 256 still be communicating on other channels. 257 * The node that is removed from the channel would autonomously detect its 258 removal either immediately or after `EvictionSuspicion` time has passed 259 (10 minutes by default) and will shut down its Raft instance. 260 261 ### TLS certificate rotation for an orderer node 262 263 All TLS certificates have an expiration date that is determined by the issuer. 264 These expiration dates can range from 10 years from the date of issuance to as 265 little as a few months, so check with your issuer. Before the expiration date, 266 you will need to rotate these certificates on the node itself and every channel 267 the node is joined to, including the system channel. 268 269 For each channel the node participates in: 270 271 1. Update the channel configuration with the new certificates. 272 2. Replace its certificates in the file system of the node. 273 3. Restart the node. 274 275 Because a node can only have a single TLS certificate key pair, the node will be 276 unable to service channels its new certificates have not been added to during 277 the update process, degrading the capacity of fault tolerance. Because of this, 278 **once the certificate rotation process has been started, it should be completed 279 as quickly as possible.** 280 281 If for some reason the rotation of the TLS certificates has started but cannot 282 complete in all channels, it is advised to rotate TLS certificates back to 283 what they were and attempt the rotation later. 284 285 ### Certificate expiration related authentication 286 Whenever a client with an identity that has an expiration date (such as an identity based on an x509 certificate) 287 sends a transaction to the orderer, the orderer checks whether its identity has expired, and if 288 so, rejects the transaction submission. 289 290 However, it is possible to configure the orderer to ignore expiration of identities via enabling 291 the `General.Authentication.NoExpirationChecks` configuration option in the `orderer.yaml`. 292 293 This should be done only under extreme circumstances, where the certificates of the administrators 294 have expired, and due to this it is not possible to send configuration updates to replace the administrator 295 certificates with renewed ones, because the config transactions signed by the existing administrators 296 are now rejected because they have expired. 297 After updating the channel it is recommended to change back to the default configuration which enforces 298 expiration checks on identities. 299 300 301 ## Metrics 302 303 For a description of the Operations Service and how to set it up, check out 304 [our documentation on the Operations Service](operations_service.html). 305 306 For a list at the metrics that are gathered by the Operations Service, check out 307 our [reference material on metrics](metrics_reference.html). 308 309 While the metrics you prioritize will have a lot to do with your particular use 310 case and configuration, there are two metrics in particular you might want to 311 monitor: 312 313 * `consensus_etcdraft_is_leader`: identifies which node in the cluster is 314 currently leader. If no nodes have this set, you have lost quorum. 315 * `consensus_etcdraft_data_persist_duration`: indicates how long write operations 316 to the Raft cluster's persistent write ahead log take. For protocol safety, 317 messages must be persisted durably, calling `fsync` where appropriate, before 318 they can be shared with the consenter set. If this value begins to climb, this 319 node may not be able to participate in consensus (which could lead to a 320 service interruption for this node and possibly the network). 321 * `consensus_etcdraft_cluster_size` and `consensus_etcdraft_active_nodes`: these 322 channel metrics help track the "active" nodes (which, as it sounds, are the nodes that 323 are currently contributing to the cluster, as compared to the total number of 324 nodes in the cluster). If the number of active nodes falls below a majority of 325 the nodes in the cluster, quorum will be lost and the ordering service will 326 stop processing blocks on the channel. 327 328 ## Troubleshooting 329 330 * The more stress you put on your nodes, the more you might have to change certain 331 parameters. As with any system, computer or mechanical, stress can lead to a drag 332 in performance. As we noted in the conceptual documentation, leader elections in 333 Raft are triggered when follower nodes do not receive either a "heartbeat" 334 messages or an "append" message that carries data from the leader for a certain 335 amount of time. Because Raft nodes share the same communication layer across 336 channels (this does not mean they share data --- they do not!), if a Raft node is 337 part of the consenter set in many channels, you might want to lengthen the amount 338 of time it takes to trigger an election to avoid inadvertent leader elections. 339 340 <!--- Licensed under Creative Commons Attribution 4.0 International License 341 https://creativecommons.org/licenses/by/4.0/) -->