github.com/osdi23p228/fabric@v0.0.0-20221218062954-77808885f5db/docs/source/raft_configuration.md (about)

     1  # Configuring and operating a Raft ordering service
     2  
     3  **Audience**: *Raft ordering node admins*
     4  
     5  ## Conceptual overview
     6  
     7  For a high level overview of the concept of ordering and how the supported
     8  ordering service implementations (including Raft) work at a high level, check
     9  out our conceptual documentation on the [Ordering Service](./orderer/ordering_service.html).
    10  
    11  To learn about the process of setting up an ordering node --- including the
    12  creation of a local MSP and the creation of a genesis block --- check out our
    13  documentation on [Setting up an ordering node](orderer_deploy.html).
    14  
    15  ## Configuration
    16  
    17  While every Raft node must be added to the system channel, a node does not need
    18  to be added to every application channel. Additionally, you can remove and add a
    19  node from a channel dynamically without affecting the other nodes, a process
    20  described in the Reconfiguration section below.
    21  
    22  Raft nodes identify each other using TLS pinning, so in order to impersonate a
    23  Raft node, an attacker needs to obtain the **private key** of its TLS
    24  certificate. As a result, it is not possible to run a Raft node without a valid
    25  TLS configuration.
    26  
    27  A Raft cluster is configured in two planes:
    28  
    29    * **Local configuration**: Governs node specific aspects, such as TLS
    30    communication, replication behavior, and file storage.
    31  
    32    * **Channel configuration**: Defines the membership of the Raft cluster for the
    33    corresponding channel, as well as protocol specific parameters such as heartbeat
    34    frequency, leader timeouts, and more.
    35  
    36  Recall, each channel has its own instance of a Raft protocol running. Thus, a
    37  Raft node must be referenced in the configuration of each channel it belongs to
    38  by adding its server and client TLS certificates (in `PEM` format) to the channel
    39  config. This ensures that when other nodes receive a message from it, they can
    40  securely confirm the identity of the node that sent the message.
    41  
    42  The following section from `configtx.yaml` shows three Raft nodes (also called
    43  “consenters”) in the channel:
    44  
    45  ```
    46         Consenters:
    47              - Host: raft0.example.com
    48                Port: 7050
    49                ClientTLSCert: path/to/ClientTLSCert0
    50                ServerTLSCert: path/to/ServerTLSCert0
    51              - Host: raft1.example.com
    52                Port: 7050
    53                ClientTLSCert: path/to/ClientTLSCert1
    54                ServerTLSCert: path/to/ServerTLSCert1
    55              - Host: raft2.example.com
    56                Port: 7050
    57                ClientTLSCert: path/to/ClientTLSCert2
    58                ServerTLSCert: path/to/ServerTLSCert2
    59  ```
    60  
    61  Note: an orderer will be listed as a consenter in the system channel as well as
    62  any application channels they're joined to.
    63  
    64  When the channel config block is created, the `configtxgen` tool reads the paths
    65  to the TLS certificates, and replaces the paths with the corresponding bytes of
    66  the certificates.
    67  
    68  ### Local configuration
    69  
    70  The `orderer.yaml` has two configuration sections that are relevant for Raft
    71  orderers:
    72  
    73  **Cluster**, which determines the TLS communication configuration. And
    74  **consensus**, which determines where Write Ahead Logs and Snapshots are
    75  stored.
    76  
    77  **Cluster parameters:**
    78  
    79  By default, the Raft service is running on the same gRPC server as the client
    80  facing server (which is used to send transactions or pull blocks), but it can be
    81  configured to have a separate gRPC server with a separate port.
    82  
    83  This is useful for cases where you want TLS certificates issued by the
    84  organizational CAs, but used only by the cluster nodes to communicate among each
    85  other, and TLS certificates issued by a public TLS CA for the client facing API.
    86  
    87    * `ClientCertificate`, `ClientPrivateKey`: The file path of the client TLS certificate
    88    and corresponding private key.
    89    * `ListenPort`: The port the cluster listens on. 
    90    It must be same as `consenters[i].Port` in Channel configuration. 
    91    If blank, the port is the same port as the orderer general port (`general.listenPort`)
    92    * `ListenAddress`: The address the cluster service is listening on.
    93    * `ServerCertificate`, `ServerPrivateKey`: The TLS server certificate key pair
    94    which is used when the cluster service is running on a separate gRPC server
    95    (different port).
    96  
    97  Note: `ListenPort`, `ListenAddress`, `ServerCertificate`, `ServerPrivateKey` must
    98  be either set together or unset together.
    99  If they are unset, they are inherited from the general TLS section,
   100  in example `general.tls.{privateKey, certificate}`.
   101  When general TLS is disabled:
   102   - Use a different `ListenPort` than the orderer general port
   103   - Properly configure TLS root CAs in the channel configuration.
   104  
   105  There are also hidden configuration parameters for `general.cluster` which can be
   106  used to further fine tune the cluster communication or replication mechanisms:
   107  
   108    * `SendBufferSize`: Regulates the number of messages in the egress buffer.
   109    * `DialTimeout`, `RPCTimeout`: Specify the timeouts of creating connections and
   110    establishing streams.
   111    * `ReplicationBufferSize`: the maximum number of bytes that can be allocated
   112    for each in-memory buffer used for block replication from other cluster nodes.
   113    Each channel has its own memory buffer. Defaults to `20971520` which is `20MB`.
   114    * `PullTimeout`: the maximum duration the ordering node will wait for a block
   115    to be received before it aborts. Defaults to five seconds.
   116    * `ReplicationRetryTimeout`: The maximum duration the ordering node will wait
   117    between two consecutive attempts. Defaults to five seconds.
   118    * `ReplicationBackgroundRefreshInterval`: the time between two consecutive
   119    attempts to replicate existing channels that this node was added to, or
   120    channels that this node failed to replicate in the past. Defaults to five
   121    minutes.
   122    * `TLSHandshakeTimeShift`: If the TLS certificates of the ordering nodes
   123    expire and are not replaced in time (see TLS certificate rotation below),
   124     communication between them cannot be established, and it will be impossible
   125     to send new transactions to the ordering service.
   126     To recover from such a scenario, it is possible to make TLS handshakes
   127     between ordering nodes consider the time to be shifted backwards a given
   128     amount that is configured to `TLSHandshakeTimeShift`.
   129     This setting only applies when a separate cluster listener is in use.  If
   130     the cluster service is sharing the orderer's main gRPC server, then instead
   131     specify `TLSHandshakeTimeShift` in the `General.TLS` section.
   132  
   133  **Consensus parameters:**
   134  
   135    * `WALDir`: the location at which Write Ahead Logs for `etcd/raft` are stored.
   136    Each channel will have its own subdirectory named after the channel ID.
   137    * `SnapDir`: specifies the location at which snapshots for `etcd/raft` are stored.
   138    Each channel will have its own subdirectory named after the channel ID.
   139  
   140  There are also two hidden configuration parameters that can each be set by adding
   141  them the consensus section in the `orderer.yaml`:
   142  
   143    * `EvictionSuspicion`: The cumulative period of time of channel eviction
   144    suspicion that triggers the node to pull blocks from other nodes and see if it
   145    has been evicted from the channel in order to confirm its suspicion. If the
   146    suspicion is confirmed (the inspected block doesn't contain the node's TLS
   147    certificate), the node halts its operation for that channel. A node suspects
   148    its channel eviction when it doesn't know about any elected leader nor can be
   149    elected as leader in the channel. Defaults to 10 minutes.
   150    * `TickIntervalOverride`: If set, this value will be preferred over the tick
   151    interval configured in all channels where this ordering node is a consenter.
   152    This value should be set only with great care, as a mismatch in tick interval
   153    across orderers could result in a loss of quorum for one or more channels.
   154  
   155  ### Channel configuration
   156  
   157  Apart from the (already discussed) consenters, the Raft channel configuration has
   158  an `Options` section which relates to protocol specific knobs. It is currently
   159  not possible to change these values dynamically while a node is running. The
   160  node have to be reconfigured and restarted.
   161  
   162  The only exceptions is `SnapshotIntervalSize`, which can be adjusted at runtime.
   163  
   164  Note: It is recommended to avoid changing the following values, as a misconfiguration
   165  might lead to a state where a leader cannot be elected at all (i.e, if the
   166  `TickInterval` and `ElectionTick` are extremely low). Situations where a leader
   167  cannot be elected are impossible to resolve, as leaders are required to make
   168  changes. Because of such dangers, we suggest not tuning these parameters for most
   169  use cases.
   170  
   171    * `TickInterval`: The time interval between two `Node.Tick` invocations.
   172    * `ElectionTick`: The number of `Node.Tick` invocations that must pass between
   173    elections. That is, if a follower does not receive any message from the leader
   174    of current term before `ElectionTick` has elapsed, it will become candidate
   175    and start an election.
   176    * `ElectionTick` must be greater than `HeartbeatTick`.
   177    * `HeartbeatTick`: The number of `Node.Tick` invocations that must pass between
   178    heartbeats. That is, a leader sends heartbeat messages to maintain its
   179    leadership every `HeartbeatTick` ticks.
   180    * `MaxInflightBlocks`: Limits the max number of in-flight append blocks during
   181    optimistic replication phase.
   182    * `SnapshotIntervalSize`: Defines number of bytes per which a snapshot is taken.
   183  
   184  ## Reconfiguration
   185  
   186  The Raft orderer supports dynamic (meaning, while the channel is being serviced)
   187  addition and removal of nodes as long as only one node is added or removed at a
   188  time. Note that your cluster must be operational and able to achieve consensus
   189  before you attempt to reconfigure it. For instance, if you have three nodes, and
   190  two nodes fail, you will not be able to reconfigure your cluster to remove those
   191  nodes. Similarly, if you have one failed node in a channel with three nodes, you
   192  should not attempt to rotate a certificate, as this would induce a second fault.
   193  As a rule, you should never attempt any configuration changes to the Raft
   194  consenters, such as adding or removing a consenter, or rotating a consenter's
   195  certificate unless all consenters are online and healthy.
   196  
   197  If you do decide to change these parameters, it is recommended to only attempt
   198  such a change during a maintenance cycle. Problems are most likely to occur when
   199  a configuration is attempted in clusters with only a few nodes while a node is
   200  down. For example, if you have three nodes in your consenter set and one of them
   201  is down, it means you have two out of three nodes alive. If you extend the cluster
   202  to four nodes while in this state, you will have only two out of four nodes alive,
   203  which is not a quorum. The fourth node won't be able to onboard because nodes can
   204  only onboard to functioning clusters (unless the total size of the cluster is
   205  one or two).
   206  
   207  So by extending a cluster of three nodes to four nodes (while only two are
   208  alive) you are effectively stuck until the original offline node is resurrected.
   209  
   210  Adding a new node to a Raft cluster is done by:
   211  
   212    1. **Adding the TLS certificates** of the new node to the channel through a
   213    channel configuration update transaction. Note: the new node must be added to
   214    the system channel before being added to one or more application channels.
   215    2. **Fetching the latest config block** of the system channel from an orderer node
   216    that's part of the system channel.
   217    3. **Ensuring that the node that will be added is part of the system channel**
   218    by checking that the config block that was fetched includes the certificate of
   219    (soon to be) added node.
   220    4. **Starting the new Raft node** with the path to the config block in the
   221    `General.BootstrapFile` configuration parameter.
   222    5. **Waiting for the Raft node to replicate the blocks** from existing nodes for
   223    all channels its certificates have been added to. After this step has been
   224    completed, the node begins servicing the channel.
   225    6. **Adding the endpoint** of the newly added Raft node to the channel
   226    configuration of all channels.
   227  
   228  It is possible to add a node that is already running (and participates in some
   229  channels already) to a channel while the node itself is running. To do this, simply
   230  add the node’s certificate to the channel config of the channel. The node will
   231  autonomously detect its addition to the new channel (the default value here is
   232  five minutes, but if you want the node to detect the new channel more quickly,
   233  reboot the node) and will pull the channel blocks from an orderer in the
   234  channel, and then start the Raft instance for that chain.
   235  
   236  After it has successfully done so, the channel configuration can be updated to
   237  include the endpoint of the new Raft orderer.
   238  
   239  Removing a node from a Raft cluster is done by:
   240  
   241    1. Removing its endpoint from the channel config for all channels, including
   242    the system channel controlled by the orderer admins.
   243    2. Removing its entry (identified by its certificates) from the channel
   244    configuration for all channels. Again, this includes the system channel.
   245    3. Shut down the node.
   246  
   247  Removing a node from a specific channel, but keeping it servicing other channels
   248  is done by:
   249  
   250    1. Removing its endpoint from the channel config for the channel.
   251    2. Removing its entry (identified by its certificates) from the channel
   252    configuration.
   253    3. The second phase causes:
   254       * The remaining orderer nodes in the channel to cease communicating with
   255       the removed orderer node in the context of the removed channel. They might
   256       still be communicating on other channels.
   257       * The node that is removed from the channel would autonomously detect its
   258       removal either immediately or after `EvictionSuspicion` time has passed
   259       (10 minutes by default) and will shut down its Raft instance.
   260  
   261  ### TLS certificate rotation for an orderer node
   262  
   263  All TLS certificates have an expiration date that is determined by the issuer.
   264  These expiration dates can range from 10 years from the date of issuance to as
   265  little as a few months, so check with your issuer. Before the expiration date,
   266  you will need to rotate these certificates on the node itself and every channel
   267  the node is joined to, including the system channel.
   268  
   269  For each channel the node participates in:
   270  
   271    1. Update the channel configuration with the new certificates.
   272    2. Replace its certificates in the file system of the node.
   273    3. Restart the node.
   274  
   275  Because a node can only have a single TLS certificate key pair, the node will be
   276  unable to service channels its new certificates have not been added to during
   277  the update process, degrading the capacity of fault tolerance. Because of this,
   278  **once the certificate rotation process has been started, it should be completed
   279  as quickly as possible.**
   280  
   281  If for some reason the rotation of the TLS certificates has started but cannot
   282  complete in all channels, it is advised to rotate TLS certificates back to
   283  what they were and attempt the rotation later.
   284  
   285  ### Certificate expiration related authentication
   286  Whenever a client with an identity that has an expiration date (such as an identity based on an x509 certificate)
   287  sends a transaction to the orderer, the orderer checks whether its identity has expired, and if
   288  so, rejects the transaction submission.
   289  
   290  However, it is possible to configure the orderer to ignore expiration of identities via enabling
   291  the `General.Authentication.NoExpirationChecks` configuration option in the `orderer.yaml`.
   292  
   293  This should be done only under extreme circumstances, where the certificates of the administrators
   294  have expired, and due to this it is not possible to send configuration updates to replace the administrator
   295  certificates with renewed ones, because the config transactions signed by the existing administrators
   296  are now rejected because they have expired.
   297  After updating the channel it is recommended to change back to the default configuration which enforces
   298  expiration checks on identities.
   299  
   300  
   301  ## Metrics
   302  
   303  For a description of the Operations Service and how to set it up, check out
   304  [our documentation on the Operations Service](operations_service.html).
   305  
   306  For a list at the metrics that are gathered by the Operations Service, check out
   307  our [reference material on metrics](metrics_reference.html).
   308  
   309  While the metrics you prioritize will have a lot to do with your particular use
   310  case and configuration, there are two metrics in particular you might want to
   311  monitor:
   312  
   313  * `consensus_etcdraft_is_leader`: identifies which node in the cluster is
   314     currently leader. If no nodes have this set, you have lost quorum.
   315  * `consensus_etcdraft_data_persist_duration`: indicates how long write operations
   316     to the Raft cluster's persistent write ahead log take. For protocol safety,
   317     messages must be persisted durably, calling `fsync` where appropriate, before
   318     they can be shared with the consenter set. If this value begins to climb, this
   319     node may not be able to participate in consensus (which could lead to a
   320     service interruption for this node and possibly the network).
   321  * `consensus_etcdraft_cluster_size` and `consensus_etcdraft_active_nodes`: these
   322     channel metrics help track the "active" nodes (which, as it sounds, are the nodes that
   323     are currently contributing to the cluster, as compared to the total number of
   324     nodes in the cluster). If the number of active nodes falls below a majority of
   325     the nodes in the cluster, quorum will be lost and the ordering service will
   326     stop processing blocks on the channel.
   327  
   328  ## Troubleshooting
   329  
   330  * The more stress you put on your nodes, the more you might have to change certain
   331  parameters. As with any system, computer or mechanical, stress can lead to a drag
   332  in performance. As we noted in the conceptual documentation, leader elections in
   333  Raft are triggered when follower nodes do not receive either a "heartbeat"
   334  messages or an "append" message that carries data from the leader for a certain
   335  amount of time. Because Raft nodes share the same communication layer across
   336  channels (this does not mean they share data --- they do not!), if a Raft node is
   337  part of the consenter set in many channels, you might want to lengthen the amount
   338  of time it takes to trigger an election to avoid inadvertent leader elections.
   339  
   340  <!--- Licensed under Creative Commons Attribution 4.0 International License
   341  https://creativecommons.org/licenses/by/4.0/) -->