github.com/ari-anchor/sei-tendermint@v0.0.0-20230519144642-dc826b7b56bb/docs/tools/debugging/proposer-based-timestamps-runbook.md (about) 1 --- 2 order: 3 3 --- 4 5 # Proposer-Based Timestamps Runbook 6 7 Version v0.36 of Tendermint added new constraints for the timestamps included in 8 each block created by Tendermint. The new constraints mean that validators may 9 fail to produce valid blocks or may issue `nil` `prevotes` for proposed blocks 10 depending on the configuration of the validator's local clock. 11 12 ## What is this document for? 13 14 This document provides a set of actionable steps for application developers and 15 node operators to diagnose and fix issues related to clock synchronization and 16 configuration of the Proposer-Based Timestamps [SynchronyParams](https://github.com/tendermint/tendermint/blob/master/spec/core/data_structures.md#synchronyparams). 17 18 Use this runbook if you observe that validators are frequently voting `nil` for a block that the rest 19 of the network votes for or if validators are frequently producing block proposals 20 that are not voted for by the rest of the network. 21 22 ## Requirements 23 24 To use this runbook, you must be running a node that has the [Prometheus metrics endpoint enabled](https://github.com/tendermint/tendermint/blob/master/docs/nodes/metrics.md) 25 and the Tendermint RPC endpoint enabled and accessible. 26 27 It is strongly recommended to also run a Prometheus metrics collector to gather and 28 analyze metrics from the Tendermint node. 29 30 ## Debugging a Single Node 31 32 If you observe that a single validator is frequently failing to produce blocks or 33 voting nil for proposals that other validators vote for and suspect it may be 34 related to clock synchronization, use the following steps to debug and correct the issue. 35 36 ### Check Timely Metric 37 38 Tendermint exposes a histogram metric for the difference between the timestamp in the proposal 39 the and the time read from the node's local clock when the proposal is received. 40 41 The histogram exposes multiple metrics on the Prometheus `/metrics` endpoint called 42 * `tendermint_consensus_proposal_timestamp_difference_bucket`. 43 * `tendermint_consensus_proposal_timestamp_difference_sum`. 44 * `tendermint_consensus_proposal_timestamp_difference_count`. 45 46 Each metric is also labeled with the key `is_timely`, which can have a value of 47 `true` or `false`. 48 49 #### From the Prometheus Collector UI 50 51 If you are running a Prometheus collector, navigate to the query web interface and select the 'Graph' tab. 52 53 Issue a query for the following: 54 55 ``` 56 tendermint_consensus_proposal_timestamp_difference_count{is_timely="false"} / 57 tendermint_consensus_proposal_timestamp_difference_count{is_timely="true"} 58 ``` 59 60 This query will graph the ratio of proposals the node considered timely to those it 61 considered untimely. If the ratio is increasing, it means that your node is consistently 62 seeing more proposals that are far from its local clock. If this is the case, you should 63 check to make sure your local clock is properly synchronized to NTP. 64 65 #### From the `/metrics` url 66 67 If you are not running a Prometheus collector, navigate to the `/metrics` endpoint 68 exposed on the Prometheus metrics port with `curl` or a browser. 69 70 Search for the `tendermint_consensus_proposal_timestamp_difference_count` metrics. 71 This metric is labeled with `is_timely`. Investigate the value of 72 `tendermint_consensus_proposal_timestamp_difference_count` where `is_timely="false"` 73 and where `is_timely="true"`. Refresh the endpoint and observe if the value of `is_timely="false"` 74 is growing. 75 76 If you observe that `is_timely="false"` is growing, it means that your node is consistently 77 seeing proposals that are far from its local clock. If this is the case, you should check 78 to make sure your local clock is properly synchronized to NTP. 79 80 ### Checking Clock Sync 81 82 NTP configuration and tooling is very specific to the operating system and distribution 83 that your validator node is running. This guide assumes you have `timedatectl` installed with 84 [chrony](https://chrony.tuxfamily.org/), a popular tool for interacting with time 85 synchronization on Linux distributions. If you are using an operating system or 86 distribution with a different time synchronization mechanism, please consult the 87 documentation for your operating system to check the status and re-synchronize the daemon. 88 89 #### Check if NTP is Enabled 90 91 ```shell 92 $ timedatectl 93 ``` 94 95 From the output, ensure that `NTP service` is `active`. If `NTP service` is `inactive`, run: 96 97 ```shell 98 $ timedatectl set-ntp true 99 ``` 100 101 Re-run the `timedatectl` command and verify that the change has taken effect. 102 103 #### Check if Your NTP Daemon is Synchronized 104 105 Check the status of your local `chrony` NTP daemon using by running the following: 106 107 ```shell 108 $ chronyc tracking 109 ``` 110 111 If the `chrony` daemon is running, you will see output that indicates its current status. 112 If the `chrony` daemon is not running, restart it and re-run `chronyc tracking`. 113 114 The `System time` field of the response should show a value that is much smaller than 100 115 milliseconds. 116 117 If the value is very large, restart the `chronyd` daemon. 118 119 ## Debugging a Network 120 121 If you observe that a network is frequently failing to produce blocks and suspect 122 it may be related to clock synchronization, use the following steps to debug and correct the issue. 123 124 ### Check Prevote Message Delay 125 126 Tendermint exposes metrics that help determine how synchronized the clocks on a network are. 127 128 These metrics are visible on the Prometheus `/metrics` endpoint and are called: 129 * `tendermint_consensus_quorum_prevote_delay` 130 * `tendermint_consensus_full_prevote_delay` 131 132 These metrics calculate the difference between the timestamp in the proposal message and 133 the timestamp of a prevote that was issued during consensus. 134 135 The `tendermint_consensus_quorum_prevote_delay` metric is the interval in seconds 136 between the proposal timestamp and the timestamp of the earliest prevote that 137 achieved a quorum during the prevote step. 138 139 The `tendermint_consensus_full_prevote_delay` metric is the interval in seconds 140 between the proposal timestamp and the timestamp of the latest prevote in a round 141 where 100% of the validators voted. 142 143 #### From the Prometheus Collector UI 144 145 If you are running a Prometheus collector, navigate to the query web interface and select the 'Graph' tab. 146 147 Issue a query for the following: 148 149 ``` 150 sum(tendermint_consensus_quorum_prevote_delay) by (proposer_address) 151 ``` 152 153 This query will graph the difference in seconds for each proposer on the network. 154 155 If the value is much larger for some proposers, then the issue is likely related to the clock 156 synchronization of their nodes. Contact those proposers and ensure that their nodes 157 are properly connected to NTP using the steps for [Debugging a Single Node](#debugging-a-single-node). 158 159 If the value is relatively similar for all proposers you should next compare this 160 value to the `SynchronyParams` values for the network. Continue to the [Checking 161 Sychrony](#checking-synchrony) steps. 162 163 #### From the `/metrics` url 164 165 If you are not running a Prometheus collector, navigate to the `/metrics` endpoint 166 exposed on the Prometheus metrics port. 167 168 Search for the `tendermint_consensus_quorum_prevote_delay` metric. There will be one 169 entry of this metric for each `proposer_address`. If the value of this metric is 170 much larger for some proposers, then the issue is likely related to synchronization of their 171 nodes with NTP. Contact those proposers and ensure that their nodes are properly connected 172 to NTP using the steps for [Debugging a Single Node](#debugging-a-single-node). 173 174 If the values are relatively similar for all proposers you should next compare, 175 you'll need to compare this value to the `SynchronyParams` for the network. Continue 176 to the [Checking Sychrony](#checking-synchrony) steps. 177 178 ### Checking Synchrony 179 180 To determine the currently configured `SynchronyParams` for your network, issue a 181 request to your node's RPC endpoint. For a node running locally with the RPC server 182 exposed on port `26657`, run the following command: 183 184 ```shell 185 $ curl localhost:26657/consensus_params 186 ``` 187 188 The json output will contain a field named `synchrony`, with the following structure: 189 190 ```json 191 { 192 "precision": "500000000", 193 "message_delay": "3000000000" 194 } 195 ``` 196 197 The `precision` and `message_delay` values returned are listed in nanoseconds: 198 In the examples above, the precision is 500ms and the message delay is 3s. 199 Remember, `tendermint_consensus_quorum_prevote_delay` is listed in seconds. 200 If the `tendermint_consensus_quorum_prevote_delay` value approaches the sum of `precision` and `message_delay`, 201 then the value selected for these parameters is too small. Your application will 202 need to be modified to update the `SynchronyParams` to have larger values. 203 204 ### Updating SynchronyParams 205 206 The `SynchronyParams` are `ConsensusParameters` which means they are set and updated 207 by the application running alongside Tendermint. Updates to these parameters must 208 be passed to the application during the `FinalizeBlock` ABCI method call. 209 210 If the application was built using the CosmosSDK, then these parameters can be updated 211 programatically using a governance proposal. For more information, see the [CosmosSDK 212 documentation](https://hub.cosmos.network/main/governance/submitting.html#sending-the-transaction-that-submits-your-governance-proposal). 213 214 If the application does not implement a way to update the consensus parameters 215 programatically, then the application itself must be updated to do so. More information on updating 216 the consensus parameters via ABCI can be found in the [FinalizeBlock documentation](https://github.com/tendermint/tendermint/blob/master/spec/abci++/abci++_methods_002_draft.md#finalizeblock).