github.com/ari-anchor/sei-tendermint@v0.0.0-20230519144642-dc826b7b56bb/docs/tools/debugging/proposer-based-timestamps-runbook.md (about)

     1  ---
     2  order: 3
     3  ---
     4  
     5  # Proposer-Based Timestamps Runbook
     6  
     7  Version v0.36 of Tendermint added new constraints for the timestamps included in
     8  each block created by Tendermint. The new constraints mean that validators may
     9  fail to produce valid blocks or may issue `nil` `prevotes` for proposed blocks
    10  depending on the configuration of the validator's local clock.
    11  
    12  ## What is this document for?
    13  
    14  This document provides a set of actionable steps for application developers and
    15  node operators to diagnose and fix issues related to clock synchronization and
    16  configuration of the Proposer-Based Timestamps [SynchronyParams](https://github.com/tendermint/tendermint/blob/master/spec/core/data_structures.md#synchronyparams).
    17  
    18  Use this runbook if you observe that validators are frequently voting `nil` for a block that the rest
    19  of the network votes for or if validators are frequently producing block proposals
    20  that are not voted for by the rest of the network.
    21  
    22  ## Requirements
    23  
    24  To use this runbook, you must be running a node that has the [Prometheus metrics endpoint enabled](https://github.com/tendermint/tendermint/blob/master/docs/nodes/metrics.md)
    25  and the Tendermint RPC endpoint enabled and accessible.
    26  
    27  It is strongly recommended to also run a Prometheus metrics collector to gather and
    28  analyze metrics from the Tendermint node.
    29  
    30  ## Debugging a Single Node
    31  
    32  If you observe that a single validator is frequently failing to produce blocks or
    33  voting nil for proposals that other validators vote for and suspect it may be
    34  related to clock synchronization, use the following steps to debug and correct the issue.
    35  
    36  ### Check Timely Metric
    37  
    38  Tendermint exposes a histogram metric for the difference between the timestamp in the proposal
    39  the and the time read from the node's local clock when the proposal is received.
    40  
    41  The histogram exposes multiple metrics on the Prometheus `/metrics` endpoint called
    42  * `tendermint_consensus_proposal_timestamp_difference_bucket`.
    43  * `tendermint_consensus_proposal_timestamp_difference_sum`.
    44  * `tendermint_consensus_proposal_timestamp_difference_count`.
    45  
    46  Each metric is also labeled with the key `is_timely`, which can have a value of
    47  `true` or `false`.
    48  
    49  #### From the Prometheus Collector UI
    50  
    51  If you are running a Prometheus collector, navigate to the query web interface and select the 'Graph' tab.
    52  
    53  Issue a query for the following:
    54  
    55  ```
    56  tendermint_consensus_proposal_timestamp_difference_count{is_timely="false"} /
    57  tendermint_consensus_proposal_timestamp_difference_count{is_timely="true"}
    58  ```
    59  
    60  This query will graph the ratio of proposals the node considered timely to those it
    61  considered untimely. If the ratio is increasing, it means that your node is consistently
    62  seeing more proposals that are far from its local clock. If this is the case, you should
    63  check to make sure your local clock is properly synchronized to NTP.
    64  
    65  #### From the `/metrics` url
    66  
    67  If you are not running a Prometheus collector, navigate to the `/metrics` endpoint
    68  exposed on the Prometheus metrics port with `curl` or a browser.
    69  
    70  Search for the `tendermint_consensus_proposal_timestamp_difference_count` metrics.
    71  This metric is labeled with `is_timely`. Investigate the value of
    72  `tendermint_consensus_proposal_timestamp_difference_count` where `is_timely="false"`
    73  and where `is_timely="true"`. Refresh the endpoint and observe if the value of `is_timely="false"`
    74  is growing.
    75  
    76  If you observe that `is_timely="false"` is growing, it means that your node is consistently
    77  seeing proposals that are far from its local clock. If this is the case, you should check
    78  to make sure your local clock is properly synchronized to NTP.
    79  
    80  ### Checking Clock Sync
    81  
    82  NTP configuration and tooling is very specific to the operating system and distribution
    83  that your validator node is running. This guide assumes you have `timedatectl` installed with
    84  [chrony](https://chrony.tuxfamily.org/), a popular tool for interacting with time
    85  synchronization on Linux distributions. If you are using an operating system or
    86  distribution with a different time synchronization mechanism, please consult the
    87  documentation for your operating system to check the status and re-synchronize the daemon.
    88  
    89  #### Check if NTP is Enabled
    90  
    91  ```shell
    92  $ timedatectl
    93  ```
    94  
    95  From the output, ensure that `NTP service` is `active`. If `NTP service` is `inactive`, run:
    96  
    97  ```shell
    98  $ timedatectl set-ntp true
    99  ```
   100  
   101  Re-run the `timedatectl` command and verify that the change has taken effect.
   102  
   103  #### Check if Your NTP Daemon is Synchronized
   104  
   105  Check the status of your local `chrony` NTP daemon using by running the following:
   106  
   107  ```shell
   108  $ chronyc tracking
   109  ```
   110  
   111  If the `chrony` daemon is running, you will see output that indicates its current status.
   112  If the `chrony` daemon is not running, restart it and re-run `chronyc tracking`.
   113  
   114  The `System time` field of the response should show a value that is much smaller than 100
   115  milliseconds.
   116  
   117  If the value is very large, restart the `chronyd` daemon.
   118  
   119  ## Debugging a Network
   120  
   121  If you observe that a network is frequently failing to produce blocks and suspect
   122  it may be related to clock synchronization, use the following steps to debug and correct the issue.
   123  
   124  ### Check Prevote Message Delay
   125  
   126  Tendermint exposes metrics that help determine how synchronized the clocks on a network are.
   127  
   128  These metrics are visible on the Prometheus `/metrics` endpoint and are called:
   129  * `tendermint_consensus_quorum_prevote_delay`
   130  * `tendermint_consensus_full_prevote_delay`
   131  
   132  These metrics calculate the difference between the timestamp in the proposal message and
   133  the timestamp of a prevote that was issued during consensus.
   134  
   135  The `tendermint_consensus_quorum_prevote_delay` metric is the interval in seconds
   136  between the proposal timestamp and the timestamp of the earliest prevote that
   137  achieved a quorum during the prevote step.
   138  
   139  The `tendermint_consensus_full_prevote_delay` metric is the interval in seconds
   140  between the proposal timestamp and the timestamp of the latest prevote in a round
   141  where 100% of the validators voted.
   142  
   143  #### From the Prometheus Collector UI
   144  
   145  If you are running a Prometheus collector, navigate to the query web interface and select the 'Graph' tab.
   146  
   147  Issue a query for the following:
   148  
   149  ```
   150  sum(tendermint_consensus_quorum_prevote_delay) by (proposer_address)
   151  ```
   152  
   153  This query will graph the difference in seconds for each proposer on the network.
   154  
   155  If the value is much larger for some proposers, then the issue is likely related to the clock
   156  synchronization of their nodes. Contact those proposers and ensure that their nodes
   157  are properly connected to NTP using the steps for [Debugging a Single Node](#debugging-a-single-node).
   158  
   159  If the value is relatively similar for all proposers you should next compare this
   160  value to the `SynchronyParams` values for the network. Continue to the [Checking
   161  Sychrony](#checking-synchrony) steps.
   162  
   163  #### From the `/metrics` url
   164  
   165  If you are not running a Prometheus collector, navigate to the `/metrics` endpoint
   166  exposed on the Prometheus metrics port.
   167  
   168  Search for the `tendermint_consensus_quorum_prevote_delay` metric. There will be one
   169  entry of this metric for each `proposer_address`. If the value of this metric is
   170  much larger for some proposers, then the issue is likely related to synchronization of their
   171  nodes with NTP. Contact those proposers and ensure that their nodes are properly connected
   172  to NTP using the steps for [Debugging a Single Node](#debugging-a-single-node).
   173  
   174  If the values are relatively similar for all proposers you should next compare,
   175  you'll need to compare this value to the `SynchronyParams` for the network. Continue
   176  to the [Checking Sychrony](#checking-synchrony) steps.
   177  
   178  ### Checking Synchrony
   179  
   180  To determine the currently configured `SynchronyParams` for your network, issue a
   181  request to your node's RPC endpoint. For a node running locally with the RPC server
   182  exposed on port `26657`, run the following command:
   183  
   184  ```shell
   185  $ curl localhost:26657/consensus_params
   186  ```
   187  
   188  The json output will contain a field named `synchrony`, with the following structure:
   189  
   190  ```json
   191  {
   192    "precision": "500000000",
   193    "message_delay": "3000000000"
   194  }
   195  ```
   196  
   197  The `precision` and `message_delay` values returned are listed in nanoseconds:
   198  In the examples above, the precision is 500ms and the message delay is 3s.
   199  Remember, `tendermint_consensus_quorum_prevote_delay` is listed in seconds.
   200  If the `tendermint_consensus_quorum_prevote_delay` value approaches the sum of `precision` and `message_delay`,
   201  then the value selected for these parameters is too small. Your application will
   202  need to be modified to update the `SynchronyParams` to have larger values.
   203  
   204  ### Updating SynchronyParams
   205  
   206  The `SynchronyParams` are `ConsensusParameters` which means they are set and updated
   207  by the application running alongside Tendermint. Updates to these parameters must
   208  be passed to the application during the `FinalizeBlock` ABCI method call.
   209  
   210  If the application was built using the CosmosSDK, then these parameters can be updated
   211  programatically using a governance proposal. For more information, see the [CosmosSDK
   212  documentation](https://hub.cosmos.network/main/governance/submitting.html#sending-the-transaction-that-submits-your-governance-proposal).
   213  
   214  If the application does not implement a way to update the consensus parameters
   215  programatically, then the application itself must be updated to do so. More information on updating
   216  the consensus parameters via ABCI can be found in the [FinalizeBlock documentation](https://github.com/tendermint/tendermint/blob/master/spec/abci++/abci++_methods_002_draft.md#finalizeblock).