github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/qa/TMCore-QA-37.md

github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/qa/TMCore-QA-37.md (about)

     1  ---
     2  order: 1
     3  parent:
     4    title: Tendermint Core QA Results v0.37.x
     5    description: This is a report on the results obtained when running TM v0.37.x on testnets
     6    order: 4
     7  ---
     8  
     9  # Tendermint Core QA Results v0.37.x
    10  
    11  ## Issues discovered
    12  
    13  During this iteration of the QA process, the following issues were found:
    14  
    15  * (critical, fixed) [\#9533] - This bug caused full nodes to sometimes get stuck
    16    when blocksyncing, requiring a manual restart to unblock them. Importantly,
    17    this bug was also present in v0.34.x and the fix was also backported in
    18    [\#9534].
    19  * (critical, fixed) [\#9539] - `loadtime` is very likely to include more than
    20    one "=" character in transactions, with is rejected by the e2e application.
    21  * (critical, fixed) [\#9581] - Absent prometheus label makes CometBFT crash
    22    when enabling Prometheus metric collection
    23  * (non-critical, not fixed) [\#9548] - Full nodes can go over 50 connected
    24    peers, which is not intended by the default configuration.
    25  * (non-critical, not fixed) [\#9537] - With the default mempool cache setting,
    26    duplicated transactions are not rejected when gossipped and eventually flood
    27    all mempools. The 200 node testnets were thus run with a value of 200000 (as
    28    opposed to the default 10000)
    29  
    30  ## 200 Node Testnet
    31  
    32  ### Finding the Saturation Point
    33  
    34  The first goal is to identify the saturation point and compare it with the baseline (v0.34.x).
    35  For further details, see [this paragraph](CometBFT-QA-34.md#finding-the-saturation-point)
    36  in the baseline version.
    37  
    38  The following table summarizes the results for v0.37.x, for the different experiments
    39  (extracted from file [`v037_report_tabbed.txt`](img37/200nodes_tm037/v037_report_tabbed.txt)).
    40  
    41  The X axis of this table is `c`, the number of connections created by the load runner process to the target node.
    42  The Y axis of this table is `r`, the rate or number of transactions issued per second.
    43  
    44  |        |  c=1  |  c=2  |  c=4  |
    45  | :---   | ----: | ----: | ----: |
    46  | r=25   |  2225 | 4450  | 8900  |
    47  | r=50   |  4450 | 8900  | 17800 |
    48  | r=100  |  8900 | 17800 | 35600 |
    49  | r=200  | 17800 | 35600 | 38660 |
    50  
    51  For comparison, this is the table with the baseline version.
    52  
    53  |        |  c=1  |  c=2  |  c=4  |
    54  | :---   | ----: | ----: | ----: |
    55  | r=25   |  2225 | 4450  | 8900  |
    56  | r=50   |  4450 | 8900  | 17800 |
    57  | r=100  |  8900 | 17800 | 35400 |
    58  | r=200  | 17800 | 35600 | 37358 |
    59  
    60  The saturation point is beyond the diagonal:
    61  
    62  * `r=200,c=2`
    63  * `r=100,c=4`
    64  
    65  which is at the same place as the baseline. For more details on the saturation point, see
    66  [this paragraph](CometBFT-QA-34.md#finding-the-saturation-point) in the baseline version.
    67  
    68  The experiment chosen to examine Prometheus metrics is the same as in the baseline:
    69  **`r=200,c=2`**.
    70  
    71  The load runner's CPU load was negligible (near 0) when running `r=200,c=2`.
    72  
    73  ### Examining latencies
    74  
    75  The method described [here](method.md) allows us to plot the latencies of transactions
    76  for all experiments.
    77  
    78  ![all-latencies](img37/200nodes_tm037/v037_200node_latencies.png)
    79  
    80  The data seen in the plot is similar to that of the baseline.
    81  
    82  ![all-latencies-bl](img34/v034_200node_latencies.png)
    83  
    84  Therefore, for further details on these plots,
    85  see [this paragraph](CometBFT-QA-34.md#examining-latencies) in the baseline version.
    86  
    87  The following plot summarizes average latencies versus overall throughputs
    88  across different numbers of WebSocket connections to the node into which
    89  transactions are being loaded.
    90  
    91  ![latency-vs-throughput](img37/200nodes_tm037/v037_latency_throughput.png)
    92  
    93  This is similar to that of the baseline plot:
    94  
    95  ![latency-vs-throughput-bl](img34/v034_latency_throughput.png)
    96  
    97  ### Prometheus Metrics on the Chosen Experiment
    98  
    99  As mentioned [above](#finding-the-saturation-point), the chosen experiment is `r=200,c=2`.
   100  This section further examines key metrics for this experiment extracted from Prometheus data.
   101  
   102  #### Mempool Size
   103  
   104  The mempool size, a count of the number of transactions in the mempool, was shown to be stable and homogeneous
   105  at all full nodes. It did not exhibit any unconstrained growth.
   106  The plot below shows the evolution over time of the cumulative number of transactions inside all full nodes' mempools
   107  at a given time.
   108  
   109  ![mempool-cumulative](img37/200nodes_tm037/v037_r200c2_mempool_size.png)
   110  
   111  The plot below shows evolution of the average over all full nodes, which oscillate between 1500 and 2000 outstanding transactions.
   112  
   113  ![mempool-avg](img37/200nodes_tm037/v037_r200c2_mempool_size_avg.png)
   114  
   115  The peaks observed coincide with the moments when some nodes reached round 1 of consensus (see below).
   116  
   117  **These plots yield similar results to the baseline**:
   118  
   119  ![mempool-cumulative-bl](img34/v034_r200c2_mempool_size.png)
   120  
   121  ![mempool-avg-bl](img34/v034_r200c2_mempool_size_avg.png)
   122  
   123  #### Peers
   124  
   125  The number of peers was stable at all nodes.
   126  It was higher for the seed nodes (around 140) than for the rest (between 16 and 78).
   127  
   128  ![peers](img37/200nodes_tm037/v037_r200c2_peers.png)
   129  
   130  Just as in the baseline, the fact that non-seed nodes reach more than 50 peers is due to #9548.
   131  
   132  **This plot yields similar results to the baseline**:
   133  
   134  ![peers-bl](img34/v034_r200c2_peers.png)
   135  
   136  #### Consensus Rounds per Height
   137  
   138  Most heights took just one round, but some nodes needed to advance to round 1 at some point.
   139  
   140  ![rounds](img37/200nodes_tm037/v037_r200c2_rounds.png)
   141  
   142  **This plot yields slightly better results than the baseline**:
   143  
   144  ![rounds-bl](img34/v034_r200c2_rounds.png)
   145  
   146  #### Blocks Produced per Minute, Transactions Processed per Minute
   147  
   148  The blocks produced per minute are the gradient of this plot.
   149  
   150  ![heights](img37/200nodes_tm037/v037_r200c2_heights.png)
   151  
   152  Over a period of 2 minutes, the height goes from 477 to 524.
   153  This results in an average of 23.5 blocks produced per minute.
   154  
   155  The transactions processed per minute are the gradient of this plot.
   156  
   157  ![total-txs](img37/200nodes_tm037/v037_r200c2_total-txs.png)
   158  
   159  Over a period of 2 minutes, the total goes from 64525 to 100125 transactions,
   160  resulting in 17800 transactions per minute. However, we can see in the plot that
   161  all transactions in the load are process long before the two minutes.
   162  If we adjust the time window when transactions are processed (approx. 90 seconds),
   163  we obtain 23733 transactions per minute.
   164  
   165  **These plots yield similar results to the baseline**:
   166  
   167  ![heights-bl](img34/v034_r200c2_heights.png)
   168  
   169  ![total-txs](img34/v034_r200c2_total-txs.png)
   170  
   171  #### Memory Resident Set Size
   172  
   173  Resident Set Size of all monitored processes is plotted below.
   174  
   175  ![rss](img37/200nodes_tm037/v037_r200c2_rss.png)
   176  
   177  The average over all processes oscillates around 380 MiB and does not demonstrate unconstrained growth.
   178  
   179  ![rss-avg](img37/200nodes_tm037/v037_r200c2_rss_avg.png)
   180  
   181  **These plots yield similar results to the baseline**:
   182  
   183  ![rss-bl](img34/v034_r200c2_rss.png)
   184  
   185  ![rss-avg-bl](img34/v034_r200c2_rss_avg.png)
   186  
   187  #### CPU utilization
   188  
   189  The best metric from Prometheus to gauge CPU utilization in a Unix machine is `load1`,
   190  as it usually appears in the
   191  [output of `top`](https://www.digitalocean.com/community/tutorials/load-average-in-linux).
   192  
   193  ![load1](img37/200nodes_tm037/v037_r200c2_load1.png)
   194  
   195  It is contained below 5 on most nodes.
   196  
   197  **This plot yields similar results to the baseline**:
   198  
   199  ![load1](img34/v034_r200c2_load1.png)
   200  
   201  ### Test Result
   202  
   203  **Result: PASS**
   204  
   205  Date: 2022-10-14
   206  
   207  Version: 1cf9d8e276afe8595cba960b51cd056514965fd1
   208  
   209  ## Rotating Node Testnet
   210  
   211  We use the same load as in the baseline: `c=4,r=800`.
   212  
   213  Just as in the baseline tests, the version of CometBFT used for these tests is affected by #9539.
   214  See this paragraph in the [baseline report](CometBFT-QA-34.md#rotating-node-testnet) for further details.
   215  Finally, note that this setup allows for a fairer comparison between this version and the baseline.
   216  
   217  ### Latencies
   218  
   219  The plot of all latencies can be seen here.
   220  
   221  ![rotating-all-latencies](img37/200nodes_tm037/v037_rotating_latencies.png)
   222  
   223  Which is similar to the baseline.
   224  
   225  ![rotating-all-latencies-bl](img34/v034_rotating_latencies_uniq.png)
   226  
   227  Note that we are comparing against the baseline plot with _unique_
   228  transactions. This is because the problem with duplicate transactions
   229  detected during the baseline experiment did not show up for `v0.37`,
   230  which is _not_ proof that the problem is not present in `v0.37`.
   231  
   232  ### Prometheus Metrics
   233  
   234  The set of metrics shown here match those shown on the baseline (`v0.34`) for the same experiment.
   235  We also show the baseline results for comparison.
   236  
   237  #### Blocks and Transactions per minute
   238  
   239  The blocks produced per minute are the gradient of this plot.
   240  
   241  ![rotating-heights](img37/200nodes_tm037/v037_rotating_heights.png)
   242  
   243  Over a period of 4446 seconds, the height goes from 5 to 3323.
   244  This results in an average of 45 blocks produced per minute,
   245  which is similar to the baseline, shown below.
   246  
   247  ![rotating-heights-bl](img34/v034_rotating_heights.png)
   248  
   249  The following two plots show only the heights reported by ephemeral nodes.
   250  The second plot is the baseline plot for comparison.
   251  
   252  ![rotating-heights-ephe](img37/200nodes_tm037/v037_rotating_heights_ephe.png)
   253  
   254  ![rotating-heights-ephe-bl](img34/v034_rotating_heights_ephe.png)
   255  
   256  By the length of the segments, we can see that ephemeral nodes in `v0.37`
   257  catch up slightly faster.
   258  
   259  The transactions processed per minute are the gradient of this plot.
   260  
   261  ![rotating-total-txs](img37/200nodes_tm037/v037_rotating_total-txs.png)
   262  
   263  Over a period of 3852 seconds, the total goes from 597 to 267298 transactions in one of the validators,
   264  resulting in 4154 transactions per minute, which is slightly lower than the baseline,
   265  although the baseline had to deal with duplicate transactions.
   266  
   267  For comparison, this is the baseline plot.
   268  
   269  ![rotating-total-txs-bl](img34/v034_rotating_total-txs.png)
   270  
   271  #### Peers
   272  
   273  The plot below shows the evolution of the number of peers throughout the experiment.
   274  
   275  ![rotating-peers](img37/200nodes_tm037/v037_rotating_peers.png)
   276  
   277  This is the baseline plot, for comparison.
   278  
   279  ![rotating-peers-bl](img34/v034_rotating_peers.png)
   280  
   281  The plotted values and their evolution are comparable in both plots.
   282  
   283  For further details on these plots, see the baseline report.
   284  
   285  #### Memory Resident Set Size
   286  
   287  The average Resident Set Size (RSS) over all processes looks slightly more stable
   288  on `v0.37` (first plot) than on the baseline (second plot).
   289  
   290  ![rotating-rss-avg](img37/200nodes_tm037/v037_rotating_rss_avg.png)
   291  
   292  ![rotating-rss-avg-bl](img34/v034_rotating_rss_avg.png)
   293  
   294  The memory taken by the validators and the ephemeral nodes when they are up is comparable (not shown in the plots),
   295  just as observed in the baseline.
   296  
   297  #### CPU utilization
   298  
   299  The plot shows metric `load1` for all nodes.
   300  
   301  ![rotating-load1](img37/200nodes_tm037/v037_rotating_load1.png)
   302  
   303  ![rotating-load1-bl](img34/v034_rotating_load1.png)
   304  
   305  In both cases, it is contained under 5 most of the time, which is considered normal load.
   306  The green line in the `v0.37` plot and the purple line in the baseline plot (`v0.34`)
   307  correspond to the validators receiving all transactions, via RPC, from the load runner process.
   308  In both cases, they oscillate around 5 (normal load). The main difference is that other
   309  nodes are generally less loaded in `v0.37`.
   310  
   311  ### Test Result
   312  
   313  **Result: PASS**
   314  
   315  Date: 2022-10-10
   316  
   317  Version: 155110007b9d8b83997a799016c1d0844c8efbaf
   318  
   319  [\#9533]: https://github.com/tendermint/tendermint/pull/9533
   320  [\#9534]: https://github.com/tendermint/tendermint/pull/9534
   321  [\#9539]: https://github.com/tendermint/tendermint/issues/9539
   322  [\#9548]: https://github.com/tendermint/tendermint/issues/9548
   323  [\#9537]: https://github.com/tendermint/tendermint/issues/9537
   324  [\#9581]: https://github.com/tendermint/tendermint/issues/9581