github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/qa/TMCore-QA-37.md (about) 1 --- 2 order: 1 3 parent: 4 title: Tendermint Core QA Results v0.37.x 5 description: This is a report on the results obtained when running TM v0.37.x on testnets 6 order: 4 7 --- 8 9 # Tendermint Core QA Results v0.37.x 10 11 ## Issues discovered 12 13 During this iteration of the QA process, the following issues were found: 14 15 * (critical, fixed) [\#9533] - This bug caused full nodes to sometimes get stuck 16 when blocksyncing, requiring a manual restart to unblock them. Importantly, 17 this bug was also present in v0.34.x and the fix was also backported in 18 [\#9534]. 19 * (critical, fixed) [\#9539] - `loadtime` is very likely to include more than 20 one "=" character in transactions, with is rejected by the e2e application. 21 * (critical, fixed) [\#9581] - Absent prometheus label makes CometBFT crash 22 when enabling Prometheus metric collection 23 * (non-critical, not fixed) [\#9548] - Full nodes can go over 50 connected 24 peers, which is not intended by the default configuration. 25 * (non-critical, not fixed) [\#9537] - With the default mempool cache setting, 26 duplicated transactions are not rejected when gossipped and eventually flood 27 all mempools. The 200 node testnets were thus run with a value of 200000 (as 28 opposed to the default 10000) 29 30 ## 200 Node Testnet 31 32 ### Finding the Saturation Point 33 34 The first goal is to identify the saturation point and compare it with the baseline (v0.34.x). 35 For further details, see [this paragraph](CometBFT-QA-34.md#finding-the-saturation-point) 36 in the baseline version. 37 38 The following table summarizes the results for v0.37.x, for the different experiments 39 (extracted from file [`v037_report_tabbed.txt`](img37/200nodes_tm037/v037_report_tabbed.txt)). 40 41 The X axis of this table is `c`, the number of connections created by the load runner process to the target node. 42 The Y axis of this table is `r`, the rate or number of transactions issued per second. 43 44 | | c=1 | c=2 | c=4 | 45 | :--- | ----: | ----: | ----: | 46 | r=25 | 2225 | 4450 | 8900 | 47 | r=50 | 4450 | 8900 | 17800 | 48 | r=100 | 8900 | 17800 | 35600 | 49 | r=200 | 17800 | 35600 | 38660 | 50 51 For comparison, this is the table with the baseline version. 52 53 | | c=1 | c=2 | c=4 | 54 | :--- | ----: | ----: | ----: | 55 | r=25 | 2225 | 4450 | 8900 | 56 | r=50 | 4450 | 8900 | 17800 | 57 | r=100 | 8900 | 17800 | 35400 | 58 | r=200 | 17800 | 35600 | 37358 | 59 60 The saturation point is beyond the diagonal: 61 62 * `r=200,c=2` 63 * `r=100,c=4` 64 65 which is at the same place as the baseline. For more details on the saturation point, see 66 [this paragraph](CometBFT-QA-34.md#finding-the-saturation-point) in the baseline version. 67 68 The experiment chosen to examine Prometheus metrics is the same as in the baseline: 69 **`r=200,c=2`**. 70 71 The load runner's CPU load was negligible (near 0) when running `r=200,c=2`. 72 73 ### Examining latencies 74 75 The method described [here](method.md) allows us to plot the latencies of transactions 76 for all experiments. 77 78 ![all-latencies](img37/200nodes_tm037/v037_200node_latencies.png) 79 80 The data seen in the plot is similar to that of the baseline. 81 82 ![all-latencies-bl](img34/v034_200node_latencies.png) 83 84 Therefore, for further details on these plots, 85 see [this paragraph](CometBFT-QA-34.md#examining-latencies) in the baseline version. 86 87 The following plot summarizes average latencies versus overall throughputs 88 across different numbers of WebSocket connections to the node into which 89 transactions are being loaded. 90 91 ![latency-vs-throughput](img37/200nodes_tm037/v037_latency_throughput.png) 92 93 This is similar to that of the baseline plot: 94 95 ![latency-vs-throughput-bl](img34/v034_latency_throughput.png) 96 97 ### Prometheus Metrics on the Chosen Experiment 98 99 As mentioned [above](#finding-the-saturation-point), the chosen experiment is `r=200,c=2`. 100 This section further examines key metrics for this experiment extracted from Prometheus data. 101 102 #### Mempool Size 103 104 The mempool size, a count of the number of transactions in the mempool, was shown to be stable and homogeneous 105 at all full nodes. It did not exhibit any unconstrained growth. 106 The plot below shows the evolution over time of the cumulative number of transactions inside all full nodes' mempools 107 at a given time. 108 109 ![mempool-cumulative](img37/200nodes_tm037/v037_r200c2_mempool_size.png) 110 111 The plot below shows evolution of the average over all full nodes, which oscillate between 1500 and 2000 outstanding transactions. 112 113 ![mempool-avg](img37/200nodes_tm037/v037_r200c2_mempool_size_avg.png) 114 115 The peaks observed coincide with the moments when some nodes reached round 1 of consensus (see below). 116 117 **These plots yield similar results to the baseline**: 118 119 ![mempool-cumulative-bl](img34/v034_r200c2_mempool_size.png) 120 121 ![mempool-avg-bl](img34/v034_r200c2_mempool_size_avg.png) 122 123 #### Peers 124 125 The number of peers was stable at all nodes. 126 It was higher for the seed nodes (around 140) than for the rest (between 16 and 78). 127 128 ![peers](img37/200nodes_tm037/v037_r200c2_peers.png) 129 130 Just as in the baseline, the fact that non-seed nodes reach more than 50 peers is due to #9548. 131 132 **This plot yields similar results to the baseline**: 133 134 ![peers-bl](img34/v034_r200c2_peers.png) 135 136 #### Consensus Rounds per Height 137 138 Most heights took just one round, but some nodes needed to advance to round 1 at some point. 139 140 ![rounds](img37/200nodes_tm037/v037_r200c2_rounds.png) 141 142 **This plot yields slightly better results than the baseline**: 143 144 ![rounds-bl](img34/v034_r200c2_rounds.png) 145 146 #### Blocks Produced per Minute, Transactions Processed per Minute 147 148 The blocks produced per minute are the gradient of this plot. 149 150 ![heights](img37/200nodes_tm037/v037_r200c2_heights.png) 151 152 Over a period of 2 minutes, the height goes from 477 to 524. 153 This results in an average of 23.5 blocks produced per minute. 154 155 The transactions processed per minute are the gradient of this plot. 156 157 ![total-txs](img37/200nodes_tm037/v037_r200c2_total-txs.png) 158 159 Over a period of 2 minutes, the total goes from 64525 to 100125 transactions, 160 resulting in 17800 transactions per minute. However, we can see in the plot that 161 all transactions in the load are process long before the two minutes. 162 If we adjust the time window when transactions are processed (approx. 90 seconds), 163 we obtain 23733 transactions per minute. 164 165 **These plots yield similar results to the baseline**: 166 167 ![heights-bl](img34/v034_r200c2_heights.png) 168 169 ![total-txs](img34/v034_r200c2_total-txs.png) 170 171 #### Memory Resident Set Size 172 173 Resident Set Size of all monitored processes is plotted below. 174 175 ![rss](img37/200nodes_tm037/v037_r200c2_rss.png) 176 177 The average over all processes oscillates around 380 MiB and does not demonstrate unconstrained growth. 178 179 ![rss-avg](img37/200nodes_tm037/v037_r200c2_rss_avg.png) 180 181 **These plots yield similar results to the baseline**: 182 183 ![rss-bl](img34/v034_r200c2_rss.png) 184 185 ![rss-avg-bl](img34/v034_r200c2_rss_avg.png) 186 187 #### CPU utilization 188 189 The best metric from Prometheus to gauge CPU utilization in a Unix machine is `load1`, 190 as it usually appears in the 191 [output of `top`](https://www.digitalocean.com/community/tutorials/load-average-in-linux). 192 193 ![load1](img37/200nodes_tm037/v037_r200c2_load1.png) 194 195 It is contained below 5 on most nodes. 196 197 **This plot yields similar results to the baseline**: 198 199 ![load1](img34/v034_r200c2_load1.png) 200 201 ### Test Result 202 203 **Result: PASS** 204 205 Date: 2022-10-14 206 207 Version: 1cf9d8e276afe8595cba960b51cd056514965fd1 208 209 ## Rotating Node Testnet 210 211 We use the same load as in the baseline: `c=4,r=800`. 212 213 Just as in the baseline tests, the version of CometBFT used for these tests is affected by #9539. 214 See this paragraph in the [baseline report](CometBFT-QA-34.md#rotating-node-testnet) for further details. 215 Finally, note that this setup allows for a fairer comparison between this version and the baseline. 216 217 ### Latencies 218 219 The plot of all latencies can be seen here. 220 221 ![rotating-all-latencies](img37/200nodes_tm037/v037_rotating_latencies.png) 222 223 Which is similar to the baseline. 224 225 ![rotating-all-latencies-bl](img34/v034_rotating_latencies_uniq.png) 226 227 Note that we are comparing against the baseline plot with _unique_ 228 transactions. This is because the problem with duplicate transactions 229 detected during the baseline experiment did not show up for `v0.37`, 230 which is _not_ proof that the problem is not present in `v0.37`. 231 232 ### Prometheus Metrics 233 234 The set of metrics shown here match those shown on the baseline (`v0.34`) for the same experiment. 235 We also show the baseline results for comparison. 236 237 #### Blocks and Transactions per minute 238 239 The blocks produced per minute are the gradient of this plot. 240 241 ![rotating-heights](img37/200nodes_tm037/v037_rotating_heights.png) 242 243 Over a period of 4446 seconds, the height goes from 5 to 3323. 244 This results in an average of 45 blocks produced per minute, 245 which is similar to the baseline, shown below. 246 247 ![rotating-heights-bl](img34/v034_rotating_heights.png) 248 249 The following two plots show only the heights reported by ephemeral nodes. 250 The second plot is the baseline plot for comparison. 251 252 ![rotating-heights-ephe](img37/200nodes_tm037/v037_rotating_heights_ephe.png) 253 254 ![rotating-heights-ephe-bl](img34/v034_rotating_heights_ephe.png) 255 256 By the length of the segments, we can see that ephemeral nodes in `v0.37` 257 catch up slightly faster. 258 259 The transactions processed per minute are the gradient of this plot. 260 261 ![rotating-total-txs](img37/200nodes_tm037/v037_rotating_total-txs.png) 262 263 Over a period of 3852 seconds, the total goes from 597 to 267298 transactions in one of the validators, 264 resulting in 4154 transactions per minute, which is slightly lower than the baseline, 265 although the baseline had to deal with duplicate transactions. 266 267 For comparison, this is the baseline plot. 268 269 ![rotating-total-txs-bl](img34/v034_rotating_total-txs.png) 270 271 #### Peers 272 273 The plot below shows the evolution of the number of peers throughout the experiment. 274 275 ![rotating-peers](img37/200nodes_tm037/v037_rotating_peers.png) 276 277 This is the baseline plot, for comparison. 278 279 ![rotating-peers-bl](img34/v034_rotating_peers.png) 280 281 The plotted values and their evolution are comparable in both plots. 282 283 For further details on these plots, see the baseline report. 284 285 #### Memory Resident Set Size 286 287 The average Resident Set Size (RSS) over all processes looks slightly more stable 288 on `v0.37` (first plot) than on the baseline (second plot). 289 290 ![rotating-rss-avg](img37/200nodes_tm037/v037_rotating_rss_avg.png) 291 292 ![rotating-rss-avg-bl](img34/v034_rotating_rss_avg.png) 293 294 The memory taken by the validators and the ephemeral nodes when they are up is comparable (not shown in the plots), 295 just as observed in the baseline. 296 297 #### CPU utilization 298 299 The plot shows metric `load1` for all nodes. 300 301 ![rotating-load1](img37/200nodes_tm037/v037_rotating_load1.png) 302 303 ![rotating-load1-bl](img34/v034_rotating_load1.png) 304 305 In both cases, it is contained under 5 most of the time, which is considered normal load. 306 The green line in the `v0.37` plot and the purple line in the baseline plot (`v0.34`) 307 correspond to the validators receiving all transactions, via RPC, from the load runner process. 308 In both cases, they oscillate around 5 (normal load). The main difference is that other 309 nodes are generally less loaded in `v0.37`. 310 311 ### Test Result 312 313 **Result: PASS** 314 315 Date: 2022-10-10 316 317 Version: 155110007b9d8b83997a799016c1d0844c8efbaf 318 319 [\#9533]: https://github.com/tendermint/tendermint/pull/9533 320 [\#9534]: https://github.com/tendermint/tendermint/pull/9534 321 [\#9539]: https://github.com/tendermint/tendermint/issues/9539 322 [\#9548]: https://github.com/tendermint/tendermint/issues/9548 323 [\#9537]: https://github.com/tendermint/tendermint/issues/9537 324 [\#9581]: https://github.com/tendermint/tendermint/issues/9581