github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/qa/method.md

github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/qa/method.md (about)

     1  ---
     2  order: 1
     3  parent:
     4    title: Method
     5    order: 1
     6  ---
     7  
     8  # Method
     9  
    10  This document provides a detailed description of the QA process.
    11  It is intended to be used by engineers reproducing the experimental setup for future tests of CometBFT.
    12  
    13  The (first iteration of the) QA process as described [in the RELEASES.md document][releases]
    14  was applied to version v0.34.x in order to have a set of results acting as benchmarking baseline.
    15  This baseline is then compared with results obtained in later versions.
    16  
    17  Out of the testnet-based test cases described in [the releases document][releases] we focused on two of them:
    18  _200 Node Test_, and _Rotating Nodes Test_.
    19  
    20  [releases]: https://github.com/cometbft/cometbft/blob/v0.37.x/RELEASES.md#large-scale-testnets
    21  
    22  ## Software Dependencies
    23  
    24  ### Infrastructure Requirements to Run the Tests
    25  
    26  * An account at Digital Ocean (DO), with a high droplet limit (>202)
    27  * The machine to orchestrate the tests should have the following installed:
    28      * A clone of the [testnet repository][testnet-repo]
    29          * This repository contains all the scripts mentioned in the reminder of this section
    30      * [Digital Ocean CLI][doctl]
    31      * [Terraform CLI][Terraform]
    32      * [Ansible CLI][Ansible]
    33  
    34  [testnet-repo]: https://github.com/cometbft/qa-infra
    35  [Ansible]: https://docs.ansible.com/ansible/latest/index.html
    36  [Terraform]: https://www.terraform.io/docs
    37  [doctl]: https://docs.digitalocean.com/reference/doctl/how-to/install/
    38  
    39  ### Requirements for Result Extraction
    40  
    41  * Matlab or Octave
    42  * [Prometheus][prometheus] server installed
    43  * blockstore DB of one of the full nodes in the testnet
    44  * Prometheus DB
    45  
    46  [prometheus]: https://prometheus.io/
    47  
    48  ## 200 Node Testnet
    49  
    50  ### Running the test
    51  
    52  This section explains how the tests were carried out for reproducibility purposes.
    53  
    54  1. [If you haven't done it before]
    55     Follow steps 1-4 of the `README.md` at the top of the testnet repository to configure Terraform, and `doctl`.
    56  2. Copy file `testnets/testnet200.toml` onto `testnet.toml` (do NOT commit this change)
    57  3. Set the variable `VERSION_TAG` in the `Makefile` to the git hash that is to be tested.
    58     * If you are running the base test, which implies an homogeneous network (all nodes are running the same version),
    59       then make sure makefile variable `VERSION2_WEIGHT` is set to 0
    60     * If you are running a mixed network, set the variable `VERSION_TAG2` to the other version you want deployed
    61       in the network. The, adjust the weight variables `VERSION_WEIGHT` and `VERSION2_WEIGHT` to configure the
    62       desired proportion of nodes running each of the two configured versions.
    63  4. Follow steps 5-10 of the `README.md` to configure and start the 200 node testnet
    64      * WARNING: Do NOT forget to run `make terraform-destroy` as soon as you are done with the tests (see step 9)
    65  5. As a sanity check, connect to the Prometheus node's web interface and check the graph for the `COMETBFT_CONSENSUS_HEIGHT` metric.
    66     All nodes should be increasing their heights.
    67  6. You now need to start the load runner that will produce transaction load
    68      * If you don't know the saturation load of the version you are testing, you need to discover it.
    69          * `ssh` into the `testnet-load-runner`, then copy script `script/200-node-loadscript.sh` and run it from the load runner node.
    70          * Before running it, you need to edit the script to provide the IP address of a full node.
    71            This node will receive all transactions from the load runner node.
    72          * This script will take about 40 mins to run.
    73          * It is running 90-seconds-long experiments in a loop with different loads.
    74      * If you already know the saturation load, you can simply run the test (several times) for 90 seconds with a load somewhat
    75        below saturation:
    76          * set makefile variables `ROTATE_CONNECTIONS`, `ROTATE_TX_RATE`, to values that will produce the desired transaction load.
    77          * set `ROTATE_TOTAL_TIME` to 90 (seconds).
    78          * run "make runload" and wait for it to complete. You may want to run this several times so the data from different runs can be compared.
    79  7. Run `make retrieve-data` to gather all relevant data from the testnet into the orchestrating machine
    80      * Alternatively, you may want to run `make retrieve-prometheus-data` and `make retrieve-blockstore` separately.
    81        The end result will be the same.
    82      * `make retrieve-blockstore` accepts the following values in makefile variable `RETRIEVE_TARGET_HOST`
    83          * `any`: (which is the default) picks up a full node and retrieves the blockstore from that node only.
    84          * `all`: retrieves the blockstore from all full nodes; this is extremely slow, and consumes plenty of bandwidth,
    85             so use it with care.
    86          * the name of a particular full node (e.g., `validator01`): retrieves the blockstore from that node only.
    87  8. Verify that the data was collected without errors
    88      * at least one blockstore DB for a CometBFT validator
    89      * the Prometheus database from the Prometheus node
    90      * for extra care, you can run `zip -T` on the `prometheus.zip` file and (one of) the `blockstore.db.zip` file(s)
    91  9. **Run `make terraform-destroy`**
    92      * Don't forget to type `yes`! Otherwise you're in trouble.
    93  
    94  ### Result Extraction
    95  
    96  The method for extracting the results described here is highly manual (and exploratory) at this stage.
    97  The CometBFT team should improve it at every iteration to increase the amount of automation.
    98  
    99  #### Steps
   100  
   101  1. Unzip the blockstore into a directory
   102  2. Extract the latency report and the raw latencies for all the experiments. Run these commands from the directory containing the blockstore
   103      * ```bash
   104         mkdir results
   105         go run github.com/cometbft/cometbft/test/loadtime/cmd/report@f1aaa436d --database-type goleveldb --data-dir ./ > results/report.txt`
   106         go run github.com/cometbft/cometbft/test/loadtime/cmd/report@f1aaa436d --database-type goleveldb --data-dir ./ --csv results/raw.csv`
   107         ```
   108  3. File `report.txt` contains an unordered list of experiments with varying concurrent connections and transaction rate
   109      * If you are looking for the saturation point
   110          * Create files `report01.txt`, `report02.txt`, `report04.txt` and, for each experiment in file `report.txt`,
   111            copy its related lines to the filename that matches the number of connections, for example
   112            ```bash
   113            for cnum in 1 2 3 4; do echo "$cnum"; grep "Connections: $cnum" results/report.txt -B 2 -A 10 > results/report$cnum.txt;  done
   114            ```
   115  
   116          * Sort the experiments in `report01.txt` in ascending tx rate order. Likewise for `report02.txt` and `report04.txt`.
   117      * Otherwise just keep `report.txt`, and skip step 4.
   118  4. Generate file `report_tabbed.txt` by showing the contents `report01.txt`, `report02.txt`, `report04.txt` side by side
   119      * This effectively creates a table where rows are a particular tx rate and columns are a particular number of websocket connections.
   120  5. Extract the raw latencies from file `raw.csv` using the following bash loop. This creates a `.csv` file and a `.dat` file per experiment.
   121     The format of the `.dat` files is amenable to loading them as matrices in Octave.
   122       * Adapt the values of the for loop variables according to the experiments that you ran (check `report.txt`).
   123       * Adapt `report*.txt` to the files you produced in step 3.
   124  
   125      ```bash
   126      uuids=($(cat report01.txt report02.txt report04.txt | grep '^Experiment ID: ' | awk '{ print $3 }'))
   127      c=1
   128      rm -f *.dat
   129      for i in 01 02 04; do
   130        for j in 0025 0050 0100 0200; do
   131          echo $i $j $c "${uuids[$c]}"
   132          filename=c${i}_r${j}
   133          grep ${uuids[$c]} raw.csv > ${filename}.csv
   134          cat ${filename}.csv | tr , ' ' | awk '{ print $2, $3 }' >> ${filename}.dat
   135          c=$(expr $c + 1)
   136        done
   137      done
   138      ```
   139  
   140  6. Enter Octave
   141  7. Load all `.dat` files generated in step 5 into matrices using this Octave code snippet
   142  
   143      ```octave
   144      conns =  { "01"; "02"; "04" };
   145      rates =  { "0025"; "0050"; "0100"; "0200" };
   146      for i = 1:length(conns)
   147        for j = 1:length(rates)
   148          filename = strcat("c", conns{i}, "_r", rates{j}, ".dat");
   149          load("-ascii", filename);
   150        endfor
   151      endfor
   152      ```
   153  
   154  8. Set variable release to the current release undergoing QA
   155  
   156      ```octave
   157      release = "v0.34.x";
   158      ```
   159  
   160  9. Generate a plot with all (or some) experiments, where the X axis is the experiment time,
   161     and the y axis is the latency of transactions.
   162     The following snippet plots all experiments.
   163  
   164      ```octave
   165      legends = {};
   166      hold off;
   167      for i = 1:length(conns)
   168        for j = 1:length(rates)
   169          data_name = strcat("c", conns{i}, "_r", rates{j});
   170          l = strcat("c=", conns{i}, " r=", rates{j});
   171          m = eval(data_name); plot((m(:,1) - min(m(:,1))) / 1e+9, m(:,2) / 1e+9, ".");
   172          hold on;
   173          legends(1, end+1) = l;
   174        endfor
   175      endfor
   176      legend(legends, "location", "northeastoutside");
   177      xlabel("experiment time (s)");
   178      ylabel("latency (s)");
   179      t = sprintf("200-node testnet - %s", release);
   180      title(t);
   181      ```
   182  
   183  10. Consider adjusting the axis, in case you want to compare your results to the baseline, for instance
   184  
   185      ```octave
   186      axis([0, 100, 0, 30], "tic");
   187      ```
   188  
   189  11. Use Octave's GUI menu to save the plot (e.g. as `.png`)
   190  
   191  12. Repeat steps 9 and 10 to obtain as many plots as deemed necessary.
   192  
   193  13. To generate a latency vs throughput plot, using the raw CSV file generated
   194      in step 2, follow the instructions for the [`latency_throughput.py`] script.
   195      This plot is useful to visualize the saturation point.
   196  
   197  [`latency_throughput.py`]: ../../scripts/qa/reporting/README.md#Latency-vs-Throughput-Plotting
   198  
   199  14. Alternatively,  follow the instructions for the [`latency_plotter.py`] script.
   200      This script generates a series of plots per experiment and configuration that my
   201      help with visualizing Latency vs Throughput variation.
   202  
   203  [`latency_plotter.py`]: ../../scripts/qa/reporting/README.md#Latency-vs-Throughput-Plotting-version-2
   204  
   205  #### Extracting Prometheus Metrics
   206  
   207  1. Stop the prometheus server if it is running as a service (e.g. a `systemd` unit).
   208  2. Unzip the prometheus database retrieved from the testnet, and move it to replace the
   209     local prometheus database.
   210  3. Start the prometheus server and make sure no error logs appear at start up.
   211  4. Identify the time window you want to plot in your graphs.
   212  5. Execute the [`prometheus_plotter.py`] script for the time window.
   213  
   214  [`prometheus_plotter.py`]: ../../scripts/qa/reporting/README.md#prometheus-metrics
   215  
   216  ## Rotating Node Testnet
   217  
   218  ### Running the test
   219  
   220  This section explains how the tests were carried out for reproducibility purposes.
   221  
   222  1. [If you haven't done it before]
   223     Follow steps 1-4 of the `README.md` at the top of the testnet repository to configure Terraform, and `doctl`.
   224  2. Copy file `testnet_rotating.toml` onto `testnet.toml` (do NOT commit this change)
   225  3. Set variable `VERSION_TAG` to the git hash that is to be tested.
   226  4. Run `make terraform-apply EPHEMERAL_SIZE=25`
   227      * WARNING: Do NOT forget to run `make terraform-destroy` as soon as you are done with the tests
   228  5. Follow steps 6-10 of the `README.md` to configure and start the "stable" part of the rotating node testnet
   229  6. As a sanity check, connect to the Prometheus node's web interface and check the graph for the `tendermint_consensus_height` metric.
   230     All nodes should be increasing their heights.
   231  7. On a different shell,
   232      * run `make runload ROTATE_CONNECTIONS=X ROTATE_TX_RATE=Y`
   233      * `X` and `Y` should reflect a load below the saturation point (see, e.g.,
   234        [this paragraph](CometBFT-QA-34.md#finding-the-saturation-point) for further info)
   235  8. Run `make rotate` to start the script that creates the ephemeral nodes, and kills them when they are caught up.
   236      * WARNING: If you run this command from your laptop, the laptop needs to be up and connected for full length
   237        of the experiment.
   238  9. When the height of the chain reaches 3000, stop the `make rotate` script
   239  10. When the rotate script has made two iterations (i.e., all ephemeral nodes have caught up twice)
   240      after height 3000 was reached, stop `make rotate`
   241  11. Run `make retrieve-data` to gather all relevant data from the testnet into the orchestrating machine
   242  12. Verify that the data was collected without errors
   243      * at least one blockstore DB for a CometBFT validator
   244      * the Prometheus database from the Prometheus node
   245      * for extra care, you can run `zip -T` on the `prometheus.zip` file and (one of) the `blockstore.db.zip` file(s)
   246  13. **Run `make terraform-destroy`**
   247  
   248  Steps 8 to 10 are highly manual at the moment and will be improved in next iterations.
   249  
   250  ### Result Extraction
   251  
   252  In order to obtain a latency plot, follow the instructions above for the 200 node experiment, but:
   253  
   254  * The `results.txt` file contains only one experiment
   255  * Therefore, no need for any `for` loops
   256  
   257  As for prometheus, the same method as for the 200 node experiment can be applied.