github.com/Team-Kujira/tendermint@v0.34.24-indexer/docs/qa/method.md

github.com/Team-Kujira/tendermint@v0.34.24-indexer/docs/qa/method.md (about)

     1  ---
     2  order: 1
     3  title: Method
     4  ---
     5  
     6  # Method
     7  
     8  This document provides a detailed description of the QA process.
     9  It is intended to be used by engineers reproducing the experimental setup for future tests of Tendermint.
    10  
    11  The (first iteration of the) QA process as described [in the RELEASES.md document][releases]
    12  was applied to version v0.34.x in order to have a set of results acting as benchmarking baseline.
    13  This baseline is then compared with results obtained in later versions.
    14  
    15  Out of the testnet-based test cases described in [the releases document][releases] we focused on two of them:
    16  _200 Node Test_, and _Rotating Nodes Test_.
    17  
    18  [releases]: https://github.com/tendermint/tendermint/blob/v0.37.x/RELEASES.md#large-scale-testnets
    19  
    20  ## Software Dependencies
    21  
    22  ### Infrastructure Requirements to Run the Tests
    23  
    24  * An account at Digital Ocean (DO), with a high droplet limit (>202)
    25  * The machine to orchestrate the tests should have the following installed:
    26      * A clone of the [testnet repository][testnet-repo]
    27          * This repository contains all the scripts mentioned in the reminder of this section
    28      * [Digital Ocean CLI][doctl]
    29      * [Terraform CLI][Terraform]
    30      * [Ansible CLI][Ansible]
    31  
    32  [testnet-repo]: https://github.com/interchainio/tendermint-testnet
    33  [Ansible]: https://docs.ansible.com/ansible/latest/index.html
    34  [Terraform]: https://www.terraform.io/docs
    35  [doctl]: https://docs.digitalocean.com/reference/doctl/how-to/install/
    36  
    37  ### Requirements for Result Extraction
    38  
    39  * Matlab or Octave
    40  * [Prometheus][prometheus] server installed
    41  * blockstore DB of one of the full nodes in the testnet
    42  * Prometheus DB
    43  
    44  [prometheus]: https://prometheus.io/
    45  
    46  ## 200 Node Testnet
    47  
    48  ### Running the test
    49  
    50  This section explains how the tests were carried out for reproducibility purposes.
    51  
    52  1. [If you haven't done it before]
    53     Follow steps 1-4 of the `README.md` at the top of the testnet repository to configure Terraform, and `doctl`.
    54  2. Copy file `testnets/testnet200.toml` onto `testnet.toml` (do NOT commit this change)
    55  3. Set the variable `VERSION_TAG` in the `Makefile` to the git hash that is to be tested.
    56  4. Follow steps 5-10 of the `README.md` to configure and start the 200 node testnet
    57      * WARNING: Do NOT forget to run `make terraform-destroy` as soon as you are done with the tests (see step 9)
    58  5. As a sanity check, connect to the Prometheus node's web interface and check the graph for the `tendermint_consensus_height` metric.
    59     All nodes should be increasing their heights.
    60  6. `ssh` into the `testnet-load-runner`, then copy script `script/200-node-loadscript.sh` and run it from the load runner node.
    61      * Before running it, you need to edit the script to provide the IP address of a full node.
    62        This node will receive all transactions from the load runner node.
    63      * This script will take about 40 mins to run
    64      * It is running 90-seconds-long experiments in a loop with different loads
    65  7. Run `make retrieve-data` to gather all relevant data from the testnet into the orchestrating machine
    66  8. Verify that the data was collected without errors
    67      * at least one blockstore DB for a Tendermint validator
    68      * the Prometheus database from the Prometheus node
    69      * for extra care, you can run `zip -T` on the `prometheus.zip` file and (one of) the `blockstore.db.zip` file(s)
    70  9. **Run `make terraform-destroy`**
    71      * Don't forget to type `yes`! Otherwise you're in trouble.
    72  
    73  ### Result Extraction
    74  
    75  The method for extracting the results described here is highly manual (and exploratory) at this stage.
    76  The Core team should improve it at every iteration to increase the amount of automation.
    77  
    78  #### Steps
    79  
    80  1. Unzip the blockstore into a directory
    81  2. Extract the latency report and the raw latencies for all the experiments. Run these commands from the directory containing the blockstore
    82      * `go run github.com/tendermint/tendermint/test/loadtime/cmd/report@3ec6e424d --database-type goleveldb --data-dir ./ > results/report.txt`
    83      * `go run github.com/tendermint/tendermint/test/loadtime/cmd/report@3ec6e424d --database-type goleveldb --data-dir ./ --csv results/raw.csv`
    84  3. File `report.txt` contains an unordered list of experiments with varying concurrent connections and transaction rate
    85      * Create files `report01.txt`, `report02.txt`, `report04.txt` and, for each experiment in file `report.txt`,
    86        copy its related lines to the filename that matches the number of connections.
    87      * Sort the experiments in `report01.txt` in ascending tx rate order. Likewise for `report02.txt` and `report04.txt`.
    88  4. Generate file `report_tabbed.txt` by showing the contents `report01.txt`, `report02.txt`, `report04.txt` side by side
    89     * This effectively creates a table where rows are a particular tx rate and columns are a particular number of websocket connections.
    90  5. Extract the raw latencies from file `raw.csv` using the following bash loop. This creates a `.csv` file and a `.dat` file per experiment.
    91     The format of the `.dat` files is amenable to loading them as matrices in Octave
    92  
    93      ```bash
    94      uuids=($(cat report01.txt report02.txt report04.txt | grep '^Experiment ID: ' | awk '{ print $3 }'))
    95      c=1
    96      for i in 01 02 04; do
    97        for j in 0025 0050 0100 0200; do
    98          echo $i $j $c "${uuids[$c]}"
    99          filename=c${i}_r${j}
   100          grep ${uuids[$c]} raw.csv > ${filename}.csv
   101          cat ${filename}.csv | tr , ' ' | awk '{ print $2, $3 }' > ${filename}.dat
   102          c=$(expr $c + 1)
   103        done
   104      done
   105      ```
   106  
   107  6. Enter Octave
   108  7. Load all `.dat` files generated in step 5 into matrices using this Octave code snippet
   109  
   110      ```octave
   111      conns =  { "01"; "02"; "04" };
   112      rates =  { "0025"; "0050"; "0100"; "0200" };
   113      for i = 1:length(conns)
   114        for j = 1:length(rates)
   115          filename = strcat("c", conns{i}, "_r", rates{j}, ".dat");
   116          load("-ascii", filename);
   117        endfor
   118      endfor
   119      ```
   120  
   121  8. Set variable release to the current release undergoing QA
   122  
   123      ```octave
   124      release = "v0.34.x";
   125      ```
   126  
   127  9. Generate a plot with all (or some) experiments, where the X axis is the experiment time,
   128     and the y axis is the latency of transactions.
   129     The following snippet plots all experiments.
   130  
   131      ```octave
   132      legends = {};
   133      hold off;
   134      for i = 1:length(conns)
   135        for j = 1:length(rates)
   136          data_name = strcat("c", conns{i}, "_r", rates{j});
   137          l = strcat("c=", conns{i}, " r=", rates{j});
   138          m = eval(data_name); plot((m(:,1) - min(m(:,1))) / 1e+9, m(:,2) / 1e+9, ".");
   139          hold on;
   140          legends(1, end+1) = l;
   141        endfor
   142      endfor
   143      legend(legends, "location", "northeastoutside");
   144      xlabel("experiment time (s)");
   145      ylabel("latency (s)");
   146      t = sprintf("200-node testnet - %s", release);
   147      title(t);
   148      ```
   149  
   150  10. Consider adjusting the axis, in case you want to compare your results to the baseline, for instance
   151  
   152      ```octave
   153      axis([0, 100, 0, 30], "tic");
   154      ```
   155  
   156  11. Use Octave's GUI menu to save the plot (e.g. as `.png`)
   157  
   158  12. Repeat steps 9 and 10 to obtain as many plots as deemed necessary.
   159  
   160  13. To generate a latency vs throughput plot, using the raw CSV file generated
   161      in step 2, follow the instructions for the [`latency_throughput.py`] script.
   162  
   163  [`latency_throughput.py`]: ../../scripts/qa/reporting/README.md
   164  
   165  #### Extracting Prometheus Metrics
   166  
   167  1. Stop the prometheus server if it is running as a service (e.g. a `systemd` unit).
   168  2. Unzip the prometheus database retrieved from the testnet, and move it to replace the
   169     local prometheus database.
   170  3. Start the prometheus server and make sure no error logs appear at start up.
   171  4. Introduce the metrics you want to gather or plot.
   172  
   173  ## Rotating Node Testnet
   174  
   175  ### Running the test
   176  
   177  This section explains how the tests were carried out for reproducibility purposes.
   178  
   179  1. [If you haven't done it before]
   180     Follow steps 1-4 of the `README.md` at the top of the testnet repository to configure Terraform, and `doctl`.
   181  2. Copy file `testnet_rotating.toml` onto `testnet.toml` (do NOT commit this change)
   182  3. Set variable `VERSION_TAG` to the git hash that is to be tested.
   183  4. Run `make terraform-apply EPHEMERAL_SIZE=25`
   184      * WARNING: Do NOT forget to run `make terraform-destroy` as soon as you are done with the tests
   185  5. Follow steps 6-10 of the `README.md` to configure and start the "stable" part of the rotating node testnet
   186  6. As a sanity check, connect to the Prometheus node's web interface and check the graph for the `tendermint_consensus_height` metric.
   187     All nodes should be increasing their heights.
   188  7. On a different shell,
   189      * run `make runload ROTATE_CONNECTIONS=X ROTATE_TX_RATE=Y`
   190      * `X` and `Y` should reflect a load below the saturation point (see, e.g.,
   191        [this paragraph](./v034/README.md#finding-the-saturation-point) for further info)
   192  8. Run `make rotate` to start the script that creates the ephemeral nodes, and kills them when they are caught up.
   193      * WARNING: If you run this command from your laptop, the laptop needs to be up and connected for full length
   194        of the experiment.
   195  9. When the height of the chain reaches 3000, stop the `make rotate` script
   196  10. When the rotate script has made two iterations (i.e., all ephemeral nodes have caught up twice)
   197      after height 3000 was reached, stop `make rotate`
   198  11. Run `make retrieve-data` to gather all relevant data from the testnet into the orchestrating machine
   199  12. Verify that the data was collected without errors
   200      * at least one blockstore DB for a Tendermint validator
   201      * the Prometheus database from the Prometheus node
   202      * for extra care, you can run `zip -T` on the `prometheus.zip` file and (one of) the `blockstore.db.zip` file(s)
   203  13. **Run `make terraform-destroy`**
   204  
   205  Steps 8 to 10 are highly manual at the moment and will be improved in next iterations.
   206  
   207  ### Result Extraction
   208  
   209  In order to obtain a latency plot, follow the instructions above for the 200 node experiment, but:
   210  
   211  * The `results.txt` file contains only one experiment
   212  * Therefore, no need for any `for` loops
   213  
   214  As for prometheus, the same method as for the 200 node experiment can be applied.