github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/rfc/rfc-023-semi-permanent-testnet.md

github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/rfc/rfc-023-semi-permanent-testnet.md (about)

     1  # RFC 023: Semi-permanent Testnet
     2  
     3  ## Changelog
     4  
     5  - 2022-07-28: Initial draft (@mark-rushakoff)
     6  - 2022-07-29: Renumber to 023, minor clarifications (@mark-rushakoff)
     7  
     8  ## Abstract
     9  
    10  This RFC discusses a long-lived testnet, owned and operated by the Tendermint engineers.
    11  By owning and operating a production-like testnet,
    12  the team who develops Tendermint becomes more capable of discovering bugs that
    13  only arise in production-like environments.
    14  They also build expertise in operating Tendermint;
    15  this will help guide the development of Tendermint towards operator-friendly design.
    16  
    17  The RFC details a rough roadmap towards a semi-permanent testnet, some of the considered tradeoffs,
    18  and the expected outcomes from following this roadmap.
    19  
    20  ## Background
    21  
    22  The author's understanding -- which is limited as a new contributor to the Tendermint project --
    23  is that Tendermint development has been largely treated as a library for other projects to consume.
    24  Of course effort has been spent on unit tests, end-to-end tests, and integration tests.
    25  But whether developing a library or an application,
    26  there is no substitute for putting the software under a production-like load.
    27  
    28  First, there are classes of bugs that are unrealistic to discover in environments
    29  that do not resemble production.
    30  But perhaps more importantly, there are "operational features" that are best designed
    31  by the authors of a given piece of software.
    32  For instance, does the software have sufficient observability built-in?
    33  Are the reported metrics useful?
    34  Are the log messages clear and sufficiently detailed, without being too noisy?
    35  
    36  Furthermore, if the library authors are not only building --
    37  but also maintaining and operating -- an application built on top of their library,
    38  the authors will have a greatly increased confidence that their library's API
    39  is appropriate for other application authors.
    40  
    41  Once the decision has been made to run and operate a service,
    42  one of the next strategic questions is that of deploying said service.
    43  The author strongly holds the opinion that, when possible,
    44  a continuous delivery model offers the most compelling set of advantages:
    45  - The code on a particular branch (likely `main` or `master`) is exactly what is,
    46    or what will very soon be, running in production
    47  - There are no manual steps involved in deploying -- other than merging your pull request,
    48    which you had to do anyway
    49  - A bug discovered in production can be rapidly confirmed as fixed in production
    50  
    51  In summary, if the tendermint authors build, maintain, and continuously deliver an application
    52  intended to serve as a long-lived testnet, they will be able to state with confidence:
    53  - We operate the software in a production-like environment and we have observed it to be
    54    stable and performant to our requirements
    55  - We have discovered issues in production before any external parties have consumed our software,
    56    and we have addressed said issues
    57  - We have successfully used the observability tooling built into our software
    58    (perhaps in conjunction with other off-the-shelf tooling)
    59    to diagnose and debug issues in production
    60  
    61  ## Discussion
    62  
    63  The Discussion Section proposes a variety of aspects of maintaining a testnet for Tendermint.
    64  
    65  ### Number of testnets
    66  
    67  There should probably be one testnet per maintained branch of Tendermint,
    68  i.e. one for the `main` branch
    69  and one per `v0.N.x` branch that the authors maintain.
    70  
    71  There may also exist testnets for long-lived feature branches.
    72  
    73  We may eventually discover that there is good reason to run more than one testnet for a branch,
    74  perhaps due to a significant configuration variation.
    75  
    76  ### Testnet lifecycle
    77  
    78  The document has used the terms "long-lived" and "semi-permanent" somewhat interchangeably.
    79  The intent of the testnet being discussed in this RFC is to exist indefinitely;
    80  but there is a practical understanding that there will be testnet instances
    81  which will be retired due to a variety of reasons.
    82  For instance, once a release branch is no longer supported,
    83  its corresponding testnet should be torn down.
    84  
    85  In general, new commits to branches with corresponding testnets
    86  should result in an in-place upgrade of all nodes in the testnet
    87  without any data loss and without requiring new configuration.
    88  The mechanism for achieving this is outside the scope of this RFC.
    89  
    90  However, it is also expected that there will be
    91  breaking changes during the development of the `main` branch.
    92  For instance, suppose there is an unreleased feature involving storage on disk,
    93  and the developers need to change the storage format.
    94  It should be at the developers' discretion whether it is feasible and worthwhile
    95  to introduce an intermediate commit that translates the old format to the new format,
    96  or if it would be preferable to just destroy the testnet and start from scratch
    97  without any data in the old format.
    98  
    99  Similarly, if a developer inadvertently pushed a breaking change to an unreleased feature,
   100  they are free to make a judgement call between reverting the change,
   101  adding a commit to allow a forward migration,
   102  or simply forcing the testnet to recreate.
   103  
   104  ### Testnet maintenance investment
   105  
   106  While there is certainly engineering effort required to build the tooling and infrastructure
   107  to get the testnets up and running,
   108  the intent is that a running testnet requires no manual upkeep under normal conditions.
   109  
   110  It is expected that a subset of the Tendermint engineers are familiar with and engaged in
   111  writing the software to maintain and build the testnet infrastructure,
   112  but the rest of the team should not need any involvement in authoring that code.
   113  
   114  The testnets should be configured to send notifications for events requiring triage,
   115  such as a chain halt or a node OOMing.
   116  The time investment necessary to address the underlying issues for those kind of events
   117  is unpredictable.
   118  
   119  Aside from triaging exceptional events, an engineer may choose to spend some time
   120  collecting metrics or profiles from testnet nodes to check performance details
   121  before and after a particular change;
   122  or they may inspect logs associated with an expected behavior change.
   123  But during day-to-day work, engineers are not expected to spend any considerable time
   124  directly interacting with the testnets.
   125  
   126  If we discover that there are any routine actions engineers must take against the testnet
   127  that take any substantial focused time,
   128  those actions should be automated to a one-line command as much as is reasonable.
   129  
   130  ### Testnet MVP
   131  
   132  The minimum viable testnet meets this set of features:
   133  
   134  - The testnet self-updates following a new commit pushed to Tendermint's `main` branch on GitHub
   135    (there are some omitted steps here, such as CI building appropriate binaries and
   136    somehow notifying the testnet that a new build is available)
   137  - The testnet runs the Tendermint KV store for MVP
   138  - The testnet operators are notified if:
   139      - Any node's process exits for any reason other than a restart for a new binary
   140      - Any node stops updating blocks, and by extension if a chain halt occurs
   141      - No other observability will be considered for MVP
   142  - The testnet has a minimum of 1 full node and 3 validators
   143  - The testnet has a reasonably low, constant throughput of transactions -- say 30 tx/min --
   144    and the testnet operators are notified if that throughput drops below 75% of target
   145    sustained over 5 minutes
   146  - The testnet only needs to run in a single datacenter/cloud-region for MVP,
   147    i.e. running in multiple datacenters is out of scope for MVP
   148  - The testnet is running directly on VMs or compute instances;
   149    while Kubernetes or other orchestration frameworks may offer many significant advantages,
   150    the Tendermint engineers should not be required to learn those tools in order to
   151    perform basic debugging
   152  
   153  ### Testnet medium-term goals
   154  
   155  The medium-term goals are intended to be achievable within the 6-12 month time range
   156  following the launch of MVP.
   157  These goals could realistically be roadmapped following the launch of the MVP testnet.
   158  
   159  - The `main` testnet has more than 20 nodes (completely arbitrary -- 5x more than 1+3 at MVP)
   160  - In addition to the `main` testnet,
   161    there is at least one testnet associated with one release branch
   162  - The testnet no longer is simply running the Tendermint KV store;
   163    now it is built on a more complex, custom application
   164    that deliberately exercises a greater portion of the Tendermint stack
   165  - Each testnet is spread across at least two cloud providers,
   166    in order to communicate over a network more closely resembling use of Tendermint in "real" chains
   167  - The node updates have some "jitter",
   168    with some nodes updating immediately when a new build is available,
   169    and others delaying up to perhaps 30-60 minutes
   170  - The team has published some form of dashboards that have served well for debugging,
   171    which external parties can copy/modify to their needs
   172      - The dashboards must include metrics published by Tendermint nodes;
   173        there should be both OS- or runtime-level metrics such as memory in use,
   174        and application-level metrics related to the underlying blockchain
   175      - "Published" in this context is more in the spirit of "shared with the community",
   176        not "produced a supported open source tool" --
   177        this could be published to GitHub with a warning that no support is offered,
   178        or it could simply be a blog post detailing what has worked for the Tendermint developers
   179      - The dashboards will likely be implemented on free and open source tooling,
   180        but that is not a hard requirement if paid software is more appropriate
   181  - The team has produced a reference model of a log aggregation stack that external parties can use
   182      - Similar to the "published" dashboards, this only needs to be "shared" rather than "supported"
   183  - Chaos engineering has begun being integrated into the testnets
   184    (this could be periodic CPU limiting or deliberate network interference, etc.
   185    but it probably would not be filesystem corruption)
   186  - Each testnet has at least one node running a build with the Go race detector enabled
   187  - The testnet contains some kind of generalized notification system built in:
   188      - Tendermint code grows "watchdog" systems built in to validate things like
   189        subsystems have not deadlocked; e.g. if the watchdog can't acquire and immediately release
   190        a particular mutex once in every 5-minute period, it is near certain that the target
   191        subsystem has deadlocked, and an alert must be sent to the engineering team.
   192        (Outside of the testnet, the watchdogs could be disabled, or they could panic on failure.)
   193      - The notification system does some deduplication to minimize spam on system failure
   194  
   195  ### Testnet long-term vision
   196  
   197  The long-term vision includes goals that are not necessary for short- or medium-term success,
   198  but which would support building an increasingly stable and performant product.
   199  These goals would generally be beyond the one-year plan,
   200  and therefore they would not be part of initial planning.
   201  
   202  - There is a centralized dashboard to get a quick overview of all testnets,
   203    or at least one centralized dashboard per testnet,
   204    showing TBD basic information
   205  - Testnets include cloud spot instances which periodically and abruptly join and leave the network
   206  - The testnets are a heterogeneous mixture of straight VMs and Docker containers,
   207    thereby more closely representing production blockchains
   208  - Testnets have some manner of continuous profiling,
   209    so that we can produce an apples-to-apples comparison of CPU/memory cost of particular operations
   210  
   211  ### Testnet non-goals
   212  
   213  There are some things we are explicitly not trying to achieve with long-lived testnets:
   214  
   215  - The Tendermint engineers will NOT be responsible for the testnets' availability
   216    outside of working hours; there will not be any kind of on-call schedule
   217  - As a result of the 8x5 support noted in the previous point,
   218    there will be NO guarantee of uptime or availability for any testnet
   219  - The testnets will NOT be used to gate pull requests;
   220    that responsibility belongs to unit tests, end-to-end tests, and integration tests
   221  - Similarly, the testnet will NOT be used to automate any changes back into Tendermint source code;
   222    we will not automatically create a revert commit due to a failed rollout, for instance
   223  - The testnets are NOT intended to have participation from machines outside of the
   224    Tendermint engineering team's control, as the Tendermint engineers are expected
   225    to have full access to any instance where they may need to debug an issue
   226  - While there will certainly be individuals within the Tendermint engineering team
   227    who will continue to build out their individual "devops" skills to produce
   228    the infrastructure for the testnet, it is NOT a goal that every Tendermint engineer
   229    is even _familiar_ with the tech stack involved, whether it is Ansible, Terraform,
   230    Kubernetes, etc.
   231    As a rule of thumb, all engineers should be able to get shell access on any given instance
   232    and should have access to the instance's logs.
   233    Little if any further operational skills will be expected.
   234  - The testnets are not intended to be _created_ for one-off experiments.
   235    While there is nothing wrong with an engineer directly interacting with a testnet
   236    to try something out,
   237    a testnet comes with a considerable amount of "baggage", so end-to-end or integration tests
   238    are closer to the intent for "trying something to see what happens".
   239    Direct interaction should be limited to standard blockchain operations,
   240    _not_ modifying configuration of nodes.
   241  - Likewise, the purpose of the testnet is not to run specific "tests" per se,
   242    but rather to demonstrate that Tendermint blockchains as a whole are stable
   243    under a production load.
   244    Of course we will inject faults periodically, but the intent is to observe and prove that
   245    the testnet is resilient to those faults.
   246    It would be the responsibility of a lower-level test to demonstrate e.g.
   247    that the network continues when a single validator disappears without warning.
   248  - The testnet descriptions in this document are scoped only to building directly on Tendermint;
   249    integrating with the Cosmos SDK, or any other third-party library, is out of scope
   250  
   251  ### Team outcomes as a result of maintaining and operating a testnet
   252  
   253  Finally, this section reiterates what team growth we expect by running semi-permanent testnets.
   254  
   255  - Confidence that Tendermint is stable under a particular production-like load
   256  - Familiarity with typical production behavior of Tendermint, e.g. what the logs look like,
   257    what the memory footprint looks like, and what kind of throughput is reasonable
   258    for a network of a particular size
   259  - Comfort and familiarity in manually inspecting a misbehaving or failing node
   260  - Confidence that Tendermint ships sufficient tooling for external users
   261    to operate their nodes
   262  - Confidence that Tendermint exposes useful metrics, and comfort interpreting those metrics
   263  - Produce useful reference documentation that gives operators confidence to run Tendermint nodes