github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/rfc/rfc-023-semi-permanent-testnet.md (about) 1 # RFC 023: Semi-permanent Testnet 2 3 ## Changelog 4 5 - 2022-07-28: Initial draft (@mark-rushakoff) 6 - 2022-07-29: Renumber to 023, minor clarifications (@mark-rushakoff) 7 8 ## Abstract 9 10 This RFC discusses a long-lived testnet, owned and operated by the Tendermint engineers. 11 By owning and operating a production-like testnet, 12 the team who develops Tendermint becomes more capable of discovering bugs that 13 only arise in production-like environments. 14 They also build expertise in operating Tendermint; 15 this will help guide the development of Tendermint towards operator-friendly design. 16 17 The RFC details a rough roadmap towards a semi-permanent testnet, some of the considered tradeoffs, 18 and the expected outcomes from following this roadmap. 19 20 ## Background 21 22 The author's understanding -- which is limited as a new contributor to the Tendermint project -- 23 is that Tendermint development has been largely treated as a library for other projects to consume. 24 Of course effort has been spent on unit tests, end-to-end tests, and integration tests. 25 But whether developing a library or an application, 26 there is no substitute for putting the software under a production-like load. 27 28 First, there are classes of bugs that are unrealistic to discover in environments 29 that do not resemble production. 30 But perhaps more importantly, there are "operational features" that are best designed 31 by the authors of a given piece of software. 32 For instance, does the software have sufficient observability built-in? 33 Are the reported metrics useful? 34 Are the log messages clear and sufficiently detailed, without being too noisy? 35 36 Furthermore, if the library authors are not only building -- 37 but also maintaining and operating -- an application built on top of their library, 38 the authors will have a greatly increased confidence that their library's API 39 is appropriate for other application authors. 40 41 Once the decision has been made to run and operate a service, 42 one of the next strategic questions is that of deploying said service. 43 The author strongly holds the opinion that, when possible, 44 a continuous delivery model offers the most compelling set of advantages: 45 - The code on a particular branch (likely `main` or `master`) is exactly what is, 46 or what will very soon be, running in production 47 - There are no manual steps involved in deploying -- other than merging your pull request, 48 which you had to do anyway 49 - A bug discovered in production can be rapidly confirmed as fixed in production 50 51 In summary, if the tendermint authors build, maintain, and continuously deliver an application 52 intended to serve as a long-lived testnet, they will be able to state with confidence: 53 - We operate the software in a production-like environment and we have observed it to be 54 stable and performant to our requirements 55 - We have discovered issues in production before any external parties have consumed our software, 56 and we have addressed said issues 57 - We have successfully used the observability tooling built into our software 58 (perhaps in conjunction with other off-the-shelf tooling) 59 to diagnose and debug issues in production 60 61 ## Discussion 62 63 The Discussion Section proposes a variety of aspects of maintaining a testnet for Tendermint. 64 65 ### Number of testnets 66 67 There should probably be one testnet per maintained branch of Tendermint, 68 i.e. one for the `main` branch 69 and one per `v0.N.x` branch that the authors maintain. 70 71 There may also exist testnets for long-lived feature branches. 72 73 We may eventually discover that there is good reason to run more than one testnet for a branch, 74 perhaps due to a significant configuration variation. 75 76 ### Testnet lifecycle 77 78 The document has used the terms "long-lived" and "semi-permanent" somewhat interchangeably. 79 The intent of the testnet being discussed in this RFC is to exist indefinitely; 80 but there is a practical understanding that there will be testnet instances 81 which will be retired due to a variety of reasons. 82 For instance, once a release branch is no longer supported, 83 its corresponding testnet should be torn down. 84 85 In general, new commits to branches with corresponding testnets 86 should result in an in-place upgrade of all nodes in the testnet 87 without any data loss and without requiring new configuration. 88 The mechanism for achieving this is outside the scope of this RFC. 89 90 However, it is also expected that there will be 91 breaking changes during the development of the `main` branch. 92 For instance, suppose there is an unreleased feature involving storage on disk, 93 and the developers need to change the storage format. 94 It should be at the developers' discretion whether it is feasible and worthwhile 95 to introduce an intermediate commit that translates the old format to the new format, 96 or if it would be preferable to just destroy the testnet and start from scratch 97 without any data in the old format. 98 99 Similarly, if a developer inadvertently pushed a breaking change to an unreleased feature, 100 they are free to make a judgement call between reverting the change, 101 adding a commit to allow a forward migration, 102 or simply forcing the testnet to recreate. 103 104 ### Testnet maintenance investment 105 106 While there is certainly engineering effort required to build the tooling and infrastructure 107 to get the testnets up and running, 108 the intent is that a running testnet requires no manual upkeep under normal conditions. 109 110 It is expected that a subset of the Tendermint engineers are familiar with and engaged in 111 writing the software to maintain and build the testnet infrastructure, 112 but the rest of the team should not need any involvement in authoring that code. 113 114 The testnets should be configured to send notifications for events requiring triage, 115 such as a chain halt or a node OOMing. 116 The time investment necessary to address the underlying issues for those kind of events 117 is unpredictable. 118 119 Aside from triaging exceptional events, an engineer may choose to spend some time 120 collecting metrics or profiles from testnet nodes to check performance details 121 before and after a particular change; 122 or they may inspect logs associated with an expected behavior change. 123 But during day-to-day work, engineers are not expected to spend any considerable time 124 directly interacting with the testnets. 125 126 If we discover that there are any routine actions engineers must take against the testnet 127 that take any substantial focused time, 128 those actions should be automated to a one-line command as much as is reasonable. 129 130 ### Testnet MVP 131 132 The minimum viable testnet meets this set of features: 133 134 - The testnet self-updates following a new commit pushed to Tendermint's `main` branch on GitHub 135 (there are some omitted steps here, such as CI building appropriate binaries and 136 somehow notifying the testnet that a new build is available) 137 - The testnet runs the Tendermint KV store for MVP 138 - The testnet operators are notified if: 139 - Any node's process exits for any reason other than a restart for a new binary 140 - Any node stops updating blocks, and by extension if a chain halt occurs 141 - No other observability will be considered for MVP 142 - The testnet has a minimum of 1 full node and 3 validators 143 - The testnet has a reasonably low, constant throughput of transactions -- say 30 tx/min -- 144 and the testnet operators are notified if that throughput drops below 75% of target 145 sustained over 5 minutes 146 - The testnet only needs to run in a single datacenter/cloud-region for MVP, 147 i.e. running in multiple datacenters is out of scope for MVP 148 - The testnet is running directly on VMs or compute instances; 149 while Kubernetes or other orchestration frameworks may offer many significant advantages, 150 the Tendermint engineers should not be required to learn those tools in order to 151 perform basic debugging 152 153 ### Testnet medium-term goals 154 155 The medium-term goals are intended to be achievable within the 6-12 month time range 156 following the launch of MVP. 157 These goals could realistically be roadmapped following the launch of the MVP testnet. 158 159 - The `main` testnet has more than 20 nodes (completely arbitrary -- 5x more than 1+3 at MVP) 160 - In addition to the `main` testnet, 161 there is at least one testnet associated with one release branch 162 - The testnet no longer is simply running the Tendermint KV store; 163 now it is built on a more complex, custom application 164 that deliberately exercises a greater portion of the Tendermint stack 165 - Each testnet is spread across at least two cloud providers, 166 in order to communicate over a network more closely resembling use of Tendermint in "real" chains 167 - The node updates have some "jitter", 168 with some nodes updating immediately when a new build is available, 169 and others delaying up to perhaps 30-60 minutes 170 - The team has published some form of dashboards that have served well for debugging, 171 which external parties can copy/modify to their needs 172 - The dashboards must include metrics published by Tendermint nodes; 173 there should be both OS- or runtime-level metrics such as memory in use, 174 and application-level metrics related to the underlying blockchain 175 - "Published" in this context is more in the spirit of "shared with the community", 176 not "produced a supported open source tool" -- 177 this could be published to GitHub with a warning that no support is offered, 178 or it could simply be a blog post detailing what has worked for the Tendermint developers 179 - The dashboards will likely be implemented on free and open source tooling, 180 but that is not a hard requirement if paid software is more appropriate 181 - The team has produced a reference model of a log aggregation stack that external parties can use 182 - Similar to the "published" dashboards, this only needs to be "shared" rather than "supported" 183 - Chaos engineering has begun being integrated into the testnets 184 (this could be periodic CPU limiting or deliberate network interference, etc. 185 but it probably would not be filesystem corruption) 186 - Each testnet has at least one node running a build with the Go race detector enabled 187 - The testnet contains some kind of generalized notification system built in: 188 - Tendermint code grows "watchdog" systems built in to validate things like 189 subsystems have not deadlocked; e.g. if the watchdog can't acquire and immediately release 190 a particular mutex once in every 5-minute period, it is near certain that the target 191 subsystem has deadlocked, and an alert must be sent to the engineering team. 192 (Outside of the testnet, the watchdogs could be disabled, or they could panic on failure.) 193 - The notification system does some deduplication to minimize spam on system failure 194 195 ### Testnet long-term vision 196 197 The long-term vision includes goals that are not necessary for short- or medium-term success, 198 but which would support building an increasingly stable and performant product. 199 These goals would generally be beyond the one-year plan, 200 and therefore they would not be part of initial planning. 201 202 - There is a centralized dashboard to get a quick overview of all testnets, 203 or at least one centralized dashboard per testnet, 204 showing TBD basic information 205 - Testnets include cloud spot instances which periodically and abruptly join and leave the network 206 - The testnets are a heterogeneous mixture of straight VMs and Docker containers, 207 thereby more closely representing production blockchains 208 - Testnets have some manner of continuous profiling, 209 so that we can produce an apples-to-apples comparison of CPU/memory cost of particular operations 210 211 ### Testnet non-goals 212 213 There are some things we are explicitly not trying to achieve with long-lived testnets: 214 215 - The Tendermint engineers will NOT be responsible for the testnets' availability 216 outside of working hours; there will not be any kind of on-call schedule 217 - As a result of the 8x5 support noted in the previous point, 218 there will be NO guarantee of uptime or availability for any testnet 219 - The testnets will NOT be used to gate pull requests; 220 that responsibility belongs to unit tests, end-to-end tests, and integration tests 221 - Similarly, the testnet will NOT be used to automate any changes back into Tendermint source code; 222 we will not automatically create a revert commit due to a failed rollout, for instance 223 - The testnets are NOT intended to have participation from machines outside of the 224 Tendermint engineering team's control, as the Tendermint engineers are expected 225 to have full access to any instance where they may need to debug an issue 226 - While there will certainly be individuals within the Tendermint engineering team 227 who will continue to build out their individual "devops" skills to produce 228 the infrastructure for the testnet, it is NOT a goal that every Tendermint engineer 229 is even _familiar_ with the tech stack involved, whether it is Ansible, Terraform, 230 Kubernetes, etc. 231 As a rule of thumb, all engineers should be able to get shell access on any given instance 232 and should have access to the instance's logs. 233 Little if any further operational skills will be expected. 234 - The testnets are not intended to be _created_ for one-off experiments. 235 While there is nothing wrong with an engineer directly interacting with a testnet 236 to try something out, 237 a testnet comes with a considerable amount of "baggage", so end-to-end or integration tests 238 are closer to the intent for "trying something to see what happens". 239 Direct interaction should be limited to standard blockchain operations, 240 _not_ modifying configuration of nodes. 241 - Likewise, the purpose of the testnet is not to run specific "tests" per se, 242 but rather to demonstrate that Tendermint blockchains as a whole are stable 243 under a production load. 244 Of course we will inject faults periodically, but the intent is to observe and prove that 245 the testnet is resilient to those faults. 246 It would be the responsibility of a lower-level test to demonstrate e.g. 247 that the network continues when a single validator disappears without warning. 248 - The testnet descriptions in this document are scoped only to building directly on Tendermint; 249 integrating with the Cosmos SDK, or any other third-party library, is out of scope 250 251 ### Team outcomes as a result of maintaining and operating a testnet 252 253 Finally, this section reiterates what team growth we expect by running semi-permanent testnets. 254 255 - Confidence that Tendermint is stable under a particular production-like load 256 - Familiarity with typical production behavior of Tendermint, e.g. what the logs look like, 257 what the memory footprint looks like, and what kind of throughput is reasonable 258 for a network of a particular size 259 - Comfort and familiarity in manually inspecting a misbehaving or failing node 260 - Confidence that Tendermint ships sufficient tooling for external users 261 to operate their nodes 262 - Confidence that Tendermint exposes useful metrics, and comfort interpreting those metrics 263 - Produce useful reference documentation that gives operators confidence to run Tendermint nodes