github.com/ari-anchor/sei-tendermint@v0.0.0-20230519144642-dc826b7b56bb/docs/rfc/rfc-020-onboarding-projects.rst (about) 1 ======================================= 2 RFC 020: Tendermint Onboarding Projects 3 ======================================= 4 5 .. contents:: 6 :backlinks: none 7 8 Changelog 9 --------- 10 11 - 2022-03-30: Initial draft. (@tychoish) 12 - 2022-04-25: Imported document to tendermint repository. (@tychoish) 13 14 Overview 15 -------- 16 17 This document describes a collection of projects that might be good for new 18 engineers joining the Tendermint Core team. These projects mostly describe 19 features that we'd be very excited to see land in the code base, but that are 20 intentionally outside of the critical path of a release on the roadmap, and 21 have the following properties that we think make good on-boarding projects: 22 23 - require relatively little context for the project or its history beyond a 24 more isolated area of the code. 25 26 - provide exposure to different areas of the codebase, so new team members 27 will have reason to explore the code base, build relationships with people 28 on the team, and gain experience with more than one area of the system. 29 30 - be of moderate size, striking a healthy balance between trivial or 31 mechanical changes (which provide little insight) and large intractable 32 changes that require deeper insight than is available during onboarding to 33 address well. A good size project should have natural touchpoints or 34 check-ins. 35 36 Projects 37 -------- 38 39 Before diving into one of these projects, have a conversation about the 40 project or aspects of Tendermint that you're excited to work on with your 41 onboarding buddy. This will help make sure that these issues are still 42 relevant, help you get any context, underatnding known pitfalls, and to 43 confirm a high level approach or design (if relevant.) On-boarding buddies 44 should be prepared to do some design work before someone joins the team. 45 46 The descriptions that follow provide some basic background and attempt to 47 describe the user stories and the potential impact of these project. 48 49 E2E Test Systems 50 ~~~~~~~~~~~~~~~~ 51 52 Tendermint's E2E framework makes it possible to run small test networks with 53 different Tendermint configurations, and make sure that the system works. The 54 tests run Tendermint in a separate binary, and the system provides some very 55 high level protection against making changes that could break Tendermint in 56 otherwise difficult to detect ways. 57 58 Working on the E2E system is a good place to get introduced to the Tendermint 59 codebase, particularly for developers who are newer to Go, as the E2E 60 system (generator, runner, etc.) is distinct from the rest of Tendermint and 61 comparatively quite small, so it may be easier to begin making changes in this 62 area. At the same time, because the E2E system exercises *all* of Tendermint, 63 work in this area is a good way to get introduced to various components of the 64 system. 65 66 Configurable E2E Workloads 67 ++++++++++++++++++++++++++ 68 69 All E2E tests use the same workload (e.g. generated transactions, submitted to 70 different nodes in the network,) which has been tuned empirically to provide a 71 gentle but consistent parallel load that all E2E tests can pass. Ideally, the 72 workload generator could be configurable to have different shapes of work 73 (bursty, different transaction sizes, weighted to different nodes, etc.) and 74 even perhaps further parameterized within a basic shape, which would make it 75 possible to use our existing test infrastructure to answer different questions 76 about the performance or capability of the system. 77 78 The work would involve adding a new parameter to the E2E test manifest, and 79 creating an option (e.g. "legacy") for the current load generation model, 80 extract configurations options for the current load generation, and then 81 prototype implementations of alternate load generation, and also run some 82 preliminary using the tools. 83 84 Byzantine E2E Workloads 85 +++++++++++++++++++++++ 86 87 There are two main kinds of integration tests in Tendermint: the E2E test 88 framework, and then a collection of integration tests that masquerade as 89 unit-tests. While some of this expansion of test scope is (potentially) 90 inevitable, the masquerading unit tests (e.g ``consensus.byzantine_test.go``) 91 end up being difficult to understand, difficult to maintain, and unreliable. 92 93 One solution to this, would be to modify the E2E ABCI application to allow it 94 to inject byzantine behavior, and then have this be a configurable aspect of 95 a test network to be able to provoke Byzantine behavior in a "real" system and 96 then observe that evidence is constructed. This would make it possible to 97 remove the legacy tests entirely once the new tests have proven themselves. 98 99 Abstract Orchestration Framework 100 ++++++++++++++++++++++++++++++++ 101 102 The orchestration of e2e test processes is presently done using docker 103 compose, which works well, but has proven a bit limiting as all processes need 104 to run on a single machine, and the log aggregation functions are confusing at 105 best. 106 107 This project would replace the current orchestration with something more 108 generic, potentially maintaining the current system, but also allowing the e2e 109 tests to manage processes using k8s. There are a few "local" k8s frameworks 110 (e.g. kind and k3s,) which might be able to be useful for our current testing 111 model, but hopefully, we could use this new implementation with other k8s 112 systems for more flexible distribute test orchestration. 113 114 Improve Operationalize Experience of ``run-multiple.sh`` 115 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 116 117 The e2e test runner currently runs a single test, and in most cases we manage 118 the test cases using a shell script that ensure cleanup of entire test 119 suites. This is a bit difficult to maintain and makes reproduction of test 120 cases more awkward than it should be. The e2e ``runner`` itself should provide 121 equivalent functionality to ``run-multiple.sh``: ensure cleanup of test cases, 122 collect and process output, and be able to manage entire suites of cases. 123 124 It might also be useful to implement an e2e test orchestrator that runs all 125 tendermint instances in a single process, using "real" networks for faster 126 feedback and iteration during development. 127 128 In addition to being a bit easier to maintain, having a more capable runner 129 implementation would make it easier to collect data from test runs, improve 130 debugability and reporting. 131 132 Fan-Out For CI E2E Tests 133 ++++++++++++++++++++++++ 134 135 While there are some parallelism in the execution of e2e tests, each e2e test 136 job must build a tendermint e2e image, which takes about 5 minutes of CPU time 137 per-task, which given the size of each of the runs. 138 139 We'd like to be able to reduce the amount of overhead per-e2e tests while 140 keeping the cycle time for working with the tests very low, while also 141 maintaining a reasonable level of test coverage. This is an impossible 142 tradeoff, in some ways, and the percentage of overhead at the moment is large 143 enough that we can make some material progress with a moderate amount of time. 144 145 Most of this work has to do with modifying github actions configuration and 146 e2e artifact (docker) building to reduce redundant work. Eventually, when we 147 can drop the requirement for CGo storage engines, it will be possible to move 148 (cross) compile tendermint locally, and then inject the binary into the docker 149 container, which would reduce a lot of the build-time complexity, although we 150 can move more in this direction or have runtime flags to disable CGo 151 dependencies for local development. 152 153 Remove Panics 154 ~~~~~~~~~~~~~ 155 156 There are lots of places in the code base which can panic, and would not be 157 particularly well handled. While in some cases, panics are the right answer, 158 in many cases the panics were just added to simplify downstream error 159 checking, and could easily be converted to errors. 160 161 The `Don't Panic RFC 162 <https://github.com/tendermint/tendermint/blob/master/docs/rfc/rfc-008-do-not-panic.MD>`_ 163 covers some of the background and approach. 164 165 While the changes are in this project are relatively rote, this will provide 166 exposure to lots of different areas of the codebase as well as insight into 167 how different areas of the codebase interact with eachother, as well as 168 experience with the test suites and infrastructure. 169 170 Implement more Expressive ABCI Applications 171 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 172 173 Tendermint maintains two very simple ABCI applications (a KV application used 174 for basic testing, and slightly more advanced test application used in the 175 end-to-end tests). Writing an application would provide a new engineer with 176 useful experiences using Tendermint that mirrors the expierence of downstream 177 users. 178 179 This is more of an exploratory project, but could include providing common 180 interfaces on top of Tendermint consensus for other well known protocols or 181 tools (e.g. ``etcd``) or a DNS server or some other tool. 182 183 Self-Regulating Reactors 184 ~~~~~~~~~~~~~~~~~~~~~~~~ 185 186 Currently reactors (the internal processes that are responsible for the higher 187 level behavior of Tendermint) can be started and stopped, but have no 188 provision for being paused. These additional semantics may allow Tendermint to 189 pause reactors (and avoid processing their messhages, etc.) and allow better 190 coordination in the future. 191 192 While this is a big project, it's possible to break this apart into many 193 smaller projects: make p2p channels pauseable, add pause/UN-pause hooks to the 194 service implementation and machinery, and finally to modify the reactor 195 implementations to take advantage of these additional semantics 196 197 This project would give an engineer some exposure to the p2p layer of the 198 code, as well as to various aspects of the reactor implementations. 199 200 Metrics 201 ~~~~~~~ 202 203 Tendermint has a metrics system that is relatively underutilized, and figuring 204 out ways to capture and organize the metrics to provide value to users might 205 provide an interesting set of projects for new engineers on Tendermint. 206 207 Convert Logs to Metrics 208 +++++++++++++++++++++++ 209 210 Because the tendermint logs tend to be quite verbose and not particularly 211 actionable, most users largely ignore the logging or run at very low 212 verbosity. While the log statements in the code do describe useful events, 213 taken as a whole the system is not particularly tractable, and particularly at 214 the Debug level, not useful. One solution to this problem is to identify log 215 messages that might be (e.g. increment a counter for certian kinds of errors) 216 217 One approach might be to look at various logging statements, particularly 218 debug statements or errors that are logged but not returned, and see if 219 they're convertable to counters or other metrics. 220 221 Expose Metrics to Tests 222 +++++++++++++++++++++++ 223 224 The existing Tendermint test suites replace the metrics infrastructure with 225 no-op implementations, which means that tests can neither verify that metrics 226 are ever recorded, nor can tests use metrics to observe events in the 227 system. Writing an implementation, for testing, that makes it possible to 228 record metrics and provides an API for introspecting this data, as well as 229 potentially writing tests that take advantage of this type, could be useful. 230 231 Logging Metrics 232 +++++++++++++++ 233 234 In some systems, the logging system itself can provide some interesting 235 insights for operators: having metrics that track the number of messages at 236 different levels as well as the total number of messages, can act as a canary 237 for the system as a whole. 238 239 This should be achievable by adding an interceptor layer within the logging 240 package itself that can add metrics to the existing system.