github.com/ari-anchor/sei-tendermint@v0.0.0-20230519144642-dc826b7b56bb/docs/rfc/rfc-020-onboarding-projects.rst

github.com/ari-anchor/sei-tendermint@v0.0.0-20230519144642-dc826b7b56bb/docs/rfc/rfc-020-onboarding-projects.rst (about)

     1  =======================================
     2  RFC 020: Tendermint Onboarding Projects
     3  =======================================
     4  
     5  .. contents::
     6     :backlinks: none
     7  
     8  Changelog
     9  ---------
    10  
    11  - 2022-03-30: Initial draft. (@tychoish)
    12  - 2022-04-25: Imported document to tendermint repository. (@tychoish)
    13  
    14  Overview
    15  --------
    16  
    17  This document describes a collection of projects that might be good for new
    18  engineers joining the Tendermint Core team. These projects mostly describe
    19  features that we'd be very excited to see land in the code base, but that are
    20  intentionally outside of the critical path of a release on the roadmap, and
    21  have the following properties that we think make good on-boarding projects:
    22  
    23  - require relatively little context for the project or its history beyond a
    24    more isolated area of the code.
    25  
    26  - provide exposure to different areas of the codebase, so new team members
    27    will have reason to explore the code base, build relationships with people
    28    on the team, and gain experience with more than one area of the system.
    29  
    30  - be of moderate size, striking a healthy balance between trivial or
    31    mechanical changes (which provide little insight) and large intractable
    32    changes that require deeper insight than is available during onboarding to
    33    address well. A good size project should have natural touchpoints or
    34    check-ins.
    35  
    36  Projects
    37  --------
    38  
    39  Before diving into one of these projects, have a conversation about the
    40  project or aspects of Tendermint that you're excited to work on with your
    41  onboarding buddy. This will help make sure that these issues are still
    42  relevant, help you get any context, underatnding known pitfalls, and to
    43  confirm a high level approach or design (if relevant.) On-boarding buddies
    44  should be prepared to do some design work before someone joins the team.
    45  
    46  The descriptions that follow provide some basic background and attempt to
    47  describe the user stories and the potential impact of these project.
    48  
    49  E2E Test Systems
    50  ~~~~~~~~~~~~~~~~
    51  
    52  Tendermint's E2E framework makes it possible to run small test networks with
    53  different Tendermint configurations, and make sure that the system works. The
    54  tests run Tendermint in a separate binary, and the system provides some very
    55  high level protection against making changes that could break Tendermint in
    56  otherwise difficult to detect ways.
    57  
    58  Working on the E2E system is a good place to get introduced to the Tendermint
    59  codebase, particularly for developers who are newer to Go, as the E2E
    60  system (generator, runner, etc.) is distinct from the rest of Tendermint and
    61  comparatively quite small, so it may be easier to begin making changes in this
    62  area. At the same time, because the E2E system exercises *all* of Tendermint,
    63  work in this area is a good way to get introduced to various components of the
    64  system.
    65  
    66  Configurable E2E Workloads
    67  ++++++++++++++++++++++++++
    68  
    69  All E2E tests use the same workload (e.g. generated transactions, submitted to
    70  different nodes in the network,) which has been tuned empirically to provide a
    71  gentle but consistent parallel load that all E2E tests can pass. Ideally, the
    72  workload generator could be configurable to have different shapes of work
    73  (bursty, different transaction sizes, weighted to different nodes, etc.) and
    74  even perhaps further parameterized within a basic shape, which would make it
    75  possible to use our existing test infrastructure to answer different questions
    76  about the performance or capability of the system.
    77  
    78  The work would involve adding a new parameter to the E2E test manifest, and
    79  creating an option (e.g. "legacy") for the current load generation model,
    80  extract configurations options for the current load generation, and then
    81  prototype implementations of alternate load generation, and also run some
    82  preliminary using the tools.
    83  
    84  Byzantine E2E Workloads
    85  +++++++++++++++++++++++
    86  
    87  There are two main kinds of integration tests in Tendermint: the E2E test
    88  framework, and then a collection of integration tests that masquerade as
    89  unit-tests. While some of this expansion of test scope is (potentially)
    90  inevitable, the masquerading unit tests (e.g ``consensus.byzantine_test.go``)
    91  end up being difficult to understand, difficult to maintain, and unreliable.
    92  
    93  One solution to this, would be to modify the E2E ABCI application to allow it
    94  to inject byzantine behavior, and then have this be a configurable aspect of
    95  a test network to be able to provoke Byzantine behavior in a "real" system and
    96  then observe that evidence is constructed. This would make it possible to
    97  remove the legacy tests entirely once the new tests have proven themselves.
    98  
    99  Abstract Orchestration Framework
   100  ++++++++++++++++++++++++++++++++
   101  
   102  The orchestration of e2e test processes is presently done using docker
   103  compose, which works well, but has proven a bit limiting as all processes need
   104  to run on a single machine, and the log aggregation functions are confusing at
   105  best.
   106  
   107  This project would replace the current orchestration with something more
   108  generic, potentially maintaining the current system, but also allowing the e2e
   109  tests to manage processes using k8s. There are a few "local" k8s frameworks
   110  (e.g. kind and k3s,) which might be able to be useful for our current testing
   111  model, but hopefully, we could use this new implementation with other k8s
   112  systems for more flexible distribute test orchestration.
   113  
   114  Improve Operationalize Experience of ``run-multiple.sh``
   115  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   116  
   117  The e2e test runner currently runs a single test, and in most cases we manage
   118  the test cases using a shell script that ensure cleanup of entire test
   119  suites. This is a bit difficult to maintain and makes reproduction of test
   120  cases more awkward than it should be. The e2e ``runner`` itself should provide
   121  equivalent functionality to ``run-multiple.sh``: ensure cleanup of test cases,
   122  collect and process output, and be able to manage entire suites of cases.
   123  
   124  It might also be useful to implement an e2e test orchestrator that runs all
   125  tendermint instances in a single process, using "real" networks for faster
   126  feedback and iteration during development.
   127  
   128  In addition to being a bit easier to maintain, having a more capable runner
   129  implementation would make it easier to collect data from test runs, improve
   130  debugability and reporting.
   131  
   132  Fan-Out For CI E2E Tests
   133  ++++++++++++++++++++++++
   134  
   135  While there are some parallelism in the execution of e2e tests, each e2e test
   136  job must build a tendermint e2e image, which takes about 5 minutes of CPU time
   137  per-task, which given the size of each of the runs.
   138  
   139  We'd like to be able to reduce the amount of overhead per-e2e tests while
   140  keeping the cycle time for working with the tests very low, while also
   141  maintaining a reasonable level of test coverage.  This is an impossible
   142  tradeoff, in some ways, and the percentage of overhead at the moment is large
   143  enough that we can make some material progress with a moderate amount of time.
   144  
   145  Most of this work has to do with modifying github actions configuration and
   146  e2e artifact (docker) building to reduce redundant work. Eventually, when we
   147  can drop the requirement for CGo storage engines, it will be possible to move
   148  (cross) compile tendermint locally, and then inject the binary into the docker
   149  container, which would reduce a lot of the build-time complexity, although we
   150  can move more in this direction or have runtime flags to disable CGo
   151  dependencies for local development.
   152  
   153  Remove Panics
   154  ~~~~~~~~~~~~~
   155  
   156  There are lots of places in the code base which can panic, and would not be
   157  particularly well handled. While in some cases, panics are the right answer,
   158  in many cases the panics were just added to simplify downstream error
   159  checking, and could easily be converted to errors.
   160  
   161  The `Don't Panic RFC
   162  <https://github.com/tendermint/tendermint/blob/master/docs/rfc/rfc-008-do-not-panic.MD>`_
   163  covers some of the background and approach.
   164  
   165  While the changes are in this project are relatively rote, this will provide
   166  exposure to lots of different areas of the codebase as well as insight into
   167  how different areas of the codebase interact with eachother, as well as
   168  experience with the test suites and infrastructure.
   169  
   170  Implement more Expressive ABCI Applications
   171  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   172  
   173  Tendermint maintains two very simple ABCI applications (a KV application used
   174  for basic testing, and slightly more advanced test application used in the
   175  end-to-end tests). Writing an application would provide a new engineer with
   176  useful experiences using Tendermint that mirrors the expierence of downstream
   177  users.
   178  
   179  This is more of an exploratory project, but could include providing common
   180  interfaces on top of Tendermint consensus for other well known protocols or
   181  tools (e.g. ``etcd``) or a DNS server or some other tool.
   182  
   183  Self-Regulating Reactors
   184  ~~~~~~~~~~~~~~~~~~~~~~~~
   185  
   186  Currently reactors (the internal processes that are responsible for the higher
   187  level behavior of Tendermint) can be started and stopped, but have no
   188  provision for being paused. These additional semantics may allow Tendermint to
   189  pause reactors (and avoid processing their messhages, etc.) and allow better
   190  coordination in the future.
   191  
   192  While this is a big project, it's possible to break this apart into many
   193  smaller projects: make p2p channels pauseable, add pause/UN-pause hooks to the
   194  service implementation and machinery, and finally to modify the reactor
   195  implementations to take advantage of these additional semantics
   196  
   197  This project would give an engineer some exposure to the p2p layer of the
   198  code, as well as to various aspects of the reactor implementations.
   199  
   200  Metrics
   201  ~~~~~~~
   202  
   203  Tendermint has a metrics system that is relatively underutilized, and figuring
   204  out ways to capture and organize the metrics to provide value to users might
   205  provide an interesting set of projects for new engineers on Tendermint.
   206  
   207  Convert Logs to Metrics
   208  +++++++++++++++++++++++
   209  
   210  Because the tendermint logs tend to be quite verbose and not particularly
   211  actionable, most users largely ignore the logging or run at very low
   212  verbosity. While the log statements in the code do describe useful events,
   213  taken as a whole the system is not particularly tractable, and particularly at
   214  the Debug level, not useful. One solution to this problem is to identify log
   215  messages that might be (e.g. increment a counter for certian kinds of errors)
   216  
   217  One approach might be to look at various logging statements, particularly
   218  debug statements or errors that are logged but not returned, and see if
   219  they're convertable to counters or other metrics.
   220  
   221  Expose Metrics to Tests
   222  +++++++++++++++++++++++
   223  
   224  The existing Tendermint test suites replace the metrics infrastructure with
   225  no-op implementations, which means that tests can neither verify that metrics
   226  are ever recorded, nor can tests use metrics to observe events in the
   227  system. Writing an implementation, for testing, that makes it possible to
   228  record metrics and provides an API for introspecting this data, as well as
   229  potentially writing tests that take advantage of this type, could be useful.
   230  
   231  Logging Metrics
   232  +++++++++++++++
   233  
   234  In some systems, the logging system itself can provide some interesting
   235  insights for operators: having metrics that track the number of messages at
   236  different levels as well as the total number of messages, can act as a canary
   237  for the system as a whole.
   238  
   239  This should be achievable by adding an interceptor layer within the logging
   240  package itself that can add metrics to the existing system.