github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/rfc/rfc-019-config-version.md (about)

     1  # RFC 019: Configuration File Versioning
     2  
     3  ## Changelog
     4  
     5  - 19-Apr-2022: Initial draft (@creachadair)
     6  - 20-Apr-2022: Updates from review feedback (@creachadair)
     7  
     8  ## Abstract
     9  
    10  Updating configuration settings is an essential part of upgrading an existing
    11  node to a new version of the Tendermint software.  Unfortunately, it is also
    12  currently a very manual process. This document discusses some of the history of
    13  changes to the config format, actions we've taken to improve the tooling for
    14  configuration upgrades, and additional steps we may want to consider.
    15  
    16  ## Background
    17  
    18  A Tendermint node reads configuration settings at startup from a TOML formatted
    19  text file, typically named `config.toml`. The contents of this file are defined
    20  by the [`github.com/tendermint/tendermint/config`][config-pkg].
    21  
    22  Although many settings in this file remain valid from one version of Tendermint
    23  to the next, new versions of Tendermint often add, update, and remove settings.
    24  These changes often require manual intervention by operators who are upgrading
    25  their nodes.
    26  
    27  I propose we should provide better tools and documentation to help operators
    28  make configuration changes correctly during version upgrades.  Ideally, as much
    29  as possible of any configuration file update should be automated, and where
    30  that is not possible or practical, we should provide clear, explicit directions
    31  for what steps need to be taken manually. Moreover, when the node discovers
    32  incorrect or invalid configuration, we should improve the diagnostics it emits
    33  so that the operator can quickly and easily find the relevant documentation,
    34  without having to grep through source code.
    35  
    36  ## Discussion
    37  
    38  By convention, we are supposed to document required changes to the config file
    39  in the `UPGRADING.md` file for the release that introduces them.  Although we
    40  have mostly done this, the level of detail in the upgrading instructions is
    41  often insufficient for an operator to correctly update their file.
    42  
    43  The updates vary widely in complexity: Operators may need to add new required
    44  settings, update obsolete values for existing settings, move or rename existing
    45  settings within the file, or remove obsolete settings (which are thus invalid).
    46  Here are a few examples of each of these cases:
    47  
    48  - **New required settings:** Tendermint v0.35 added a new top-level `mode`
    49    setting that determines whether a node runs as a validator, a full node, or a
    50    seed node.  The default value is `"full"`, which means the operator of a
    51    validator must manually add `mode = "validator"` (or set the `--mode` flag on
    52    the command line) for their node to come up in the correct mode.
    53  
    54  - **Updated obsolete values:** Tendermint v0.35 removed support for versions
    55    `"v1"` and `"v2"` of the blocksync (formerly "fastsync") protocol, requiring
    56    any node using either of those values to update to `"v0"`.
    57  
    58  - **Moved/renamed settings:** Version v0.34 moved the top-level `pprof_laddr`
    59    setting under the `[rpc]` section.
    60  
    61    Version v0.35 renamed every setting in the file from `snake_case` to
    62    `kebab-case`, moved the top-level `fast_sync` setting into the `[blocksync]`
    63    section as (itself renamed from `[fastsync]`), and moved all the top-level
    64    `priv-validator-*` settings under a new `[priv-validator]` section with their
    65    prefix trimmed off.
    66  
    67  - **Removed obsolete settings:** Version v0.34 removed the `index_all_keys` and
    68    `index_keys` settings from the `[tx_index]` section; version v0.35 removed
    69    the `wal-dir` setting from the `[mempool]` section, and version v0.36 removed
    70    the `[blocksync]` section entirely.
    71  
    72  While many of these changes are mentioned in the config section of the upgrade
    73  instructions, some are not mentioned at all, or are hidden in other parts of
    74  the doc. For instance, the v0.34 `pprof_laddr` change was documented only as an
    75  RPC flag change. (A savvy reader might realize that the flag `--rpc.pprof_laddr`
    76  implies a corresponding config section, but it omits the related detail that
    77  there was a top-level setting that's been renamed).  The lesson here is not
    78  that the docs are bad, but to point out that prose is not the most efficient
    79  format to convey detailed changes like this. The upgrading instructions are
    80  still valuable for the human reader to understand what to expect.
    81  
    82  ### Concrete Steps
    83  
    84  As part of the v0.36 development cycle, we spent some time reverse-engineering
    85  the configuration changes since the v0.34 release and built an experimental
    86  command-line tool called [`confix`][confix], whose job it is to automatically
    87  update the settings in a `config.toml` file to the latest version.  We also
    88  backported a version of this tool into the v0.35.x branch at release v0.35.4.
    89  
    90  This tool should work fine for configuration files created by Tendermint v0.34
    91  and later, but does not (yet) know how to handle changes from prior versions of
    92  Tendermint. Part of the difficulty for older versions is simply logistical: To
    93  figure out which changes to apply, we need to understand something about the
    94  version that made the file, as well as the version we're converting it to.
    95  
    96  > **Discussion point:** In the future we might want to consider incorporating
    97  > this into the node CLI directly, but we're keeping it separate for now until
    98  > we can get some feedback from operators.
    99  
   100  For the experiment, we handled this by carefully searching the history of
   101  config format changes for shibboleths to bound the version: For example, the
   102  `[fastsync]` section was added in Tendermint v0.32 and renamed `[blocksync]` in
   103  Tendermint v0.35. So if we see a `[fastsync]` section, we have some confidence
   104  that the file was created by v0.32, v0.33, or v0.34.
   105  
   106  But such signals are delicate: The `[blocksync]` section was removed in v0.36,
   107  so if we do not find `[fastsync]`, we cannot conclude from that alone that the
   108  file is from v0.31 or earlier -- we have to look for corroborating details.
   109  While such "sniffing" tactics are fine for an experiment, they aren't as robust
   110  as we might like.
   111  
   112  This is especially relevant for configuration files that may have already been
   113  manually upgraded across several versions by the time we are asked to update
   114  them again.  Another related concern is that we'd like to make sure conversion
   115  is idempotent, so that it would be safe to rerun the tool over an
   116  already-converted file without breaking anything.
   117  
   118  ### Config Versioning
   119  
   120  One obvious tactic we could use for future releases is add a version marker to
   121  the config file. This would give tools like `confix` (and the node itself) a
   122  way to calibrate their expectations. Rather than being a version for the file
   123  itself, however, this version marker would indicate which version of Tendermint
   124  is needed to read the file.
   125  
   126  Provisionally, this might look something like:
   127  
   128  ```toml
   129  # THe minimum version of Tendermint compatible with the contents of
   130  # this configuration file.
   131  config-version = 'v0.35'
   132  ```
   133  
   134  When initializing a new node, Tendermint would populate this field with its own
   135  version (e.g., `v0.36`). When conducting an upgrade, tools like `confix` can
   136  then use this to decide which conversions are valid, and then update the value
   137  accordingly. After converting a file marked `'v0.35'` to`'v0.37'`, the
   138  conversion tool sets the file's `config-version` to reflect its compatibility.
   139  
   140  > **Discussion point:** This example presumes we would keep config files
   141  > compatible within a given release cycle, e.g., all of v0.36.x. We could also
   142  > use patch numbers here, if we think there's some reason to permit changes
   143  > that would require config file edits at that granularity. I don't think we
   144  > should, but that's a design question to consider.
   145  
   146  Upon seeing an up-to-date version marker, the conversion tool can simply exit
   147  with a diagnostic like "this file is already up-to-date", rather than sniffing
   148  the keyspace and potentially introducing errors. In addition, this would let a
   149  tool detect config files that are _newer_ than the one it understands, and
   150  issue a safe diagnostic rather than doing something wrong.  Plus, besides
   151  avoiding potentially unsafe conversions, this would also serve as
   152  human-readable documentation that the file is up-to-date for a given version.
   153  
   154  Adding a config version would not address the problem of how to convert files
   155  created by older versions of Tendermint, but it would at least help us build
   156  more robust config tooling going forward.
   157  
   158  ### Stability and Change
   159  
   160  In light of the discussion so far, it is natural to examine why we make so many
   161  changes to the configuration file from one version to the next, and whether we
   162  could reduce friction by being more conservative about what we make
   163  configurable, what config changes we make over time, and how we roll them out.
   164  
   165  Some changes, like renaming everything from snake case to kebab case, are
   166  entirely gratuitous. We could safely agree not to make those kinds of changes.
   167  Apart from that obvious case, however, many other configuration settings
   168  provide value to node operators in cases where there is no simple, universal
   169  setting that matches every application.
   170  
   171  Taking a high-level view, there are several broad reasons why we might want to
   172  make changes to configuration settings:
   173  
   174  - **Lessons learned:** Configuration settings are a good way to try things out
   175    in production, before making more invasive changes to the consensus protocol.
   176  
   177    For example, up until Tendermint v0.35, consensus timeouts were specified as
   178    per-node configuration settings (e.g., `timeout-precommit` et al.).  This
   179    allowed operators to tune these values for the needs of their network, but
   180    had the downside that individually-misconfigured nodes could stall consensus.
   181  
   182    Based on that experience, these timeouts have been deprecated in Tendermint
   183    v0.36 and converted to consensus parameters, to be consistent across all
   184    nodes in the network.
   185  
   186  - **Migration & experimentation:** Introducing new features and updating old
   187    features can complicate migration for existing users of the software.
   188    Temporary or "experimental" configuration settings can be a valuable way to
   189    mitigate that friction.
   190  
   191    For example, Tendermint v0.36 introduces a new RPC event subscription
   192    endpoint (see [ADR 075][adr075]) that will eventually replace the existing
   193    webwocket-based interface. To give users time to migrate, v0.36 adds an
   194    `experimental-disable-websocket` setting, defaulted to `false`, that allows
   195    operators to selectively disable the websocket API for testing purposes
   196    during the conversion. This setting is designed to be removed in v0.37, when
   197    the old interface is no longer supported.
   198  
   199  - **Ongoing maintenance:** Sometimes configuration settings become obsolete,
   200    and the cost of removing them trades off against the potential risks of
   201    leaving a non-functional or deprecated knob hooked up indefinitely.
   202  
   203    For example, Tendermint v0.35 deprecated two alternate implementations of the
   204    blocksync protocol, one of which was deleted entirely (`v1`) and one of which
   205    was scheduled for removal (`v2`). The `blocksync.version` setting, which had
   206    been added as a migration aid, became obsolete and needed to be updated.
   207  
   208    Despite our best intentions, sometimes engineering designs do not work out.
   209    It's just as important to leave room to back out of changes we have since
   210    reconsidered, as it is to support migrations forward onto new and improved
   211    code.
   212  
   213  - **Clarity and legibility:** Besides configuring the software, another
   214    important purpose of a config file is to document intent for the humans who
   215    operate and maintain the software. Operators need adjust settings to keep the
   216    node running, and developers need to know what options were in use when
   217    something goes wrong so they can diagnose and fix bugs.  The legibility of a
   218    config file as a _human_ artifact is also thus important.
   219  
   220    For example, Tendermint v0.35 moved settings related to validator private
   221    keys from the top-level section of the configuration file to their own
   222    designated `[priv-validator]` section. Although this change did not make any
   223    difference to the meaning of those settings, it made the organization of the
   224    file easier to understand, and allowed the names of the individual settings
   225    to be simplified (e.g., `priv-validator-key-file` became simply `key-file` in
   226    the new section).
   227  
   228    Although such changes are "gratuitous" with respect to the software, there is
   229    often value in making things more legible for the humans. While there is no
   230    simple rule to define the line, the Potter Stewart principle can be used with
   231    due care.
   232  
   233  Keeping these examples in mind, we can and should take reasonable steps to
   234  avoid churn in the configuration file across versions where we can. However, we
   235  must also accept that part of the reason for _having_ a config file is to allow
   236  us flexibility elsewhere in the design.  On that basis, we should not attempt
   237  to be too dogmatic about config changes either. Unlike changes in the block
   238  protocol, for example, which affect every user of every network that adopts
   239  them, config changes are relatively self-contained.
   240  
   241  There are few guiding principles I think we can use to strike a sensible
   242  balance:
   243  
   244  1. **No gratuitous changes.** Aesthetic changes that do not enhance legibility,
   245     avert confusion, or clarity documentation, should be entirely avoided.
   246  
   247  2. **Prefer mechanical changes.** Whenever it is practical, change settings in
   248     a way that can be updated by a tool without operator judgement. This implies
   249     finding safe, universal defaults for new settings, and not changing the
   250     default values of existing settings.
   251  
   252     Even if that means we have to make multiple changes (e.g., add a new setting
   253     in the current version, deprecate the old one, and remove the old one in the
   254     next version) it's preferable if we can mechanize each step.
   255  
   256  3. **Clearly signal intent.** When adding temporary or experimental settings,
   257     they should be clearly named and documented as such. Use long names and
   258     suggestive prefixes (e.g., `experimental-*`) so that they stand out when
   259     read in the config file or printed in logs.
   260  
   261     Relatedly, using temporary or experimental settings should cause the
   262     software to emit diagnostic logs at runtime. These log messages should be
   263     easy to grep for, and should contain pointers to more complete documentation
   264     (say, issue numbers or URLs) that the operator can read, as well as a hint
   265     about when the setting is expected to become invalid. For example:
   266  
   267     ```
   268     WARNING: Websocket RPC access is deprecated and will be removed in
   269     Tendermint v0.37. See https://tinyurl.com/adr075 for more information.
   270     ```
   271  
   272  4. **Consider both directions.** When adding a configuration setting, take some
   273     time during the implementation process to think about how the setting could
   274     be removed, as well as how it will be rolled out. This applies even for
   275     settings we imagine should be permanent. Experience may cause is to rethink
   276     our original design intent more broadly than we expected.
   277  
   278     This does not mean we have to spend a long time picking nits over the design
   279     of every setting; merely that we should convince ourselves we _could_ undo
   280     it without making too big a mess later. Even a little extra effort up front
   281     can sometimes save a lot.
   282  
   283  ## References
   284  
   285  - [Tendermint `config` package][config-pkg]
   286  - [`confix` command-line tool][confix]
   287  - [`condiff` command-line tool][condiff]
   288  - [Configuration update plan][plan]
   289  - [ADR 075: RPC Event Subscription Interface][adr075]
   290  
   291  [config-pkg]: https://godoc.org/github.com/tendermint/tendermint/config
   292  [confix]: https://github.com/tendermint/tendermint/blob/v0.37.x/scripts/confix
   293  [condiff]: https://github.com/tendermint/tendermint/blob/v0.37.x/scripts/confix/condiff
   294  [plan]: https://github.com/tendermint/tendermint/blob/v0.37.x/scripts/confix/plan.go
   295  [testdata]: https://github.com/tendermint/tendermint/blob/v0.37.x/scripts/confix/testdata
   296  [adr075]: https://github.com/tendermint/tendermint/blob/v0.37.x/docs/architecture/adr-075-rpc-subscription.md
   297  
   298  ## Appendix: Research Notes
   299  
   300  Discovering when various configuration settings were added, updated, and
   301  removed turns out to be surprisingly tedious.  To solve this puzzle, we had to
   302  answer the following questions:
   303  
   304  1. What changes were made between v0.x and v0.y? This is further complicated by
   305     cases where we have backported config changes into the middle of an earlier
   306     release cycle (e.g., `psql-conn` from v0.35.x into v0.34.13).
   307  
   308  2. When during the development cycle were those changes made? This allows us to
   309     recognize features that were backported into a previous release.
   310  
   311  3. What were the default values of the changed settings, and did they change at
   312     all during or across the release boundary?
   313  
   314  Each step of the [configuration update plan][plan] is commented with a link to
   315  one or more PRs where that change was made. The sections below discuss how we
   316  found these references.
   317  
   318  ### Tracking Changes Across Releases
   319  
   320  To figure out what changed between two releases, we built a tool called
   321  [`condiff`][condiff], which performs a "keyspace" diff of two TOML documents.
   322  This diff respects the structure of the TOML file, but ignores comments, blank
   323  lines, and configuration values, so that we can see what was added and removed.
   324  
   325  To use it, run:
   326  
   327  ```shell
   328  go run ./scripts/confix/condiff old.toml new.toml
   329  ```
   330  
   331  This tool works on any TOML documents, but for our purposes we needed
   332  Tendermint `config.toml` files. The easiest way to get these is to build the
   333  node binary for your version of interest, run `tendermint init` on a clean home
   334  directory, and copy the generated config file out. The [`testdata`][testdata]
   335  directory for the `confix` tool has configs generated from the heads of each
   336  release branch from v0.31 through v0.35.
   337  
   338  If you want to reproduce this yourself, it looks something like this:
   339  
   340  ```shell
   341  # Example for Tendermint v0.32.
   342  git checkout --track origin/v0.32.x
   343  go get golang.org/x/sys/unix
   344  go mod tidy
   345  make build
   346  rm -fr -- tmhome
   347  ./build/tendermint --home=tmhome init
   348  cp tmhome/config/config.toml config-v32.toml
   349  ```
   350  
   351  Be advised that the further back you go, the more idiosyncrasies you will
   352  encounter. For example, Tendermint v0.31 and earlier predate Go modules (v0.31
   353  used dep), and lack backport branches. And you may need to do some editing of
   354  Makefile rules once you get back into the 20s.
   355  
   356  Note that when diffing config files across the v0.34/v0.35 gap, the swap from
   357  `snake_case` to `kebab-case` makes it look like everything changed. The
   358  `condiff` tool has a `-desnake` flag that normalizes all the keys to kebab case
   359  in both inputs before comparison.
   360  
   361  ### Locating Additions and Deletions
   362  
   363  To figure out when a configuration setting was added or removed, your tool of
   364  choice is `git bisect`. The only tricky part is finding the endpoints for the
   365  search.  If the transition happened within a release, you can use that
   366  release's backport branch as the endpoint (if it has one, e.g., `v0.35.x`).
   367  
   368  However, the start point can be more problematic. The backport branches are not
   369  ancestors of `master` or of each other, which means you need to find some point
   370  in history _prior_ to the change but still attached to the mainline. For recent
   371  releases there is a dev root (e.g., `v0.35.0-dev`, `v0.34.0-dev1`, etc.). These
   372  are not named consistently, but you can usually grep the output of `git tag` to
   373  find them.
   374  
   375  In the worst case you could try starting from the root commit of the repo, but
   376  that turns out not to work in all cases. We've done some branching shenanigans
   377  over the years that mean the root is not a direct ancestor of all our release
   378  branches. When you find this you will probably swear a lot. I did.
   379  
   380  Once you have a start and end point (say, `v0.35.0-dev` and `master`), you can
   381  bisect in the usual way. I use `git grep` on the `config` directory to check
   382  whether the case I am looking for is present. For example, to find when the
   383  `[fastsync]` section was removed:
   384  
   385  ```shell
   386  # Setup:
   387  git checkout master
   388  git bisect start
   389  git bisect bad                 # it's not present on tip of master.
   390  git bisect good v0.34.0-dev1   # it was present at the start of v0.34.
   391  ```
   392  
   393  ```shell
   394  # Now repeat this until it gives you a specific commit:
   395  if git grep -q '\[fastsync\]' config ; then git bisect good ; else git bisect bad ; fi
   396  ```
   397  
   398  The above example finds where a config was removed: To find where a setting was
   399  added, do the same thing except reverse the sense of the test (`if ! git grep -q
   400  ...`).