github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/rfc/rfc-019-config-version.md (about) 1 # RFC 019: Configuration File Versioning 2 3 ## Changelog 4 5 - 19-Apr-2022: Initial draft (@creachadair) 6 - 20-Apr-2022: Updates from review feedback (@creachadair) 7 8 ## Abstract 9 10 Updating configuration settings is an essential part of upgrading an existing 11 node to a new version of the Tendermint software. Unfortunately, it is also 12 currently a very manual process. This document discusses some of the history of 13 changes to the config format, actions we've taken to improve the tooling for 14 configuration upgrades, and additional steps we may want to consider. 15 16 ## Background 17 18 A Tendermint node reads configuration settings at startup from a TOML formatted 19 text file, typically named `config.toml`. The contents of this file are defined 20 by the [`github.com/tendermint/tendermint/config`][config-pkg]. 21 22 Although many settings in this file remain valid from one version of Tendermint 23 to the next, new versions of Tendermint often add, update, and remove settings. 24 These changes often require manual intervention by operators who are upgrading 25 their nodes. 26 27 I propose we should provide better tools and documentation to help operators 28 make configuration changes correctly during version upgrades. Ideally, as much 29 as possible of any configuration file update should be automated, and where 30 that is not possible or practical, we should provide clear, explicit directions 31 for what steps need to be taken manually. Moreover, when the node discovers 32 incorrect or invalid configuration, we should improve the diagnostics it emits 33 so that the operator can quickly and easily find the relevant documentation, 34 without having to grep through source code. 35 36 ## Discussion 37 38 By convention, we are supposed to document required changes to the config file 39 in the `UPGRADING.md` file for the release that introduces them. Although we 40 have mostly done this, the level of detail in the upgrading instructions is 41 often insufficient for an operator to correctly update their file. 42 43 The updates vary widely in complexity: Operators may need to add new required 44 settings, update obsolete values for existing settings, move or rename existing 45 settings within the file, or remove obsolete settings (which are thus invalid). 46 Here are a few examples of each of these cases: 47 48 - **New required settings:** Tendermint v0.35 added a new top-level `mode` 49 setting that determines whether a node runs as a validator, a full node, or a 50 seed node. The default value is `"full"`, which means the operator of a 51 validator must manually add `mode = "validator"` (or set the `--mode` flag on 52 the command line) for their node to come up in the correct mode. 53 54 - **Updated obsolete values:** Tendermint v0.35 removed support for versions 55 `"v1"` and `"v2"` of the blocksync (formerly "fastsync") protocol, requiring 56 any node using either of those values to update to `"v0"`. 57 58 - **Moved/renamed settings:** Version v0.34 moved the top-level `pprof_laddr` 59 setting under the `[rpc]` section. 60 61 Version v0.35 renamed every setting in the file from `snake_case` to 62 `kebab-case`, moved the top-level `fast_sync` setting into the `[blocksync]` 63 section as (itself renamed from `[fastsync]`), and moved all the top-level 64 `priv-validator-*` settings under a new `[priv-validator]` section with their 65 prefix trimmed off. 66 67 - **Removed obsolete settings:** Version v0.34 removed the `index_all_keys` and 68 `index_keys` settings from the `[tx_index]` section; version v0.35 removed 69 the `wal-dir` setting from the `[mempool]` section, and version v0.36 removed 70 the `[blocksync]` section entirely. 71 72 While many of these changes are mentioned in the config section of the upgrade 73 instructions, some are not mentioned at all, or are hidden in other parts of 74 the doc. For instance, the v0.34 `pprof_laddr` change was documented only as an 75 RPC flag change. (A savvy reader might realize that the flag `--rpc.pprof_laddr` 76 implies a corresponding config section, but it omits the related detail that 77 there was a top-level setting that's been renamed). The lesson here is not 78 that the docs are bad, but to point out that prose is not the most efficient 79 format to convey detailed changes like this. The upgrading instructions are 80 still valuable for the human reader to understand what to expect. 81 82 ### Concrete Steps 83 84 As part of the v0.36 development cycle, we spent some time reverse-engineering 85 the configuration changes since the v0.34 release and built an experimental 86 command-line tool called [`confix`][confix], whose job it is to automatically 87 update the settings in a `config.toml` file to the latest version. We also 88 backported a version of this tool into the v0.35.x branch at release v0.35.4. 89 90 This tool should work fine for configuration files created by Tendermint v0.34 91 and later, but does not (yet) know how to handle changes from prior versions of 92 Tendermint. Part of the difficulty for older versions is simply logistical: To 93 figure out which changes to apply, we need to understand something about the 94 version that made the file, as well as the version we're converting it to. 95 96 > **Discussion point:** In the future we might want to consider incorporating 97 > this into the node CLI directly, but we're keeping it separate for now until 98 > we can get some feedback from operators. 99 100 For the experiment, we handled this by carefully searching the history of 101 config format changes for shibboleths to bound the version: For example, the 102 `[fastsync]` section was added in Tendermint v0.32 and renamed `[blocksync]` in 103 Tendermint v0.35. So if we see a `[fastsync]` section, we have some confidence 104 that the file was created by v0.32, v0.33, or v0.34. 105 106 But such signals are delicate: The `[blocksync]` section was removed in v0.36, 107 so if we do not find `[fastsync]`, we cannot conclude from that alone that the 108 file is from v0.31 or earlier -- we have to look for corroborating details. 109 While such "sniffing" tactics are fine for an experiment, they aren't as robust 110 as we might like. 111 112 This is especially relevant for configuration files that may have already been 113 manually upgraded across several versions by the time we are asked to update 114 them again. Another related concern is that we'd like to make sure conversion 115 is idempotent, so that it would be safe to rerun the tool over an 116 already-converted file without breaking anything. 117 118 ### Config Versioning 119 120 One obvious tactic we could use for future releases is add a version marker to 121 the config file. This would give tools like `confix` (and the node itself) a 122 way to calibrate their expectations. Rather than being a version for the file 123 itself, however, this version marker would indicate which version of Tendermint 124 is needed to read the file. 125 126 Provisionally, this might look something like: 127 128 ```toml 129 # THe minimum version of Tendermint compatible with the contents of 130 # this configuration file. 131 config-version = 'v0.35' 132 ``` 133 134 When initializing a new node, Tendermint would populate this field with its own 135 version (e.g., `v0.36`). When conducting an upgrade, tools like `confix` can 136 then use this to decide which conversions are valid, and then update the value 137 accordingly. After converting a file marked `'v0.35'` to`'v0.37'`, the 138 conversion tool sets the file's `config-version` to reflect its compatibility. 139 140 > **Discussion point:** This example presumes we would keep config files 141 > compatible within a given release cycle, e.g., all of v0.36.x. We could also 142 > use patch numbers here, if we think there's some reason to permit changes 143 > that would require config file edits at that granularity. I don't think we 144 > should, but that's a design question to consider. 145 146 Upon seeing an up-to-date version marker, the conversion tool can simply exit 147 with a diagnostic like "this file is already up-to-date", rather than sniffing 148 the keyspace and potentially introducing errors. In addition, this would let a 149 tool detect config files that are _newer_ than the one it understands, and 150 issue a safe diagnostic rather than doing something wrong. Plus, besides 151 avoiding potentially unsafe conversions, this would also serve as 152 human-readable documentation that the file is up-to-date for a given version. 153 154 Adding a config version would not address the problem of how to convert files 155 created by older versions of Tendermint, but it would at least help us build 156 more robust config tooling going forward. 157 158 ### Stability and Change 159 160 In light of the discussion so far, it is natural to examine why we make so many 161 changes to the configuration file from one version to the next, and whether we 162 could reduce friction by being more conservative about what we make 163 configurable, what config changes we make over time, and how we roll them out. 164 165 Some changes, like renaming everything from snake case to kebab case, are 166 entirely gratuitous. We could safely agree not to make those kinds of changes. 167 Apart from that obvious case, however, many other configuration settings 168 provide value to node operators in cases where there is no simple, universal 169 setting that matches every application. 170 171 Taking a high-level view, there are several broad reasons why we might want to 172 make changes to configuration settings: 173 174 - **Lessons learned:** Configuration settings are a good way to try things out 175 in production, before making more invasive changes to the consensus protocol. 176 177 For example, up until Tendermint v0.35, consensus timeouts were specified as 178 per-node configuration settings (e.g., `timeout-precommit` et al.). This 179 allowed operators to tune these values for the needs of their network, but 180 had the downside that individually-misconfigured nodes could stall consensus. 181 182 Based on that experience, these timeouts have been deprecated in Tendermint 183 v0.36 and converted to consensus parameters, to be consistent across all 184 nodes in the network. 185 186 - **Migration & experimentation:** Introducing new features and updating old 187 features can complicate migration for existing users of the software. 188 Temporary or "experimental" configuration settings can be a valuable way to 189 mitigate that friction. 190 191 For example, Tendermint v0.36 introduces a new RPC event subscription 192 endpoint (see [ADR 075][adr075]) that will eventually replace the existing 193 webwocket-based interface. To give users time to migrate, v0.36 adds an 194 `experimental-disable-websocket` setting, defaulted to `false`, that allows 195 operators to selectively disable the websocket API for testing purposes 196 during the conversion. This setting is designed to be removed in v0.37, when 197 the old interface is no longer supported. 198 199 - **Ongoing maintenance:** Sometimes configuration settings become obsolete, 200 and the cost of removing them trades off against the potential risks of 201 leaving a non-functional or deprecated knob hooked up indefinitely. 202 203 For example, Tendermint v0.35 deprecated two alternate implementations of the 204 blocksync protocol, one of which was deleted entirely (`v1`) and one of which 205 was scheduled for removal (`v2`). The `blocksync.version` setting, which had 206 been added as a migration aid, became obsolete and needed to be updated. 207 208 Despite our best intentions, sometimes engineering designs do not work out. 209 It's just as important to leave room to back out of changes we have since 210 reconsidered, as it is to support migrations forward onto new and improved 211 code. 212 213 - **Clarity and legibility:** Besides configuring the software, another 214 important purpose of a config file is to document intent for the humans who 215 operate and maintain the software. Operators need adjust settings to keep the 216 node running, and developers need to know what options were in use when 217 something goes wrong so they can diagnose and fix bugs. The legibility of a 218 config file as a _human_ artifact is also thus important. 219 220 For example, Tendermint v0.35 moved settings related to validator private 221 keys from the top-level section of the configuration file to their own 222 designated `[priv-validator]` section. Although this change did not make any 223 difference to the meaning of those settings, it made the organization of the 224 file easier to understand, and allowed the names of the individual settings 225 to be simplified (e.g., `priv-validator-key-file` became simply `key-file` in 226 the new section). 227 228 Although such changes are "gratuitous" with respect to the software, there is 229 often value in making things more legible for the humans. While there is no 230 simple rule to define the line, the Potter Stewart principle can be used with 231 due care. 232 233 Keeping these examples in mind, we can and should take reasonable steps to 234 avoid churn in the configuration file across versions where we can. However, we 235 must also accept that part of the reason for _having_ a config file is to allow 236 us flexibility elsewhere in the design. On that basis, we should not attempt 237 to be too dogmatic about config changes either. Unlike changes in the block 238 protocol, for example, which affect every user of every network that adopts 239 them, config changes are relatively self-contained. 240 241 There are few guiding principles I think we can use to strike a sensible 242 balance: 243 244 1. **No gratuitous changes.** Aesthetic changes that do not enhance legibility, 245 avert confusion, or clarity documentation, should be entirely avoided. 246 247 2. **Prefer mechanical changes.** Whenever it is practical, change settings in 248 a way that can be updated by a tool without operator judgement. This implies 249 finding safe, universal defaults for new settings, and not changing the 250 default values of existing settings. 251 252 Even if that means we have to make multiple changes (e.g., add a new setting 253 in the current version, deprecate the old one, and remove the old one in the 254 next version) it's preferable if we can mechanize each step. 255 256 3. **Clearly signal intent.** When adding temporary or experimental settings, 257 they should be clearly named and documented as such. Use long names and 258 suggestive prefixes (e.g., `experimental-*`) so that they stand out when 259 read in the config file or printed in logs. 260 261 Relatedly, using temporary or experimental settings should cause the 262 software to emit diagnostic logs at runtime. These log messages should be 263 easy to grep for, and should contain pointers to more complete documentation 264 (say, issue numbers or URLs) that the operator can read, as well as a hint 265 about when the setting is expected to become invalid. For example: 266 267 ``` 268 WARNING: Websocket RPC access is deprecated and will be removed in 269 Tendermint v0.37. See https://tinyurl.com/adr075 for more information. 270 ``` 271 272 4. **Consider both directions.** When adding a configuration setting, take some 273 time during the implementation process to think about how the setting could 274 be removed, as well as how it will be rolled out. This applies even for 275 settings we imagine should be permanent. Experience may cause is to rethink 276 our original design intent more broadly than we expected. 277 278 This does not mean we have to spend a long time picking nits over the design 279 of every setting; merely that we should convince ourselves we _could_ undo 280 it without making too big a mess later. Even a little extra effort up front 281 can sometimes save a lot. 282 283 ## References 284 285 - [Tendermint `config` package][config-pkg] 286 - [`confix` command-line tool][confix] 287 - [`condiff` command-line tool][condiff] 288 - [Configuration update plan][plan] 289 - [ADR 075: RPC Event Subscription Interface][adr075] 290 291 [config-pkg]: https://godoc.org/github.com/tendermint/tendermint/config 292 [confix]: https://github.com/tendermint/tendermint/blob/v0.37.x/scripts/confix 293 [condiff]: https://github.com/tendermint/tendermint/blob/v0.37.x/scripts/confix/condiff 294 [plan]: https://github.com/tendermint/tendermint/blob/v0.37.x/scripts/confix/plan.go 295 [testdata]: https://github.com/tendermint/tendermint/blob/v0.37.x/scripts/confix/testdata 296 [adr075]: https://github.com/tendermint/tendermint/blob/v0.37.x/docs/architecture/adr-075-rpc-subscription.md 297 298 ## Appendix: Research Notes 299 300 Discovering when various configuration settings were added, updated, and 301 removed turns out to be surprisingly tedious. To solve this puzzle, we had to 302 answer the following questions: 303 304 1. What changes were made between v0.x and v0.y? This is further complicated by 305 cases where we have backported config changes into the middle of an earlier 306 release cycle (e.g., `psql-conn` from v0.35.x into v0.34.13). 307 308 2. When during the development cycle were those changes made? This allows us to 309 recognize features that were backported into a previous release. 310 311 3. What were the default values of the changed settings, and did they change at 312 all during or across the release boundary? 313 314 Each step of the [configuration update plan][plan] is commented with a link to 315 one or more PRs where that change was made. The sections below discuss how we 316 found these references. 317 318 ### Tracking Changes Across Releases 319 320 To figure out what changed between two releases, we built a tool called 321 [`condiff`][condiff], which performs a "keyspace" diff of two TOML documents. 322 This diff respects the structure of the TOML file, but ignores comments, blank 323 lines, and configuration values, so that we can see what was added and removed. 324 325 To use it, run: 326 327 ```shell 328 go run ./scripts/confix/condiff old.toml new.toml 329 ``` 330 331 This tool works on any TOML documents, but for our purposes we needed 332 Tendermint `config.toml` files. The easiest way to get these is to build the 333 node binary for your version of interest, run `tendermint init` on a clean home 334 directory, and copy the generated config file out. The [`testdata`][testdata] 335 directory for the `confix` tool has configs generated from the heads of each 336 release branch from v0.31 through v0.35. 337 338 If you want to reproduce this yourself, it looks something like this: 339 340 ```shell 341 # Example for Tendermint v0.32. 342 git checkout --track origin/v0.32.x 343 go get golang.org/x/sys/unix 344 go mod tidy 345 make build 346 rm -fr -- tmhome 347 ./build/tendermint --home=tmhome init 348 cp tmhome/config/config.toml config-v32.toml 349 ``` 350 351 Be advised that the further back you go, the more idiosyncrasies you will 352 encounter. For example, Tendermint v0.31 and earlier predate Go modules (v0.31 353 used dep), and lack backport branches. And you may need to do some editing of 354 Makefile rules once you get back into the 20s. 355 356 Note that when diffing config files across the v0.34/v0.35 gap, the swap from 357 `snake_case` to `kebab-case` makes it look like everything changed. The 358 `condiff` tool has a `-desnake` flag that normalizes all the keys to kebab case 359 in both inputs before comparison. 360 361 ### Locating Additions and Deletions 362 363 To figure out when a configuration setting was added or removed, your tool of 364 choice is `git bisect`. The only tricky part is finding the endpoints for the 365 search. If the transition happened within a release, you can use that 366 release's backport branch as the endpoint (if it has one, e.g., `v0.35.x`). 367 368 However, the start point can be more problematic. The backport branches are not 369 ancestors of `master` or of each other, which means you need to find some point 370 in history _prior_ to the change but still attached to the mainline. For recent 371 releases there is a dev root (e.g., `v0.35.0-dev`, `v0.34.0-dev1`, etc.). These 372 are not named consistently, but you can usually grep the output of `git tag` to 373 find them. 374 375 In the worst case you could try starting from the root commit of the repo, but 376 that turns out not to work in all cases. We've done some branching shenanigans 377 over the years that mean the root is not a direct ancestor of all our release 378 branches. When you find this you will probably swear a lot. I did. 379 380 Once you have a start and end point (say, `v0.35.0-dev` and `master`), you can 381 bisect in the usual way. I use `git grep` on the `config` directory to check 382 whether the case I am looking for is present. For example, to find when the 383 `[fastsync]` section was removed: 384 385 ```shell 386 # Setup: 387 git checkout master 388 git bisect start 389 git bisect bad # it's not present on tip of master. 390 git bisect good v0.34.0-dev1 # it was present at the start of v0.34. 391 ``` 392 393 ```shell 394 # Now repeat this until it gives you a specific commit: 395 if git grep -q '\[fastsync\]' config ; then git bisect good ; else git bisect bad ; fi 396 ``` 397 398 The above example finds where a config was removed: To find where a setting was 399 added, do the same thing except reverse the sense of the test (`if ! git grep -q 400 ...`).