github.com/ari-anchor/sei-tendermint@v0.0.0-20230519144642-dc826b7b56bb/docs/rfc/rfc-002-ipc-ecosystem.md (about) 1 # RFC 002: Interprocess Communication (IPC) in Tendermint 2 3 ## Changelog 4 5 - 08-Sep-2021: Initial draft (@creachadair). 6 7 8 ## Abstract 9 10 Communication in Tendermint among consensus nodes, applications, and operator 11 tools all use different message formats and transport mechanisms. In some 12 cases there are multiple options. Having all these options complicates both the 13 code and the developer experience, and hides bugs. To support a more robust, 14 trustworthy, and usable system, we should document which communication paths 15 are essential, which could be removed or reduced in scope, and what we can 16 improve for the most important use cases. 17 18 This document proposes a variety of possible improvements of varying size and 19 scope. Specific design proposals should get their own documentation. 20 21 22 ## Background 23 24 The Tendermint state replication engine has a complex IPC footprint. 25 26 1. Consensus nodes communicate with each other using a networked peer-to-peer 27 message-passing protocol. 28 29 2. Consensus nodes communicate with the application whose state is being 30 replicated via the [Application BlockChain Interface (ABCI)][abci]. 31 32 3. Consensus nodes export a network-accessible [RPC service][rpc-service] to 33 support operations (bootstrapping, debugging) and synchronization of [light clients][light-client]. 34 This interface is also used by the [`tendermint` CLI][tm-cli]. 35 36 4. Consensus nodes export a gRPC service exposing a subset of the methods of 37 the RPC service described by (3). This was intended to simplify the 38 implementation of tools that already use gRPC to communicate with an 39 application (via the Cosmos SDK), and wanted to also talk to the consensus 40 node without implementing yet another RPC protocol. 41 42 The gRPC interface to the consensus node has been deprecated and is slated 43 for removal in the forthcoming Tendermint v0.36 release. 44 45 5. Consensus nodes may optionally communicate with a "remote signer" that holds 46 a validator key and can provide public keys and signatures to the consensus 47 node. One of the stated goals of this configuration is to allow the signer 48 to be run on a private network, separate from the consensus node, so that a 49 compromise of the consensus node from the public network would be less 50 likely to expose validator keys. 51 52 ## Discussion: Transport Mechanisms 53 54 ### Remote Signer Transport 55 56 A remote signer communicates with the consensus node in one of two ways: 57 58 1. "Raw": Using a TCP or Unix-domain socket which carries varint-prefixed 59 protocol buffer messages. In this mode, the consensus node is the server, 60 and the remote signer is the client. 61 62 This mode has been deprecated, and is intended to be removed. 63 64 2. gRPC: This mode uses the same protobuf messages as "Raw" node, but uses a 65 standard encrypted gRPC HTTP/2 stub as the transport. In this mode, the 66 remote signer is the server and the consensus node is the client. 67 68 69 ### ABCI Transport 70 71 In ABCI, the _application_ is the server, and the Tendermint consensus engine 72 is the client. Most applications implement the server using the [Cosmos SDK][cosmos-sdk], 73 which handles low-level details of the ABCI interaction and provides a 74 higher-level interface to the rest of the application. The SDK is written in Go. 75 76 Beneath the SDK, the application communicates with Tendermint core in one of 77 two ways: 78 79 - In-process direct calls (for applications written in Go and compiled against 80 the Tendermint code). This is an optimization for the common case where an 81 application is written in Go, to save on the overhead of marshaling and 82 unmarshaling requests and responses within the same process: 83 [`abci/client/local_client.go`][local-client] 84 85 - A custom remote procedure protocol built on wire-format protobuf messages 86 using a socket (the "socket protocol"): [`abci/server/socket_server.go`][socket-server] 87 88 The SDK also provides a [gRPC service][sdk-grpc] accessible from outside the 89 application, allowing transactions to be broadcast to the network, look up 90 transactions, and simulate transaction costs. 91 92 93 ### RPC Transport 94 95 The consensus node RPC service allows callers to query consensus parameters 96 (genesis data, transactions, commits), node status (network info, health 97 checks), application state (abci_query, abci_info), mempool state, and other 98 attributes of the node and its application. The service also provides methods 99 allowing transactions and evidence to be injected ("broadcast") into the 100 blockchain. 101 102 The RPC service is exposed in several ways: 103 104 - HTTP GET: Queries may be sent as URI parameters, with method names in the path. 105 106 - HTTP POST: Queries may be sent as JSON-RPC request messages in the body of an 107 HTTP POST request. The server uses a custom implementation of JSON-RPC that 108 is not fully compatible with the [JSON-RPC 2.0 spec][json-rpc], but handles 109 the common cases. 110 111 - Websocket: Queries may be sent as JSON-RPC request messages via a websocket. 112 This transport uses more or less the same JSON-RPC plumbing as the HTTP POST 113 handler. 114 115 The websocket endpoint also includes three methods that are _only_ exported 116 via websocket, which appear to support event subscription. 117 118 - gRPC: A subset of queries may be issued in protocol buffer format to the gRPC 119 interface described above under (4). As noted, this endpoint is deprecated 120 and will be removed in v0.36. 121 122 ### Opportunities for Simplification 123 124 **Claim:** There are too many IPC mechanisms. 125 126 The preponderance of ABCI usage is via the Cosmos SDK, which means the 127 application and the consensus node are compiled together into a single binary, 128 and the consensus node calls the ABCI methods of the application directly as Go 129 functions. 130 131 We also need a true IPC transport to support ABCI applications _not_ written in 132 Go. There are also several known applications written in Rust, for example 133 (including [Anoma](https://github.com/anoma/anoma), Penumbra, 134 [Oasis](https://github.com/oasisprotocol/oasis-core), Twilight, and 135 [Nomic](https://github.com/nomic-io/nomic)). Ideally we will have at most one 136 such transport "built-in": More esoteric cases can be handled by a custom proxy. 137 Pragmatically, gRPC is probably the right choice here. 138 139 The primary consumers of the multi-headed "RPC service" today are the light 140 client and the `tendermint` command-line client. There is probably some local 141 use via curl, but I expect that is mostly ad hoc. Ethan reports that nodes are 142 often configured with the ports to the RPC service blocked, which is good for 143 security but complicates use by the light client. 144 145 ### Context: Remote Signer Issues 146 147 Since the remote signer needs a secure communication channel to exchange keys 148 and signatures, and is expected to run truly remotely from the node (i.e., on a 149 separate physical server), there is not a whole lot we can do here. We should 150 finish the deprecation and removal of the "raw" socket protocol between the 151 consensus node and remote signers, but the use of gRPC is appropriate. 152 153 The main improvement we can make is to simplify the implementation quite a bit, 154 once we no longer need to support both "raw" and gRPC transports. 155 156 ### Context: ABCI Issues 157 158 In the original design of ABCI, the presumption was that all access to the 159 application should be mediated by the consensus node. The idea is that outside 160 access could change application state and corrupt the consensus process, which 161 relies on the application to be deterministic. Of course, even without outside 162 access an application could behave nondeterministically, but allowing other 163 programs to send it requests was seen as courting trouble. 164 165 Conversely, users noted that most of the time, tools written for a particular 166 application don't want to talk to the consensus module directly. The 167 application "owns" the state machine the consensus engine is replicating, so 168 tools that care about application state should talk to the application. 169 Otherwise, they would have to bake in knowledge about Tendermint (e.g., its 170 interfaces and data structures) just because of the mediation. 171 172 For clients to talk directly to the application, however, there is another 173 concern: The consensus node is the ABCI _client_, so it is inconvenient for the 174 application to "push" work into the consensus module via ABCI itself. The 175 current implementation works around this by calling the consensus node's RPC 176 service, which exposes an `ABCIQuery` kitchen-sink method that allows the 177 application a way to poke ABCI messages in the other direction. 178 179 Without this RPC method, you could work around this (at least in principle) by 180 having the consensus module "poll" the application for work that needs done, 181 but that has unsatisfactory implications for performance and robustness, as 182 well as being harder to understand. 183 184 There has apparently been discussion about trying to make a more bidirectional 185 communication between the consensus node and the application, but this issue 186 seems to still be unresolved. 187 188 Another complication of ABCI is that it requires the application (server) to 189 maintain [four separate connections][abci-conn]: One for "consensus" operations 190 (BeginBlock, EndBlock, DeliverTx, Commit), one for "mempool" operations, one 191 for "query" operations, and one for "snapshot" (state synchronization) operations. 192 The rationale seems to have been that these groups of operations should be able 193 to proceed concurrently with each other. In practice, it results in a very complex 194 state management problem to coordinate state updates between the separate streams. 195 While application authors in Go are mostly insulated from that complexity by the 196 Cosmos SDK, the plumbing to maintain those separate streams is complicated, hard 197 to understand, and we suspect it contains concurrency bugs and/or lock contention 198 issues affecting performance that are subtle and difficult to pin down. 199 200 Even without changing the semantics of any ABCI operations, this code could be 201 made smaller and easier to debug by separating the management of concurrency 202 and locking from the IPC transport: If all requests and responses are routed 203 through one connection, the server can explicitly maintain priority queues for 204 requests and responses, and make less-conservative decisions about when locks 205 are (or aren't) required to synchronize state access. With independent queues, 206 the server must lock conservatively, and no optimistic scheduling is practical. 207 208 This would be a tedious implementation change, but should be achievable without 209 breaking any of the existing interfaces. More importantly, it could potentially 210 address a lot of difficult concurrency and performance problems we currently 211 see anecdotally but have difficultly isolating because of how intertwined these 212 separate message streams are at runtime. 213 214 TODO: Impact of ABCI++ for this topic? 215 216 ### Context: RPC Issues 217 218 The RPC system serves several masters, and has a complex surface area. I 219 believe there are some improvements that can be exposed by separating some of 220 these concerns. 221 222 The Tendermint light client currently uses the RPC service to look up blocks 223 and transactions, and to forward ABCI queries to the application. The light 224 client proxy uses the RPC service via a websocket. The Cosmos IBC relayer also 225 uses the RPC service via websocket to watch for transaction events, and uses 226 the `ABCIQuery` method to fetch information and proofs for posted transactions. 227 228 Some work is already underway toward using P2P message passing rather than RPC 229 to synchronize light client state with the rest of the network. IBC relaying, 230 however, requires access to the event system, which is currently not accessible 231 except via the RPC interface. Event subscription _could_ be exposed via P2P, 232 but that is a larger project since it adds P2P communication load, and might 233 thus have an impact on the performance of consensus. 234 235 If event subscription can be moved into the P2P network, we could entirely 236 remove the websocket transport, even for clients that still need access to the 237 RPC service. Until then, we may still be able to reduce the scope of the 238 websocket endpoint to _only_ event subscription, by moving uses of the RPC 239 server as a proxy to ABCI over to the gRPC interface. 240 241 Having the RPC server still makes sense for local bootstrapping and operations, 242 but can be further simplified. Here are some specific proposals: 243 244 - Remove the HTTP GET interface entirely. 245 246 - Simplify JSON-RPC plumbing to remove unnecessary reflection and wrapping. 247 248 - Remove the gRPC interface (this is already planned for v0.36). 249 250 - Separate the websocket interface from the rest of the RPC service, and 251 restrict it to only event subscription. 252 253 Eventually we should try to emove the websocket interface entirely, but we 254 will need to revisit that (probably in a new RFC) once we've done some of the 255 easier things. 256 257 These changes would preserve the ability of operators to issue queries with 258 curl (but would require using JSON-RPC instead of URI parameters). That would 259 be a little less user-friendly, but for a use case that should not be that 260 prevalent. 261 262 These changes would also preserve compatibility with existing JSON-RPC based 263 code paths like the `tendermint` CLI and the light client (even ahead of 264 further work to remove that dependency). 265 266 **Design goal:** An operator should be able to disable non-local access to the 267 RPC server on any node in the network without impairing the ability of the 268 network to function for service of state replication, including light clients. 269 270 **Design principle:** All communication required to implement and monitor the 271 consensus network should use P2P, including the various synchronizations. 272 273 ### Options for ABCI Transport 274 275 The majority of current usage is in Go, and the majority of that is mediated by 276 the Cosmos SDK, which uses the "direct call" interface. There is probably some 277 opportunity to clean up the implementation of that code, notably by inverting 278 which interface is at the "top" of the abstraction stack (currently it acts 279 like an RPC interface, and escape-hatches into the direct call). However, this 280 general approach works fine and doesn't need to be fundamentally changed. 281 282 For applications _not_ written in Go, the two remaining options are the 283 "socket" protocol (another variation on varint-prefixed protobuf messages over 284 an unstructured stream) and gRPC. It would be nice if we could get rid of one 285 of these to reduce (unneeded?) optionality. 286 287 Since both the socket protocol and gRPC depend on protocol buffers, the 288 "socket" protocol is the most obvious choice to remove. While gRPC is more 289 complex, the set of languages that _have_ protobuf support but _lack_ gRPC 290 support is small. Moreover, gRPC is already widely used in the rest of the 291 ecosystem (including the Cosmos SDK). 292 293 If some use case did arise later that can't work with gRPC, it would not be too 294 difficult for that application author to write a little proxy (in Go) that 295 bridges the convenient SDK APIs into a simpler protocol than gRPC. 296 297 **Design principle:** It is better for an uncommon special case to carry the 298 burdens of its specialness, than to bake an escape hatch into the infrastructure. 299 300 **Recommendation:** We should deprecate and remove the socket protocol. 301 302 ### Options for RPC Transport 303 304 [ADR 057][adr-57] proposes using gRPC for the Tendermint RPC implementation. 305 This is still possible, but if we are able to simplify and decouple the 306 concerns as described above, I do not think it should be necessary. 307 308 While JSON-RPC is not the best possible RPC protocol for all situations, it has 309 some advantages over gRPC for our domain. Specifically: 310 311 - It is easy to call JSON-RPC manually from the command-line, which helps with 312 a common concern for the RPC service, local debugging and operations. 313 314 Relatedly: JSON is relatively easy for humans to read and write, and it can 315 be easily copied and pasted to share sample queries and debugging results in 316 chat, issue comments, and so on. Ideally, the RPC service will not be used 317 for activities where the costs of a text protocol are important compared to 318 its legibility and manual usability benefits. 319 320 - gRPC has an enormous dependency footprint for both clients and servers, and 321 many of the features it provides to support security and performance 322 (encryption, compression, streaming, etc.) are mostly irrelevant to local 323 use. Tendermint already needs to include a gRPC client for the remote signer, 324 but if we can avoid the need for a _client_ to depend on gRPC, that is a win 325 for usability. 326 327 - If we intend to migrate light clients off RPC to use P2P entirely, there is 328 no advantage to forcing a temporary migration to gRPC along the way; and once 329 the light client is not dependent on the RPC service, the efficiency of the 330 protocol is much less important. 331 332 - We can still get the benefits of generated data types using protocol buffers, even 333 without using gRPC: 334 335 - Protobuf defines a standard JSON encoding for all message types so 336 languages with protobuf support do not need to worry about type mapping 337 oddities. 338 339 - Using JSON means that even languages _without_ good protobuf support can 340 implement the protocol with a bit more work, and I expect this situation to 341 be rare. 342 343 Even if a language lacks a good standard JSON-RPC mechanism, the protocol is 344 lightweight and can be implemented by simple send/receive over TCP or 345 Unix-domain sockets with no need for code generation, encryption, etc. gRPC 346 uses a complex HTTP/2 based transport that is not easily replicated. 347 348 ### Future Work 349 350 The background and proposals sketched above focus on the existing structure of 351 Tendermint and improvements we can make in the short term. It is worthwhile to 352 also consider options for longer-term broader changes to the IPC ecosystem. 353 The following outlines some ideas at a high level: 354 355 - **Consensus service:** Today, the application and the consensus node are 356 nominally connected only via ABCI. Tendermint was originally designed with 357 the assumption that all communication with the application should be mediated 358 by the consensus node. Based on further experience, however, the design goal 359 is now that the _application_ should be the mediator of application state. 360 361 As noted above, however, ABCI is a client/server protocol, with the 362 application as the server. For outside clients that turns out to have been a 363 good choice, but it complicates the relationship between the application and 364 the consensus node: Previously transactions were entered via the node, now 365 they are entered via the app. 366 367 We have worked around this by using the Tendermint RPC service to give the 368 application a "back channel" to the consensus node, so that it can push 369 transactions back into the consensus network. But the RPC service exposes a 370 lot of other functionality, too, including event subscription, block and 371 transaction queries, and a lot of node status information. 372 373 Even if we can't easily "fix" the orientation of the ABCI relationship, we 374 could improve isolation by splitting out the parts of the RPC service that 375 the application needs as a back-channel, and sharing those _only_ with the 376 application. By defining a "consensus service", we could give the application 377 a way to talk back limited to only the capabilities it needs. This approach 378 has the benefit that we could do it without breaking existing use, and if we 379 later did "fix" the ABCI directionality, we could drop the special case 380 without disrupting the rest of the RPC interface. 381 382 - **Event service:** Right now, the IBC relayer relies on the Tendermint RPC 383 service to provide a stream of block and transaction events, which it uses to 384 discover which transactions need relaying to other chains. While I think 385 that event subscription should eventually be handled via P2P, we could gain 386 some immediate benefit by splitting out event subscription from the rest of 387 the RPC service. 388 389 In this model, an event subscription service would be exposed on the public 390 network, but on a different endpoint. This would remove the need for the RPC 391 service to support the websocket protocol, and would allow operators to 392 isolate potentially sensitive status query results from the public network. 393 394 At the moment the relayers also use the RPC service to get block data for 395 synchronization, but work is already in progress to handle that concern via 396 the P2P layer. Once that's done, event subscription could be separated. 397 398 Separating parts of the existing RPC service is not without cost: It might 399 require additional connection endpoints, for example, though it is also not too 400 difficult for multiple otherwise-independent services to share a connection. 401 402 In return, though, it would become easier to reduce transport options and for 403 operators to independently control access to sensitive data. Considering the 404 viability and implications of these ideas is beyond the scope of this RFC, but 405 they are documented here since they follow from the background we have already 406 discussed. 407 408 ## References 409 410 [abci]: https://github.com/tendermint/tendermint/tree/master/spec/abci 411 [rpc-service]: https://docs.tendermint.com/master/rpc/ 412 [light-client]: https://docs.tendermint.com/master/tendermint-core/light-client.html 413 [tm-cli]: https://github.com/tendermint/tendermint/tree/master/cmd/tendermint 414 [cosmos-sdk]: https://github.com/cosmos/cosmos-sdk/ 415 [local-client]: https://github.com/tendermint/tendermint/blob/master/abci/client/local_client.go 416 [socket-server]: https://github.com/tendermint/tendermint/blob/master/abci/server/socket_server.go 417 [sdk-grpc]: https://pkg.go.dev/github.com/cosmos/cosmos-sdk/types/tx#ServiceServer 418 [json-rpc]: https://www.jsonrpc.org/specification 419 [abci-conn]: https://github.com/tendermint/tendermint/blob/master/spec/abci/apps.md#state 420 [adr-57]: https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-057-RPC.md