github.com/ari-anchor/sei-tendermint@v0.0.0-20230519144642-dc826b7b56bb/docs/rfc/rfc-002-ipc-ecosystem.md (about)

     1  # RFC 002: Interprocess Communication (IPC) in Tendermint
     2  
     3  ## Changelog
     4  
     5  - 08-Sep-2021: Initial draft (@creachadair).
     6  
     7  
     8  ## Abstract
     9  
    10  Communication in Tendermint among consensus nodes, applications, and operator
    11  tools all use different message formats and transport mechanisms.  In some
    12  cases there are multiple options. Having all these options complicates both the
    13  code and the developer experience, and hides bugs. To support a more robust,
    14  trustworthy, and usable system, we should document which communication paths
    15  are essential, which could be removed or reduced in scope, and what we can
    16  improve for the most important use cases.
    17  
    18  This document proposes a variety of possible improvements of varying size and
    19  scope. Specific design proposals should get their own documentation.
    20  
    21  
    22  ## Background
    23  
    24  The Tendermint state replication engine has a complex IPC footprint.
    25  
    26  1. Consensus nodes communicate with each other using a networked peer-to-peer
    27     message-passing protocol.
    28  
    29  2. Consensus nodes communicate with the application whose state is being
    30     replicated via the [Application BlockChain Interface (ABCI)][abci].
    31  
    32  3. Consensus nodes export a network-accessible [RPC service][rpc-service] to
    33     support operations (bootstrapping, debugging) and synchronization of [light clients][light-client].
    34     This interface is also used by the [`tendermint` CLI][tm-cli].
    35  
    36  4. Consensus nodes export a gRPC service exposing a subset of the methods of
    37     the RPC service described by (3). This was intended to simplify the
    38     implementation of tools that already use gRPC to communicate with an
    39     application (via the Cosmos SDK), and wanted to also talk to the consensus
    40     node without implementing yet another RPC protocol.
    41  
    42     The gRPC interface to the consensus node has been deprecated and is slated
    43     for removal in the forthcoming Tendermint v0.36 release.
    44  
    45  5. Consensus nodes may optionally communicate with a "remote signer" that holds
    46     a validator key and can provide public keys and signatures to the consensus
    47     node. One of the stated goals of this configuration is to allow the signer
    48     to be run on a private network, separate from the consensus node, so that a
    49     compromise of the consensus node from the public network would be less
    50     likely to expose validator keys.
    51  
    52  ## Discussion: Transport Mechanisms
    53  
    54  ### Remote Signer Transport
    55  
    56  A remote signer communicates with the consensus node in one of two ways:
    57  
    58  1. "Raw": Using a TCP or Unix-domain socket which carries varint-prefixed
    59     protocol buffer messages. In this mode, the consensus node is the server,
    60     and the remote signer is the client.
    61  
    62     This mode has been deprecated, and is intended to be removed.
    63  
    64  2. gRPC: This mode uses the same protobuf messages as "Raw" node, but uses a
    65     standard encrypted gRPC HTTP/2 stub as the transport. In this mode, the
    66     remote signer is the server and the consensus node is the client.
    67  
    68  
    69  ### ABCI Transport
    70  
    71  In ABCI, the _application_ is the server, and the Tendermint consensus engine
    72  is the client.  Most applications implement the server using the [Cosmos SDK][cosmos-sdk],
    73  which handles low-level details of the ABCI interaction and provides a
    74  higher-level interface to the rest of the application. The SDK is written in Go.
    75  
    76  Beneath the SDK, the application communicates with Tendermint core in one of
    77  two ways:
    78  
    79  - In-process direct calls (for applications written in Go and compiled against
    80    the Tendermint code).  This is an optimization for the common case where an
    81    application is written in Go, to save on the overhead of marshaling and
    82    unmarshaling requests and responses within the same process:
    83    [`abci/client/local_client.go`][local-client]
    84  
    85  - A custom remote procedure protocol built on wire-format protobuf messages
    86    using a socket (the "socket protocol"): [`abci/server/socket_server.go`][socket-server]
    87  
    88  The SDK also provides a [gRPC service][sdk-grpc] accessible from outside the
    89  application, allowing transactions to be broadcast to the network, look up
    90  transactions, and simulate transaction costs.
    91  
    92  
    93  ### RPC Transport
    94  
    95  The consensus node RPC service allows callers to query consensus parameters
    96  (genesis data, transactions, commits), node status (network info, health
    97  checks), application state (abci_query, abci_info), mempool state, and other
    98  attributes of the node and its application. The service also provides methods
    99  allowing transactions and evidence to be injected ("broadcast") into the
   100  blockchain.
   101  
   102  The RPC service is exposed in several ways:
   103  
   104  - HTTP GET: Queries may be sent as URI parameters, with method names in the path.
   105  
   106  - HTTP POST: Queries may be sent as JSON-RPC request messages in the body of an
   107    HTTP POST request.  The server uses a custom implementation of JSON-RPC that
   108    is not fully compatible with the [JSON-RPC 2.0 spec][json-rpc], but handles
   109    the common cases.
   110  
   111  - Websocket: Queries may be sent as JSON-RPC request messages via a websocket.
   112    This transport uses more or less the same JSON-RPC plumbing as the HTTP POST
   113    handler.
   114  
   115    The websocket endpoint also includes three methods that are _only_ exported
   116    via websocket, which appear to support event subscription.
   117  
   118  - gRPC: A subset of queries may be issued in protocol buffer format to the gRPC
   119    interface described above under (4). As noted, this endpoint is deprecated
   120    and will be removed in v0.36.
   121  
   122  ### Opportunities for Simplification
   123  
   124  **Claim:** There are too many IPC mechanisms.
   125  
   126  The preponderance of ABCI usage is via the Cosmos SDK, which means the
   127  application and the consensus node are compiled together into a single binary,
   128  and the consensus node calls the ABCI methods of the application directly as Go
   129  functions.
   130  
   131  We also need a true IPC transport to support ABCI applications _not_ written in
   132  Go.  There are also several known applications written in Rust, for example
   133  (including [Anoma](https://github.com/anoma/anoma), Penumbra,
   134  [Oasis](https://github.com/oasisprotocol/oasis-core), Twilight, and
   135  [Nomic](https://github.com/nomic-io/nomic)). Ideally we will have at most one
   136  such transport "built-in": More esoteric cases can be handled by a custom proxy.
   137  Pragmatically, gRPC is probably the right choice here.
   138  
   139  The primary consumers of the multi-headed "RPC service" today are the light
   140  client and the `tendermint` command-line client. There is probably some local
   141  use via curl, but I expect that is mostly ad hoc. Ethan reports that nodes are
   142  often configured with the ports to the RPC service blocked, which is good for
   143  security but complicates use by the light client.
   144  
   145  ### Context: Remote Signer Issues
   146  
   147  Since the remote signer needs a secure communication channel to exchange keys
   148  and signatures, and is expected to run truly remotely from the node (i.e., on a
   149  separate physical server), there is not a whole lot we can do here. We should
   150  finish the deprecation and removal of the "raw" socket protocol between the
   151  consensus node and remote signers, but the use of gRPC is appropriate.
   152  
   153  The main improvement we can make is to simplify the implementation quite a bit,
   154  once we no longer need to support both "raw" and gRPC transports.
   155  
   156  ### Context: ABCI Issues
   157  
   158  In the original design of ABCI, the presumption was that all access to the
   159  application should be mediated by the consensus node. The idea is that outside
   160  access could change application state and corrupt the consensus process, which
   161  relies on the application to be deterministic. Of course, even without outside
   162  access an application could behave nondeterministically, but allowing other
   163  programs to send it requests was seen as courting trouble.
   164  
   165  Conversely, users noted that most of the time, tools written for a particular
   166  application don't want to talk to the consensus module directly. The
   167  application "owns" the state machine the consensus engine is replicating, so
   168  tools that care about application state should talk to the application.
   169  Otherwise, they would have to bake in knowledge about Tendermint (e.g., its
   170  interfaces and data structures) just because of the mediation.
   171  
   172  For clients to talk directly to the application, however, there is another
   173  concern: The consensus node is the ABCI _client_, so it is inconvenient for the
   174  application to "push" work into the consensus module via ABCI itself.  The
   175  current implementation works around this by calling the consensus node's RPC
   176  service, which exposes an `ABCIQuery` kitchen-sink method that allows the
   177  application a way to poke ABCI messages in the other direction.
   178  
   179  Without this RPC method, you could work around this (at least in principle) by
   180  having the consensus module "poll" the application for work that needs done,
   181  but that has unsatisfactory implications for performance and robustness, as
   182  well as being harder to understand.
   183  
   184  There has apparently been discussion about trying to make a more bidirectional
   185  communication between the consensus node and the application, but this issue
   186  seems to still be unresolved.
   187  
   188  Another complication of ABCI is that it requires the application (server) to
   189  maintain [four separate connections][abci-conn]: One for "consensus" operations
   190  (BeginBlock, EndBlock, DeliverTx, Commit), one for "mempool" operations, one
   191  for "query" operations, and one for "snapshot" (state synchronization) operations.
   192  The rationale seems to have been that these groups of operations should be able
   193  to proceed concurrently with each other. In practice, it results in a very complex
   194  state management problem to coordinate state updates between the separate streams.
   195  While application authors in Go are mostly insulated from that complexity by the
   196  Cosmos SDK, the plumbing to maintain those separate streams is complicated, hard
   197  to understand, and we suspect it contains concurrency bugs and/or lock contention
   198  issues affecting performance that are subtle and difficult to pin down.
   199  
   200  Even without changing the semantics of any ABCI operations, this code could be
   201  made smaller and easier to debug by separating the management of concurrency
   202  and locking from the IPC transport: If all requests and responses are routed
   203  through one connection, the server can explicitly maintain priority queues for
   204  requests and responses, and make less-conservative decisions about when locks
   205  are (or aren't) required to synchronize state access. With independent queues,
   206  the server must lock conservatively, and no optimistic scheduling is practical.
   207  
   208  This would be a tedious implementation change, but should be achievable without
   209  breaking any of the existing interfaces. More importantly, it could potentially
   210  address a lot of difficult concurrency and performance problems we currently
   211  see anecdotally but have difficultly isolating because of how intertwined these
   212  separate message streams are at runtime.
   213  
   214  TODO: Impact of ABCI++ for this topic?
   215  
   216  ### Context: RPC Issues
   217  
   218  The RPC system serves several masters, and has a complex surface area. I
   219  believe there are some improvements that can be exposed by separating some of
   220  these concerns.
   221  
   222  The Tendermint light client currently uses the RPC service to look up blocks
   223  and transactions, and to forward ABCI queries to the application.  The light
   224  client proxy uses the RPC service via a websocket. The Cosmos IBC relayer also
   225  uses the RPC service via websocket to watch for transaction events, and uses
   226  the `ABCIQuery` method to fetch information and proofs for posted transactions.
   227  
   228  Some work is already underway toward using P2P message passing rather than RPC
   229  to synchronize light client state with the rest of the network.  IBC relaying,
   230  however, requires access to the event system, which is currently not accessible
   231  except via the RPC interface. Event subscription _could_ be exposed via P2P,
   232  but that is a larger project since it adds P2P communication load, and might
   233  thus have an impact on the performance of consensus.
   234  
   235  If event subscription can be moved into the P2P network, we could entirely
   236  remove the websocket transport, even for clients that still need access to the
   237  RPC service. Until then, we may still be able to reduce the scope of the
   238  websocket endpoint to _only_ event subscription, by moving uses of the RPC
   239  server as a proxy to ABCI over to the gRPC interface.
   240  
   241  Having the RPC server still makes sense for local bootstrapping and operations,
   242  but can be further simplified. Here are some specific proposals:
   243  
   244  - Remove the HTTP GET interface entirely.
   245  
   246  - Simplify JSON-RPC plumbing to remove unnecessary reflection and wrapping.
   247  
   248  - Remove the gRPC interface (this is already planned for v0.36).
   249  
   250  - Separate the websocket interface from the rest of the RPC service, and
   251    restrict it to only event subscription.
   252  
   253    Eventually we should try to emove the websocket interface entirely, but we
   254    will need to revisit that (probably in a new RFC) once we've done some of the
   255    easier things.
   256  
   257  These changes would preserve the ability of operators to issue queries with
   258  curl (but would require using JSON-RPC instead of URI parameters). That would
   259  be a little less user-friendly, but for a use case that should not be that
   260  prevalent.
   261  
   262  These changes would also preserve compatibility with existing JSON-RPC based
   263  code paths like the `tendermint` CLI and the light client (even ahead of
   264  further work to remove that dependency).
   265  
   266  **Design goal:** An operator should be able to disable non-local access to the
   267  RPC server on any node in the network without impairing the ability of the
   268  network to function for service of state replication, including light clients.
   269  
   270  **Design principle:** All communication required to implement and monitor the
   271  consensus network should use P2P, including the various synchronizations.
   272  
   273  ### Options for ABCI Transport
   274  
   275  The majority of current usage is in Go, and the majority of that is mediated by
   276  the Cosmos SDK, which uses the "direct call" interface. There is probably some
   277  opportunity to clean up the implementation of that code, notably by inverting
   278  which interface is at the "top" of the abstraction stack (currently it acts
   279  like an RPC interface, and escape-hatches into the direct call). However, this
   280  general approach works fine and doesn't need to be fundamentally changed.
   281  
   282  For applications _not_ written in Go, the two remaining options are the
   283  "socket" protocol (another variation on varint-prefixed protobuf messages over
   284  an unstructured stream) and gRPC. It would be nice if we could get rid of one
   285  of these to reduce (unneeded?) optionality.
   286  
   287  Since both the socket protocol and gRPC depend on protocol buffers, the
   288  "socket" protocol is the most obvious choice to remove. While gRPC is more
   289  complex, the set of languages that _have_ protobuf support but _lack_ gRPC
   290  support is small. Moreover, gRPC is already widely used in the rest of the
   291  ecosystem (including the Cosmos SDK).
   292  
   293  If some use case did arise later that can't work with gRPC, it would not be too
   294  difficult for that application author to write a little proxy (in Go) that
   295  bridges the convenient SDK APIs into a simpler protocol than gRPC.
   296  
   297  **Design principle:** It is better for an uncommon special case to carry the
   298  burdens of its specialness, than to bake an escape hatch into the infrastructure.
   299  
   300  **Recommendation:** We should deprecate and remove the socket protocol.
   301  
   302  ### Options for RPC Transport
   303  
   304  [ADR 057][adr-57] proposes using gRPC for the Tendermint RPC implementation.
   305  This is still possible, but if we are able to simplify and decouple the
   306  concerns as described above, I do not think it should be necessary.
   307  
   308  While JSON-RPC is not the best possible RPC protocol for all situations, it has
   309  some advantages over gRPC for our domain. Specifically:
   310  
   311  - It is easy to call JSON-RPC manually from the command-line, which helps with
   312    a common concern for the RPC service, local debugging and operations.
   313  
   314    Relatedly: JSON is relatively easy for humans to read and write, and it can
   315    be easily copied and pasted to share sample queries and debugging results in
   316    chat, issue comments, and so on. Ideally, the RPC service will not be used
   317    for activities where the costs of a text protocol are important compared to
   318    its legibility and manual usability benefits.
   319  
   320  - gRPC has an enormous dependency footprint for both clients and servers, and
   321    many of the features it provides to support security and performance
   322    (encryption, compression, streaming, etc.) are mostly irrelevant to local
   323    use. Tendermint already needs to include a gRPC client for the remote signer,
   324    but if we can avoid the need for a _client_ to depend on gRPC, that is a win
   325    for usability.
   326  
   327  - If we intend to migrate light clients off RPC to use P2P entirely, there is
   328    no advantage to forcing a temporary migration to gRPC along the way; and once
   329    the light client is not dependent on the RPC service, the efficiency of the
   330    protocol is much less important.
   331  
   332  - We can still get the benefits of generated data types using protocol buffers, even
   333    without using gRPC:
   334  
   335    - Protobuf defines a standard JSON encoding for all message types so
   336      languages with protobuf support do not need to worry about type mapping
   337      oddities.
   338  
   339    - Using JSON means that even languages _without_ good protobuf support can
   340      implement the protocol with a bit more work, and I expect this situation to
   341      be rare.
   342  
   343  Even if a language lacks a good standard JSON-RPC mechanism, the protocol is
   344  lightweight and can be implemented by simple send/receive over TCP or
   345  Unix-domain sockets with no need for code generation, encryption, etc. gRPC
   346  uses a complex HTTP/2 based transport that is not easily replicated.
   347  
   348  ### Future Work
   349  
   350  The background and proposals sketched above focus on the existing structure of
   351  Tendermint and improvements we can make in the short term. It is worthwhile to
   352  also consider options for longer-term broader changes to the IPC ecosystem.
   353  The following outlines some ideas at a high level:
   354  
   355  - **Consensus service:** Today, the application and the consensus node are
   356    nominally connected only via ABCI. Tendermint was originally designed with
   357    the assumption that all communication with the application should be mediated
   358    by the consensus node.  Based on further experience, however, the design goal
   359    is now that the _application_ should be the mediator of application state.
   360  
   361    As noted above, however, ABCI is a client/server protocol, with the
   362    application as the server. For outside clients that turns out to have been a
   363    good choice, but it complicates the relationship between the application and
   364    the consensus node: Previously transactions were entered via the node, now
   365    they are entered via the app.
   366  
   367    We have worked around this by using the Tendermint RPC service to give the
   368    application a "back channel" to the consensus node, so that it can push
   369    transactions back into the consensus network. But the RPC service exposes a
   370    lot of other functionality, too, including event subscription, block and
   371    transaction queries, and a lot of node status information.
   372  
   373    Even if we can't easily "fix" the orientation of the ABCI relationship, we
   374    could improve isolation by splitting out the parts of the RPC service that
   375    the application needs as a back-channel, and sharing those _only_ with the
   376    application. By defining a "consensus service", we could give the application
   377    a way to talk back limited to only the capabilities it needs. This approach
   378    has the benefit that we could do it without breaking existing use, and if we
   379    later did "fix" the ABCI directionality, we could drop the special case
   380    without disrupting the rest of the RPC interface.
   381  
   382  - **Event service:** Right now, the IBC relayer relies on the Tendermint RPC
   383    service to provide a stream of block and transaction events, which it uses to
   384    discover which transactions need relaying to other chains.  While I think
   385    that event subscription should eventually be handled via P2P, we could gain
   386    some immediate benefit by splitting out event subscription from the rest of
   387    the RPC service.
   388  
   389    In this model, an event subscription service would be exposed on the public
   390    network, but on a different endpoint. This would remove the need for the RPC
   391    service to support the websocket protocol, and would allow operators to
   392    isolate potentially sensitive status query results from the public network.
   393  
   394    At the moment the relayers also use the RPC service to get block data for
   395    synchronization, but work is already in progress to handle that concern via
   396    the P2P layer. Once that's done, event subscription could be separated.
   397  
   398  Separating parts of the existing RPC service is not without cost: It might
   399  require additional connection endpoints, for example, though it is also not too
   400  difficult for multiple otherwise-independent services to share a connection.
   401  
   402  In return, though, it would become easier to reduce transport options and for
   403  operators to independently control access to sensitive data. Considering the
   404  viability and implications of these ideas is beyond the scope of this RFC, but
   405  they are documented here since they follow from the background we have already
   406  discussed.
   407  
   408  ## References
   409  
   410  [abci]: https://github.com/tendermint/tendermint/tree/master/spec/abci
   411  [rpc-service]: https://docs.tendermint.com/master/rpc/
   412  [light-client]: https://docs.tendermint.com/master/tendermint-core/light-client.html
   413  [tm-cli]: https://github.com/tendermint/tendermint/tree/master/cmd/tendermint
   414  [cosmos-sdk]: https://github.com/cosmos/cosmos-sdk/
   415  [local-client]: https://github.com/tendermint/tendermint/blob/master/abci/client/local_client.go
   416  [socket-server]: https://github.com/tendermint/tendermint/blob/master/abci/server/socket_server.go
   417  [sdk-grpc]: https://pkg.go.dev/github.com/cosmos/cosmos-sdk/types/tx#ServiceServer
   418  [json-rpc]: https://www.jsonrpc.org/specification
   419  [abci-conn]: https://github.com/tendermint/tendermint/blob/master/spec/abci/apps.md#state
   420  [adr-57]: https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-057-RPC.md