github.com/onflow/flow-go@v0.35.7-crescendo-preview.23-atree-inlining/network/p2p/inspector/README.MD (about)

     1  # Control Message Validation Inspector Overview
     2  
     3  ## Component Overview
     4  The Control Message Validation Inspector (`ControlMsgValidationInspector`) is an injectable component responsible for asynchronous inspection of incoming GossipSub RPC.
     5  It is entirely developed and maintained at Flow blockchain codebase and is injected into the GossipSub protocol of libp2p at the startup of the node.
     6  All incoming RPC messages are passed through this inspection to ensure their validity and compliance with the Flow protocol semantics.
     7  
     8  The inspector performs two primary functions:
     9  1. **RPC truncation (blocking)**: It truncates size of incoming RPC messages to prevent excessive resource consumption, if needed. This is done by sampling the messages and reducing their size to a configurable threshold.
    10  2. **RPC inspection (aka validation) (non-blocking)**: It inspects (aka validates) the truncated or original RPC messages for compliance with the Flow protocol semantics. This includes validation of message structure, topic, sender, and other relevant attributes.
    11  
    12  Figure below shows the high-level overview of the Control Message Validation Inspector and its interaction with the GossipSub protocol and the Flow node.
    13  The blue box represents the GossipSub protocol, which is responsible for handling the pub-sub messaging system and is an external dependency of the Flow node.
    14  The green boxes represent various components of the Flow node's networking layer that are involved in the inspection and processing of incoming RPC messages.
    15  The steps that are marked with an asterisk (*) are performed concurrently, while the rest are performed sequentially.
    16  As shown in this figure, an incoming RPC message is passed by GossipSub to the Control Message Validation Inspector, which then performs the blocking truncation process and queues the RPC for asynchronous non-blocking inspection processes.
    17  As soon as the RPC is queued for inspection, it is also passed to the GossipSub protocol for further processing. The results of the inspection are used for internal metrics, logging, and feedback to the GossipSub scoring system.
    18  Once the GossipSub processes the RPC it passes the message to the libp2p node component of the networking layer of the Flow node, which then processes the message and sends it to the rest of the Flow node for further processing.
    19  Note that the validation process is non-blocking, hence even a malformed RPC is allowed to proceed through the GossipSub protocol to the Flow node. 
    20  However, based on the result of the asynchronous inspection, the message may be scored negatively, and the sender may be penalized in the peer scoring system.
    21  The rationale behind this is that post truncation, as far as the RPC size is within the configured limits, a single (or few) non-compliant RPCs do not drastically affect the system's health, hence, the RPCs are allowed to proceed for further processing.
    22  What matters is the persistent behavior of the sender, and the sender's reputation and future message propagation are _eventually_ affected based on the inspection results.
    23  ![rpc-inspection-process.png](rpc-inspection-process.png)
    24  ## What is an RPC?
    25  RPC stands for Remote Procedure Call. In the context of GossipSub, it is a message that is sent from one peer to another peer over the GossipSub protocol. 
    26  The message is sent in the form of a protobuf message and is used to communicate information about the state of the network, such as topic membership, message propagation, and other relevant information.
    27  It encapsulates various types of messages and commands that peers exchange to implement the GossipSub protocol, a pub-sub (publish-subscribe) messaging system. 
    28  Remember that the purpose of GossipSub is to efficiently disseminate messages to interested subscribers in the network without requiring a central broker or server.
    29  Here is what an RPC message looks like in the context of GossipSub:
    30  ```go
    31  type RPC struct {
    32  	Subscriptions        []*RPC_SubOpts  `protobuf:"bytes,1,rep,name=subscriptions" json:"subscriptions,omitempty"`
    33  	Publish              []*Message      `protobuf:"bytes,2,rep,name=publish" json:"publish,omitempty"`
    34  	Control              *ControlMessage `protobuf:"bytes,3,opt,name=control" json:"control,omitempty"`
    35  	XXX_NoUnkeyedLiteral struct{}        `json:"-"`
    36  	XXX_unrecognized     []byte          `json:"-"`
    37  	XXX_sizecache        int32           `json:"-"`
    38  }
    39  ``` 
    40  
    41  Here's a breakdown of the components within the GossipSub's `RPC` struct:
    42  1. **Subscriptions (`[]*RPC_SubOpts`)**: This field contains a list of subscription options (`RPC_SubOpts`). 
    43      Each `RPC_SubOpts` represents a peer's intent to subscribe or unsubscribe from a topic. 
    44      This allows peers to dynamically adjust their interest in various topics and manage their subscription list.
    45  2. **Publish (`[]*Message`)**: The `Publish` field contains a list of messages that the peer wishes to publish (or gossip) to the network. 
    46      Each `Message` is intended for a specific topic, and peers subscribing to that topic should receive the message. 
    47      This field is essential for the dissemination of information and data across the network.
    48  3. **Control (`*ControlMessage`)**
    49      The `Control` field holds a control message, which contains various types of control information required for the operation of the GossipSub protocol. 
    50      This can include information about grafting (joining a mesh for a topic), pruning (leaving a mesh), 
    51      and other control signals related to the maintenance and optimization of the pub-sub network. 
    52     The control messages play a crucial role in the mesh overlay maintenance, ensuring efficient and reliable message propagation.
    53  4. **XXX Fields** These fields (`XXX_NoUnkeyedLiteral`, `XXX_unrecognized`, and `XXX_sizecache`) are generated by the protobuf compiler and are not directly used by the GossipSub protocol. 
    54     They are used internally by the protobuf library for various purposes like caching and ensuring correct marshalling and unmarshalling of the protobuf data.
    55  
    56  ### Closer Look at the Control Message
    57  In GossipSub, a Control Message is a part of the `RPC` structure and plays a crucial role in maintaining and optimizing the network. 
    58  It contains several fields, each corresponding to different types of control information.
    59  The primary purpose of these control messages is to manage the mesh overlay that underpins the GossipSub protocol,
    60  ensuring efficient and reliable message propagation.
    61  
    62  At the core, the control messages are used to maintain the mesh overlay for each topic, allowing peers to join and leave the mesh as their interests and network connectivity change.
    63  The control messages include the following types:
    64  
    65  1. **IHAVE (`[]*ControlIHave`)**: the `IHAVE` messages are used to advertise to peers that the sender has certain messages. 
    66     This is part of the message propagation mechanism. 
    67     When a peer receives an `IHAVE` message and is interested in the advertised messages (because it doesn't have them yet), 
    68     it can request those messages from the sender using an `IWANT` message.
    69  
    70  2. **IWANT (`[]*ControlIWant`)**: the `IWANT` messages are requests sent to peers to ask for specific messages previously 
    71     advertised in an `IHAVE` message. 
    72     This mechanism ensures that messages propagate through the network, 
    73     reaching interested subscribers even if they are not directly connected to the message's original publisher.
    74  
    75  3. **GRAFT (`[]*ControlGraft`)**: The `GRAFT` messages are used to express the sender's intention to join the mesh for a specific topic. 
    76     In GossipSub, each peer maintains a local mesh network for each topic it is interested in. 
    77     Each local mesh is a subset of the peers in the network that are interested in the same topic. The complete mesh for a topic is formed by the union of all local meshes, which must be connected to ensure efficient message propagation 
    78     (the peer scoring ensures that the mesh is well-connected and that peers are not overloaded with messages)
    79     Sending a `GRAFT` message is a way to join the local mesh of a peer, indicating that the sender wants to receive and forward messages for the specific topic.
    80  
    81  4. **PRUNE (`[]*ControlPrune`)**: conversely, `PRUNE` messages are sent when a peer wants to leave the local mesh for a specific topic. 
    82     This could be because the peer is no longer interested in the topic or is optimizing its network connections. 
    83     Upon receiving a `PRUNE` message, peers will remove the sender from their mesh for the specific topic.
    84  
    85  ```go
    86  type ControlMessage struct {
    87  	Ihave                []*ControlIHave `protobuf:"bytes,1,rep,name=ihave" json:"ihave,omitempty"`
    88  	Iwant                []*ControlIWant `protobuf:"bytes,2,rep,name=iwant" json:"iwant,omitempty"`
    89  	Graft                []*ControlGraft `protobuf:"bytes,3,rep,name=graft" json:"graft,omitempty"`
    90  	Prune                []*ControlPrune `protobuf:"bytes,4,rep,name=prune" json:"prune,omitempty"`
    91  	XXX_NoUnkeyedLiteral struct{}        `json:"-"`
    92  	XXX_unrecognized     []byte          `json:"-"`
    93  	XXX_sizecache        int32           `json:"-"`
    94  }
    95  ```
    96  
    97  ## Why is RPC Inspection Necessary?
    98  In the context of the Flow blockchain, RPC inspection is necessary for the following reasons:
    99  1. **Security**: The inspection process mitigates potential security risks such as spamming, message replay attacks, or malicious content dissemination, and provides complementing feedbacks for the internal GossipSub scoring system.
   100  
   101  2. **Resource Management**: By validating and potentially truncating incoming RPC messages, the system manages its computational and memory resources more effectively. 
   102     This prevents resource exhaustion attacks where an adversary might attempt to overwhelm the system by sending a large volume of non-compliant or oversized messages.
   103  
   104  3. **Metrics and Monitoring**: The inspection process provides valuable insights into the network's health and performance. 
   105     By monitoring the incoming RPC messages, the system can collect metrics and statistics about message propagation, topic membership, and other relevant network attributes.
   106  
   107  ## RPC Truncation (Blocking)
   108  The Control Message Validation Inspector is responsible for truncating the size of incoming RPC messages to prevent excessive resource consumption. This is done by sampling the messages and reducing their size to a configurable threshold. 
   109  The truncation process is entirely done in a blocking manner, i.e., it is performed at the entry point of the GossipSub through an injected interceptor, and the incoming RPC messages are modified before they are further processed by the GossipSub protocol.
   110  The truncation process is applied to different components of the RPC message, specifically the control message types (`GRAFT`, `PRUNE`, `IHAVE`, `IWANT`) and their respective message IDs.
   111  Truncation is triggered if the count of messages or message IDs exceeds certain configured thresholds, ensuring that the system resources are not overwhelmed.
   112  When the number of messages or message IDs exceeds the threshold, a random sample of messages or message IDs is selected, and the rest are discarded.
   113  
   114  ### Message vs Message ID Truncation
   115  In the context of GossipSub RPC inspection, there is a subtle distinction between the count of messages and the count of message IDs:
   116  
   117  1. **Count of Messages:**
   118      - This refers to the number of control messages (like `GRAFT`, `PRUNE`, `IHAVE`, `IWANT`) that are part of the `ControlMessage` structure within an RPC message, i.e., size of the `Graft`, `Prune`, `Ihave`, and `Iwant` slice fields.
   119      - Each control message type serves a different purpose in the GossipSub protocol (e.g., `GRAFT` for joining a mesh for a topic, `PRUNE` for leaving a mesh).
   120      - When we talk about the "count of messages," we're referring to how many individual control messages of each type are included in the RPC.
   121      - Truncation based on the count of messages ensures that the number of control messages of each type doesn't exceed a configured threshold, preventing overwhelming the receiving peer with too many control instructions at once.
   122  
   123  2. **Count of Message IDs:**
   124      - This refers to the number of unique identifiers for actual published messages that are being referenced within control messages like `IHAVE` and `IWANT`.
   125      - `IHAVE` messages contain IDs of messages that the sender has and is announcing to peers. `IWANT` messages contain IDs of messages that the sender wants from peers.
   126      - Each _individual_ `IHAVE` or `IWANT` control message can reference multiple message IDs. The "count of message IDs" is the total number of such IDs contained within each `IHAVE` or `IWANT` control message.
   127      - Truncation based on the count of message IDs ensures that each `IHAVE` or `IWANT` control message doesn't reference an excessively large number of messages. This prevents a scenario where a peer might be asked to process an overwhelming number of message requests at once, which could lead to resource exhaustion.
   128  
   129  ## RPC Validation (Non-Blocking)
   130  The Control Message Validation Inspector is also responsible for inspecting the truncated or original RPC messages for compliance with the Flow protocol semantics. 
   131  The inspection process is done post truncation and is entirely non-blocking, i.e., it does not prevent the further processing of the RPC messages by the GossipSub protocol. 
   132  In other words, the RPC messages are passed through after truncation for further processing by the GossipSub protocol, regardless of whether they pass the inspection or not.
   133  At the same time, each incoming RPC message is queued for asynchronous inspection, and the results of the inspection are used for internal metrics, logging, and feedback to the GossipSub scoring system.
   134  This means that even a non-compliant RPC message is allowed to proceed through the GossipSub protocol to the Flow node. However, based on the result of the asynchronous inspection, 
   135  the message may be scored negatively, and the sender may be penalized in the peer scoring system. Hence, its future messages may be de-prioritized or ignored by the GossipSub protocol.
   136  This follows the principle that post truncation, as far as the RPC size is within the configured limits, a single (or few) non-compliant RPCs do not drastically affect the system's health,
   137  hence, the RPCs are allowed to proceed for further processing. However, the sender's reputation and future message propagation are affected based on the inspection results.
   138  
   139  The queued RPCs are picked by a pool of worker threads, and the inspection is performed in parallel to the GossipSub protocol's processing of the RPC messages.
   140  Each RPC message is inspected for the following attributes sequentially, and once a non-compliance is detected, the inspection process is terminated with a failure result. A failure result
   141  will cause an _invalid control message notification_ (`p2p.InvCtrlMsgNotif`) to be sent to the `GossipSubAppSpecificScoreRegistry`, which will then be used for penalizing the sender in the peer scoring system.
   142  The `GossipSubAppSpecificScoreRegistry` is a Flow-level component that decides on part of the individual peer's scoring based on their Flow-specific behavior. 
   143  It directly provides feedback to the GossipSub protocol for scoring the peers.
   144  
   145  The [order of inspections for a single RPC](https://github.com/onflow/flow-go/blob/master/network/p2p/inspector/validation/control_message_validation_inspector.go#L270-L323) is as follows. Note that in the
   146  descriptions below, when we say an RPC is flagged as invalid or the inspection process is terminated with a failure result, and an _invalid control message notification_ is sent to the `GossipSubAppSpecificScoreRegistry`, which
   147  will then be used for penalizing the sender in the peer scoring system.
   148  1. `GRAFT` messages validation: Each RPC contains one or more `GRAFT` messages. Each `GRAFT` message contains a topic ID indicating the mesh the peer wants to join. 
   149      The validation process involves iterating through each `GRAFT` message received in the (potentially truncated) RPC.
   150      For each `GRAFT` message, the topic ID is validated to ensure it corresponds to a valid and recognized topic within the Flow-network.
   151      Topic validation might involve checking if the topic is known, if it's within the scope of the peer's interests or subscriptions, and if it aligns with the network's current configuration (e.g., checking against the active spork ID).
   152      If the topic is cluster-prefixed, additional validations ensure that the topic is part of the active cluster IDs. 
   153      If (even one) topic ID is invalid or unrecognized, the `GRAFT` message is flagged as invalid, and the inspection process is terminated with a failure result. 
   154      In future we may relax this condition to allow for a certain number of invalid topics, but for now, a single invalid topic results in a failure.
   155      The inspection process also system keeps track of the topics seen in the `GRAFT` messages of the same RPC. 
   156      If a topic is repeated (i.e., if there are duplicate topics in the `GRAFT` messages of the same RPC), this is usually a sign of a protocol violation or misbehavior.
   157      The validation process counts these duplicates and, if the number exceeds a certain threshold, it flags RPC message as invalid and terminates the inspection process with a failure result.
   158      Note that all `GRAFT` messages on the same (potentially truncated) RPC are validated together, without any sampling, as the number of `GRAFT` messages is usually assumed small, and validating
   159      them is not assumed to be resource-intensive.
   160  2. `PRUNE` messages validation: Similar to `GRAFT`s, each RPC contains one or more `PRUNE` messages. Each `PRUNE` message contains a topic ID indicating the mesh the peer wants to leave. 
   161      The validation process involves iterating through each `PRUNE` message received in the (potentially truncated) RPC.
   162      For each `PRUNE` message, the topic ID is validated to ensure it corresponds to a valid and recognized topic within the Flow-network.
   163      Topic validation might involve checking if the topic is known, if it's within the scope of the peer's interests or subscriptions, and if it aligns with the network's current configuration (e.g., checking against the active spork ID).
   164      If the topic is cluster-prefixed, additional validations ensure that the topic is part of the active cluster IDs.
   165      If (even one) topic ID is invalid or unrecognized, the `PRUNE` message is flagged as invalid, and the inspection process is terminated with a failure result. 
   166      In future we may relax this condition to allow for a certain number of invalid topics, but for now, a single invalid topic results in a failure.
   167      The inspection process also system keeps track of the topics seen in the `PRUNE` messages of the same RPC. 
   168      If a topic is repeated (i.e., if there are duplicate topics in the `PRUNE` messages of the same RPC), this is usually a sign of a protocol violation or misbehavior.
   169      The validation process counts these duplicates and, if the number exceeds a certain threshold, it flags RPC message as invalid and terminates the inspection process with a failure result.
   170      Note that all `PRUNE` messages on the same (potentially truncated) RPC are validated together, without any sampling, as the number of `PRUNE` messages is usually assumed small, and validating
   171      them is not assumed to be resource-intensive.
   172  3. `IWANT` messages validation: Each RPC contains one or more `IWANT` messages. Each `IWANT` message contains a list of message IDs that the sender wants from the receiver as the result of an `IHAVE` message.
   173      The validation process involves iterating through each `IWANT` message received in the (potentially truncated) RPC.
   174      For each `IWANT` message, the message IDs are validated to ensure they correspond to a valid message ID that recently advertised by the sender in an `IHAVE` message.
   175      We define an `IWANT` cache miss as the event of an `IWANT` message ID does not correspond to a valid recently advertised `IHAVE` message ID.
   176      When number of `IWANT` cache misses exceeds a certain threshold, the `IWANT` message is flagged as invalid, and the inspection process is terminated with a failure result.
   177      The inspection process also system keeps track of the message IDs seen in the `IWANT` messages of the same RPC. 
   178      If a message ID is repeated (i.e., if there are duplicate message IDs in the `IWANT` messages of the same RPC), this is usually a sign of a protocol violation or misbehavior.
   179      The validation process counts these duplicates and, if the number exceeds a certain threshold, it flags RPC message as invalid and terminates the inspection process with a failure result.
   180      Note that all `IWANT` messages on the same (potentially truncated) RPC are validated together, without any sampling, as the number of `IWANT` messages is usually assumed small, and validating 
   181      them is not assumed to be resource-intensive.
   182  4. `IHAVE` messages validation: Each RPC contains one or more `IHAVE` messages. Each `IHAVE` message contains a list of message IDs that the sender has and is advertising to the receiver.
   183      The validation process involves iterating through each `IHAVE` message received in the (potentially truncated) RPC.
   184      Each `IHAVE` message is composed of a topic ID as well as the list of message IDs advertised for that topic.
   185      Each topic ID is validated to ensure it corresponds to a valid and recognized topic within the Flow-network.
   186      Topic validation might involve checking if the topic is known, if it's within the scope of the peer's interests or subscriptions, and if it aligns with the network's current configuration (e.g., checking against the active spork ID).
   187      If the topic is cluster-prefixed, additional validations ensure that the topic is part of the active cluster IDs.
   188      If (even one) topic ID is invalid or unrecognized, the `IHAVE` message is flagged as invalid, and the inspection process is terminated with a failure result.
   189      The inspection process also system keeps track of the topics seen in the `IHAVE` messages of the same RPC. When a topic is repeated (i.e., if there are duplicate topics in the `IHAVE` messages of the same RPC), this is usually a sign of a protocol violation or misbehavior.
   190      The validation process counts these duplicates and, if the number exceeds a certain threshold, it flags RPC message as invalid and terminates the inspection process with a failure result.
   191      The message IDs advertised in the `IHAVE` messages are also validated ensure there are no duplicates. When a message ID is repeated (i.e., if there are duplicate message IDs in the `IHAVE` messages of the same RPC), this is usually a sign of a protocol violation or misbehavior.
   192      The validation process counts these duplicates and, if the number exceeds a certain threshold, it flags RPC message as invalid and terminates the inspection process with a failure result.
   193      Note that all `IHAVE` messages on the same (potentially truncated) RPC are validated together, without any sampling, as the number of `IHAVE` messages is usually assumed small, and validating
   194      them is not assumed to be resource-intensive.
   195  5. `Publish` messages validation: Each RPC contains a list of `Publish` messages that are intended to be gossiped to the network.
   196      The validation process involves iterating through each `Publish` message received in the (potentially truncated) RPC.
   197      To validate the `Publish` messages of an RPC, the inspector samples a subset of the `Publish` messages and validates them for compliance with the Flow protocol semantics.
   198      This is done to avoid adding excessive computational overhead to the inspection process, as the number of `Publish` messages in an RPC can be large, and validating each message can be resource-intensive.
   199      The validation of each `Publish` message involves several steps: (1) whether the sender is a valid (staked) Flow node, 
   200      (2) whether the topic ID is a valid based on the Flow protocol semantics, and (3) whether the local peer has a valid subscription to the topic.
   201      Failure in any of these steps results in a validation error for the `Publish` message. 
   202      However, validation error for a single `Publish` message does not cause inspection process to terminate with a failure result for the entire RPC.
   203      Rather the inspection process continues to validate the rest of the `Publish` messages in the sampled RPC.
   204      Once the entire sampled RPC is validated, the inspection process is terminated with a success if the number of validation errors is within a certain threshold.
   205      Otherwise, when the number of validation errors exceeds the threshold, the inspection process is terminated with a failure result, which 
   206      will cause an _invalid control message notification_ to be sent to the `GossipSubAppSpecificScoreRegistry`, which will then be used for penalizing the sender in the peer scoring system.
   207      As this is the last step in the inspection process, when an RPC reaches this step, it means that the RPC has passed all the previous inspections and is only being validated for the `Publish` messages.
   208      Hence, result of this step is used to determine the final result of the inspection process.