github.com/alexdevranger/node-1.8.27@v0.0.0-20221128213301-aa5841e41d2d/swarm/pss/writeup.md

github.com/alexdevranger/node-1.8.27@v0.0.0-20221128213301-aa5841e41d2d/swarm/pss/writeup.md (about)

     1  ## PSS tests failures explanation
     2  
     3  This document aims to explain the changes in https://github.com/ethersphere/go-ethereum/pull/126 and how those changes affect the pss_test.go TestNetwork tests.
     4  
     5  ### Problem
     6  
     7  When running the TestNetwork test, execution sometimes:
     8  
     9  - deadlocks
    10  - panics
    11  - failures with wrong result, such as:
    12  
    13  ```
    14  $ go test -v ./swarm/pss -cpu 4 -run TestNetwork
    15  ```
    16  
    17  ```
    18  --- FAIL: TestNetwork (68.13s)
    19      --- FAIL: TestNetwork/3/10/4/sim (68.13s)
    20          pss_test.go:697: 7 of 10 messages received
    21          pss_test.go:700: 3 messages were not received
    22  FAIL
    23  ```
    24  
    25  Moreover execution almost always deadlocks with `sim` adapter, and `sock` adapter (when buffer is low), but is mostly stable with `exec` and `tcp` adapters.
    26  
    27  ### Findings and Fixes
    28  
    29  #### 1. Addressing panics
    30  
    31  Panics were caused due to concurrent map read/writes and unsynchronised access to shared memory by multiple goroutines. This is visible when running the test with the `-race` flag.
    32  
    33  ```
    34  go test -race -v ./swarm/pss -cpu 4 -run TestNetwork
    35  
    36    1 ==================
    37    2 WARNING: DATA RACE
    38    3 Read at 0x00c424d456a0 by goroutine 1089:
    39    4   github.com/alexdevranger/node-1.8.27/swarm/pss.(*Pss).forward.func1()
    40    5       /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/swarm/pss/pss.go:654 +0x44f
    41    6   github.com/alexdevranger/node-1.8.27/swarm/network.(*Kademlia).eachConn.func1()
    42    7       /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/swarm/network/kademlia.go:350 +0xc9
    43    8   github.com/alexdevranger/node-1.8.27/pot.(*Pot).eachNeighbour.func1()
    44    9       /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/pot/pot.go:599 +0x59
    45    ...
    46  
    47   28
    48   29 Previous write at 0x00c424d456a0 by goroutine 829:
    49   30   github.com/alexdevranger/node-1.8.27/swarm/pss.(*Pss).Run()
    50   31       /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/swarm/pss/pss.go:192 +0x16a
    51   32   github.com/alexdevranger/node-1.8.27/swarm/pss.(*Pss).Run-fm()
    52   33       /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/swarm/pss/pss.go:185 +0x63
    53   34   github.com/alexdevranger/node-1.8.27/p2p.(*Peer).startProtocols.func1()
    54   35       /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/p2p/peer.go:347 +0x8b
    55   ...
    56  ```
    57  
    58  ##### Current solution
    59  
    60  Adding a mutex around all shared data.
    61  
    62  #### 2. Failures with wrong result
    63  
    64  The validation phase of the TestNetwork test is done using an RPC subscription:
    65  
    66  ```
    67      ...
    68  	triggerChecks := func(trigger chan enode.ID, id enode.ID, rpcclient *rpc.Client) error {
    69  		msgC := make(chan APIMsg)
    70  		ctx, cancel := context.WithTimeout(context.Background(), time.Second)
    71  		defer cancel()
    72  		sub, err := rpcclient.Subscribe(ctx, "pss", msgC, "receive", hextopic)
    73  		...
    74  ```
    75  
    76  By design the RPC uses a subscription buffer with a max length. When this length is reached, the subscription is dropped. The current config value is not suitable for stress tests.
    77  
    78  ##### Current solution
    79  
    80  Increase the max length of the RPC subscription buffer.
    81  
    82  ```
    83  const (
    84  	// Subscriptions are removed when the subscriber cannot keep up.
    85  	//
    86  	// This can be worked around by supplying a channel with sufficiently sized buffer,
    87  	// but this can be inconvenient and hard to explain in the docs. Another issue with
    88  	// buffered channels is that the buffer is static even though it might not be needed
    89  	// most of the time.
    90  	//
    91  	// The approach taken here is to maintain a per-subscription linked list buffer
    92  	// shrinks on demand. If the buffer reaches the size below, the subscription is
    93  	// dropped.
    94  	maxClientSubscriptionBuffer = 20000
    95  )
    96  ```
    97  
    98  #### 3. Deadlocks
    99  
   100  Deadlocks are triggered when using:
   101  
   102  - `sim` adapter - synchronous, unbuffered channel
   103  - `sock` adapter - asynchronous, buffered channel (when using a 1K buffer)
   104  
   105  No deadlocks were triggered when using:
   106  
   107  - `tcp` adapter - asynchronous, buffered channel
   108  - `exec` adapter - asynchronous, buffered channel
   109  
   110  Ultimately the deadlocks happen due to blocking `pp.Send()` call at:
   111  
   112  // attempt to send the message
   113  err := pp.Send(msg)
   114  if err != nil {
   115  log.Debug(fmt.Sprintf("%v: failed forwarding: %v", sendMsg, err))
   116  return true
   117  }
   118  
   119  `p2p` request handling is synchronous (as discussed at https://github.com/ethersphere/go-ethereum/issues/130), `pss` is also synchronous, therefore if two nodes happen to be processing a request, while at the same time waiting for response on `pp.Send(msg)`, deadlock occurs.
   120  
   121  `pp.Send(msg)` is only blocking when the underlying adapter is blocking (read `sim` or `sock`) or the buffer of the connection is full.
   122  
   123  ##### Current solution
   124  
   125  Make no assumption on the undelying connection, and call `pp.Send` asynchronously in a go-routine.
   126  
   127  Alternatively, get rid of the `sim` and `sock` adapters, and use `tcp` adapter for testing.