github.com/alexdevranger/node-1.8.27@v0.0.0-20221128213301-aa5841e41d2d/swarm/pss/writeup.md (about) 1 ## PSS tests failures explanation 2 3 This document aims to explain the changes in https://github.com/ethersphere/go-ethereum/pull/126 and how those changes affect the pss_test.go TestNetwork tests. 4 5 ### Problem 6 7 When running the TestNetwork test, execution sometimes: 8 9 - deadlocks 10 - panics 11 - failures with wrong result, such as: 12 13 ``` 14 $ go test -v ./swarm/pss -cpu 4 -run TestNetwork 15 ``` 16 17 ``` 18 --- FAIL: TestNetwork (68.13s) 19 --- FAIL: TestNetwork/3/10/4/sim (68.13s) 20 pss_test.go:697: 7 of 10 messages received 21 pss_test.go:700: 3 messages were not received 22 FAIL 23 ``` 24 25 Moreover execution almost always deadlocks with `sim` adapter, and `sock` adapter (when buffer is low), but is mostly stable with `exec` and `tcp` adapters. 26 27 ### Findings and Fixes 28 29 #### 1. Addressing panics 30 31 Panics were caused due to concurrent map read/writes and unsynchronised access to shared memory by multiple goroutines. This is visible when running the test with the `-race` flag. 32 33 ``` 34 go test -race -v ./swarm/pss -cpu 4 -run TestNetwork 35 36 1 ================== 37 2 WARNING: DATA RACE 38 3 Read at 0x00c424d456a0 by goroutine 1089: 39 4 github.com/alexdevranger/node-1.8.27/swarm/pss.(*Pss).forward.func1() 40 5 /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/swarm/pss/pss.go:654 +0x44f 41 6 github.com/alexdevranger/node-1.8.27/swarm/network.(*Kademlia).eachConn.func1() 42 7 /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/swarm/network/kademlia.go:350 +0xc9 43 8 github.com/alexdevranger/node-1.8.27/pot.(*Pot).eachNeighbour.func1() 44 9 /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/pot/pot.go:599 +0x59 45 ... 46 47 28 48 29 Previous write at 0x00c424d456a0 by goroutine 829: 49 30 github.com/alexdevranger/node-1.8.27/swarm/pss.(*Pss).Run() 50 31 /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/swarm/pss/pss.go:192 +0x16a 51 32 github.com/alexdevranger/node-1.8.27/swarm/pss.(*Pss).Run-fm() 52 33 /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/swarm/pss/pss.go:185 +0x63 53 34 github.com/alexdevranger/node-1.8.27/p2p.(*Peer).startProtocols.func1() 54 35 /Users/nonsense/code/src/github.com/alexdevranger/node-1.8.27/p2p/peer.go:347 +0x8b 55 ... 56 ``` 57 58 ##### Current solution 59 60 Adding a mutex around all shared data. 61 62 #### 2. Failures with wrong result 63 64 The validation phase of the TestNetwork test is done using an RPC subscription: 65 66 ``` 67 ... 68 triggerChecks := func(trigger chan enode.ID, id enode.ID, rpcclient *rpc.Client) error { 69 msgC := make(chan APIMsg) 70 ctx, cancel := context.WithTimeout(context.Background(), time.Second) 71 defer cancel() 72 sub, err := rpcclient.Subscribe(ctx, "pss", msgC, "receive", hextopic) 73 ... 74 ``` 75 76 By design the RPC uses a subscription buffer with a max length. When this length is reached, the subscription is dropped. The current config value is not suitable for stress tests. 77 78 ##### Current solution 79 80 Increase the max length of the RPC subscription buffer. 81 82 ``` 83 const ( 84 // Subscriptions are removed when the subscriber cannot keep up. 85 // 86 // This can be worked around by supplying a channel with sufficiently sized buffer, 87 // but this can be inconvenient and hard to explain in the docs. Another issue with 88 // buffered channels is that the buffer is static even though it might not be needed 89 // most of the time. 90 // 91 // The approach taken here is to maintain a per-subscription linked list buffer 92 // shrinks on demand. If the buffer reaches the size below, the subscription is 93 // dropped. 94 maxClientSubscriptionBuffer = 20000 95 ) 96 ``` 97 98 #### 3. Deadlocks 99 100 Deadlocks are triggered when using: 101 102 - `sim` adapter - synchronous, unbuffered channel 103 - `sock` adapter - asynchronous, buffered channel (when using a 1K buffer) 104 105 No deadlocks were triggered when using: 106 107 - `tcp` adapter - asynchronous, buffered channel 108 - `exec` adapter - asynchronous, buffered channel 109 110 Ultimately the deadlocks happen due to blocking `pp.Send()` call at: 111 112 // attempt to send the message 113 err := pp.Send(msg) 114 if err != nil { 115 log.Debug(fmt.Sprintf("%v: failed forwarding: %v", sendMsg, err)) 116 return true 117 } 118 119 `p2p` request handling is synchronous (as discussed at https://github.com/ethersphere/go-ethereum/issues/130), `pss` is also synchronous, therefore if two nodes happen to be processing a request, while at the same time waiting for response on `pp.Send(msg)`, deadlock occurs. 120 121 `pp.Send(msg)` is only blocking when the underlying adapter is blocking (read `sim` or `sock`) or the buffer of the connection is full. 122 123 ##### Current solution 124 125 Make no assumption on the undelying connection, and call `pp.Send` asynchronously in a go-routine. 126 127 Alternatively, get rid of the `sim` and `sock` adapters, and use `tcp` adapter for testing.