code.vegaprotocol.io/vega@v0.79.0/core/snapshot/README.md (about)

     1  # Snapshot Engine
     2  
     3  The snapshot engine is responsible for collecting all in-memory state that exists across all other engines/services. The saved state can then be used to restore a node to a particular block height by propagating this state back into each engine. This can either be done using a local snapshot, where an existing node restarted, or via a network snapshot, where a new node joins and is gifted a snapshot from other nodes in the network.
     4  
     5  Each engine which needs to save state will register themselves with the snapshot engine via a call to `engine.AddProviders()` and expose themselves through the below interface:
     6  
     7  ```
     8  type StateProvider interface {
     9  	Namespace() SnapshotNamespace
    10  	Keys() []string
    11  	GetState(key string) ([]byte, []StateProvider, error)
    12  	LoadState(ctx context.Context, pl *Payload) ([]StateProvider, error)
    13  	Stopped() bool
    14  }
    15  ```
    16  
    17  then at every snapshot-block, the snapshot engine will suck up all the state from each registered provider and save it to disk.
    18  
    19  ## Identifying state to snapshot
    20  
    21  When we talk about an engine's state we mean "fields in an engine's data structure". More specifically, fields which hold data that exist across multiple blocks.
    22  
    23  For example if we had an engine that looked like this:
    24  
    25  ```
    26  type SomeEngine struct {
    27  
    28      cfg  Config
    29      log *logging.Logger
    30  
    31      // track orders, id -> order
    32      orders map[string]*types.Order
    33  
    34      // registered callbacks for whenever something happens
    35      cbs map[string]func() error
    36  }
    37  ```
    38  
    39  The important field that needs saving into a snapshot is `orders`. The fields `cfg` and `log` are only configuration fields and so are not relevant. For `cbs` the onus is on the subscriber to re-register their callback when they restore from a snapshot, so _this_ engine's snapshot need not worry.
    40  
    41  ### Gotcha 1: Cannot include validator-only state in a snapshot
    42  
    43  Given that validator and non-validator nodes take snapshots and the hash of a snapshot is included in the commit-hash, if any state that is only present in _validator_ nodes is added to a snapshot, then all non-validator nodes will fall out of consensus.
    44  
    45  An example of this is the Ethereum-event-forwarder which is only something validator nodes do. The state it contains is the Ethereum block-height that was last checked for events, but we cannot save this into a snapshot. We instead handle it by saving the Ethereum block-height of the last `ChainEvent` sent into core, which is a transaction all node types will see. This will not be the last Ethereum block-height checked, but it is good enough.
    46  
    47  ### Gotcha 2: Cannot include single-node state in a snapshot
    48  
    49  This is similar to gotcha 1 but worth mentioning explicitly. Some engine's that send validator-commands back into the network will keep track of whether they should retry based on whether sending in that transaction was successful. This state is personal to an individual node and cannot be saved to the snapshot. It will cause a different snapshot hash than any other node and this node will fall out of consensus
    50  
    51  For example the notary engine keeps track of whether the node needs to try sending in a node-signature again. The notary engine also keeps tracks of which nodes it has received signatures from. Therefore when restoring from a snapshot it's retry-state can be repopulate indirectly based on whether a node's own node-vote is in the set of received votes. If it is not there then it needs to retry.
    52  
    53  ### Gotcha 3: Remember to re-register callbacks between engines
    54  
    55  The links between engines, whether that be the registration of callbacks or other things, is not state that can be saved into a snapshot but should be restored via re-registration when loading from a snapshot.
    56  
    57  For example, a market subscribes to oracles for both termination and settlement. When a market is restored from a snapshot it must re-subscribe to those oracles again, but must do so depending on the market's restored state i.e a terminated market should only re-subscribe to the settlement oracle.
    58  
    59  ### Gotcha 4: Trying to use complex-logic to deduce whether a field needs adding to the snapshot
    60  
    61  Its not worth it. Your assessment is probably wrong and will result in a horrible bug that presents itself 5 weeks later at the worst possible moment. Unless it is _plainly obvious_ that a field has a lifetime of less than a block, or it can _trivially_ be derived from another field, then just add it to the snapshot.
    62  
    63  ## Snapshot tests
    64  
    65  Snapshot testing is in a good place. We have lots of layers that check for particular types of issues. The flavours of snapshot tests that exist today are:
    66  
    67  - Unit-tests
    68  - System-tests
    69  - Snapshot soak tests
    70  - Snapshot pipeline
    71  
    72  ### Unit-tests
    73  
    74  Each engine that is a snapshot provider should have unit-tests that verify the roundtrip of saving and restoring the snapshot state.
    75  
    76  Writing an effective unit-test for an engine's snapshot involves checking three things:
    77  
    78  - completeness: all fields are saved and restored identically
    79  - determinism: the same state serialises to the same bytes, always
    80  - engine connections: subscriptions/callbacks to other engine's are re-subscribed
    81  
    82  #### Testing completeness
    83  
    84  The best way to check for completeness is to do the following:
    85  
    86  - create an engine with some state
    87  - call `.GetState()` to get the serialised state `b1`
    88  - create a second engine and load in `b1`
    89  - assert that all the fields in both engines are equal i.e `assert.Equal(t, eng1.GetOrders(), end2.GetOrders())`
    90  
    91  #### Testing determinism
    92  
    93  The best way to check for determinism is to do the following:
    94  
    95  - create an engine with some state
    96  - call `.GetState()` to get the serialised state `b1`
    97  - create a second engine and load in `b1`
    98  - call `.GetState()` on the second engine to get `b2`
    99  - assert that `b1 == b2`
   100  
   101  The main cause of non-determinism is when converting a `map -> slice`. Given that maps are unordered, the resultant slice must be sorted by the map's keys for it to serialised to the same byte string. This is why checking for completeness is not a sufficient test for determinism because the map will still restore exactly even though the snapshot is different. Equally checking that the snapshot is deterministic is not a sufficient test for completeness. For example if we had a field of type `time.Time{}` and saved its `t.Unix()` value in the snapshot, the snapshot would be reliably deterministic but the restored value will have lost the nanoseconds and not be identical to before.
   102  
   103  #### Testing engine connections
   104  
   105  The best way to check that subscriptions are restored is to do the following:
   106  
   107  1. Create a first business and snapshot engines.
   108  2. Add some state to the business engine.
   109  3. Generate a local snapshot.
   110  4. Add some more state to the business engine.
   111  5. Get the state for all keys, and save the result in a map, on the business engine.
   112  6. Close these first engines.
   113  7. Create a second business and snapshot engines.
   114  8. Restore the local snapshot.
   115  9. Verify the hashes produced by the first and second snapshot.
   116  10. Add more state exactly the same way as step 4.
   117  11. Repeat step 5 but for the second business engine.
   118  12. Compare the content of the step 5 and 11.
   119  
   120  Below is a pseudo-code-ish example of what a snapshot unit-tests should look like:
   121  
   122  ```go
   123  package business_test
   124  
   125  import (
   126  	"testing"
   127  	"time"
   128  
   129  	"code.vegaprotocol.io/vega/core/integration/stubs"
   130  	"code.vegaprotocol.io/vega/core/snapshot"
   131  	"code.vegaprotocol.io/vega/core/stats"
   132  	vgtest "code.vegaprotocol.io/vega/libs/test"
   133  	"code.vegaprotocol.io/vega/logging"
   134  	"code.vegaprotocol.io/vega/paths"
   135  	"github.com/stretchr/testify/assert"
   136  	"github.com/stretchr/testify/require"
   137  )
   138  
   139  func TestEngineSnapshot(t *testing.T) {
   140  	ctx := vgtest.VegaContext("chainid", 100)
   141  
   142  	log := logging.NewTestLogger()
   143  
   144  	// This is important to use the same path so the engines pick up the same
   145  	// database.
   146  	vegaPath := paths.New(t.TempDir())
   147  
   148  	now := time.Now()
   149  	timeService := stubs.NewTimeStub()
   150  	timeService.SetTime(now)
   151  
   152  	statsData := stats.New(log, stats.NewDefaultConfig())
   153  
   154  	// Create the engines
   155  	businessEngine1 := NewBusinessEngine()
   156  
   157  	// Do not use memory implementation! Use LevelDB to be as close as possible
   158  	// to production.
   159  	snapshotEngine1, err := snapshot.NewEngine(vegaPath, snapshot.DefaultConfig(), log, timeService, statsData.Blockchain)
   160  	require.NoError(t, err)
   161  
   162  	// This is to avoid double closing the engine.
   163  	closeSnapshotEngine1 := vgtest.OnlyOnce(snapshotEngine1.Close)
   164  	defer closeSnapshotEngine1()
   165  
   166  	snapshotEngine1.AddProvider(businessEngine1)
   167  
   168  	// No snapshot yet, does nothing.
   169  	require.NoError(t, snapshotEngine1.Start(ctx))
   170  
   171  	// This will help us to verify the engines are correctly wired when
   172  	// restoring the state.
   173  	populateState(t, businessEngine1)
   174  
   175  	// Taking the first snapshot. Saved locally.
   176  	// Call `SnapshotNow()`, and not `Snapshot()` as it's async and might create
   177  	// a flickering test, or a data-race.
   178  	hash1, err := snapshotEngine1.SnapshotNow(ctx)
   179  	require.NoError(t, err)
   180  
   181  	// This will help us to detect drift on between the business engines after
   182  	// a restoration and an update.
   183  	populateMoreState(t, businessEngine1)
   184  
   185  	// Manually snapshotting the first business engine state, AFTER the second
   186  	// update.
   187  	state1 := map[string][]byte{}
   188  	for _, key := range businessEngine1.Keys() {
   189  		state, additionalProvider, err := businessEngine1.GetState(key)
   190  		require.NoError(t, err)
   191  		assert.Empty(t, additionalProvider)
   192  		state1[key] = state
   193  	}
   194  
   195  	// Closing the first engine now, so the second snapshot engine is the only one
   196  	// connecting to the database. This is the closest to the production setup.
   197  	closeSnapshotEngine1()
   198  
   199  	// Create the engines
   200  	businessEngine2 := NewBusinessEngine()
   201  	snapshotEngine2, err := snapshot.NewEngine(vegaPath, snapshot.DefaultConfig(), log, timeService, statsData.Blockchain)
   202  	require.NoError(t, err)
   203  	defer snapshotEngine2.Close()
   204  
   205  	snapshotEngine2.AddProviders(businessEngine2.Engine)
   206  
   207  	// This triggers the state restoration from the local snapshot.
   208  	require.NoError(t, snapshotEngine2.Start(ctx))
   209  
   210  	// Comparing the hash after restoration, to ensure it produces the same result.
   211  	hash2, _, _ := snapshotEngine2.Info()
   212  	require.Equal(t, hash1, hash2)
   213  
   214  	// Reproduce exactly the same state modification on the second business engine
   215  	// as the first one, AFTER the snapshot.
   216  	populateMoreState(t, businessEngine2)
   217  
   218  	// Manually snapshotting the second business engine state, AFTER the second
   219  	// update.
   220  	state2 := map[string][]byte{}
   221  	for _, key := range businessEngine2.Keys() {
   222  		state, additionalProvider, err := businessEngine2.GetState(key)
   223  		require.NoError(t, err)
   224  		assert.Empty(t, additionalProvider)
   225  		state2[key] = state
   226  	}
   227  
   228  	// Attempt to detect any drift in the data.
   229  	// If the data don't match, check for missing or non-deterministic data in
   230  	// the snapshot.
   231  	for key := range state1 {
   232  		assert.Equalf(t, state1[key], state2[key], "Key %q does not have the same data", key)
   233  	}
   234  }
   235  
   236  ```
   237  
   238  ### System tests
   239  
   240  System-tests exist that directly flex snapshots in known troublesome situations, and also check more functional aspects of snapshots (they are produced in line with the network parameter, we only save as many as set in the config file etc etc). These exist in the test file `snapshot_test.py` in the system-test repo.
   241  
   242  There are also tests that do not _directly_ test snapshot behaviour but where snapshots are used by that feature, for example validators-joining-and-leave and protocol-upgrade tests. These tests exist across almost all of the system-tests marked as `network_infra`.
   243  
   244  #### How to debug a failure
   245  
   246  For any run of a system-test the block-data and vega home directories are saved as artefacts. They can be downloaded, used to replay the chain locally, and to then perform the same snapshot restored. The block of the failing snapshot can be found in the logs of the node that failed to restart.
   247  
   248  ### Snapshot soak tests
   249  
   250  The "snapshot soak tests" are run at the end of every overnight full system-test run. They take the resultant chain data generated by running the full test suite, replays the chain, and then attempts to restore from every snapshot that was taken during the lifetime of the chain. The benefit of these tests is that they check snapshots that are created during obscure transient states which are harder to dream up when writing snapshot system-tests or unit-tests.
   251  
   252  It also means that our effective coverage of snapshots mirror the system-test AC coverage, and as new system-tests for features are written we automatically get testing that the snapshots for those features also work.
   253  
   254  #### How to debug a failure
   255  
   256  Reproducing a failed soak-test locally is very easy as you can trivially use the same script as the CI. The steps are:
   257  
   258  - Download the `testnet` folder of artefacts from the system-test run that produced the bad snapshot
   259  - Clone the `system-tests` repo and find the script `tests/soak-test/run.py`
   260  - Run the script to first replay the chain: `poetry run python3 run.py --tm-home=tendermint/node2 --vega-home=vega/node2 --vega-binary=../vega --replay`. **It's important to use `node2` as it's a non-validator node.**
   261  - It will write logs files from the node to `node-0.log` and `err-node-0.log`
   262  - Restart the node from the problem snapshot `poetry run python3 run.py --tm-home=tendermint/node2 --vega-home=vega/node2 --vega-binary=../vega --block BLOCK_NUM`. **It's important to use `node2` as it's a non-validator node.**
   263  - It will write log files from the node to `node-BLOCK_NUM.log` and `err-node-BLOCK_NUM.log`
   264  - Compare the two logs to see where state has diverged
   265  
   266  ### Snapshot pipelines
   267  
   268  A reoccuring Jenkins pipeline exists that will repeatedly join a network using statesync snapshots. The pipeline runs every 10mins on all of our internal networks (devnet1, stagnet1, testnet) as well as mainnet. There is a slack channel `#snapshot-notify` the show the results.
   269  
   270  The pipeline exists to verify that snapshots work in a more realistic environment where the volume of state is more representative of what we would expect on a real Vega network.
   271  
   272  #### How to debug a failure
   273  
   274  The snapshot pipeline jobs will store as an artefact the block-data and the snapshot it loaded from. This allows you to replay the partial chain the in same way locally and reproduce any failure. By finding the last _successful_ snapshot pipeline job, those artefacts can be used to replay the partial chain from a working snapshot allowing comparison between logs to find where state started to diverge.
   275  
   276  ### Event-diff soak tests
   277  
   278  It checks that the events sent out by core are always sent out in the same order. As the system tests are running `node2` writes out all the events it sends to a data node to a file. At the end of the test run the chain is replayed and another file containing all events is produce. These two files are then diffed. If it fails on Jenkins then you will see output that looks like the following:
   279  
   280  ```
   281  [2023-10-09T15:24:45.689Z] === Starting event-file diff
   282  [2023-10-09T15:24:45.689Z] Differences found between: /jenkins/workspace/common/system-tests-wrapper/networkdata/testnet/vega/node2/eventlog.evt /jenkins/workspace/common/system-tests-wrapper/networkdata/testnet/vega/node2/eventlog-replay.evt
   283  ```
   284  
   285  #### How to debug a failure
   286  
   287  The two event files that were diffed will be saved as artefacts in Jenkins, and the first step is to download them locally. From there they are be parsed into human-readable JSON by using the following vega tool,a nd then diffed:
   288  
   289  ```
   290  vega tools events --out=original.out --events=eventlog.evt
   291  
   292  vega tools events --out=replay.out --events=eventlog-replay.evt
   293  
   294  diff original.out replay.out
   295  ```
   296  
   297  Note that for events created during a full system-test run, both the parsing and diff can take some time.
   298  
   299  The diff can then be used to hunt down which block produces different events, and which event type it is. For example if the diff flagged up and event like below:
   300  
   301  ```
   302  {
   303     "id": "4615-75",
   304     "block": "953A2BC530B192B78CA5D4228C377BF3C66FEA65F0C4AF93B0DDBE7AFDE036A7",
   305     "type": 11,
   306     "vote": {
   307        "party_id": "7c2860d661607c3e51df31f7fae478acceb6ad0f45ef0d044b74d37cf7f78ebc",
   308        "value": 2,
   309        "proposal_id": "2372a4901660ace8d4e5b9e318754abdbf959454610878c13fd8f73317ebacbd",
   310        "timestamp": "1696858255513756974",
   311        "total_governance_token_balance": "9749958450000000000",
   312        "total_governance_token_weight": "1",
   313        "total_equity_like_share_weight": "0"
   314     },
   315     "version": 1,
   316     "chain_id": "testnet-001",
   317     "tx_hash": "953A2BC530B192B78CA5D4228C377BF3C66FEA65F0C4AF93B0DDBE7AFDE036A7"
   318  }
   319  ```
   320  
   321  we know that it was emitted in block `4615` and was for a `vote` event. From here we can look through all calls to `events.NewVoteEvent()` in core and look for places where we may be iterating over a map, or sorting that may be insufficient, and will cause a different in event order.