code.vegaprotocol.io/vega@v0.79.0/datanode/docs/schema_design_principles.md (about) 1 # Schema Design Principles 2 3 Our database is generally 'insert only', so when a particular entity changes state, we *add* a new row in the database, we don't update an existing one. 4 5 This is important for both being able to audit what happened in the past, and also for the functioning of decentralised history 6 7 The way this is implemented is that most tables have a `Foo` *table* and a `FooCurrent` *view*, where `Foo` has every single change and `FooCurrent` is a database view which selects the most recent version. 8 9 For example with proposals we have a table: 10 11 ```sql 12 CREATE TABLE proposals( 13 id BYTEA NOT NULL, 14 state proposal_state NOT NULL, 15 vega_time TIMESTAMP WITH TIME ZONE NOT NULL, 16 .... 17 ) 18 ``` 19 20 And a corresponding view 21 22 ``` 23 CREATE VIEW proposals_current AS ( 24 SELECT DISTINCT ON (id) * FROM proposals ORDER BY id, vega_time DESC 25 ); 26 ``` 27 28 So for a proposal with id 0x01 that was initially `STATE_OPEN` and then `STATE_PASSED` in the `proposals` table we might have 29 30 | id | state | vega_time | 31 | --- | --- | --- | 32 | 0x01 | STATE_OPEN | 2023-01-01 00:00 | 33 | 0x01 | STATE_PASSED | 2023-01-02 00:00 | 34 35 But selecting from `proposals_current` you would receive, for each distinct ID, the row with the highest `vega_time`: 36 37 | id | state | vega_time | 38 | --- | --- | --- | 39 | 0x01 | STATE_PASSED | 2023-01-02 00:00 | 40 41 ## Performance Implications 42 43 The design above is great for insert performance, because we don’t have to go looking for rows to update when changes come along. However, without a bit of care can have quite a surprising effect on the performance when querying the `xxx_current` views. 44 45 If you are filtering on a field that is not part of the `DISTINCT ON(yyy)` clause, the database has no choice but to materialise the entire view before querying it. 46 47 For example, if you were to query 48 49 ```sql 50 SELECT * FROM proposals_current WHERE state='STATE_OPEN' 51 ``` 52 53 The only way that can be determined is to 54 55 - First get the current version of each proposal (e.g. the set of rows with unique `id`s corresponding to the highest `vega_time` for each `id`) 56 - Then apply your filter on `state` to that materialized view. 57 58 Postgres **cannot reverse** the order of those operations because otherwise you would be asking the question ** 59 60 > *show me the highest version of a proposal in which it’s state was `OPEN`* 61 > 62 63 instead of the intended 64 65 > *show me, out of the most recent versions of all proposals, which of those has state `OPEN`* 66 > 67 68 If the size of that materialised view is large, even though your subsequent filtering might well reduce it to a handful of rows, the query will be slow. 69 70 Additionally the materialization of the view doesn’t have any indexes on it. 71 72 ### Workaround #1 - DISTINCT ON 73 74 If the column being filtered is part of the the `DISTINCT ON` clause Postgres ***can*** push the filter down into the table scan. For example this is doesn’t require materializing the whole view: 75 76 ```sql 77 SELECT * FROM proposals_current WHERE id='\x01' 78 ``` 79 80 Because `id` is part of the `DISTINCT ON` clause, it is safe for Postgres to apply your filter to the table itself rather than on the result of the view. It can use any relevant indexes on the table to do that quickly. 81 82 If you are sure that a particular column will never change for a particular `id` , you can add it to the `DISTINCT ON` clause of the view - for example: 83 84 ```sql 85 CREATE VIEW proposals_current AS ( 86 SELECT DISTINCT ON (id, reference) * FROM proposals ORDER BY id, reference, vega_time DESC 87 ); 88 ``` 89 90 And that will greatly improve the performance of a query like 91 92 ```sql 93 SELECT * FROM proposals_current WHERE reference='foo' 94 ``` 95 96 ### Workaround #2- Temporal tables maintained by triggers 97 98 If you want speedy queries on large tables where you want to filter by columns that **do** change, there is no choice but to make your inserts do some updating as well. 99 100 We do this on our orders table. It’s set up like this: 101 102 ```sql 103 CREATE TABLE orders ( 104 id BYTEA NOT NULL, 105 vega_time TIMESTAMP WITH TIME ZONE NOT NULL, 106 current BOOLEAN NOT NULL DEFAULT FALSE, 107 ); 108 ``` 109 110 The `current` field is the extra special sauce. 111 112 - For row corresponding to the most recent update to an order, it is always `true`. 113 - For a row corresponding to a previous state of an order, it is set to `false` 114 - A trigger handles updating of `current` on the previous most current previous version when an insert is made to the table 115 116 This allows us to now create our `orders_current` view as follows: 117 118 ```sql 119 CREATE VIEW orders_current AS ( 120 SELECT * FROM orders WHERE current = true 121 ); 122 ``` 123 124 This view doesn’t suffer from the same issues of not being able to push down filters into the table scan that views created with `DISTINCT ON` has. You can even create indexes that only contain rows where `current=true`: 125 126 ```sql 127 CREATE INDEX ON orders (market_id) where current=true; 128 ``` 129 130 Which keeps the size of the index down and makes querying the latest version of orders by `market_id` as fast as if the table didn’t have all the historical versions in there as well. 131 132 **Downsides** 133 134 - `INSERT` becomes considerably slower, and `UPDATE` of existing rows generate dead tuples in the database that cause bloat and need to be `VACUUM`-ed later. 135 - Network history has to take into account this flag and ensure that after restoring from history segments the table is processed such that only the latest 136 - version of the order is market as snapshots and sharing, and requires special handling there to work. 137 138 ### Workaround #3 - Separate ‘audit’ tables 139 140 The idea here is to keep a table with only the most recent versions of things, and a separate table with all this changes. This is quite a common pattern, but we currently don’t do this as we weren’t able to get good enough insert performance.