github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160426_upsert.md (about) 1 - Feature Name: UPSERT 2 - Status: completed 3 - Start Date: 2016-04-26 4 - Authors: Daniel Harrison 5 - RFC PR: [#6331](https://github.com/cockroachdb/cockroach/pull/6331) 6 - Cockroach Issue: [#1962](https://github.com/cockroachdb/cockroach/issues/1962) 7 8 # Summary 9 10 `UPSERT` allows a user to atomically either insert a row, or on the basis of the 11 row already existing, update that existing row instead. Examples include MySQL's 12 `ON DUPLICATE KEY UPDATE` and `REPLACE`, PostgreSQL's `ON CONFLICT DO UPDATE` or 13 VoltDB's `UPSERT` statement. 14 15 16 # Motivation 17 18 The absence of this feature from Postgres was a long-standing complaint before 19 being implemented in 9.5. Additionally, it's difficult to use a two column table 20 as a key/value store without `UPSERT` (as we suggest in response to requests to 21 make our kv layer public). 22 23 24 # Detailed design 25 26 Both a long form `INSERT INTO kv VALUES (1, 'one') ON CONFLICT index DO UPDATE 27 SET v = 'one'` and a short form `UPSERT INTO kv VALUES(1, 'one')` will be 28 introduced. In contrast to MySQL's `REPLACE`, the latter will behave as 29 syntactic sugar for the long form for the primary index. The short form will be 30 implemented first. 31 32 A row is considered in conflict if the specified index is violated by its 33 insertion. 34 35 In practice, upserts tend to be either insert or update heavy. A user's song 36 ratings would be mostly inserts but a set of counters would be mostly updates. 37 There are two options for upsert implementation, one optimizing the insert case 38 and one the update case. Whichever we pick will be wrong for some uses. 39 40 * __Optimize for updates__. Query rows are processed a batch at a time to 41 prevent unbounded memory usage. A fetch is run to determine which rows have 42 conflicts. If there are conflicts and the index is a secondary index, a second 43 fetch is run to get the values needed to compute the update statements. Then 44 each row is either inserted or updated as appropriate. 45 46 Example (primary key upsert): 47 ```sql 48 INSERT INTO counters (name, val) VALUES('foo', 1) ON CONFLICT name DO UPDATE SET value = value + 1; 49 ``` 50 51 The primary key was provided, so the fields needed to compute the update are 52 fetched directly. If the row didn't exist, we're safe to insert. If it did, a 53 standard update is run with the fetched values. 54 55 Example (secondary key upsert): 56 ```sql 57 INSERT INTO song_ratings (id, user, song, rating) VALUES(unique_rowid(), 'foo', 'bar', 4) ON CONFLICT song DO UPDATE SET rating = 4; 58 ``` 59 60 The key for the specified index is constructed and fetched. If it doesn't 61 exist, we're safe to insert. Otherwise, the primary key is extracted from the 62 fetched entry and the values necessary to compute the update are fetched in a 63 second pass. Finally, the update is run. 64 65 * __Optimize for inserts__. Query rows are processed a batch at a time to 66 prevent unbounded memory usage. It is optimistically assumed that insertion will 67 run with no conflicts, but CPut (or maybe InitPut) is used to find any rows that 68 do conflict (as is already done in `INSERT`). The batch is executed. If 69 conflicts occur, a follow-up batch is constructed to update any rows that 70 conflicted. 71 72 In the fully general `ON CONFLICT DO UPDATE` case, the insert and update 73 clauses can address even completely different sets of fields. Either a batch has 74 to be used per row (which would be too unperformant) or some or all of the 75 inserted values have to be rolled back when the row conflicts. This requires the 76 previous values. We can't query them (they've already been overwritten) which 77 means they'll all need to be returned from the kv layer. 78 79 The current implementation of CPut will fail a batch after the first violation 80 and so is unsuitable for implementing the insert optimized version. InitPut is 81 also close to, but not exactly, what is needed. 82 83 The update optimized implementation is considerably simpler to implement. It's 84 possible that in the future we'll want to implement both and let the user hint 85 which one should be used. Note that there is a case we'll specialize for 86 performance: when `UPSERT`ing a row which has values for all the columns in 87 the table and there are no secondary indexes, we can do it entirely with `Put`s. 88 89 90 # Drawbacks 91 92 * Having two strategies with different runtime trade-offs is unfortunate. 93 94 95 # Alternatives 96 97 * __Optimize for inserts.__ Wait for the Freeze/Unfreeze work to land, then do 98 the CPut/InitPut strategy described above. 99 100 * __Only implement the long form.__ When the short form is sufficient, it's more 101 clear, and once both are implemented there will be little additional code 102 required to keep the short form. 103 104 * __Implement `REPLACE` with MySQL's semantics__ (a `DELETE` followed by an 105 `INSERT`) instead of the short form. There are three key differences: 106 107 * When `REPLACE` encounters a conflict, a normal `DELETE` is done and any 108 applicable `ON DELETE CASCADE` will be triggered. This is fairly surprising and 109 has historically been a source of confusion. 110 111 * If some table fields are unspecified, the update will overwrite them with 112 default values, while `UPSERT` will leave them unchanged. 113 114 * `REPLACE` counts the update case as 2 "affected rows", while `UPSERT` will 115 count it as 1. 116 117 If the user really wants this behavior, a `DELETE` followed by an `INSERT` in 118 a transaction is identical. 119 120 121 # Unresolved questions 122 123 * In contrast to `REPLACE`, table constraints will be enforced after the logical 124 insert or update. TODO(dan): Add details on how this will work once 125 [#6309](https://github.com/cockroachdb/cockroach/pull/6309) is fleshed out.