go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/gae/impl/memory/README.md (about) 1 Datastore implementation internals 2 ================================== 3 4 This document contains internal implementation details for this in-memory 5 version of datastore. It's mostly helpful to understand what's going on in this 6 implementation, but it also can reveal some insight into how the real appengine 7 datastore works (though note that the specific encodings are different). 8 9 Additionally note that this implementation cheats by moving some of the Key 10 bytes into the table (collection) names (like the namespace, property name for 11 the builtin indexes, etc.). The real implementation contains these bytes in the 12 table row keys, I think. 13 14 15 Internal datastore key/value collection schema 16 ---------------------------------------------- 17 18 The datastore implementation here uses several different tables ('collections') 19 to manage state for the data. The schema for these tables is enumerated below 20 to make the code a bit easier to reason about. 21 22 All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized 23 using `go.chromium.org/luci/gae/service/datastore/serialize`, which in turn uses the 24 primitives available in `go.chromium.org/luci/common/cmpbin`. The encodings 25 are important to understanding why the schemas below sort correctly when 26 compared only using `bytes.Compare` (aka `memcmp`). This doc will assume that 27 you're familiar with those encodings, but will point out where we diverge from 28 the stock encodings. 29 30 All encoded Property values used in memory store Keys (i.e. index rows) are 31 serialized using the settings `serialize.WithoutContext`, and 32 `datastore.ShouldIndex`. 33 34 ### Primary table 35 36 The primary table maps datastore keys to entities. 37 38 - Name: `"ents:" + namespace` 39 - Key: serialized datastore.Property containing the entity's datastore.Key 40 - Value: serialized datastore.PropertyMap 41 42 This table also encodes values for the following special keys: 43 44 - Every entity root (e.g. a Key with nil Parent()) with key K has: 45 - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}` 46 A child entity with the kind `__entity_group__` and an id of `1`. The value 47 has a single property `__version__`, which contains the version number of 48 this entity group. This is used to detect transaction conflicts. 49 - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}` 50 A child entity with the kind `__entity_group__` and an id of `1`. The value 51 has a single property `__version__`, which contains the last automatically 52 allocated entity ID for entities within this entity group. 53 - A root entity with the key `Key("__entity_group_ids__",1)` which contains the 54 same `__version__` property, and indicates the last automatically allocated 55 entity ID for root entities. 56 57 ### Compound Index table 58 59 The next table keeps track of all the user-added 'compound' index descriptions 60 (not the content for the indexes). There is a row in this table for each 61 compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`. 62 63 - Name: `"idx"` 64 - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice 65 in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`). 66 - Value: empty 67 68 The Key format here requires some special attention. Say you started with 69 a compound IndexDefinition of: 70 71 IndexDefinition{ 72 Kind: "Foo", 73 Ancestor: true, 74 SortBy: []IndexColumn{ 75 {Property: "Something", Direction: DESCENDING}, 76 {Property: "Else", Direction: ASCENDING}, 77 {Property: "Cool", Direction: ASCENDING}, 78 } 79 } 80 81 After prepping it for the table, it would be equivalent to: 82 83 IndexDefinition{ 84 Kind: "Foo", 85 Ancestor: true, 86 SortBy: []IndexColumn{ 87 {Property: "__key__", Direction: ASCENDING}, 88 {Property: "Cool", Direction: ASCENDING}, 89 {Property: "Else", Direction: ASCENDING}, 90 {Property: "Something", Direction: DESCENDING}, 91 } 92 } 93 94 The reason for doing this will be covered in the `Query Planning` section, but 95 it boils down to allowing the query planner to use this encoded table to 96 intelligently scan for potentially useful compound indexes. 97 98 ### Index Tables 99 100 Every index (both builtin and compound) has one index table per namespace, which 101 contains as rows every entry in the index, one per row. 102 103 - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()` 104 - Key: concatenated datastore.Property values, one per SortBy column in the 105 IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is 106 DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF). 107 - Value: empty 108 109 If the IndexDefinition has `Ancestor: true`, then the very first column of the 110 Key contains the partial Key for the entity. So if an entity has the datastore 111 key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`, 112 `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column 113 of indexes that it matches. 114 115 #### Builtin (automatic) indexes 116 117 The following indexes are automatically created for some entity with a key 118 `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo": 119 120 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ 121 {Property: "__key__", Direction: ASCENDING}, 122 }} 123 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ 124 {Property: "Foo", Direction: ASCENDING}, 125 {Property: "__key__", Direction: ASCENDING}, 126 }} 127 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ 128 {Property: "Foo", Direction: DESCENDING}, 129 {Property: "__key__", Direction: ASCENDING}, 130 }} 131 132 Index updates 133 ------------- 134 135 (Note that this is a LARGE departure from how the production appengine datastore 136 does this. This model only works because the implementation is not distributed, 137 and not journaled. The real datastore does index updates in parallel and is 138 generally pretty fancy compared to this). 139 140 Index updates are pretty straightforward. On a mutation to the primary entity 141 table, we take the old entity value (remember that entity values are 142 PropertyMaps), the new property value, create index entries for both, merge 143 them, and apply the deltas to the affected index tables (i.e. entries that 144 exist in the old entities, but not the new ones, are deleted. Entries that exist 145 in the new entities, but not the old ones, are added). 146 147 Index generation works (given an slice of indexes []Idxs) by: 148 149 * serializing all ShouldIndex Properties in the PropertyMap to get a 150 `map[name][]serializedProperty`. 151 * for each index idx 152 * if idx's columns contain properties that are not in the map, skip idx 153 * make a `[][]serializedProperty`, where each serializedProperty slice 154 corresponds with the IndexColumn of idx.SortBy 155 * duplicated values for multi-valued properties are skipped. 156 * generate a `[]byte` row which is the contatenation of one value from each 157 `[]serializedProperty`, permuting through all combinations. If the SortBy 158 column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty 159 value!. 160 * add that generated []byte row to the index's corresponding table. 161 162 Note that we choose to serialize all permutations of the saved entity. This is 163 so that we can use repeated-column indexes to fill queries which use a subset of 164 the columns. E.g. if we have the index `duck,duck,duck,goose`, we can 165 theoretically use it to fill a query for `duck=1,duck=2,goose>"canadian"`, by 166 pasting 1 or 2 as the value for the 3rd `duck` column. This simplifies index 167 selection at the expense of larger indexes. However, it means that if you have 168 the entity: 169 170 duck = 1, 2, 3, 4 171 goose = "færøske" 172 173 It generates the following index entries: 174 175 duck=1,duck=1,duck=1,goose="færøske" 176 duck=1,duck=1,duck=2,goose="færøske" 177 duck=1,duck=1,duck=3,goose="færøske" 178 duck=1,duck=1,duck=4,goose="færøske" 179 duck=1,duck=2,duck=1,goose="færøske" 180 duck=1,duck=2,duck=2,goose="færøske" 181 duck=1,duck=2,duck=3,goose="færøske" 182 duck=1,duck=2,duck=4,goose="færøske" 183 duck=1,duck=3,duck=1,goose="færøske" 184 ... a lot ... 185 duck=4,duck=4,duck=4,goose="færøske" 186 187 This is a very large number of index rows (i.e. an 'exploding index')! 188 189 An alternate design would be to only generate unique permutations of elements 190 where the index has repeated columns of a single property. This only makes sense 191 because it's illegal to have an equality and an inequality on the same property, 192 under the current constraints of appengine (though not completely ridiculous in 193 general, if inequality constraints meant the same thing as equality constraints. 194 However it would lead to a multi-dimensional query, which can be quite slow and 195 is very difficult to scale without application knowledge). If we do this, it 196 also means that we need to SORT the equality filter values when generating the 197 prefix (so that the least-valued equality constraint is first). If we did this, 198 then the generated index rows for the above entity would be: 199 200 duck=1,duck=2,duck=3,goose="færøske" 201 duck=1,duck=2,duck=4,goose="færøske" 202 duck=1,duck=3,duck=4,goose="færøske" 203 duck=2,duck=3,duck=4,goose="færøske" 204 205 Which be a LOT more compact. It may be worth implementing this restriction 206 later, simply for the memory savings when indexing multi-valued properties. 207 208 If this technique is used, there's also room to unambiguously index entities 209 with repeated equivalent values. E.g. if duck=1,1,2,3,4 , then you could see 210 a row in the index like: 211 212 duck=1,duck=1,duck=2,goose="færøske" 213 214 Which would allow you to query for "an entity which has duck values equal to 1, 215 1 and 2". Currently such a query is not possible to execute (it would be 216 equivalent to "an entity which has duck values equal to 1 and 2"). 217 218 Query planning 219 -------------- 220 221 Now that we have all of our data tabulated, let's plan some queries. The 222 high-level algorithm works like this: 223 224 * Generate a suffix format from the user's query which looks like: 225 * orders (including the inequality as the first order, if any) 226 * projected fields which aren't explicitly referenced in the orders (we 227 assume ASCENDING order for them), in the order that they were projected. 228 * `__key__` (implied ascending, unless the query's last sort order is for 229 `__key__`, in which case it's whatever order the user specified) 230 * Reverse the order of this suffix format, and serialize it into an 231 IndexDefinition, along with the query's Kind and Ancestor values. This 232 does what PrepForIdxTable did when we added the Index in the first place. 233 * Use this serialized reversed index to find compound indexes which might 234 match by looking up rows in the "idx" table which begin with this serialized 235 reversed index. 236 * Generate every builtin index for the inequality + equality filter 237 properties, and see if they match too. 238 239 An index is a potential match if its suffix *exactly* matches the suffix format, 240 and it contains *only* sort orders which appear in the query (e.g. the index 241 contains a column which doesn't appear as an equality or inequlity filter). 242 243 The index search continues until: 244 245 * We find at least one matching index; AND 246 * The combination of all matching indexes accounts for every equality filter 247 at least once. 248 249 If we fail to find sufficient indexes to fulfill the query, we generate an index 250 description that *could* be sufficient by concatenating all missing equality 251 filters, in ascending order, followed by concatenating the suffix format that we 252 generated for this query. We then suggest this new index to the user for them to 253 add by returing an error containing the generated IndexDefinition. Note that 254 this index is not REQUIRED for the user to add; they could choose to add bits 255 and pieces of it, extend existing indexes in order to cover the missing columns, 256 invert the direction of some of the equality columns, etc. 257 258 Recall that equality filters are expressed as 259 `map[propName][]serializedProperty`. We'll refer to this mapping as the 260 'constraint' mapping below. 261 262 To actually come up with the final index selection, we sort all the matching 263 indexes from greatest number of columns to least. We add the 0th index (the one 264 with the greatest number of columns) unconditionally. We then keep adding indexes 265 which contain one or more of the remaining constraints, until we have no 266 more constraints to satisfy. 267 268 Adding an index entails determining which columns in that index correspond to 269 equality columns, and which ones correspond to inequality/order/projection 270 columns. Recall that the inequality/order/projection columns are all the same 271 for all of the potential indices (i.e. they all share the same *suffix format*). 272 We can use this to just iterate over the index's SortBy columns which we'll use 273 for equality filters. For each equality column, we remove a corresponding value 274 from the constraints map. In the event that we _run out_ of constraints for a 275 given column, we simply _pick an arbitrary value_ from the original equality 276 filter mapping and use that. This is valid to do because, after all, they're 277 equality filters. 278 279 Note that for compound indexes, the ancestor key counts as an equality filter, 280 and if the compound index has `Ancestor: true`, then we implicitly put the 281 ancestor as if it were the first SortBy column. For satisfying Ancestor queries 282 with built-in indexes, see the next section. 283 284 Once we've got our list of constraints for this index, we concatenate them all 285 together to get the *prefix* for this index. When iterating over this index, we 286 only ever want to see index rows whose prefix exactly matches this. Unlike the 287 suffix formt, the prefix is per-index (remember that ALL indexes in the 288 query must have the same suffix format). 289 290 Finally, we set the 'start' and 'end' values for all chosen indexes to either 291 the Start and End cursors, or the Greater-Than and Less-Than values for the 292 inequality. The Cursors contain values for every suffix column, and the 293 inequality only contains a value for the first suffix column. If both cursors 294 and an inequality are specified, we take the smaller set of both (the 295 combination which will return the fewest rows). 296 297 That's about it for index selection! See Query Execution for how we actually use 298 the selected indexes to run a query. 299 300 ### Ancestor queries and Built-in indexes 301 302 You may have noticed that the built-in indexes can be used for Ancestor queries 303 with equality filters, but they don't start with the magic Ancestor column! 304 305 There's a trick that you can do if the suffix format for the query is just 306 `__key__` though (e.g. the query only contains equality filters, and/or an 307 inequality filter on `__key__`). You can serialize the datastore key that you're 308 planning to use for the Ancestor query, then chop off the termintating null byte 309 from the encoding, and then use this as additional prefix bytes for this index. 310 So if the builtin for the "Val" property has the column format of: 311 312 {Property: "Val"}, {Property: "__key__"} 313 314 And your query holds Val as an equality filter, you can serialize the 315 ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had 316 an index row: 317 318 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP 319 320 (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for 321 the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as: 322 323 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 324 325 Omitting the STOP which normally terminates the Key encoding. Using this prefix 326 will only return index rows which are `/Kind,1` or its children. 327 328 "That's cool! Why not use this trick for compound indexes?", I hear you ask :) 329 Remember that this trick only works if the prefix before the `__key__` is 330 *entirely* composed of equality filters. Also recall that if you ONLY use 331 equality filters and Ancestor (and possibly an inequality on `__key__`), then 332 you can always satisfy the query from the built-in indexes! While you 333 technically could do it with a compound index, there's not really a point to 334 doing so. To remain faithful to the production datastore implementation, we 335 don't implement this trick for anything other than the built-in indexes. 336 337 ### Cursor format 338 339 Cursors work by containing values for each of the columns in the suffix, in the 340 order and Direction specified by the suffix. In fact, cursors are just encoded 341 versions of the []IndexColumn used for the 'suffix format', followed by the 342 raw bytes of the suffix for that particular row (incremented by 1 bit). 343 344 This means that technically you can port cursors between any queries which share 345 precisely the same suffix format, regardless of other query options, even if the 346 index planner ends up choosing different indexes to use from the first query to 347 the second. No state is maintained in the service implementation for cursors. 348 349 I suspect that this is a more liberal version of cursors than how the production 350 appengine implements them, but I haven't verified one way or the other. 351 352 Query execution 353 --------------- 354 355 Last but not least, we need to actually execute the query. After figuring out 356 which indexes to use with what prefixes and start/end values, we essentially 357 have a list of index subsets, all sorted the same way. To pull the values out, 358 we start by iterating the first index in the list, grabbing its suffix value, 359 and trying to iterate from that suffix in the second, third, fourth, etc index. 360 361 If any index iterates past that suffix, we start back at the 0th index with that 362 suffix, and continue to try to find a matching row. Doing this will end up 363 skipping large portions of all of the indexes in the list. This is the algorithm 364 known as "zigzag merge join", and you can find talks on it from some of the 365 appengine folks. It has very good algorithmic running time and tends to scale 366 with the number of full matches, rather than the size of all of the indexes 367 involved. 368 369 A hit occurs when all of the iterators have precisely the same suffix. This hit 370 suffix is then decoded using the suffix format information. The very last column 371 of the suffix will always be the datastore key. The suffix is then used to call 372 back to the user, according to the query type: 373 374 * keys-only queries just directly return the Key 375 * projection queries return the projected fields from the decoded suffix. 376 Remember how we added all the projections after the orders? This is why. The 377 projected values are pulled directly from the index, instead of going to the 378 main entity table. 379 * normal queries pull the decoded Key from the "ents" table, and return that 380 entity to the user.