go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/gae/impl/memory/README.md

go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/gae/impl/memory/README.md (about)

1 Datastore implementation internals
2 ==================================
3
4 This document contains internal implementation details for this in-memory
5 version of datastore. It's mostly helpful to understand what's going on in this
6 implementation, but it also can reveal some insight into how the real appengine
7 datastore works (though note that the specific encodings are different).
8
9 Additionally note that this implementation cheats by moving some of the Key
10 bytes into the table (collection) names (like the namespace, property name for
11 the builtin indexes, etc.). The real implementation contains these bytes in the
12 table row keys, I think.
13
14
15 Internal datastore key/value collection schema
16 ----------------------------------------------
17
18 The datastore implementation here uses several different tables ('collections')
19 to manage state for the data. The schema for these tables is enumerated below
20 to make the code a bit easier to reason about.
21
22 All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized
23 using `go.chromium.org/luci/gae/service/datastore/serialize`, which in turn uses the
24 primitives available in `go.chromium.org/luci/common/cmpbin`. The encodings
25 are important to understanding why the schemas below sort correctly when
26 compared only using `bytes.Compare` (aka `memcmp`). This doc will assume that
27 you're familiar with those encodings, but will point out where we diverge from
28 the stock encodings.
29
30 All encoded Property values used in memory store Keys (i.e. index rows) are
31 serialized using the settings `serialize.WithoutContext`, and
32 `datastore.ShouldIndex`.
33
34 ### Primary table
35
36 The primary table maps datastore keys to entities.
37
38 - Name: `"ents:" + namespace`
39 - Key: serialized datastore.Property containing the entity's datastore.Key
40 - Value: serialized datastore.PropertyMap
41
42 This table also encodes values for the following special keys:
43
44 - Every entity root (e.g. a Key with nil Parent()) with key K has:
45 - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}`
46 A child entity with the kind `__entity_group__` and an id of `1`. The value
47 has a single property `__version__`, which contains the version number of
48 this entity group. This is used to detect transaction conflicts.
49 - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}`
50 A child entity with the kind `__entity_group__` and an id of `1`. The value
51 has a single property `__version__`, which contains the last automatically
52 allocated entity ID for entities within this entity group.
53 - A root entity with the key `Key("__entity_group_ids__",1)` which contains the
54 same `__version__` property, and indicates the last automatically allocated
55 entity ID for root entities.
56
57 ### Compound Index table
58
59 The next table keeps track of all the user-added 'compound' index descriptions
60 (not the content for the indexes). There is a row in this table for each
61 compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`.
62
63 - Name: `"idx"`
64 - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice
65 in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`).
66 - Value: empty
67
68 The Key format here requires some special attention. Say you started with
69 a compound IndexDefinition of:
70
71 IndexDefinition{
72 Kind: "Foo",
73 Ancestor: true,
74 SortBy: []IndexColumn{
75 {Property: "Something", Direction: DESCENDING},
76 {Property: "Else", Direction: ASCENDING},
77 {Property: "Cool", Direction: ASCENDING},
78 }
79 }
80
81 After prepping it for the table, it would be equivalent to:
82
83 IndexDefinition{
84 Kind: "Foo",
85 Ancestor: true,
86 SortBy: []IndexColumn{
87 {Property: "__key__", Direction: ASCENDING},
88 {Property: "Cool", Direction: ASCENDING},
89 {Property: "Else", Direction: ASCENDING},
90 {Property: "Something", Direction: DESCENDING},
91 }
92 }
93
94 The reason for doing this will be covered in the `Query Planning` section, but
95 it boils down to allowing the query planner to use this encoded table to
96 intelligently scan for potentially useful compound indexes.
97
98 ### Index Tables
99
100 Every index (both builtin and compound) has one index table per namespace, which
101 contains as rows every entry in the index, one per row.
102
103 - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()`
104 - Key: concatenated datastore.Property values, one per SortBy column in the
105 IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is
106 DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF).
107 - Value: empty
108
109 If the IndexDefinition has `Ancestor: true`, then the very first column of the
110 Key contains the partial Key for the entity. So if an entity has the datastore
111 key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`,
112 `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column
113 of indexes that it matches.
114
115 #### Builtin (automatic) indexes
116
117 The following indexes are automatically created for some entity with a key
118 `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo":
119
120 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
121 {Property: "__key__", Direction: ASCENDING},
122 }}
123 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
124 {Property: "Foo", Direction: ASCENDING},
125 {Property: "__key__", Direction: ASCENDING},
126 }}
127 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
128 {Property: "Foo", Direction: DESCENDING},
129 {Property: "__key__", Direction: ASCENDING},
130 }}
131
132 Index updates
133 -------------
134
135 (Note that this is a LARGE departure from how the production appengine datastore
136 does this. This model only works because the implementation is not distributed,
137 and not journaled. The real datastore does index updates in parallel and is
138 generally pretty fancy compared to this).
139
140 Index updates are pretty straightforward. On a mutation to the primary entity
141 table, we take the old entity value (remember that entity values are
142 PropertyMaps), the new property value, create index entries for both, merge
143 them, and apply the deltas to the affected index tables (i.e. entries that
144 exist in the old entities, but not the new ones, are deleted. Entries that exist
145 in the new entities, but not the old ones, are added).
146
147 Index generation works (given an slice of indexes []Idxs) by:
148
149 * serializing all ShouldIndex Properties in the PropertyMap to get a
150 `map[name][]serializedProperty`.
151 * for each index idx
152 * if idx's columns contain properties that are not in the map, skip idx
153 * make a `[][]serializedProperty`, where each serializedProperty slice
154 corresponds with the IndexColumn of idx.SortBy
155 * duplicated values for multi-valued properties are skipped.
156 * generate a `[]byte` row which is the contatenation of one value from each
157 `[]serializedProperty`, permuting through all combinations. If the SortBy
158 column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty
159 value!.
160 * add that generated []byte row to the index's corresponding table.
161
162 Note that we choose to serialize all permutations of the saved entity. This is
163 so that we can use repeated-column indexes to fill queries which use a subset of
164 the columns. E.g. if we have the index `duck,duck,duck,goose`, we can
165 theoretically use it to fill a query for `duck=1,duck=2,goose>"canadian"`, by
166 pasting 1 or 2 as the value for the 3rd `duck` column. This simplifies index
167 selection at the expense of larger indexes. However, it means that if you have
168 the entity:
169
170 duck = 1, 2, 3, 4
171 goose = "færøske"
172
173 It generates the following index entries:
174
175 duck=1,duck=1,duck=1,goose="færøske"
176 duck=1,duck=1,duck=2,goose="færøske"
177 duck=1,duck=1,duck=3,goose="færøske"
178 duck=1,duck=1,duck=4,goose="færøske"
179 duck=1,duck=2,duck=1,goose="færøske"
180 duck=1,duck=2,duck=2,goose="færøske"
181 duck=1,duck=2,duck=3,goose="færøske"
182 duck=1,duck=2,duck=4,goose="færøske"
183 duck=1,duck=3,duck=1,goose="færøske"
184 ... a lot ...
185 duck=4,duck=4,duck=4,goose="færøske"
186
187 This is a very large number of index rows (i.e. an 'exploding index')!
188
189 An alternate design would be to only generate unique permutations of elements
190 where the index has repeated columns of a single property. This only makes sense
191 because it's illegal to have an equality and an inequality on the same property,
192 under the current constraints of appengine (though not completely ridiculous in
193 general, if inequality constraints meant the same thing as equality constraints.
194 However it would lead to a multi-dimensional query, which can be quite slow and
195 is very difficult to scale without application knowledge). If we do this, it
196 also means that we need to SORT the equality filter values when generating the
197 prefix (so that the least-valued equality constraint is first). If we did this,
198 then the generated index rows for the above entity would be:
199
200 duck=1,duck=2,duck=3,goose="færøske"
201 duck=1,duck=2,duck=4,goose="færøske"
202 duck=1,duck=3,duck=4,goose="færøske"
203 duck=2,duck=3,duck=4,goose="færøske"
204
205 Which be a LOT more compact. It may be worth implementing this restriction
206 later, simply for the memory savings when indexing multi-valued properties.
207
208 If this technique is used, there's also room to unambiguously index entities
209 with repeated equivalent values. E.g. if duck=1,1,2,3,4 , then you could see
210 a row in the index like:
211
212 duck=1,duck=1,duck=2,goose="færøske"
213
214 Which would allow you to query for "an entity which has duck values equal to 1,
215 1 and 2". Currently such a query is not possible to execute (it would be
216 equivalent to "an entity which has duck values equal to 1 and 2").
217
218 Query planning
219 --------------
220
221 Now that we have all of our data tabulated, let's plan some queries. The
222 high-level algorithm works like this:
223
224 * Generate a suffix format from the user's query which looks like:
225 * orders (including the inequality as the first order, if any)
226 * projected fields which aren't explicitly referenced in the orders (we
227 assume ASCENDING order for them), in the order that they were projected.
228 * `__key__` (implied ascending, unless the query's last sort order is for
229 `__key__`, in which case it's whatever order the user specified)
230 * Reverse the order of this suffix format, and serialize it into an
231 IndexDefinition, along with the query's Kind and Ancestor values. This
232 does what PrepForIdxTable did when we added the Index in the first place.
233 * Use this serialized reversed index to find compound indexes which might
234 match by looking up rows in the "idx" table which begin with this serialized
235 reversed index.
236 * Generate every builtin index for the inequality + equality filter
237 properties, and see if they match too.
238
239 An index is a potential match if its suffix *exactly* matches the suffix format,
240 and it contains *only* sort orders which appear in the query (e.g. the index
241 contains a column which doesn't appear as an equality or inequlity filter).
242
243 The index search continues until:
244
245 * We find at least one matching index; AND
246 * The combination of all matching indexes accounts for every equality filter
247 at least once.
248
249 If we fail to find sufficient indexes to fulfill the query, we generate an index
250 description that *could* be sufficient by concatenating all missing equality
251 filters, in ascending order, followed by concatenating the suffix format that we
252 generated for this query. We then suggest this new index to the user for them to
253 add by returing an error containing the generated IndexDefinition. Note that
254 this index is not REQUIRED for the user to add; they could choose to add bits
255 and pieces of it, extend existing indexes in order to cover the missing columns,
256 invert the direction of some of the equality columns, etc.
257
258 Recall that equality filters are expressed as
259 `map[propName][]serializedProperty`. We'll refer to this mapping as the
260 'constraint' mapping below.
261
262 To actually come up with the final index selection, we sort all the matching
263 indexes from greatest number of columns to least. We add the 0th index (the one
264 with the greatest number of columns) unconditionally. We then keep adding indexes
265 which contain one or more of the remaining constraints, until we have no
266 more constraints to satisfy.
267
268 Adding an index entails determining which columns in that index correspond to
269 equality columns, and which ones correspond to inequality/order/projection
270 columns. Recall that the inequality/order/projection columns are all the same
271 for all of the potential indices (i.e. they all share the same *suffix format*).
272 We can use this to just iterate over the index's SortBy columns which we'll use
273 for equality filters. For each equality column, we remove a corresponding value
274 from the constraints map. In the event that we _run out_ of constraints for a
275 given column, we simply _pick an arbitrary value_ from the original equality
276 filter mapping and use that. This is valid to do because, after all, they're
277 equality filters.
278
279 Note that for compound indexes, the ancestor key counts as an equality filter,
280 and if the compound index has `Ancestor: true`, then we implicitly put the
281 ancestor as if it were the first SortBy column. For satisfying Ancestor queries
282 with built-in indexes, see the next section.
283
284 Once we've got our list of constraints for this index, we concatenate them all
285 together to get the *prefix* for this index. When iterating over this index, we
286 only ever want to see index rows whose prefix exactly matches this. Unlike the
287 suffix formt, the prefix is per-index (remember that ALL indexes in the
288 query must have the same suffix format).
289
290 Finally, we set the 'start' and 'end' values for all chosen indexes to either
291 the Start and End cursors, or the Greater-Than and Less-Than values for the
292 inequality. The Cursors contain values for every suffix column, and the
293 inequality only contains a value for the first suffix column. If both cursors
294 and an inequality are specified, we take the smaller set of both (the
295 combination which will return the fewest rows).
296
297 That's about it for index selection! See Query Execution for how we actually use
298 the selected indexes to run a query.
299
300 ### Ancestor queries and Built-in indexes
301
302 You may have noticed that the built-in indexes can be used for Ancestor queries
303 with equality filters, but they don't start with the magic Ancestor column!
304
305 There's a trick that you can do if the suffix format for the query is just
306 `__key__` though (e.g. the query only contains equality filters, and/or an
307 inequality filter on `__key__`). You can serialize the datastore key that you're
308 planning to use for the Ancestor query, then chop off the termintating null byte
309 from the encoding, and then use this as additional prefix bytes for this index.
310 So if the builtin for the "Val" property has the column format of:
311
312 {Property: "Val"}, {Property: "__key__"}
313
314 And your query holds Val as an equality filter, you can serialize the
315 ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had
316 an index row:
317
318 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP
319
320 (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for
321 the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as:
322
323 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1
324
325 Omitting the STOP which normally terminates the Key encoding. Using this prefix
326 will only return index rows which are `/Kind,1` or its children.
327
328 "That's cool! Why not use this trick for compound indexes?", I hear you ask :)
329 Remember that this trick only works if the prefix before the `__key__` is
330 *entirely* composed of equality filters. Also recall that if you ONLY use
331 equality filters and Ancestor (and possibly an inequality on `__key__`), then
332 you can always satisfy the query from the built-in indexes! While you
333 technically could do it with a compound index, there's not really a point to
334 doing so. To remain faithful to the production datastore implementation, we
335 don't implement this trick for anything other than the built-in indexes.
336
337 ### Cursor format
338
339 Cursors work by containing values for each of the columns in the suffix, in the
340 order and Direction specified by the suffix. In fact, cursors are just encoded
341 versions of the []IndexColumn used for the 'suffix format', followed by the
342 raw bytes of the suffix for that particular row (incremented by 1 bit).
343
344 This means that technically you can port cursors between any queries which share
345 precisely the same suffix format, regardless of other query options, even if the
346 index planner ends up choosing different indexes to use from the first query to
347 the second. No state is maintained in the service implementation for cursors.
348
349 I suspect that this is a more liberal version of cursors than how the production
350 appengine implements them, but I haven't verified one way or the other.
351
352 Query execution
353 ---------------
354
355 Last but not least, we need to actually execute the query. After figuring out
356 which indexes to use with what prefixes and start/end values, we essentially
357 have a list of index subsets, all sorted the same way. To pull the values out,
358 we start by iterating the first index in the list, grabbing its suffix value,
359 and trying to iterate from that suffix in the second, third, fourth, etc index.
360
361 If any index iterates past that suffix, we start back at the 0th index with that
362 suffix, and continue to try to find a matching row. Doing this will end up
363 skipping large portions of all of the indexes in the list. This is the algorithm
364 known as "zigzag merge join", and you can find talks on it from some of the
365 appengine folks. It has very good algorithmic running time and tends to scale
366 with the number of full matches, rather than the size of all of the indexes
367 involved.
368
369 A hit occurs when all of the iterators have precisely the same suffix. This hit
370 suffix is then decoded using the suffix format information. The very last column
371 of the suffix will always be the datastore key. The suffix is then used to call
372 back to the user, according to the query type:
373
374 * keys-only queries just directly return the Key
375 * projection queries return the projected fields from the decoded suffix.
376 Remember how we added all the projections after the orders? This is why. The
377 projected values are pulled directly from the index, instead of going to the
378 main entity table.
379 * normal queries pull the decoded Key from the "ents" table, and return that
380 entity to the user.