go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/gae/impl/memory/README.md (about)

     1  Datastore implementation internals
     2  ==================================
     3  
     4  This document contains internal implementation details for this in-memory
     5  version of datastore. It's mostly helpful to understand what's going on in this
     6  implementation, but it also can reveal some insight into how the real appengine
     7  datastore works (though note that the specific encodings are different).
     8  
     9  Additionally note that this implementation cheats by moving some of the Key
    10  bytes into the table (collection) names (like the namespace, property name for
    11  the builtin indexes, etc.). The real implementation contains these bytes in the
    12  table row keys, I think.
    13  
    14  
    15  Internal datastore key/value collection schema
    16  ----------------------------------------------
    17  
    18  The datastore implementation here uses several different tables ('collections')
    19  to manage state for the data. The schema for these tables is enumerated below
    20  to make the code a bit easier to reason about.
    21  
    22  All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized
    23  using `go.chromium.org/luci/gae/service/datastore/serialize`, which in turn uses the
    24  primitives available in `go.chromium.org/luci/common/cmpbin`. The encodings
    25  are important to understanding why the schemas below sort correctly when
    26  compared only using `bytes.Compare` (aka `memcmp`). This doc will assume that
    27  you're familiar with those encodings, but will point out where we diverge from
    28  the stock encodings.
    29  
    30  All encoded Property values used in memory store Keys (i.e. index rows) are
    31  serialized using the settings `serialize.WithoutContext`, and
    32  `datastore.ShouldIndex`.
    33  
    34  ### Primary table
    35  
    36  The primary table maps datastore keys to entities.
    37  
    38  - Name: `"ents:" + namespace`
    39  - Key: serialized datastore.Property containing the entity's datastore.Key
    40  - Value: serialized datastore.PropertyMap
    41  
    42  This table also encodes values for the following special keys:
    43  
    44  - Every entity root (e.g. a Key with nil Parent()) with key K has:
    45    - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}`
    46      A child entity with the kind `__entity_group__` and an id of `1`. The value
    47      has a single property `__version__`, which contains the version number of
    48      this entity group. This is used to detect transaction conflicts.
    49    - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}`
    50      A child entity with the kind `__entity_group__` and an id of `1`. The value
    51      has a single property `__version__`, which contains the last automatically
    52      allocated entity ID for entities within this entity group.
    53  - A root entity with the key `Key("__entity_group_ids__",1)` which contains the
    54    same `__version__` property, and indicates the last automatically allocated
    55    entity ID for root entities.
    56  
    57  ### Compound Index table
    58  
    59  The next table keeps track of all the user-added 'compound' index descriptions
    60  (not the content for the indexes). There is a row in this table for each
    61  compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`.
    62  
    63  - Name: `"idx"`
    64  - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice
    65    in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`).
    66  - Value: empty
    67  
    68  The Key format here requires some special attention. Say you started with
    69  a compound IndexDefinition of:
    70  
    71      IndexDefinition{
    72        Kind: "Foo",
    73        Ancestor: true,
    74        SortBy: []IndexColumn{
    75          {Property: "Something", Direction: DESCENDING},
    76          {Property: "Else", Direction: ASCENDING},
    77          {Property: "Cool", Direction: ASCENDING},
    78        }
    79      }
    80  
    81  After prepping it for the table, it would be equivalent to:
    82  
    83      IndexDefinition{
    84        Kind: "Foo",
    85        Ancestor: true,
    86        SortBy: []IndexColumn{
    87          {Property: "__key__", Direction: ASCENDING},
    88          {Property: "Cool", Direction: ASCENDING},
    89          {Property: "Else", Direction: ASCENDING},
    90          {Property: "Something", Direction: DESCENDING},
    91        }
    92      }
    93  
    94  The reason for doing this will be covered in the `Query Planning` section, but
    95  it boils down to allowing the query planner to use this encoded table to
    96  intelligently scan for potentially useful compound indexes.
    97  
    98  ### Index Tables
    99  
   100  Every index (both builtin and compound) has one index table per namespace, which
   101  contains as rows every entry in the index, one per row.
   102  
   103  - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()`
   104  - Key: concatenated datastore.Property values, one per SortBy column in the
   105    IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is
   106    DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF).
   107  - Value: empty
   108  
   109  If the IndexDefinition has `Ancestor: true`, then the very first column of the
   110  Key contains the partial Key for the entity. So if an entity has the datastore
   111  key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`,
   112  `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column
   113  of indexes that it matches.
   114  
   115  #### Builtin (automatic) indexes
   116  
   117  The following indexes are automatically created for some entity with a key
   118  `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo":
   119  
   120      IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
   121        {Property: "__key__", Direction: ASCENDING},
   122      }}
   123      IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
   124        {Property: "Foo", Direction: ASCENDING},
   125        {Property: "__key__", Direction: ASCENDING},
   126      }}
   127      IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
   128        {Property: "Foo", Direction: DESCENDING},
   129        {Property: "__key__", Direction: ASCENDING},
   130      }}
   131  
   132  Index updates
   133  -------------
   134  
   135  (Note that this is a LARGE departure from how the production appengine datastore
   136  does this. This model only works because the implementation is not distributed,
   137  and not journaled. The real datastore does index updates in parallel and is
   138  generally pretty fancy compared to this).
   139  
   140  Index updates are pretty straightforward. On a mutation to the primary entity
   141  table, we take the old entity value (remember that entity values are
   142  PropertyMaps), the new property value, create index entries for both, merge
   143  them, and apply the deltas to the affected index tables (i.e. entries that
   144  exist in the old entities, but not the new ones, are deleted. Entries that exist
   145  in the new entities, but not the old ones, are added).
   146  
   147  Index generation works (given an slice of indexes []Idxs) by:
   148  
   149  * serializing all ShouldIndex Properties in the PropertyMap to get a
   150    `map[name][]serializedProperty`.
   151  * for each index idx
   152    * if idx's columns contain properties that are not in the map, skip idx
   153    * make a `[][]serializedProperty`, where each serializedProperty slice
   154      corresponds with the IndexColumn of idx.SortBy
   155      * duplicated values for multi-valued properties are skipped.
   156    * generate a `[]byte` row which is the contatenation of one value from each
   157      `[]serializedProperty`, permuting through all combinations. If the SortBy
   158      column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty
   159      value!.
   160    * add that generated []byte row to the index's corresponding table.
   161  
   162  Note that we choose to serialize all permutations of the saved entity. This is
   163  so that we can use repeated-column indexes to fill queries which use a subset of
   164  the columns. E.g. if we have the index `duck,duck,duck,goose`, we can
   165  theoretically use it to fill a query for `duck=1,duck=2,goose>"canadian"`, by
   166  pasting 1 or 2 as the value for the 3rd `duck` column. This simplifies index
   167  selection at the expense of larger indexes. However, it means that if you have
   168  the entity:
   169  
   170      duck = 1, 2, 3, 4
   171      goose = "færøske"
   172  
   173  It generates the following index entries:
   174  
   175      duck=1,duck=1,duck=1,goose="færøske"
   176      duck=1,duck=1,duck=2,goose="færøske"
   177      duck=1,duck=1,duck=3,goose="færøske"
   178      duck=1,duck=1,duck=4,goose="færøske"
   179      duck=1,duck=2,duck=1,goose="færøske"
   180      duck=1,duck=2,duck=2,goose="færøske"
   181      duck=1,duck=2,duck=3,goose="færøske"
   182      duck=1,duck=2,duck=4,goose="færøske"
   183      duck=1,duck=3,duck=1,goose="færøske"
   184      ... a lot ...
   185      duck=4,duck=4,duck=4,goose="færøske"
   186  
   187  This is a very large number of index rows (i.e. an 'exploding index')!
   188  
   189  An alternate design would be to only generate unique permutations of elements
   190  where the index has repeated columns of a single property. This only makes sense
   191  because it's illegal to have an equality and an inequality on the same property,
   192  under the current constraints of appengine (though not completely ridiculous in
   193  general, if inequality constraints meant the same thing as equality constraints.
   194  However it would lead to a multi-dimensional query, which can be quite slow and
   195  is very difficult to scale without application knowledge). If we do this, it
   196  also means that we need to SORT the equality filter values when generating the
   197  prefix (so that the least-valued equality constraint is first). If we did this,
   198  then the generated index rows for the above entity would be:
   199  
   200      duck=1,duck=2,duck=3,goose="færøske"
   201      duck=1,duck=2,duck=4,goose="færøske"
   202      duck=1,duck=3,duck=4,goose="færøske"
   203      duck=2,duck=3,duck=4,goose="færøske"
   204  
   205  Which be a LOT more compact. It may be worth implementing this restriction
   206  later, simply for the memory savings when indexing multi-valued properties.
   207  
   208  If this technique is used, there's also room to unambiguously index entities
   209  with repeated equivalent values. E.g. if duck=1,1,2,3,4 , then you could see
   210  a row in the index like:
   211  
   212      duck=1,duck=1,duck=2,goose="færøske"
   213  
   214  Which would allow you to query for "an entity which has duck values equal to 1,
   215  1 and 2". Currently such a query is not possible to execute (it would be
   216  equivalent to "an entity which has duck values equal to 1 and 2").
   217  
   218  Query planning
   219  --------------
   220  
   221  Now that we have all of our data tabulated, let's plan some queries. The
   222  high-level algorithm works like this:
   223  
   224  * Generate a suffix format from the user's query which looks like:
   225    * orders (including the inequality as the first order, if any)
   226    * projected fields which aren't explicitly referenced in the orders (we
   227      assume ASCENDING order for them), in the order that they were projected.
   228    * `__key__` (implied ascending, unless the query's last sort order is for
   229      `__key__`, in which case it's whatever order the user specified)
   230  * Reverse the order of this suffix format, and serialize it into an
   231    IndexDefinition, along with the query's Kind and Ancestor values. This
   232    does what PrepForIdxTable did when we added the Index in the first place.
   233  * Use this serialized reversed index to find compound indexes which might
   234    match by looking up rows in the "idx" table which begin with this serialized
   235    reversed index.
   236  * Generate every builtin index for the inequality + equality filter
   237    properties, and see if they match too.
   238  
   239  An index is a potential match if its suffix *exactly* matches the suffix format,
   240  and it contains *only* sort orders which appear in the query (e.g. the index
   241  contains a column which doesn't appear as an equality or inequlity filter).
   242  
   243  The index search continues until:
   244  
   245  * We find at least one matching index; AND
   246  * The combination of all matching indexes accounts for every equality filter
   247    at least once.
   248  
   249  If we fail to find sufficient indexes to fulfill the query, we generate an index
   250  description that *could* be sufficient by concatenating all missing equality
   251  filters, in ascending order, followed by concatenating the suffix format that we
   252  generated for this query. We then suggest this new index to the user for them to
   253  add by returing an error containing the generated IndexDefinition. Note that
   254  this index is not REQUIRED for the user to add; they could choose to add bits
   255  and pieces of it, extend existing indexes in order to cover the missing columns,
   256  invert the direction of some of the equality columns, etc.
   257  
   258  Recall that equality filters are expressed as
   259  `map[propName][]serializedProperty`. We'll refer to this mapping as the
   260  'constraint' mapping below.
   261  
   262  To actually come up with the final index selection, we sort all the matching
   263  indexes from greatest number of columns to least. We add the 0th index (the one
   264  with the greatest number of columns) unconditionally. We then keep adding indexes
   265  which contain one or more of the remaining constraints, until we have no
   266  more constraints to satisfy.
   267  
   268  Adding an index entails determining which columns in that index correspond to
   269  equality columns, and which ones correspond to inequality/order/projection
   270  columns. Recall that the inequality/order/projection columns are all the same
   271  for all of the potential indices (i.e. they all share the same *suffix format*).
   272  We can use this to just iterate over the index's SortBy columns which we'll use
   273  for equality filters. For each equality column, we remove a corresponding value
   274  from the constraints map. In the event that we _run out_ of constraints for a
   275  given column, we simply _pick an arbitrary value_ from the original equality
   276  filter mapping and use that. This is valid to do because, after all, they're
   277  equality filters.
   278  
   279  Note that for compound indexes, the ancestor key counts as an equality filter,
   280  and if the compound index has `Ancestor: true`, then we implicitly put the
   281  ancestor as if it were the first SortBy column. For satisfying Ancestor queries
   282  with built-in indexes, see the next section.
   283  
   284  Once we've got our list of constraints for this index, we concatenate them all
   285  together to get the *prefix* for this index. When iterating over this index, we
   286  only ever want to see index rows whose prefix exactly matches this. Unlike the
   287  suffix formt, the prefix is per-index (remember that ALL indexes in the
   288  query must have the same suffix format).
   289  
   290  Finally, we set the 'start' and 'end' values for all chosen indexes to either
   291  the Start and End cursors, or the Greater-Than and Less-Than values for the
   292  inequality. The Cursors contain values for every suffix column, and the
   293  inequality only contains a value for the first suffix column. If both cursors
   294  and an inequality are specified, we take the smaller set of both (the
   295  combination which will return the fewest rows).
   296  
   297  That's about it for index selection! See Query Execution for how we actually use
   298  the selected indexes to run a query.
   299  
   300  ### Ancestor queries and Built-in indexes
   301  
   302  You may have noticed that the built-in indexes can be used for Ancestor queries
   303  with equality filters, but they don't start with the magic Ancestor column!
   304  
   305  There's a trick that you can do if the suffix format for the query is just
   306  `__key__` though (e.g. the query only contains equality filters, and/or an
   307  inequality filter on `__key__`). You can serialize the datastore key that you're
   308  planning to use for the Ancestor query, then chop off the termintating null byte
   309  from the encoding, and then use this as additional prefix bytes for this index.
   310  So if the builtin for the "Val" property has the column format of:
   311  
   312      {Property: "Val"}, {Property: "__key__"}
   313  
   314  And your query holds Val as an equality filter, you can serialize the
   315  ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had
   316  an index row:
   317  
   318      PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP
   319  
   320  (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for
   321  the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as:
   322  
   323      PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1
   324  
   325  Omitting the STOP which normally terminates the Key encoding. Using this prefix
   326  will only return index rows which are `/Kind,1` or its children.
   327  
   328  "That's cool! Why not use this trick for compound indexes?", I hear you ask :)
   329  Remember that this trick only works if the prefix before the `__key__` is
   330  *entirely* composed of equality filters. Also recall that if you ONLY use
   331  equality filters and Ancestor (and possibly an inequality on `__key__`), then
   332  you can always satisfy the query from the built-in indexes! While you
   333  technically could do it with a compound index, there's not really a point to
   334  doing so. To remain faithful to the production datastore implementation, we
   335  don't implement this trick for anything other than the built-in indexes.
   336  
   337  ### Cursor format
   338  
   339  Cursors work by containing values for each of the columns in the suffix, in the
   340  order and Direction specified by the suffix. In fact, cursors are just encoded
   341  versions of the []IndexColumn used for the 'suffix format', followed by the
   342  raw bytes of the suffix for that particular row (incremented by 1 bit).
   343  
   344  This means that technically you can port cursors between any queries which share
   345  precisely the same suffix format, regardless of other query options, even if the
   346  index planner ends up choosing different indexes to use from the first query to
   347  the second. No state is maintained in the service implementation for cursors.
   348  
   349  I suspect that this is a more liberal version of cursors than how the production
   350  appengine implements them, but I haven't verified one way or the other.
   351  
   352  Query execution
   353  ---------------
   354  
   355  Last but not least, we need to actually execute the query. After figuring out
   356  which indexes to use with what prefixes and start/end values, we essentially
   357  have a list of index subsets, all sorted the same way. To pull the values out,
   358  we start by iterating the first index in the list, grabbing its suffix value,
   359  and trying to iterate from that suffix in the second, third, fourth, etc index.
   360  
   361  If any index iterates past that suffix, we start back at the 0th index with that
   362  suffix, and continue to try to find a matching row. Doing this will end up
   363  skipping large portions of all of the indexes in the list. This is the algorithm
   364  known as "zigzag merge join", and you can find talks on it from some of the
   365  appengine folks. It has very good algorithmic running time and tends to scale
   366  with the number of full matches, rather than the size of all of the indexes
   367  involved.
   368  
   369  A hit occurs when all of the iterators have precisely the same suffix. This hit
   370  suffix is then decoded using the suffix format information. The very last column
   371  of the suffix will always be the datastore key. The suffix is then used to call
   372  back to the user, according to the query type:
   373  
   374  * keys-only queries just directly return the Key
   375  * projection queries return the projected fields from the decoded suffix.
   376    Remember how we added all the projections after the orders? This is why. The
   377    projected values are pulled directly from the index, instead of going to the
   378    main entity table.
   379  * normal queries pull the decoded Key from the "ents" table, and return that
   380    entity to the user.