github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171020_inverted_indexes.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20171020_inverted_indexes.md (about)

     1  # Inverted Indexes RFC
     2  - Feature Name: Inverted Indexes
     3  - Status: accepted
     4  - Start Date: 9/22/2017
     5  - Authors: Masha Schneider, Jordan Lewis
     6  - RFC PR: #18992
     7  - Cockroach Issue: #2969
     8  
     9  # Summary
    10  
    11  Add infrastructure for inverted indexes in CockroachDB, which will permit
    12  indexing arbitrarily-nested primitives in JSON column types in the short term,
    13  array columns in the medium term, and full-text indexing of text columns and
    14  geospatial indexes in the long term.
    15  
    16  # Motivation
    17  
    18  We’ve committed to supporting JSON column types to make CockroachDB useful for
    19  schema-less data, and for compatibility with Postgres applications that use the
    20  `JSONB` column type. (n.b. this document discusses JSON columns and values, but
    21  in CockroachDB they will be referred to as either `JSON` or `JSONB` depending
    22  on the outcome from the `JSON` implementation RFC) Having JSON columns on their
    23  own is useful, but their usefulness is severely curtailed without an efficient
    24  way to index them, and CockroachDB currently only provides indexes that permit
    25  searching based on comparison to a prefix of a column value. In order to
    26  support JSON in a way that is useful to our customers, we need to be able to
    27  index JSON documents in a more fine-grained way than comparison on value
    28  prefix.
    29  
    30  Specifically, the list of use cases that we’ve seen for indexing JSON columns
    31  includes:
    32  
    33  - Filtering JSON columns based on existence of top-level keys. For example,
    34    “give me all rows whose JSON column contains the key `foo` at the top level”.
    35  - Filtering JSON columns based on primitive values of top-level keys. For
    36    example, “give me all rows whose JSON column contains the key `foo` at the
    37    top level, and its value is equal to `bar`”.
    38  - Filtering JSON columns based on primitive values of deeply-nested keys. For
    39    example, “give me all rows whose JSON column has the path `foo.bar.baz =
    40    qux`”.
    41  
    42  While out of scope for immediate implementation, here are some other related
    43  questions that are not possible to efficiently answer with CockroachDB’s
    44  current indexing scheme:
    45  
    46  - Filtering array columns based on what array elements they contain. For
    47    example, “give me all rows whose array column contains the integer 3”.
    48  - Filtering text columns based on what words they contain. For example, “give
    49    me all rows whose text column contain the words “CockroachDB rocks”.
    50  
    51  None of the questions posed by these use cases are efficiently answerable by
    52  CockroachDB’s conventional indexes without full table scans, since they are not
    53  satisfiable by ordinary value prefix comparison operators. We’d like to be able
    54  to answer these questions and others like them in an efficient way; inverted
    55  indexes provide a way to do so.
    56  
    57  # Guide-level explanation
    58  ## What is an inverted index?
    59  
    60  All of the above questions share a common structure: they’re all filters on
    61  components of tokenizable data. Arrays are tokenized by element; text is
    62  tokenized by splitting on whitespace and stop words; JSON is tokenized by
    63  key-value pairs.
    64  
    65  An inverted index exploits the tokenizable nature of a data type to permit
    66  efficient searches of the following form:
    67  
    68  
    69  > Give me all of the rows of table `t` whose tokenizable column `c` contains
    70  > tokens that satisfy my constraints.
    71  
    72  Being able to answer this general kind of question efficiently is extremely
    73  powerful, and allows us to satisfy the near-term use cases for JSON above.
    74  
    75  ----------
    76  
    77  **Aside: Why is it called an inverted index?**
    78  A common point of confusion around inverted indexes is their name. What makes
    79  an inverted index any more inverted than a secondary index, which points from a
    80  column value back to the primary key of table?
    81  
    82  Originally, inverted indexes were designed for full text search over documents.
    83  In that domain, the forward index allows you to retrieve the list of words
    84  associated with a particular document ID. To implement full text search, you
    85  need to reverse the mapping: instead of pointing from document ID to a list of
    86  words, you would point from word to list of document IDs.
    87  
    88  This inversion of the direction of the index is where the name inverted index
    89  comes from.
    90  
    91  ----------
    92  ## Review: Ordinary Indexes in CockroachDB
    93  
    94  An ordinary primary or secondary index in CockroachDB maps one or several
    95  column values to an **index key**. In the KV layer this is represented as a key
    96  with the **index key** and maybe some other other column values. The KV layer
    97  is sorted, which makes retrieval of index values fast when you know the index
    98  key. Consider the following table:
    99  
   100  
   101      CREATE TABLE t (
   102            keyCol  INT PRIMARY KEY,
   103            jsonCol JSONB
   104      );
   105  
   106  In a simplified view of the KV layer, this table would look like the following:
   107  
   108  | **Key (table id, index id, value)** | **Value**                                    |
   109  | ----------------------------------- | -------------------------------------------- |
   110  | `t/1/1`                             | `{"x": "a", "y": 7}` |
   111  | `t/1/2`                             | `{"x": "b", "z": {"a": true, "b": false}}`   |
   112  | `t/1/3`                             | `{"y": 7, "z": {"b": false}}`                |
   113  
   114  In this case, the keys in the KV have a prefix containing the table name, the
   115  index key and the name of the column for which the value is stored in the value
   116  field. This makes it very fast to run queries where the index key is in the
   117  `where` clause. For example:
   118  
   119      SELECT jsonCol FROM t where keyCol = x;
   120  
   121  ## Why are ordinary secondary indexes insufficient?
   122  
   123  An ordinary index works very well when you’re searching based on prefixes of
   124  sorted data. However, if we want to be able to query over tokenizable data like
   125  in the motivation section, this approach doesn’t save enough information about
   126  the individual components. We need something more sophisticated because the
   127  component of interest might not be a prefix on the whole data structure.
   128  
   129  For example, suppose we wanted to query the table above for all rows where
   130  `json``Col->z->b = false`. At a glance it’s easy to see that both rows `2` and
   131  `3` satisfy that constraint.
   132  
   133  An ordinary secondary index built by storing the entire `jsonCol` as the key
   134  and the `keyCol` as the value would not help us answer this question
   135  efficiently, since it would be sorted lexicographically by the `valueCol` and
   136  the JSON `"z"` key of the value is not necessarily the first JSON key of
   137  the value. We could get around this if we had computed indexes, by creating an
   138  index on `valueCol->z`, but in order to do that we would need to know that we
   139  were indexing on `z` beforehand. Users of unstructured data types would like
   140  their indexes to work on the whole space *without* specifying the keys to index
   141  on.
   142  
   143  ## How does inverted indexing solve the problem?
   144  
   145  At a high level without going into detail about the physical key encoding, an
   146  inverted index on the above data would look like the following:
   147  
   148  | **Key (json path + value)** | **Value (rows whose** `**valueCol**` **contains the key)** |
   149  | --------------------------- | ---------------------------------------------------------- |
   150  | `x:a`                       | `1`                                                        |
   151  | `x:b`                       | `2`                                                        |
   152  | `y:7`                       | `1`                                                        |
   153  | `y:7`                       | `3`                                                        |
   154  | `z:a:true`                  | `2`                                                        |
   155  | `z:b:false`                 | `2`                                                        |
   156  | `z:b:false`                 | `3`                                                        |
   157  
   158  The question we posed above `valueCol->z->b = false` is now incredibly natural
   159  to answer with this type of index. Now it’s a straightforward lookup into the
   160  KV, just the way ordinary indexes are.
   161  
   162  This general approach to indexing is usable on any tokenizable datatype, such
   163  as arrays, text, and even XML, although we won't be adding support for any of
   164  these datatypes in the near term.
   165  
   166  # Reference-level explanation
   167  
   168  ## Detailed Design: Key encoding
   169  
   170  CockroachDB's inverted index for JSON will encode the full JSON path for every
   171  leaf value in a JSON document, similar to Postgres' GIN index with the
   172  `jsonb_path_ops` modifier.
   173  
   174  Postgres supports two flavors of inverted indexes for JSON `jsonb_ops` and
   175  `jsonb_path_ops`. Roughly, the difference is that `jsonb_ops` collapses the
   176  hierarchy of the JSON document, whereas `jsonb_path_ops` preserves it. The
   177  result is that `jsonb_ops` produces a larger and less precise index more similar
   178  to an ordinary full-text index.
   179  
   180  **`jsonb_ops`**
   181  
   182  The first method, `jsonb_ops`, works by adding every JSON key and value from
   183  the indexed column into the index as a key. It’s essentially no different from
   184  full-text search - keys and values are treated almost identically, and all of
   185  the hierarchical information from the input JSON is discarded. Given the
   186  following JSON:
   187  
   188      {"x": "b", "z": {"a": true, "b": false}}
   189  
   190  the keys produced would be:
   191  
   192      Ka
   193      Kb
   194      Kx
   195      Kz
   196      Vb
   197      Vtrue
   198      Vfalse
   199  
   200  To find the results for the following containment query,
   201  
   202      SELECT * from t WHERE jsonCol @> {"x":"b", "z:"{"a":true}}
   203  
   204  we would have to look up the index key for every key and value in the input to
   205  get a set of primary keys per object, then take the intersection of those sets,
   206  and finally re-check the containment condition for all of the rows in that
   207  intersection to make sure that the structure is correct. This is inefficient in
   208  terms of both time and space. In this example, it’s obvious that a very large
   209  number of rows can have `true` as a value in their JSON column so we would have
   210  to perform a join on a very large data set. The resultant index will also be
   211  quite large on disk, since every individual key and value gets indexed.
   212  
   213  **`jsonb_path_ops`**
   214  
   215  The second type of JSON index that Postgres supports is called
   216  `jsonb_path_ops`. This type of index encodes the full JSON path for every leaf
   217  value into the index key. For the example above, the resultant index keys would
   218  be the following:
   219  
   220      x:b
   221      z:a:true
   222      z:b:false
   223  
   224  This type of index produces significantly fewer keys, only one per leaf from
   225  original JSON, not one per token. Performing lookups on this encoding is very
   226  efficient for full path queries because it’s just one lookup.
   227  
   228  From user feedback we’ve received, users are mostly interested in running
   229  queries supported by the second type. The second type is also more space and
   230  time efficient. Therefore, we propose implementing the second variety of index
   231  initially. Once we implement full text search, we will have the first variety
   232  for free using JSON keys and values as the tokens instead of words.
   233  
   234  **Encoding the index**
   235  
   236  Inverted indexes are traditionally encoded as a “postings list”, where each key
   237  points to a single compound value containing a list of matching primary index
   238  ids. Traditional RDBMS systems can encode this compound value efficiently by
   239  using their internal row id representations in the list. Because CockroachDB
   240  doesn’t have internal row ids, we would have to implement a traditional
   241  postings list by storing a list of all missing subcomponents of matching
   242  primary keys in the RocksDB value.
   243  
   244  There are two disadvantages of doing it this way. Firstly, naive insertion and
   245  deletion would be very inefficient. When modifying the index for a JSON object
   246  we would have to read everything in the KV value, modify it, and then write it
   247  all back, for every single leaf in the JSON. This produces more MVCC garbage on
   248  that set of keys, is more expensive over the wire, and would cause contention
   249  on the set of highly-accessed and frequently written keys. Secondly, since
   250  there’s no bound on the number of rows that might contain a particular path,
   251  one individual key’s value could easily grow past the soft 64 megabyte value
   252  limit for RocksDB. The only way to get around that limit would be to add logic
   253  to split and merge values, adding unnecessary complexity.
   254  
   255  Fortunately, CockroachDB has already solved this problem for non-unique
   256  secondary indexes where we encode missing primary key components as part of
   257  the index key. Inverted indexes can reuse this approach to avoid the
   258  disadvantages of the posting list. We can encode both the full JSON path and
   259  the missing primary key components into the index key.
   260  
   261  For example, a row with primary key `pk` and JSON column `{"a": "b", "c": "d"}`
   262  in table `1` would produce the following keys in an inverted index with id `2`:
   263  
   264      /1/2/a/b/pk
   265      /1/2/c/d/pk
   266  
   267  As an implementation note, Postgres deviates from this idea by storing a hash
   268  of all of values of the path components as its index key to save space. This
   269  space savings comes at the cost of losing a lot of index capabilities: sub-path
   270  matching, top-level-key matching, and range scans of path and leaf values. This
   271  approach isn’t useful for us, since all of our keys are stored in sorted order
   272  and RocksDB does prefix compression on adjacent keys, so we don’t have to
   273  sacrifice those neat capabilities to save space.
   274  
   275  **Physical encoding**
   276  
   277  JSON values contain two collection types:
   278  
   279  - Objects: collections of key/value pairs
   280  - Arrays: Ordered lists of values
   281  
   282  We will encode these two structures into index keys in a very similar way. The
   283  full key schema looks like this:
   284  
   285  `/table id/index id/(json object key | arr)…/json leaf value/pk columns…/`
   286  
   287  where `arr` is a separator indicating that the next key is part of an array.
   288  There’s only one leaf value per index key, so the start of the primary key
   289  columns always follows the leaf value, which is detectable by its encoding. The
   290  value component of an index KV is left unused. In the future, inverted indexes
   291  could be extended by adding stored columns into the value component like
   292  secondary indexes currently allow with `STORING`.
   293  
   294  For example, a row with primary key `pk`, JSON column `{"a": "b", "c": "d"}` and
   295  text column "foo" in table `1` would produce the following key-value pairs in an
   296  inverted index with id `2` that's marked to store the text column:
   297  
   298      /1/2/a/b/pk -> "foo"
   299      /1/2/c/d/pk -> "foo"
   300  
   301  **Key encoding JSON objects**
   302  
   303  Encoding JSON objects into the index key is straightforward. Every key in a
   304  path is written to the index key after the index id and is prefixed with a type
   305  tag. Object keys don’t need ordinary type tags, because they can only be
   306  strings. However, we still need to add some sort of tag to distinguish them
   307  from leaf values, which will need to be encoded with a type tag. We’ll pick
   308  `NOTNULL`, which is not used by the leaf value encoding.
   309  
   310  JSON leaf values have four types, `null`, boolean, string or number. Since we
   311  don’t know what they are when reading the values back from a key, we’ll need to
   312  tag them with types. CockroachDB already uses type tags for its value encoding,
   313  so we can reuse those here.
   314  
   315  For example, if we had a table with id `1` with columns `keyCol` and `jsonCol`
   316  containing the following values:
   317  
   318  | keyCol | jsonCol                                    |
   319  | ------ | ------------------------------------------ |
   320  | `pk1`  | `{"x": "b", "z": {"a": true, "b": false}}` |
   321  | `pk2`  | `{"x": "b", "y": null}`                    |
   322  
   323  and an inverted index on `jsonCol` with index id `2`, we’d encode the following
   324  index keys:
   325  
   326      1/2/x/b/pk1
   327      1/2/x/b/pk2
   328      1/2/y/null/pk2
   329      1/2/z/a/true/pk1
   330      1/2/z/b/false/pk1
   331  
   332  To reduce clutter, we've dropped the type tags from our examples. For
   333  completeness, the final index key above would look like the following with all
   334  type tags included:
   335  
   336      1/2/NOTNULL/z/NOTNULL/b/BOOL/false/pk1
   337  
   338  **Key encoding JSON Arrays**
   339  
   340  JSON arrays are ordered lists of other JSON values. For example:
   341  
   342      ["a", 3, [4, 5, 4], [false, true], {foo:bar}]
   343  
   344  To encode array membership into an index key, the `ARRAY` type tag is reused to
   345  indicate that the subsequent key component is part of an array. With `arr`
   346  representing the `ARRAY` type tag and assuming the primary key is `pk1`, the
   347  example above would produce the following index keys:
   348  
   349  
   350      1/2/arr/3/pk1
   351      1/2/arr/a/pk1
   352      1/2/arr/arr/4/pk1
   353      1/2/arr/arr/5/pk1
   354      1/2/arr/arr/false/pk1
   355      1/2/arr/arr/true/pk1
   356      1/2/arr/foo/bar/pk1
   357  
   358  We don’t actually distinguish arrays with an index or name, just a level of
   359  nesting. To understand why, consider the following query:
   360  
   361      SELECT * from t where jsonCol @> [[false]];
   362  
   363  This should return the JSON above because there is indeed a value `false`
   364  nested two deep in an array. However, if we included the indices of the array
   365  values, we wouldn’t be able to disambiguate between the inner array `[4,5]` and
   366  `[false, true]`. If we just add a prefix for array containment, we don’t have
   367  this problem.
   368  
   369  ## Detailed design: operators, index selection, index management
   370  
   371  **Supported Operators**
   372  
   373  Postgres provides the following indexable operators on JSON:
   374  
   375  | @> | Does the left JSON value contain the right JSON path/value entries at the top level? |
   376  | -- | ------------------------------------------------------------------------------------ |
   377  | ?  | Does the string exist as a top-level key within the JSON value?                      |
   378  | ?& | Do all of these array strings exist as top-level keys?                               |
   379  | ?| | Do any of these array strings exist as top-level keys?                               |
   380  
   381  Note that, unlike Postgres’ `json_path_ops` index, the proposed index will
   382  support all of the operators above, not just the first one. The reason for this
   383  extra support is that the proposed encoding is lossless with respect to the
   384  structure of the input JSON, unlike Postgres’ lossy hash encoding; searching
   385  for a top-level key is as simple as performing a range scan. For example, given
   386  the column value `{a: {c: 3}}` and the resultant index key `.../a/c/3/`,
   387  completing the query `col ? a` merely requires checking to see if any rows are
   388  available in the keyspan `/a -> /b`.
   389  
   390  Even though supporting this feature isn't in scope for initial implementation,
   391  the proposed index key encoding permits range queries and partial path matches
   392  that are not even expressible by Postgres’s `@>` operator semantics. For
   393  example, the query `SELECT * FROM j WHERE col->a->>x > 3`, though inexpressible
   394  via Postgres’ indexable JSON operators, is easily expressed via a scan over
   395  keyspan `/a/3 -> /b`.
   396  
   397  **Index Creation**
   398  The syntax of creating an inverted index on a `JSONB` column type should match
   399  Postgres, for compatibility with existing Postgres applications.
   400  
   401  Therefore, we must support the following syntax even though we don’t support
   402  Postgres’ GIN indexes in general:
   403  
   404      CREATE INDEX <optional name> ON <table> USING GIN(<column> <opt_ops_specifier>)
   405  
   406  where `<opt_ops_specifier>` can be only `jsonb_path_ops`, since that’s the
   407  index mode we’ve chosen to support.
   408  
   409  For ergonomics, we might want to also support a syntax like the following:
   410  
   411      CREATE INVERTED INDEX <optional name> ON <table> (<column> <opt_ops_specifier>)
   412  
   413  A future extension to inverted indexes to permit stored columns could have the
   414  `STORING` syntax that we currently implement for secondary indexes:
   415  
   416      CREATE INVERTED INDEX <optional name> ON <table> (<column> <opt_ops_specifier>) STORING (<cols>)
   417  
   418  The fact that an index is inverted will be stored in its index descriptor,
   419  which will inform the index encoder/decoders and index selection algorithms how
   420  to behave correctly.
   421  
   422  Inverted indexes will be physically backfilled and updated in the same fashion
   423  as ordinary secondary indexes.
   424  
   425  **Index Selection**
   426  
   427  Once we have an inverted index created, when do we choose to use it?
   428  
   429  If the query contains a constraint against an indexed JSON column that uses any
   430  of the above operators, we add the inverted index to the set of index
   431  candidates.
   432  
   433  In general, the priority of indexes during index selection is determined by
   434  sorting the index constraints by *selectivity*, an estimate of how good the
   435  index will be at constraining the number of rows it returns given the
   436  operations that constrain it. At the time of this writing, selectivity is
   437  determined by a heuristic that takes into account the number of constrained
   438  index columns, whether or not the index is covering, and the order of the
   439  index. Notably, there’s no effort made to guess how many rows will be returned
   440  by a given constraint, since we don’t currently collect table statistics.
   441  
   442  Since the inverted index might have a very large number of matching rows per
   443  index key, in general we have to assume that it has fairly low selectivity.
   444  It’s not exactly clear how to weight an inverted index versus an
   445  incompletely-constrained unique index on the same table, since we don’t have
   446  table statistics that inform us of the selectivity of a partial match of the
   447  other index. If a user is performing a containment query on an indexed column,
   448  though, we can probably make the assumption that they want to use the index.
   449  So, we’ll weight a JSON column constraint quite high - but lower than a
   450  fully-constrained unique index.
   451  
   452  To give users recourse in the case where these heuristics cause an undesirable
   453  index to be picked, the explicit index selection operator `table@index` will
   454  also be supported for inverted indexes.
   455  
   456  If we guessed that keys in JSON values roughly follow a uniform distribution,
   457  one might be able to guess the selectivity of a JSON inverted index constraint
   458  by counting the number of keys in the path to search, with the rationale that
   459  the longer the path constraint, the less likely it is that documents contain
   460  that path. However, it’s clear that JSON values aren’t in general uniformly
   461  distributed, since it’s very common to have "hot paths" in a particular JSON
   462  schema, so this estimation doesn’t seem all that useful.
   463  
   464  **Index usage**
   465  
   466  Unlike ordinary secondary indexes, inverted indexes behave differently when
   467  they are constrained multiple times. The idea is that, since the index is
   468  inverted and contains all of the primary keys that contain the value in the
   469  constraint, multiple constraints are implemented by checking each constraint
   470  individually and intersecting the resultant primary keys. For example, the
   471  containment query `col @> '3' AND col @> 'b'` requires 2 index scans: one for
   472  all the rows that contain `3`, and one for all the rows that contain `b`. The
   473  resultant row keys are then intersected to produce the final result.
   474  
   475  `OR` constraints, on the other hand, result in the output sets being unioned as
   476  normal.
   477  
   478  Containment queries that search for objects that contain multiple leaf values
   479  are similar - they can be treated as a conjunction of constraints that each
   480  contain one full path to a leaf value. For example, the query
   481  `col @> '{"animal": "dog", "color": "green"}'` requires 2 index scans: one for
   482  all the rows that contain `{"animal": "dog"}`, and one for all the rows that
   483  contain `{"color": "green"}`. The results are intersected like above.
   484  
   485  Simple array containment queries are similar as well: the query
   486  `col @> '{"a": [1,2]}'` produces 2 index scans, one for `/a/1` and one for
   487  `/a/2/`. The results are intersected as above.
   488  
   489  Queries that contain multiple constraints within a single JSON object or array
   490  past the first level of nesting require special handling: for each result in
   491  the intersection of the index scans, an index join against the primary table
   492  must be performed to re-check the query’s index condition. To understand why,
   493  observe the following examples and adversarial inputs:
   494  
   495  | query                              | adversarial input           |
   496  | ---------------------------------- | ----------------------------|
   497  | `col @> '[{"a": 1, "b": 2}]'`      | `[{"a": 1}, {"b": 2}]`      |
   498  | `col @> '[[1,2]]'`                 | `[[1], [2]]`                |
   499  
   500  
   501  In each of those cases, an intersection of index scans for each of the paths
   502  through the query object will both include rows with the adversarial input, but
   503  the adversarial input does not actually contain the query. To resolve this
   504  issue, each row in the intersection must be rechecked against the index
   505  condition.
   506  
   507  **Match intersection algorithm**
   508  
   509  As mentioned above, the fundamental new operation required for performing a
   510  search on an inverted index is intersecting 1 or more sets of primary keys and
   511  using that as the input to an index scan. This operation will be built into a
   512  new plan node that sits in between a scan node and an index join node to avoid
   513  polluting the pre-existing scan node implementations.
   514  
   515  The inverted index intersection node will handle unioning the output of `OR`
   516  constraints, and intersecting the output of `AND` constraints as described
   517  above. Specifically, this node will retrieve all of the matching primary keys,
   518  deduplicate/intersect/union them as necessary, and use the output to perform an
   519  index join against the primary index.
   520  
   521  Since the primary keys for each inverted index prefix (everything up to the
   522  primary key) are sorted, intersection and union can be performed with a cheap
   523  in-memory merge operation.
   524  
   525  The sorting of these primary key sets could be used in a future implementation
   526  step to transform this naive index join into a merge join, but that won't be
   527  done as part of the initial implementation.
   528  
   529  ## Drawbacks
   530  
   531  Inverted indexes are very large, since they generate an unbounded number of
   532  index keys for every value. Implementing ordinary secondary indexes over
   533  computed columns (which are not yet implemented as of the time of writing)
   534  would produce a smaller index, at the cost of requiring that the customers know
   535  ahead of time which keys they need an index on.
   536  
   537  ## Rationale and Alternatives
   538  
   539  **Rejected JSON Array Encoding**
   540  For key encoding arrays we considered leaving out the array special character
   541  prefix altogether. For the JSON object in the example above, the following keys
   542  would be encoded:
   543  
   544      table/jsonCol/3/pk
   545      table/jsonCol/4/pk
   546      table/jsonCol/5/pk
   547      table/jsonCol/a/pk
   548      table/jsonCol/false/pk
   549      table/jsonCol/foo/bar/pk
   550      table/jsonCol/true/pk
   551  
   552  The advantage of this approach is a shorter key. The disadvantage is that we
   553  would have to do more disambiguation for a given query because the nesting
   554  information would be lost. For example, searching for containment of `[[1]]`
   555  in an index on a table with values `[1,2]` and `[[1,2]]` would generate a
   556  keyspan that selected both rows, since the nesting information was thrown away
   557  on index creation.
   558  
   559  ## Future work
   560  
   561  This RFC leaves the door open for several exciting features:
   562  
   563  1. Inverted indexes on other column types. The key encoding described above can
   564     mostly be reused for other column types, as can the index manipulation and
   565     result set intersection code.
   566  2. Efficient use of JSON inverted indexes for queries that search for key
   567     existence below the top level. More index selection analysis would be
   568     required, but we can support a filter like `WHERE v->a ? b` with the proposed
   569     index structure.
   570  3. Efficient use of JSON inverted indexes for range scans on leaf values. The
   571     `@>` operator doesn’t support this, but a filter like `WHERE v->>a > 10`
   572     could utilize a range scan over the index since it’s sorted.
   573  
   574  ## Unresolved questions
   575  
   576  **Index weighting**
   577  
   578  How exactly should we weight the selectivity of constraints on inverted indexes
   579  versus ordinary indexes?
   580  
   581  For example, the following query includes a partially constrained secondary
   582  index alongside a constrained inverted index.
   583  
   584      CREATE TABLE a (a INT, b INT, c JSON, PRIMARY KEY (a, b));
   585      CREATE INVERTED INDEX ON a(c);
   586  
   587      SELECT * FROM a WHERE a = 3 AND c @> {"foo": "bar"};
   588  
   589  It's not clear which filter has better selectivity, since we don't have table
   590  statistics to inform us of the distribution of rows in `a`.
   591  
   592  **Use cases for path unaware inverted indexes**
   593  
   594  It's not clear to us whether there will be demand for path unaware inverted
   595  indexes on JSON columns.