github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200421_geospatial.md (about)

     1  - Feature Name: Geospatial
     2  - Status: in-progress
     3  - Start Date: 2020-04-01
     4  - Authors: danhhz otan rytaft sumeerbhola
     5  - RFC PR: [#47762](https://github.com/cockroachdb/cockroach/pull/47762)
     6  - Cockroach Issue: [#19313](https://github.com/cockroachdb/cockroach/issues/19313)
     7  
     8  Table of contents:
     9  
    10  - [Summary](#Summary)
    11  - [Guide-level explanation](#Guide-level-explanation)
    12  - [Reference-level explanation](#Reference-level-explanation)
    13    * [SQL Types & Functions](#SQL-Types--Functions)
    14    * [Indexing](#Indexing)
    15    * [End-to-End Example](#End-to-End-Example)
    16    * [Package Structure](#Package-Structure)
    17    * [Telemetry](#Telemetry)
    18    * [Testing](#Testing)
    19    * [Performance Benchmarking](#Performance-Benchmarking)
    20  - [Unresolved questions](#Unresolved-questions)
    21  
    22  # Summary
    23  
    24  This RFC proposes the addition of geospatial features to CockroachDB, similar to
    25  what is supported by PostGIS. These features are often requested by users. At the time
    26  of writing, support for geospatial features is the most requested feature in
    27  CockroachDB.
    28  
    29  The approach leverages open-source geometry libraries used by PostGIS whenever
    30  possible. For indexing, the approach diverges from PostGIS by dividing the space
    31  into cells of decreasing size, which fits into the totally ordered key-value
    32  model of CockroachDB. Early experiments suggest that this is competitive in
    33  performance with R-trees whilst allowing for horizontal scaling.
    34  
    35  # Guide-level explanation
    36  
    37  Our geospatial implementation will be compatible with PostGIS. There
    38  are a number of SQL geospatial implementations that we could have
    39  modeled ourselves after as well as the OGC SQL standard or we could
    40  have developed our own new interface. We've selected PostGIS
    41  compatibility largely for two reasons:
    42  
    43  1. When polled directly, a large fraction of our users asked for
    44     PostGIS compatibility.
    45  2. We've found time and time again that being a drop in replacement
    46     for Postgres is an important aid to adoption. Given that the
    47     open-source geospatial community has built most of its tooling around
    48     PostGIS, this has the potential to be an even larger impact than
    49     usual.
    50  
    51  Additionally, both PostGIS and SQL Server largely follow the OGC SQL
    52  standard, so this is not really as opinionated a decision as it would
    53  otherwise be. However, notably, where PostGIS behavior diverges from
    54  the OGC SQL standard, we've opted to prioritize the PostGIS behavior.
    55  
    56  As a result, our initial user-facing geospatial footprint will be a
    57  subset of PostGIS's footprint. We'll certainly expand this footprint
    58  over time, but given that some PostGIS features are rarely used
    59  (raster) and others are deprecated, it's likely that we'll never cover
    60  all of it. See below for the exact initial footprint we'll be aiming
    61  for.
    62  
    63  The specs we refer to in this document are:
    64  
    65  * OGC: [v1.1](http://portal.opengeospatial.org/files/?artifact_id=13228)
    66  ** NOTE: For 20.2, we will aim to support v1.1 as PostGIS does
    67  * SQL/MM: [ISO 13249-3](http://jtc1sc32.org/doc/N1101-1150/32N1107-WD13249-3--spatial.pdf)
    68  
    69  Our approach will leverage the open-source geometry libraries used by
    70  PostGIS whenever possible. These libraries are widely used and tested,
    71  and additionally allow us to match PostGIS behavior for geometric
    72  corner cases. For indexing, we will diverge from PostGIS, which uses
    73  R-trees. The totally ordered key-value model used in CockroachDB
    74  indexes allows for horizontal scaling and continuous repartitioning,
    75  and we need to preserve these fundamental characteristics of the
    76  system. We will use a divide-the-space approach that divides space
    77  into cells of decreasing size, and turns the cell space into a linear
    78  ordering using a space-filling curve. Shapes are indexed by a set of
    79  cells using an inverted index.
    80  
    81  # Reference-level explanation
    82  
    83  Geospatial is exposed through two new SQL types: GEOMETRY (planar) and
    84  GEOGRAPHY (sphere/spheroid, which only works on latitude/longitude
    85  SRID systems). SQL Datums can be either of these types, as can SQL table
    86  columns.
    87  
    88  An individual datum of type GEOMETRY or GEOGRAPHY will be one of
    89  several "shapes". The initial set of these we'll support (and is all
    90  that is supported by GEOGRAPHY) is:
    91  
    92  * POINT
    93  * LINESTRING
    94  * POLYGON
    95  * MULTIPOINT
    96  * MULTILINESTRING
    97  * MULTIPOLYGON
    98  * GEOMETRYCOLLECTION
    99  * GEOMETRY ("shape", which is different than the "type", which allows
   100  any of the above types to be referenced - more later)
   101  
   102  In addition, geospatial types and datums can have Spatial Reference
   103  System Identifiers (SRIDs) which specify which projection was applied
   104  to their coordinate values. At a base level, we will look to support
   105  SRID 0 (undefined) and 4326 (vanilla lat/lng geography on a WGS84
   106  spheroid). Supporting SRIDs outside of these two is a stretch task for
   107  the 20.2 release which we will detail in this RFC.
   108  
   109  A number of new SQL builtin functions are added which operate on
   110  GEOMETRY and GEOGRAPHY. Some of these functions can be accelerated by
   111  maintaining a spatial index over a table, which is exposed as a SQL
   112  index. The optimizer will be aware of these indexes and use them in
   113  planning.
   114  
   115  
   116  ## SQL Types & Functions
   117  
   118  ### SQL Types
   119  
   120  We will introduce two new data types to support geospatial data:
   121  
   122  * Geometry - for planar functionality compliant with the OGC spec. For
   123    20.2, we will look at only supporting 2D geometries, but still
   124    support ingesting Z- and M- coordinates. For 20.2, Geometry will be
   125    a wrapper around the
   126    [twpayne/go-geom](https://github.com/twpayne/go-geom) library
   127    which allows us to easily transform the datum for use in
   128    the GEOS/SFCGAL libraries that have the builtins we require (see the
   129    builtins section), whilst also allowing some simple operations to be
   130    done without converting to GEOS/SFCGAL. We will gradually
   131    switch to our own abstraction as we look to support curves/3D
   132    geometries in future versions, which is lacking in twpayne/go-geom.
   133  
   134  * Geography - for a subset of functionality of geometry that does
   135    calculations on a spheroid using geodesic lines, operating only on
   136    lat/lng-based projections (see the "Spatial Reference Identifiers
   137    (SRIDs) and Projections" section). This is not part of the OGC
   138    spec. For 20.2, Geography will mostly wrap around the
   139    [S2 Geometry](https://s2geometry.io/) Region interface and
   140    GeographicLib. The S2 Geometry will give us sphere math, as well as
   141    operations which do not require physical real world
   142    values. GeographicLib will perform necessary spheroid math
   143    operations which require real world values (i.e. in metres for
   144    ST_Area, ST_DWithin, ST_Distance, ST_Perimeter, or in degrees for
   145    ST_Azimuth).
   146  
   147  Note we will not support box2d and box3d PostGIS types and operators,
   148  as they are designed to work with the R-Tree index which we are not
   149  supporting.
   150  
   151  
   152  #### Supported Geometry Types (Shapes)
   153  
   154  For each data type, we will be able to support the "geometry types" as
   155  defined in Section 6.2.6 of the OGC spec. We will internally call
   156  these geometry types "shapes". The full hierachy is from the OGC
   157  spec is as follows.
   158  
   159  ![OGC Shape Hierachy](20200421_geospatial/shape_hierachy.png?raw=true "OGC Spec Hierachy")
   160  
   161  We are planning to support the following:
   162  
   163  * Initially, we will look to support the following shapes that are
   164    applicable to both Geometry and Geography in line with the OGC and
   165    SQL/MM specifications:
   166    * POINT
   167    * LINESTRING,
   168    * POLYGON
   169    * MULTIPOINT
   170    * MULTILINESTRING
   171    * MULTIPOLYGON
   172    * GEOMETRY (wildcard shape; can be any shape except another GEOMETRY) 
   173    * GEOMETRYCOLLECTION (can be a collection of all the MULTI- shapes)
   174    
   175  * In the future, we will look to support the following shapes for the
   176    Geometry data type only, in line with PostGIS support:
   177    * 2.5D (all 2D with Z- and M- coordinates and 3D datatypes)
   178    * 2D:
   179      * CURVE
   180      * MULTICURVE
   181      * TRIANGLE
   182    * 3D: 
   183      * TIN (triangulated irregular network)
   184      * SURFACE
   185      * MULTISURFACE
   186      * POLYHEDRALSURFACE
   187  
   188  
   189  #### OIDs
   190  
   191  The OIDs for these data types do not exist in `lib/pq/oid` as PostGIS
   192  is an extension which only adds this metadata when the extension is
   193  installed in the CLI. We hence have to define our own OIDs for
   194  geospatial data types.
   195  
   196  We will hardcode them in a number range which fits in the range that
   197  can fit within an int4. We should not use an OID range higher than
   198  int4 as this would be incompatible with Postgres.
   199  
   200  
   201  #### Casts
   202  
   203  PostGIS allows Geography types to be implicitly cast into Geometry
   204  types in addition to explicit casts. However, Geometry types can only
   205  be explicitly cast to Geography types.
   206  
   207  We will be able to support the explicit casts easily. However, as
   208  implicit casts are not supported in Cockroach (yet), we may have to
   209  "double define" a few definitions of functions that cast Geography to
   210  Geometry to emulate PostGIS. We will aim to support this as
   211  "pseudo-implicit casts" for commonly used functions that are only
   212  defined for Geometry (e.g. ST_SRID). The user also has the option to
   213  explicitly cast Geography to Geometry before the builtins if we miss
   214  any.
   215  
   216  Casts between other supported types (`text` and `bytea`) will also
   217  be initially supported via explicit casts, with implicit casts or
   218  pseudo-implicit casts mentioned above implemented at a later date.
   219  
   220  
   221  #### Column Definitions
   222  
   223  In line with PostGIS, we will support the following column definition
   224  mechanisms for GEOMETRY:
   225  
   226  * `ADD COLUMN <colname> GEOMETRY` - this allows any kind of geometry
   227    to be added. As no SRID is specified in the column definition, the
   228    geometry by default will not have a SRID (which is defined as SRID =
   229    0).
   230  
   231  * `ADD COLUMN <colname> GEOMETRY(shape)` - this allows only the shape
   232    defined to be stored in this column. However, the GEOMETRY shape
   233    allows any shape to be stored in the column, and the
   234    GEOMETRYCOLLECTION shape allows any MULTI- prefixed shape to be
   235    stored in the column. As above, any SRID is allowed in this column.
   236  
   237  * `ADD COLUMN <colname> GEOMETRY(shape, srid)` - applies both the
   238    shape restriction above, as well as only allowing a specific SRID to
   239    be stored in the column. If an SRID is not defined with the Datum
   240    (i.e. has an SRID of 0), it's SRID will implicitly change to the
   241    SRID in the given column. Note having a column of GEOMETRY(shape, 0)
   242    where the SRID is 0 allows any SRID to be stored.
   243  
   244  We will mirror the same definitions for GEOGRAPHY as we do for
   245  GEOMETRY with one key difference - the default SRID will be 4326
   246  (representing vanilla lat/lng on a WGS84 sphere), instead of leaving
   247  it with an SRID of 0.
   248  
   249  For 20.2, operations such as `ALTER TABLE ... SET DATA TYPE ...` may not be
   250  permitted, but will be extended in later versions.
   251  
   252  
   253  #### Index Creation
   254  
   255  To maintain compatibility with PostGIS and existing tools, the syntax
   256  to create indexes will involve the same syntax as using GiST, which will
   257  initialize the indexes with the S2-backed inverted index underneath.
   258  
   259  This PostGIS-backed index will look like the following:
   260  
   261  ```
   262  CREATE INDEX [indexname] ON [tablename] USING GIST ( [geospatial_column] );
   263  ```
   264  
   265  If we decide to support user-adjustable S2 parameters, we could
   266  support a syntax like the following:
   267  
   268  ```
   269  CREATE INVERTED INDEX [indexname] ON [tablename] ( [geospatial_column] ) WITH COVERING (min_level = 0, max_level = 30, ….)
   270  ```
   271  
   272  For 2D planar geometry there will be an axis-aligned rectangular bound
   273  that all shapes that want index acceleration should fit in. The bound
   274  will be specified at index creation time. We could adopt a syntax
   275  similar to SQL Server, which also requires a rectangular [bounding
   276  box](https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview?view=sql-server-ver15#the-bounding-box).
   277  Shapes can exceed this bounding box, but those shapes will not benefit
   278  from index acceleration. To support this syntax, we will introduce a
   279  WITH:
   280  
   281  ``` 
   282  CREATE INVERTED INDEX [indexname] ON [tablename] ( [geospatial_column] ) WITH BOUNDING BOX (x_min, y_min, x_max, y_max)
   283  ```
   284  
   285  The default bounding box for 2D geometry will be decided after
   286  consultation with users.
   287  
   288  #### Disk Serialization
   289  
   290  For Geometry and Geography types, we will serialize the data as
   291  a protobuf that contains the EWKB. This allows us flexibility in
   292  changing libraries later, as well as encoding the necessary type
   293  and shape metadata. The protobuf will appear as follows:
   294  
   295  ```
   296  message SpatialObject {
   297    // EWKB is always stored in littleEndian order.
   298    bytes ewkb = 1;
   299    // SRID is denormalized from the EWKB.
   300    int32 srid = 2;
   301    // Shape is the Shape that is stored in the database.
   302    Shape shape = 3;
   303    // BBox bounding box for optimizing certain operations
   304    // depending on performance benchmarks.
   305    BoundingBox bbox = 4;
   306  }
   307  
   308  enum Shape {
   309    GEOMETRY = 1;
   310    POINT = 2;
   311    // ... etc.
   312  };
   313  
   314  message BoundingBox {
   315    float64 min_x = 1;
   316    float64 min_y = 2;
   317    float64 max_x = 3;
   318    float64 max_y = 4;
   319  }
   320  ```
   321  
   322  It is worth noting we could serialize Geography types with the S2
   323  library. The S2 library would be fastest to transform serialization
   324  and deserialization into S2 types (compared to EWKB). However, it is
   325  missing important metadata such as 2.5D/3D (Z-) and M- coordinates, as
   326  well as original shape and SRID metadata. There is a "Compressed"
   327  serialization option for S2 objects which further improves disk usage,
   328  however it is not yet ported to Go.
   329  
   330  We will not support indexing geospatial types in default primary keys
   331  and unique secondary indexes (as we do not for JSONB and ARRAYs today). This is
   332  because we will not be able to match the PostGIS definition as it's based
   333  on a hash of its internal data structure, which means we will
   334  not be able to be a "drop-in" replacement here. If we could decide
   335  on our own ordering, it will be based on the raw EWKB byte ordering.
   336  This may change subject to further user consultation before the upcoming
   337  v20.2 release.
   338  
   339  We will however support using inverted indexes for geospatial indexing
   340  -- see the indexing section for details.
   341  
   342  #### Display
   343  
   344  In line with PostGIS, we will display values from SELECT as EWKB-hex
   345  strings.
   346  
   347  
   348  ### Builtins
   349  
   350  PostGIS supported functionality is defined in [Section 14.11 of their
   351  documentation](https://postgis.net/docs/PostGIS_Special_Functions_Index.html#PostGIS_TypeFunctionMatrix),
   352  which is a superset of those available in section 7 of the OGC
   353  spec. For 20.2, we will aim to support all the 2D geometry (except for
   354  CURVE) and geography functionality. The priority will be for functions
   355  available for use for indexing and functions already implemented by
   356  GEOS/S2/GeographicLib.
   357  
   358  Most operations involving multiple Geometries and/or Geographies can
   359  only succeed if the SRIDs between the two data types are the same.
   360  
   361  
   362  #### Geometry (2D)
   363  
   364  PostGIS mainly defers difficult 2D geometry functionality to the
   365  [GEOS](https://trac.osgeo.org/geos/) library - however, some operations
   366  are done in PostGIS itself if it is not supported in GEOS.
   367  
   368  We will aim to use CGO to interface with the GEOS library in
   369  20.2. This is the most straightforward and quick way to set up as the
   370  GEOS library specifically targets implementing the OGC spec.  This
   371  provides us with coverage for many of the mathematical predicates
   372  required for 2D geometry and lets us build a lot of functionality
   373  quickly. We will eat the CGO overhead to be able to ship quickly.
   374  
   375  We will use 
   376  [dlopen](http://man7.org/linux/man-pages/man3/dlopen.3.html) and
   377  [dlsym](https://pubs.opengroup.org/onlinepubs/009695399/functions/dlsym.html) to
   378  interface with the GEOS library. This prevents us from statically
   379  linking the GEOS library - instead, a CRDB user must install GEOS into
   380  their environment. As a corollary of this limitation:
   381  
   382  * We will initially not be supporting certain 2D Geometry features on
   383    CRDB on Windows for the 20.2 (we can decide to use the
   384    [LoadLibrary](https://docs.microsoft.com/en-gb/windows/win32/api/libloaderapi/nf-libloaderapi-loadlibrarya?redirectedfrom=MSDN)
   385    function in the future).
   386  
   387  * Users will have to specify the path of their GEOS install when
   388    running a cockroach binary. This should be set as an array of paths
   389    which points to the location where the C bindings may be installed,
   390    with a sensible default that covers most operating system default
   391    locations. This is a potential source of confusion, which should be
   392    well documented.
   393    * We should include the GEOS library in the docker containers we ship.
   394    * We will include GEOS in the tarball and include instructions
   395      from the installation page to copy configure GEOS if Geospatial
   396      operations are required. This is similar to PostGIS where users
   397      would require extra steps by installing the PostGIS extension
   398      dependencies to perform Geospatial operations.
   399  
   400  Example changed installation instructions:
   401  
   402  ```sh
   403  $ wget -qO- https://binaries.cockroachdb.com/cockroach-v20.2.0.linux-amd64.tgz | tar  xvz
   404  $ cp -i cockroach-v20.2.0.linux-amd64/cockroach /usr/local/bin/
   405  # Optional step.
   406  $ cp -i cockroach-v20.2.0.linux-amd64/libgeos_c.so /usr/local/lib/
   407  ```
   408  
   409  If a user attempts to use a geospatial function without having the library installed,
   410  they should see an error similar to the following:
   411  
   412  ```sql
   413  $ SELECT ST_Area('POINT(1.0 1.0)'::geometry)
   414  ERROR: geospatial functions are only supported if the GEOS module is installed
   415  HINT: See http://cockroachdb.com/link/to/installation/instructions
   416  ```
   417  
   418  If a user loads geospatial data then later does not have the module installed,
   419  the above error will still appear.
   420  
   421  
   422  #### Geometry (Curves)
   423  
   424  Curves are implemented natively in PostGIS, with an [approximation of
   425  ~32 points per
   426  curve](https://github.com/postgis/postgis/blob/master/liblwgeom/lwgeom_geos.c#L412) 
   427  done when input into the GEOS library. We can look to use some of the
   428  same math after v20.2.
   429  
   430  
   431  #### Geometry (2.5D)
   432  
   433  These are implemented natively in PostGIS with some GEOS support. We
   434  can look at doing this after v20.2.
   435  
   436  
   437  #### Geometry (3D)
   438  
   439  PostGIS uses [SFCGAL](http://www.sfcgal.org/) to compute 3D
   440  geometry. We will employ similar CGO techniques described in Geometry
   441  2D, with the same downsides.
   442  
   443  
   444  #### Geography
   445  
   446  In PostGIS, spherical calculations are done within PostGIS, whilst
   447  values requiring spheroid values (e.g. Distance, Area) are done using
   448  [GeographicLib](https://geographiclib.sourceforge.io/).
   449  
   450  Most spherical calculations are supported by the S2 Geometry library
   451  in C++, most of which are ported over to Golang. Combining the S2
   452  Geometry library with using CGO/GeographicLib (which does not require
   453  dlopen/dlsym as we can link it directly in our binaries), we will be
   454  able to support all relevant Geography functions.
   455  
   456  
   457  ##### Default Args for Geography functions
   458  
   459  Most geography functions that can optionally operate on a spheroid
   460  (e.g. [ST_Area](https://postgis.net/docs/ST_Area.html)) have a
   461  default argument of "use_spheroid = True". We do not support default
   462  arguments in Cockroach. As such, we will have to define two functions
   463  for each of these - one taking no bool and one taking a bool
   464  value. This will add an extra "row" for this builtin in our docs
   465  compared to PostGIS.
   466  
   467  Users and tools specifying explicitly default args (e.g. `SELECT
   468  ST_Area(geom, use_spheroid := false)`) will be unable to do so at v20.2
   469  unless we implement the default args feature before then.
   470  
   471  
   472  #### Comparison Ops
   473  
   474  Direct comparison using the `>`/`>=`/`=`/`<`/`<=`/`!=` comparators in
   475  PostGIS involve hashing the internal structure of the PostGIS
   476  object. Comparators (besides =, which should probably use ST_Equal
   477  anyway) are not expected to be commonly used with geospatial data
   478  types.
   479  
   480  As such, we will define our own comparator operations to use raw byte
   481  EWKB comparisons, which will be incompatible with PostGIS for base
   482  comparator operators only.
   483  
   484  
   485  #### bbox Ops
   486  
   487  PostGIS has a set of [bounding box
   488  operators](https://postgis.net/docs/reference.html#idm9874) that are
   489  not part of the OGC or SQL/MM specs. As mentioned above, our different
   490  indexing approach prevents these from being index
   491  accelerated. Therefore we have no plans to support these in the
   492  initial implementation.
   493  
   494  
   495  #### Distance Operators
   496  
   497  PostGIS has a set of [distance
   498  operators](https://postgis.net/docs/reference.html#idm10964) that are
   499  useful for accelerating k-nearest-neighbor search as outlined in
   500  [https://postgis.net/workshops/postgis-intro/knn.html](https://postgis.net/workshops/postgis-intro/knn.html). We
   501  plan to add support for k-nearest-neighbor in later versions, and may
   502  support a subset of these distance operators.
   503  
   504  
   505  #### Index Usage
   506  
   507  Certain function builtins in PostGIS can utilize an index which we
   508  will match with some optimizer work around the indexing (see Indexing
   509  section).
   510  
   511  For 2D geometry and geography, these are:
   512  
   513  * ST_Covers
   514  * ST_CoveredBy
   515  * ST_Contains (geometry only)
   516  * ST_ContainsProperly (geometry only)
   517  * ST_Crosses (geometry only)
   518  * ST_DFullyWithin (geometry only)
   519  * ST_DWithin
   520  * ST_Equals (geometry only)
   521  * ST_Intersects
   522  * ST_Overlaps (geometry only)
   523  * ST_Touches (geometry only)
   524  * ST_Within (geometry only)
   525  
   526  These functions have an equivalent with an `_` prefix
   527  (e.g.. _ST_Within) that avoids using the indexes. We will similarly
   528  have such functions, in which the optimizer will know not to use the
   529  indexes when doing these operations.
   530  
   531  3D geometry support includes 2D and 1D shapes embedded in 3 coordinate
   532  dimensions, and some support for 3D shapes (solids). This uses the
   533  SFCGAL library. We will not support 3D coordinate dimensions since we
   534  do not yet have a comprehensive solution for indexing 3D space. For
   535  reference, the 3D functions that need such indexing support are:
   536  
   537  * ST_3DDFullyWithin
   538  * ST_3DDWithin
   539  * ST_3DIntersects
   540  
   541  
   542  #### Builtin Result Caching
   543  
   544  In PostGIS, certain builtin operations are cached, using the hashed data
   545  from the shape as keys. This is especially useful for ST_Distance/ST_DWithin
   546  for repeated use of these operations. We may similarly look into
   547  caching the results of certain operations, but this is out of scope for
   548  v20.2. If we decide to go along this route, a future RFC will be
   549  published.
   550  
   551  #### Documentation
   552  
   553  PostGIS has documentation for each of the builtins they support, which
   554  gives details such as which libraries are used, common gotchas and
   555  links to other useful information (click on any builtin as defined in
   556  [Section
   557  14.11](https://postgis.net/docs/PostGIS_Special_Functions_Index.html#PostGIS_TypeFunctionMatrix)). We
   558  should be similarly detailed and rigorous with our
   559  explanations. However, we will need to build the necessary framework
   560  to support referencing (valid) docs within our builtins for this to
   561  work.
   562  
   563  
   564  ### External Formats
   565  
   566  We will support the following external formats (and the associated
   567  builtins for parsing and encoding them):
   568  
   569  * Well Known Text (WKT) and Well Known Bytes (WKB), as defined in the
   570    SQL/MM Section 5.1.45 and 5.1.46. These are all supported by
   571    twpayne/geom library, except for WKT decoding. For the WKT decoding,
   572    we can use the GEOS library (which blocks Geography adoption for
   573    non-GEOS installs) in the interim, but as GEOS library does not
   574    support curves/3D types, we may wish to write our own parser in the
   575    future.
   576  
   577  * Extended Well Known Text (EWKT) and Extended Well Known Bytes
   578    (EWKB), as defined by PostGIS for cross-compatibility. This contains
   579    an extra reserved bit for specifying whether an SRID is included, as
   580    well as extra bytes used for encoding the SRID itself. These are all
   581    supported by twpayne/geom library.
   582  
   583  * GeoJSON, as defined by
   584    [RFC7946](https://tools.ietf.org/html/rfc7946). These are all
   585    supported by twpayne/go-geom libraries.
   586  
   587  
   588  ### Spatial Reference Identifiers (SRIDs) and Projections
   589  
   590  A SRID is a number that identifies a spatial reference system. A
   591  spatial reference system provides meaning to the point coordinates
   592  used in the representation of shapes. When the geometry type is
   593  used to represent the Earth, the SRID represents a projection on
   594  a plane, for example the
   595  [web mercator](https://en.wikipedia.org/wiki/Web_Mercator_projection)
   596  projection is represented using SRID 3857. The geography type is not
   597  projected, and the default is SRID 4326, which uses lat/lng
   598  coordinates. 
   599  
   600  SRID support is heavily inbuilt into the SQL type system (see SQL
   601  Types/Column Definitions).
   602  
   603  
   604  #### spatial_ref_sys
   605  
   606  Upon loading the PostGIS extension, a spatial_ref_sys table is
   607  imported into the public schema of the given database, with a default
   608  set of SRIDs pre-loaded into the database. This table exists in line
   609  with the OGC Spec, with the addition of a `proj4text` column which
   610  defines the WKT projection which has been transformed into something
   611  the [PROJ](https://proj.org/) library can use. This table is similarly
   612  insertable and deletable. Any Postgres user can modify this table.
   613  
   614  PostGIS INSERT statements for custom defined SRIDs are available from
   615  entries on the[ Spatial Reference
   616  website](https://spatialreference.org/). If users are defining custom
   617  SRIDs with WKT but do not have access to proj4text, we can provide a
   618  builtin CGO wrapper around the [GDAL](https://gdal.org/) library
   619  (X/MIT) which can perform this translation - but there are external
   620  cli tools already available to perform this operation.
   621  
   622  We need to be able to support some notion of a spatial_ref_sys table,
   623  which allows management of user-defined insertable spatial_ref_sys
   624  data.
   625  
   626  For v20.2, we will not be looking to support custom SRIDs, and as such
   627  SRIDs will be a hardcoded map and viewable from a `pg_extension`
   628  virtual schema, discussed below in the "`pg_extension` schema" section.
   629  
   630  In future versions, we can look at supporting user-defined
   631  `spatial_ref_sys` schemas by allowing the user to "copy" the
   632  `spatial_ref_sys` table into the public schema and using that
   633  as the source of truth, analogous to PostGIS. This will appear
   634  as the following:
   635  
   636  ```
   637  $ CREATE EXTENSION 'postgis';
   638  -- copy over spatial_ref_sys table, maybe even geometry_column/geography_column views
   639  $ SELECT * FROM spatial_ref_sys
   640  -- .... results copied from default spatial_ref_sys
   641  $ INSERT INTO geom_table VALUES ('SRID=9999;POINT(1.0 1.0)')
   642  -- lookup is based on what is available on the public schema as opposed
   643  -- to a hardcoded map.
   644  ```
   645  
   646  #### geometry_columns / geography_columns
   647  
   648  The `geometry_columns` and `geography_columns` views in PostGIS are a
   649  wrapper around pg_catalog which allow the user to view which tables
   650  currently involve geospatial data. Due to PostGIS's extension nature,
   651  they are available on the public schema upon extension registration.
   652  
   653  This will also be available on the `pg_extension` schema
   654  mentioned below.
   655  
   656  ##### `pg_extension` schema
   657  
   658  For cross compatibility with PostGIS and its nature as an extension
   659  which injects tables into the public schema, we need this table to be
   660  resolvable from anywhere by simply requesting "geometry_columns",
   661  "geography_columns" or "spatial_ref_sys".
   662  
   663  For v20.2, this will be available on a virtual schema called `pg_extension`.
   664  This will be on the search_path similar to how pg_catalog is today. This
   665  will differentiate these views from the default virtual schema views 
   666  available on other tables.
   667  
   668  This schema should also be resolvable from any search path, and
   669  thus available from any catalog. Any custom defined schemas should not
   670  be named `pg_extension`.
   671  
   672  
   673  #### spatial_ref_sys lookup caching
   674  
   675  New column definitions (e.g. `ADD COLUMN &lt;colname> geography(point,
   676  4326)`) as well as importing new geospatial datums with SRIDs defined
   677  (e.g. `ST_FromEWKT('SRID=4326;POINT(1.0 2.0)')`) require a lookup of
   678  whether the SRID is valid. It does not need to apply the
   679  transformations, as it is expected that the input is already defined
   680  as the given SRID.
   681  
   682  To speed up having to look at the spatial_ref_sys for valid SRIDs, we
   683  will need to cache successful SRID lookups. We can keep an
   684  (potentially bounded) in-memory mapping of valid SRIDs with a
   685  "timeout" to assist with the caching. This will be stale upon deletion
   686  by an admin user, which the timeout will look to resolve.
   687  
   688  
   689  #### ST_Transform
   690  
   691  ST_Transform is the builtin that projects a given geospatial datum
   692  from one SRID projection to another. This can be done as a CGO wrapper
   693  around the [PROJ](https://proj.org/) library by transforming from one
   694  proj4text to another. The library has a lot of extra additions that
   695  may be unnecessary for our use case (e.g. requiring the installation of
   696  sqlite3 for storing metadata from certain tools we don't use) and as
   697  such we may decide to also ship this as a library which requires
   698  dlopen/dlsym.
   699  
   700  It is worth noting there is a Go port of the PROJ library named
   701  [go-spatial/proj](https://github.com/go-spatial/proj), but is missing
   702  some essential functionality such as the ability to define custom
   703  proj4 transformations.
   704  
   705  
   706  ### Other Supported Operations
   707  
   708  
   709  #### AddGeomColumn
   710  
   711  This is a builtin provided by PostGIS that adds a new Geometry column
   712  to the database given a set of parameters. This is used a lot by tools
   713  that import geometry data (at least from tutorials), so we should look
   714  to support the same builtin.
   715  
   716  The AddGeomColumn primitive also adds a couple of CHECK expressions
   717  that ensures that the SRIDs/shape in the column match the datums,
   718  however, this is expected to work even when using the vanilla `ADD
   719  COLUMN geometry(shape, srid)` syntax. This is probably a relic of old
   720  PostGIS, and we will look to add the same CHECK expressions.
   721  
   722  
   723  ### BACKUP/RESTORE/cockroach dump
   724  
   725  Backup/Restore/cockroach dump operations will continue to "work out of
   726  the box". We will furthermore be able to ingest postgres dumps of
   727  PostGIS data as we support I/O of EWKT-as-hex format when parsing and
   728  displaying geospatial datums.
   729  
   730  
   731  ### CHANGEFEED
   732  
   733  We will use EWKB for CHANGEFEED results for Geography and Geometry
   734  types. This may be out of line with the OGC spec, but follows the
   735  principle that we will prioritize PostGIS behaviour.
   736  
   737  
   738  ### Out of Scope for v20.2
   739  
   740  The following has been mentioned as out of scope:
   741  * 2.5D, 3D shapes
   742  * PRIMARY KEY definitions for Geometry/Geography types
   743  * Default Arguments for Builtins
   744  * Caching certain builtin operations
   745  * User defined SRIDs
   746  
   747  Furthermore, these will be out of scope:
   748  * Supporting the default geometric data types in
   749    [PostgreSQL](https://www.postgresql.org/docs/current/datatype-geometric.html).
   750  
   751  
   752  ## Indexing
   753  
   754  
   755  ### Background
   756  
   757  The current approaches to geospatial indexes divide cleanly into two
   758  buckets. One approach is to "divide the objects". This works by
   759  inserting the objects into a tree (usually a balanced tree) whose
   760  shape depends on the data being indexed. The other approach is to
   761  "divide the space". This works by creating a decomposition of the
   762  space being indexed into buckets of various sizes.
   763  
   764  When an object is indexed, a "covering" shape (e.g. a bounding box) is
   765  constructed that completely encompasses the indexed object. Index
   766  queries work by looking for containment or intersection between the
   767  covering shape for the query object and the indexed covering
   768  shapes. This retrieves false positives but no false negatives.
   769  
   770  
   771  #### Divide the Objects
   772  
   773  PostGIS is the notable implementation of divide the objects. It
   774  maintains an "R tree" (rectangle tree) which is implemented as a
   775  Postgres "GiST" index. The GiST index is a generalization of data
   776  types that can be naturally ordered into a hierarchy of supersets
   777  (e.g.B+ trees, R trees, etc).
   778  
   779  The covering shape used by PostGIS is a "bounding box" which is the
   780  minimal rectangle that encompasses the indexed shape. The same is done
   781  for the query shape.
   782  
   783  Lucene uses [BKD
   784  trees](https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf)
   785  with a triangle tessellation of shapes. BKD trees permit a
   786  log-structured multi file approach, but compactions would need to read
   787  all the input files into memory to redivide the objects. More
   788  importantly, they are not easily compatible with the normal horizontal
   789  scaling approach of splitting and merging lexicographic ranges.
   790  
   791  Advantages of Divide the Objects:
   792  
   793  * Each object is present in the index once
   794  * Can index 3D GEOMETRY space
   795  * Can index GEOGRAPHY with altitude
   796  * Can index infinite GEOMETRY space
   797  
   798  Disadvantages of Divide the Objects:
   799  
   800  * The tree needs a periodic balancing operation to have predictable
   801    read latencies
   802  * Object insertions and balancing require locking
   803  * How to effectively horizontally distribute an R tree is an [open
   804    question](https://www.anand-iyer.com/papers/sift-socc2017.pdf).
   805  * Bulk ingest (IMPORT) requires coordination between nodes as the
   806    tree's shape depends on its contents
   807  * A bounding box can return many false positives
   808  
   809  #### Divide the Space
   810  
   811  Recent geospatial index implementations tend to prefer to divide the
   812  space, because being able to horizontally distribute data is
   813  increasingly important. The relevant implementations here are
   814  Microsoft SQL Server and MongoDB.
   815  
   816  The space is divided into a quadtree (or a set of quadtrees) with a
   817  set number of levels and a data-independent shape. Each node in the
   818  quad tree (a "cell") represents some part of the indexed space and is
   819  divided once horizontally and once vertically to produce 4 children in
   820  the next level. Each node in the quadtree is "content addressable",
   821  meaning that mapping from the node to its unique ID and back is
   822  possible without external information.
   823  
   824  Implementations tend to use clever strategies for the unique IDs with
   825  important guarantees:
   826  
   827  * The IDs of all ancestors of a cell are enumerable
   828  * The IDs of all descendants of a cell is a range query
   829  * The cells of nearby IDs are spatially near
   830  
   831  MongoDB uses the [S2 library](https://s2geometry.io/) from Google for
   832  Geography and their own implementation for 2D planar geometry. The
   833  latter uses a
   834  [geohash](http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves)
   835  based numbering of cells which has worse locality than the Hilbert
   836  curve used by S2. SQL Server uses an implementation of their own
   837  devising.
   838  
   839  When indexing an object, a covering is computed, often using some
   840  number of the predefined cells. The same is done for the query object
   841  when using the index. Ancestors and descendants can be retrieved by
   842  using the ID properties above.
   843  
   844  The number of covering cells can vary per indexed object. There is an
   845  important tradeoff in the number of cells used to represent an object
   846  in the index: fewer cells use less space but create a looser
   847  covering. A looser covering retrieves more false positives from the
   848  index, which is expensive because the exact answer computation that's
   849  run after the index query is expensive. However, at some point the
   850  benefits of retrieving fewer false positives is outweighed by how long
   851  it takes to scan a large index.
   852  
   853  Because the space is divided beforehand, it must be finite. This means
   854  that Divide the Space works for (spherical) GEOGRAPHY and for finite
   855  GEOMETRY (planar) but not for infinite GEOMETRY. SQL Server requires
   856  that finite bounds be declared when creating a GEOMETRY index.
   857  
   858  Current divide the space approaches (both S2 and SQL Server) assume
   859  flat space, so cannot be used for indexing 3D GEOMETRY or altitude
   860  GEOGRAPHY. Conceptually, the approach is extendable to 3D, but would
   861  require our own implementation, and the performance characteristics of
   862  indexing a larger space in this manner are unknown.
   863  
   864  Advantages of Divide the Space:
   865  
   866  * Easy to scale horizontally
   867  * No balancing
   868  * Inserts require no locking
   869  * Bulk ingest is simple
   870  * Allow a per-object tradeoff between index size and false positives
   871  
   872  Disadvantages of Divide the Space:
   873  
   874  * Do not easily support indexing infinite GEOMETRY (but we discuss a
   875    way to extend to infinite geometries in future work).
   876  * Performance characteristics of indexing 3D GEOMETRY space and
   877    indexing altitude GEOGRAPHY are unknown.
   878  
   879  
   880  ### Approach
   881  
   882  For geography we will use the S2 library that models a sphere using a
   883  quad-tree divide-the-space approach and numbers the cells using a
   884  Hilbert Curve. The cell ids are 64 bit integers and the leaf cells
   885  measure 1cm across on the Earth’s surface. For 2D planar geometry, we
   886  will initially reuse the S2 library by mapping the rectangular bounds
   887  of the index to one face of the unit cube in S2, and then mapping it
   888  to the unit sphere. This distorts the original shape, but index
   889  lookups (discussed in more detail below) look for cell overlaps, and
   890  overlaps are preserved by this distortion.
   891  
   892  In future versions we may write our own divide-the-space library for
   893  2D planar geometry that avoids any distortions, and would permit
   894  removal of the rectangular index bounds (see future work section).
   895  
   896  
   897  ### Index storage
   898  
   899  An index is configured with parameters that control aspects of the
   900  covering produced for a shape: the number of cells, the lowest and
   901  highest levels of the cells. The cell-ids are represented as
   902  uint64. The stored index is an inverted index consisting of this
   903  cell-id column followed by the primary key columns in the indexed
   904  table. Note that each cell-id can be used in the covering for multiple
   905  table rows, and each table row can have multiple cell-ids in the
   906  covering, that is, it is a many-to-many relationship.
   907  
   908  The following picture shows a nyc census block in purple with a 4 cell
   909  covering in teal. Note that the cells are of different sizes. The
   910  cell-id for one of the cells in the covering is also shown. 4 cells
   911  were used in this case since that was the maximum number of cells
   912  configured in the index configuration, stored in the IndexDescriptor.
   913  
   914  ![Census Block Covering](20200421_geospatial/census_block1.png?raw=true "Census Block Covering")
   915  
   916  This on-disk representation of the inverted index splits the “posting
   917  list” for each cell-id, where the list is the table rows that are
   918  indexed by that cell-id, into one index row per list element. This is
   919  simple but not very efficient for reads, since each index row is
   920  potentially small and queries always need to read all the rows for a
   921  cell-id. However, it is convenient for additions and deletions, since
   922  a single row needs to be modified. Preliminary experiments indicate
   923  that the simple approach can be competitive in performance with
   924  PostGIS, so we will adopt this simple approach until otherwise
   925  necessary.
   926  
   927  
   928  ### Index queries
   929  
   930  We first consider how indexes are used to accelerate filtering and
   931  then extend it to join queries.
   932  
   933  
   934  #### Filtering
   935  
   936  Consider a boolean function op(g, x) where g is a given shape, and x
   937  is an indexed shape for which this function is true. The index is used
   938  to find the set of all x for which this function could return
   939  true. All functions are mapped to the following for index
   940  acceleration:
   941  
   942  * `contains(g, x)`: Every point in shape x is also in g, i.e., g
   943    contains x.
   944  * `contained-by(g, x)`: This is equivalent to contains(x, g), but is
   945    considered separately since x represents the shapes that are
   946    indexed.
   947  * `intersects(g, x)` or `intersects(x, g)`: g and x have at least 1 point
   948    that is in both.
   949  
   950  Consider the covering(s) for any shape s to be the set of cellids that
   951  cover the shape s. PostGIS uses bounding boxes instead of cell
   952  coverings. Bounding boxes have the following property:
   953  
   954  * contains(s1, s2) => contains(bounding-box(s1), bounding-box(s2))
   955  * intersects(s1, s2) => intersects(bounding-box(s1), bounding-box(s2))
   956  
   957  We assume the same property for cell coverings:
   958  
   959  * contains(s1, s2) => contains(covering(s1), covering(s2))
   960  * intersects(s1, s2) => intersects(covering(s1), covering(s2))
   961  
   962  The first property is not always true and we will discuss how to
   963  adjust to that reality in a later section.
   964  
   965  The following abstract example is used to illustrate the filtering
   966  algorithm, where c[i] is a cell number. Consider g has the cell
   967  covering c[213], c[61], c[64] in a quad-tree rooted at c[0]. For
   968  convenience of illustration, the numbering is not the Hilbert curve
   969  numbering used by S2. In the numbering here, the children of cell c[i]
   970  are numbered c[4*i+1]...c[4*i+4]. The following depicts this covering
   971  as a tree with the leaf cells being the covering cells. Note that the
   972  paths from the leaves to the root are not the same length.
   973  
   974  
   975  ```
   976                       c[0]
   977                        |
   978                       c[3]
   979                        |
   980                    +---+-----+
   981                    |         |
   982                  c[13]     c[15]
   983                    |         |
   984                  c[53]    +--+--+
   985                    |      |     |
   986                 C[213]  c[61] c[64]
   987  
   988  ```
   989  
   990  
   991  
   992  * contains(g, x): Using the covering property from earlier, all shapes
   993    contained by g must have coverings contained by {c[213], c[61],
   994    c[64]}. In the full quad-tree (not the partial one depicted above),
   995    these are all the shapes indexed in the subtrees rooted at c[213],
   996    c[61], c[64]. Due to the locality of the Hilbert curve numbering,
   997    each subtree is a single contiguous range of integers. The indexed
   998    shapes that could satisfy this function are:
   999  
  1000    ```
  1001      (⋃ \for-all c in subtree-range(c[213]) index(c)) ⋃
  1002      (⋃ \for-all c in subtree-range(c[61]) index(c)) ⋃
  1003      (⋃ \for-all c in subtree-range(c[64]) index(c))
  1004    ```  
  1005  
  1006      There can be duplicate shapes both within a subtree and across
  1007      subtrees. For example, a shape could be indexed under two children
  1008      of c[61], say c[245] and c[247]. The scan of the subtree rooted at
  1009      c[61] will find it twice.
  1010  
  1011  * contained-by(g, x): Using the covering property from earlier, all
  1012    shapes containing g must have coverings that contain c[213] and
  1013    contain c[61] and contain c[64]. The indexed shapes that could
  1014    satisfy this function are
  1015  
  1016    ```
  1017      (index(c[213]) ⋃ index(c[53]) ⋃ index(c[13]) ⋃ index(c[3]) ⋃ index(c[0])) ⋂
  1018      (index(c61) ⋃ index(c15) ⋃ index(c3) ⋃ index(c0)) ⋂
  1019      (index(c64) ⋃ index(c15) ⋃ index(c3) ⋃ index(c0))
  1020    ```
  1021    
  1022    One can factor out common sub-expressions.
  1023  
  1024  * intersects(g, x) (or (x, g)): This needs to retrieve the shapes that
  1025    satisfy contains(g, x) and shapes indexed using the ancestors of
  1026    c213, c61, c64. That is,
  1027  
  1028    ```
  1029      contains(g, x) ⋃ index(c0) ⋃ index(c3) ⋃ index(c13) ⋃ index(c53) ⋃ index(c15)
  1030    ```
  1031  
  1032  
  1033  #### Mapping to the index functions
  1034  
  1035  Functions map to the index functions:
  1036  
  1037  * ST_Covers(g, x), ST_Covers(x, g): use contains(g, x) or
  1038    contained-by(g, x)
  1039  * ST_CoveredBy(g, x), ST_CoveredBy(x, g): use contained-by(g, x) or
  1040    contains(g, x)
  1041  * ST_Contains(g, x), ST_Contains(x, g): use contains(g, x) or
  1042    contained-by(g, x)
  1043  * ST_ContainsProperly(g, x), ST_ContainsProperly(x, g):  use contains(g, x) or
  1044    contained-by(g, x)
  1045  * ST_Crosses: use intersects
  1046  * ST_DFullyWithin(g, x, d), ST_DFullyWithin(x, g, d): extend g by
  1047    distance d to produce a shape g’, and then use contains(g', x). The
  1048    S2 library has an S2ShapeIndexBufferedRegion class that can be used
  1049    for extending by distance d for the Geography type. This is not part
  1050    of the S2 Go library, so we will need to port it from C++. See a
  1051    later section for the Geometry type.
  1052  * ST_DWithin(g, x, d), ST_DWithin(x, g, d): extend g by distance to
  1053    produce a shape g’, and then use intersects.
  1054  * ST_Equals: use intersects (we may be able to do better later) 
  1055  * ST_Intersects: use intersects
  1056  * ST_Overlaps: use intersects
  1057  * ST_Touches: use intersects
  1058  * ST_Within(g, x), ST_Within(x, g): use contained-by(g, x) or contains(g, x)
  1059  
  1060  
  1061  #### Joins
  1062  
  1063  Queries can join between tables with geospatial indexes such as
  1064  
  1065  ```
  1066  SELECT blocks.blkid
  1067  FROM nyc_census_blocks blocks
  1068  JOIN nyc_subway_stations subways
  1069  ON ST_Contains(blocks.geom, subways.geom)
  1070  ```
  1071  
  1072  We cannot execute this by joining between two inverted indexes. This
  1073  is no different from PostGIS, which does not attempt to join two
  1074  R-tree indexes. This query will execute using a special geospatial
  1075  inverted-index “lookup join”:
  1076  
  1077  * Each table row of one of the tables (typically the smaller one)
  1078    produces the given shape g (from the previous section).
  1079  * The covering of g is computed, and is used to lookup cell-ids (and
  1080    cell-id ranges) from the index of the other table.
  1081  * The set expression (described in the previous section) is computed
  1082    to generate the matching shapes.
  1083  
  1084  We plan to generalize this geospatial “lookup join” to also be usable
  1085  for other inverted index cases, like ARRAY and JSON.
  1086  
  1087  
  1088  #### Query Planner/Optimizer
  1089  
  1090  The optimizer will need to perform three major tasks to support
  1091  geospatial queries:
  1092  
  1093  * _Detect functions with constant inputs and perform constant
  1094    folding_.
  1095  
  1096    * Since we already support constant folding for functions, this
  1097      should be as simple as adding the relevant function names to the
  1098      `FoldFunctionWhitelist`.
  1099  
  1100  * _Detect index accelerated functions (see the list above) in the
  1101    WHERE clause with one constant and one non-constant input, and
  1102    generate constrained index scans_.
  1103  
  1104    * This will be an exploration rule which detects primary index scans
  1105      wrapped in a `SELECT` with predicate `WHERE ST_Covers(g, x)` (or
  1106      any other index accelerated function in which `g` is constant and
  1107      there is an index on `x`). The rule will create an alternative
  1108      plan in which the primary index scan is replaced with an
  1109      expression containing constrained inverted index scans combined
  1110      with set operations. The specific index constraints and set
  1111      operators (`UNION` or `INTERSECT`) will depend on the values
  1112      returned by the `contains`, `contained-by` or `intersects` APIs
  1113      described above. Later on, we can make these set expressions more
  1114      efficient using other transformation rules (e.g., convert the
  1115      `UNION`s to `UNION ALL` + `DISTINCT`) or help from the execution
  1116      engine (e.g., to support scanning multiple inverted index ranges
  1117      as part of a single scan). Note that the `SELECT` operator must
  1118      still remain after the set expressions to filter out false
  1119      positives returned by the index scans.
  1120  
  1121    * In order for the optimizer to decide between the original primary
  1122      index scan, an alternate secondary index scan (e.g., if there is
  1123      another predicate besides `ST_Covers`), or the set expression with
  1124      inverted index scans, we will need to update the cost model and
  1125      statistics code. For the cost model, we will need to understand
  1126      how expensive the predicate `ST_Covers` is compared to the cost of
  1127      scanning a row. For the statistics code, we will need an estimate
  1128      of how much data must be scanned for each of the inverted index
  1129      scans, as well as how many rows are returned by the full set
  1130      expression. The statistics code for inverted indexes is currently
  1131      very rudimentary, so this should be improved whilst benefitting
  1132      JSONB and ARRAY.
  1133  
  1134    * To improve the statistics code for inverted indexes, there are a
  1135      couple of options: (1) Store the number of distinct keys in the
  1136      index, as well as the total number of values in the index
  1137      (counting duplicates). To estimate the selectivity of an equality
  1138      predicate (e.g., x = 5), we would use the formula (1/no. distinct
  1139      keys) * (no. values in the index/no. of rows). (2) A more accurate
  1140      selectivity estimate would use a histogram, storing the number of
  1141      values indexed by each key. This also includes duplicate entries,
  1142      so the total number of values in the histogram would add up to
  1143      more than the number of rows. Making these improvements to
  1144      inverted index statistics would not only help support geospatial
  1145      queries, but also improve our JSON support.
  1146    
  1147  * _Detect index accelerated functions (see the list above) in the
  1148    WHERE/ON clause with two non-constant inputs, and generate lookup
  1149    joins_.
  1150  
  1151    * Similar to the previous case, this will be an exploration rule
  1152      which detects primary index scans wrapped in a predicate
  1153      containing one of the functions that can be
  1154      index-accelerated. However, this rule will only apply if both
  1155      inputs to the function are variables, and at least one has an
  1156      inverted index. In this case, the optimizer will generate one or
  1157      more lookup joins, in which one of the variables is used to look
  1158      up into the index of the other variable. As with the case above,
  1159      the predicate must remain after the join to filter out false
  1160      positives.
  1161  
  1162    * To decide which index to use for the join (or whether to use one
  1163      of the indexes at all), we will need to make the same changes to
  1164      the cost model and statistics code described above. We will also
  1165      need an estimate of how expensive the API calls to `contains`,
  1166      `contained-by` or `intersects` are since they will need to be
  1167      called on each input row. Additionally, we will need to improve
  1168      the selectivity calculation for lookup joins in cases where the
  1169      index is inverted.
  1170  
  1171      Since the set operations required for each row make selectivity
  1172      calculation on geospatial lookup joins especially difficult, we
  1173      may consider adding a new type of histogram for geospatial data.
  1174      The histogram buckets would be ranges of S2 cell IDs, and the
  1175      counts would represent the number of objects in the geospatial
  1176      column that overlap that cell. By joining histograms on two
  1177      geospatial columns, we could estimate the selectivity of a real
  1178      join on those columns
  1179  
  1180  ### Weakness of Covering invariant
  1181  
  1182  The previous sections have assumed the invariant
  1183  
  1184  
  1185  ```
  1186  contains(s1, s2) => contains(covering(s1), covering(s2))
  1187  ```
  1188  
  1189  
  1190  This is not true in the general case even if the parameters used for
  1191  computing the covering are the same:
  1192  
  1193  * It *may* be true due to implementation artifacts of the S2 library
  1194    for the shapes we are using, but this needs very careful
  1195    verification. Specifically, for coverings produced using the same
  1196    settings, the region coverer code in S2 may achieve this because of
  1197    how it prioritizes which [cells to
  1198    expand](https://sourcegraph.com/github.com/google/s2geometry/-/blob/src/s2/s2region_coverer.cc#L187-192)
  1199    (using level and number of children). But the MayIntersect
  1200    computation used is allowed to have [false
  1201    positives](https://sourcegraph.com/github.com/google/s2geometry/-/blob/src/s2/s2region.h#L85-92),
  1202    though for the shapes we are using it claims not to have false
  1203    positives.
  1204  
  1205  * We have experimentally observed benefits in using more cells in the
  1206    covering of the query shape (the g parameter). One way to work
  1207    around this would be to use a finer cell covering for g for
  1208    contained-by(g, x) and one that matches the index for contains(g,
  1209    x).
  1210  
  1211  
  1212  ![Covering Invariant](20200421_geospatial/covering_invariant.png?raw=true "Covering Invariant")
  1213  
  1214  The illustration above shows an example: the blue shape has a covering
  1215  represented by the two blue cells and the orange shape has a covering
  1216  represented by the larger orange cell (which is the parent of the blue
  1217  cells).
  1218  
  1219  To begin with, we will work around this as follows:
  1220  
  1221  * For contained-by(g, x), use an “inner” covering for g. An inner
  1222    covering is analogous to an interior covering, i.e., cells that are
  1223    fully contained in g, but can additionally use leaf level cells that
  1224    overlap with g.
  1225  * Loosen the index computation for contains(g, x) to be identical to
  1226    intersects. This means a very small g, which is contained by many
  1227    large shapes will have an index lookup that will retrieve all these
  1228    large shapes that will be false positives.
  1229  
  1230  
  1231  #### Alternatives to Inner covering
  1232  
  1233  If we can’t strengthen the covering invariant another alternative
  1234  would be to store both the regular covering and the inner covering in
  1235  the inverted index. Cells that are in both coverings (e.g. a point
  1236  shape will have identical inner and regular covering) would need to be
  1237  written once and marked as part of both coverings.
  1238  
  1239  
  1240  ### Considerations for 2D Planar Geometry
  1241  
  1242  We will use S2 to index 2D planar geometry by mapping the rectangular
  1243  bounds of the index to the unit square, which is then mapped to the
  1244  unit cube. We consider the potential issues with this approach:
  1245  
  1246  * Shapes that exceed the index bounds: we clip the shape to the index
  1247    bounds and index the clipped shape in the usual
  1248    manner. Additionally, the shape will be indexed under a special
  1249    “spillover” cell-id (that is not used by S2). Queries where the
  1250    specified shape exceeds the bounds will need to read this spillover
  1251    cell-id in addition to reading the rest of the index using the
  1252    clipped shape.
  1253  
  1254  * The mapping of the rectangular bounds to the unit square and then
  1255    mapping to the unit sphere introduces distortions. This are ok for
  1256    most operations since both the query shape and the indexed shape
  1257    will continue to share the same cells (where they overlap) despite
  1258    the distortion. However this an issue for ST_DWithin and
  1259    ST_DFullyWithin which additionally specify a distance: the mapping
  1260    to the unit sphere uses a non-linear function so the distance on the
  1261    unit cube face translates differently based on where it is located
  1262    on the cube. Instead of using S2 to extend the shape by distance (as
  1263    we do for Geography), we will extend the shape by distance using the
  1264    `GEOSBuffer_r` function in the GEOS library, and then translate the
  1265    returned geometry to S2.
  1266  
  1267  
  1268  ### Future Work
  1269  
  1270  Indexing is an open-ended area and there is much potential for future
  1271  improvements. We will prioritize improvements based on user
  1272  experience, so it is hard to know what we will discover. Here are some
  1273  early ideas on where we could work on improvements:
  1274  
  1275  * K nearest neighbor search: This is a hard problem with a
  1276    divide-the-space approach. We have some preliminary ideas for how to
  1277    support this based on distances between the centroids of the
  1278    bounding box, but this will wait until we have more clarity on user
  1279    requirements.
  1280  
  1281  * Eliminate the bounding box in 2D planar indexes: This will
  1282    potentially go together with replacing our use of S2 for 2D planar
  1283    geometries with our own divide-the-space approach (the challenge
  1284    here is fast and robust algorithms for determining containment and
  1285    intersection of a cell with a shape). We have some early ideas on
  1286    how one can use a variable number of bits in the cell-id, and a 4
  1287    quadrant plane, and extend the 4 quadrants indefinitely starting
  1288    from an origin. This approach requires a different key comparator
  1289    than the current MVCC comparator for the geospatial
  1290    index. Additionally, it may require some storage engine hooks to
  1291    rewrite keys as part of compactions.
  1292  
  1293  * More efficient storage and retrieval of inverted index: Storing and
  1294    retrieving an inverted index with one row per (cell-id, shape) is
  1295    inefficient. The typical alternative is to use posting lists of the
  1296    form cell-id => set of shape ids, where the shape ids are numeric
  1297    and dense and can be packed efficiently using delta and run-length
  1298    encoding. This allows for efficient posting list union and
  1299    intersection (e.g. using roaring bitmaps). A big issue in an OLTP
  1300    setting is insertions and deletions to the set of shape ids. Reading
  1301    and rewriting the set for each update will introduce a hotspot in
  1302    the key space. An alternative we could explore is to use storage
  1303    engine level merges to incorporate changes to the set. This cannot
  1304    be done without storage engine changes since the engine would need
  1305    to know about (a) what is not committed, since uncommitted changes
  1306    should not be merged with committed changes, (b) merge across
  1307    different MVCC timestamps, (c) split long posting lists into
  1308    multiple key-value pairs. Some of these issues are common to what
  1309    would be needed to do GC of old MVCC versions within the engine (as
  1310    part of compactions). We do not expect this to happen in the near
  1311    future.
  1312  
  1313  * Working with the default geometric data types in
  1314    [PostgreSQL](https://www.postgresql.org/docs/current/datatype-geometric.html),
  1315    in particular, [the
  1316    indexing](https://www.postgresql.org/docs/current/indexes-types.html).
  1317  
  1318  
  1319  ## End-to-End Example
  1320  
  1321  We consider the following query adapted from
  1322  [https://postgis.net/workshops/postgis-intro/joins_exercises.html](https://postgis.net/workshops/postgis-intro/joins_exercises.html).
  1323  
  1324  
  1325  ### Table Definitions
  1326  
  1327  Consider the following table definitions:
  1328  
  1329  <table>
  1330    <tr>
  1331     <td>-- copied from output of shp2pgsql
  1332  <p>
  1333  CREATE TABLE "nyc_census_blocks" (gid serial,
  1334  <p>
  1335  "blkid" varchar(15),
  1336  <p>
  1337  "popn_total" float8,
  1338  <p>
  1339  "popn_white" float8,
  1340  <p>
  1341  "popn_black" float8,
  1342  <p>
  1343  "popn_nativ" float8,
  1344  <p>
  1345  "popn_asian" float8,
  1346  <p>
  1347  "popn_other" float8,
  1348  <p>
  1349  "boroname" varchar(32));
  1350  <p>
  1351  ALTER TABLE "nyc_census_blocks" ADD PRIMARY KEY (gid);
  1352  <p>
  1353  SELECT AddGeometryColumn('','nyc_census_blocks','geom','0','MULTIPOLYGON',2);
  1354  <p>
  1355  -- create an index
  1356  <p>
  1357  CREATE INDEX nyc_census_blocks_geo_idx ON nyc_census_blocks USING GIST(geom);
  1358  <p>
  1359  -- and add 38794 rows
  1360     </td>
  1361    </tr>
  1362    <tr>
  1363     <td>-- copied from output of shp2pgsql
  1364  <p>
  1365  CREATE TABLE "nyc_neighborhoods" (gid serial,
  1366  <p>
  1367  "boroname" varchar(43),
  1368  <p>
  1369  "name" varchar(64));
  1370  <p>
  1371  ALTER TABLE "nyc_neighborhoods" ADD PRIMARY KEY (gid);
  1372  <p>
  1373  SELECT AddGeometryColumn('','nyc_neighborhoods','geom','0','MULTIPOLYGON',2);
  1374  <p>
  1375  -- create an index
  1376  <p>
  1377  CREATE INDEX nyc_neighbourhoods_geo_idx ON nyc_neighbourhoods USING GIST(geom);
  1378  <p>
  1379  -- and add 129 rows
  1380     </td>
  1381    </tr>
  1382  </table>
  1383  
  1384  
  1385  Now we run the following query:
  1386  
  1387  
  1388  ```
  1389  SELECT
  1390    n.name,
  1391    Sum(c.popn_total) / (ST_Area(n.geom) / 1000000.0) AS popn_per_sqkm
  1392  FROM nyc_census_blocks AS c
  1393  JOIN nyc_neighborhoods AS n
  1394  ON ST_Intersects(c.geom, n.geom)  AND c.boroname = n.boroname
  1395  WHERE n.name = 'Upper West Side'
  1396  OR n.name = 'Upper East Side'
  1397  GROUP BY n.name, n.geom;
  1398  ```
  1399  
  1400  Note both `nyc_census_blocks` and `nyc_neighborhoods` have geospatial
  1401  indexes on the geom column.
  1402  
  1403  
  1404  ### Example Match with S2 Coverings
  1405  
  1406  Let's take an example match from the above query and look at the S2
  1407  generated index coverings.
  1408  
  1409  Census Block 360610138001000 is in purple, with a generated S2
  1410  covering coloured in teal. The covering completely highlights the
  1411  entire census area using a maximum of 4 cells (which we have specified
  1412  experimentally for the sake of this example).
  1413  
  1414  
  1415  ![Census Block Covering](20200421_geospatial/census_block2.png?raw=true "Census Block Covering")
  1416  ![Census Block Covering with Cells](20200421_geospatial/census_block3.png?raw=true "Census Block Covering with Cells")
  1417  
  1418  Now here is the Upper East Side, with S2 using a single cell as a
  1419  cover for it. Again, we have experimentally set the use of a maximum
  1420  of 4 cells, but S2 was happy just using the one.
  1421  
  1422  
  1423  ![Upper East Side](20200421_geospatial/ues1.png?raw=true "Upper East Side")
  1424  ![Upper East Side With Cells](20200421_geospatial/ues2.png?raw=true "Upper East Side With Cells")
  1425  
  1426  
  1427  Here is everything so far overlayed with each other. Here we can
  1428  explicitly see that the coverings of the Census Block are (some unit
  1429  of grand) children of the Upper East Side, and there is a clear
  1430  intersection of the Census Block and the Upper East Side.
  1431  
  1432  ![Upper East Side With Census Block](20200421_geospatial/ues_with_census_block.png?raw=true "Upper East Side With Census Block")
  1433  
  1434  Let us consider trying to see whether we should try and match them
  1435  together with S2 coverings.
  1436  
  1437  Consider the CellIDs in binary format:
  1438  
  1439  
  1440  ```
  1441  UES:  1000100111000010010110001100000000000000000000000000000000000000
  1442  Blk1: 1000100111000010010110001011110001000000000000000000000000000000
  1443  Blk2: 1000100111000010010110001011101110001010110000000000000000000000
  1444  Blk3: 1000100111000010010110001011101110001011010000000000000000000000
  1445  Blk4: 1000100111000010010110001011101111110000000000000000000000000000
  1446  ```
  1447  
  1448  ```
  1449               ^-----------------------------------------^
  1450  ```
  1451  
  1452  With the S2 encoding mechanism, we can tell which items
  1453  intersect with the single cell covering of the Upper East Side by
  1454  considering the range
  1455  
  1456  ```
  1457  [1000100111000010010110001000000000000000000000000000000000000000,1000100111000010010110010000000000000000000000000000000000000000)
  1458  ```
  1459  
  1460  as well as any parent coverings by constantly
  1461  1000100111000010010110001100000000000000000000000000000000000000
  1462  dividing by 4 until we have reached the root sub-tree. This would
  1463  match all 4 blocks as indicated above.
  1464  
  1465  
  1466  ### Planning
  1467  
  1468  The number of neighborhoods is 129 and the number of census blocks is
  1469  38794. Even though name is not the primary key of the
  1470  `nyc_neighborhoods` table, the cardinality of the name field is the
  1471  same, 129. So the optimizer can predict that the name filter will
  1472  match 2 rows. This filter will be pushed below the join, so the input
  1473  to the join from `nyc_neighborhoods` is 2 rows.
  1474  
  1475  The optimizer will consider several different types of joins when
  1476  joining `nyc_neighborhoods` and `nyc_census_blocks`. The two most
  1477  likely options are (1) a hash join on the predicate `c.boroname =
  1478  n.boroname`, and (2) a geospatial lookup join on the predicate
  1479  `ST_Intersects(c.geom, n.geom)`, using the inverted index on
  1480  `nyc_census_blocks`.
  1481  
  1482  In order to decide between these two options, the optimizer can use
  1483  statistics. The boronames have cardinality 5, so the number of rows
  1484  returned by the hash join is 38794*2/5 = 15518. The expensive
  1485  `ST_Intersects(c.geom, n.geom)` function must then be called on each
  1486  of those rows.
  1487  
  1488  The cost of the geospatial lookup join is more challenging to
  1489  estimate, but it can be calculated by first estimating the number of
  1490  CellIDs that must be scanned in `nyc_census_blocks_geo_idx` for each
  1491  of the two matching rows in `nyc_neighborhoods`. Suppose that
  1492  approximately 60 CellIDs must be scanned for a particular row, and the
  1493  cardinality of the inverted index (i.e., number of CellIDs with
  1494  entries) is 6000. Since each census block will exist in 4 index
  1495  entries, the total number of values indexed is 38794*4 =
  1496  155176. Therefore, the number of entries scanned is (60/6000)*155176*2
  1497  = 3104. After de-duplicating the results, the number of output rows is
  1498  3104/4 = 776. The expensive `ST_Intersects(c.geom, n.geom)` function
  1499  must then be called on each of those rows to eliminate false
  1500  positives, and the inexpensive predicate `c.boroname = n.boroname`
  1501  must also be applied.
  1502  
  1503  Based on the above cost analysis, the optimizer decides to use the
  1504  geospatial index to do a geospatial-lookup join, where each of the
  1505  selected rows from the `nyc_neighborhoods` table will be used to
  1506  construct and compute a set union expression on the inverted
  1507  geospatial index of `nyc_census_blocks`.
  1508  
  1509  Here is the expected query plan:
  1510  
  1511  
  1512  ```
  1513  QUERY PLAN                                                      
  1514  --------------------------------------------------------------------------------------
  1515  project
  1516   ├── group-by
  1517   │    ├── inner-join (lookup nyc_census_blocks AS c)
  1518   │    │    ├── lookup columns are key
  1519   │    │    ├── inner-join (geospatial-lookup nyc_census_blocks@nyc_census_blocks_geo_idx)
  1520   │    │    │    ├── select
  1521   │    │    │    │    ├── scan nyc_neighborhoods AS n
  1522   │    │    │    │    └── filters
  1523   │    │    │    │         └── (name = 'Upper West Side') OR (name = 'Upper East Side')
  1524   │    │    │    └── filters (true)
  1525   │    │    └── filters
  1526   │    │         ├── st_intersects(c.geom, n.geom)
  1527   │    │         └── c.boroname = n.boroname
  1528   │    └── aggregations
  1529   │         └── sum
  1530   │              └── popn_total
  1531   └── projections
  1532        └── sum / (st_area(n.geom) / 1e+06)
  1533  ```
  1534  
  1535  
  1536  
  1537  ### Execution
  1538  
  1539  This query will execute in the manner outlined by the query plan:
  1540  
  1541  
  1542  
  1543  * Since there is no index on nyc_neighborhoods.name, the
  1544    nyc_neighborhoods table is scanned and the name filter is
  1545    applied. For each of the two rows that pass the filter, the
  1546    following steps will be performed.
  1547  
  1548  * A covering of the geom field in the row in nyc_neighborhoods is
  1549    computed. This covering is used by the intersects() function
  1550    outlined in the previous [section](#filtering), to produce a set of
  1551    cell-ids that must be retrieved from the index and the corresponding
  1552    nyc_census_blocks.gid column, which is the primary key,
  1553    unioned. This is the candidate set of census_blocks keys that
  1554    satisfy ST_Intersects.
  1555  
  1556  * The primary key of nyc_census_blocks (nyc_census_blocks.gid) is then
  1557    used to perform a lookup join with the primary index of
  1558    nyc_census_blocks, which is needed to retrieve the remaining
  1559    columns. As part of this lookup join, the ST_Intersects predicate
  1560    and the boroname equality predicate are computed and if both are
  1561    true, the join condition is satisfied.
  1562  
  1563  * Next, the sum of the total population is calculated, grouped by
  1564    neighborhood.
  1565  
  1566  * Finally, the sum is divided by the area of the neighborhood, in
  1567    order to find the population per square meter. Area is calculated
  1568    using the built-in scalar function ST_Area.
  1569  
  1570  
  1571  ## Package Structure
  1572  
  1573  We will put geospatial calculation data in a new `pkg/geo` package, to
  1574  separate "SQL" from "Geospatial". We've decided to separate
  1575  "functions" and their complexity into subpackages to simplify import
  1576  complexity.
  1577  
  1578  Our package structure will look as follows:
  1579  
  1580  * pkg
  1581    * geo (contains basic data types, as well as transformations from/to
  1582      external data formats)
  1583      * geopb (contains protobuf storage for shapes, as well as typedefs
  1584        for SRIDs)
  1585      * geogfn (contains geography functions, as well as the CGO
  1586        wrappers for GeographicLib)
  1587      * geomfn (contains geometry functions, as well as the CGO wrappers
  1588        for GEOS / SFCGAL)
  1589      * geosrid (contains SRID transformation functionality)
  1590      * geoindex (contains indexing functionality)
  1591      * geographiclib (contains spheroid functionality from
  1592        GeographicLib)
  1593      * geos (contains 2D geometry from the GEOS library)
  1594      * geoproj (contains proj functionality from PROJ)
  1595  
  1596  ## Telemetry
  1597  
  1598  To gather insight, we should add telemetry on the following:
  1599  
  1600  * SQL Usage
  1601    * column type usage
  1602    * usage statistics of various builtins
  1603    * success rate of caching SRID
  1604    * success rate of caching builtins (if implemented)
  1605  * Indexing
  1606    * spatial index usage
  1607    * spatial index using out of bounds geometries
  1608    * filtering % done by using indexes
  1609  
  1610  ## Testing
  1611  
  1612  In addition to writing our own unit and integration tests, there are
  1613  several third-party tools we can look at utilizing to compare
  1614  correctness. These can be adapted to and used in the CockroachDB
  1615  framework, including:
  1616  
  1617  * [OGC Conformance Tests](https://cite.opengeospatial.org/teamengine/)
  1618    - tests to ensure OGC compatibility - adaptable into nightly
  1619    roachtests
  1620  * [monetdb-mcs](https://github.com/cswxu/monetdb-mcs/tree/544d87e9e67a926e88b17b171aa4251b20177be0/geom/)
  1621    has a few of their own integration tests - adaptable into nightly
  1622    roachtests
  1623  
  1624  Due to internal restrictions, we will be unable to utilize
  1625  the following directly from the codebase:
  1626  
  1627  * [PostGIS regression
  1628    tests](https://github.com/postgis/postgis/blob/master/regress )
  1629  
  1630  * [PostGIS "garden"
  1631    tests](https://trac.osgeo.org/postgis/wiki/DevWikiGardenTest) -
  1632    tests for crashing / running into edge cases. We can utilize
  1633    sqlsmith to write a similar version for ourselves.
  1634  
  1635  We will also look at correctness testing our own index
  1636  implementation. We can utilize the default GEOS and GeographicLib
  1637  calculations to determine whether our index implementation will
  1638  correctly fetch the same results as without using the index. To do
  1639  this, we can generate random test suites with indexed columns and
  1640  compare results with and without using the index. As an added bonus,
  1641  we can see the performance improvement when using indexing as well
  1642  with this approach.
  1643  
  1644  
  1645  ## Performance Benchmarking
  1646  
  1647  At the time of writing, we have struggled to find industry standard
  1648  benchmarks for comparing performance of spatial indexes. As such, we
  1649  are mainly left to our own devices.
  1650  
  1651  We will be looking at writing our own benchmarking suite, which will
  1652  benchmark on the following:
  1653  
  1654  * Various core geospatial builtins
  1655  * Indexed retrievals
  1656    * We should test indexed retrievals against numbers of rows, as well
  1657      as different sizes of geospatial shapes (e.g. lots of little
  1658      and/or big shapes) as well as sparseness of geospatial shapes
  1659      (e.g. lots of data in "New York" vs data sparsely populated around
  1660      different areas).
  1661    * We should also try to test under lock contention (to test the
  1662      theory that R-Trees do not behave well under conditions involving
  1663      heavy locking)
  1664  * IMPORT/EXPORT functionality
  1665  
  1666  We may decide to compare our implementation against other SQL-based
  1667  implementations such as SQL Server and Postgres/PostGIS. MongoDB is a
  1668  good stretch option as well.
  1669  
  1670  ## Unresolved questions
  1671  
  1672  None beyond what is already mentioned in earlier text.
  1673  
  1674  # Updates
  1675  * 2020-05-07:
  1676    * added ST_ContainsProperly as an indexable function.