github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200421_geospatial.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200421_geospatial.md (about)

1 - Feature Name: Geospatial
2 - Status: in-progress
3 - Start Date: 2020-04-01
4 - Authors: danhhz otan rytaft sumeerbhola
5 - RFC PR: [#47762](https://github.com/cockroachdb/cockroach/pull/47762)
6 - Cockroach Issue: [#19313](https://github.com/cockroachdb/cockroach/issues/19313)
7
8 Table of contents:
9
10 - [Summary](#Summary)
11 - [Guide-level explanation](#Guide-level-explanation)
12 - [Reference-level explanation](#Reference-level-explanation)
13 * [SQL Types & Functions](#SQL-Types--Functions)
14 * [Indexing](#Indexing)
15 * [End-to-End Example](#End-to-End-Example)
16 * [Package Structure](#Package-Structure)
17 * [Telemetry](#Telemetry)
18 * [Testing](#Testing)
19 * [Performance Benchmarking](#Performance-Benchmarking)
20 - [Unresolved questions](#Unresolved-questions)
21
22 # Summary
23
24 This RFC proposes the addition of geospatial features to CockroachDB, similar to
25 what is supported by PostGIS. These features are often requested by users. At the time
26 of writing, support for geospatial features is the most requested feature in
27 CockroachDB.
28
29 The approach leverages open-source geometry libraries used by PostGIS whenever
30 possible. For indexing, the approach diverges from PostGIS by dividing the space
31 into cells of decreasing size, which fits into the totally ordered key-value
32 model of CockroachDB. Early experiments suggest that this is competitive in
33 performance with R-trees whilst allowing for horizontal scaling.
34
35 # Guide-level explanation
36
37 Our geospatial implementation will be compatible with PostGIS. There
38 are a number of SQL geospatial implementations that we could have
39 modeled ourselves after as well as the OGC SQL standard or we could
40 have developed our own new interface. We've selected PostGIS
41 compatibility largely for two reasons:
42
43 1. When polled directly, a large fraction of our users asked for
44 PostGIS compatibility.
45 2. We've found time and time again that being a drop in replacement
46 for Postgres is an important aid to adoption. Given that the
47 open-source geospatial community has built most of its tooling around
48 PostGIS, this has the potential to be an even larger impact than
49 usual.
50
51 Additionally, both PostGIS and SQL Server largely follow the OGC SQL
52 standard, so this is not really as opinionated a decision as it would
53 otherwise be. However, notably, where PostGIS behavior diverges from
54 the OGC SQL standard, we've opted to prioritize the PostGIS behavior.
55
56 As a result, our initial user-facing geospatial footprint will be a
57 subset of PostGIS's footprint. We'll certainly expand this footprint
58 over time, but given that some PostGIS features are rarely used
59 (raster) and others are deprecated, it's likely that we'll never cover
60 all of it. See below for the exact initial footprint we'll be aiming
61 for.
62
63 The specs we refer to in this document are:
64
65 * OGC: [v1.1](http://portal.opengeospatial.org/files/?artifact_id=13228)
66 ** NOTE: For 20.2, we will aim to support v1.1 as PostGIS does
67 * SQL/MM: [ISO 13249-3](http://jtc1sc32.org/doc/N1101-1150/32N1107-WD13249-3--spatial.pdf)
68
69 Our approach will leverage the open-source geometry libraries used by
70 PostGIS whenever possible. These libraries are widely used and tested,
71 and additionally allow us to match PostGIS behavior for geometric
72 corner cases. For indexing, we will diverge from PostGIS, which uses
73 R-trees. The totally ordered key-value model used in CockroachDB
74 indexes allows for horizontal scaling and continuous repartitioning,
75 and we need to preserve these fundamental characteristics of the
76 system. We will use a divide-the-space approach that divides space
77 into cells of decreasing size, and turns the cell space into a linear
78 ordering using a space-filling curve. Shapes are indexed by a set of
79 cells using an inverted index.
80
81 # Reference-level explanation
82
83 Geospatial is exposed through two new SQL types: GEOMETRY (planar) and
84 GEOGRAPHY (sphere/spheroid, which only works on latitude/longitude
85 SRID systems). SQL Datums can be either of these types, as can SQL table
86 columns.
87
88 An individual datum of type GEOMETRY or GEOGRAPHY will be one of
89 several "shapes". The initial set of these we'll support (and is all
90 that is supported by GEOGRAPHY) is:
91
92 * POINT
93 * LINESTRING
94 * POLYGON
95 * MULTIPOINT
96 * MULTILINESTRING
97 * MULTIPOLYGON
98 * GEOMETRYCOLLECTION
99 * GEOMETRY ("shape", which is different than the "type", which allows
100 any of the above types to be referenced - more later)
101
102 In addition, geospatial types and datums can have Spatial Reference
103 System Identifiers (SRIDs) which specify which projection was applied
104 to their coordinate values. At a base level, we will look to support
105 SRID 0 (undefined) and 4326 (vanilla lat/lng geography on a WGS84
106 spheroid). Supporting SRIDs outside of these two is a stretch task for
107 the 20.2 release which we will detail in this RFC.
108
109 A number of new SQL builtin functions are added which operate on
110 GEOMETRY and GEOGRAPHY. Some of these functions can be accelerated by
111 maintaining a spatial index over a table, which is exposed as a SQL
112 index. The optimizer will be aware of these indexes and use them in
113 planning.
114
115
116 ## SQL Types & Functions
117
118 ### SQL Types
119
120 We will introduce two new data types to support geospatial data:
121
122 * Geometry - for planar functionality compliant with the OGC spec. For
123 20.2, we will look at only supporting 2D geometries, but still
124 support ingesting Z- and M- coordinates. For 20.2, Geometry will be
125 a wrapper around the
126 [twpayne/go-geom](https://github.com/twpayne/go-geom) library
127 which allows us to easily transform the datum for use in
128 the GEOS/SFCGAL libraries that have the builtins we require (see the
129 builtins section), whilst also allowing some simple operations to be
130 done without converting to GEOS/SFCGAL. We will gradually
131 switch to our own abstraction as we look to support curves/3D
132 geometries in future versions, which is lacking in twpayne/go-geom.
133
134 * Geography - for a subset of functionality of geometry that does
135 calculations on a spheroid using geodesic lines, operating only on
136 lat/lng-based projections (see the "Spatial Reference Identifiers
137 (SRIDs) and Projections" section). This is not part of the OGC
138 spec. For 20.2, Geography will mostly wrap around the
139 [S2 Geometry](https://s2geometry.io/) Region interface and
140 GeographicLib. The S2 Geometry will give us sphere math, as well as
141 operations which do not require physical real world
142 values. GeographicLib will perform necessary spheroid math
143 operations which require real world values (i.e. in metres for
144 ST_Area, ST_DWithin, ST_Distance, ST_Perimeter, or in degrees for
145 ST_Azimuth).
146
147 Note we will not support box2d and box3d PostGIS types and operators,
148 as they are designed to work with the R-Tree index which we are not
149 supporting.
150
151
152 #### Supported Geometry Types (Shapes)
153
154 For each data type, we will be able to support the "geometry types" as
155 defined in Section 6.2.6 of the OGC spec. We will internally call
156 these geometry types "shapes". The full hierachy is from the OGC
157 spec is as follows.
158
159 ![OGC Shape Hierachy](20200421_geospatial/shape_hierachy.png?raw=true "OGC Spec Hierachy")
160
161 We are planning to support the following:
162
163 * Initially, we will look to support the following shapes that are
164 applicable to both Geometry and Geography in line with the OGC and
165 SQL/MM specifications:
166 * POINT
167 * LINESTRING,
168 * POLYGON
169 * MULTIPOINT
170 * MULTILINESTRING
171 * MULTIPOLYGON
172 * GEOMETRY (wildcard shape; can be any shape except another GEOMETRY)
173 * GEOMETRYCOLLECTION (can be a collection of all the MULTI- shapes)
174
175 * In the future, we will look to support the following shapes for the
176 Geometry data type only, in line with PostGIS support:
177 * 2.5D (all 2D with Z- and M- coordinates and 3D datatypes)
178 * 2D:
179 * CURVE
180 * MULTICURVE
181 * TRIANGLE
182 * 3D:
183 * TIN (triangulated irregular network)
184 * SURFACE
185 * MULTISURFACE
186 * POLYHEDRALSURFACE
187
188
189 #### OIDs
190
191 The OIDs for these data types do not exist in `lib/pq/oid` as PostGIS
192 is an extension which only adds this metadata when the extension is
193 installed in the CLI. We hence have to define our own OIDs for
194 geospatial data types.
195
196 We will hardcode them in a number range which fits in the range that
197 can fit within an int4. We should not use an OID range higher than
198 int4 as this would be incompatible with Postgres.
199
200
201 #### Casts
202
203 PostGIS allows Geography types to be implicitly cast into Geometry
204 types in addition to explicit casts. However, Geometry types can only
205 be explicitly cast to Geography types.
206
207 We will be able to support the explicit casts easily. However, as
208 implicit casts are not supported in Cockroach (yet), we may have to
209 "double define" a few definitions of functions that cast Geography to
210 Geometry to emulate PostGIS. We will aim to support this as
211 "pseudo-implicit casts" for commonly used functions that are only
212 defined for Geometry (e.g. ST_SRID). The user also has the option to
213 explicitly cast Geography to Geometry before the builtins if we miss
214 any.
215
216 Casts between other supported types (`text` and `bytea`) will also
217 be initially supported via explicit casts, with implicit casts or
218 pseudo-implicit casts mentioned above implemented at a later date.
219
220
221 #### Column Definitions
222
223 In line with PostGIS, we will support the following column definition
224 mechanisms for GEOMETRY:
225
226 * `ADD COLUMN <colname> GEOMETRY` - this allows any kind of geometry
227 to be added. As no SRID is specified in the column definition, the
228 geometry by default will not have a SRID (which is defined as SRID =
229 0).
230
231 * `ADD COLUMN <colname> GEOMETRY(shape)` - this allows only the shape
232 defined to be stored in this column. However, the GEOMETRY shape
233 allows any shape to be stored in the column, and the
234 GEOMETRYCOLLECTION shape allows any MULTI- prefixed shape to be
235 stored in the column. As above, any SRID is allowed in this column.
236
237 * `ADD COLUMN <colname> GEOMETRY(shape, srid)` - applies both the
238 shape restriction above, as well as only allowing a specific SRID to
239 be stored in the column. If an SRID is not defined with the Datum
240 (i.e. has an SRID of 0), it's SRID will implicitly change to the
241 SRID in the given column. Note having a column of GEOMETRY(shape, 0)
242 where the SRID is 0 allows any SRID to be stored.
243
244 We will mirror the same definitions for GEOGRAPHY as we do for
245 GEOMETRY with one key difference - the default SRID will be 4326
246 (representing vanilla lat/lng on a WGS84 sphere), instead of leaving
247 it with an SRID of 0.
248
249 For 20.2, operations such as `ALTER TABLE ... SET DATA TYPE ...` may not be
250 permitted, but will be extended in later versions.
251
252
253 #### Index Creation
254
255 To maintain compatibility with PostGIS and existing tools, the syntax
256 to create indexes will involve the same syntax as using GiST, which will
257 initialize the indexes with the S2-backed inverted index underneath.
258
259 This PostGIS-backed index will look like the following:
260
261 ```
262 CREATE INDEX [indexname] ON [tablename] USING GIST ( [geospatial_column] );
263 ```
264
265 If we decide to support user-adjustable S2 parameters, we could
266 support a syntax like the following:
267
268 ```
269 CREATE INVERTED INDEX [indexname] ON [tablename] ( [geospatial_column] ) WITH COVERING (min_level = 0, max_level = 30, ….)
270 ```
271
272 For 2D planar geometry there will be an axis-aligned rectangular bound
273 that all shapes that want index acceleration should fit in. The bound
274 will be specified at index creation time. We could adopt a syntax
275 similar to SQL Server, which also requires a rectangular [bounding
276 box](https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview?view=sql-server-ver15#the-bounding-box).
277 Shapes can exceed this bounding box, but those shapes will not benefit
278 from index acceleration. To support this syntax, we will introduce a
279 WITH:
280
281 ```
282 CREATE INVERTED INDEX [indexname] ON [tablename] ( [geospatial_column] ) WITH BOUNDING BOX (x_min, y_min, x_max, y_max)
283 ```
284
285 The default bounding box for 2D geometry will be decided after
286 consultation with users.
287
288 #### Disk Serialization
289
290 For Geometry and Geography types, we will serialize the data as
291 a protobuf that contains the EWKB. This allows us flexibility in
292 changing libraries later, as well as encoding the necessary type
293 and shape metadata. The protobuf will appear as follows:
294
295 ```
296 message SpatialObject {
297 // EWKB is always stored in littleEndian order.
298 bytes ewkb = 1;
299 // SRID is denormalized from the EWKB.
300 int32 srid = 2;
301 // Shape is the Shape that is stored in the database.
302 Shape shape = 3;
303 // BBox bounding box for optimizing certain operations
304 // depending on performance benchmarks.
305 BoundingBox bbox = 4;
306 }
307
308 enum Shape {
309 GEOMETRY = 1;
310 POINT = 2;
311 // ... etc.
312 };
313
314 message BoundingBox {
315 float64 min_x = 1;
316 float64 min_y = 2;
317 float64 max_x = 3;
318 float64 max_y = 4;
319 }
320 ```
321
322 It is worth noting we could serialize Geography types with the S2
323 library. The S2 library would be fastest to transform serialization
324 and deserialization into S2 types (compared to EWKB). However, it is
325 missing important metadata such as 2.5D/3D (Z-) and M- coordinates, as
326 well as original shape and SRID metadata. There is a "Compressed"
327 serialization option for S2 objects which further improves disk usage,
328 however it is not yet ported to Go.
329
330 We will not support indexing geospatial types in default primary keys
331 and unique secondary indexes (as we do not for JSONB and ARRAYs today). This is
332 because we will not be able to match the PostGIS definition as it's based
333 on a hash of its internal data structure, which means we will
334 not be able to be a "drop-in" replacement here. If we could decide
335 on our own ordering, it will be based on the raw EWKB byte ordering.
336 This may change subject to further user consultation before the upcoming
337 v20.2 release.
338
339 We will however support using inverted indexes for geospatial indexing
340 -- see the indexing section for details.
341
342 #### Display
343
344 In line with PostGIS, we will display values from SELECT as EWKB-hex
345 strings.
346
347
348 ### Builtins
349
350 PostGIS supported functionality is defined in [Section 14.11 of their
351 documentation](https://postgis.net/docs/PostGIS_Special_Functions_Index.html#PostGIS_TypeFunctionMatrix),
352 which is a superset of those available in section 7 of the OGC
353 spec. For 20.2, we will aim to support all the 2D geometry (except for
354 CURVE) and geography functionality. The priority will be for functions
355 available for use for indexing and functions already implemented by
356 GEOS/S2/GeographicLib.
357
358 Most operations involving multiple Geometries and/or Geographies can
359 only succeed if the SRIDs between the two data types are the same.
360
361
362 #### Geometry (2D)
363
364 PostGIS mainly defers difficult 2D geometry functionality to the
365 [GEOS](https://trac.osgeo.org/geos/) library - however, some operations
366 are done in PostGIS itself if it is not supported in GEOS.
367
368 We will aim to use CGO to interface with the GEOS library in
369 20.2. This is the most straightforward and quick way to set up as the
370 GEOS library specifically targets implementing the OGC spec. This
371 provides us with coverage for many of the mathematical predicates
372 required for 2D geometry and lets us build a lot of functionality
373 quickly. We will eat the CGO overhead to be able to ship quickly.
374
375 We will use
376 [dlopen](http://man7.org/linux/man-pages/man3/dlopen.3.html) and
377 [dlsym](https://pubs.opengroup.org/onlinepubs/009695399/functions/dlsym.html) to
378 interface with the GEOS library. This prevents us from statically
379 linking the GEOS library - instead, a CRDB user must install GEOS into
380 their environment. As a corollary of this limitation:
381
382 * We will initially not be supporting certain 2D Geometry features on
383 CRDB on Windows for the 20.2 (we can decide to use the
384 [LoadLibrary](https://docs.microsoft.com/en-gb/windows/win32/api/libloaderapi/nf-libloaderapi-loadlibrarya?redirectedfrom=MSDN)
385 function in the future).
386
387 * Users will have to specify the path of their GEOS install when
388 running a cockroach binary. This should be set as an array of paths
389 which points to the location where the C bindings may be installed,
390 with a sensible default that covers most operating system default
391 locations. This is a potential source of confusion, which should be
392 well documented.
393 * We should include the GEOS library in the docker containers we ship.
394 * We will include GEOS in the tarball and include instructions
395 from the installation page to copy configure GEOS if Geospatial
396 operations are required. This is similar to PostGIS where users
397 would require extra steps by installing the PostGIS extension
398 dependencies to perform Geospatial operations.
399
400 Example changed installation instructions:
401
402 ```sh
403 $ wget -qO- https://binaries.cockroachdb.com/cockroach-v20.2.0.linux-amd64.tgz | tar xvz
404 $ cp -i cockroach-v20.2.0.linux-amd64/cockroach /usr/local/bin/
405 # Optional step.
406 $ cp -i cockroach-v20.2.0.linux-amd64/libgeos_c.so /usr/local/lib/
407 ```
408
409 If a user attempts to use a geospatial function without having the library installed,
410 they should see an error similar to the following:
411
412 ```sql
413 $ SELECT ST_Area('POINT(1.0 1.0)'::geometry)
414 ERROR: geospatial functions are only supported if the GEOS module is installed
415 HINT: See http://cockroachdb.com/link/to/installation/instructions
416 ```
417
418 If a user loads geospatial data then later does not have the module installed,
419 the above error will still appear.
420
421
422 #### Geometry (Curves)
423
424 Curves are implemented natively in PostGIS, with an [approximation of
425 ~32 points per
426 curve](https://github.com/postgis/postgis/blob/master/liblwgeom/lwgeom_geos.c#L412)
427 done when input into the GEOS library. We can look to use some of the
428 same math after v20.2.
429
430
431 #### Geometry (2.5D)
432
433 These are implemented natively in PostGIS with some GEOS support. We
434 can look at doing this after v20.2.
435
436
437 #### Geometry (3D)
438
439 PostGIS uses [SFCGAL](http://www.sfcgal.org/) to compute 3D
440 geometry. We will employ similar CGO techniques described in Geometry
441 2D, with the same downsides.
442
443
444 #### Geography
445
446 In PostGIS, spherical calculations are done within PostGIS, whilst
447 values requiring spheroid values (e.g. Distance, Area) are done using
448 [GeographicLib](https://geographiclib.sourceforge.io/).
449
450 Most spherical calculations are supported by the S2 Geometry library
451 in C++, most of which are ported over to Golang. Combining the S2
452 Geometry library with using CGO/GeographicLib (which does not require
453 dlopen/dlsym as we can link it directly in our binaries), we will be
454 able to support all relevant Geography functions.
455
456
457 ##### Default Args for Geography functions
458
459 Most geography functions that can optionally operate on a spheroid
460 (e.g. [ST_Area](https://postgis.net/docs/ST_Area.html)) have a
461 default argument of "use_spheroid = True". We do not support default
462 arguments in Cockroach. As such, we will have to define two functions
463 for each of these - one taking no bool and one taking a bool
464 value. This will add an extra "row" for this builtin in our docs
465 compared to PostGIS.
466
467 Users and tools specifying explicitly default args (e.g. `SELECT
468 ST_Area(geom, use_spheroid := false)`) will be unable to do so at v20.2
469 unless we implement the default args feature before then.
470
471
472 #### Comparison Ops
473
474 Direct comparison using the `>`/`>=`/`=`/`<`/`<=`/`!=` comparators in
475 PostGIS involve hashing the internal structure of the PostGIS
476 object. Comparators (besides =, which should probably use ST_Equal
477 anyway) are not expected to be commonly used with geospatial data
478 types.
479
480 As such, we will define our own comparator operations to use raw byte
481 EWKB comparisons, which will be incompatible with PostGIS for base
482 comparator operators only.
483
484
485 #### bbox Ops
486
487 PostGIS has a set of [bounding box
488 operators](https://postgis.net/docs/reference.html#idm9874) that are
489 not part of the OGC or SQL/MM specs. As mentioned above, our different
490 indexing approach prevents these from being index
491 accelerated. Therefore we have no plans to support these in the
492 initial implementation.
493
494
495 #### Distance Operators
496
497 PostGIS has a set of [distance
498 operators](https://postgis.net/docs/reference.html#idm10964) that are
499 useful for accelerating k-nearest-neighbor search as outlined in
500 [https://postgis.net/workshops/postgis-intro/knn.html](https://postgis.net/workshops/postgis-intro/knn.html). We
501 plan to add support for k-nearest-neighbor in later versions, and may
502 support a subset of these distance operators.
503
504
505 #### Index Usage
506
507 Certain function builtins in PostGIS can utilize an index which we
508 will match with some optimizer work around the indexing (see Indexing
509 section).
510
511 For 2D geometry and geography, these are:
512
513 * ST_Covers
514 * ST_CoveredBy
515 * ST_Contains (geometry only)
516 * ST_ContainsProperly (geometry only)
517 * ST_Crosses (geometry only)
518 * ST_DFullyWithin (geometry only)
519 * ST_DWithin
520 * ST_Equals (geometry only)
521 * ST_Intersects
522 * ST_Overlaps (geometry only)
523 * ST_Touches (geometry only)
524 * ST_Within (geometry only)
525
526 These functions have an equivalent with an `_` prefix
527 (e.g.. _ST_Within) that avoids using the indexes. We will similarly
528 have such functions, in which the optimizer will know not to use the
529 indexes when doing these operations.
530
531 3D geometry support includes 2D and 1D shapes embedded in 3 coordinate
532 dimensions, and some support for 3D shapes (solids). This uses the
533 SFCGAL library. We will not support 3D coordinate dimensions since we
534 do not yet have a comprehensive solution for indexing 3D space. For
535 reference, the 3D functions that need such indexing support are:
536
537 * ST_3DDFullyWithin
538 * ST_3DDWithin
539 * ST_3DIntersects
540
541
542 #### Builtin Result Caching
543
544 In PostGIS, certain builtin operations are cached, using the hashed data
545 from the shape as keys. This is especially useful for ST_Distance/ST_DWithin
546 for repeated use of these operations. We may similarly look into
547 caching the results of certain operations, but this is out of scope for
548 v20.2. If we decide to go along this route, a future RFC will be
549 published.
550
551 #### Documentation
552
553 PostGIS has documentation for each of the builtins they support, which
554 gives details such as which libraries are used, common gotchas and
555 links to other useful information (click on any builtin as defined in
556 [Section
557 14.11](https://postgis.net/docs/PostGIS_Special_Functions_Index.html#PostGIS_TypeFunctionMatrix)). We
558 should be similarly detailed and rigorous with our
559 explanations. However, we will need to build the necessary framework
560 to support referencing (valid) docs within our builtins for this to
561 work.
562
563
564 ### External Formats
565
566 We will support the following external formats (and the associated
567 builtins for parsing and encoding them):
568
569 * Well Known Text (WKT) and Well Known Bytes (WKB), as defined in the
570 SQL/MM Section 5.1.45 and 5.1.46. These are all supported by
571 twpayne/geom library, except for WKT decoding. For the WKT decoding,
572 we can use the GEOS library (which blocks Geography adoption for
573 non-GEOS installs) in the interim, but as GEOS library does not
574 support curves/3D types, we may wish to write our own parser in the
575 future.
576
577 * Extended Well Known Text (EWKT) and Extended Well Known Bytes
578 (EWKB), as defined by PostGIS for cross-compatibility. This contains
579 an extra reserved bit for specifying whether an SRID is included, as
580 well as extra bytes used for encoding the SRID itself. These are all
581 supported by twpayne/geom library.
582
583 * GeoJSON, as defined by
584 [RFC7946](https://tools.ietf.org/html/rfc7946). These are all
585 supported by twpayne/go-geom libraries.
586
587
588 ### Spatial Reference Identifiers (SRIDs) and Projections
589
590 A SRID is a number that identifies a spatial reference system. A
591 spatial reference system provides meaning to the point coordinates
592 used in the representation of shapes. When the geometry type is
593 used to represent the Earth, the SRID represents a projection on
594 a plane, for example the
595 [web mercator](https://en.wikipedia.org/wiki/Web_Mercator_projection)
596 projection is represented using SRID 3857. The geography type is not
597 projected, and the default is SRID 4326, which uses lat/lng
598 coordinates.
599
600 SRID support is heavily inbuilt into the SQL type system (see SQL
601 Types/Column Definitions).
602
603
604 #### spatial_ref_sys
605
606 Upon loading the PostGIS extension, a spatial_ref_sys table is
607 imported into the public schema of the given database, with a default
608 set of SRIDs pre-loaded into the database. This table exists in line
609 with the OGC Spec, with the addition of a `proj4text` column which
610 defines the WKT projection which has been transformed into something
611 the [PROJ](https://proj.org/) library can use. This table is similarly
612 insertable and deletable. Any Postgres user can modify this table.
613
614 PostGIS INSERT statements for custom defined SRIDs are available from
615 entries on the[ Spatial Reference
616 website](https://spatialreference.org/). If users are defining custom
617 SRIDs with WKT but do not have access to proj4text, we can provide a
618 builtin CGO wrapper around the [GDAL](https://gdal.org/) library
619 (X/MIT) which can perform this translation - but there are external
620 cli tools already available to perform this operation.
621
622 We need to be able to support some notion of a spatial_ref_sys table,
623 which allows management of user-defined insertable spatial_ref_sys
624 data.
625
626 For v20.2, we will not be looking to support custom SRIDs, and as such
627 SRIDs will be a hardcoded map and viewable from a `pg_extension`
628 virtual schema, discussed below in the "`pg_extension` schema" section.
629
630 In future versions, we can look at supporting user-defined
631 `spatial_ref_sys` schemas by allowing the user to "copy" the
632 `spatial_ref_sys` table into the public schema and using that
633 as the source of truth, analogous to PostGIS. This will appear
634 as the following:
635
636 ```
637 $ CREATE EXTENSION 'postgis';
638 -- copy over spatial_ref_sys table, maybe even geometry_column/geography_column views
639 $ SELECT * FROM spatial_ref_sys
640 -- .... results copied from default spatial_ref_sys
641 $ INSERT INTO geom_table VALUES ('SRID=9999;POINT(1.0 1.0)')
642 -- lookup is based on what is available on the public schema as opposed
643 -- to a hardcoded map.
644 ```
645
646 #### geometry_columns / geography_columns
647
648 The `geometry_columns` and `geography_columns` views in PostGIS are a
649 wrapper around pg_catalog which allow the user to view which tables
650 currently involve geospatial data. Due to PostGIS's extension nature,
651 they are available on the public schema upon extension registration.
652
653 This will also be available on the `pg_extension` schema
654 mentioned below.
655
656 ##### `pg_extension` schema
657
658 For cross compatibility with PostGIS and its nature as an extension
659 which injects tables into the public schema, we need this table to be
660 resolvable from anywhere by simply requesting "geometry_columns",
661 "geography_columns" or "spatial_ref_sys".
662
663 For v20.2, this will be available on a virtual schema called `pg_extension`.
664 This will be on the search_path similar to how pg_catalog is today. This
665 will differentiate these views from the default virtual schema views
666 available on other tables.
667
668 This schema should also be resolvable from any search path, and
669 thus available from any catalog. Any custom defined schemas should not
670 be named `pg_extension`.
671
672
673 #### spatial_ref_sys lookup caching
674
675 New column definitions (e.g. `ADD COLUMN &lt;colname> geography(point,
676 4326)`) as well as importing new geospatial datums with SRIDs defined
677 (e.g. `ST_FromEWKT('SRID=4326;POINT(1.0 2.0)')`) require a lookup of
678 whether the SRID is valid. It does not need to apply the
679 transformations, as it is expected that the input is already defined
680 as the given SRID.
681
682 To speed up having to look at the spatial_ref_sys for valid SRIDs, we
683 will need to cache successful SRID lookups. We can keep an
684 (potentially bounded) in-memory mapping of valid SRIDs with a
685 "timeout" to assist with the caching. This will be stale upon deletion
686 by an admin user, which the timeout will look to resolve.
687
688
689 #### ST_Transform
690
691 ST_Transform is the builtin that projects a given geospatial datum
692 from one SRID projection to another. This can be done as a CGO wrapper
693 around the [PROJ](https://proj.org/) library by transforming from one
694 proj4text to another. The library has a lot of extra additions that
695 may be unnecessary for our use case (e.g. requiring the installation of
696 sqlite3 for storing metadata from certain tools we don't use) and as
697 such we may decide to also ship this as a library which requires
698 dlopen/dlsym.
699
700 It is worth noting there is a Go port of the PROJ library named
701 [go-spatial/proj](https://github.com/go-spatial/proj), but is missing
702 some essential functionality such as the ability to define custom
703 proj4 transformations.
704
705
706 ### Other Supported Operations
707
708
709 #### AddGeomColumn
710
711 This is a builtin provided by PostGIS that adds a new Geometry column
712 to the database given a set of parameters. This is used a lot by tools
713 that import geometry data (at least from tutorials), so we should look
714 to support the same builtin.
715
716 The AddGeomColumn primitive also adds a couple of CHECK expressions
717 that ensures that the SRIDs/shape in the column match the datums,
718 however, this is expected to work even when using the vanilla `ADD
719 COLUMN geometry(shape, srid)` syntax. This is probably a relic of old
720 PostGIS, and we will look to add the same CHECK expressions.
721
722
723 ### BACKUP/RESTORE/cockroach dump
724
725 Backup/Restore/cockroach dump operations will continue to "work out of
726 the box". We will furthermore be able to ingest postgres dumps of
727 PostGIS data as we support I/O of EWKT-as-hex format when parsing and
728 displaying geospatial datums.
729
730
731 ### CHANGEFEED
732
733 We will use EWKB for CHANGEFEED results for Geography and Geometry
734 types. This may be out of line with the OGC spec, but follows the
735 principle that we will prioritize PostGIS behaviour.
736
737
738 ### Out of Scope for v20.2
739
740 The following has been mentioned as out of scope:
741 * 2.5D, 3D shapes
742 * PRIMARY KEY definitions for Geometry/Geography types
743 * Default Arguments for Builtins
744 * Caching certain builtin operations
745 * User defined SRIDs
746
747 Furthermore, these will be out of scope:
748 * Supporting the default geometric data types in
749 [PostgreSQL](https://www.postgresql.org/docs/current/datatype-geometric.html).
750
751
752 ## Indexing
753
754
755 ### Background
756
757 The current approaches to geospatial indexes divide cleanly into two
758 buckets. One approach is to "divide the objects". This works by
759 inserting the objects into a tree (usually a balanced tree) whose
760 shape depends on the data being indexed. The other approach is to
761 "divide the space". This works by creating a decomposition of the
762 space being indexed into buckets of various sizes.
763
764 When an object is indexed, a "covering" shape (e.g. a bounding box) is
765 constructed that completely encompasses the indexed object. Index
766 queries work by looking for containment or intersection between the
767 covering shape for the query object and the indexed covering
768 shapes. This retrieves false positives but no false negatives.
769
770
771 #### Divide the Objects
772
773 PostGIS is the notable implementation of divide the objects. It
774 maintains an "R tree" (rectangle tree) which is implemented as a
775 Postgres "GiST" index. The GiST index is a generalization of data
776 types that can be naturally ordered into a hierarchy of supersets
777 (e.g.B+ trees, R trees, etc).
778
779 The covering shape used by PostGIS is a "bounding box" which is the
780 minimal rectangle that encompasses the indexed shape. The same is done
781 for the query shape.
782
783 Lucene uses [BKD
784 trees](https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf)
785 with a triangle tessellation of shapes. BKD trees permit a
786 log-structured multi file approach, but compactions would need to read
787 all the input files into memory to redivide the objects. More
788 importantly, they are not easily compatible with the normal horizontal
789 scaling approach of splitting and merging lexicographic ranges.
790
791 Advantages of Divide the Objects:
792
793 * Each object is present in the index once
794 * Can index 3D GEOMETRY space
795 * Can index GEOGRAPHY with altitude
796 * Can index infinite GEOMETRY space
797
798 Disadvantages of Divide the Objects:
799
800 * The tree needs a periodic balancing operation to have predictable
801 read latencies
802 * Object insertions and balancing require locking
803 * How to effectively horizontally distribute an R tree is an [open
804 question](https://www.anand-iyer.com/papers/sift-socc2017.pdf).
805 * Bulk ingest (IMPORT) requires coordination between nodes as the
806 tree's shape depends on its contents
807 * A bounding box can return many false positives
808
809 #### Divide the Space
810
811 Recent geospatial index implementations tend to prefer to divide the
812 space, because being able to horizontally distribute data is
813 increasingly important. The relevant implementations here are
814 Microsoft SQL Server and MongoDB.
815
816 The space is divided into a quadtree (or a set of quadtrees) with a
817 set number of levels and a data-independent shape. Each node in the
818 quad tree (a "cell") represents some part of the indexed space and is
819 divided once horizontally and once vertically to produce 4 children in
820 the next level. Each node in the quadtree is "content addressable",
821 meaning that mapping from the node to its unique ID and back is
822 possible without external information.
823
824 Implementations tend to use clever strategies for the unique IDs with
825 important guarantees:
826
827 * The IDs of all ancestors of a cell are enumerable
828 * The IDs of all descendants of a cell is a range query
829 * The cells of nearby IDs are spatially near
830
831 MongoDB uses the [S2 library](https://s2geometry.io/) from Google for
832 Geography and their own implementation for 2D planar geometry. The
833 latter uses a
834 [geohash](http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves)
835 based numbering of cells which has worse locality than the Hilbert
836 curve used by S2. SQL Server uses an implementation of their own
837 devising.
838
839 When indexing an object, a covering is computed, often using some
840 number of the predefined cells. The same is done for the query object
841 when using the index. Ancestors and descendants can be retrieved by
842 using the ID properties above.
843
844 The number of covering cells can vary per indexed object. There is an
845 important tradeoff in the number of cells used to represent an object
846 in the index: fewer cells use less space but create a looser
847 covering. A looser covering retrieves more false positives from the
848 index, which is expensive because the exact answer computation that's
849 run after the index query is expensive. However, at some point the
850 benefits of retrieving fewer false positives is outweighed by how long
851 it takes to scan a large index.
852
853 Because the space is divided beforehand, it must be finite. This means
854 that Divide the Space works for (spherical) GEOGRAPHY and for finite
855 GEOMETRY (planar) but not for infinite GEOMETRY. SQL Server requires
856 that finite bounds be declared when creating a GEOMETRY index.
857
858 Current divide the space approaches (both S2 and SQL Server) assume
859 flat space, so cannot be used for indexing 3D GEOMETRY or altitude
860 GEOGRAPHY. Conceptually, the approach is extendable to 3D, but would
861 require our own implementation, and the performance characteristics of
862 indexing a larger space in this manner are unknown.
863
864 Advantages of Divide the Space:
865
866 * Easy to scale horizontally
867 * No balancing
868 * Inserts require no locking
869 * Bulk ingest is simple
870 * Allow a per-object tradeoff between index size and false positives
871
872 Disadvantages of Divide the Space:
873
874 * Do not easily support indexing infinite GEOMETRY (but we discuss a
875 way to extend to infinite geometries in future work).
876 * Performance characteristics of indexing 3D GEOMETRY space and
877 indexing altitude GEOGRAPHY are unknown.
878
879
880 ### Approach
881
882 For geography we will use the S2 library that models a sphere using a
883 quad-tree divide-the-space approach and numbers the cells using a
884 Hilbert Curve. The cell ids are 64 bit integers and the leaf cells
885 measure 1cm across on the Earth’s surface. For 2D planar geometry, we
886 will initially reuse the S2 library by mapping the rectangular bounds
887 of the index to one face of the unit cube in S2, and then mapping it
888 to the unit sphere. This distorts the original shape, but index
889 lookups (discussed in more detail below) look for cell overlaps, and
890 overlaps are preserved by this distortion.
891
892 In future versions we may write our own divide-the-space library for
893 2D planar geometry that avoids any distortions, and would permit
894 removal of the rectangular index bounds (see future work section).
895
896
897 ### Index storage
898
899 An index is configured with parameters that control aspects of the
900 covering produced for a shape: the number of cells, the lowest and
901 highest levels of the cells. The cell-ids are represented as
902 uint64. The stored index is an inverted index consisting of this
903 cell-id column followed by the primary key columns in the indexed
904 table. Note that each cell-id can be used in the covering for multiple
905 table rows, and each table row can have multiple cell-ids in the
906 covering, that is, it is a many-to-many relationship.
907
908 The following picture shows a nyc census block in purple with a 4 cell
909 covering in teal. Note that the cells are of different sizes. The
910 cell-id for one of the cells in the covering is also shown. 4 cells
911 were used in this case since that was the maximum number of cells
912 configured in the index configuration, stored in the IndexDescriptor.
913
914 ![Census Block Covering](20200421_geospatial/census_block1.png?raw=true "Census Block Covering")
915
916 This on-disk representation of the inverted index splits the “posting
917 list” for each cell-id, where the list is the table rows that are
918 indexed by that cell-id, into one index row per list element. This is
919 simple but not very efficient for reads, since each index row is
920 potentially small and queries always need to read all the rows for a
921 cell-id. However, it is convenient for additions and deletions, since
922 a single row needs to be modified. Preliminary experiments indicate
923 that the simple approach can be competitive in performance with
924 PostGIS, so we will adopt this simple approach until otherwise
925 necessary.
926
927
928 ### Index queries
929
930 We first consider how indexes are used to accelerate filtering and
931 then extend it to join queries.
932
933
934 #### Filtering
935
936 Consider a boolean function op(g, x) where g is a given shape, and x
937 is an indexed shape for which this function is true. The index is used
938 to find the set of all x for which this function could return
939 true. All functions are mapped to the following for index
940 acceleration:
941
942 * `contains(g, x)`: Every point in shape x is also in g, i.e., g
943 contains x.
944 * `contained-by(g, x)`: This is equivalent to contains(x, g), but is
945 considered separately since x represents the shapes that are
946 indexed.
947 * `intersects(g, x)` or `intersects(x, g)`: g and x have at least 1 point
948 that is in both.
949
950 Consider the covering(s) for any shape s to be the set of cellids that
951 cover the shape s. PostGIS uses bounding boxes instead of cell
952 coverings. Bounding boxes have the following property:
953
954 * contains(s1, s2) => contains(bounding-box(s1), bounding-box(s2))
955 * intersects(s1, s2) => intersects(bounding-box(s1), bounding-box(s2))
956
957 We assume the same property for cell coverings:
958
959 * contains(s1, s2) => contains(covering(s1), covering(s2))
960 * intersects(s1, s2) => intersects(covering(s1), covering(s2))
961
962 The first property is not always true and we will discuss how to
963 adjust to that reality in a later section.
964
965 The following abstract example is used to illustrate the filtering
966 algorithm, where c[i] is a cell number. Consider g has the cell
967 covering c[213], c[61], c[64] in a quad-tree rooted at c[0]. For
968 convenience of illustration, the numbering is not the Hilbert curve
969 numbering used by S2. In the numbering here, the children of cell c[i]
970 are numbered c[4*i+1]...c[4*i+4]. The following depicts this covering
971 as a tree with the leaf cells being the covering cells. Note that the
972 paths from the leaves to the root are not the same length.
973
974
975 ```
976 c[0]
977 |
978 c[3]
979 |
980 +---+-----+
981 | |
982 c[13] c[15]
983 | |
984 c[53] +--+--+
985 | | |
986 C[213] c[61] c[64]
987
988 ```
989
990
991
992 * contains(g, x): Using the covering property from earlier, all shapes
993 contained by g must have coverings contained by {c[213], c[61],
994 c[64]}. In the full quad-tree (not the partial one depicted above),
995 these are all the shapes indexed in the subtrees rooted at c[213],
996 c[61], c[64]. Due to the locality of the Hilbert curve numbering,
997 each subtree is a single contiguous range of integers. The indexed
998 shapes that could satisfy this function are:
999
1000 ```
1001 (⋃ \for-all c in subtree-range(c[213]) index(c)) ⋃
1002 (⋃ \for-all c in subtree-range(c[61]) index(c)) ⋃
1003 (⋃ \for-all c in subtree-range(c[64]) index(c))
1004 ```
1005
1006 There can be duplicate shapes both within a subtree and across
1007 subtrees. For example, a shape could be indexed under two children
1008 of c[61], say c[245] and c[247]. The scan of the subtree rooted at
1009 c[61] will find it twice.
1010
1011 * contained-by(g, x): Using the covering property from earlier, all
1012 shapes containing g must have coverings that contain c[213] and
1013 contain c[61] and contain c[64]. The indexed shapes that could
1014 satisfy this function are
1015
1016 ```
1017 (index(c[213]) ⋃ index(c[53]) ⋃ index(c[13]) ⋃ index(c[3]) ⋃ index(c[0])) ⋂
1018 (index(c61) ⋃ index(c15) ⋃ index(c3) ⋃ index(c0)) ⋂
1019 (index(c64) ⋃ index(c15) ⋃ index(c3) ⋃ index(c0))
1020 ```
1021
1022 One can factor out common sub-expressions.
1023
1024 * intersects(g, x) (or (x, g)): This needs to retrieve the shapes that
1025 satisfy contains(g, x) and shapes indexed using the ancestors of
1026 c213, c61, c64. That is,
1027
1028 ```
1029 contains(g, x) ⋃ index(c0) ⋃ index(c3) ⋃ index(c13) ⋃ index(c53) ⋃ index(c15)
1030 ```
1031
1032
1033 #### Mapping to the index functions
1034
1035 Functions map to the index functions:
1036
1037 * ST_Covers(g, x), ST_Covers(x, g): use contains(g, x) or
1038 contained-by(g, x)
1039 * ST_CoveredBy(g, x), ST_CoveredBy(x, g): use contained-by(g, x) or
1040 contains(g, x)
1041 * ST_Contains(g, x), ST_Contains(x, g): use contains(g, x) or
1042 contained-by(g, x)
1043 * ST_ContainsProperly(g, x), ST_ContainsProperly(x, g): use contains(g, x) or
1044 contained-by(g, x)
1045 * ST_Crosses: use intersects
1046 * ST_DFullyWithin(g, x, d), ST_DFullyWithin(x, g, d): extend g by
1047 distance d to produce a shape g’, and then use contains(g', x). The
1048 S2 library has an S2ShapeIndexBufferedRegion class that can be used
1049 for extending by distance d for the Geography type. This is not part
1050 of the S2 Go library, so we will need to port it from C++. See a
1051 later section for the Geometry type.
1052 * ST_DWithin(g, x, d), ST_DWithin(x, g, d): extend g by distance to
1053 produce a shape g’, and then use intersects.
1054 * ST_Equals: use intersects (we may be able to do better later)
1055 * ST_Intersects: use intersects
1056 * ST_Overlaps: use intersects
1057 * ST_Touches: use intersects
1058 * ST_Within(g, x), ST_Within(x, g): use contained-by(g, x) or contains(g, x)
1059
1060
1061 #### Joins
1062
1063 Queries can join between tables with geospatial indexes such as
1064
1065 ```
1066 SELECT blocks.blkid
1067 FROM nyc_census_blocks blocks
1068 JOIN nyc_subway_stations subways
1069 ON ST_Contains(blocks.geom, subways.geom)
1070 ```
1071
1072 We cannot execute this by joining between two inverted indexes. This
1073 is no different from PostGIS, which does not attempt to join two
1074 R-tree indexes. This query will execute using a special geospatial
1075 inverted-index “lookup join”:
1076
1077 * Each table row of one of the tables (typically the smaller one)
1078 produces the given shape g (from the previous section).
1079 * The covering of g is computed, and is used to lookup cell-ids (and
1080 cell-id ranges) from the index of the other table.
1081 * The set expression (described in the previous section) is computed
1082 to generate the matching shapes.
1083
1084 We plan to generalize this geospatial “lookup join” to also be usable
1085 for other inverted index cases, like ARRAY and JSON.
1086
1087
1088 #### Query Planner/Optimizer
1089
1090 The optimizer will need to perform three major tasks to support
1091 geospatial queries:
1092
1093 * _Detect functions with constant inputs and perform constant
1094 folding_.
1095
1096 * Since we already support constant folding for functions, this
1097 should be as simple as adding the relevant function names to the
1098 `FoldFunctionWhitelist`.
1099
1100 * _Detect index accelerated functions (see the list above) in the
1101 WHERE clause with one constant and one non-constant input, and
1102 generate constrained index scans_.
1103
1104 * This will be an exploration rule which detects primary index scans
1105 wrapped in a `SELECT` with predicate `WHERE ST_Covers(g, x)` (or
1106 any other index accelerated function in which `g` is constant and
1107 there is an index on `x`). The rule will create an alternative
1108 plan in which the primary index scan is replaced with an
1109 expression containing constrained inverted index scans combined
1110 with set operations. The specific index constraints and set
1111 operators (`UNION` or `INTERSECT`) will depend on the values
1112 returned by the `contains`, `contained-by` or `intersects` APIs
1113 described above. Later on, we can make these set expressions more
1114 efficient using other transformation rules (e.g., convert the
1115 `UNION`s to `UNION ALL` + `DISTINCT`) or help from the execution
1116 engine (e.g., to support scanning multiple inverted index ranges
1117 as part of a single scan). Note that the `SELECT` operator must
1118 still remain after the set expressions to filter out false
1119 positives returned by the index scans.
1120
1121 * In order for the optimizer to decide between the original primary
1122 index scan, an alternate secondary index scan (e.g., if there is
1123 another predicate besides `ST_Covers`), or the set expression with
1124 inverted index scans, we will need to update the cost model and
1125 statistics code. For the cost model, we will need to understand
1126 how expensive the predicate `ST_Covers` is compared to the cost of
1127 scanning a row. For the statistics code, we will need an estimate
1128 of how much data must be scanned for each of the inverted index
1129 scans, as well as how many rows are returned by the full set
1130 expression. The statistics code for inverted indexes is currently
1131 very rudimentary, so this should be improved whilst benefitting
1132 JSONB and ARRAY.
1133
1134 * To improve the statistics code for inverted indexes, there are a
1135 couple of options: (1) Store the number of distinct keys in the
1136 index, as well as the total number of values in the index
1137 (counting duplicates). To estimate the selectivity of an equality
1138 predicate (e.g., x = 5), we would use the formula (1/no. distinct
1139 keys) * (no. values in the index/no. of rows). (2) A more accurate
1140 selectivity estimate would use a histogram, storing the number of
1141 values indexed by each key. This also includes duplicate entries,
1142 so the total number of values in the histogram would add up to
1143 more than the number of rows. Making these improvements to
1144 inverted index statistics would not only help support geospatial
1145 queries, but also improve our JSON support.
1146
1147 * _Detect index accelerated functions (see the list above) in the
1148 WHERE/ON clause with two non-constant inputs, and generate lookup
1149 joins_.
1150
1151 * Similar to the previous case, this will be an exploration rule
1152 which detects primary index scans wrapped in a predicate
1153 containing one of the functions that can be
1154 index-accelerated. However, this rule will only apply if both
1155 inputs to the function are variables, and at least one has an
1156 inverted index. In this case, the optimizer will generate one or
1157 more lookup joins, in which one of the variables is used to look
1158 up into the index of the other variable. As with the case above,
1159 the predicate must remain after the join to filter out false
1160 positives.
1161
1162 * To decide which index to use for the join (or whether to use one
1163 of the indexes at all), we will need to make the same changes to
1164 the cost model and statistics code described above. We will also
1165 need an estimate of how expensive the API calls to `contains`,
1166 `contained-by` or `intersects` are since they will need to be
1167 called on each input row. Additionally, we will need to improve
1168 the selectivity calculation for lookup joins in cases where the
1169 index is inverted.
1170
1171 Since the set operations required for each row make selectivity
1172 calculation on geospatial lookup joins especially difficult, we
1173 may consider adding a new type of histogram for geospatial data.
1174 The histogram buckets would be ranges of S2 cell IDs, and the
1175 counts would represent the number of objects in the geospatial
1176 column that overlap that cell. By joining histograms on two
1177 geospatial columns, we could estimate the selectivity of a real
1178 join on those columns
1179
1180 ### Weakness of Covering invariant
1181
1182 The previous sections have assumed the invariant
1183
1184
1185 ```
1186 contains(s1, s2) => contains(covering(s1), covering(s2))
1187 ```
1188
1189
1190 This is not true in the general case even if the parameters used for
1191 computing the covering are the same:
1192
1193 * It *may* be true due to implementation artifacts of the S2 library
1194 for the shapes we are using, but this needs very careful
1195 verification. Specifically, for coverings produced using the same
1196 settings, the region coverer code in S2 may achieve this because of
1197 how it prioritizes which [cells to
1198 expand](https://sourcegraph.com/github.com/google/s2geometry/-/blob/src/s2/s2region_coverer.cc#L187-192)
1199 (using level and number of children). But the MayIntersect
1200 computation used is allowed to have [false
1201 positives](https://sourcegraph.com/github.com/google/s2geometry/-/blob/src/s2/s2region.h#L85-92),
1202 though for the shapes we are using it claims not to have false
1203 positives.
1204
1205 * We have experimentally observed benefits in using more cells in the
1206 covering of the query shape (the g parameter). One way to work
1207 around this would be to use a finer cell covering for g for
1208 contained-by(g, x) and one that matches the index for contains(g,
1209 x).
1210
1211
1212 ![Covering Invariant](20200421_geospatial/covering_invariant.png?raw=true "Covering Invariant")
1213
1214 The illustration above shows an example: the blue shape has a covering
1215 represented by the two blue cells and the orange shape has a covering
1216 represented by the larger orange cell (which is the parent of the blue
1217 cells).
1218
1219 To begin with, we will work around this as follows:
1220
1221 * For contained-by(g, x), use an “inner” covering for g. An inner
1222 covering is analogous to an interior covering, i.e., cells that are
1223 fully contained in g, but can additionally use leaf level cells that
1224 overlap with g.
1225 * Loosen the index computation for contains(g, x) to be identical to
1226 intersects. This means a very small g, which is contained by many
1227 large shapes will have an index lookup that will retrieve all these
1228 large shapes that will be false positives.
1229
1230
1231 #### Alternatives to Inner covering
1232
1233 If we can’t strengthen the covering invariant another alternative
1234 would be to store both the regular covering and the inner covering in
1235 the inverted index. Cells that are in both coverings (e.g. a point
1236 shape will have identical inner and regular covering) would need to be
1237 written once and marked as part of both coverings.
1238
1239
1240 ### Considerations for 2D Planar Geometry
1241
1242 We will use S2 to index 2D planar geometry by mapping the rectangular
1243 bounds of the index to the unit square, which is then mapped to the
1244 unit cube. We consider the potential issues with this approach:
1245
1246 * Shapes that exceed the index bounds: we clip the shape to the index
1247 bounds and index the clipped shape in the usual
1248 manner. Additionally, the shape will be indexed under a special
1249 “spillover” cell-id (that is not used by S2). Queries where the
1250 specified shape exceeds the bounds will need to read this spillover
1251 cell-id in addition to reading the rest of the index using the
1252 clipped shape.
1253
1254 * The mapping of the rectangular bounds to the unit square and then
1255 mapping to the unit sphere introduces distortions. This are ok for
1256 most operations since both the query shape and the indexed shape
1257 will continue to share the same cells (where they overlap) despite
1258 the distortion. However this an issue for ST_DWithin and
1259 ST_DFullyWithin which additionally specify a distance: the mapping
1260 to the unit sphere uses a non-linear function so the distance on the
1261 unit cube face translates differently based on where it is located
1262 on the cube. Instead of using S2 to extend the shape by distance (as
1263 we do for Geography), we will extend the shape by distance using the
1264 `GEOSBuffer_r` function in the GEOS library, and then translate the
1265 returned geometry to S2.
1266
1267
1268 ### Future Work
1269
1270 Indexing is an open-ended area and there is much potential for future
1271 improvements. We will prioritize improvements based on user
1272 experience, so it is hard to know what we will discover. Here are some
1273 early ideas on where we could work on improvements:
1274
1275 * K nearest neighbor search: This is a hard problem with a
1276 divide-the-space approach. We have some preliminary ideas for how to
1277 support this based on distances between the centroids of the
1278 bounding box, but this will wait until we have more clarity on user
1279 requirements.
1280
1281 * Eliminate the bounding box in 2D planar indexes: This will
1282 potentially go together with replacing our use of S2 for 2D planar
1283 geometries with our own divide-the-space approach (the challenge
1284 here is fast and robust algorithms for determining containment and
1285 intersection of a cell with a shape). We have some early ideas on
1286 how one can use a variable number of bits in the cell-id, and a 4
1287 quadrant plane, and extend the 4 quadrants indefinitely starting
1288 from an origin. This approach requires a different key comparator
1289 than the current MVCC comparator for the geospatial
1290 index. Additionally, it may require some storage engine hooks to
1291 rewrite keys as part of compactions.
1292
1293 * More efficient storage and retrieval of inverted index: Storing and
1294 retrieving an inverted index with one row per (cell-id, shape) is
1295 inefficient. The typical alternative is to use posting lists of the
1296 form cell-id => set of shape ids, where the shape ids are numeric
1297 and dense and can be packed efficiently using delta and run-length
1298 encoding. This allows for efficient posting list union and
1299 intersection (e.g. using roaring bitmaps). A big issue in an OLTP
1300 setting is insertions and deletions to the set of shape ids. Reading
1301 and rewriting the set for each update will introduce a hotspot in
1302 the key space. An alternative we could explore is to use storage
1303 engine level merges to incorporate changes to the set. This cannot
1304 be done without storage engine changes since the engine would need
1305 to know about (a) what is not committed, since uncommitted changes
1306 should not be merged with committed changes, (b) merge across
1307 different MVCC timestamps, (c) split long posting lists into
1308 multiple key-value pairs. Some of these issues are common to what
1309 would be needed to do GC of old MVCC versions within the engine (as
1310 part of compactions). We do not expect this to happen in the near
1311 future.
1312
1313 * Working with the default geometric data types in
1314 [PostgreSQL](https://www.postgresql.org/docs/current/datatype-geometric.html),
1315 in particular, [the
1316 indexing](https://www.postgresql.org/docs/current/indexes-types.html).
1317
1318
1319 ## End-to-End Example
1320
1321 We consider the following query adapted from
1322 [https://postgis.net/workshops/postgis-intro/joins_exercises.html](https://postgis.net/workshops/postgis-intro/joins_exercises.html).
1323
1324
1325 ### Table Definitions
1326
1327 Consider the following table definitions:
1328
1329 <table>
1330 <tr>
1331 <td>-- copied from output of shp2pgsql
1332 
1333 CREATE TABLE "nyc_census_blocks" (gid serial,
1334 
1335 "blkid" varchar(15),
1336 
1337 "popn_total" float8,
1338 
1339 "popn_white" float8,
1340 
1341 "popn_black" float8,
1342 
1343 "popn_nativ" float8,
1344 
1345 "popn_asian" float8,
1346 
1347 "popn_other" float8,
1348 
1349 "boroname" varchar(32));
1350 
1351 ALTER TABLE "nyc_census_blocks" ADD PRIMARY KEY (gid);
1352 
1353 SELECT AddGeometryColumn('','nyc_census_blocks','geom','0','MULTIPOLYGON',2);
1354 
1355 -- create an index
1356 
1357 CREATE INDEX nyc_census_blocks_geo_idx ON nyc_census_blocks USING GIST(geom);
1358 
1359 -- and add 38794 rows
1360 </td>
1361 </tr>
1362 <tr>
1363 <td>-- copied from output of shp2pgsql
1364 
1365 CREATE TABLE "nyc_neighborhoods" (gid serial,
1366 
1367 "boroname" varchar(43),
1368 
1369 "name" varchar(64));
1370 
1371 ALTER TABLE "nyc_neighborhoods" ADD PRIMARY KEY (gid);
1372 
1373 SELECT AddGeometryColumn('','nyc_neighborhoods','geom','0','MULTIPOLYGON',2);
1374 
1375 -- create an index
1376 
1377 CREATE INDEX nyc_neighbourhoods_geo_idx ON nyc_neighbourhoods USING GIST(geom);
1378 
1379 -- and add 129 rows
1380 </td>
1381 </tr>
1382 </table>
1383
1384
1385 Now we run the following query:
1386
1387
1388 ```
1389 SELECT
1390 n.name,
1391 Sum(c.popn_total) / (ST_Area(n.geom) / 1000000.0) AS popn_per_sqkm
1392 FROM nyc_census_blocks AS c
1393 JOIN nyc_neighborhoods AS n
1394 ON ST_Intersects(c.geom, n.geom) AND c.boroname = n.boroname
1395 WHERE n.name = 'Upper West Side'
1396 OR n.name = 'Upper East Side'
1397 GROUP BY n.name, n.geom;
1398 ```
1399
1400 Note both `nyc_census_blocks` and `nyc_neighborhoods` have geospatial
1401 indexes on the geom column.
1402
1403
1404 ### Example Match with S2 Coverings
1405
1406 Let's take an example match from the above query and look at the S2
1407 generated index coverings.
1408
1409 Census Block 360610138001000 is in purple, with a generated S2
1410 covering coloured in teal. The covering completely highlights the
1411 entire census area using a maximum of 4 cells (which we have specified
1412 experimentally for the sake of this example).
1413
1414
1415 ![Census Block Covering](20200421_geospatial/census_block2.png?raw=true "Census Block Covering")
1416 ![Census Block Covering with Cells](20200421_geospatial/census_block3.png?raw=true "Census Block Covering with Cells")
1417
1418 Now here is the Upper East Side, with S2 using a single cell as a
1419 cover for it. Again, we have experimentally set the use of a maximum
1420 of 4 cells, but S2 was happy just using the one.
1421
1422
1423 ![Upper East Side](20200421_geospatial/ues1.png?raw=true "Upper East Side")
1424 ![Upper East Side With Cells](20200421_geospatial/ues2.png?raw=true "Upper East Side With Cells")
1425
1426
1427 Here is everything so far overlayed with each other. Here we can
1428 explicitly see that the coverings of the Census Block are (some unit
1429 of grand) children of the Upper East Side, and there is a clear
1430 intersection of the Census Block and the Upper East Side.
1431
1432 ![Upper East Side With Census Block](20200421_geospatial/ues_with_census_block.png?raw=true "Upper East Side With Census Block")
1433
1434 Let us consider trying to see whether we should try and match them
1435 together with S2 coverings.
1436
1437 Consider the CellIDs in binary format:
1438
1439
1440 ```
1441 UES: 1000100111000010010110001100000000000000000000000000000000000000
1442 Blk1: 1000100111000010010110001011110001000000000000000000000000000000
1443 Blk2: 1000100111000010010110001011101110001010110000000000000000000000
1444 Blk3: 1000100111000010010110001011101110001011010000000000000000000000
1445 Blk4: 1000100111000010010110001011101111110000000000000000000000000000
1446 ```
1447
1448 ```
1449 ^-----------------------------------------^
1450 ```
1451
1452 With the S2 encoding mechanism, we can tell which items
1453 intersect with the single cell covering of the Upper East Side by
1454 considering the range
1455
1456 ```
1457 [1000100111000010010110001000000000000000000000000000000000000000,1000100111000010010110010000000000000000000000000000000000000000)
1458 ```
1459
1460 as well as any parent coverings by constantly
1461 1000100111000010010110001100000000000000000000000000000000000000
1462 dividing by 4 until we have reached the root sub-tree. This would
1463 match all 4 blocks as indicated above.
1464
1465
1466 ### Planning
1467
1468 The number of neighborhoods is 129 and the number of census blocks is
1469 38794. Even though name is not the primary key of the
1470 `nyc_neighborhoods` table, the cardinality of the name field is the
1471 same, 129. So the optimizer can predict that the name filter will
1472 match 2 rows. This filter will be pushed below the join, so the input
1473 to the join from `nyc_neighborhoods` is 2 rows.
1474
1475 The optimizer will consider several different types of joins when
1476 joining `nyc_neighborhoods` and `nyc_census_blocks`. The two most
1477 likely options are (1) a hash join on the predicate `c.boroname =
1478 n.boroname`, and (2) a geospatial lookup join on the predicate
1479 `ST_Intersects(c.geom, n.geom)`, using the inverted index on
1480 `nyc_census_blocks`.
1481
1482 In order to decide between these two options, the optimizer can use
1483 statistics. The boronames have cardinality 5, so the number of rows
1484 returned by the hash join is 38794*2/5 = 15518. The expensive
1485 `ST_Intersects(c.geom, n.geom)` function must then be called on each
1486 of those rows.
1487
1488 The cost of the geospatial lookup join is more challenging to
1489 estimate, but it can be calculated by first estimating the number of
1490 CellIDs that must be scanned in `nyc_census_blocks_geo_idx` for each
1491 of the two matching rows in `nyc_neighborhoods`. Suppose that
1492 approximately 60 CellIDs must be scanned for a particular row, and the
1493 cardinality of the inverted index (i.e., number of CellIDs with
1494 entries) is 6000. Since each census block will exist in 4 index
1495 entries, the total number of values indexed is 38794*4 =
1496 155176. Therefore, the number of entries scanned is (60/6000)*155176*2
1497 = 3104. After de-duplicating the results, the number of output rows is
1498 3104/4 = 776. The expensive `ST_Intersects(c.geom, n.geom)` function
1499 must then be called on each of those rows to eliminate false
1500 positives, and the inexpensive predicate `c.boroname = n.boroname`
1501 must also be applied.
1502
1503 Based on the above cost analysis, the optimizer decides to use the
1504 geospatial index to do a geospatial-lookup join, where each of the
1505 selected rows from the `nyc_neighborhoods` table will be used to
1506 construct and compute a set union expression on the inverted
1507 geospatial index of `nyc_census_blocks`.
1508
1509 Here is the expected query plan:
1510
1511
1512 ```
1513 QUERY PLAN
1514 --------------------------------------------------------------------------------------
1515 project
1516 ├── group-by
1517 │ ├── inner-join (lookup nyc_census_blocks AS c)
1518 │ │ ├── lookup columns are key
1519 │ │ ├── inner-join (geospatial-lookup nyc_census_blocks@nyc_census_blocks_geo_idx)
1520 │ │ │ ├── select
1521 │ │ │ │ ├── scan nyc_neighborhoods AS n
1522 │ │ │ │ └── filters
1523 │ │ │ │ └── (name = 'Upper West Side') OR (name = 'Upper East Side')
1524 │ │ │ └── filters (true)
1525 │ │ └── filters
1526 │ │ ├── st_intersects(c.geom, n.geom)
1527 │ │ └── c.boroname = n.boroname
1528 │ └── aggregations
1529 │ └── sum
1530 │ └── popn_total
1531 └── projections
1532 └── sum / (st_area(n.geom) / 1e+06)
1533 ```
1534
1535
1536
1537 ### Execution
1538
1539 This query will execute in the manner outlined by the query plan:
1540
1541
1542
1543 * Since there is no index on nyc_neighborhoods.name, the
1544 nyc_neighborhoods table is scanned and the name filter is
1545 applied. For each of the two rows that pass the filter, the
1546 following steps will be performed.
1547
1548 * A covering of the geom field in the row in nyc_neighborhoods is
1549 computed. This covering is used by the intersects() function
1550 outlined in the previous [section](#filtering), to produce a set of
1551 cell-ids that must be retrieved from the index and the corresponding
1552 nyc_census_blocks.gid column, which is the primary key,
1553 unioned. This is the candidate set of census_blocks keys that
1554 satisfy ST_Intersects.
1555
1556 * The primary key of nyc_census_blocks (nyc_census_blocks.gid) is then
1557 used to perform a lookup join with the primary index of
1558 nyc_census_blocks, which is needed to retrieve the remaining
1559 columns. As part of this lookup join, the ST_Intersects predicate
1560 and the boroname equality predicate are computed and if both are
1561 true, the join condition is satisfied.
1562
1563 * Next, the sum of the total population is calculated, grouped by
1564 neighborhood.
1565
1566 * Finally, the sum is divided by the area of the neighborhood, in
1567 order to find the population per square meter. Area is calculated
1568 using the built-in scalar function ST_Area.
1569
1570
1571 ## Package Structure
1572
1573 We will put geospatial calculation data in a new `pkg/geo` package, to
1574 separate "SQL" from "Geospatial". We've decided to separate
1575 "functions" and their complexity into subpackages to simplify import
1576 complexity.
1577
1578 Our package structure will look as follows:
1579
1580 * pkg
1581 * geo (contains basic data types, as well as transformations from/to
1582 external data formats)
1583 * geopb (contains protobuf storage for shapes, as well as typedefs
1584 for SRIDs)
1585 * geogfn (contains geography functions, as well as the CGO
1586 wrappers for GeographicLib)
1587 * geomfn (contains geometry functions, as well as the CGO wrappers
1588 for GEOS / SFCGAL)
1589 * geosrid (contains SRID transformation functionality)
1590 * geoindex (contains indexing functionality)
1591 * geographiclib (contains spheroid functionality from
1592 GeographicLib)
1593 * geos (contains 2D geometry from the GEOS library)
1594 * geoproj (contains proj functionality from PROJ)
1595
1596 ## Telemetry
1597
1598 To gather insight, we should add telemetry on the following:
1599
1600 * SQL Usage
1601 * column type usage
1602 * usage statistics of various builtins
1603 * success rate of caching SRID
1604 * success rate of caching builtins (if implemented)
1605 * Indexing
1606 * spatial index usage
1607 * spatial index using out of bounds geometries
1608 * filtering % done by using indexes
1609
1610 ## Testing
1611
1612 In addition to writing our own unit and integration tests, there are
1613 several third-party tools we can look at utilizing to compare
1614 correctness. These can be adapted to and used in the CockroachDB
1615 framework, including:
1616
1617 * [OGC Conformance Tests](https://cite.opengeospatial.org/teamengine/)
1618 - tests to ensure OGC compatibility - adaptable into nightly
1619 roachtests
1620 * [monetdb-mcs](https://github.com/cswxu/monetdb-mcs/tree/544d87e9e67a926e88b17b171aa4251b20177be0/geom/)
1621 has a few of their own integration tests - adaptable into nightly
1622 roachtests
1623
1624 Due to internal restrictions, we will be unable to utilize
1625 the following directly from the codebase:
1626
1627 * [PostGIS regression
1628 tests](https://github.com/postgis/postgis/blob/master/regress )
1629
1630 * [PostGIS "garden"
1631 tests](https://trac.osgeo.org/postgis/wiki/DevWikiGardenTest) -
1632 tests for crashing / running into edge cases. We can utilize
1633 sqlsmith to write a similar version for ourselves.
1634
1635 We will also look at correctness testing our own index
1636 implementation. We can utilize the default GEOS and GeographicLib
1637 calculations to determine whether our index implementation will
1638 correctly fetch the same results as without using the index. To do
1639 this, we can generate random test suites with indexed columns and
1640 compare results with and without using the index. As an added bonus,
1641 we can see the performance improvement when using indexing as well
1642 with this approach.
1643
1644
1645 ## Performance Benchmarking
1646
1647 At the time of writing, we have struggled to find industry standard
1648 benchmarks for comparing performance of spatial indexes. As such, we
1649 are mainly left to our own devices.
1650
1651 We will be looking at writing our own benchmarking suite, which will
1652 benchmark on the following:
1653
1654 * Various core geospatial builtins
1655 * Indexed retrievals
1656 * We should test indexed retrievals against numbers of rows, as well
1657 as different sizes of geospatial shapes (e.g. lots of little
1658 and/or big shapes) as well as sparseness of geospatial shapes
1659 (e.g. lots of data in "New York" vs data sparsely populated around
1660 different areas).
1661 * We should also try to test under lock contention (to test the
1662 theory that R-Trees do not behave well under conditions involving
1663 heavy locking)
1664 * IMPORT/EXPORT functionality
1665
1666 We may decide to compare our implementation against other SQL-based
1667 implementations such as SQL Server and Postgres/PostGIS. MongoDB is a
1668 good stretch option as well.
1669
1670 ## Unresolved questions
1671
1672 None beyond what is already mentioned in earlier text.
1673
1674 # Updates
1675 * 2020-05-07:
1676 * added ST_ContainsProperly as an indexable function.