github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160721_information_schema.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160721_information_schema.md (about)

     1  - Feature Name: Information Schema
     2  - Status: completed
     3  - Start Date: 2016-07-19
     4  - Authors: Nathan VanBenschoten
     5  - RFC PR: [#7965](https://github.com/cockroachdb/cockroach/pull/7965)
     6  - Cockroach Issue: [#5194](https://github.com/cockroachdb/cockroach/issues/5194),
     7                     [#5583](https://github.com/cockroachdb/cockroach/issues/5583)
     8  
     9  
    10  # Summary
    11  
    12  `information_schema` allows users to perform database introspection, providing
    13  information about all tables, views, columns, procedures, and access privileges
    14  in a database. This proposal details initial plans to support the `information_schema`
    15  specification in CockroachDB.
    16  
    17  # Motivation
    18  
    19  Up until this point, CockroachDB has only supported limited schema reflection through
    20  `SHOW` statements. This is restricting in that it is inconsistent with other SQL
    21  statements, it forces users to learn new syntax, and it is inextensible. Unlike `SHOW`
    22  statements, `information_schema` exposes a table-like interface which feels natural in
    23  relational database and addresses all of these concerns. For other advantages of
    24  querying `information_schema` over using `SHOW` statements, see MySQL's
    25  [documentation on the topic.](http://dev.mysql.com/doc/refman/5.7/en/information-schema.html#idm140648692050944)
    26  
    27  More importantly, in order to perform schema reflection, a large majority of ORMs
    28  and other database tools query metadata tables within the `information_schema`. Without
    29  proper support for these meta table, many tools simply cannot be used with CockroachDB.
    30  This is an issue, as support for these tools is critical for user accessibility
    31  of CockroachDB. Because of this widespread usage of `information_schema`, adding support
    32  for the standard will be a major step towards supporting a group of ORMs and other tools,
    33  and by extension, expanding the ecosystem around CockroachDB.
    34  
    35  # Detailed design
    36  
    37  ### Relation to VIEWs
    38  
    39  `information_schema` tables are commonly referred to as VIEWs. This is because they are
    40  not base tables, so there are no files associated with them. On top of this, they are read-only
    41  tables, so one can only read their contents, and cannot perform `INSERT`, `UPDATE`, or `DELETE`
    42  operations on them.
    43  
    44  However, while `information_schema` tables are conceptually similar to read-only views, they
    45  are handled differently than standard user-level VIEWs in most databases. For instance, in MySQL
    46  `information_schema` tables are **temporary tables that are created and filled on demand**.
    47  The reason for this is that a `VIEW` is a mapping from an ordinary base table to a virtual table.
    48  However, the data stored in the `information_schema` is not stored in any underlying table,
    49  so the "tables" need to supply data themselves. All this goes to say that even though future
    50  implementation efforts towards adding support for VIEWs to CockroachDB may touch
    51  `information_schema` code, the features can and should be developed independently from each other.
    52  
    53  ### General Implementation Concerns
    54  
    55  There are two general implementation concerns that need to be considered for the
    56  introduction of `information_schema` into CockroachDB. The first concern is that the
    57  `information_schema` database and the tables it contains should **behave** exactly
    58  like all other databases and tables, respectively. Regardless of how they are implemented,
    59  they should be indistinguishable from persisted relations. The second concern is that the
    60  tables need to **produce** the correct information when queried by performing a form
    61  of database introspection.
    62  
    63  ### Virtual Database and Table Descriptors
    64  
    65  To address the first implementation concern, this RFC proposes that we hardcode a set of
    66  static database and table descriptors for the new database and all new tables needed
    67  for the `information_schema`. This will be similar to the approach taken for `SystemDatabase`,
    68  except there is no need to actually store the descriptors in an underlying Cockroach KV store
    69  during cluster bootstrap, because they will not need to be modified. These descriptors
    70  will all be given **read-only privileges**, just like the majority of system tables.
    71  
    72  Interactions with these "virtual" database and table descriptors will be handled by
    73  `sql.planner`'s implementation of
    74  [`DatabaseAccessor`](https://github.com/cockroachdb/cockroach/blob/4d45696ea776a5b912262e5eef9889eacb6abe41/sql/database.go#L90)
    75  and
    76  [`SchemaAccessor`](https://github.com/cockroachdb/cockroach/blob/4d45696ea776a5b912262e5eef9889eacb6abe41/sql/table.go#L89),
    77  respectively. By handling these descriptors at this level, we can avoid having to add
    78  special-cases for `information_schema` constructs at abstraction levels above this.
    79  Desired behavior which this would naturally provide is:
    80  
    81  - Preventing the creation of databases called "information_schema"
    82  - Preventing the creation of tables in the "information_schema" (read-only database privileges)
    83  - Preventing the mutation of tables in the "information_schema" (read-only table privileges)
    84  
    85  ### Virtual Data Sources
    86  
    87  To address the second implementation concern, the RFC proposes that we catch queries to
    88  these virtual descriptors in
    89  [`planner.getDataSource`](https://github.com/cockroachdb/cockroach/blob/4d45696ea776a5b912262e5eef9889eacb6abe41/sql/data_source.go#L191),
    90  and return a `valuesNode` instead of a `scanNode` with desired information populated. Using
    91  a `valueNode` in this way draws direct parallels to our current implementation of `SHOW`
    92  statements. It allows us to mock out the table scan, and populate the provided values using
    93  arbitrary code.
    94  
    95  ### pg_catalog
    96  
    97  `pg_catalog` is a PostgreSQL extension database that predates the introduction of
    98  `information_schema` to the SQL standard. PostgreSQL maps all `information_schema` "tables" to
    99  `pg_catalog` through the use of VIEWS. While it would be ideal to only support
   100  `information_schema`, as it is part of the SQL standard and present in most databases (as opposed
   101  to just PostgreSQL), there does seem to be a demand for an implementation of `pg_catalog` as well.
   102  This demand comes from:
   103  
   104  - PostgreSQL-specific tools and ORMs, which do not make an effort to be cross platform compatible
   105  - PostgreSQL-specific drivers, which do not make an effort to be cross platform compatible
   106  - Legacy cross-platform tools and ORMs, which special-case the use of `pg_catalog` for
   107    any database exposing the PostgreSQL wire protocol
   108  
   109  The RFC proposes to map all `pg_catalog` tables to queries on the `information_schema`, in
   110  a similar way that PostgreSQL does (but in the opposite direction). Ideally this would be
   111  completed with VIEWs, but there are a few downsides to this. These include that the eventual
   112  implementation of VIEWs will require `ViewDescriptors` to be distributed, which is unnecessary
   113  for a static VIEW. Instead, it would be simpler to map `pg_catalog` queries directly to
   114  `information_schema` queries in code. However, this RFC also proposes that we push off a more
   115  detailed discussion on `pg_catalog` until the implementation of `information_schema` is complete.
   116  
   117  ### Standards Considerations
   118  
   119  It is advised that we take a similar approach to MySQL in its representation of
   120  `information_schema`. The `information_schema` table structure will follow the ANSI/ISO
   121  SQL:2003 standard Part 11 Schemata. The intent is approximate compliance with SQL:2003 core
   122  feature F021 Basic information schema.
   123  
   124  Additionally, MySQL provides a number of extensions to their implementation of the
   125  `information_schema`. An example of this is the inclusion of an `ENGINE` column in their
   126  `INFORMATION_SCHEMA.TABLES` table. The database also excludes columns that do not
   127  make sense to include given their SQL implementation. Similarly, we will omit columns that
   128  are not relevant for our implementation, and add CockroachDB-specific extension columns
   129  if any specific need for them arises.
   130  
   131  CockroachDB-specific information which makes sense to add to our `information_schema`
   132  implementation includes:
   133  - column families
   134  - interleaved tables
   135  - index stored columns
   136  
   137  ### Privileges
   138  
   139  All users will have access to the `information_schema` tables, but they will only be able
   140  to see rows which they have access to. The same privileges will apply to selecting information
   141  from `information_schema` and viewing the same information through `SHOW` statements. This will
   142  be a change from our current implementation of `SHOW` statements, where we generally return all
   143  instances of a given object, even those that a user does not have privileges for.
   144  
   145  This approach to access control for the `information_schema` is identical to MySQL's.
   146  
   147  ### Other Concerns
   148  
   149  #### SHOW Statements
   150  
   151  Our inclusion of `SHOW` statements was inspired by MySQL. In MySQL, `SHOW` statements
   152  were added as the primary means of schema reflection before `information_schema` was
   153  introduced to the SQL standard. Shortly after it was introduced, MySQL added support for the
   154  meta tables, and began mapping `SHOW` statements directly onto `SELECT` statements from
   155  these virtual tables.
   156  
   157  This RFC proposes that sometime after support for `information_schema` is added, we
   158  follow MySQL's lead and turn `SHOW` statements into a wrapper around `information_schema`
   159  queries. This will help eliminate duplicate implementations and simplify SHOW statement
   160  handling significantly.
   161  
   162  #### Schema Search Path Extension (only needed for `pg_catalog` support)
   163  
   164  PostgreSQL has a notion of a "Schema Search Path". In much the same way that we support the
   165  `SET database = xyz`, which will search for unqualified table names in the `xyz` database, the
   166  schema search path allows unqualified table names to be associated with a specific
   167  database/table pair. However, unlike our session database setting, PostgreSQL allows multiple
   168  databases to be in the search path. This is important because a large percent of the the time
   169  that `pg_catalog` tables are referenced in ORMs and other external tools, they are
   170  referenced using unqualified table names. For instance, a number of the ORMs use
   171  `pg_catalog` like:
   172  
   173  ```sql
   174  SELECT * from pg_class
   175  ```
   176  
   177  instead of
   178  
   179  ```sql
   180  SELECT * from pg_catalog.pg_class
   181  ```
   182  
   183  The reason this is allowed is because PostgreSQL includes `pg_catalog` in its
   184  database search path by default. In order to support this use case, we should extend our
   185  notion of a single database search path to a set of search paths. Like support for
   186  `pg_catalog`, this does not need to be completed during the initial implementation of
   187  this RFC.
   188  
   189  # Drawbacks
   190  
   191  - Complicates some SQL descriptor logic
   192  - Complicates database introspection on itself (ie. `SHOW TABLES FROM information_schema`)
   193  - Adds a few special cases to the query engine, but this is much less intrusive than
   194    if we tried to mock out the database and table existence at a higher level than
   195    descriptors
   196  
   197  # Alternatives
   198  
   199  - Create and maintain real tables just like we do with SystemDB tables
   200  - Mock out database and table existence at a higher level than [virtual
   201    descriptors](#virtual-database-and-table-descriptors)
   202  - Begin building out VIEW infrastructure, and treat `information_schema` like
   203    a read-only view. As noted in [Relation to VIEWs](#relation-to-views), this
   204    wouldn't actually change very much, and would still require a lot of
   205    this work. For instance, we would still need to somehow mock out the existence
   206    of the `information_schema` database itself, and would still need to have the
   207    VIEWs source their own introspective data.
   208  
   209  # Unresolved questions
   210  
   211  ### Information_Schema Across Different Versions of CockroachDB
   212  
   213  Because this proposal suggests the use of static virtual descriptors instead of real
   214  descriptors persisted to an underlying CockroachDB store, the descriptors used for a
   215  given `information_schema` query will depend on the version of the CockroachDB instance
   216  fielding a SQL request. In a mixed-version cluster, this could result in two nodes returning
   217  `information_schema` tables with different schemas if the static schema was altered between
   218  the nodes' versions. Still, both nodes will issue replicated transactions internally to
   219  populate these tables, so both will return "correct" information.
   220  
   221  Inconsistencies where different nodes return different results for a given query can occur
   222  today if two nodes are running different versions of our SQL query engine. The only difference
   223  is that this change will introduce the possibility of schema inconsistencies because
   224  `information_schema` descriptors may not be consistent between nodes. This shouldn't be a
   225  serious issue, but it is something to note.