github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160418_dump.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160418_dump.md (about)

     1  - Feature Name: SQL Backup
     2  - Status: completed
     3  - Start Date: 2016-04-14
     4  - Authors: Matt Jibson
     5  - RFC PR: [#6058](https://github.com/cockroachdb/cockroach/pull/6058)
     6  - Cockroach Issue:
     7  
     8  # Summary
     9  
    10  In order to support backing up SQL data, we need a program that can recreate
    11  structure and data. Add a `dump` subcommand to the CLI that produces a `.sql`
    12  file that is suitable for being read by the `cockroach sql` shell.
    13  
    14  # Motivation
    15  
    16  We need a way to backup and restore SQL data for various reasons (just to
    17  have backups, some kinds of version migrations or updates). Doing this at
    18  any level below the SQL level (i.e., copying files, KV pairs, raft logs
    19  (is this even a thing?)) may be faster in some cases, but will also exclude
    20  some upgrades from being performed at all. For example, the recent decimal
    21  sort encoding bug could only be fixed by re-encoding the decimal value,
    22  which requires the SQL layer to re-encode it
    23  ([#5994](https://github.com/cockroachdb/cockroach/pull/5994)).
    24  
    25  # Detailed design
    26  
    27  The proposal is to add a `dump` subcommand to the CLI that can dump a
    28  specified table, database, or entire cluster. This command will connect to
    29  the SQL endpoint and produce structure and data SQL statements.
    30  
    31  ## Structured Statements
    32  
    33  Structure statements (`CREATE TABLE`) will be produced by a new `SHOW CREATE
    34  TABLE` statement. MySQL supports a `SHOW CREATE TABLE` statement. Postgres
    35  uses `pg_dump --schema-only` which is an executable that queries the `pg_*`
    36  tables and makes a `CREATE` statement from those. We have to do this work
    37  somewhere, so we might as well expose this as a new statement that other
    38  users and programs can use.
    39  
    40  ## Data Statements
    41  
    42  Data statements (`INSERT`) will be produced using a `SELECT *`. Since a
    43  table could be very large, we will page over the data using the last fetched
    44  primary key as the starting point:
    45  
    46  ```
    47  SELECT * FROM t ORDER BY t.PK LIMIT 1000
    48  	Fetches keys 1 - 1000
    49  SELECT * FROM t WHERE t.PK > 1000 ORDER BY t.PK LIMIT 1000
    50  	Fetches keys 1001 - 2000
    51  …
    52  ```
    53  
    54  The results of these SELECT statements will be formatted into INSERT
    55  statements. Each INSERT statement will contain many rows to increase
    56  performance, defaulting to 100 but configurable by the user.
    57  
    58  ## Transactions
    59  
    60  Since we need the data to not change during the multiple SELECTs, all this
    61  work (structure + data) must be done in the same transaction, or at least at
    62  the same timestamp. Issuing a `BEGIN TRANSACTION ISOLATION LEVEL SNAPSHOT`
    63  will do this for a specific connection.
    64  
    65  Since large backups will take more time, they have a higher chance of losing
    66  their connection for whatever reason, which ends the transaction. We ergo
    67  need a way to resume a dump from the last place and time that it started. This
    68  could be achieved using a time travel query.
    69  ([#5963](https://github.com/cockroachdb/cockroach/issues/5963))
    70  discusses many possible solutions.
    71  
    72  # Alternatives
    73  
    74  Instead of connecting to the SQL endpoint, we could connect directly to the
    75  KV layer and perform all the same work. This would require more work since
    76  we'd be re-implementing a lot of stuff that SQL does for us, but it *may*
    77  increase performance because we can skip some things SQL does that we don't
    78  care about. However, these are just possible performance gains for more work,
    79  so this is likely a bad idea to start with, and may not have any actual
    80  performance gains even if it were done.
    81  
    82  Both Postgres and MySQL have machinery to have the server write a file from
    83  a SELECT: `SELECT * FROM t INTO OUTFILE`, `COPY t TO ‘filename’`. These
    84  seem not appropriate for Cockroach since the performance benefits of these
    85  statements is that the network does not have to be traversed when dumping the
    86  data. But in Cockroach’s case the data is already on various nodes over
    87  the network, so the benefit of writing to a file on a specific server is
    88  greatly decreased. These seem like optimizations for a future version which
    89  can determine which ranges live on which nodes and request that those nodes
    90  write a SQL dump file to their own disks, which operators then aggregate
    91  and archive elsewhere.
    92  
    93  # Unresolved questions
    94  
    95  User accounts and privilege grants. Do we want them always, never, or
    96  sometimes? Our current GRANT system can grant privileges to users that don't
    97  exist, so it may make sense to always create GRANT statements for any level
    98  of dump (table, database, cluster).