github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170319_certificate_rotation.md (about)

     1  - Feature Name: certificate rotation
     2  - Status: completed
     3  - Start Date: 2017-03-18
     4  - Authors: @mberhault
     5  - RFC PR: [14254](https://github.com/cockroachdb/cockroach/pull/14254)
     6  - Cockroach Issue: [1674](https://github.com/cockroachdb/cockroach/issues/1674)
     7  
     8  # Summary
     9  
    10  This RFC proposes changes to certificate and key management.
    11  
    12  The main goals of this RFC are:
    13  * support addition and use of new certificates and keys without node restart
    14  * decouple CA from node certificate rotation
    15  * simplify generation and use of certificates and keys
    16  
    17  Out of scope for this RFC:
    18  * certificate revocation
    19  * certificate/key deployment (aka: getting the files onto the nodes/clients)
    20  * automatic certificate renewal
    21  * use of CSRs (certificate signing requests)
    22  
    23  # Motivation
    24  
    25  Anytime certificates and keys are used, we must allow for rotation,
    26  either due to security concerns or standard expiration.
    27  
    28  This RFC is concerned with rotation due to expiration, meaning we do not
    29  consider certificate revocation.
    30  
    31  Our current certificate use allows for a single CA certificate and a single
    32  server certificate/private key pair, with node restarts being required to
    33  update any of them.
    34  
    35  We wish to be able to push new certificates to nodes and use them
    36  without restart or connection termination.
    37  
    38  # Certificate expirations
    39  
    40  CA and node certificates need radically different lifetimes to allow new
    41  node certificates to be rolled out rapidly without waiting for CA propagation
    42  to all clients and nodes.
    43  
    44  We propose the following defaults:
    45  * 5 year expiration on CA certificates
    46  * 1 year expiration on node certificates
    47  
    48  This may not always be appropriate. See "unresolved questions".
    49  
    50  ## Expiration monitoring
    51  
    52  To provide enough warning of potential problems, each node should record
    53  and export the start/end validity timestamp for each type of certificate:
    54  * the CA certificate
    55  * the combined client/server certificate issued to user `node`
    56  
    57  If two certificates of the same type are present (eg: two CA certificates), report
    58  the timestamps for the certificate with the latest expiration date.
    59  
    60  With such metrics, we can now alert on expiring certificates, either with a fixed lifetime
    61  remaining or when a fraction of the lifetime has expired.
    62  
    63  It may be preferable to monitor certificate chains rather than individual certificates
    64  (see Known Issues below) and report validity dates for the latest valid chain.
    65  
    66  # Certificate and key files
    67  
    68  ## Storage location
    69  
    70  The location of certificates and keys can be specified using the `--certs-dir` command-line
    71  flag or the corresponding `COCKROACH_CERTS_DIR` environment variable.
    72  
    73  The flag value is a relative directory, with a default value of `~/.cockroach-certs`.
    74  
    75  We avoid using the `cockroach-data` directory for a few reasons:
    76  * the certs must exist before cockroach is run, making the directory less discoverable
    77  * wiping a node would wipe certs as well
    78  
    79  All files within the directory are examined, but sub-directories are not traversed.
    80  
    81  ## Naming scheme
    82  
    83  We propose the following naming scheme:
    84  `<prefix>[.<middle>].<extension>`
    85  
    86  `<prefix>` determines the role:
    87  * `ca`: certificate authority certificate/key.
    88  * `node`: node combined client/server certificate/key.
    89  * `client`: client certificate/key.
    90  
    91  `<middle>` is required for client certificates, where `<middle>` is the name
    92  of the client as specified in the certificate Common Name (eg: `client.marc.crt`).
    93  For other certificate types, this may be used to differentiate between multiple versions of a similar
    94  certificate/key. See "unresolved questions".
    95  
    96  `<extension>` determines the type of file:
    97  * `crt` files are certificates
    98  * `key` files are keys
    99  
   100  ## Permissions and file types
   101  
   102  The only check is for the key to be read-write by the current user only (maximum permissions of `-rwx------`).
   103  
   104  We need to provide admins with a way to disable permissions checks, due both to incompatible
   105  certificate deployment methods, and incompatible filesystems/architectures. An environment variable
   106  `COCKROACH_SKIP_CERTS_PERMISSION_CHECKS` with a stern warning should be sufficient.
   107  
   108  # Certificate creation and renewal
   109  
   110  ## CA certificate
   111  
   112  Initial CA creation involves creating the certificate and private key.
   113  * `ca.crt`: the CA certificate, valid 5 years. Provided to all nodes and clients.
   114  * `ca.key`: the CA private key. Must be kept safe and **not** distributed to nodes and clients.
   115  
   116  CA renewal involves creating a new certificate using either the same, or a new private key.
   117  All valid CA certificates need to be kept as well as their corresponding keys:
   118  * append the new certificate to the existing `ca.crt` (optionally removing expired certificates along the way).
   119  * store the new key in a new file.
   120  
   121  The updated `ca.crt` file must be communicated to all nodes and clients.
   122  The `ca.key` must still be kept safe and **not** distributed to nodes and clients.
   123  
   124  When signing node/client certificate, if multiple CA certificates are found inside `ca.crt`, the
   125  certificate matching the private key will be used. If multiple such certificates exist, the one with
   126  the latest expiration date will be used.
   127  
   128  ## Node/client certificate
   129  
   130  The trusted machine holding the CA private key generates node certificates and private keys, then
   131  pushes them to the appropriate nodes.
   132  Keys are not kept by the trusted machine once deployed on the nodes.
   133  
   134  Generated node/client certificates have a shorter default lifetime than CA certificates (see "Certificate Expiration" section). Furthermore, their expiration date cannot exceed the CA certificate expiration.
   135  
   136  Upon renewal, certificates and keys are fully re-generated, with no attempt to re-use the private node/client key.
   137  Filenames for node/client certificates and keys can remain the same as before, or be new files.
   138  
   139  # Reloading certificate/key files
   140  
   141  ## Triggering a reload
   142  
   143  Running nodes can be told to re-read the certificate directory by issuing a `SIGHUP` to the process.
   144  
   145  Since we cannot control when nodes may be restarted, it is important to keep the reload process
   146  identical to the initial load.
   147  
   148  ## Validating certificates
   149  
   150  Node certificates must be checked for validity. Specifically:
   151  * we must have a valid certificate/private key pair
   152  * the certificate must currently be valid (`Not Before < time.Now() < Not After`)
   153  * the certificate must be signed by one of the CA certificates on this node
   154  
   155  The last condition is an attempt to avoid loading a certificate that may not be verifiable
   156  by other nodes or clients. If we do not have the right CA, chances are someone else does not either.
   157  
   158  We may need to set a timer for certificates that have not reached their `Not Before` date, otherwise
   159  we would need to trigger a second refresh.
   160  
   161  # Online certificate rotation
   162  
   163  A good description of online key rotation in Go can be found in
   164  [Hitless TLS Certificate Rotation in Go](https://diogomonica.com/2017/01/11/hitless-tls-certificate-rotation-in-go/)
   165  
   166  Adding or swapping certificates can be done in multiple ways:
   167  1. construct a new `tls.Config` object
   168  1. modify individual fields
   169  1. implement callbacks corresponding to individual fields
   170  
   171  The `tls.Config` object is specified at connection time and cannot be modified after as it
   172  is not safe for concurrent use.
   173  
   174  A node needs to maintain two `tls.Config` objects, one for server-side connections, one for client-side connections. A new config can be constructed upon reload, then reused for all subsequent connections.
   175  
   176  The server-side `tls.Config` object can be specified for each client connection by implementing
   177  the `tls.Config.GetConfigForClient`. This should return the most recent `tls.Config` object.`
   178  
   179  ## Adding a new CA certificate
   180  
   181  Root CAs for server and client certificate verification are in `tls.Config.RootCAs` and `tls.Config.ClientCAs` respectively. We should add all detected CA certificates to both pools.
   182  
   183  ## Rotating node/client certificate
   184  
   185  The node and client certificates are set in `tls.Config.Certificates`. 
   186  If more than one node certificate is present, the one matching the requested `ServerName` is presented.
   187  We should set only one certificate in `tls.Config.Certificates`.
   188  
   189  # Additional interfaces
   190  
   191  ## Command line
   192  
   193  We will need to modify all commands that use certs to make use of the new directory structure.
   194  
   195  We will also need:
   196  * modification to `cert create-ca` to use an existing key.
   197  * `cert list` to list all CA certs and node/client cert/key pairs.
   198  
   199  ## Admin UI / debug pages
   200  
   201  We want at least barebones listing of all certs on a given node, including validity
   202  dates, certificate chain (corresponding CA for a node cert), and valid hosts.
   203  
   204  Soon-to-expire certificates (or chains) must be reported prominently on the admin UI and
   205  available through external metrics.
   206  
   207  # Future work
   208  
   209  ## Packaging certificates for other languages/libraries
   210  
   211  Separate `.crt` and `.key` files are expected by libpq, but other libraries/languages may have different ways of specifying/packaging them.
   212  
   213  We need to:
   214  * augment our per-language examples to include secure mode. see [docs/631](https://github.com/cockroachdb/docs/issues/631)
   215  * document how to use public tools (ie: openssl) to convert certificates.
   216  * provide multiple cert/key output modes for the `cockroach cert` commands.
   217  
   218  ## Alternate reload methods
   219  
   220  Some additional methods to trigger a reload can later be introduced:
   221  * a timer based on certificate expiration
   222  * regular timer
   223  * admin UI endpoint
   224  
   225  ## Client certificate monitoring
   226  
   227  We have no way of knowing which certificate authority a client has, so we cannot monitor for
   228  clients not yet aware of a new CA certificate.
   229  
   230  We could examine client certificates and report soon-to-expire ones. This will not help
   231  with CA knowledge, but would provide better visibility into user authentication issues.
   232  
   233  # Unresolved questions
   234  
   235  ## `tls.Config` fields
   236  
   237  We need to verify that the proposed CA and node cert rotation mechanisms work, especially through
   238  grpc.
   239  
   240  Since everything uses `tls.Config`, implementing `tls.Config.GetConfigForClient` to rotate
   241  the config on the server should be sufficient,
   242  
   243  However, we need to ensure that all client-side connections are able to use the new config when
   244  initiating a connection.
   245  
   246  ## Renegotiation and certificate rotation
   247  
   248  Renegotiation may cause new certificates to be presented. We need to make sure this will not
   249  cause issues.
   250  
   251  The `tls.Config` comments also mention this happening in TLS 1.3:
   252  ```
   253  GetClientCertificate may be called multiple times for the same
   254  connection if renegotiation occurs or if TLS 1.3 is in use.
   255  ```
   256  
   257  ## Multi-certificate DER files and postgres clients
   258  
   259  The Go `lib/pq` will use all certificates found in `ca.crt`, but this may not be the
   260  case of other libraries.
   261  It may be safer to keep a single CA certificate per file.
   262  
   263  ## Allow use of multiple certs directories
   264  
   265  Instead of the `--certs-dir` flag being a single directory, we could allow specification of
   266  multiple directories. This would be useful to separate CA certificates from other certs.
   267  
   268  ## File permissions
   269  
   270  Is checking for `-rwx------` on keys sufficient? A more stringent check would be similar
   271  to what the ssh client does (strict directory/file permissions).
   272  
   273  ## Multiple versions of the same certificate
   274  
   275  Should we allow multiple versions of the same certificate? eg: multiple files matching `ca.*.crt` or `node.*.crt`? If so, how do we handle them?
   276  
   277  ## Certificate lifetime
   278  
   279  We need to pick some defaults for certificate lifetimes.
   280  
   281  The proposed ones may be inappropriate for most users: security-minded admins or those with
   282  a good certificate-rotation process in place may want much shorter periods while casual
   283  users may want to never be bothered by certificates.
   284  
   285  # Drawbacks
   286  
   287  ## Monitoring certificates
   288  
   289  Simply because both a CA certificate and a node certificate are valid does not mean they
   290  are both correct. Consider the following scenario:
   291  * we receive an alert for CA certificate expiring soon
   292  * we lost the CA private key so generate a new cert/key pair
   293  * we push the new CA certificate to all nodes
   294  * alerts no longer fire
   295  
   296  In this case, as long as the old CA is still valid, we can verify other node certificates.
   297  However, as soon as the old CA expires, we will be unable to verify node certificates due
   298  to the key change.
   299  
   300  We could improve monitoring by analyzing the lifetime data for each certificate chain
   301  on each node. This would notice that the old chain expires when the old CA expires, and the
   302  new chain contains only a CA certificate, no node certificate.
   303  
   304  If we record/export chain information by CA cert serial number (or public key), we can ensure
   305  that all nodes have certificate chains rooted at the same CA.
   306  
   307  ## Multiple viable CA certs when generating node/client certs
   308  
   309  When generating new node or clients certs, we may automatically detect multiple
   310  valid CA certs in the certificate directory.
   311  If two certs have the same key pair, we can pick the one expiring the latest.
   312  If the keys differ, we want to throw an error and either ask the user to remove one, or
   313  add an additional flag to force the cert (this partially defeats the point of automatic
   314  file detection).
   315  
   316  ## Delegated users
   317  
   318  Dropping specific cert flags means that the postgres URL will be built automatically from the
   319  requested username.
   320  
   321  For example, running the following command: `cockroach sql -u foo -certs-dir=~/.cockroach-certs` will
   322  generate the URL: `postgresl://foo@localhost:26257/?sslcert=foo.client.crt&sslkey=foo.client.key&...`
   323  
   324  If delegation is allowed (user `root` can act as user `foo`), the command must be run with the
   325  fully-specified postgres URL `postgresl://foo@localhost:26257/?sslcert=root.client.crt&sslkey=root.client.key&...`
   326  
   327  Delegation remains doable, but with an extra hoop to jump through.
   328  
   329  ## Determination of secure mode
   330  
   331  With default values for certificate and key locations, secure mode is now less explicit, relying
   332  on the detection of certificates in the default directory. This could be misleading to users.
   333  
   334  # Alternative solutions
   335  
   336  ## Certificate/key file discovery
   337  
   338  We have three major options to specify certificate and key files:
   339  
   340  ### Full file specification
   341  
   342  This is the current method, all files are specified by their own flags.
   343  
   344  Drawbacks:
   345  * tedious. this can be alleviated by having default file names
   346  * does not support multiple files. If renewal is done using new file pairs, these flags would need to be changed.
   347  * using multiple CA certificates inside a single `ca.crt` file requires finding the certificate matching the key. This can be partially avoided by always putting the newest certificate first.
   348  
   349  Advantages:
   350  * simple code: we use the files as specified, relying on the standard library for mismatches.
   351  * allows separate storage locations for certs (eg: CA in `/etc/...`, node certs in user directory).
   352  
   353  ### Globs
   354  
   355  Command-line flags for filename globs (either per pair, or per file). Optionally, allow specification
   356  or a certs directory, with globs matching files within the directory.
   357  
   358  Drawbacks:
   359  * how do we deal with multiple matching certificates? (eg: `node.old.crt` and `node.new.crt`)?
   360  * if a glob matches multiple types of certs (eg: `*.crt` glob matches CA/node/client certs), do we just fail?
   361  * same problem with multi-cert `ca.crt` files.
   362  * if using a shared directory, does not allow separate storage locations.
   363  
   364  Advantages:
   365  * reasonably easy to code, especially if requiring single matches.
   366  * if specifying per file-type globs, can handle separate storage locations.
   367  
   368  ### Automatically-determine files
   369  
   370  Automatically determine file types (key vs cert) and cert usage (CA vs node vs client) by analyzing
   371  all files in the certs directory.
   372  Certificates can be determine by looking at `IsCA` or `ExtendedUsage`. The keys can be matched
   373  to certificates by comparing algorithms and public keys.
   374  
   375  Drawbacks:
   376  * does not allow for separate storage location. We could allow specification of multiple certs directories.
   377  * complex code: parses/analyses all certificates and keys. Need to make sure we mimic the standard library behavior to avoid surprises. Also need to evolve with the standard library (eg: key type support).
   378  * unusual specification of certs/keys. Everyone else specifies files directly.
   379  * "too much magic"
   380  
   381  Advantages:
   382  * full validation can provide user-friendly error messages on improper files (still obscure without decent knowledge of certficates).
   383  * support for multiple ways of generating/deploying certificates and keys.