github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170319_certificate_rotation.md (about) 1 - Feature Name: certificate rotation 2 - Status: completed 3 - Start Date: 2017-03-18 4 - Authors: @mberhault 5 - RFC PR: [14254](https://github.com/cockroachdb/cockroach/pull/14254) 6 - Cockroach Issue: [1674](https://github.com/cockroachdb/cockroach/issues/1674) 7 8 # Summary 9 10 This RFC proposes changes to certificate and key management. 11 12 The main goals of this RFC are: 13 * support addition and use of new certificates and keys without node restart 14 * decouple CA from node certificate rotation 15 * simplify generation and use of certificates and keys 16 17 Out of scope for this RFC: 18 * certificate revocation 19 * certificate/key deployment (aka: getting the files onto the nodes/clients) 20 * automatic certificate renewal 21 * use of CSRs (certificate signing requests) 22 23 # Motivation 24 25 Anytime certificates and keys are used, we must allow for rotation, 26 either due to security concerns or standard expiration. 27 28 This RFC is concerned with rotation due to expiration, meaning we do not 29 consider certificate revocation. 30 31 Our current certificate use allows for a single CA certificate and a single 32 server certificate/private key pair, with node restarts being required to 33 update any of them. 34 35 We wish to be able to push new certificates to nodes and use them 36 without restart or connection termination. 37 38 # Certificate expirations 39 40 CA and node certificates need radically different lifetimes to allow new 41 node certificates to be rolled out rapidly without waiting for CA propagation 42 to all clients and nodes. 43 44 We propose the following defaults: 45 * 5 year expiration on CA certificates 46 * 1 year expiration on node certificates 47 48 This may not always be appropriate. See "unresolved questions". 49 50 ## Expiration monitoring 51 52 To provide enough warning of potential problems, each node should record 53 and export the start/end validity timestamp for each type of certificate: 54 * the CA certificate 55 * the combined client/server certificate issued to user `node` 56 57 If two certificates of the same type are present (eg: two CA certificates), report 58 the timestamps for the certificate with the latest expiration date. 59 60 With such metrics, we can now alert on expiring certificates, either with a fixed lifetime 61 remaining or when a fraction of the lifetime has expired. 62 63 It may be preferable to monitor certificate chains rather than individual certificates 64 (see Known Issues below) and report validity dates for the latest valid chain. 65 66 # Certificate and key files 67 68 ## Storage location 69 70 The location of certificates and keys can be specified using the `--certs-dir` command-line 71 flag or the corresponding `COCKROACH_CERTS_DIR` environment variable. 72 73 The flag value is a relative directory, with a default value of `~/.cockroach-certs`. 74 75 We avoid using the `cockroach-data` directory for a few reasons: 76 * the certs must exist before cockroach is run, making the directory less discoverable 77 * wiping a node would wipe certs as well 78 79 All files within the directory are examined, but sub-directories are not traversed. 80 81 ## Naming scheme 82 83 We propose the following naming scheme: 84 `<prefix>[.<middle>].<extension>` 85 86 `<prefix>` determines the role: 87 * `ca`: certificate authority certificate/key. 88 * `node`: node combined client/server certificate/key. 89 * `client`: client certificate/key. 90 91 `<middle>` is required for client certificates, where `<middle>` is the name 92 of the client as specified in the certificate Common Name (eg: `client.marc.crt`). 93 For other certificate types, this may be used to differentiate between multiple versions of a similar 94 certificate/key. See "unresolved questions". 95 96 `<extension>` determines the type of file: 97 * `crt` files are certificates 98 * `key` files are keys 99 100 ## Permissions and file types 101 102 The only check is for the key to be read-write by the current user only (maximum permissions of `-rwx------`). 103 104 We need to provide admins with a way to disable permissions checks, due both to incompatible 105 certificate deployment methods, and incompatible filesystems/architectures. An environment variable 106 `COCKROACH_SKIP_CERTS_PERMISSION_CHECKS` with a stern warning should be sufficient. 107 108 # Certificate creation and renewal 109 110 ## CA certificate 111 112 Initial CA creation involves creating the certificate and private key. 113 * `ca.crt`: the CA certificate, valid 5 years. Provided to all nodes and clients. 114 * `ca.key`: the CA private key. Must be kept safe and **not** distributed to nodes and clients. 115 116 CA renewal involves creating a new certificate using either the same, or a new private key. 117 All valid CA certificates need to be kept as well as their corresponding keys: 118 * append the new certificate to the existing `ca.crt` (optionally removing expired certificates along the way). 119 * store the new key in a new file. 120 121 The updated `ca.crt` file must be communicated to all nodes and clients. 122 The `ca.key` must still be kept safe and **not** distributed to nodes and clients. 123 124 When signing node/client certificate, if multiple CA certificates are found inside `ca.crt`, the 125 certificate matching the private key will be used. If multiple such certificates exist, the one with 126 the latest expiration date will be used. 127 128 ## Node/client certificate 129 130 The trusted machine holding the CA private key generates node certificates and private keys, then 131 pushes them to the appropriate nodes. 132 Keys are not kept by the trusted machine once deployed on the nodes. 133 134 Generated node/client certificates have a shorter default lifetime than CA certificates (see "Certificate Expiration" section). Furthermore, their expiration date cannot exceed the CA certificate expiration. 135 136 Upon renewal, certificates and keys are fully re-generated, with no attempt to re-use the private node/client key. 137 Filenames for node/client certificates and keys can remain the same as before, or be new files. 138 139 # Reloading certificate/key files 140 141 ## Triggering a reload 142 143 Running nodes can be told to re-read the certificate directory by issuing a `SIGHUP` to the process. 144 145 Since we cannot control when nodes may be restarted, it is important to keep the reload process 146 identical to the initial load. 147 148 ## Validating certificates 149 150 Node certificates must be checked for validity. Specifically: 151 * we must have a valid certificate/private key pair 152 * the certificate must currently be valid (`Not Before < time.Now() < Not After`) 153 * the certificate must be signed by one of the CA certificates on this node 154 155 The last condition is an attempt to avoid loading a certificate that may not be verifiable 156 by other nodes or clients. If we do not have the right CA, chances are someone else does not either. 157 158 We may need to set a timer for certificates that have not reached their `Not Before` date, otherwise 159 we would need to trigger a second refresh. 160 161 # Online certificate rotation 162 163 A good description of online key rotation in Go can be found in 164 [Hitless TLS Certificate Rotation in Go](https://diogomonica.com/2017/01/11/hitless-tls-certificate-rotation-in-go/) 165 166 Adding or swapping certificates can be done in multiple ways: 167 1. construct a new `tls.Config` object 168 1. modify individual fields 169 1. implement callbacks corresponding to individual fields 170 171 The `tls.Config` object is specified at connection time and cannot be modified after as it 172 is not safe for concurrent use. 173 174 A node needs to maintain two `tls.Config` objects, one for server-side connections, one for client-side connections. A new config can be constructed upon reload, then reused for all subsequent connections. 175 176 The server-side `tls.Config` object can be specified for each client connection by implementing 177 the `tls.Config.GetConfigForClient`. This should return the most recent `tls.Config` object.` 178 179 ## Adding a new CA certificate 180 181 Root CAs for server and client certificate verification are in `tls.Config.RootCAs` and `tls.Config.ClientCAs` respectively. We should add all detected CA certificates to both pools. 182 183 ## Rotating node/client certificate 184 185 The node and client certificates are set in `tls.Config.Certificates`. 186 If more than one node certificate is present, the one matching the requested `ServerName` is presented. 187 We should set only one certificate in `tls.Config.Certificates`. 188 189 # Additional interfaces 190 191 ## Command line 192 193 We will need to modify all commands that use certs to make use of the new directory structure. 194 195 We will also need: 196 * modification to `cert create-ca` to use an existing key. 197 * `cert list` to list all CA certs and node/client cert/key pairs. 198 199 ## Admin UI / debug pages 200 201 We want at least barebones listing of all certs on a given node, including validity 202 dates, certificate chain (corresponding CA for a node cert), and valid hosts. 203 204 Soon-to-expire certificates (or chains) must be reported prominently on the admin UI and 205 available through external metrics. 206 207 # Future work 208 209 ## Packaging certificates for other languages/libraries 210 211 Separate `.crt` and `.key` files are expected by libpq, but other libraries/languages may have different ways of specifying/packaging them. 212 213 We need to: 214 * augment our per-language examples to include secure mode. see [docs/631](https://github.com/cockroachdb/docs/issues/631) 215 * document how to use public tools (ie: openssl) to convert certificates. 216 * provide multiple cert/key output modes for the `cockroach cert` commands. 217 218 ## Alternate reload methods 219 220 Some additional methods to trigger a reload can later be introduced: 221 * a timer based on certificate expiration 222 * regular timer 223 * admin UI endpoint 224 225 ## Client certificate monitoring 226 227 We have no way of knowing which certificate authority a client has, so we cannot monitor for 228 clients not yet aware of a new CA certificate. 229 230 We could examine client certificates and report soon-to-expire ones. This will not help 231 with CA knowledge, but would provide better visibility into user authentication issues. 232 233 # Unresolved questions 234 235 ## `tls.Config` fields 236 237 We need to verify that the proposed CA and node cert rotation mechanisms work, especially through 238 grpc. 239 240 Since everything uses `tls.Config`, implementing `tls.Config.GetConfigForClient` to rotate 241 the config on the server should be sufficient, 242 243 However, we need to ensure that all client-side connections are able to use the new config when 244 initiating a connection. 245 246 ## Renegotiation and certificate rotation 247 248 Renegotiation may cause new certificates to be presented. We need to make sure this will not 249 cause issues. 250 251 The `tls.Config` comments also mention this happening in TLS 1.3: 252 ``` 253 GetClientCertificate may be called multiple times for the same 254 connection if renegotiation occurs or if TLS 1.3 is in use. 255 ``` 256 257 ## Multi-certificate DER files and postgres clients 258 259 The Go `lib/pq` will use all certificates found in `ca.crt`, but this may not be the 260 case of other libraries. 261 It may be safer to keep a single CA certificate per file. 262 263 ## Allow use of multiple certs directories 264 265 Instead of the `--certs-dir` flag being a single directory, we could allow specification of 266 multiple directories. This would be useful to separate CA certificates from other certs. 267 268 ## File permissions 269 270 Is checking for `-rwx------` on keys sufficient? A more stringent check would be similar 271 to what the ssh client does (strict directory/file permissions). 272 273 ## Multiple versions of the same certificate 274 275 Should we allow multiple versions of the same certificate? eg: multiple files matching `ca.*.crt` or `node.*.crt`? If so, how do we handle them? 276 277 ## Certificate lifetime 278 279 We need to pick some defaults for certificate lifetimes. 280 281 The proposed ones may be inappropriate for most users: security-minded admins or those with 282 a good certificate-rotation process in place may want much shorter periods while casual 283 users may want to never be bothered by certificates. 284 285 # Drawbacks 286 287 ## Monitoring certificates 288 289 Simply because both a CA certificate and a node certificate are valid does not mean they 290 are both correct. Consider the following scenario: 291 * we receive an alert for CA certificate expiring soon 292 * we lost the CA private key so generate a new cert/key pair 293 * we push the new CA certificate to all nodes 294 * alerts no longer fire 295 296 In this case, as long as the old CA is still valid, we can verify other node certificates. 297 However, as soon as the old CA expires, we will be unable to verify node certificates due 298 to the key change. 299 300 We could improve monitoring by analyzing the lifetime data for each certificate chain 301 on each node. This would notice that the old chain expires when the old CA expires, and the 302 new chain contains only a CA certificate, no node certificate. 303 304 If we record/export chain information by CA cert serial number (or public key), we can ensure 305 that all nodes have certificate chains rooted at the same CA. 306 307 ## Multiple viable CA certs when generating node/client certs 308 309 When generating new node or clients certs, we may automatically detect multiple 310 valid CA certs in the certificate directory. 311 If two certs have the same key pair, we can pick the one expiring the latest. 312 If the keys differ, we want to throw an error and either ask the user to remove one, or 313 add an additional flag to force the cert (this partially defeats the point of automatic 314 file detection). 315 316 ## Delegated users 317 318 Dropping specific cert flags means that the postgres URL will be built automatically from the 319 requested username. 320 321 For example, running the following command: `cockroach sql -u foo -certs-dir=~/.cockroach-certs` will 322 generate the URL: `postgresl://foo@localhost:26257/?sslcert=foo.client.crt&sslkey=foo.client.key&...` 323 324 If delegation is allowed (user `root` can act as user `foo`), the command must be run with the 325 fully-specified postgres URL `postgresl://foo@localhost:26257/?sslcert=root.client.crt&sslkey=root.client.key&...` 326 327 Delegation remains doable, but with an extra hoop to jump through. 328 329 ## Determination of secure mode 330 331 With default values for certificate and key locations, secure mode is now less explicit, relying 332 on the detection of certificates in the default directory. This could be misleading to users. 333 334 # Alternative solutions 335 336 ## Certificate/key file discovery 337 338 We have three major options to specify certificate and key files: 339 340 ### Full file specification 341 342 This is the current method, all files are specified by their own flags. 343 344 Drawbacks: 345 * tedious. this can be alleviated by having default file names 346 * does not support multiple files. If renewal is done using new file pairs, these flags would need to be changed. 347 * using multiple CA certificates inside a single `ca.crt` file requires finding the certificate matching the key. This can be partially avoided by always putting the newest certificate first. 348 349 Advantages: 350 * simple code: we use the files as specified, relying on the standard library for mismatches. 351 * allows separate storage locations for certs (eg: CA in `/etc/...`, node certs in user directory). 352 353 ### Globs 354 355 Command-line flags for filename globs (either per pair, or per file). Optionally, allow specification 356 or a certs directory, with globs matching files within the directory. 357 358 Drawbacks: 359 * how do we deal with multiple matching certificates? (eg: `node.old.crt` and `node.new.crt`)? 360 * if a glob matches multiple types of certs (eg: `*.crt` glob matches CA/node/client certs), do we just fail? 361 * same problem with multi-cert `ca.crt` files. 362 * if using a shared directory, does not allow separate storage locations. 363 364 Advantages: 365 * reasonably easy to code, especially if requiring single matches. 366 * if specifying per file-type globs, can handle separate storage locations. 367 368 ### Automatically-determine files 369 370 Automatically determine file types (key vs cert) and cert usage (CA vs node vs client) by analyzing 371 all files in the certs directory. 372 Certificates can be determine by looking at `IsCA` or `ExtendedUsage`. The keys can be matched 373 to certificates by comparing algorithms and public keys. 374 375 Drawbacks: 376 * does not allow for separate storage location. We could allow specification of multiple certs directories. 377 * complex code: parses/analyses all certificates and keys. Need to make sure we mimic the standard library behavior to avoid surprises. Also need to evolve with the standard library (eg: key type support). 378 * unusual specification of certs/keys. Everyone else specifies files directly. 379 * "too much magic" 380 381 Advantages: 382 * full validation can provide user-friendly error messages on improper files (still obscure without decent knowledge of certficates). 383 * support for multiple ways of generating/deploying certificates and keys.