github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20190731_cluster_name.md (about) 1 - Feature Name: cluster name 2 - Status: completed 3 - Start Date: 2019-07-31 4 - Authors: knz, ben, marc 5 - RFC PR: [#39196](https://github.com/cockroachdb/cockroach/pull/39196) 6 - Cockroach Issue: [#16784](https://github.com/cockroachdb/cockroach/issues/16784) [#15888](https://github.com/cockroachdb/cockroach/issues/15888) [#28408](https://github.com/cockroachdb/cockroach/issues/28408) 7 8 # Summary 9 10 New feature: a string value, called "cluster name", and checked for 11 equality when a newly started node joins a cluster. 12 13 Prototype implementation: https://github.com/cockroachdb/cockroach/pull/39270 14 15 This will prevent newly added nodes to join the wrong cluster when a 16 user has multiple clusters running side by side. 17 18 This would be configured using another command-line parameter `--cluster-name` 19 alongside `--join`. 20 21 It will increase operational simplicity and the overall goal to “make 22 data easy”. 23 24 Other impact: 25 26 - (nice to have) the value would be shown in the admin UI, so that two UI 27 screens for different clusters side-by-side can be disambiguated at 28 a glance. 29 30 - (to be further considered) the value would be used as an additional 31 label annotation on exported prometheus metrics. This would make it 32 possible for Grafana and other monitoring solutions to more easily 33 connect to multiple CockroachDB clusters simultaneously. 34 35 Out of scope: it would be good to check the cluster name for equality 36 in the node certificate properties. This would prevent a node from 37 trusting another node's certificate if not in the same cluster, even 38 though they may be signed by the same CA. See https://github.com/cockroachdb/cockroach/issues/28408 39 40 41 # Motivation 42 43 The motivation for doing this has been detail in 44 https://github.com/cockroachdb/cockroach/issues/16784 and 45 https://github.com/cockroachdb/cockroach/issues/15888 46 47 - avoid mistaken node joins 48 - disambiguate UI screens 49 - disambiguate clusters in prometheus metrics 50 51 # Guide-level explanation 52 53 The service manager in charge of starting the `cockroach` process 54 would activate this feature by providing the `--cluster-name` parameter 55 on the command line. 56 57 When the parameter is supplied, a newly created node will verify that 58 the other nodes that it connects to (via `--join`) have the same name 59 configured. If they detect a different name, the join would fail and the 60 user would be informed. 61 62 If the parameter is not supplied, we get the current behavior - the 63 join succeeds in any case. 64 65 There are two scenarios of interest: 66 67 - newly-created clusters. These would have `--cluster-name` set on 68 every node from the beginning. 69 70 - previously-created clusters that don't have `--cluster-name` 71 configured yet, and wish to "opt into" the system. 72 For these the following process is required: 73 74 1. restart (possibly with a version upgrade to 19.2) every node in a 75 rolling fashion, specifyfing both `--cluster-name` 76 `--disable-cluster-name-verification` on the command line. 77 2. perform another rolling restart, removing `--disable-cluster-name-verification` 78 from every node. 79 80 # Reference-level explanation 81 82 - The parameters `--cluster-name` and 83 `--disable-cluster-name-verification` are parsed for `cockroach start` and `start-single-node` 84 only. 85 86 - A maximum of 256 characters (for now) is allowed for `--cluster-name`. 87 88 - Its lexical format is verified against the following regexp: `^[a-zA-Z](?:[-a-ZA-Z0-9]*[a-zA-Z0-9]|)$` 89 (allows `a123` and `a-b` but not `a.`, `123a` or `b-`. 90 We choose to exclude "_" because we foresee this may want 91 to become a valid hostname eventually. We choose to exclude "." 92 to open the door for integration with mDNS). 93 94 - It is stored in the `server.Config` object, and propagated to the RPC 95 context and the heartbeat service. 96 97 - The name is populated by the recipient of a PingRequest into the 98 PingResponse object, and checked by the initiator of the heartbeat. We choose 99 to check in the initiator so that the error condition can be 100 reported clearly to the operator (via `log.Shout`). 101 102 - The flag `--disable-cluster-name-verification` disables the check. 103 104 - The effect of `--disable-cluster-name-verification` is activated if 105 either side has it set on the command line. This is necessary 106 because during the rolling upgrade where `--cluster-name` is added 107 with `--disable-cluster-name-verification`, there is no technical 108 constraint that the initiator of a heartbeat must always be a node 109 that already has `--disable-cluster-name-verification` set. Consider 110 the following scenario: 111 112 1. nodes upgraded to 19.2 without `--cluster-name` set 113 2. n1 restarted with `--cluster-name --disable-cluster-name-verification` 114 3. n2 sends a heartbeat to n1. Receives a cluster name. Because 115 at that point n2 does not have `--disable-cluster-name-verification` (yet) 116 it will perform the check and that check will fail. 117 118 To fix this, the value of `--disable-cluster-name-verification` is sent 119 alongside the name in PingResponse, and combined (OR) with the local 120 one on the initiator side. If either side has the flag set, 121 the check is disabled. 122 123 - A new SQL built-in function `crdb_internal.cluster_name()` reports 124 the configured value. 125 126 - (Optionally) reported in `statuspb.NodeStatus` and displayed in UI. 127 128 ### Changing the cluster name in existing clusters 129 130 The design as proposed makes the cluster name consistent across all nodes. 131 132 Once the cluster name has been set to a new value, it becomes harder to change it. 133 The RFC as-is enables the following procedure to change the cluter name: 134 135 1. restart all nodes in rolling fashion, adding the parameter 136 `--disable-cluster-name-verification` and changing `--cluster-name` to the new value. 137 After the restart, the new name is known everywhere but name verification is disabled. 138 2. restart all nodes (a 2nd time), removing `--disable-cluster-name-verification`. 139 140 This manual procedure could be further automated to prevent restarting the nodes by adding a new cluster RPC, which automatically: 141 142 1. sets the flag "disable verification" and erasing the cluster name remotely in the `*rpc.Context` of every node, 143 2. changes the cluster name remotely in the `*rpc.Context` and `HeartbeatService` of every node, 144 3. sets the flag "disable verification" back to false in every node. 145 146 ## Detailed design 147 148 How: new CLI flag, populates `server.Config`, used in hearbeat checks. 149 150 Optionally picked up by the `Nodes()` status RPC and `crdb_internal.gossip_nodes`. 151 152 ## Drawbacks 153 154 None known at this time. 155 156 ## Rationale and Alternatives 157 158 - Cluster setting *instead of* command line flag. 159 160 Rejected because cluster settings are not yet available when setting 161 up a fresh cluster (nodes can join each other but no storage 162 available until `init` has been issued). 163 164 - *Separate*, additional easy-to-configure "display name" used in 165 admin UI. 166 167 Rejected because the need for this has not been expressed strongly 168 at this time. 169 170 - Separate names for join verification and Prometheus metric labels. 171 172 Under discussion. Proposal is to keep them the same until there 173 is a reason to make them separate. When that time comes, we can 174 introduce a cluster setting for the Prometheus metrics, whose 175 default value comes from the command line flag. 176 177 Another alternative is to do nothing. A current way of setting a 178 cluster name label on prometheus metrics is to have a rule to attach 179 the label at metric scraping time. See [example](https://github.com/cockroachdb/cockroach/blob/master/monitoring/prometheus.yml#L35). 180 181 - Avoiding the `--disable-cluster-name-verification` altogether: 182 we have not been able to find a protocol that lets an existing 183 cluster (without a name) "opt into" a new name. 184 185 ## Unresolved questions 186 187 N/A