github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20190731_cluster_name.md (about)

     1  - Feature Name: cluster name
     2  - Status: completed
     3  - Start Date: 2019-07-31
     4  - Authors: knz, ben, marc
     5  - RFC PR: [#39196](https://github.com/cockroachdb/cockroach/pull/39196)
     6  - Cockroach Issue: [#16784](https://github.com/cockroachdb/cockroach/issues/16784) [#15888](https://github.com/cockroachdb/cockroach/issues/15888) [#28408](https://github.com/cockroachdb/cockroach/issues/28408)
     7  
     8  # Summary
     9  
    10  New feature: a string value, called "cluster name", and checked for
    11  equality when a newly started node joins a cluster.
    12  
    13  Prototype implementation:  https://github.com/cockroachdb/cockroach/pull/39270
    14  
    15  This will prevent newly added nodes to join the wrong cluster when a
    16  user has multiple clusters running side by side.
    17  
    18  This would be configured using another command-line parameter `--cluster-name`
    19  alongside `--join`.
    20  
    21  It will increase operational simplicity and the overall goal to “make
    22  data easy”.
    23  
    24  Other impact:
    25  
    26  - (nice to have) the value would be shown in the admin UI, so that two UI
    27    screens for different clusters side-by-side can be disambiguated at
    28    a glance.
    29  
    30  - (to be further considered) the value would be used as an additional
    31    label annotation on exported prometheus metrics. This would make it
    32    possible for Grafana and other monitoring solutions to more easily
    33    connect to multiple CockroachDB clusters simultaneously.
    34  
    35  Out of scope: it would be good to check the cluster name for equality
    36  in the node certificate properties. This would prevent a node from
    37  trusting another node's certificate if not in the same cluster, even
    38  though they may be signed by the same CA. See https://github.com/cockroachdb/cockroach/issues/28408
    39  
    40  
    41  # Motivation
    42  
    43  The motivation for doing this has been detail in
    44  https://github.com/cockroachdb/cockroach/issues/16784 and
    45  https://github.com/cockroachdb/cockroach/issues/15888
    46  
    47  - avoid mistaken node joins
    48  - disambiguate UI screens
    49  - disambiguate clusters in prometheus metrics
    50  
    51  # Guide-level explanation
    52  
    53  The service manager in charge of starting the `cockroach` process
    54  would activate this feature by providing the `--cluster-name` parameter
    55  on the command line.
    56  
    57  When the parameter is supplied, a newly created node will verify that
    58  the other nodes that it connects to (via `--join`) have the same name
    59  configured. If they detect a different name, the join would fail and the
    60  user would be informed.
    61  
    62  If the parameter is not supplied, we get the current behavior - the
    63  join succeeds in any case.
    64  
    65  There are two scenarios of interest:
    66  
    67  - newly-created clusters. These would have `--cluster-name` set on
    68    every node from the beginning.
    69  
    70  - previously-created clusters that don't have `--cluster-name`
    71    configured yet, and wish to "opt into" the system.
    72    For these the following process is required:
    73  
    74    1. restart (possibly with a version upgrade to 19.2) every node in a
    75       rolling fashion, specifyfing both `--cluster-name`
    76       `--disable-cluster-name-verification` on the command line.
    77    2. perform another rolling restart, removing `--disable-cluster-name-verification`
    78       from every node.
    79  
    80  # Reference-level explanation
    81  
    82  - The parameters `--cluster-name` and
    83    `--disable-cluster-name-verification` are parsed for `cockroach start` and `start-single-node`
    84    only.
    85  
    86  - A maximum of 256 characters (for now) is allowed for `--cluster-name`.
    87  
    88  - Its lexical format is verified against the following regexp: `^[a-zA-Z](?:[-a-ZA-Z0-9]*[a-zA-Z0-9]|)$`
    89    (allows `a123` and `a-b` but not `a.`, `123a` or `b-`.
    90    We choose to exclude "_" because we foresee this may want
    91    to become a valid hostname eventually. We choose to exclude "."
    92    to open the door for integration with mDNS).
    93  
    94  - It is stored in the `server.Config` object, and propagated to the RPC
    95    context and the heartbeat service.
    96  
    97  - The name is populated by the recipient of a PingRequest into the
    98    PingResponse object, and checked by the initiator of the heartbeat. We choose
    99    to check in the initiator so that the error condition can be
   100    reported clearly to the operator (via `log.Shout`).
   101  
   102  - The flag `--disable-cluster-name-verification` disables the check.
   103  
   104  - The effect of `--disable-cluster-name-verification` is activated if
   105    either side has it set on the command line. This is necessary
   106    because during the rolling upgrade where `--cluster-name` is added
   107    with `--disable-cluster-name-verification`, there is no technical
   108    constraint that the initiator of a heartbeat must always be a node
   109    that already has `--disable-cluster-name-verification` set. Consider
   110    the following scenario:
   111  
   112    1. nodes upgraded to 19.2 without `--cluster-name` set
   113    2. n1 restarted with `--cluster-name --disable-cluster-name-verification`
   114    3. n2 sends a heartbeat to n1. Receives a cluster name. Because
   115       at that point n2 does not have `--disable-cluster-name-verification` (yet)
   116       it will perform the check and that check will fail.
   117  
   118    To fix this, the value of `--disable-cluster-name-verification` is sent
   119    alongside the name in PingResponse, and combined (OR) with the local
   120    one on the initiator side. If either side has the flag set,
   121    the check is disabled.
   122  
   123  - A new SQL built-in function `crdb_internal.cluster_name()` reports
   124    the configured value.
   125  
   126  - (Optionally) reported in `statuspb.NodeStatus` and displayed in UI.
   127  
   128  ### Changing the cluster name in existing clusters
   129  
   130  The design as proposed makes the cluster name consistent across all nodes.
   131  
   132  Once the cluster name has been set to a new value, it becomes harder to change it.
   133  The RFC as-is enables the following procedure to change the cluter name:
   134  
   135  1. restart all nodes in rolling fashion, adding the parameter
   136     `--disable-cluster-name-verification` and changing `--cluster-name` to the new value.
   137     After the restart, the new name is known everywhere but name verification is disabled.
   138  2. restart all nodes (a 2nd time), removing `--disable-cluster-name-verification`.
   139  
   140  This manual procedure could be further automated to prevent restarting the nodes by adding  a new cluster RPC, which automatically:
   141  
   142  1. sets the flag "disable verification" and erasing the cluster name remotely in the `*rpc.Context` of every node,
   143  2. changes the cluster name remotely in the `*rpc.Context` and `HeartbeatService` of every node,
   144  3. sets the flag "disable verification" back to false in every node.
   145  
   146  ## Detailed design
   147  
   148  How: new CLI flag, populates `server.Config`, used in hearbeat checks.
   149  
   150  Optionally picked up by the `Nodes()` status RPC and `crdb_internal.gossip_nodes`.
   151  
   152  ## Drawbacks
   153  
   154  None known at this time.
   155  
   156  ## Rationale and Alternatives
   157  
   158  - Cluster setting *instead of* command line flag.
   159  
   160    Rejected because cluster settings are not yet available when setting
   161    up a fresh cluster (nodes can join each other but no storage
   162    available until `init` has been issued).
   163  
   164  - *Separate*, additional easy-to-configure "display name" used in
   165    admin UI.
   166  
   167    Rejected because the need for this has not been expressed strongly
   168    at this time.
   169  
   170  - Separate names for join verification and Prometheus metric labels.
   171  
   172    Under discussion. Proposal is to keep them the same until there
   173    is a reason to make them separate. When that time comes, we can
   174    introduce a cluster setting for the Prometheus metrics, whose
   175    default value comes from the command line flag.
   176  
   177    Another alternative is to do nothing. A current way of setting a
   178    cluster name label on prometheus metrics is to have a rule to attach
   179    the label at metric scraping time. See [example](https://github.com/cockroachdb/cockroach/blob/master/monitoring/prometheus.yml#L35).
   180  
   181  - Avoiding the `--disable-cluster-name-verification` altogether:
   182    we have not been able to find a protocol that lets an existing
   183    cluster (without a name) "opt into" a new name.
   184  
   185  ## Unresolved questions
   186  
   187  N/A