github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/cmd/roachprod/README.md (about) 1 ## roachprod 2 3 ⚠️ roachprod is an **internal** tool for creating and testing 4 CockroachDB clusters. Use at your own risk! ⚠️ 5 6 ## Setup 7 8 1. Make sure you have [gcloud installed] and configured (`gcloud auth list` to 9 check, `gcloud auth login` to authenticate). You may want to update old 10 installations (`gcloud components update`). 11 1. Build a local binary of `roachprod`: `make bin/roachprod` 12 1. Add `$PWD/bin` to your `PATH` so you can run `roachprod` from the root directory of `cockroach`. 13 14 ## Summary 15 16 * By default, clusters are created in the [cockroach-ephemeral] GCE 17 project. Use the `--gce-project` flag or `GCE_PROJECT` environment 18 variable to create clusters in a different GCE project. Note that 19 the `lifetime` functionality requires `roachprod gc 20 --gce-project=<name>` to be run periodically (i.e. via a 21 cronjob). This is only provided out-of-the-box for the 22 [cockroach-ephemeral] cluster. 23 * Anyone can connect to any port on VMs in [cockroach-ephemeral]. 24 **DO NOT STORE SENSITIVE DATA**. 25 * Cluster names are prefixed with the user creating them. For example, 26 `roachprod create test` creates the `marc-test` cluster. 27 * VMs have a default lifetime of 12 hours (changeable with the 28 `--lifetime` flag). 29 * Default settings create 4 VMs (`-n 4`) with 4 CPUs, 15GB memory 30 (`--machine-type=n1-standard-4`), and local SSDs (`--local-ssd`). 31 32 ## Cluster quick-start using roachprod 33 34 ```bash 35 # Create a cluster with 4 nodes and local SSD. The last node is used as a 36 # load generator for some tests. Note that the cluster name must always begin 37 # with your username. 38 export CLUSTER="${USER}-test" 39 roachprod create ${CLUSTER} -n 4 --local-ssd 40 41 # Add gcloud SSH key. Optional for most commands, but some require it. 42 ssh-add ~/.ssh/google_compute_engine 43 44 # Stage binaries. 45 roachprod stage ${CLUSTER} workload 46 roachprod stage ${CLUSTER} release v2.0.5 47 48 # ...or using roachprod directly (e.g., for your locally-built binary). 49 build/builder.sh mkrelease 50 roachprod put ${CLUSTER} cockroach-linux-2.6.32-gnu-amd64 cockroach 51 52 # Start a cluster. 53 roachprod start ${CLUSTER} 54 55 # Check the admin UI. 56 roachprod admin --open ${CLUSTER}:1 57 58 # Run a workload. 59 roachprod run ${CLUSTER}:4 -- ./workload init kv 60 roachprod run ${CLUSTER}:4 -- ./workload run kv --read-percent=0 --splits=1000 --concurrency=384 --duration=5m 61 62 # Open a SQL connection to the first node. 63 roachprod sql ${CLUSTER}:1 64 65 # Extend lifetime by another 6 hours. 66 roachprod extend ${CLUSTER} --lifetime=6h 67 68 # Destroy the cluster. 69 roachprod destroy ${CLUSTER} 70 ``` 71 72 ## Command reference 73 74 Warning: this reference is incomplete. Be prepared to refer to the CLI help text 75 and the source code. 76 77 ### Create a cluster 78 79 ``` 80 $ roachprod create foo 81 Creating cluster marc-foo with 3 nodes 82 OK 83 marc-foo: 23h59m42s remaining 84 marc-foo-0000 [marc-foo-0000.us-east1-b.cockroach-ephemeral] 85 marc-foo-0001 [marc-foo-0001.us-east1-b.cockroach-ephemeral] 86 marc-foo-0002 [marc-foo-0002.us-east1-b.cockroach-ephemeral] 87 Syncing... 88 ``` 89 90 #### Choosing a Provider 91 92 Use the `--clouds` flag to set which cloud provider(s) to use. Ex: 93 94 ``` 95 $ roachprod create foo --clouds gce,aws 96 ``` 97 98 #### Node Distribution Options 99 100 There are a couple flags that interact to create nodes in one zone or in 101 geographically distributed zones: 102 103 - `--geo` 104 - the `--[provider]-zones` flags (`--gce-zones`, `--aws-zones`, `--azure-locations`) 105 106 Here's what to expect when the options are combined: 107 108 - _If neither are set_: nodes are all placed within one of the the provider's default zones 109 - _`--geo` only_: nodes are spread across the provider's default zones 110 - _`--[provider]-zones` or `--geo --[provider]-zones`_: nodes are spread across 111 all the specified zones 112 113 ### Interact using crl-prod tools 114 115 `roachprod` populates hosts files in `~/.roachprod/hosts`. These are used by 116 `crl-prod` tools to map clusters to node addresses. 117 118 ``` 119 $ crl-ssh marc-foo all df -h / 120 1: marc-foo-0000.us-east1-b.cockroach-ephemeral 121 Filesystem Size Used Avail Use% Mounted on 122 /dev/sda1 49G 1.2G 48G 3% / 123 124 2: marc-foo-0001.us-east1-b.cockroach-ephemeral 125 Filesystem Size Used Avail Use% Mounted on 126 /dev/sda1 49G 1.2G 48G 3% / 127 128 3: marc-foo-0002.us-east1-b.cockroach-ephemeral 129 Filesystem Size Used Avail Use% Mounted on 130 /dev/sda1 49G 1.2G 48G 3% / 131 ``` 132 133 ### Interact using `roachprod` directly 134 135 ``` 136 # Add ssh-key 137 $ ssh-add ~/.ssh/google_compute_engine 138 139 $ roachprod status marc-foo 140 marc-foo: status 3/3 141 1: not running 142 2: not running 143 3: not running 144 ``` 145 146 ### SSH into hosts 147 148 `roachprod` uses `gcloud` to sync the list of hostnames to `~/.ssh/config` and 149 set up keys. 150 151 ``` 152 $ ssh marc-foo-0000.us-east1-b.cockroach-ephemeral 153 ``` 154 155 ### List clusters 156 157 ``` 158 $ roachprod list 159 marc-foo: 23h58m27s remaining 160 marc-foo-0000 161 marc-foo-0001 162 marc-foo-0002 163 Syncing... 164 ``` 165 166 ### Destroy cluster 167 168 ``` 169 $ roachprod destroy marc-foo 170 Destroying cluster marc-foo with 3 nodes 171 OK 172 ``` 173 174 See `roachprod help <command>` for further details. 175 176 ## Return Codes 177 178 `roachprod` uses return codes to provide information about the exit status. 179 These are the codes and what they mean: 180 181 - 0: everything ran as expected 182 - 1: an unclassified roachprod error 183 - 10: a problem with an SSH connection to a server in the cluster 184 - 20: a problem running a non-cockroach command on a remote cluster server or on a local node 185 - 30: a problem running a cockroach command on a remote cluster server or a local node 186 187 Each of these codes has a corresponding easy-to-search-for string that is 188 emitted to output when an error of that type occurs. The strings are emitted 189 near the end of output and for each error that happens during an ssh 190 connection to a remote cluster node. The strings for each error code are: 191 192 - 1: `UNCLASSIFIED_PROBLEM` 193 - 10: `SSH_PROBLEM` 194 - 20: `COMMAND_PROBLEM` 195 - 30: `DEAD_ROACH_PROBLEM` 196 197 # Future improvements 198 199 * Bigger loadgen VM (last instance) 200 201 * Ease the creation of test metadata and then running a series of tests 202 using `roachprod <cluster> test <dir1> <dir2> ...`. Perhaps something like 203 `roachprod prepare <test> <binary>`. 204 205 * Automatically detect stalled tests and restart tests upon unexpected 206 failures. Detection of stalled tests could be done by noticing zero output 207 for a period of time. 208 209 * Detect crashed cockroach nodes. 210 211 [cockroach-ephemeral]: https://console.cloud.google.com/home/dashboard?project=cockroach-ephemeral 212 [gcloud installed]: https://cloud.google.com/sdk/downloads