github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/cli/cluster.md (about) 1 --- 2 layout: post 3 title: CLUSTER 4 permalink: /docs/cli/cluster 5 redirect_from: 6 - /cli/cluster.md/ 7 - /docs/cli/cluster.md/ 8 --- 9 10 # `ais cluster` command 11 12 The command has the following subcommands: 13 14 ```console 15 $ ais cluster <TAB-TAB> 16 show remote-detach set-primary decommission reset-stats 17 remote-attach rebalance shutdown add-remove-nodes 18 ``` 19 20 > **Important:** with the single exception of [`add-remove-nodes`](#adding-removing-nodes), all the other the commands listed above operate on the level of the **entire** cluster. Node level operations (e.g., shutting down a given selected node, etc.) can be found under `add-remove-nodes`. 21 22 Alternatively, use `--help` to show subcommands with brief descriptions: 23 24 ```console 25 $ ais cluster --help 26 NAME: 27 ais cluster - monitor and manage AIS cluster: add/remove nodes, change primary gateway, etc. 28 29 USAGE: 30 ais cluster command [command options] [arguments...] 31 32 COMMANDS: 33 show show cluster nodes and utilization 34 remote-attach attach remote ais cluster 35 remote-detach detach remote ais cluster 36 rebalance administratively start and stop global rebalance; show global rebalance 37 set-primary select a new primary proxy/gateway 38 shutdown shut down entire cluster 39 decommission decommission entire cluster 40 add-remove-nodes manage cluster membership (add/remove nodes, temporarily or permanently) 41 reset-stats reset cluster or node stats (all cumulative metrics or only errors) 42 ``` 43 44 As always, each subcommand will have its own help and usage examples (the latter possibly spread across multiple documents). 45 46 > For any keyword or text of any kind, you can easily look up examples and descriptions (if available) via a simple `find`, for instance: 47 48 ```console 49 $ find . -type f -name "*.md" | xargs grep "ais.*mountpath" 50 ``` 51 52 Note that there is a single CLI command to [grow](#join-a-node) a cluster, and multiple commands to scale it down. 53 54 Scaling down can be done gracefully or forcefully, and also temporarily or permanently. 55 56 For background, usage examples, and details, please see [this document](/docs/leave_cluster.md). 57 58 # Adding/removing nodes 59 60 The corresponding functionality can be found under the subcommand called `add-remove-nodes`: 61 62 ```console 63 $ ais cluster add-remove-nodes --help 64 NAME: 65 ais cluster add-remove-nodes - manage cluster membership (add/remove nodes, temporarily or permanently) 66 67 USAGE: 68 ais cluster add-remove-nodes command [command options] [arguments...] 69 70 COMMANDS: 71 join add a node to the cluster 72 start-maintenance put node in maintenance mode, temporarily suspend its operation 73 stop-maintenance activate node by taking it back from "maintenance" 74 decommission safely and permanently remove node from the cluster 75 76 shutdown shutdown a node, gracefully or immediately; 77 note: upon shutdown the node won't be decommissioned - it'll remain in the cluster map 78 and can be manually restarted to rejoin the cluster at any later time; 79 see also: 'ais advanced remove-from-smap' 80 ``` 81 82 ## Table of Contents 83 - [Cluster and Node status](#cluster-and-node-status) 84 - [Show cluster map](#show-cluster-map) 85 - [Show cluster stats](#show-cluster-stats) 86 - [Show disk stats](#show-disk-stats) 87 - [Join a node](#join-a-node) 88 - [Remove a node](#remove-a-node) 89 - [Remote AIS cluster](#remote-ais-cluster) 90 - [Attach remote cluster](#attach-remote-cluster) 91 - [Detach remote cluster](#detach-remote-cluster) 92 - [Show remote clusters](#show-remote-clusters) 93 - [Remove a node](#remove-a-node) 94 - [Reset (ie., zero out) stats counters and other metrics](#reset-ie-zero-out-stats-counters-and-other-metrics) 95 96 ## Cluster and Node status 97 98 The command has a rather long(ish) short description and multiple subcommands: 99 100 ```console 101 $ ais show cluster --help 102 NAME: 103 ais show cluster - show cluster nodes and utilization 104 105 USAGE: 106 ais show cluster command [command options] [NODE_ID] | [target [NODE_ID]] | [proxy [NODE_ID]] | 107 [smap [NODE_ID]] | [bmd [NODE_ID]] | [config [NODE_ID]] | [stats [NODE_ID]] 108 109 COMMANDS: 110 smap show Smap (cluster map) 111 bmd show BMD (bucket metadata) 112 config show cluster and node configuration 113 stats (alias for "ais show performance") show performance counters, throughput, latency, and more (press <TAB-TAB> to select specific view) 114 115 OPTIONS: 116 --refresh value interval for continuous monitoring; 117 valid time units: ns, us (or µs), ms, s (default), m, h 118 --count value used together with '--refresh' to limit the number of generated reports (default: 0) 119 --json, -j json input/output 120 --no-headers, -H display tables without headers 121 --help, -h show help 122 ``` 123 124 To quickly exemplify, let's assume the cluster has a (target) node called `t[xyz]`. Then: 125 126 127 ### show cluster: all nodes (including t[xyz]) and gateways, as well as deployed version and runtime stats 128 ```console 129 $ ais show cluster 130 ``` 131 132 ### show all target (nodes) and, again, runtime statistics, software version, deployment type, K8s pods, and more 133 ```console 134 $ ais show cluster target 135 ``` 136 137 ### show specific target 138 ```console 139 $ ais show cluster target t[xyz] 140 ``` 141 142 ### ask specific target to show its cluster map 143 ```console 144 $ ais show cluster smap t[xyz] 145 ``` 146 147 and so on and so forth. 148 149 ### Notes 150 151 > The last example (above) may potentially make sense when troubleshooting. Otherwise, by design and implementation, cluster map (`Smap`), bucket metadata (`BMD`), and all other cluster-level metadata exists in identical protected and versioned replicas on all nodes at any given point in time. 152 153 > Still, to display cluster map in its (JSON) fullness, run: 154 155 ```console 156 $ ais show cluster smap --json 157 ``` 158 159 > `--json` option is almost universally supported in CLI 160 161 > Similar to all other `show` commands, `ais cluster show` is an alias for `ais cluster show`. Both can be used interchangeably. 162 163 ### Options 164 165 | Flag | Type | Description | Default | 166 | --- | --- | --- | --- | 167 | `--json, -j` | `bool` | Output in JSON format | `false` | 168 | `--count` | `int` | Can be used in combination with `--refresh` option to limit the number of generated reports | `1` | 169 | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds) | ` ` | 170 | `--no-headers` | `bool` | Display tables without headers | `false` | 171 172 ### Examples 173 174 ```console 175 $ ais show cluster 176 PROXY MEM USED % MEM AVAIL UPTIME 177 pufGp8080[P] 0.28% 15.43GiB 17m 178 ETURp8083 0.26% 15.43GiB 17m 179 sgahp8082 0.26% 15.43GiB 17m 180 WEQRp8084 0.27% 15.43GiB 17m 181 Watdp8081 0.26% 15.43GiB 17m 182 183 TARGET MEM USED % MEM AVAIL CAP USED % CAP AVAIL CPU USED % REBALANCE UPTIME 184 iPbHt8088 0.28% 15.43GiB 14.00% 1.178TiB 0.13% - 17m 185 Zgmlt8085 0.28% 15.43GiB 14.00% 1.178TiB 0.13% - 17m 186 oQZCt8089 0.28% 15.43GiB 14.00% 1.178TiB 0.14% - 17m 187 dIzMt8086 0.28% 15.43GiB 14.00% 1.178TiB 0.13% - 17m 188 YodGt8087 0.28% 15.43GiB 14.00% 1.178TiB 0.14% - 17m 189 190 Summary: 191 Proxies: 5 (0 - unelectable) 192 Targets: 5 193 Primary Proxy: pufGp8080 194 Smap Version: 14 195 Deployment: dev 196 ``` 197 198 ## Show cluster map 199 200 `ais show cluster smap [NODE_ID]` 201 202 Show a copy of the cluster map (Smap) stored on `NODE_ID`. 203 204 If `NODE_ID` is not given, show cluster map from (primary or secondary) proxy "pointed to" by your local CLI configuration (`ais config cli`) or `AIS_ENDPOINT` environment. 205 206 > Note that cluster map (`Smap`), bucket metadata (`BMD`), and all other cluster-level metadata exists in identical protected and versioned replicas on all nodes at any given point in time. 207 208 Useful variations include `ais show cluster smap --json` (to see the unabridged version), and also: 209 210 ```console 211 $ ais show cluster smap --refresh 5 212 ``` 213 214 The latter will periodically (until Ctrl-C) show cluster map in 5-second intervals - might be useful in presence of any kind of membership changes (e.g., cluster startup). 215 216 ### Options 217 218 | Flag | Type | Description | Default | 219 | --- | --- | --- | --- | 220 | `--count` | `int` | Can be used in combination with `--refresh` option to limit the number of generated reports | `1` | 221 | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds) | ` ` | 222 | `--json, -j` | `bool` | Output in JSON format | `false` | 223 224 ### Examples 225 226 #### Show smap from a given node 227 228 Ask a specific node for its cluster map (Smap) replica: 229 230 ```console 231 $ ais show cluster smap <TAB-TAB> 232 ... p[ETURp8083] ... 233 234 $ ais show cluster smap p[ETURp8083] 235 NODE TYPE PUBLIC URL 236 ETURp8083 proxy http://127.0.0.1:8083 237 WEQRp8084 proxy http://127.0.0.1:8084 238 Watdp8081 proxy http://127.0.0.1:8081 239 pufGp8080[P] proxy http://127.0.0.1:8080 240 sgahp8082 proxy http://127.0.0.1:8082 241 242 NODE TYPE PUBLIC URL 243 YodGt8087 target http://127.0.0.1:8087 244 Zgmlt8085 target http://127.0.0.1:8085 245 dIzMt8086 target http://127.0.0.1:8086 246 iPbHt8088 target http://127.0.0.1:8088 247 oQZCt8089 target http://127.0.0.1:8089 248 249 Non-Electable: 250 251 Primary Proxy: pufGp8080 252 Proxies: 5 Targets: 5 Smap Version: 14 253 ``` 254 255 ## Show cluster stats 256 257 `ais show cluster stats` is a alias for `ais show performance`. 258 259 The latter is the primary implementation, and the preferred way to investigate cluster performance, while `ais show cluster stats` is retained in part for convenience and in part for backward compatibility. 260 261 ```console 262 $ ais show cluster stats <TAB-TAB> 263 counters throughput latency capacity disk 264 265 $ ais show cluster stats --help 266 NAME: 267 ais show cluster stats - (alias for "ais show performance") show performance counters, throughput, latency, and more (press <TAB-TAB> to select specific view) 268 269 USAGE: 270 ais show cluster stats command [command options] [TARGET_ID] 271 272 COMMANDS: 273 counters show (GET, PUT, DELETE, RENAME, EVICT, APPEND) object counts, as well as: 274 - numbers of list-objects requests; 275 - (GET, PUT, etc.) cumulative and average sizes; 276 - associated error counters, if any, and more. 277 throughput show GET and PUT throughput, associated (cumulative, average) sizes and counters 278 latency show GET, PUT, and APPEND latencies and average sizes 279 capacity show target mountpaths, disks, and used/available capacity 280 disk show disk utilization and read/write statistics 281 282 OPTIONS: 283 --refresh value interval for continuous monitoring; 284 valid time units: ns, us (or µs), ms, s (default), m, h 285 --count value used together with '--refresh' to limit the number of generated reports (default: 0) 286 --all when printing tables, show all columns including those that have only zero values 287 --no-headers, -H display tables without headers 288 --regex value regular expression to select table columns (case-insensitive), e.g.: --regex "put|err" 289 --units value show statistics and/or parse command-line specified sizes using one of the following _units of measurement_: 290 iec - IEC format, e.g.: KiB, MiB, GiB (default) 291 si - SI (metric) format, e.g.: KB, MB, GB 292 raw - do not convert to (or from) human-readable format 293 --average-size show average GET, PUT, etc. request size 294 --help, -h show help 295 ``` 296 297 See also: 298 299 * [ais show performance`](/docs/cli/show.md) 300 301 ## Show disk stats 302 303 `ais show storage disk [TARGET_ID]` - show disk utilization and read/write statistics 304 305 ```console 306 $ ais show storage disk --help 307 NAME: 308 ais show storage disk - show disk utilization and read/write statistics 309 310 USAGE: 311 ais show storage disk [command options] [TARGET_ID] 312 313 OPTIONS: 314 --refresh value interval for continuous monitoring; 315 valid time units: ns, us (or µs), ms, s (default), m, h 316 --count value used together with '--refresh' to limit the number of generated reports (default: 0) 317 --no-headers, -H display tables without headers 318 --units value show statistics and/or parse command-line specified sizes using one of the following _units of measurement_: 319 iec - IEC format, e.g.: KiB, MiB, GiB (default) 320 si - SI (metric) format, e.g.: KB, MB, GB 321 raw - do not convert to (or from) human-readable format 322 --regex value regular expression to select table columns (case-insensitive), e.g.: --regex "put|err" 323 --summary tally up target disks to show per-target read/write summary stats and average utilizations 324 --help, -h show help 325 ``` 326 327 When `TARGET_ID` is not given, disk stats for all targets will be shown and aggregated. 328 329 ### Options 330 331 | Flag | Type | Description | Default | 332 | --- | --- | --- | --- | 333 | `--json, -j` | `bool` | Output in JSON format | `false` | 334 | `--count` | `int` | Can be used in combination with `--refresh` option to limit the number of generated reports | `1` | 335 | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds) | ` ` | 336 | `--no-headers` | `bool` | Display tables without headers | `false` | 337 338 ### Examples 339 340 #### Display disk reports stats N times every M seconds 341 342 Display 5 reports of all targets' disk statistics, with 10s intervals between each report. 343 344 ```console 345 $ ais show storage disk --count 2 --refresh 10s 346 Target Disk Read Write %Util 347 163171t8088 sda 6.00KiB/s 171.00KiB/s 49 348 948212t8089 sda 6.00KiB/s 171.00KiB/s 49 349 41981t8085 sda 6.00KiB/s 171.00KiB/s 49 350 490062t8086 sda 6.00KiB/s 171.00KiB/s 49 351 164472t8087 sda 6.00KiB/s 171.00KiB/s 49 352 353 Target Disk Read Write %Util 354 163171t8088 sda 1.00KiB/s 4.26MiB/s 96 355 41981t8085 sda 1.00KiB/s 4.26MiB/s 96 356 948212t8089 sda 1.00KiB/s 4.26MiB/s 96 357 490062t8086 sda 1.00KiB/s 4.29MiB/s 96 358 164472t8087 sda 1.00KiB/s 4.26MiB/s 96 359 ``` 360 361 ## Join a node 362 363 `ais cluster add-remove-nodes join --role=proxy IP:PORT` 364 365 Join a proxy to the cluster. 366 367 `ais cluster add-remove-nodes join --role=target IP:PORT` 368 369 Join a target to the cluster. 370 371 Note: The node will try to join the cluster using an ID it detects (either in the filesystem's xattrs or on disk) or that it generates for itself. 372 If you would like to specify an ID, you can do so while starting the [`aisnode` executable](/docs/command_line.md). 373 374 ### Examples 375 376 #### Join node 377 378 Join a proxy node with socket address `192.168.0.185:8086` 379 380 ```console 381 $ ais cluster add-remove-nodes join --role=proxy 192.168.0.185:8086 382 Proxy with ID "23kfa10f" successfully joined the cluster. 383 ``` 384 385 ## Remove a node 386 387 **Temporarily remove an existing node from the cluster:** 388 389 `ais cluster add-remove-nodes start-maintenance NODE_ID` 390 `ais cluster add-remove-nodes stop-maintenance NODE_ID` 391 392 Starting maintenance puts the node in maintenance mode, and the cluster gradually transitions to 393 operating without the specified node (which is labeled `maintenance` in the cluster map). Stopping 394 maintenance will revert this. 395 396 `ais cluster add-remove-nodes shutdown NODE_ID` 397 398 Shutting down a node will put the node in maintenance mode first, and then shut down the `aisnode` 399 process on the node. 400 401 402 **Permanently remove an existing node from the cluster:** 403 404 `ais cluster add-remove-nodes decommission NODE_ID` 405 406 Decommissioning a node will safely remove a node from the cluster by triggering a cluster-wide 407 rebalance first. This can be avoided by specifying `--no-rebalance`. 408 409 410 ### Options 411 412 | Flag | Type | Description | Default | 413 | --- | --- | --- | --- | 414 | `--no-rebalance` | `bool` | By default, `ais cluster add-remove-nodes maintenance` and `ais cluster add-remove-nodes decommission` triggers a global cluster-wide rebalance. The `--no-rebalance` flag disables automatic rebalance thus providing for the administrative option to rebalance the cluster manually at a later time. BEWARE: advanced usage only! | `false` | 415 416 ### Examples 417 418 #### Decommission node 419 420 **Permananently remove proxy p[omWp8083] from the cluster:** 421 422 ```console 423 $ ais cluster add-remove-nodes decommission <TAB-TAB> 424 p[cFOp8082] p[Hqhp8085] p[omWp8083] t[bFat8087] t[Icjt8089] t[ofPt8091] 425 p[dpKp8084] p[NGVp8081] p[Uerp8080] t[erbt8086] t[IDDt8090] t[TKSt8088] 426 427 $ ais cluster add-remove-nodes decommission p[omWp8083] 428 429 Node "omWp8083" has been successfully removed from the cluster. 430 ``` 431 432 **To terminate `aisnode` on a given machine, use the `shutdown` command, e.g.:** 433 434 ```console 435 $ ais cluster add-remove-nodes shutdown t[23kfa10f] 436 ``` 437 438 Similar to the `maintenance` option, `shutdown` triggers global rebalancing then shuts down the corresponding `aisnode` process (target `t[23kfa10f]` in the example above). 439 440 #### Temporarily put node in maintenance 441 442 ```console 443 $ ais show cluster 444 PROXY MEM USED % MEM AVAIL UPTIME 445 202446p8082 0.09% 31.28GiB 70s 446 279128p8080[P] 0.11% 31.28GiB 80s 447 448 TARGET MEM USED % MEM AVAIL CAP USED % CAP AVAIL CPU USED % REBALANCE UPTIME 449 147665t8084 0.10% 31.28GiB 16% 2.458TiB 0.12% - 70s 450 165274t8087 0.10% 31.28GiB 16% 2.458TiB 0.12% - 70s 451 452 $ ais cluster add-remove-nodes start-maintenance 147665t8084 453 $ ais show cluster 454 PROXY MEM USED % MEM AVAIL UPTIME 455 202446p8082 0.09% 31.28GiB 70s 456 279128p8080[P] 0.11% 31.28GiB 80s 457 458 TARGET MEM USED % MEM AVAIL CAP USED % CAP AVAIL CPU USED % REBALANCE UPTIME STATUS 459 147665t8084 0.10% 31.28GiB 16% 2.458TiB 0.12% - 71s maintenance 460 165274t8087 0.10% 31.28GiB 16% 2.458TiB 0.12% - 71s online 461 ``` 462 463 #### Take a node out of maintenance 464 465 ```console 466 $ ais cluster add-remove-nodes stop-maintenance t[147665t8084] 467 $ ais show cluster 468 PROXY MEM USED % MEM AVAIL UPTIME 469 202446p8082 0.09% 31.28GiB 80s 470 279128p8080[P] 0.11% 31.28GiB 90s 471 472 TARGET MEM USED % MEM AVAIL CAP USED % CAP AVAIL CPU USED % REBALANCE UPTIME 473 147665t8084 0.10% 31.28GiB 16% 2.458TiB 0.12% - 80s 474 165274t8087 0.10% 31.28GiB 16% 2.458TiB 0.12% - 80s 475 ``` 476 477 ## Remote AIS cluster 478 479 Given an arbitrary pair of AIS clusters A and B, cluster B can be *attached* to cluster A, thus providing (to A) a fully-accessible (list-able, readable, writeable) *backend*. 480 481 For background, terminology, and definitions, and for many more usage examples, please see: 482 483 * [Remote AIS cluster](/docs/providers.md#remote-ais-cluster) 484 * [Usage examples and easy-to-use scripts for developers](/docs/development.md) 485 486 ### Attach remote cluster 487 488 `ais cluster remote-attach UUID=URL [UUID=URL...]` 489 490 or 491 492 `ais cluster remote-attach ALIAS=URL [ALIAS=URL...]` 493 494 Attach a remote AIS cluster to a local one via the remote cluster public URL. Alias (a user-defined name) can be used instead of cluster UUID for convenience. 495 For more details and background on *remote clustering*, please refer to this [document](/docs/providers.md). 496 497 #### Examples 498 499 Attach two remote clusters, the first - by its UUID, the second one - via user-friendly alias (`two`). 500 501 ```console 502 $ ais cluster remote-attach a345e890=http://one.remote:51080 two=http://two.remote:51080` 503 ``` 504 505 ### Detach remote cluster 506 507 `ais cluster remote-detach UUID|ALIAS` 508 509 Detach a remote cluster using its alias or UUID. 510 511 #### Examples 512 513 Example below assumes that the remote has user-given alias `two`: 514 515 ```console 516 $ ais cluster remote-detach two 517 ``` 518 519 ### Show remote clusters 520 521 `ais show remote-cluster` 522 523 Show details about attached remote clusters. 524 525 #### Examples 526 The following two commands attach and then show the remote cluster at the address `my.remote.ais:51080`: 527 528 ```console 529 $ ais cluster remote-attach alias111=http://my.remote.ais:51080 530 Remote cluster (alias111=http://my.remote.ais:51080) successfully attached 531 $ ais show remote-cluster 532 UUID URL Alias Primary Smap Targets Online 533 eKyvPyHr my.remote.ais:51080 alias111 p[80381p11080] v27 10 yes 534 ``` 535 536 Notice that: 537 538 * user can assign an arbitrary name (aka alias) to a given remote cluster 539 * the remote cluster does *not* have to be online at attachment time; offline or currently unreachable clusters are shown as follows: 540 541 ```console 542 $ ais show remote-cluster 543 UUID URL Alias Primary Smap Targets Online 544 eKyvPyHr my.remote.ais:51080 alias111 p[primary1] v27 10 no 545 <alias222> <other.remote.ais:51080> n/a n/a n/a no 546 ``` 547 548 Notice the difference between the first and the second lines in the printout above: while both clusters appear to be currently offline (see the rightmost column), the first one was accessible at some earlier time and therefore we show that it has (in this example) 10 storage nodes and other details. 549 550 To `detach` any of the previously configured associations, simply run: 551 552 ```console 553 $ ais cluster remote-detach alias111 554 $ ais show remote-cluster 555 UUID URL Alias Primary Smap Targets Online 556 <alias222> <other.remote.ais:51080> n/a n/a n/a no 557 ``` 558 559 ## Reset (ie., zero out) stats counters and other metrics 560 561 `ais cluster reset-stats` 562 563 ### Example and usage 564 565 ```console 566 $ ais cluster reset-stats --help 567 NAME: 568 ais cluster reset-stats - reset cluster or node stats (all cumulative metrics or only errors) 569 570 USAGE: 571 ais cluster reset-stats [command options] [NODE_ID] 572 573 OPTIONS: 574 --errors-only reset only error counters 575 --help, -h show help 576 ``` 577 578 Let's go ahead and reset all error counters: 579 580 ```console 581 $ ais cluster reset-stats --errors-only 582 Cluster error metrics successfully reset 583 ```