github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/cli/dsort.md (about) 1 --- 2 layout: post 3 title: DSORT 4 permalink: /docs/cli/dsort 5 redirect_from: 6 - /cli/dsort.md/ 7 - /docs/cli/dsort.md/ 8 --- 9 10 # Start, Stop, and monitor distributed parallel sorting (dSort) 11 12 For background and in-depth presentation, please see this [document](/docs/dsort.md). 13 14 - [Usage](#usage) 15 - [Example](#example) 16 - [Generate Shards](#generate-shards) 17 - [Start dSort job](#start-dsort-job) 18 - [Show dSort jobs and job status](#show-dsort-jobs-and-job-status) 19 - [Stop dSort job](#stop-dsort-job) 20 - [Remove dSort job](#remove-dsort-job) 21 - [Wait for dSort job](#wait-for-dsort-job) 22 23 24 ## Usage 25 26 `ais dsort [command options] [JSON_SPECIFICATION|YAML_SPECIFICATION|-] [SRC_BUCKET] [DST_BUCKET]` 27 28 ```console 29 $ ais dsort --help 30 NAME: 31 ais dsort - (alias for "job start dsort") start dsort job 32 Required parameters: 33 - input_bck: source bucket (used as both source and destination if the latter not specified) 34 - input_format: (see docs and examples below) 35 - output_format: (ditto) 36 - output_shard_size: (as the name implies) 37 E.g. inline JSON spec: 38 $ ais start dsort '{ 39 "extension": ".tar", 40 "input_bck": {"name": "dsort-testing"}, 41 "input_format": {"template": "shard-{0..9}"}, 42 "output_shard_size": "200KB", 43 "description": "pack records into categorized shards", 44 "order_file": "http://website.web/static/order_file.txt", 45 "order_file_sep": " " 46 }' 47 E.g. inline YAML spec: 48 $ ais start dsort -f - <<EOM 49 extension: .tar 50 input_bck: 51 name: dsort-testing 52 input_format: 53 template: shard-{0..9} 54 output_format: new-shard-{0000..1000} 55 output_shard_size: 10KB 56 description: shuffle shards from 0 to 9 57 algorithm: 58 kind: shuffle 59 EOM 60 Tip: use '--dry-run' to see the results without making any changes 61 Tip: use '--verbose' to print the spec (with all its parameters including applied defaults) 62 See also: docs/dsort.md, docs/cli/dsort.md, and ais/test/scripts/dsort* 63 64 USAGE: 65 ais dsort [command options] [JSON_SPECIFICATION|YAML_SPECIFICATION|-] [SRC_BUCKET] [DST_BUCKET] 66 67 OPTIONS: 68 --file value, -f value path to JSON or YAML job specification 69 --verbose, -v verbose 70 --help, -h show help 71 ``` 72 73 ## Example 74 75 This example simply runs [ais/test/scripts/dsort-ex1-spec.json](https://github.com/NVIDIA/aistore/blob/main/ais/test/scripts/dsort-ex1-spec.json) specification. The source and destination buckets - ais://src and ais://dst, respectively - must exist. 76 77 Further, the source buckets must have at least 10 shards with names that match `input_format` (see below). 78 79 Notice the `-v` (`--verbose`) switch as well. 80 81 ```console 82 $ ais start dsort ais://src ais://dst -f ais/test/scripts/dsort-ex1-spec.json --verbose 83 PROPERTY VALUE 84 algorithm.content_key_type - 85 algorithm.decreasing false 86 algorithm.extension - 87 algorithm.kind alphanumeric 88 algorithm.seed - 89 create_concurrency_max_limit 0 90 description sort shards alphanumerically 91 dry_run false 92 dsorter_type - 93 extension .tar 94 extract_concurrency_max_limit 0 95 input_bck ais://src 96 input_format.objnames - 97 input_format.template shard-{0..9} 98 max_mem_usage - 99 order_file - 100 order_file_sep \t 101 output_bck ais://dst 102 output_format new-shard-{0000..1000} 103 output_shard_size 10KB 104 105 Config override: none 106 107 srt-M8ld-VU_i 108 ``` 109 110 ## Generate Shards 111 112 `ais archive gen-shards "BUCKET/TEMPLATE.EXT"` 113 114 Put randomly generated shards into a bucket. The main use case for this command is dSort testing. 115 [Further reference for this command can be found here.](archive.md#generate-shards) 116 117 ## Start dSort job 118 119 `ais start dsort JOB_SPEC` or `ais start dsort -f <PATH_TO_JOB_SPEC>` 120 121 Start new dSort job with the provided specification. 122 Specification should be provided by either argument or `-f` flag - providing both argument and flag will result in error. 123 Upon creation, `JOB_ID` of the job is returned - it can then be used to abort it or retrieve metrics. 124 125 | Flag | Type | Description | Default | 126 | --- | --- | --- | --- | 127 | `--file, -f` | `string` | Path to file containing JSON or YAML job specification. Providing `-` will result in reading from STDIN | `""` | 128 129 The following table describes JSON/YAML keys which can be used in the specification. 130 131 | Key | Type | Description | Required | Default | 132 | --- | --- | --- | --- | --- | 133 | `extension` | `string` | extension of input and output shards (either `.tar`, `.tgz` or `.zip`) | yes | | 134 | `input_format.template` | `string` | name template for input shard | yes | | 135 | `output_format` | `string` | name template for output shard | yes | | 136 | `input_bck.name` | `string` | bucket name where shards objects are stored | yes | | 137 | `input_bck.provider` | `string` | bucket backend provider, see [docs](/docs/providers.md) | no | `"ais"` | 138 | `output_bck.name` | `string` | bucket name where new output shards will be saved | no | same as `input_bck.name` | 139 | `output_bck.provider` | `string` | bucket backend provider, see [docs](/docs/providers.md) | no | same as `input_bck.provider` | 140 | `description` | `string` | description of dSort job | no | `""` | 141 | `output_shard_size` | `string` | size (in bytes) of the output shard, can be in form of raw numbers `10240` or suffixed `10KB` | yes | | 142 | `algorithm.kind` | `string` | determines which sorting algorithm dSort job uses, available are: `"alphanumeric"`, `"shuffle"`, `"content"` | no | `"alphanumeric"` | 143 | `algorithm.decreasing` | `bool` | determines if the algorithm should sort the records in decreasing or increasing order, used for `kind=alphanumeric` or `kind=content` | no | `false` | 144 | `algorithm.seed` | `string` | seed provided to random generator, used when `kind=shuffle` | no | `""` - `time.Now()` is used | 145 | `algorithm.extension` | `string` | content of the file with provided extension will be used as sorting key, used when `kind=content` | yes (only when `kind=content`) | 146 | `algorithm.content_key_type` | `string` | content key type; may have one of the following values: "int", "float", or "string"; used exclusively with `kind=content` sorting | yes (only when `kind=content`) | 147 | `order_file` | `string` | URL to the file containing external key map (it should contain lines in format: `record_key[sep]shard-%d-fmt`) | yes (only when `output_format` not provided) | `""` | 148 | `order_file_sep` | `string` | separator used for splitting `record_key` and `shard-%d-fmt` in the lines in external key map | no | `\t` (TAB) | 149 | `max_mem_usage` | `string` | limits the amount of total system memory allocated by both dSort and other running processes. Once and if this threshold is crossed, dSort will continue extracting onto local drives. Can be in format 60% or 10GB | no | same as in `/deploy/dev/local/aisnode_config.sh` | 150 | `extract_concurrency_max_limit` | `int` | limits maximum number of concurrent shards extracted per disk | no | (calculated based on different factors) ~50 | 151 | `create_concurrency_max_limit` | `int` | limits maximum number of concurrent shards created per disk| no | (calculated based on different factors) ~50 | 152 153 There's also the possibility to override some of the values from global `distributed_sort` config via job specification. 154 All values are optional - if empty, the value from global `distributed_sort` config will be used. 155 For more information refer to [configuration](/docs/configuration.md). 156 157 | Key | Type | Description | 158 | --- | --- | --- | 159 | `duplicated_records` | `string` | what to do when duplicated records are found: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation | 160 | `missing_shards` | `string` | what to do when missing shards are detected: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation | 161 | `ekm_malformed_line` | `string`| what to do when extraction key map notices a malformed line: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation | 162 | `ekm_missing_key` | `string` | what to do when extraction key map have a missing key: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation | 163 | `dsorter_mem_threshold` | `string`| minimum free memory threshold which will activate specialized dsorter type which uses memory in creation phase - benchmarks shows that this type of dsorter behaves better than general type | 164 165 ### Examples 166 167 #### Sort records inside the shards 168 169 Command defined below starts (alphanumeric) sorting job with extended metrics for **input** shards with names `shard-0.tar`, `shard-1.tar`, ..., `shard-9.tar`. 170 Each of the **output** shards will have at least `10240` bytes (`10KB`) and will be named `new-shard-0000.tar`, `new-shard-0001.tar`, ... 171 172 Assuming that `dsort_spec.json` contains: 173 174 ```json 175 { 176 "extension": ".tar", 177 "input_bck": {"name": "dsort-testing"}, 178 "input_format": { 179 template: "shard-{0..9}" 180 }, 181 "output_format": "new-shard-{0000..1000}", 182 "output_shard_size": "10KB", 183 "description": "sort shards from 0 to 9", 184 "algorithm": { 185 "kind": "alphanumeric" 186 }, 187 } 188 ``` 189 190 You can start dSort job with: 191 192 ```console 193 $ ais start dsort -f dsort_spec.json 194 JGHEoo89gg 195 ``` 196 197 #### Shuffle records 198 199 Command defined below starts basic shuffle job for **input** shards with names `shard-0.tar`, `shard-1.tar`, ..., `shard-9.tar`. 200 Each of the **output** shards will have at least `10240` bytes (`10KB`) and will be named `new-shard-0000.tar`, `new-shard-0001.tar`, ... 201 202 ```console 203 $ ais start dsort -f - <<EOM 204 extension: .tar 205 input_bck: 206 name: dsort-testing 207 input_format: 208 template: shard-{0..9} 209 output_format: new-shard-{0000..1000} 210 output_shard_size: 10KB 211 description: shuffle shards from 0 to 9 212 algorithm: 213 kind: shuffle 214 EOM 215 JGHEoo89gg 216 ``` 217 218 #### Pack records into shards with different categories - EKM (External Key Map) 219 220 One of the key features of the dSort is that user can specify the exact mapping from the record key to the output shard. 221 To use this feature `output_format` should be empty and `order_file`, as well as `order_file_sep`, must be set. 222 The output shards will be created with provided format which must contain mandatory `%d` which is required to enumerate the shards. 223 224 Assuming that `order_file` (URL: `http://website.web/static/order_file.txt`) has content: 225 226 ``` 227 cat_0.txt shard-cats-%d 228 cat_1.txt shard-cats-%d 229 ... 230 dog_0.txt shard-dogs-%d 231 dog_1.txt shard-dogs-%d 232 ... 233 car_0.txt shard-car-%d 234 car_1.txt shard-car-%d 235 ... 236 ``` 237 238 or if `order_file` (URL: `http://website.web/static/order_file.json`, notice `.json` extension) and has content: 239 240 ```json 241 { 242 "shard-cats-%d": [ 243 "cat_0.txt", 244 "cat_1.txt", 245 ... 246 ], 247 "shard-dogs-%d": [ 248 "dog_0.txt", 249 "dog_1.txt", 250 ... 251 ], 252 "shard-car-%d": [ 253 "car_0.txt", 254 "car_1.txt", 255 ... 256 ], 257 ... 258 } 259 ``` 260 261 and content of the **input** shards looks more or less like this: 262 263 ``` 264 shard-0.tar: 265 - cat_0.txt 266 - dog_0.txt 267 - car_0.txt 268 ... 269 shard-1.tar: 270 - cat_1.txt 271 - dog_1.txt 272 - car_1.txt 273 ... 274 ``` 275 276 You can run: 277 278 ```console 279 $ ais start dsort '{ 280 "extension": ".tar", 281 "input_bck": {"name": "dsort-testing"}, 282 "input_format": {"template": "shard-{0..9}"}, 283 "output_shard_size": "200KB", 284 "description": "pack records into categorized shards", 285 "order_file": "http://website.web/static/order_file.txt", 286 "order_file_sep": " " 287 }' 288 JGHEoo89gg 289 ``` 290 291 After the run, the **output** shards will look more or less like this (the number of records in given shard depends on provided `output_shard_size`): 292 293 ``` 294 shard-cats-0.tar: 295 - cat_1.txt 296 - cat_2.txt 297 shard-cats-1.tar: 298 - cat_3.txt 299 - cat_4.txt 300 ... 301 shard-dogs-0.tar: 302 - dog_1.txt 303 - dog_2.txt 304 ... 305 ``` 306 307 ## Show dSort jobs and job status 308 309 `ais show job dsort [JOB_ID]` 310 311 Retrieve the status of the dSort with provided `JOB_ID` which is returned upon creation. 312 Lists all dSort jobs if the `JOB_ID` argument is omitted. 313 314 ### Options 315 316 | Flag | Type | Description | Default | 317 | --- | --- | --- | --- | 318 | `--regex` | `string` | Regex for the description of dSort jobs | `""` | 319 | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds). E.g.: `--refresh 2s`| ` ` | 320 | `--verbose, -v` | `bool` | Show detailed metrics | `false` | 321 | `--log` | `string` | Path to file where the metrics will be saved (does not work with progress bar) | `/tmp/dsort_run.txt` | 322 | `--json, -j` | `bool` | Show only json metrics | `false` | 323 324 ### Examples 325 326 #### Show dSort jobs with description matching provided regex 327 328 Shows all dSort jobs with descriptions starting with `sort ` prefix. 329 330 ```console 331 $ ais show job dsort --regex "^sort (.*)" 332 JOB ID STATUS START FINISH DESCRIPTION 333 nro_Y5h9n Finished 03-16 11:39:07 03-16 11:39:07 sort shards from 0 to 9 334 Key_Y5h9n Finished 03-16 11:39:23 03-16 11:39:23 sort shards from 10 to 19 335 enq9Y5Aqn Finished 03-16 11:39:34 03-16 11:39:34 sort shards from 20 to 29 336 ``` 337 338 #### Save metrics to log file 339 340 Save newly fetched metrics of the dSort job with ID `5JjIuGemR` to `/tmp/dsort_run.txt` file every `500` milliseconds 341 342 ```console 343 $ ais show job dsort 5JjIuGemR --refresh 500ms --log "/tmp/dsort_run.txt" 344 Dsort job has finished successfully in 21.948806ms: 345 Longest extraction: 1.49907ms 346 Longest sorting: 8.288299ms 347 Longest creation: 4.553µs 348 ``` 349 350 #### Show only json metrics 351 352 ```console 353 $ ais show job dsort 5JjIuGemR --json 354 { 355 "825090t8089": { 356 "local_extraction": { 357 "started_time": "2020-05-28T09:53:42.466267891-04:00", 358 "end_time": "2020-05-28T09:53:42.50773835-04:00", 359 .... 360 }, 361 .... 362 }, 363 .... 364 } 365 ``` 366 367 #### Show only json metrics filtered by daemon id 368 369 ```console 370 $ ais show job dsort 5JjIuGemR 766516t8087 --json 371 { 372 "766516t8087": { 373 "local_extraction": { 374 "started_time": "2020-05-28T09:53:42.466267891-04:00", 375 "end_time": "2020-05-28T09:53:42.50773835-04:00", 376 .... 377 }, 378 .... 379 } 380 } 381 ``` 382 383 #### Using jq to filter out the json formatted metric output 384 385 Show running status of meta sorting phase for all targets. 386 387 ```console 388 $ ais show job dsort 5JjIuGemR --json | jq .[].meta_sorting.running 389 false 390 false 391 true 392 false 393 ``` 394 395 Show created shards in each target along with the target ids. 396 397 ```console 398 $ ais show job dsort 5JjIuGemR --json | jq 'to_entries[] | [.key, .value.shard_creation.created_count]' 399 [ 400 "766516t8087", 401 "189" 402 ] 403 [ 404 "710650t8086", 405 "207" 406 ] 407 [ 408 "825090t8089", 409 "211" 410 ] 411 [ 412 "743838t8088", 413 "186" 414 ] 415 [ 416 "354275t8085", 417 "207" 418 ] 419 ``` 420 421 422 ## Stop dSort job 423 424 `ais stop dsort JOB_ID` 425 426 Stop the dSort job with given `JOB_ID`. 427 428 ## Remove dSort job 429 430 `ais job rm dsort JOB_ID` 431 432 Remove the finished dSort job with given `JOB_ID` from the job list. 433 434 ## Wait for dSort job 435 436 `ais wait dsort JOB_ID` 437 438 or, same: 439 440 `ais wait JOB_ID` 441 442 Wait for the dSort job with given `JOB_ID` to finish. 443 444 ### Options 445 446 | Flag | Type | Description | Default | 447 | --- | --- | --- | --- | 448 | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds) | `1s` | 449 | `--progress` | `bool` | Displays progress bar | `false` |