github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/cli/dsort.md (about)

     1  ---
     2  layout: post
     3  title: DSORT
     4  permalink: /docs/cli/dsort
     5  redirect_from:
     6   - /cli/dsort.md/
     7   - /docs/cli/dsort.md/
     8  ---
     9  
    10  # Start, Stop, and monitor distributed parallel sorting (dSort)
    11  
    12  For background and in-depth presentation, please see this [document](/docs/dsort.md).
    13  
    14  - [Usage](#usage)
    15  - [Example](#example)
    16  - [Generate Shards](#generate-shards)
    17  - [Start dSort job](#start-dsort-job)
    18  - [Show dSort jobs and job status](#show-dsort-jobs-and-job-status)
    19  - [Stop dSort job](#stop-dsort-job)
    20  - [Remove dSort job](#remove-dsort-job)
    21  - [Wait for dSort job](#wait-for-dsort-job)
    22  
    23  
    24  ## Usage
    25  
    26  `ais dsort [command options] [JSON_SPECIFICATION|YAML_SPECIFICATION|-] [SRC_BUCKET] [DST_BUCKET]`
    27  
    28  ```console
    29  $ ais dsort --help
    30  NAME:
    31     ais dsort - (alias for "job start dsort") start dsort job
    32     Required parameters:
    33                - input_bck: source bucket (used as both source and destination if the latter not specified)
    34                - input_format: (see docs and examples below)
    35                - output_format: (ditto)
    36                - output_shard_size: (as the name implies)
    37     E.g. inline JSON spec:
    38                  $ ais start dsort '{
    39                    "extension": ".tar",
    40                    "input_bck": {"name": "dsort-testing"},
    41                    "input_format": {"template": "shard-{0..9}"},
    42                    "output_shard_size": "200KB",
    43                    "description": "pack records into categorized shards",
    44                    "order_file": "http://website.web/static/order_file.txt",
    45                    "order_file_sep": " "
    46                  }'
    47     E.g. inline YAML spec:
    48                  $ ais start dsort -f - <<EOM
    49                    extension: .tar
    50                    input_bck:
    51                        name: dsort-testing
    52                    input_format:
    53                        template: shard-{0..9}
    54                    output_format: new-shard-{0000..1000}
    55                    output_shard_size: 10KB
    56                    description: shuffle shards from 0 to 9
    57                    algorithm:
    58                        kind: shuffle
    59                    EOM
    60     Tip: use '--dry-run' to see the results without making any changes
    61     Tip: use '--verbose' to print the spec (with all its parameters including applied defaults)
    62     See also: docs/dsort.md, docs/cli/dsort.md, and ais/test/scripts/dsort*
    63  
    64  USAGE:
    65     ais dsort [command options] [JSON_SPECIFICATION|YAML_SPECIFICATION|-] [SRC_BUCKET] [DST_BUCKET]
    66  
    67  OPTIONS:
    68     --file value, -f value  path to JSON or YAML job specification
    69     --verbose, -v           verbose
    70     --help, -h              show help
    71  ```
    72  
    73  ## Example
    74  
    75  This example simply runs [ais/test/scripts/dsort-ex1-spec.json](https://github.com/NVIDIA/aistore/blob/main/ais/test/scripts/dsort-ex1-spec.json) specification. The source and destination buckets - ais://src and ais://dst, respectively - must exist.
    76  
    77  Further, the source buckets must have at least 10 shards with names that match `input_format` (see below).
    78  
    79  Notice the `-v` (`--verbose`) switch as well.
    80  
    81  ```console
    82  $ ais start dsort ais://src ais://dst -f ais/test/scripts/dsort-ex1-spec.json --verbose
    83  PROPERTY                         VALUE
    84  algorithm.content_key_type       -
    85  algorithm.decreasing             false
    86  algorithm.extension              -
    87  algorithm.kind                   alphanumeric
    88  algorithm.seed                   -
    89  create_concurrency_max_limit     0
    90  description                      sort shards alphanumerically
    91  dry_run                          false
    92  dsorter_type                     -
    93  extension                        .tar
    94  extract_concurrency_max_limit    0
    95  input_bck                        ais://src
    96  input_format.objnames            -
    97  input_format.template            shard-{0..9}
    98  max_mem_usage                    -
    99  order_file                       -
   100  order_file_sep                   \t
   101  output_bck                       ais://dst
   102  output_format                    new-shard-{0000..1000}
   103  output_shard_size                10KB
   104  
   105  Config override:                 none
   106  
   107  srt-M8ld-VU_i
   108  ```
   109  
   110  ## Generate Shards
   111  
   112  `ais archive gen-shards "BUCKET/TEMPLATE.EXT"`
   113  
   114  Put randomly generated shards into a bucket. The main use case for this command is dSort testing.
   115  [Further reference for this command can be found here.](archive.md#generate-shards)
   116  
   117  ## Start dSort job
   118  
   119  `ais start dsort JOB_SPEC` or `ais start dsort -f <PATH_TO_JOB_SPEC>`
   120  
   121  Start new dSort job with the provided specification.
   122  Specification should be provided by either argument or `-f` flag - providing both argument and flag will result in error.
   123  Upon creation, `JOB_ID` of the job is returned - it can then be used to abort it or retrieve metrics.
   124  
   125  | Flag | Type | Description | Default |
   126  | --- | --- | --- | --- |
   127  | `--file, -f` | `string` | Path to file containing JSON or YAML job specification. Providing `-` will result in reading from STDIN | `""` |
   128  
   129  The following table describes JSON/YAML keys which can be used in the specification.
   130  
   131  | Key | Type | Description | Required | Default |
   132  | --- | --- | --- | --- | --- |
   133  | `extension` | `string` | extension of input and output shards (either `.tar`, `.tgz` or `.zip`) | yes | |
   134  | `input_format.template` | `string` | name template for input shard | yes | |
   135  | `output_format` | `string` | name template for output shard | yes | |
   136  | `input_bck.name` | `string` | bucket name where shards objects are stored | yes | |
   137  | `input_bck.provider` | `string` | bucket backend provider, see [docs](/docs/providers.md) | no | `"ais"` |
   138  | `output_bck.name` | `string` | bucket name where new output shards will be saved | no | same as `input_bck.name` |
   139  | `output_bck.provider` | `string` | bucket backend provider, see [docs](/docs/providers.md) | no | same as `input_bck.provider` |
   140  | `description` | `string` | description of dSort job | no | `""` |
   141  | `output_shard_size` | `string` | size (in bytes) of the output shard, can be in form of raw numbers `10240` or suffixed `10KB` | yes | |
   142  | `algorithm.kind` | `string` | determines which sorting algorithm dSort job uses, available are: `"alphanumeric"`, `"shuffle"`, `"content"` | no | `"alphanumeric"` |
   143  | `algorithm.decreasing` | `bool` | determines if the algorithm should sort the records in decreasing or increasing order, used for `kind=alphanumeric` or `kind=content` | no | `false` |
   144  | `algorithm.seed` | `string` | seed provided to random generator, used when `kind=shuffle` | no | `""` - `time.Now()` is used |
   145  | `algorithm.extension` | `string` | content of the file with provided extension will be used as sorting key, used when `kind=content` | yes (only when `kind=content`) |
   146  | `algorithm.content_key_type` | `string` | content key type; may have one of the following values: "int", "float", or "string"; used exclusively with `kind=content` sorting | yes (only when `kind=content`) |
   147  | `order_file` | `string` | URL to the file containing external key map (it should contain lines in format: `record_key[sep]shard-%d-fmt`) | yes (only when `output_format` not provided) | `""` |
   148  | `order_file_sep` | `string` | separator used for splitting `record_key` and `shard-%d-fmt` in the lines in external key map | no | `\t` (TAB) |
   149  | `max_mem_usage` | `string` | limits the amount of total system memory allocated by both dSort and other running processes. Once and if this threshold is crossed, dSort will continue extracting onto local drives. Can be in format 60% or 10GB | no | same as in `/deploy/dev/local/aisnode_config.sh` |
   150  | `extract_concurrency_max_limit` | `int` | limits maximum number of concurrent shards extracted per disk | no | (calculated based on different factors) ~50 |
   151  | `create_concurrency_max_limit` | `int` | limits maximum number of concurrent shards created per disk| no | (calculated based on different factors) ~50 |
   152  
   153  There's also the possibility to override some of the values from global `distributed_sort` config via job specification.
   154  All values are optional - if empty, the value from global `distributed_sort` config will be used.
   155  For more information refer to [configuration](/docs/configuration.md).
   156  
   157  | Key | Type | Description |
   158  | --- | --- | --- |
   159  | `duplicated_records` | `string` | what to do when duplicated records are found: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation |
   160  | `missing_shards` | `string` | what to do when missing shards are detected: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation |
   161  | `ekm_malformed_line` | `string`| what to do when extraction key map notices a malformed line: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation |
   162  | `ekm_missing_key` | `string` | what to do when extraction key map have a missing key: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation |
   163  | `dsorter_mem_threshold` | `string`| minimum free memory threshold which will activate specialized dsorter type which uses memory in creation phase - benchmarks shows that this type of dsorter behaves better than general type |
   164  
   165  ### Examples
   166  
   167  #### Sort records inside the shards
   168  
   169  Command defined below starts (alphanumeric) sorting job with extended metrics for **input** shards with names `shard-0.tar`, `shard-1.tar`, ..., `shard-9.tar`.
   170  Each of the **output** shards will have at least `10240` bytes (`10KB`) and will be named `new-shard-0000.tar`, `new-shard-0001.tar`, ...
   171  
   172  Assuming that `dsort_spec.json` contains:
   173  
   174  ```json
   175  {
   176      "extension": ".tar",
   177      "input_bck": {"name": "dsort-testing"},
   178      "input_format": {
   179  	template: "shard-{0..9}"
   180      },
   181      "output_format": "new-shard-{0000..1000}",
   182      "output_shard_size": "10KB",
   183      "description": "sort shards from 0 to 9",
   184      "algorithm": {
   185          "kind": "alphanumeric"
   186      },
   187  }
   188  ```
   189  
   190  You can start dSort job with:
   191  
   192  ```console
   193  $ ais start dsort -f dsort_spec.json
   194  JGHEoo89gg
   195  ```
   196  
   197  #### Shuffle records
   198  
   199  Command defined below starts basic shuffle job for **input** shards with names `shard-0.tar`, `shard-1.tar`, ..., `shard-9.tar`.
   200  Each of the **output** shards will have at least `10240` bytes (`10KB`) and will be named `new-shard-0000.tar`, `new-shard-0001.tar`, ...
   201  
   202  ```console
   203  $ ais start dsort -f - <<EOM
   204  extension: .tar
   205  input_bck:
   206      name: dsort-testing
   207  input_format:
   208      template: shard-{0..9}
   209  output_format: new-shard-{0000..1000}
   210  output_shard_size: 10KB
   211  description: shuffle shards from 0 to 9
   212  algorithm:
   213      kind: shuffle
   214  EOM
   215  JGHEoo89gg
   216  ```
   217  
   218  #### Pack records into shards with different categories - EKM (External Key Map)
   219  
   220  One of the key features of the dSort is that user can specify the exact mapping from the record key to the output shard.
   221  To use this feature `output_format` should be empty and `order_file`, as well as `order_file_sep`, must be set.
   222  The output shards will be created with provided format which must contain mandatory `%d` which is required to enumerate the shards.
   223  
   224  Assuming that `order_file` (URL: `http://website.web/static/order_file.txt`) has content:
   225  
   226  ```
   227  cat_0.txt shard-cats-%d
   228  cat_1.txt shard-cats-%d
   229  ...
   230  dog_0.txt shard-dogs-%d
   231  dog_1.txt shard-dogs-%d
   232  ...
   233  car_0.txt shard-car-%d
   234  car_1.txt shard-car-%d
   235  ...
   236  ```
   237  
   238  or if `order_file` (URL: `http://website.web/static/order_file.json`, notice `.json` extension) and has content:
   239  
   240  ```json
   241  {
   242    "shard-cats-%d": [
   243      "cat_0.txt",
   244      "cat_1.txt",
   245      ...
   246    ],
   247    "shard-dogs-%d": [
   248      "dog_0.txt",
   249      "dog_1.txt",
   250      ...
   251    ],
   252    "shard-car-%d": [
   253      "car_0.txt",
   254      "car_1.txt",
   255      ...
   256    ],
   257    ...
   258  }
   259  ```
   260  
   261  and content of the **input** shards looks more or less like this:
   262  
   263  ```
   264  shard-0.tar:
   265  - cat_0.txt
   266  - dog_0.txt
   267  - car_0.txt
   268  ...
   269  shard-1.tar:
   270  - cat_1.txt
   271  - dog_1.txt
   272  - car_1.txt
   273  ...
   274  ```
   275  
   276  You can run:
   277  
   278  ```console
   279  $ ais start dsort '{
   280      "extension": ".tar",
   281      "input_bck": {"name": "dsort-testing"},
   282      "input_format": {"template": "shard-{0..9}"},
   283      "output_shard_size": "200KB",
   284      "description": "pack records into categorized shards",
   285      "order_file": "http://website.web/static/order_file.txt",
   286      "order_file_sep": " "
   287  }'
   288  JGHEoo89gg
   289  ```
   290  
   291  After the run, the **output** shards will look more or less like this (the number of records in given shard depends on provided `output_shard_size`):
   292  
   293  ```
   294  shard-cats-0.tar:
   295  - cat_1.txt
   296  - cat_2.txt
   297  shard-cats-1.tar:
   298  - cat_3.txt
   299  - cat_4.txt
   300  ...
   301  shard-dogs-0.tar:
   302  - dog_1.txt
   303  - dog_2.txt
   304  ...
   305  ```
   306  
   307  ## Show dSort jobs and job status
   308  
   309  `ais show job dsort [JOB_ID]`
   310  
   311  Retrieve the status of the dSort with provided `JOB_ID` which is returned upon creation.
   312  Lists all dSort jobs if the `JOB_ID` argument is omitted.
   313  
   314  ### Options
   315  
   316  | Flag | Type | Description | Default |
   317  | --- | --- | --- | --- |
   318  | `--regex` | `string` | Regex for the description of dSort jobs | `""` |
   319  | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds). E.g.:  `--refresh 2s`| ` ` |
   320  | `--verbose, -v` | `bool` | Show detailed metrics | `false` |
   321  | `--log` | `string` | Path to file where the metrics will be saved (does not work with progress bar) | `/tmp/dsort_run.txt` |
   322  | `--json, -j` | `bool` | Show only json metrics | `false` |
   323  
   324  ### Examples
   325  
   326  #### Show dSort jobs with description matching provided regex
   327  
   328  Shows all dSort jobs with descriptions starting with `sort ` prefix.
   329  
   330  ```console
   331  $ ais show job dsort --regex "^sort (.*)"
   332  JOB ID		 STATUS		 START		 FINISH			 DESCRIPTION
   333  nro_Y5h9n	 Finished	 03-16 11:39:07	 03-16 11:39:07 	 sort shards from 0 to 9
   334  Key_Y5h9n	 Finished	 03-16 11:39:23	 03-16 11:39:23 	 sort shards from 10 to 19
   335  enq9Y5Aqn	 Finished	 03-16 11:39:34	 03-16 11:39:34 	 sort shards from 20 to 29
   336  ```
   337  
   338  #### Save metrics to log file
   339  
   340  Save newly fetched metrics of the dSort job with ID `5JjIuGemR` to `/tmp/dsort_run.txt` file every `500` milliseconds
   341  
   342  ```console
   343  $ ais show job dsort 5JjIuGemR --refresh 500ms --log "/tmp/dsort_run.txt"
   344  Dsort job has finished successfully in 21.948806ms:
   345    Longest extraction:	1.49907ms
   346    Longest sorting:	8.288299ms
   347    Longest creation:	4.553µs
   348  ```
   349  
   350  #### Show only json metrics
   351  
   352  ```console
   353  $ ais show job dsort 5JjIuGemR --json
   354  {
   355    "825090t8089": {
   356      "local_extraction": {
   357        "started_time": "2020-05-28T09:53:42.466267891-04:00",
   358        "end_time": "2020-05-28T09:53:42.50773835-04:00",
   359        ....
   360       },
   361       ....
   362    },
   363    ....
   364  }
   365  ```
   366  
   367  #### Show only json metrics filtered by daemon id
   368  
   369  ```console
   370  $ ais show job dsort 5JjIuGemR 766516t8087 --json
   371  {
   372    "766516t8087": {
   373      "local_extraction": {
   374        "started_time": "2020-05-28T09:53:42.466267891-04:00",
   375        "end_time": "2020-05-28T09:53:42.50773835-04:00",
   376        ....
   377       },
   378       ....
   379    }
   380  }
   381  ```
   382  
   383  #### Using jq to filter out the json formatted metric output
   384  
   385  Show running status of meta sorting phase for all targets.
   386  
   387  ```console
   388  $ ais show job dsort 5JjIuGemR --json | jq .[].meta_sorting.running
   389  false
   390  false
   391  true
   392  false
   393  ```
   394  
   395  Show created shards in each target along with the target ids.
   396  
   397  ```console
   398  $ ais show job dsort 5JjIuGemR --json | jq 'to_entries[] | [.key, .value.shard_creation.created_count]'
   399  [
   400    "766516t8087",
   401    "189"
   402  ]
   403  [
   404    "710650t8086",
   405    "207"
   406  ]
   407  [
   408    "825090t8089",
   409    "211"
   410  ]
   411  [
   412    "743838t8088",
   413    "186"
   414  ]
   415  [
   416    "354275t8085",
   417    "207"
   418  ]
   419  ```
   420  
   421  
   422  ## Stop dSort job
   423  
   424  `ais stop dsort JOB_ID`
   425  
   426  Stop the dSort job with given `JOB_ID`.
   427  
   428  ## Remove dSort job
   429  
   430  `ais job rm dsort JOB_ID`
   431  
   432  Remove the finished dSort job with given `JOB_ID` from the job list.
   433  
   434  ## Wait for dSort job
   435  
   436  `ais wait dsort JOB_ID`
   437  
   438  or, same:
   439  
   440  `ais wait JOB_ID`
   441  
   442  Wait for the dSort job with given `JOB_ID` to finish.
   443  
   444  ### Options
   445  
   446  | Flag | Type | Description | Default |
   447  | --- | --- | --- | --- |
   448  | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds) | `1s` |
   449  | `--progress` | `bool` | Displays progress bar | `false` |