github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/dsort.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/dsort.md (about)

     1  ---
     2  layout: post
     3  title: DSORT
     4  permalink: /docs/dsort
     5  redirect_from:
     6   - /dsort.md/
     7   - /docs/dsort.md/
     8  ---
     9  
    10  Dsort is extension for AIStore. It was designed to perform map-reduce like
    11  operations on terabytes and petabytes of AI datasets. As a part of the whole
    12  system, Dsort is capable of taking advantage of objects stored on AIStore without
    13  much overhead.
    14  
    15  AI datasets are usually stored in tarballs, zip objects, msgpacks or tf-records.
    16  Focusing only on these types of files and specific workload allows us to tweak
    17  performance without too much tradeoffs.
    18  
    19  ## Capabilities
    20  
    21  Example of map-reduce like operation which can be performed on dSort is
    22  shuffling (in particular sorting) all objects across all shards by a given
    23  algorithm.
    24  
    25  ![dsort-overview](images/dsort_mapreduce.png)
    26  
    27  We allow for output shards to be different size than input shards, thus a user
    28  is also able to reshard the objects. This means that output shards can contain
    29  more or less objects inside the shard than in the input shard, depending on
    30  requested sizes of the shards.
    31  
    32  The result of such an operation would mean that we could get output shards with
    33  different sizes with objects that are shuffled across all the shards, which
    34  would then be ready to be processed by a machine learning script/model.
    35  
    36  ## Terms
    37  
    38  **Object** - single piece of data. In tarballs and zip files, an *object* is
    39  single file contained in this type of archives. In msgpack (assuming that
    40  msgpack file is stream of dictionaries) *object* is single dictionary.
    41  
    42  **Shard** - collection of objects. In tarballs and zip files, a *shard* is whole
    43  archive. In msgpack is the whole msgpack file.
    44  
    45  We distinguish two kinds of shards: input and output. Input shards, as the name
    46  says, it is given as an input for the dSort operation. Output on the other hand
    47  is something that is the result of the operation. Output shards can differ from
    48  input shards in many ways: size, number of objects, names etc.
    49  
    50  Shards are assumed to be already on AIStore cluster or somewhere in a remote bucket
    51  so that AIStore can access them. Output shards will always be placed in the same
    52  bucket and directory as the input shards - accessing them after completed dSort,
    53  is the same as input shards but of course with different names.
    54  
    55  **Record** - abstracts multiple objects with same key name into single
    56  structure. Records are inseparable which means if they come from single shard
    57  they will also be in output shard together.
    58  
    59  Eg. if we have a tarball which contains files named: `file1.txt`, `file1.png`,
    60  `file2.png`, then we would have 2 *records*: one for `file1` and one for
    61  `file2`.
    62  
    63  **Extraction phase** - dSort has multiple phases in which it does the whole
    64  operation. The first of them is **extraction**. In this phase, dSort is reading
    65  input shards and looks inside them to get to the objects and metadata. Objects
    66  and their metadata are then extracted to either disk or memory so that dSort
    67  won't need another pass of the whole data set again. This way Dsort can create
    68  **Records** which are then used for the whole operation as the main source of
    69  information (like location of the objects, sizes, names etc.). Extraction phase
    70  is very critical because it does I/O operations. To make the following phases
    71  faster, we added support for extraction to memory so that requests for the given
    72  objects will be served from RAM instead of disk. The user can specify how much
    73  memory can be used for the extraction phase, either in raw numbers like `1GB` or
    74  percentages `60%`.
    75  
    76  As mentioned this operation does a lot of I/O operations. To allow the user to
    77  have better control over the disk usage, we have provided a concurrency
    78  parameter which limits the number of shards that can be read at the same time.
    79  
    80  **Sorting phase** - in this phase, the metadata is processed and aggregated on a
    81  single machine. It can be processed in various ways: sorting, shuffling,
    82  resizing etc. This is usually the fastest phase but still uses a lot of CPU
    83  processing power, to process the metadata.
    84  
    85  The merging of metadata is performed in multiple steps to distribute the load
    86  across machines.
    87  
    88  **Creation phase** - it is last phase of dSort where output shards are created.
    89  Like the extraction phase, the creation phase is bottlenecked by disk and I/O.
    90  Additionally, this phase may use a lot of bandwidth because objects may have
    91  been extracted on different machine.
    92  
    93  Similarly to the extraction phase we expose a concurrency parameter for the
    94  user, to limit number of shards created simultaneously.
    95  
    96  Shards are created from local records or remote records. Local records are
    97  records which were extracted on the machine where the shard is being created,
    98  and similarly, remote records are records which were extracted on different
    99  machines. This means that a single machine will typically have a lot of
   100  read/write operations on the disk coming from either local or remote requests.
   101  This is why tweaking the concurrency parameter is really important and can have
   102  great impact on performance. We strongly advise to make couple of tests on small
   103  load to see what value of this parameter will result in the best performance.
   104  Eg. tests shown that on setup: 10x targets, 10x disks on each target, the best
   105  concurrency value is 60.
   106  
   107  The other thing that user needs to remember is that when running multiple dSort
   108  operations at once, it might be better to set the concurrency parameter to
   109  something lower since both of the operation may use disk at the same time. A
   110  higher concurrency parameter can result in performance degradation.
   111  
   112  ![dsort-shard-creation](images/dsort_shard_creation.png)
   113  
   114  **Metrics** - user can monitor whole operation thanks to metrics. Metrics
   115  provide an overview of what is happening in the cluster, for example: which
   116  phase is currently running, how much time has been spent on each phase, etc.
   117  There are many metrics (numbers and stats) recorded for each of the phases.
   118  
   119  ## Metrics
   120  
   121  Dsort allows users to fetch the statistics of a given job (either
   122  started/running or already finished). Each phase has different, specific metrics
   123  which can be monitored. Description of metrics returned for *single node*:
   124  
   125  * `local_extraction`
   126    * `started_time` - timestamp when the local extraction has started.
   127    * `end_time` - timestamp when the local extraction has finished.
   128    * `elapsed` - duration (in seconds) of the local extraction phase.
   129    * `running` - informs if the phase is currently running.
   130    * `finished` - informs if the phase has finished.
   131    * `total_count` - static number of shards which needs to be scanned - informs what is the expected number of input shards.
   132    * `extracted_count` - number of shards extracted/processed by given node. This number can differ from node to node since shards may not be equally distributed.
   133    * `extracted_size` - size of extracted/processed shards by given node.
   134    * `extracted_record_count` - number of records extracted (in total) from all processed shards.
   135    * `extracted_to_disk_count` - number of records extracted (in total) and saved to the disk (there was not enough space to save them in memory).
   136    * `extracted_to_disk_size` - size of extracted records which were saved to the disk.
   137    * `single_shard_stats` - statistics about single shard processing.
   138      * `total_ms` - total number of milliseconds spent extracting all shards.
   139      * `count` - number of extracted shards.
   140      * `min_ms` - shortest duration of extracting a shard (in milliseconds).
   141      * `max_ms` - longest duration of extracting a shard (in milliseconds).
   142      * `avg_ms` - average duration of extracting a shard (in milliseconds).
   143      * `min_throughput` - minimum throughput of extracting a shard (in bytes per second).
   144      * `max_throughput` - maximum throughput of extracting a shard (in bytes per second).
   145      * `avg_throughput` - average throughput of extracting a shard (in bytes per second).
   146  * `meta_sorting`
   147    * `started_time` - timestamp when the meta sorting has started.
   148    * `end_time` - timestamp when the meta sorting has finished.
   149    * `elapsed` - duration (in seconds) of the meta sorting phase.
   150    * `running` - informs if the phase is currently running.
   151    * `finished` - informs if the phase has finished.
   152    * `sent_stats` - statistics about sending records to other nodes.
   153      * `total_ms` - total number of milliseconds spent on sending the records.
   154      * `count` - number of records sent to other targets.
   155      * `min_ms` - shortest duration of sending the records (in milliseconds).
   156      * `max_ms` - longest duration of sending the records (in milliseconds).
   157      * `avg_ms` - average duration of sending the records (in milliseconds).
   158    * `recv_stats` - statistics about receiving records from other nodes.
   159      * `total_ms` - total number of milliseconds spent on receiving the records from nodes.
   160      * `count` - number of records received from other targets.
   161      * `min_ms` - shortest duration of receiving the records (in milliseconds).
   162      * `max_ms` - longest duration of receiving the records (in milliseconds).
   163      * `avg_ms` - average duration of receiving the records (in milliseconds).
   164  * `shard_creation`
   165    * `started_time` - timestamp when the shard creation has started.
   166    * `end_time` - timestamp when the shard creation has finished.
   167    * `elapsed` - duration (in seconds) of the shard creation phase.
   168    * `running` - informs if the phase is currently running.
   169    * `finished` - informs if the phase has finished.
   170    * `to_create` - number of shards which needs to be created on given node.
   171    * `created_count` - number of shards already created.
   172    * `moved_shard_count` - number of shards moved from the node to another one (it sometimes makes sense to create shards locally and send it via network).
   173    * `req_stats` - statistics about sending requests for records.
   174      * `total_ms` - total number of milliseconds spent on sending requests for records from other nodes.
   175      * `count` - number of requested records.
   176      * `min_ms` - shortest duration of sending a request (in milliseconds).
   177      * `max_ms` - longest duration of sending a request (in milliseconds).
   178      * `avg_ms` - average duration of sending a request (in milliseconds).
   179    * `resp_stats` - statistics about waiting for the records.
   180      * `total_ms` - total number of milliseconds spent on waiting for the records from other nodes.
   181      * `count` - number of records received from other nodes.
   182      * `min_ms` - shortest duration of waiting for a record (in milliseconds).
   183      * `max_ms` - longest duration of waiting for a record (in milliseconds).
   184      * `avg_ms` - average duration of waiting for a record (in milliseconds).
   185    * `local_send_stats` - statistics about sending record content to other target.
   186      * `total_ms` - total number of milliseconds spent on writing the record content to the wire.
   187      * `count` - number of records received from other nodes.
   188      * `min_ms` - shortest duration of waiting for a record content to written into the wire (in milliseconds).
   189      * `max_ms` - longest duration of waiting for a record content to written into the wire (in milliseconds).
   190      * `avg_ms` - average duration of waiting for a record content to written into the wire (in milliseconds).
   191      * `min_throughput` - minimum throughput of writing record content into the wire (in bytes per second).
   192      * `max_throughput` - maximum throughput of writing record content into the wire (in bytes per second).
   193      * `avg_throughput` - average throughput of writing record content into the wire (in bytes per second).
   194    * `local_recv_stats` - statistics receiving record content from other target.
   195      * `total_ms` - total number of milliseconds spent on receiving the record content from the wire.
   196      * `count` - number of records received from other nodes.
   197      * `min_ms` - shortest duration of waiting for a record content to be read from the wire (in milliseconds).
   198      * `max_ms` - longest duration of waiting for a record content to be read from the wire (in milliseconds).
   199      * `avg_ms` - average duration of waiting for a record content to be read from the wire (in milliseconds).
   200      * `min_throughput` - minimum throughput of reading record content from the wire (in bytes per second).
   201      * `max_throughput` - maximum throughput of reading record content from the wire (in bytes per second).
   202      * `avg_throughput` - average throughput of reading record content from the wire (in bytes per second).
   203    * `single_shard_stats` - statistics about single shard creation.
   204      * `total_ms` - total number of milliseconds spent creating all shards.
   205      * `count` - number of created shards.
   206      * `min_ms` - shortest duration of creating a shard (in milliseconds).
   207      * `max_ms` - longest duration of creating a shard (in milliseconds).
   208      * `avg_ms` - average duration of creating a shard (in milliseconds).
   209      * `min_throughput` - minimum throughput of creating a shard (in bytes per second).
   210      * `max_throughput` - maximum throughput of creating a shard (in bytes per second).
   211      * `avg_throughput` - average throughput of creating a shard (in bytes per second).
   212  * `aborted` - informs if the job has been aborted.
   213  * `archived` - informs if the job has finished and was archived to journal.
   214  * `description` - description of the job.
   215  
   216  Example output for single node:
   217  ```json
   218  {
   219    "local_extraction": {
   220      "started_time": "2019-06-17T12:27:25.102691781+02:00",
   221      "end_time": "2019-06-17T12:28:04.982017787+02:00",
   222      "elapsed": 39,
   223      "running": false,
   224      "finished": true,
   225      "total_count": 1000,
   226      "extracted_count": 182,
   227      "extracted_size": 4771020800,
   228      "extracted_record_count": 9100,
   229      "extracted_to_disk_count": 4,
   230      "extracted_to_disk_size": 104857600,
   231      "single_shard_stats": {
   232        "total_ms": 251417,
   233        "count": 182,
   234        "min_ms": 30,
   235        "max_ms": 2696,
   236        "avg_ms": 1381,
   237        "min_throughput": 9721724,
   238        "max_throughput": 847903603,
   239        "avg_throughput": 50169799
   240      }
   241    },
   242    "meta_sorting": {
   243      "started_time": "2019-06-17T12:28:04.982041542+02:00",
   244      "end_time": "2019-06-17T12:28:05.336979995+02:00",
   245      "elapsed": 0,
   246      "running": false,
   247      "finished": true,
   248      "sent_stats": {
   249        "total_ms": 99,
   250        "count": 1,
   251        "min_ms": 99,
   252        "max_ms": 99,
   253        "avg_ms": 99
   254      },
   255      "recv_stats": {
   256        "total_ms": 246,
   257        "count": 1,
   258        "min_ms": 246,
   259        "max_ms": 246,
   260        "avg_ms": 246
   261      }
   262    },
   263    "shard_creation": {
   264      "started_time": "2019-06-17T12:28:05.725630555+02:00",
   265      "end_time": "2019-06-17T12:29:19.108651924+02:00",
   266      "elapsed": 73,
   267      "running": false,
   268      "finished": true,
   269      "to_create": 9988,
   270      "created_count": 9988,
   271      "moved_shard_count": 0,
   272      "req_stats": {
   273        "total_ms": 160,
   274        "count": 8190,
   275        "min_ms": 0,
   276        "max_ms": 20,
   277        "avg_ms": 0
   278      },
   279      "resp_stats": {
   280        "total_ms": 4323665,
   281        "count": 8190,
   282        "min_ms": 0,
   283        "max_ms": 6829,
   284        "avg_ms": 527
   285      },
   286      "single_shard_stats": {
   287        "total_ms": 4487385,
   288        "count": 9988,
   289        "min_ms": 0,
   290        "max_ms": 6829,
   291        "avg_ms": 449,
   292        "min_throughput": 76989,
   293        "max_throughput": 709852568,
   294        "avg_throughput": 98584381
   295      }
   296    },
   297    "aborted": false,
   298    "archived": true
   299  }
   300  ```
   301  
   302  ## API
   303  
   304  You can use the [AIS's CLI](/docs/cli.md) to start, abort, retrieve metrics or list dSort jobs.
   305  It is also possible generate random dataset to test dSort's capabilities.
   306  
   307  ## Config
   308  
   309  | Config value | Default value | Description |
   310  |---|---|---|
   311  | `duplicated_records` | "ignore" | what to do when duplicated records are found: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation |
   312  | `missing_shards` | "ignore" | what to do when missing shards are detected: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation |
   313  | `ekm_malformed_line` | "abort" | what to do when extraction key map notices a malformed line: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation |
   314  | `ekm_missing_key` | "abort" | what to do when extraction key map have a missing key: "ignore" - ignore and continue, "warn" - notify a user and continue, "abort" - abort dSort operation |
   315  | `call_timeout` | "10m" | a maximum time a target waits for another target to respond |
   316  | `default_max_mem_usage` | "80%" | a maximum amount of memory used by running dSort. Can be set as a percent of total memory(e.g `80%`) or as the number of bytes(e.g, `12G`) |
   317  | `dsorter_mem_threshold` | "100GB" | minimum free memory threshold which will activate specialized dsorter type which uses memory in creation phase - benchmarks shows that this type of dsorter behaves better than general type |
   318  | `compression` | "never" | LZ4 compression parameters used when dSort sends its shards over network. Values: "never" - disables, "always" - compress all data, or a set of rules for LZ4, e.g "ratio=1.2" means enable compression from the start but disable when average compression ratio drops below 1.2 to save CPU resources |
   319  
   320  
   321  To clear what these values means we have couple examples to showcase certain scenarios.
   322  
   323  ### Examples
   324  
   325  #### `default_max_mem_usage`
   326  
   327  Lets assume that we have `N` targets, where each target has `Y`GB of RAM and `default_max_mem_usage` is set to `80%`.
   328  So dSort can allocate memory until the number of memory used (total in the system) is below `80% * Y`GB.
   329  What this means is that regardless of how much other subsystems or programs working at the same instance use memory, the dSort will never allocate memory if the watermark is reached.
   330  For example if some other program already allocated `90% * Y`GB memory (only `10%` is left), then dSort will not allocate any memory since it will notice that the watermark is already exceeded.
   331  
   332  #### `dsorter_mem_threshold`
   333  
   334  Dsort has implemented for now 2 different types of so called "dsorter": `dsorter_mem` and `dsorter_general`.
   335  These two implementations use memory, disks and network a little bit differently and are designated to different use cases.
   336  
   337  By default `dsorter_general` is used as it was implemented for all types of workloads.
   338  It is allocates memory during the first phase of dSort and uses it in the last phase.
   339  
   340  `dsorter_mem` was implemented in mind to speed up the creation phase which is usually a biggest bottleneck.
   341  It has specific way of building shards in memory and then persisting them do the disk.
   342  This makes this dsorter memory oriented and it is required for it to have enough memory to build the shards.
   343  
   344  To determine which dsorter to use we have introduced a heuristic which tries to determine when it is best to use `dsorter_mem` instead of `dsorter_general`.
   345  Config value `dsorter_mem_threshold` sets the threshold above which the `dsorter_mem` will be used.
   346  If **all** targets have max memory usage (see `default_max_mem_usage`) above the `dsorter_mem_threshold` then `dsorter_mem` is chosen for the dSort job.
   347  For example if each target has `Y`GB of RAM, `default_max_mem_usage` is set to `80%` and `dsorter_mem_threshold` is set to `100GB` then as long as on all targets `80% * Y > 100GB` then `dsorter_mem` will be used.