github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/performance.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/performance.md (about)

     1  ---
     2  layout: post
     3  title: PERFORMANCE
     4  permalink: /docs/performance
     5  redirect_from:
     6   - /performance.md/
     7   - /docs/performance.md/
     8  ---
     9  
    10  AIStore is all about performance. It's all about performance and reliability, to be precise. An assortment of tips and recommendations that you find in this text will go a long way to ensure that AIS _does_ deliver.
    11  
    12  But first, one important consideration:
    13  
    14  Currently(**) AIS utilizes local filesystems - local formatted drives of any kind. You could use HDDs and/or NVMes, and any Linux filesystem: xfs, zfs, ext4... you name it. In all cases, the resulting throughput of an AIS cluster will be the sum of throughputs of individual drives.
    15  
    16  Usage of local filesystems has its pros and cons, its inevitable trade-offs. In particular, one immediate implication is called [`ulimit`](#maximum-number-of-open-files) - the limit on the maximum number of open file descriptors. We do recommend to fix that specific tunable right away - as described [below](#maximum-number-of-open-files).
    17  
    18  The second related option is [`noatime`](#noatime) - and the same argument applies. Please make sure to review and tune-up them both.
    19  
    20  > (**) The interface between AIS and a local filesystem is extremely limited and boils down to writing/reading key/value pairs (where values are, effectively, files) and checking existence of keys (filenames). Moving AIS off of POSIX to some sort of (TBD) key/value storage engine should be a straightforward exercise. Related expectations/requirements include reliability and performance under stressful workloads that read and write "values" of 1MB and greater in size.
    21  
    22  - [Operating System](#operating-system)
    23  - [CPU](#cpu)
    24  - [Network](#network)
    25  - [Smoke test](#smoke-test)
    26  - [Maximum number of open files](#maximum-number-of-open-files)
    27  - [Storage](#storage)
    28    - [Disk priority](#disk-priority)
    29    - [Benchmarking disk](#benchmarking-disk)
    30    - [Local filesystem](#local-filesystem)
    31    - [`noatime`](#noatime)
    32  - [Virtualization](#virtualization)
    33  - [Metadata write policy](#metadata-write-policy)
    34  - [PUT latency](#put-latency)
    35  - [GET throughput](#get-throughput)
    36  - [`aisloader`](#aisloader)
    37  
    38  ## Operating System
    39  
    40  There are two types of nodes in AIS cluster:
    41  
    42  * targets (ie., storage nodes)
    43  * proxies (aka gateways)
    44  
    45  > AIS target must have a data disk, or disks. AIS proxies do not "see" a single byte of user data - they implement most of the control plane and provide access points to the cluster.
    46  
    47  The question, then, is how to get the maximum out of the underlying hardware? How to improve datapath performance. This and similar questions have one simple answer: **tune-up** target's operating system.
    48  
    49  Specifically, `sysctl` selected system variables, such as `net.core.wmem_max`, `net.core.rmem_max`, `vm.swappiness`, and more - here's the approximate list:
    50  
    51  * [https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/host-config/vars/host_config_sysctl.yml](sysctl)
    52  
    53  The document is part of a separate [repository](https://github.com/NVIDIA/ais-k8s) that serves the (specific) purposes of deploying AIS on **bare-metal Kubernetes**. The repo includes a number of [playbooks](https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/README.md) to assist in a full deployment of AIStore.
    54  
    55  In particular, there is a section of pre-deployment playbooks to [prepare AIS nodes for deployment on bare-metal Kubernetes](https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/host-config/README.md)
    56  
    57  General references:
    58  
    59  * [Ubuntu Performance Tuning (archived)](https://web.archive.org/web/20190811180754/https://wiki.mikejung.biz/Ubuntu_Performance_Tuning) <- good guide about general optimizations (some of them are described below)
    60  * [RHEL 7 Performance Tuning Guide](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/pdf/performance_tuning_guide/Red_Hat_Enterprise_Linux-7-Performance_Tuning_Guide-en-US.pdf) <- detailed view on how to tune the RHEL lot of the tips and tricks apply for other Linux distributions
    61  
    62  ## CPU
    63  
    64  Setting CPU governor (P-States) to `performance` may make a big difference and, in particular, result in much better network throughput:
    65  
    66  * [How to tune your 100G host](https://fasterdata.es.net/assets/Papers-and-Publications/100G-Tuning-TechEx2016.tierney.pdf)
    67  
    68  On `Linux`:
    69  
    70  ```console
    71  $ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    72  ```
    73  
    74  or using `cpupower` package (on `Debian` and `Ubuntu`):
    75  
    76  ```console
    77  $ apt-get install -y linux-tools-$(uname -r) # install `cpupower`
    78  $ cpupower frequency-info # check current settings
    79  $ cpupower frequency-set -r -g performance # set `performance` setting to all CPU's
    80  $ cpupower frequency-info # check settings after the change
    81  ```
    82  
    83  Once the packages are installed (the step that will depend on your Linux distribution), you can then follow the *tuning instructions* from the referenced PDF (above).
    84  
    85  ## Network
    86  
    87  AIStore supports 3 (three) logical networks:
    88  
    89  * public (default port `51081`)
    90  * intra-cluster control (`51082`), and
    91  * intra-cluster data (`51083`)
    92  
    93  Ideally, all 3 are provisioned (physically) separately - to reduce contention, avoid HoL, and ultimately optimize intra-cluster performance.
    94  
    95  Separately, and in addition:
    96  
    97  * MTU should be set to `9000` (Jumbo frames) - this is one of the most important configurations
    98  * Optimize TCP send buffer sizes on the target side (`net.core.rmem_max`, `net.ipv4.tcp_rmem`)
    99  * Optimize TCP receive buffer on the client (reading) side (`net.core.wmem_max`, `net.ipv4.tcp_wmem`)
   100  * Set `net.ipv4.tcp_mtu_probing = 2`
   101  
   102  > The last setting is especially important when client's MTU is greater than 1500.
   103  
   104  > The list of tunables (above) cannot be considered _exhaustive_. Optimal (high-performance) choices always depend on the hardware, Linux kernel, and a variety of factors outside the scope.
   105  
   106  ## Smoke test
   107  
   108  To ensure client ⇔ proxy, client ⇔ target, proxy ⇔ target, and target ⇔ target connectivity you can use `iperf` (make sure to use Jumbo frames and disable fragmentation).
   109  Here is example use of `iperf`:
   110  
   111  ```console
   112  $ iperf -P 20 -l 128K -i 1 -t 30 -w 512K -c <IP-address>
   113  ```
   114  
   115  **NOTE**: `iperf` must show 95% of the bandwidth of a given phys interface. If it does not, try to find out why. It might have no sense to run any benchmark prior to finding out.
   116  
   117  ## Maximum number of open files
   118  
   119  This must be done before running benchmarks, let alone deploying AIS inproduction - the maximum number of open file descriptors must be increased.
   120  
   121  The corresponding system configuration file is `/etc/security/limits.conf`.
   122  
   123  > In Linux, the default per-process maximum is 1024. It is **strongly recommended** to raise it to at least `100,000`.
   124  
   125  Here's a full replica of [/etc/security/limits.conf](https://github.com/NVIDIA/aistore/blob/main/deploy/conf/limits.conf) that we use for development _and_ production.
   126  
   127  To check your current settings, run `ulimit -n` or `tail /etc/security/limits.conf`.
   128  
   129  To increase the limits, copy the following 5 lines into `/etc/security/limits.conf`:
   130  
   131  ```console
   132  $ tail /etc/security/limits.conf
   133  #ftp             hard    nproc           0
   134  #ftp             -       chroot          /ftp
   135  #@student        -       maxlogins       4
   136  
   137  root             hard    nofile          999999
   138  root             soft    nofile          999999
   139  *                hard    nofile          999999
   140  *                soft    nofile          999999
   141  
   142  # End of file
   143  ```
   144  
   145  Once done, re-login and double-check that both *soft* and *hard* limits have indeed changed:
   146  
   147  ```console
   148  $ ulimit -n
   149  999999
   150  $ ulimit -Hn
   151  999999
   152  ```
   153  
   154  For further references, google:
   155  
   156  * `docker run --ulimit`
   157  * `DefaultLimitNOFILE`
   158  
   159  ## Storage
   160  
   161  Storage-wise, each local `ais_local.json` config must be looking as follows:
   162  
   163  ```json
   164  {
   165      "net": {
   166          "hostname": "127.0.1.0",
   167          "port": "8080",
   168      },
   169      "fspaths": {
   170          "/tmp/ais/1": {},
   171          "/tmp/ais/2": {},
   172          "/tmp/ais/3": {}
   173      }
   174  }
   175  ```
   176  
   177  * Each local path from the `fspaths` section above must be (or contain as a prefix) a mountpath of a local filesystem.
   178  * Each local filesystem (above) must utilize one or more data drives, whereby none of the data drives is shared between two or more local filesystems.
   179  * Each filesystem must be fine-tuned for reading large files/blocks/xfersizes.
   180  
   181  ### Disk priority
   182  
   183  AIStore can be sensitive to spikes in I/O latencies, especially when running bursty traffic that carries large numbers of small-size I/O requests (resulting from writing and/or reading small-size objects).
   184  To ensure that latencies do not spike because of some other processes running on the system it is recommended to give `aisnode` process the highest disk priority.
   185  This can be done using [`ionice` tool](https://linux.die.net/man/1/ionice).
   186  
   187  ```console
   188  $ # Best effort, highest priority
   189  $ sudo ionice -c2 -n0 -p $(pgrep aisnode)
   190  ```
   191  
   192  ### Benchmarking disk
   193  
   194  **TIP:** double check that rootfs/tmpfs of the AIStore target are _not_ used when reading and writing data.
   195  
   196  When running a benchmark, make sure to run and collect the following in each and every target:
   197  
   198  ```console
   199  $ iostat -cdxtm 10
   200  ```
   201  
   202  Local hard drive read performance, the fastest block-level reading smoke test is:
   203  
   204  ```console
   205  $ hdparm -Tt /dev/<drive-name>
   206  ```
   207  
   208  Reading the block from certain offset (in gigabytes), `--direct` argument ensure that we bypass the drive’s buffer cache and read directly from the disk:
   209  
   210  ```console
   211  $ hdparm -t --direct --offset 100 /dev/<drive-name>
   212  ```
   213  
   214  More: [Tune hard disk with `hdparm`](http://www.linux-magazine.com/Online/Features/Tune-Your-Hard-Disk-with-hdparm).
   215  
   216  ### Local filesystem
   217  
   218  Another way to increase storage performance is to benchmark different filesystems: `ext`, `xfs`, `openzfs`.
   219  Tuning the corresponding IO scheduler can prove to be important:
   220  
   221  * [ais_enable_multiqueue](https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/host-config/docs/ais_enable_multiqueue.md)
   222  
   223  Other related references:
   224  
   225  * [How to improve disk IO performance in Linux](https://www.golinuxcloud.com/how-to-improve-disk-io-performance-in-linux/)
   226  * [Performance Tuning on Linux — Disk I/O](https://cromwell-intl.com/open-source/performance-tuning/disks.html)
   227  
   228  ### `noatime`
   229  
   230  One of the most important performance improvements can be achieved by simply turning off `atime` (access time) updates on the filesystem.
   231  
   232  Example: mount xfs with the following performance-optimized parameters (that also include `noatime`):
   233  
   234  ```console
   235  noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier
   236  ```
   237  
   238  > As aside, it is maybe important to note and disclose that AIStore itself lazily updates access times. The complete story has two pieces:
   239  1. On the one hand, AIS can be deployed as a fast cache tier in front of any of the multiple supported [backends](providers.md). To support this specific usage scenario, AIS tracks access times and the remaining capacity. The latter has 3 (three) configurable thresholds: low-watermark, high-watermark, and OOS (out of space) - by default, 75%, 90%, and 95%, respectively. Running low on free space automatically triggers LRU job that starts visiting evictable buckets (if any), sorting their content by access times, and yes - evicting. Bucket "evictability" is also configurable and, in turn, has two different defaults: true for buckets that have remote backend, and false otherwise.
   240  2. On the other hand, we certainly don't want to have any writes upon reads. That's why object metadata is cached and atime gets updated in memory to a) avoid extra reads and b) absorb multiple accesses and multiple atime updates. Further, if and only if we update atime in memory, we also mark the metadata as `dirty`. So that later, if for whatever reason we need to un-cache it, we flush it to disk (along with the most recently updates access time).
   241  3. The actual mechanism of when and whether to flush the corresponding metadata is driven by two conditions: the time since last access and the amount of free memory. If there's plenty of memory, we flush `dirty` metadata after 1 hour of inactivity.
   242  4. Finally, in AIS all configuration-related defaults (e.g., the default watermarks mentioned above) are also configurable - but that's a different story and a different scope...
   243  
   244  External links:
   245  
   246  * [The atime and noatime attribute](http://en.tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap6sec73.html)
   247  * [Mount with noatime](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/global_file_system_2/s2-manage-mountnoatime)
   248  * [Gain 30% Linux Disk Performance with noatime](https://lonesysadmin.net/2013/12/08/gain-30-linux-disk-performance-noatime-nodiratime-relatime)
   249  
   250  ## Virtualization
   251  
   252  There must be no sharing of host resources between two or more VMs that are AIS nodes.
   253  
   254  Even if there is a single virtual machine, the host may decide to swap it out when idle, or give it a single hyperthreaded vCPU instead of a full-blown physical core - this condition must be prevented.
   255  
   256  Virtualized AIS node needs to have a physical resource - memory, CPU, network, storage - in its entirety. Hypervisor must resort to the remaining absolutely required minimum.
   257  
   258  Make sure to use PCI passthrough to assign a device (NIC, HDD) directly to the AIS VM.
   259  
   260  AIStore's primary goal is to scale with clustered drives. Therefore, the choice of a drive type and its capabilities remains very important.
   261  
   262  Finally, when initializing virtualized disks it'd be advisable to set an optimal [block size](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/disk-performance.html).
   263  
   264  ## Metadata write policy
   265  
   266  There will be times when it'll make sense to keep storage system's metadata strictly in memory or, maybe, cache and flush it on a best-effort basis. The corresponding use cases include temporary datasets - transient data of any kind that can (and will) be discarded. There's also the case when AIS is used as a high-performance caching layer where original data is already sufficiently replicated and protected.
   267  
   268  Metadata write policy - json tag `write_policy.md` - was introduced specifically to support those and related scenarios. Configurable both globally and for each individual bucket (or dataset), the current set of supported policies includes:
   269  
   270  | Policy | Description |
   271  | --- | ---|
   272  | `immediate` | write immediately upon updates (global default) |
   273  | `delayed`   | cache and flush when not accessed for a while (see also: [noatime](#noatime)) |
   274  | `never`     | never write but always keep metadata in memory (aka "transient") |
   275  
   276  > For the most recently updated enumeration, please see the [source](/cmn/api_const.go).
   277  
   278  ## PUT latency
   279  
   280  AIS provides checksumming and self-healing - the capabilities that ensure that user data is end-to-end protected and that data corruption, if it ever happens, will be properly and timely detected and - in presence of any type of data redundancy - resolved by the system.
   281  
   282  There's a price, though, and the scenarios where you could make an educated choice to trade checksumming for performance.
   283  
   284  In particular, let's say that we are massively writing a new content into a bucket.
   285  
   286  > Type of the bucket doesn't matter - it may be an `ais://` bucket, or `s3://`, or any other supported [backend](/docs/bucket.md#backend-provider) including HDFS and HTTP.
   287  
   288  What matters is that we do *know* that we'll be overwriting few objects, percentage-wise. Then it would stand to reason that AIS, on its end, should probably refrain from trying to load the destination object's metadata. Skip loading existing object's metadata in order to compare checksums (and thus maybe avoid writing altogether if the checksums match) and/or update the object's version, etc.
   289  
   290  * [API: PUT(object)](/api/object.go) - and look for `SkipVC` option
   291  * [CLI: PUT(object)](/docs/cli/object.md#put-object) - `--skip-vc` option
   292  
   293  ## GET throughput
   294  
   295  AIS is elastic cluster that can grow and shrink at runtime when you attach (or detach) disks and add (or remove) nodes. Drive are prone to sudden failures while nodes can experience unexpected crashes. At all times, though, and in presence of any and all events, AIS tries to keep [fair and balanced](/docs/rebalance.md) distribution of user data across all clustered nodes and drives.
   296  
   297  The idea to keep all drives **equally utilized** is the absolute cornerstone of the design.
   298  
   299  There's one simple trick, though, to improve drive utilization even further: [n-way mirroring](/docs/storage_svcs.md#n-way-mirror). In fact, 
   300  
   301    * if you observe <span style="color:red">90% and higher</span> utilizations,
   302    * and if adding more drives (or more nodes) is not an option,
   303    * and if you still have some spare capacity to create additional copies of data
   304  
   305  **if** all of the above is true, it'd be a very good idea to *mirror* the corresponding bucket, e.g.:
   306  
   307  ```console
   308  # reconfigure existing bucket (`ais://abc` in the example) for 3 copies
   309  
   310  $ ais bucket props set ais://abc mirror.enabled=true mirror.copies=3
   311  Bucket props successfully updated
   312  "mirror.copies" set to: "3" (was: "2")
   313  "mirror.enabled" set to: "true" (was: "false")
   314  ```
   315  
   316  In other words, under extreme loads n-way mirroring is especially, and strongly, recommended. (Data protection would then be an additional bonus, needless to say.) The way it works is simple:
   317  
   318  * AIS target constantly monitors drive utilizations
   319  * Given multiple copies, AIS target always selects a copy (and a drive that stores it) based on the drive's utilization.
   320  
   321  Ultimately, a drive that has fewer outstanding I/O requests and is less utilized - will always win.
   322  
   323  ## `aisloader`
   324  
   325  AIStore includes `aisloader` - a powerful benchmarking tool that can be used to generate a wide variety of workloads closely resembling those produced by AI apps.
   326  
   327  For numerous command-line options and usage examples, please see:
   328  
   329  * [Load Generator](/docs/aisloader.md)
   330  * [How To Benchmark AIStore](howto_benchmark.md)
   331  
   332  Or, simply run `aisloader` with no arguments and see its online help, including command-line options and numerous usage examples:
   333  
   334  ```console
   335  $ make aisloader; aisloader
   336  
   337  AIS loader (aisloader v1.3, build ...) is a tool to measure storage performance.
   338  It's a load generator that has been developed to benchmark and stress-test AIStore
   339  but can be easily extended to benchmark any S3-compatible backend.
   340  For usage, run: `aisloader`, or `aisloader usage`, or `aisloader --help`.
   341  Further details at https://github.com/NVIDIA/aistore/blob/main/docs/howto_benchmark.md
   342  
   343  Command-line options
   344  ====================
   345  ...
   346  ...
   347  
   348  ```
   349  
   350  Note as well that `aisloader` is fully StatsD-enabled - collected metrics can be forwarded to any StatsD-compliant backend for visualization and further analysis.