github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/performance.md (about) 1 --- 2 layout: post 3 title: PERFORMANCE 4 permalink: /docs/performance 5 redirect_from: 6 - /performance.md/ 7 - /docs/performance.md/ 8 --- 9 10 AIStore is all about performance. It's all about performance and reliability, to be precise. An assortment of tips and recommendations that you find in this text will go a long way to ensure that AIS _does_ deliver. 11 12 But first, one important consideration: 13 14 Currently(**) AIS utilizes local filesystems - local formatted drives of any kind. You could use HDDs and/or NVMes, and any Linux filesystem: xfs, zfs, ext4... you name it. In all cases, the resulting throughput of an AIS cluster will be the sum of throughputs of individual drives. 15 16 Usage of local filesystems has its pros and cons, its inevitable trade-offs. In particular, one immediate implication is called [`ulimit`](#maximum-number-of-open-files) - the limit on the maximum number of open file descriptors. We do recommend to fix that specific tunable right away - as described [below](#maximum-number-of-open-files). 17 18 The second related option is [`noatime`](#noatime) - and the same argument applies. Please make sure to review and tune-up them both. 19 20 > (**) The interface between AIS and a local filesystem is extremely limited and boils down to writing/reading key/value pairs (where values are, effectively, files) and checking existence of keys (filenames). Moving AIS off of POSIX to some sort of (TBD) key/value storage engine should be a straightforward exercise. Related expectations/requirements include reliability and performance under stressful workloads that read and write "values" of 1MB and greater in size. 21 22 - [Operating System](#operating-system) 23 - [CPU](#cpu) 24 - [Network](#network) 25 - [Smoke test](#smoke-test) 26 - [Maximum number of open files](#maximum-number-of-open-files) 27 - [Storage](#storage) 28 - [Disk priority](#disk-priority) 29 - [Benchmarking disk](#benchmarking-disk) 30 - [Local filesystem](#local-filesystem) 31 - [`noatime`](#noatime) 32 - [Virtualization](#virtualization) 33 - [Metadata write policy](#metadata-write-policy) 34 - [PUT latency](#put-latency) 35 - [GET throughput](#get-throughput) 36 - [`aisloader`](#aisloader) 37 38 ## Operating System 39 40 There are two types of nodes in AIS cluster: 41 42 * targets (ie., storage nodes) 43 * proxies (aka gateways) 44 45 > AIS target must have a data disk, or disks. AIS proxies do not "see" a single byte of user data - they implement most of the control plane and provide access points to the cluster. 46 47 The question, then, is how to get the maximum out of the underlying hardware? How to improve datapath performance. This and similar questions have one simple answer: **tune-up** target's operating system. 48 49 Specifically, `sysctl` selected system variables, such as `net.core.wmem_max`, `net.core.rmem_max`, `vm.swappiness`, and more - here's the approximate list: 50 51 * [https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/host-config/vars/host_config_sysctl.yml](sysctl) 52 53 The document is part of a separate [repository](https://github.com/NVIDIA/ais-k8s) that serves the (specific) purposes of deploying AIS on **bare-metal Kubernetes**. The repo includes a number of [playbooks](https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/README.md) to assist in a full deployment of AIStore. 54 55 In particular, there is a section of pre-deployment playbooks to [prepare AIS nodes for deployment on bare-metal Kubernetes](https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/host-config/README.md) 56 57 General references: 58 59 * [Ubuntu Performance Tuning (archived)](https://web.archive.org/web/20190811180754/https://wiki.mikejung.biz/Ubuntu_Performance_Tuning) <- good guide about general optimizations (some of them are described below) 60 * [RHEL 7 Performance Tuning Guide](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/pdf/performance_tuning_guide/Red_Hat_Enterprise_Linux-7-Performance_Tuning_Guide-en-US.pdf) <- detailed view on how to tune the RHEL lot of the tips and tricks apply for other Linux distributions 61 62 ## CPU 63 64 Setting CPU governor (P-States) to `performance` may make a big difference and, in particular, result in much better network throughput: 65 66 * [How to tune your 100G host](https://fasterdata.es.net/assets/Papers-and-Publications/100G-Tuning-TechEx2016.tierney.pdf) 67 68 On `Linux`: 69 70 ```console 71 $ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor 72 ``` 73 74 or using `cpupower` package (on `Debian` and `Ubuntu`): 75 76 ```console 77 $ apt-get install -y linux-tools-$(uname -r) # install `cpupower` 78 $ cpupower frequency-info # check current settings 79 $ cpupower frequency-set -r -g performance # set `performance` setting to all CPU's 80 $ cpupower frequency-info # check settings after the change 81 ``` 82 83 Once the packages are installed (the step that will depend on your Linux distribution), you can then follow the *tuning instructions* from the referenced PDF (above). 84 85 ## Network 86 87 AIStore supports 3 (three) logical networks: 88 89 * public (default port `51081`) 90 * intra-cluster control (`51082`), and 91 * intra-cluster data (`51083`) 92 93 Ideally, all 3 are provisioned (physically) separately - to reduce contention, avoid HoL, and ultimately optimize intra-cluster performance. 94 95 Separately, and in addition: 96 97 * MTU should be set to `9000` (Jumbo frames) - this is one of the most important configurations 98 * Optimize TCP send buffer sizes on the target side (`net.core.rmem_max`, `net.ipv4.tcp_rmem`) 99 * Optimize TCP receive buffer on the client (reading) side (`net.core.wmem_max`, `net.ipv4.tcp_wmem`) 100 * Set `net.ipv4.tcp_mtu_probing = 2` 101 102 > The last setting is especially important when client's MTU is greater than 1500. 103 104 > The list of tunables (above) cannot be considered _exhaustive_. Optimal (high-performance) choices always depend on the hardware, Linux kernel, and a variety of factors outside the scope. 105 106 ## Smoke test 107 108 To ensure client ⇔ proxy, client ⇔ target, proxy ⇔ target, and target ⇔ target connectivity you can use `iperf` (make sure to use Jumbo frames and disable fragmentation). 109 Here is example use of `iperf`: 110 111 ```console 112 $ iperf -P 20 -l 128K -i 1 -t 30 -w 512K -c <IP-address> 113 ``` 114 115 **NOTE**: `iperf` must show 95% of the bandwidth of a given phys interface. If it does not, try to find out why. It might have no sense to run any benchmark prior to finding out. 116 117 ## Maximum number of open files 118 119 This must be done before running benchmarks, let alone deploying AIS inproduction - the maximum number of open file descriptors must be increased. 120 121 The corresponding system configuration file is `/etc/security/limits.conf`. 122 123 > In Linux, the default per-process maximum is 1024. It is **strongly recommended** to raise it to at least `100,000`. 124 125 Here's a full replica of [/etc/security/limits.conf](https://github.com/NVIDIA/aistore/blob/main/deploy/conf/limits.conf) that we use for development _and_ production. 126 127 To check your current settings, run `ulimit -n` or `tail /etc/security/limits.conf`. 128 129 To increase the limits, copy the following 5 lines into `/etc/security/limits.conf`: 130 131 ```console 132 $ tail /etc/security/limits.conf 133 #ftp hard nproc 0 134 #ftp - chroot /ftp 135 #@student - maxlogins 4 136 137 root hard nofile 999999 138 root soft nofile 999999 139 * hard nofile 999999 140 * soft nofile 999999 141 142 # End of file 143 ``` 144 145 Once done, re-login and double-check that both *soft* and *hard* limits have indeed changed: 146 147 ```console 148 $ ulimit -n 149 999999 150 $ ulimit -Hn 151 999999 152 ``` 153 154 For further references, google: 155 156 * `docker run --ulimit` 157 * `DefaultLimitNOFILE` 158 159 ## Storage 160 161 Storage-wise, each local `ais_local.json` config must be looking as follows: 162 163 ```json 164 { 165 "net": { 166 "hostname": "127.0.1.0", 167 "port": "8080", 168 }, 169 "fspaths": { 170 "/tmp/ais/1": {}, 171 "/tmp/ais/2": {}, 172 "/tmp/ais/3": {} 173 } 174 } 175 ``` 176 177 * Each local path from the `fspaths` section above must be (or contain as a prefix) a mountpath of a local filesystem. 178 * Each local filesystem (above) must utilize one or more data drives, whereby none of the data drives is shared between two or more local filesystems. 179 * Each filesystem must be fine-tuned for reading large files/blocks/xfersizes. 180 181 ### Disk priority 182 183 AIStore can be sensitive to spikes in I/O latencies, especially when running bursty traffic that carries large numbers of small-size I/O requests (resulting from writing and/or reading small-size objects). 184 To ensure that latencies do not spike because of some other processes running on the system it is recommended to give `aisnode` process the highest disk priority. 185 This can be done using [`ionice` tool](https://linux.die.net/man/1/ionice). 186 187 ```console 188 $ # Best effort, highest priority 189 $ sudo ionice -c2 -n0 -p $(pgrep aisnode) 190 ``` 191 192 ### Benchmarking disk 193 194 **TIP:** double check that rootfs/tmpfs of the AIStore target are _not_ used when reading and writing data. 195 196 When running a benchmark, make sure to run and collect the following in each and every target: 197 198 ```console 199 $ iostat -cdxtm 10 200 ``` 201 202 Local hard drive read performance, the fastest block-level reading smoke test is: 203 204 ```console 205 $ hdparm -Tt /dev/<drive-name> 206 ``` 207 208 Reading the block from certain offset (in gigabytes), `--direct` argument ensure that we bypass the drive’s buffer cache and read directly from the disk: 209 210 ```console 211 $ hdparm -t --direct --offset 100 /dev/<drive-name> 212 ``` 213 214 More: [Tune hard disk with `hdparm`](http://www.linux-magazine.com/Online/Features/Tune-Your-Hard-Disk-with-hdparm). 215 216 ### Local filesystem 217 218 Another way to increase storage performance is to benchmark different filesystems: `ext`, `xfs`, `openzfs`. 219 Tuning the corresponding IO scheduler can prove to be important: 220 221 * [ais_enable_multiqueue](https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/host-config/docs/ais_enable_multiqueue.md) 222 223 Other related references: 224 225 * [How to improve disk IO performance in Linux](https://www.golinuxcloud.com/how-to-improve-disk-io-performance-in-linux/) 226 * [Performance Tuning on Linux — Disk I/O](https://cromwell-intl.com/open-source/performance-tuning/disks.html) 227 228 ### `noatime` 229 230 One of the most important performance improvements can be achieved by simply turning off `atime` (access time) updates on the filesystem. 231 232 Example: mount xfs with the following performance-optimized parameters (that also include `noatime`): 233 234 ```console 235 noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier 236 ``` 237 238 > As aside, it is maybe important to note and disclose that AIStore itself lazily updates access times. The complete story has two pieces: 239 1. On the one hand, AIS can be deployed as a fast cache tier in front of any of the multiple supported [backends](providers.md). To support this specific usage scenario, AIS tracks access times and the remaining capacity. The latter has 3 (three) configurable thresholds: low-watermark, high-watermark, and OOS (out of space) - by default, 75%, 90%, and 95%, respectively. Running low on free space automatically triggers LRU job that starts visiting evictable buckets (if any), sorting their content by access times, and yes - evicting. Bucket "evictability" is also configurable and, in turn, has two different defaults: true for buckets that have remote backend, and false otherwise. 240 2. On the other hand, we certainly don't want to have any writes upon reads. That's why object metadata is cached and atime gets updated in memory to a) avoid extra reads and b) absorb multiple accesses and multiple atime updates. Further, if and only if we update atime in memory, we also mark the metadata as `dirty`. So that later, if for whatever reason we need to un-cache it, we flush it to disk (along with the most recently updates access time). 241 3. The actual mechanism of when and whether to flush the corresponding metadata is driven by two conditions: the time since last access and the amount of free memory. If there's plenty of memory, we flush `dirty` metadata after 1 hour of inactivity. 242 4. Finally, in AIS all configuration-related defaults (e.g., the default watermarks mentioned above) are also configurable - but that's a different story and a different scope... 243 244 External links: 245 246 * [The atime and noatime attribute](http://en.tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap6sec73.html) 247 * [Mount with noatime](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/global_file_system_2/s2-manage-mountnoatime) 248 * [Gain 30% Linux Disk Performance with noatime](https://lonesysadmin.net/2013/12/08/gain-30-linux-disk-performance-noatime-nodiratime-relatime) 249 250 ## Virtualization 251 252 There must be no sharing of host resources between two or more VMs that are AIS nodes. 253 254 Even if there is a single virtual machine, the host may decide to swap it out when idle, or give it a single hyperthreaded vCPU instead of a full-blown physical core - this condition must be prevented. 255 256 Virtualized AIS node needs to have a physical resource - memory, CPU, network, storage - in its entirety. Hypervisor must resort to the remaining absolutely required minimum. 257 258 Make sure to use PCI passthrough to assign a device (NIC, HDD) directly to the AIS VM. 259 260 AIStore's primary goal is to scale with clustered drives. Therefore, the choice of a drive type and its capabilities remains very important. 261 262 Finally, when initializing virtualized disks it'd be advisable to set an optimal [block size](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/disk-performance.html). 263 264 ## Metadata write policy 265 266 There will be times when it'll make sense to keep storage system's metadata strictly in memory or, maybe, cache and flush it on a best-effort basis. The corresponding use cases include temporary datasets - transient data of any kind that can (and will) be discarded. There's also the case when AIS is used as a high-performance caching layer where original data is already sufficiently replicated and protected. 267 268 Metadata write policy - json tag `write_policy.md` - was introduced specifically to support those and related scenarios. Configurable both globally and for each individual bucket (or dataset), the current set of supported policies includes: 269 270 | Policy | Description | 271 | --- | ---| 272 | `immediate` | write immediately upon updates (global default) | 273 | `delayed` | cache and flush when not accessed for a while (see also: [noatime](#noatime)) | 274 | `never` | never write but always keep metadata in memory (aka "transient") | 275 276 > For the most recently updated enumeration, please see the [source](/cmn/api_const.go). 277 278 ## PUT latency 279 280 AIS provides checksumming and self-healing - the capabilities that ensure that user data is end-to-end protected and that data corruption, if it ever happens, will be properly and timely detected and - in presence of any type of data redundancy - resolved by the system. 281 282 There's a price, though, and the scenarios where you could make an educated choice to trade checksumming for performance. 283 284 In particular, let's say that we are massively writing a new content into a bucket. 285 286 > Type of the bucket doesn't matter - it may be an `ais://` bucket, or `s3://`, or any other supported [backend](/docs/bucket.md#backend-provider) including HDFS and HTTP. 287 288 What matters is that we do *know* that we'll be overwriting few objects, percentage-wise. Then it would stand to reason that AIS, on its end, should probably refrain from trying to load the destination object's metadata. Skip loading existing object's metadata in order to compare checksums (and thus maybe avoid writing altogether if the checksums match) and/or update the object's version, etc. 289 290 * [API: PUT(object)](/api/object.go) - and look for `SkipVC` option 291 * [CLI: PUT(object)](/docs/cli/object.md#put-object) - `--skip-vc` option 292 293 ## GET throughput 294 295 AIS is elastic cluster that can grow and shrink at runtime when you attach (or detach) disks and add (or remove) nodes. Drive are prone to sudden failures while nodes can experience unexpected crashes. At all times, though, and in presence of any and all events, AIS tries to keep [fair and balanced](/docs/rebalance.md) distribution of user data across all clustered nodes and drives. 296 297 The idea to keep all drives **equally utilized** is the absolute cornerstone of the design. 298 299 There's one simple trick, though, to improve drive utilization even further: [n-way mirroring](/docs/storage_svcs.md#n-way-mirror). In fact, 300 301 * if you observe <span style="color:red">90% and higher</span> utilizations, 302 * and if adding more drives (or more nodes) is not an option, 303 * and if you still have some spare capacity to create additional copies of data 304 305 **if** all of the above is true, it'd be a very good idea to *mirror* the corresponding bucket, e.g.: 306 307 ```console 308 # reconfigure existing bucket (`ais://abc` in the example) for 3 copies 309 310 $ ais bucket props set ais://abc mirror.enabled=true mirror.copies=3 311 Bucket props successfully updated 312 "mirror.copies" set to: "3" (was: "2") 313 "mirror.enabled" set to: "true" (was: "false") 314 ``` 315 316 In other words, under extreme loads n-way mirroring is especially, and strongly, recommended. (Data protection would then be an additional bonus, needless to say.) The way it works is simple: 317 318 * AIS target constantly monitors drive utilizations 319 * Given multiple copies, AIS target always selects a copy (and a drive that stores it) based on the drive's utilization. 320 321 Ultimately, a drive that has fewer outstanding I/O requests and is less utilized - will always win. 322 323 ## `aisloader` 324 325 AIStore includes `aisloader` - a powerful benchmarking tool that can be used to generate a wide variety of workloads closely resembling those produced by AI apps. 326 327 For numerous command-line options and usage examples, please see: 328 329 * [Load Generator](/docs/aisloader.md) 330 * [How To Benchmark AIStore](howto_benchmark.md) 331 332 Or, simply run `aisloader` with no arguments and see its online help, including command-line options and numerous usage examples: 333 334 ```console 335 $ make aisloader; aisloader 336 337 AIS loader (aisloader v1.3, build ...) is a tool to measure storage performance. 338 It's a load generator that has been developed to benchmark and stress-test AIStore 339 but can be easily extended to benchmark any S3-compatible backend. 340 For usage, run: `aisloader`, or `aisloader usage`, or `aisloader --help`. 341 Further details at https://github.com/NVIDIA/aistore/blob/main/docs/howto_benchmark.md 342 343 Command-line options 344 ==================== 345 ... 346 ... 347 348 ``` 349 350 Note as well that `aisloader` is fully StatsD-enabled - collected metrics can be forwarded to any StatsD-compliant backend for visualization and further analysis.