github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2023-11-27-aistore-fast-tier.md (about) 1 --- 2 layout: post 3 title: "AIStore as a Fast Tier Storage Solution: Enhancing Petascale Deep Learning Across Remote Cloud Backends" 4 date: November 27, 2023 5 author: Abhishek Gaikwad, Aaron Wilson, Alex Aizman 6 categories: aistore cloud petascale linear-scalibility 7 --- 8 9 The challenges associated with loading petascale datasets, crucial for training models in both vision and language processing, pose significant hurdles in the field of deep learning. These datasets, often hosted on various cloud backends, add complexity to the training process. Existing cloud storage solutions are increasingly seen as too expensive and/or slow to handle petascale training. 10 11 A big part of the problem is that machine learning workloads (training in particular) require random access to a vast (and [growing](https://arxiv.org/abs/2211.04325)) amount of unstructured data. Random access to a petabyte of data without having a petabyte of RAM at your disposal (or some sort of other fast cache) is difficult. As a result, storage often becomes the primary bottleneck for contemporary machine learning ([Ref.](https://opendatascience.com/data-storage-keeping-pace-for-ai-and-deep-learning/)). 12 13 To solve precisely this problem, we have built [AIStore](https://github.com/NVIDIA/aistore) (AIS). AIS is a lightweight, fully open-source, fully reliable storage that can be ad-hoc deployed anywhere from a single Linux machine to a bare-metal cluster of any size. 14 15 This blog provides AIS benchmarks and analysis. We compare "with" and "without" AIStore performance metrics in accessing Cloud datasets. 16 17 AIS features linear scalability with each added storage node - in fact, with each added drive. Our testing indicates that deploying an AIS cluster within the same data center as compute servers not only delivers high per-GPU throughput but also ensures stable latencies with minimal jitter. Additionally, AIS significantly reduces the total cost by eliminating data egress fees from cloud providers. 18 19 ## Background and Requirements 20 21 AIStore's essential prerequisite is a Linux machine with disks. While not a requirement, a managed Kubernetes (K8s) environment is highly recommended to streamline [deployment](https://github.com/NVIDIA/ais-K8s/blob/master/docs/README.md) and management. Direct deployment on bare-metal instances is possible, but managed K8s is advised for efficiency and ease of use given the complexities associated with K8s management. 22 23 In an AIS cluster, proxies (gateways) and targets (storage nodes) efficiently manage data requests from clients. When a client issues a GET request, a proxy, chosen randomly or specifically for load balancing, directs the request to an appropriate target based on the current cluster map. If the target has the data, it's directly sent to the client — a 'warm GET'. For unavailable data, AIS executes a 'cold GET', involving a series of steps: remote GET through the vendor's SDK, local storage of the object, validation of checksums for end-to-end protection (if enabled), storage of metadata (both local and remote, such as ETag, version, checksums, custom), making the object visible (only at this stage), and finally, creating additional copies or slices as per bucket properties, like n-way replication or erasure coding. 24 25 This architecture is particularly advantageous for multi-epoch training on large and super-large datasets. During the first epoch, data objects are cached onto the disks, enabling direct and swift access in subsequent epochs, thereby significantly enhancing overall performance. 26 27 ![AIStore Architecture](/assets/aistore-fast-tier/ais-architecture.png) 28 29 ## Benchmark: Setup and Configuration 30 31 Recently, we conducted benchmarks of AIS on Oracle Cloud Infrastructure ([OCI](https://www.oracle.com/cloud/)). Our setup utilized AIStore's Kubernetes deployment, managed by the Oracle Kubernetes Engine (OKE) running Kubernetes version `1.26.7`. The benchmarks involved two configurations: a 3-node and a 10-node cluster, each employing [BM.DenseIO.E4.128](https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-dense) instances on OCI. 32 33 | Specification | Details | 34 |--------------------------|----------------------| 35 | **OCPU Cores** | 128 | 36 | **Memory** | 2 TB | 37 | **NVMe Drives** | 8 drives (54.4 TB total) [Samsung PM1733](https://semiconductor.samsung.com/ssd/enterprise-ssd/pm1733-pm1735/mzwlj7t6hala-00007/#:~:text=Sequential%20Read%207000%20MB%2Fs%20Sequential,by%20the%20user%27s%20system%20configuration)| 38 | **Network Interface Card** | 50 Gbps | 39 | **OS (Kernel)** | Linux (5.15.0-104.119.4.2.el8uek.x86_64) | 40 41 It's important to note that while AIS in this context doesn't demand extensive CPU and memory resources (in fact, during our benchmarks we never observed CPU utilization exceeding 5%), other scenarios like resharding or ETL may have different requirements. 42 43 Parallel to this, the clients utilized [`aisloader`](https://github.com/NVIDIA/aistore/blob/main/docs/aisloader.md), a load generator for benchmarking AIS and other S3-compatible backends. For the 3-node cluster, we deployed 6 clients (6 separate DenseIO.E4 machines), each running 50 workers, totaling 300 workers. For the 10-node cluster, the setup included 10 clients with 100 workers each for a total of 1000 workers. Each worker in every `aisloader` performed 1MB and 10MB read requests - from AIS or directly from S3. For testing, we utilized the [aisloader-composer](https://github.com/NVIDIA/aistore/tree/main/bench/tools/aisloader-composer) tool, a suite of scripts and [Ansible](https://github.com/ansible/ansible) playbooks, enabling the benchmarking of an AIS cluster across multiple hosts with `aisloader`, orchestrated by Ansible. 44 45 ## Summary of Performance Metrics 46 47 Below, we present a detailed comparison of throughput and latency in different scenarios. We tested direct data retrieval from S3 without AIS (**'S3 DIRECT'**), as well as initial (**'S3 COLD through AIS'**) and subsequent (**'S3 WARM through AIS'**) accesses through AIS where S3 was a backend. 48 49 The tests were conducted with 1 MB and 10 MB object sizes. The table outlines throughput in GiB/s and average latency in milliseconds. For AIS scenarios, it also includes average disk utilization, shedding light on the efficiency of data retrieval and processing. 50 51 ### 3-Node AIS Cluster Performance with 300 Client Workers 52 53 **Cluster:** 54 - 3 gateways, 3 targets 55 - 24 NVMe disks, total raw capacity of 149 TB 56 57 **Bench Client:** 58 - 6 `aisloader` clients (DenseIO.E4 machines), each with 50 workers, totaling 300 workers 59 60 **Results:** 61 62 | GET from/how | Object Size | Throughput (GiB/s) | Latency Avg (ms) | Avg. Disk Util.(%) | 63 |--------------|-------------|--------------------|------------------|--------------------| 64 | S3 COLD through AIS | 1 MB | 1.080 | 264.388 | 1 | 65 | S3 WARM through AIS | 1 MB | 16.450 | 17.793 | 20 | 66 | S3 DIRECT | 1 MB | 1.480 | 197.243 | n/a | 67 | | | | | | 68 | S3 COLD through AIS | 10 MB | 2.260 | 1291.5 | 1 | 69 | S3 WARM through AIS | 10 MB | 16.520 | 177.234 | 20 | 70 | S3 DIRECT | 10 MB | 2.640 | 1108.5 | n/a | 71 72 ### 10-Node AIS Cluster Performance with 1000 Client Workers 73 74 **Cluster:** 75 - 10 gateways, 10 targets 76 - 80 NVMe disks, total raw capacity: 496 TB. 77 78 **Bench Client:** 79 - 10 `aisloader` clients (DenseIO.E4 machines), each running 100 workers, totaling 1000 workers. 80 81 **Results:** 82 83 | GET from/how | Object Size | Throughput (GiB/s) | Latency Avg (ms) | Avg. Disk Util.(%) | 84 |--------------|-------------|--------------------|------------------|--------------------| 85 | S3 COLD through AIS | 1 MB | 2.780 | 350.823 | 1 | 86 | **S3 WARM through AIS** | **1 MB** | **53.550** | **18.22** | **96** | 87 | S3 DIRECT | 1 MB | 2.750 | 353.979 | n/a | 88 | | | | | | 89 | S3 COLD through AIS | 10 MB | 2.980 | 3257.3 | 1 | 90 | **S3 WARM through AIS** | **10 MB** | **54.680** | **178.386** | **97** | 91 | S3 DIRECT | 10 MB | 2.610 | 3599.8 | n/a | 92 93 ![Throughput and Network Comparison](/assets/aistore-fast-tier/comparison_chart.png) 94 95 ## Performance Analysis and Network Insights 96 In our study, two key observations emerged, highlighting the performance and efficiency of AIStore. 97 98 ### Linear Scalability Correlated with Storage Nodes 99 100 Our first observation _concerns_ the correlation between performance and number of target nodes in a cluster. 101 102 In the 3-node setup with 300 workers and 24 NVMe drives, we achieved a throughput of 16.5 GiB/s or 47.2 Gb/s per node. Similarly, in the 10-node setup with 100 workers and 80 drives, the throughput was around 54.68 GiB/s or 47.0 Gb/s per node. This consistency in per-node throughput across different configurations strongly suggests that AIStore's performance scales linearly with each additional target node. Disk read performance sits around 704 MiB/s for the 3-node cluster and 700 MiB/s for the 10-node, implying linear scaling but coming far short of the performance capabilities of the NVME drives. The limiting factor turns out to be the 50 Gb/s network capacity of the network interfaces on each node -- more on that later. 103 104 Initially, disk utilization for WARM GET is high (97%), but it decreases as `aisloader` clients fetch more objects, storing them in [Linux's Page Cache](https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics). The accompanying graph of Throughput vs. Disk Read illustrates this trend: as objects are cached, disk reads diminish, leading to lower disk utilization. 105 106 ![Throughput vs. Disk Read](/assets/aistore-fast-tier/throughput-vs-diskread.png) 107 108 We set up a Grafana dashboard for monitoring and also stored logs from both `aisloader` and AIS for reference. Below, find the stats logs from one of the ten `aisloader` clients used in our tests. 109 110 ![aisloder logs](/assets/aistore-fast-tier/aisloader-logs.png) 111 112 ### Network Efficiency and Recommendations 113 114 Our second observation _pertains_ to network efficiency. The relationship between network bandwidth and data payload reveals minimal network overhead when considering MSS (Maximum Segment Size) and MTU (Maximum Transmission Unit). 115 116 ![Throughput vs Network](/assets/aistore-fast-tier/throughput-vs-network.png) 117 118 When running a 'warm GET' benchmark with 10MiB objects, we observed approximately a 1.8 GiB/s delta between what `aisloader` received compared to total network traffic when performing a 10 MiB object transfer. 119 120 Inspecting the packets revealed 66 bytes of headers: 121 - 14: fixed-length of the ethernet header 122 - 20: minimum size of the IP header 123 - 32: minimum size of the TCP header (20) plus 2 bytes of padding and 10 bytes for the `Timestamp` option (used for dealing with congestion in busy networks such as this one) 124 125 Aisloader reported an object throughput of 54.68 GiB/s, indicating transmission of around 5,600 objects per second (54.68 GiB/10 MiB). 126 To transfer a single object then, we require 1,056 packets with a header overhead of 70KB per object at our current MTU of 9000 ((10MiB + 70KB)/9000). 127 With 70KB of overhead per object and 5,600 objects per second, that makes up roughly 372 MiB of the delta between actual object transfer and total network usage. 128 These findings explain the preference and recommendation for jumbo frames in network configuration. 129 With a larger MTU, e.g. 64K, the 10 MiB object could be sent with only 165 packets, reducing the overhead per object from 70KB to only 10.9 KB. 130 131 Looking at the packets again, we can see that each GET request made by `aisloader` to the cluster was 210 bytes (including HTTP headers). Since each GET request is redirected back to the client and then to the correct target, this also must be included in our measurement for "total network sent." But, even with 5,600 objects per second, this comes out to only 1.12 MiB/s, almost irrelevant at this scale. 132 133 To explain the remaining difference, we added a filter to look only at traffic outbound from the ais cluster in K8s. As seen in the graph below, that reduced the total delta to around 1.3 GiB/s -- cutting out any additional traffic on the hosts, including traffic from metrics. However, that still leaves around 900 MiB/s of extra traffic. 134 135 ![Network filtered by K8s](/assets/aistore-fast-tier/network-with-K8s.png) 136 137 At this point, we began experimenting with the number of `aisloader` worker threads and realized that 100 workers per client on 10 `aisloader` clients (1000 worker threads total) was excessive. Not only that, it brought the unintended side effect of increased network congestion. 138 139 Reducing the number of worker threads to 20 per client resulted in slightly lower object throughput but yielded more stable results, less network overhead, and significantly lower latency. As we increased workers to push the maximum throughput, we began to hit the limits of our network interface, resulting in dropped, retransmitted, out-of-order, and duplicated packets. Without all of this extra congestion, the 20-worker client benchmark showed the results we expected, with an extremely efficient total network overhead of <400 MiB/s. 140 141 ![Network with 20-worker clients](/assets/aistore-fast-tier/network-20-threads.png) 142 143 Finally, each node in our setup was equipped with a 50 Gbps Ethernet Network Adapter. 144 Translated into GiB/s, this provides a maximum transfer speed of approximately 5.82 GiB/s per node. 145 For the 3-node cluster, we recorded an object throughput of 16.5 GiB/s, aligning closely with the theoretical maximum of 17.46 GiB/s. 146 147 Similarly, the 10-node cluster exhibited a throughput of around 54.68 GiB/s, nearing the collective capacity of approximately 58.2 GiB/s. 148 Accounting for network overhead, our results show that for this cluster AIS fully saturates our network capacity. 149 150 ## Discussion and Next Steps 151 152 As observed in the networking section above, the network interface in our benchmark configuration is the primary bottleneck for performance. To further increase throughput, more nodes and/or higher-performance network adapters will be required. Alternatively, we could also use LACP trunking or (static) NIC teaming. 153 154 Either way, our next immediate objective will be - removing network bottleneck. 155 156 We also observed significantly better latency numbers when reducing the number of client worker threads from 100 to 20 per client (~41ms vs ~178ms on average). 157 158 In a production environment, we would expect the number of AIS nodes to be tuned up or down to meet the client's needs (and not the other way around). But, for the benchmarks, we tuned the load generated by the client to be appropriate for the AIS cluster. 159 160 This raises the question of how many worker threads should be necessary to stress-test [this particular cluster](#10-node-ais-cluster-performance-with-1000-client-workers). 161 162 In our tests, we observed throughput around 4.5 GiB/s with a single worker thread on each `aisloader` client machine with latencies in the range of 21 ms. 163 164 As we increased the number of worker threads, throughput increased almost linearly but latency stayed fairly consistent. 165 Going up to 10 threads per `aisloader` client resulted in a throughput increase to around 40 GiB/s with only a ~4 ms increase in latency. 166 However, as workers increased further we saw diminishing returns. 167 Moving from 10 to 20 workers per client (100 to 200 total) increased throughput by 20% but increased latency by almost 100%. 168 169 Further, changing from 200 to 1000 total workers, the bandwidth increased by only 4% while latency increased by 334%. 170 As the requests began to queue at the network and disk layers, we saw latency spike up drastically with only a slight increase in throughput -- the targets are never idle, but each request must wait (more for) its turn. This spike in latency signals that additional bench workers will start to have adverse effects on the performance. 171 172 **In other words:** 173 174 If AIS cluster is **not** overprovisioned for a given (bench or compute) client and when the load is increasing we _may_ see the picture described above: increasing average latency and "diminishing returns" in terms of getting higher throughput. 175 Our future steps include determining appropriate thresholds for these metrics on a per-workload basis, as they will be crucial not only for generating load for benchmark clusters but also for scaling and sizing the cluster itself in production. 176 177 Running the full suite of benchmarks with more finely tuned client workers should result in more optimal latency numbers, perhaps at a small cost to total throughput. 178 179 ## Conclusion 180 181 In summary, AIS demonstrated linear scalability in a particular setup, effectively transitioning from 24 to 80 drives. It showed efficient network usage with minimal overhead and saturated our available network capacity. We also observed the tradeoff between maximal throughput and latency and identified signs of excessive load on the cluster. While these findings are specific to our benchmark and setup, they underscore AIStore's architectural strengths in providing an efficient approach for handling large-scale data in machine learning applications. 182 183 ## References 184 1. GitHub: 185 - [AIStore](https://github.com/NVIDIA/aistore) 186 - [AIS-K8S](https://github.com/NVIDIA/ais-K8s) 187 - [Deploying AIStore on K8s](https://github.com/NVIDIA/ais-K8s/blob/master/docs/README.md) 188 2. Documentation, blogs, videos: 189 - https://aiatscale.org 190 - https://github.com/NVIDIA/aistore/tree/main/docs 191 - [aisloader](https://github.com/NVIDIA/aistore/blob/main/docs/aisloader.md)