github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2024-02-16-multihome-bench.md (about)

     1  ---
     2  layout: post
     3  title: "Maximizing Cluster Bandwidth with AIS Multihoming"
     4  date: February 16, 2024
     5  author: Aaron Wilson
     6  categories: aistore cloud multihome benchmark network oci fast-tier cache
     7  --- 
     8  
     9  Identifying bottlenecks in high-performance systems is critical to optimize the hardware and associated costs. 
    10  While [AIStore (AIS)](https://github.com/NVIDIA/aistore) can provide linear scaling of performance with additional drives, in our [previous article](https://aiatscale.org/blog/2023/11/27/aistore-fast-tier) we observed a hard limit to this scaling -- our total network bandwidth. 
    11  In our benchmark cluster, each AIS target node was provisioned with 8 [NVMe drives](https://semiconductor.samsung.com/us/ssd/enterprise-ssd/pm1733-pm1735/mzwlj7t6hala-00007/). These drives provided a significantly higher data throughput than the node's single 50 Gbps network interface could handle.
    12  
    13  We needed to expand our network capacity to increase our available read throughput and make the most of these powerful drives. 
    14  One solution, called network bonding, network teaming, or link aggregation, allows for multiple network interfaces to combine their throughput over a single IP address. 
    15  However, this is not always a supported mechanism especially in cloud environments where users may not have full control over the network. 
    16  This was the case for our benchmark cluster, deployed on Oracle Cloud Infrastructure (OCI). 
    17  The shape we used allowed us to attach additional virtual interfaces (VNICs) to the 2 physical interfaces on the host, but did not support bonding both of these interfaces to a single IP address, so we needed a different solution.
    18  
    19  An alternative, introduced in the upcoming AIS release 3.22, is multihoming. 
    20  This feature works on any multi-interface hosts with no extra configuration of link aggregation. 
    21  Rather than combining the interfaces onto a single external IP for each node, this approach communicates over them independently, allowing for full utilization of all interfaces. 
    22  In our benchmarks below, we used multihoming to achieve nearly double the throughput results from our previous tests and unlocked the full potential of our high-speed NVMe drives when reading. 
    23  
    24  ## Configuring the hosts
    25  
    26  The test setup used for benchmarking our AIS cluster with multihoming is shown below:
    27  
    28  - AIS Hosts: 10 nodes running Oracle Linux 8 on the following OCI shape:
    29  
    30  | Shape          | OCPU | Memory (GB) | Local Disk          | Max Network Bandwidth |
    31  |----------------|------|-------------|---------------------|-----------------------|
    32  | BM.DenseIO.E4.128 | 128  | 2048        | 54.4 TB NVMe SSD Storage (8 drives) | 2x 50 Gbps |
    33  
    34  - AISLoader clients: 10 nodes running Ubuntu 22.04 on the following OCI shape:
    35  
    36  | Shape           | OCPU | Memory (GB) | Local Disk      | Max Network Bandwidth |
    37  |-----------------|------|-------------|-----------------|-----------------------|
    38  | BM.Standard3.64 | 64   | 1024        | Block storage only | 2 x 50 Gbps        |
    39  
    40  - Networking: AIS hosts and AISLoader clients were deployed in a single OCI VCN with 2 subnets to support a secondary VNIC on all client and AIS hosts
    41  
    42  ## Deploying with Multihoming
    43  
    44  For a full walkthrough of a multi-homed AIS deployment, check [the documentation in the AIS K8s repository](https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/ais-deployment/docs/deploy_with_multihome.md). 
    45  
    46  Before taking advantage of AIS multihoming, the systems themselves must be configured with multiple IPs on multiple interfaces. In our case, this involved adding a second VNIC in OCI and configuring the OS routing rules using their provided scripts, following this OCI [guide](https://docs.oracle.com/iaas/compute-cloud-at-customer/topics/network/creating-and-attaching-a-secondary-vnic.htm).
    47  
    48  **IMPORTANT: To use multihoming you _must_ use an AIS K8s Operator version >= v0.97 and deploy AIS version >= 3.22 (or 'latest')`**
    49   
    50  Next, update the ansible host config to tell the deployment which additional IPs to use. Add `additional_hosts` to each ansible host entry as a comma-separated list of any additional IPs you want to use. For example, the hosts file for our 3 node cluster might look like this:
    51  
    52  ```yaml
    53  [ais]
    54  controller_host ansible_host=10.51.248.1 additional_hosts=10.51.248.32
    55  worker1 ansible_host=10.51.248.2 additional_hosts=10.51.248.33
    56  worker2 ansible_host=10.51.248.3 additional_hosts=10.51.248.34
    57  ```
    58  
    59  By default, K8s pods do not allow multiple IPs. To add this, we'll need to use [Multus CNI](https://github.com/k8snetworkplumbingwg/multus-cni). The `create_network_definition` playbook (provided in the [ais-k8s repo](https://github.com/NVIDIA/ais-k8s/tree/master)) will automatically install the latest version of Multus in the cluster and create a network definition the pods can use.
    60  
    61  Once the additional hosts have been added to the hosts file and the network attachment definition has been created, all that is needed is a standard AIS deployment. The AIS [K8s operator](https://github.com/NVIDIA/ais-k8s/tree/master/operator) will take care of connecting each AIS pod to the specified additional hosts through Multus.
    62  
    63  Below is a simple network diagram of how the AIS pods work with Multus in our cluster. We are using a macvlan bridge to connect the pod to the second interface. This is configured in the network attachment definition created by our `create_network_definition` playbook. AIS can also be configured to use other Multus network attachment definitions. See our [multihome deployment doc](https://github.com/NVIDIA/ais-k8s/blob/master/playbooks/ais-deployment/docs/deploy_with_multihome.md) and the Multus [usage guide](https://github.com/k8snetworkplumbingwg/multus-cni/blob/master/docs/how-to-use.md) for details on using this playbook and configuring network attachment definitions. 
    64  
    65  ![Multus Network Diagram](/assets/multihome_bench/multus_diagram.png)
    66  
    67  After deploying with multihoming, we configured our AIS test buckets with a PBSS backend (NVIDIA's SwiftStack implementation). 
    68  This is an s3 compatible backend, so AIS can handle any s3 requests to PBSS without API changes. 
    69  When a GET request fetches a new object, AIS will store that object locally in the cluster for future use. 
    70  
    71  ## Benchmark results
    72  
    73  We ran 3 benchmarks to validate the performance gains of using AIS as a fast-tier with multihoming. 
    74  In these tests, "cold" GET refers to a GET request to AIS where the object does not exist locally in the AIS cluster and must be fetched from the remote backend.
    75  On the other hand, a "warm" GET is our terminology for a request where the AIS cluster has a local copy of the data and can simply return it directly to the client. 
    76  
    77  ### Test 1: Cold GET with PBSS backend
    78  For the first benchmark, we ran a single epoch of GET requests against a remote 20 TiB PBSS bucket populated with 10MiB objects. 
    79  Each [AISLoader client](https://github.com/NVIDIA/aistore/blob/main/docs/aisloader.md) reads every object in the dataset in random order. 
    80  Over time, as objects are cached in AIS, the next clients to fetch the same object are able to get it without reaching out to PBSS.
    81  
    82  ![Cold GET throughput](/assets/multihome_bench/cold_get_throughput.png)
    83  
    84  ![Cold GET latency](/assets/multihome_bench/cold_get_latency.png)
    85  
    86  ![Cold GET disk utilization](/assets/multihome_bench/cold_get_disk_util.png)
    87  
    88  ### Test 2: Warm GET
    89  Next we ran a single epoch against the same bucket once all the objects are stored locally in AIS. 
    90  Querying the data already cached in AIS showed a massive improvement over fetching the data remotely from PBSS. 
    91  We saw consistently high throughput and low latency, but because nearly the entire dataset could fit in the memory on the AIS nodes, disk usage was inconsistent.
    92  
    93  ![Warm GET throughput](/assets/multihome_bench/warm_get_throughput.png)
    94  
    95  ![Warm GET latency](/assets/multihome_bench/warm_get_latency.png)
    96  
    97  ![Warm GET disk utilization](/assets/multihome_bench/warm_get_disk_util.png)
    98  
    99      Note these numbers depend heavily on specific deployment and network bandwidth to PBSS. 
   100      AIS has an intentional advantage here for its use as a fast-tier. 
   101      It is both nearer to the client (same VCN) and uses higher performance drives than the remote PBSS cluster.
   102  
   103  | Metric                    | Direct from PBSS | From AIS Warm GET |
   104  |---------------------------|------------------|-------------------|
   105  | Average Throughput        | 3.12 GiB/s       | 104.64 GiB/s      |
   106  | Average Latency           | 1570 ms          | 46.7 ms           |
   107  
   108  
   109  ### Test 3: Warm GET, 100 TiB local dataset
   110  Finally, we populated a separate 100 TiB bucket of 10 MiB objects to ensure that each query would actually have to do a fresh read from the disk. This would test the disk performance scaling. We also ran FIO tests against the drives themselves to make sure the performance scaled linearly.
   111  
   112  ![100 TiB GET throughput](/assets/multihome_bench/large_warm_get_throughput.png)
   113  
   114  ![100 TiB GET latency](/assets/multihome_bench/large_warm_get_latency.png)
   115  
   116  ![100 TiB GET disk utilization](/assets/multihome_bench/large_warm_get_disk_util.png)
   117  
   118  ```
   119  Total average latency: 47.29 ms
   120  Total throughput: 103.247 GiB/s
   121  ```
   122  
   123  The throughput and latency numbers do not show much difference from the benchmark on the smaller dataset, but the disk utilization is much higher, consistently over 97%. 
   124  This indicates we have shifted the limiting factor of our cluster performance to the disks!
   125  
   126  While this is the desired result under a stress load, keep in mind this is _not_ what you would want to see from your cluster for sustained use in production. 
   127  To keep these disks busy, there is a queue of disk operations that will result in an increase in latency if it grows too large. 
   128  We observed this in a separate test with even more clients. 
   129  With 80 worker threads per client rather than 50, we saw an increase in latency from 47ms to 72ms along with lower per-client throughput. 
   130  A pattern of increased latency and increased disk queueing indicates a need for scaling up the AIS cluster.
   131  
   132  This chart shows the individual disk throughput for all disks in the cluster. There is only a small variance as each drive is kept busy and performing near its max specification.
   133  
   134  ![100 TiB GET all-disk throughput](/assets/multihome_bench/large_warm_get_all_disks.png)
   135  
   136  #### FIO Comparison:
   137  
   138  To check that this is consistent with the expected performance from our drives, we ran a [FIO benchmark](https://fio.readthedocs.io/en/latest/fio_doc.html) against the drives on one of our machines to get an idea of their capabilities. 
   139  
   140  We ran the FIO benchmark with the following configuration: 
   141  ```yaml
   142  [global]
   143  ioengine=libaio
   144  direct=0
   145  verify=0
   146  time_based
   147  runtime=30s
   148  refill_buffers
   149  group_reporting
   150  size=10MiB
   151  iodepth=32
   152  
   153  [read-job]
   154  rw=read
   155  bs=4096
   156  numjobs=1
   157  ```
   158  
   159  We noticed some variation from run to run with FIO, but the total throughput for a direct read from all disks on the node we tested was around 10.99 GiB/s. 
   160  The average throughput for each node in the cluster was 10.32 GiB/s. 
   161  So compared to this small disk test, AIS achieved 94% of the total read capacity of the drives on this setup.
   162  
   163  ## Conclusion
   164  
   165  By utilizing an additional network interface and thus removing the network bottleneck, we were able to achieve full drive scaling and maximize the use of drives in our cluster with our specific workload. 
   166  In our "cold GET" benchmark, we observed an uptick in throughput and latency numbers as objects are stored in AIS. 
   167  With all the data stored in AIS, our "warm GET" benchmark showed high performance across the entire epoch, limited now by disks rather than network. 
   168  Moving to an even larger dataset, the cluster showed high disk utilization and disk read performance consistent with results from independent drive benchmarks. 
   169  Even with a very large dataset under heavy load, the cluster maintained low latency and consistently high throughput to all clients. 
   170  
   171  ## References
   172  
   173  - [Multus CNI](https://github.com/k8snetworkplumbingwg/multus-cni)
   174  - [Flexible I/O Tester (FIO)](https://fio.readthedocs.io/en/latest/fio_doc.html)
   175  - [AIStore](https://github.com/NVIDIA/aistore)
   176  - [AIS-K8S](https://github.com/NVIDIA/ais-K8s)
   177  - [AISLoader](https://github.com/NVIDIA/aistore/blob/main/docs/aisloader.md)
   178  - [AIS as a Fast-Tier](https://aiatscale.org/blog/2023/11/27/aistore-fast-tier)