github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2022-08-15-dask-data-analysis.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2022-08-15-dask-data-analysis.md (about)

     1  ---
     2  layout: post
     3  title:  "AIStore: Data Analysis w/ DataFrames"
     4  date:   Aug 15, 2022
     5  author: Ryan Koo
     6  categories: aistore dask
     7  ---
     8  
     9  # AIStore: Data Analysis w/ DataFrames
    10  
    11  [Dask](https://www.dask.org/) is a new and flexible open-source Python library for *parallel/distributed computing* and *optimized memory usage*. Dask extends many of today's popular Python libraries providing scalability with ease of usability.
    12  
    13  This technical blog will dive into [Dask `DataFrames`](https://examples.dask.org/dataframe.html), a data structure built on and in parallel with [Pandas `DataFrames`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), and how it can be used in nearly identical ways as [Pandas `DataFrames`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) to analyze and mutate tabular data while offering *better performance*.
    14  
    15  ## Why Dask?
    16  
    17  1. **Python Popularity**
    18  
    19      ![Programming Language Growth](/images/language-popularity.png)
    20  
    21      Python's popularity has skyrocketed over the past few years, especially with data scientists and machine learning developers. This is largely due to Python's extensive and mature collection of libraries for data science and machine learning, such as `Pandas`, `NumPy`, `Scikit-Learn`, `MatPlotLib`, `PyTorch`, and more.
    22  
    23      Dask integrates these Python-based libraries, providing scalability with little to no changes in usage.
    24  
    25  2. **Scalability**
    26  
    27      Dask effectively scales Python code from a *single machine* up to a *distributed cluster*. 
    28  
    29      Dask leaves behind a low-memory footprint, loading data by chunks as required and throwing away any chunks that are not immediately needed. This means that relatively low-power laptops and desktops *can* load and handle datasets that would normally be considered too large. Additionally, Dask can leverage the multiple CPU cores found in most modern day laptops and desktops, providing an added performance boost. 
    30  
    31      For large distributed clusters consisting of many machines, Dask is able to efficiently scale large, complex computations across those many machines. Dask breaks up these large computations and efficiently allocates them across distributed hardware.
    32  
    33  3. **Familiar API**
    34  
    35      ![Python Library Popularity](/images/python-package-popularity.png)
    36      
    37      The above mentioned Python libraries have grown immensely in popularity as of recent. However, most of them were not designed to scale beyond a single machine nor with the exponentional growth of dataset sizes. Many of them were developed *before* big data use-cases became prevalent and can't process today's larger datasets as a result. Even Pandas, one of the most popular Python libraries available today, struggles to perform with larger datasets.
    38  
    39      Dask allows you to natively scale these familiar libraries and tools for larger datasets while limiting change in usage.
    40  
    41  
    42  ## Data Analysis w/ Dask DataFrames
    43  
    44  The [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html#dask-dataframe) is a data structure based on the `pandas.dataframe` (data structure) representing two-dimensional, size-mutable tabular data. Dask DataFrames consist of many Pandas DataFrames arranged along the *index*. In fact, the Dask DataFrame API [copies](https://docs.dask.org/en/stable/dataframe.html#dask-dataframe-copies-the-pandas-dataframe-api) the Pandas DataFrame API, and should be very familiar to previous Pandas users.
    45  
    46  The `dask.dataframe` library, and most other Dask libraries, supports data access via [HTTP(s)](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html#http-s). AIStore, on the other hand, provides both native and Amazon S3 compatible [REST API](https://aiatscale.org/docs/http-api), which means that data stored on AIStore can be accessed and used directly from/by Dask clients.
    47  
    48  We can instantiate a Dask DataFrame, loading a sample CSV residing in an AIStore bucket as follows: 
    49  
    50  ```python
    51  import dask.dataframe as dd
    52  import os
    53  
    54  AIS_ENDPOINT = os.environ["AIS_ENDPOINT"]
    55  
    56  def read_csv_ais(bck_name: str, obj_name: str):
    57      return dd.read_csv(f"{AIS_ENDPOINT}/v1/objects/{bck_name}/{obj_name}")
    58  
    59  # Load CSV from AIStore bucket
    60  df = read_csv_ais(bck_name="dask-demo-bucket", obj_name="zillow.csv")
    61  ```
    62  
    63  Dask DataFrames are *lazy*, meaning that the data is only loaded when needed. Dask DataFrames can automatically use data partitioned between RAM and disk, as well data distributed across multiple nodes in a cluster. Dask decides how to compute the results and decides where the best place is to run the actual computation based on resource availability.
    64  
    65  When a Dask DataFrame is instantiated, only the first partition of data is loaded into memory (for preview):
    66  
    67  ```python
    68  # Preview data (first few rows) in memory
    69  df.head()
    70  ```
    71  
    72  The rest of the data is only loaded into memory when a computation is made. The following computations do not execute until the `compute()` method is called, at which point *only* the necessary parts of the data are pulled and loaded into memory:
    73  
    74  ```python
    75  # Simple statistics
    76  mean_price = df[' "List Price ($)"'].mean()
    77  mean_size = df[' "Living Space (sq ft)"'].mean()
    78  mean_bed_count = df[' "Beds"'].mean()
    79  std_price = df[' "List Price ($)"'].std()
    80  std_size = df[' "Living Space (sq ft)"'].std()
    81  std_bed_count = df[' "Beds"'].std() 
    82  
    83  # Computations are executed
    84  dd.compute({"mean_price": mean_price, "mean_bed_count": bed_sum, "mean_size": mean_size, "std_price": std_price, "std_size", "std_bed_count": std_bed_count})
    85  ```
    86  
    87  Dask DataFrames also support more complex computations familiar to previous Pandas users such as calculating statistcs by group and filtering rows:
    88  
    89  ```python
    90  # Mean list price of homes grouped by bed count
    91  df.groupby(' "Baths"')[' "List Price ($)"'].mean().compute()
    92  
    93  # Filtering data to a subset of only homes built after 2000
    94  filtered_df = df[df[' "Year"'] > 2000]
    95  ```
    96  
    97  > For an interactive demonstration of the Dask `DataFrame` features shown in this article (and more), please refer to the [Dask AIStore Demo (Jupyter Notebook)](https://github.com/NVIDIA/aistore/blob/main/python/examples/dask/dask-aistore-demo.ipynb).
    98  
    99  
   100  ## References
   101  
   102  * [Dask API](https://docs.dask.org/en/stable/dataframe-api.html)
   103  * [Pandas API](https://pandas.pydata.org/docs/reference/index.html)
   104  * [AIStore Python SDK](https://github.com/NVIDIA/aistore/blob/main/docs/python_sdk.md)
   105  
   106