github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/python/examples/dask/dask-aistore-demo.ipynb (about) 1 { 2 "cells": [ 3 { 4 "cell_type": "markdown", 5 "metadata": {}, 6 "source": [ 7 "## **AIStore w/ Dask**" 8 ] 9 }, 10 { 11 "cell_type": "markdown", 12 "metadata": {}, 13 "source": [ 14 "[Dask](https://www.dask.org/) is a new and popular open-source Python library for parallel computing. Dask provides this parallel functionality for a single machine up to a distributed cluster." 15 ] 16 }, 17 { 18 "cell_type": "markdown", 19 "metadata": {}, 20 "source": [ 21 "#### **Pandas v. Dask**\n", 22 "\n", 23 "For data scientists, [Pandas](https://pandas.pydata.org/) has long been the preferred tool for analyzing and manipulating data with Python. However, Dask provides similar capabilities, with the notable exception being Dask's *scalability* with dataset size. \n", 24 "\n", 25 "[Dask DataFrames](https://docs.dask.org/en/stable/dataframe.html) are essentially, a large, virtual dataframe divided along the index into multiple Pandas DataFrames. In fact, the `dask.DataFrame` API is a [subset](https://docs.dask.org/en/stable/dataframe.html#dask-dataframe-copies-the-pandas-dataframe-api) of `pd.DataFrame` API and should be familiar to those already familiar with [Pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).\n", 26 "\n", 27 "> For more information on Dask DataFrame API use, please refer to the [Dask DataFrame API reference](https://docs.dask.org/en/stable/dataframe-api.html).\n", 28 "\n", 29 "Dask `DataFrames` support data access via HTTP(s) protocol and AIStore clusters support data access via [REST API](https://aiatscale.org/docs/http-api). REST API allows for data on AIStore to be read, written, and otherwise operated on via HTTP(s) verbs." 30 ] 31 }, 32 { 33 "cell_type": "markdown", 34 "metadata": {}, 35 "source": [ 36 "### <ins>**Getting Started**" 37 ] 38 }, 39 { 40 "cell_type": "markdown", 41 "metadata": {}, 42 "source": [ 43 "Install Dask with `pip` as follows:\n", 44 "\n", 45 "```console\n", 46 "$ python -m pip install dask\n", 47 "```\n", 48 "\n", 49 "Install the latest AIStore release either with `conda` or `pip`:\n", 50 "\n", 51 "```console\n", 52 "$ conda install aistore\n", 53 "```\n", 54 "\n", 55 "```console\n", 56 "$ pip install aistore\n", 57 "```" 58 ] 59 }, 60 { 61 "cell_type": "markdown", 62 "metadata": {}, 63 "source": [ 64 "Start by deploying an AIStore cluster. The following demonstrations will be utilizing a [Minikube (Kubernetes) deployment](https://github.com/NVIDIA/aistore/blob/master/deploy/dev/k8s/README.md) of AIStore. \n", 65 "\n", 66 "> For information on AIStore deployment options, refer [here](https://github.com/NVIDIA/aistore/blob/master/deploy/README.md).\n", 67 "\n", 68 "Once deployed, import the `aistore` package and initialize a `Client`:" 69 ] 70 }, 71 { 72 "cell_type": "code", 73 "execution_count": 4, 74 "metadata": {}, 75 "outputs": [ 76 { 77 "data": { 78 "text/plain": [ 79 "[BucketEntry(name='zillow.csv', size=842, checksum='8b118b6e12d5b4b1', atime='15 Aug 22 15:13 UTC', version='', target_url='', copies=0, flags=64)]" 80 ] 81 }, 82 "execution_count": 4, 83 "metadata": {}, 84 "output_type": "execute_result" 85 } 86 ], 87 "source": [ 88 "from aistore import Client\n", 89 "\n", 90 "AIS_ENDPOINT = \"http://192.168.49.2:8080\"\n", 91 "\n", 92 "# Initialize AIStore client\n", 93 "client = Client(AIS_ENDPOINT)\n", 94 "\n", 95 "# Load sample data to AIStore for following demonstrations (Local to AIStore)\n", 96 "client.bucket(\"dask-demo-bucket\").create()\n", 97 "client.bucket(\"dask-demo-bucket\").object(\"zillow.csv\").put(\"./data/zillow.csv\")\n", 98 "\n", 99 "# Verify that the object is in bucket\n", 100 "client.bucket(\"dask-demo-bucket\").list_objects().get_entries()" 101 ] 102 }, 103 { 104 "cell_type": "markdown", 105 "metadata": {}, 106 "source": [ 107 "### <ins>**ETL** (**E**xtract-**T**ransform-**L**oad)" 108 ] 109 }, 110 { 111 "cell_type": "markdown", 112 "metadata": {}, 113 "source": [ 114 "**Note:** ETL processes using Dask with data on AIStore *are* possible, but has limitations as Dask does *not* currently support AIStore as a recognized storage provider. Refer to the AIS-ETL service [here](https://github.com/NVIDIA/aistore/blob/master/docs/etl.md), which offers both *offline* and *inline* custom transformations, as well as flexibility to the scope of those transformations (bucket-specific or object(s)-specific), and more." 115 ] 116 }, 117 { 118 "cell_type": "markdown", 119 "metadata": {}, 120 "source": [ 121 "#### **Extract:**" 122 ] 123 }, 124 { 125 "cell_type": "markdown", 126 "metadata": {}, 127 "source": [ 128 "Initialize the Dask [`DataFrame`](https://docs.dask.org/en/stable/dataframe.html) with a direct HTTP(s) address to data residing on AIStore:" 129 ] 130 }, 131 { 132 "cell_type": "code", 133 "execution_count": 5, 134 "metadata": {}, 135 "outputs": [], 136 "source": [ 137 "import dask.dataframe as dd\n", 138 "\n", 139 "\n", 140 "def read_csv_ais(bck_name: str, obj_name: str):\n", 141 " return dd.read_csv(f\"{AIS_ENDPOINT}/v1/objects/{bck_name}/{obj_name}\")\n", 142 "\n", 143 "\n", 144 "# Initialize DataFrame\n", 145 "df = read_csv_ais(bck_name=\"dask-demo-bucket\", obj_name=\"zillow.csv\")" 146 ] 147 }, 148 { 149 "cell_type": "markdown", 150 "metadata": {}, 151 "source": [ 152 "Dask `DataFrames` are split into smaller partitions and partitions are loaded into memory *as needed*. The method `head()` allows us to see just the first few lines of the data:" 153 ] 154 }, 155 { 156 "cell_type": "code", 157 "execution_count": 6, 158 "metadata": {}, 159 "outputs": [ 160 { 161 "data": { 162 "text/html": [ 163 "<div>\n", 164 "<style scoped>\n", 165 " .dataframe tbody tr th:only-of-type {\n", 166 " vertical-align: middle;\n", 167 " }\n", 168 "\n", 169 " .dataframe tbody tr th {\n", 170 " vertical-align: top;\n", 171 " }\n", 172 "\n", 173 " .dataframe thead th {\n", 174 " text-align: right;\n", 175 " }\n", 176 "</style>\n", 177 "<table border=\"1\" class=\"dataframe\">\n", 178 " <thead>\n", 179 " <tr style=\"text-align: right;\">\n", 180 " <th></th>\n", 181 " <th>Index</th>\n", 182 " <th>\"Living Space (sq ft)\"</th>\n", 183 " <th>\"Beds\"</th>\n", 184 " <th>\"Baths\"</th>\n", 185 " <th>\"Zip\"</th>\n", 186 " <th>\"Year\"</th>\n", 187 " <th>\"List Price ($)\"</th>\n", 188 " </tr>\n", 189 " </thead>\n", 190 " <tbody>\n", 191 " <tr>\n", 192 " <th>0</th>\n", 193 " <td>1</td>\n", 194 " <td>2222</td>\n", 195 " <td>3</td>\n", 196 " <td>3.5</td>\n", 197 " <td>32312</td>\n", 198 " <td>1981</td>\n", 199 " <td>250000</td>\n", 200 " </tr>\n", 201 " <tr>\n", 202 " <th>1</th>\n", 203 " <td>2</td>\n", 204 " <td>1628</td>\n", 205 " <td>3</td>\n", 206 " <td>2.0</td>\n", 207 " <td>32308</td>\n", 208 " <td>2009</td>\n", 209 " <td>185000</td>\n", 210 " </tr>\n", 211 " <tr>\n", 212 " <th>2</th>\n", 213 " <td>3</td>\n", 214 " <td>3824</td>\n", 215 " <td>5</td>\n", 216 " <td>4.0</td>\n", 217 " <td>32312</td>\n", 218 " <td>1954</td>\n", 219 " <td>399000</td>\n", 220 " </tr>\n", 221 " <tr>\n", 222 " <th>3</th>\n", 223 " <td>4</td>\n", 224 " <td>1137</td>\n", 225 " <td>3</td>\n", 226 " <td>2.0</td>\n", 227 " <td>32309</td>\n", 228 " <td>1993</td>\n", 229 " <td>150000</td>\n", 230 " </tr>\n", 231 " <tr>\n", 232 " <th>4</th>\n", 233 " <td>5</td>\n", 234 " <td>3560</td>\n", 235 " <td>6</td>\n", 236 " <td>4.0</td>\n", 237 " <td>32309</td>\n", 238 " <td>1973</td>\n", 239 " <td>315000</td>\n", 240 " </tr>\n", 241 " </tbody>\n", 242 "</table>\n", 243 "</div>" 244 ], 245 "text/plain": [ 246 " Index \"Living Space (sq ft)\" \"Beds\" \"Baths\" \"Zip\" \"Year\" \\\n", 247 "0 1 2222 3 3.5 32312 1981 \n", 248 "1 2 1628 3 2.0 32308 2009 \n", 249 "2 3 3824 5 4.0 32312 1954 \n", 250 "3 4 1137 3 2.0 32309 1993 \n", 251 "4 5 3560 6 4.0 32309 1973 \n", 252 "\n", 253 " \"List Price ($)\" \n", 254 "0 250000 \n", 255 "1 185000 \n", 256 "2 399000 \n", 257 "3 150000 \n", 258 "4 315000 " 259 ] 260 }, 261 "execution_count": 6, 262 "metadata": {}, 263 "output_type": "execute_result" 264 } 265 ], 266 "source": [ 267 "# View first partition of data\n", 268 "df.head()" 269 ] 270 }, 271 { 272 "cell_type": "code", 273 "execution_count": 7, 274 "metadata": {}, 275 "outputs": [ 276 { 277 "data": { 278 "text/html": [ 279 "<div>\n", 280 "<style scoped>\n", 281 " .dataframe tbody tr th:only-of-type {\n", 282 " vertical-align: middle;\n", 283 " }\n", 284 "\n", 285 " .dataframe tbody tr th {\n", 286 " vertical-align: top;\n", 287 " }\n", 288 "\n", 289 " .dataframe thead th {\n", 290 " text-align: right;\n", 291 " }\n", 292 "</style>\n", 293 "<table border=\"1\" class=\"dataframe\">\n", 294 " <thead>\n", 295 " <tr style=\"text-align: right;\">\n", 296 " <th></th>\n", 297 " <th>Index</th>\n", 298 " <th>\"Living Space (sq ft)\"</th>\n", 299 " <th>\"Beds\"</th>\n", 300 " <th>\"Baths\"</th>\n", 301 " <th>\"Zip\"</th>\n", 302 " <th>\"Year\"</th>\n", 303 " <th>\"List Price ($)\"</th>\n", 304 " </tr>\n", 305 " </thead>\n", 306 " <tbody>\n", 307 " <tr>\n", 308 " <th>0</th>\n", 309 " <td>1</td>\n", 310 " <td>2222</td>\n", 311 " <td>3</td>\n", 312 " <td>3.5</td>\n", 313 " <td>32312</td>\n", 314 " <td>1981</td>\n", 315 " <td>250000</td>\n", 316 " </tr>\n", 317 " <tr>\n", 318 " <th>1</th>\n", 319 " <td>2</td>\n", 320 " <td>1628</td>\n", 321 " <td>3</td>\n", 322 " <td>2.0</td>\n", 323 " <td>32308</td>\n", 324 " <td>2009</td>\n", 325 " <td>185000</td>\n", 326 " </tr>\n", 327 " </tbody>\n", 328 "</table>\n", 329 "</div>" 330 ], 331 "text/plain": [ 332 " Index \"Living Space (sq ft)\" \"Beds\" \"Baths\" \"Zip\" \"Year\" \\\n", 333 "0 1 2222 3 3.5 32312 1981 \n", 334 "1 2 1628 3 2.0 32308 2009 \n", 335 "\n", 336 " \"List Price ($)\" \n", 337 "0 250000 \n", 338 "1 185000 " 339 ] 340 }, 341 "execution_count": 7, 342 "metadata": {}, 343 "output_type": "execute_result" 344 } 345 ], 346 "source": [ 347 "# View first two lines of data\n", 348 "df.head(2)" 349 ] 350 }, 351 { 352 "cell_type": "markdown", 353 "metadata": {}, 354 "source": [ 355 "#### **Transform:**" 356 ] 357 }, 358 { 359 "cell_type": "markdown", 360 "metadata": {}, 361 "source": [ 362 "As mentioned before, Dask `DataFrames` are built on and in parallel with the Pandas API. The `dask.dataframe` API is very similar to the [Pandas](https://docs.dask.org/en/stable/dataframe.html#dask-dataframe-copies-the-pandas-dataframe-api) API and should be very familiar to those already familiar with Pandas:" 363 ] 364 }, 365 { 366 "cell_type": "code", 367 "execution_count": 8, 368 "metadata": {}, 369 "outputs": [ 370 { 371 "name": "stdout", 372 "output_type": "stream", 373 "text": [ 374 "Index(['Index', ' \"Living Space (sq ft)\"', ' \"Beds\"', ' \"Baths\"', ' \"Zip\"',\n", 375 " ' \"Year\"', ' \"List Price ($)\"'],\n", 376 " dtype='object')\n" 377 ] 378 }, 379 { 380 "data": { 381 "text/html": [ 382 "<div>\n", 383 "<style scoped>\n", 384 " .dataframe tbody tr th:only-of-type {\n", 385 " vertical-align: middle;\n", 386 " }\n", 387 "\n", 388 " .dataframe tbody tr th {\n", 389 " vertical-align: top;\n", 390 " }\n", 391 "\n", 392 " .dataframe thead th {\n", 393 " text-align: right;\n", 394 " }\n", 395 "</style>\n", 396 "<table border=\"1\" class=\"dataframe\">\n", 397 " <thead>\n", 398 " <tr style=\"text-align: right;\">\n", 399 " <th></th>\n", 400 " <th>index</th>\n", 401 " <th>Index</th>\n", 402 " <th>\"Living Space (sq ft)\"</th>\n", 403 " <th>\"Beds\"</th>\n", 404 " <th>\"Baths\"</th>\n", 405 " <th>\"Zip\"</th>\n", 406 " <th>\"Year\"</th>\n", 407 " <th>\"List Price ($)\"</th>\n", 408 " </tr>\n", 409 " </thead>\n", 410 " <tbody>\n", 411 " <tr>\n", 412 " <th>0</th>\n", 413 " <td>1</td>\n", 414 " <td>2</td>\n", 415 " <td>1628</td>\n", 416 " <td>3</td>\n", 417 " <td>2.0</td>\n", 418 " <td>32308</td>\n", 419 " <td>2009</td>\n", 420 " <td>185000</td>\n", 421 " </tr>\n", 422 " <tr>\n", 423 " <th>1</th>\n", 424 " <td>9</td>\n", 425 " <td>10</td>\n", 426 " <td>1997</td>\n", 427 " <td>3</td>\n", 428 " <td>3.0</td>\n", 429 " <td>32311</td>\n", 430 " <td>2006</td>\n", 431 " <td>295000</td>\n", 432 " </tr>\n", 433 " <tr>\n", 434 " <th>2</th>\n", 435 " <td>10</td>\n", 436 " <td>11</td>\n", 437 " <td>2097</td>\n", 438 " <td>4</td>\n", 439 " <td>3.0</td>\n", 440 " <td>32311</td>\n", 441 " <td>2016</td>\n", 442 " <td>290000</td>\n", 443 " </tr>\n", 444 " <tr>\n", 445 " <th>3</th>\n", 446 " <td>14</td>\n", 447 " <td>15</td>\n", 448 " <td>1381</td>\n", 449 " <td>3</td>\n", 450 " <td>2.0</td>\n", 451 " <td>32301</td>\n", 452 " <td>2006</td>\n", 453 " <td>143000</td>\n", 454 " </tr>\n", 455 " </tbody>\n", 456 "</table>\n", 457 "</div>" 458 ], 459 "text/plain": [ 460 " index Index \"Living Space (sq ft)\" \"Beds\" \"Baths\" \"Zip\" \"Year\" \\\n", 461 "0 1 2 1628 3 2.0 32308 2009 \n", 462 "1 9 10 1997 3 3.0 32311 2006 \n", 463 "2 10 11 2097 4 3.0 32311 2016 \n", 464 "3 14 15 1381 3 2.0 32301 2006 \n", 465 "\n", 466 " \"List Price ($)\" \n", 467 "0 185000 \n", 468 "1 295000 \n", 469 "2 290000 \n", 470 "3 143000 " 471 ] 472 }, 473 "execution_count": 8, 474 "metadata": {}, 475 "output_type": "execute_result" 476 } 477 ], 478 "source": [ 479 "# Print columns\n", 480 "print(df.columns)\n", 481 "\n", 482 "# Filter datapoints for houses built AFTER the year 2000\n", 483 "df_new = df[df[' \"Year\"'] > 2000]\n", 484 "# Further filter for houses with a price less than $300,000\n", 485 "df_new_and_budget = df_new[df_new[' \"List Price ($)\"'] < 300000]\n", 486 "# Reindex\n", 487 "df_final = df_new_and_budget.reset_index()\n", 488 "\n", 489 "# Verify that data has been correctly filtered\n", 490 "df_final.head()" 491 ] 492 }, 493 { 494 "cell_type": "markdown", 495 "metadata": {}, 496 "source": [ 497 "Dask also provides other tools that efficiently process a variety of different types of files, such as [images](https://examples.dask.org/applications/image-processing.html). The following snippet demonstrates image pre-processing capabilities with Dask using the [`dask-image`](https://github.com/dask/dask-image) module, similar to image transformations demonstrated with AIS-ETLs [here](https://aiatscale.org/blog/2021/10/22/ais-etl-2):" 498 ] 499 }, 500 { 501 "cell_type": "code", 502 "execution_count": 29, 503 "metadata": {}, 504 "outputs": [ 505 { 506 "data": { 507 "text/plain": [ 508 "<matplotlib.image.AxesImage at 0x7f0e38725ab0>" 509 ] 510 }, 511 "execution_count": 29, 512 "metadata": {}, 513 "output_type": "execute_result" 514 }, 515 { 516 "data": { 517 "image/png": "", 518 "text/plain": [ 519 "<Figure size 432x288 with 1 Axes>" 520 ] 521 }, 522 "metadata": { 523 "needs_background": "light" 524 }, 525 "output_type": "display_data" 526 } 527 ], 528 "source": [ 529 "from nturl2path import url2pathname\n", 530 "import imageio.v3\n", 531 "from io import BytesIO\n", 532 "import matplotlib.pyplot as plt\n", 533 "import os\n", 534 "import requests\n", 535 "\n", 536 "try:\n", 537 " from skimage.io import imread as sk_imread\n", 538 "except (AttributeError, ImportError):\n", 539 " pass\n", 540 "\n", 541 "from dask.array.core import Array\n", 542 "from dask.base import tokenize\n", 543 "\n", 544 "\n", 545 "def add_leading_dimension(x):\n", 546 " return x[None, ...]\n", 547 "\n", 548 "\n", 549 "def url_image_reader(url):\n", 550 " response = requests.get(url)\n", 551 " byte_content = BytesIO(response.content)\n", 552 " image = imageio.v3.imread(byte_content)\n", 553 " return image\n", 554 "\n", 555 "\n", 556 "# Modified version of dask.array.image.imread (doesn't use glob)\n", 557 "def custom_imread(filenames, imread=None, preprocess=None):\n", 558 " \"\"\"Read a stack of images into a dask array\n", 559 " Parameters\n", 560 " ----------\n", 561 " filenames: list of strings\n", 562 " A list of filename strings, eg: ['myfile._01.png', 'myfile_02.png']\n", 563 " imread: function (optional)\n", 564 " Optionally provide custom imread function.\n", 565 " Function should expect a filename and produce a numpy array.\n", 566 " Defaults to ``skimage.io.imread``.\n", 567 " preprocess: function (optional)\n", 568 " Optionally provide custom function to preprocess the image.\n", 569 " Function should expect a numpy array for a single image.\n", 570 " Examples\n", 571 " --------\n", 572 " >>> from dask.array.image import imread\n", 573 " >>> im = imread('2015-*-*.png') # doctest: +SKIP\n", 574 " >>> im.shape # doctest: +SKIP\n", 575 " (365, 1000, 1000, 3)\n", 576 " Returns\n", 577 " -------\n", 578 " Dask array of all images stacked along the first dimension.\n", 579 " Each separate image file will be treated as an individual chunk.\n", 580 " \"\"\"\n", 581 " imread = imread or sk_imread\n", 582 "\n", 583 " name = \"imread-%s\" % tokenize(filenames, map(os.path.getmtime, filenames))\n", 584 "\n", 585 " sample = imread(filenames[0])\n", 586 " if preprocess:\n", 587 " sample = preprocess(sample)\n", 588 "\n", 589 " keys = [(name, i) + (0,) * len(sample.shape) for i in range(len(filenames))]\n", 590 " if preprocess:\n", 591 " values = [\n", 592 " (add_leading_dimension, (preprocess, (imread, fn))) for fn in filenames\n", 593 " ]\n", 594 " else:\n", 595 " values = [(add_leading_dimension, (imread, fn)) for fn in filenames]\n", 596 " dsk = dict(zip(keys, values))\n", 597 "\n", 598 " chunks = ((1,) * len(filenames),) + tuple((d,) for d in sample.shape)\n", 599 "\n", 600 " return Array(dsk, name, chunks, sample.dtype)\n", 601 "\n", 602 "\n", 603 "filenames = [f\"{AIS_ENDPOINT}/v1/objects/dask-demo-bucket/sample-image.jpg\"]\n", 604 "result = custom_imread(filenames, imread=url_image_reader)\n", 605 "\n", 606 "\n", 607 "# Pre-Processing function for image\n", 608 "def custom_filter(rgb):\n", 609 " result = (rgb[..., 0] * 0.2125) + (rgb[..., 1] * 0.7154) + (rgb[..., 2] * 0.0721)\n", 610 " return result\n", 611 "\n", 612 "\n", 613 "flower = result[0]\n", 614 "modified_flower = custom_filter(flower)\n", 615 "\n", 616 "plt.imshow(modified_flower)" 617 ] 618 }, 619 { 620 "cell_type": "markdown", 621 "metadata": {}, 622 "source": [ 623 "#### **Load:**" 624 ] 625 }, 626 { 627 "cell_type": "markdown", 628 "metadata": {}, 629 "source": [ 630 "Dask *does* support many of the popular cloud storage providers (i.e. S3, GCP, Azure), and supports the direct writing of DataFrames as files to those services:\n", 631 "\n", 632 "```python\n", 633 "# Write DataFrame as CSV to S3\n", 634 "dd.to_csv(\"s3://dask-demo-bucket/sample.csv\")\n", 635 "\n", 636 "# Write DataFrame as JSON to GCP\n", 637 "dd.to_json(\"gcs://dask-demo-bucket/sample.json\")\n", 638 "```\n", 639 "\n", 640 "> AIStore supports a subset of S3 API and Dask supports [S3-compatible storage services](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html#using-other-s3-compatible-services). However, Dask uses Boto3 for this and AIStore does not support Boto3 as it does not follow HTTP(s) protocol. For more information on AIStore compatibility with S3, refer [here](https://github.com/NVIDIA/aistore/blob/master/docs/s3compat.md).\n", 641 "\n", 642 "While we cannot directly write to AIStore with Dask API (`dd.to_csv(\"ais://bucket/file.csv\")` not supported as of now), we can convert the DataFrame to bytes and move it using AIStore's own API:" 643 ] 644 }, 645 { 646 "cell_type": "code", 647 "execution_count": null, 648 "metadata": {}, 649 "outputs": [], 650 "source": [ 651 "# Convert to Pandas DataFrame\n", 652 "pd = df_new_and_budget.compute()\n", 653 "files = pd.to_csv(encoding=\"utf-8\")\n", 654 "\n", 655 "client.bucket(\"dask-demo-bucket\").list_objects().get_entries()\n", 656 "\n", 657 "client.bucket(\"dask-demo-bucket\").object(\"formatted-zillow.csv\").put(content=files)\n", 658 "\n", 659 "print(type(files))\n", 660 "\n", 661 "# Verify that the transformed file is now in the AIStore bucket\n", 662 "client.bucket(\"dask-demo-bucket\").list_objects().get_entries()" 663 ] 664 }, 665 { 666 "cell_type": "markdown", 667 "metadata": {}, 668 "source": [ 669 "However, the above method of converting a Dask DataFrame to a Pandas DataFrame may not be ideal as it defeats some of the performance advantages of using Dask. Converting a Dask DataFrame to a Pandas DataFrame only makes sense to do if the data can fully fit into memory (i.e. data has been filtered and is now much smaller).\n", 670 "\n", 671 "For much larger datasets, [AIS-ETL](https://github.com/NVIDIA/aistore/blob/master/docs/etl.md) may offer better performance while offering similar ETL capabilities to those demonstrated above. Please refer [here](https://github.com/NVIDIA/aistore/blob/master/docs/etl.md) for more information." 672 ] 673 }, 674 { 675 "cell_type": "markdown", 676 "metadata": {}, 677 "source": [ 678 "### <ins>**Data Analysis** " 679 ] 680 }, 681 { 682 "cell_type": "markdown", 683 "metadata": {}, 684 "source": [ 685 "> Dask is not always the most efficient tool to be used in certain scenarios. Refer [here](https://docs.dask.org/en/latest/best-practices.html#start-small) for information on the best use-cases for Dask.\n", 686 "\n", 687 "While Dask's use-cases with AIStore are quite limited as of now, it is still useful for performing data analysis with optimized memory usage. With Dask DataFrames, data can be analyzed **without** loading the entirety of the data into memory. More specifically, because Dask DataFrames (and most other Dask collections) are [lazy](https://saturncloud.io/blog/a-data-scientist-s-guide-to-lazy-evaluation-with-dask/), computations are executed only when the `dask.distributed.compute` method is called. As a result, specific parts of the data are loaded into memory as needed for the computations and are not loaded otherwise, optimizing memory usage:" 688 ] 689 }, 690 { 691 "cell_type": "code", 692 "execution_count": null, 693 "metadata": {}, 694 "outputs": [], 695 "source": [ 696 "# Computations to be made (not computed until .compute() is called)\n", 697 "mean_price = df[' \"List Price ($)\"'].mean()\n", 698 "bed_sum = df[' \"Beds\"'].sum()\n", 699 "mean_size = df[' \"Living Space (sq ft)\"'].mean()\n", 700 "\n", 701 "# Computes above in parallel\n", 702 "dd.compute({\"price_mean\": mean_price, \"bed_sum\": bed_sum, \"mean_size\": mean_size})" 703 ] 704 }, 705 { 706 "cell_type": "markdown", 707 "metadata": {}, 708 "source": [ 709 "More examples of data analysis with Dask DataFrames can be found [here](https://examples.dask.org/dataframe.html) under Dask documentation." 710 ] 711 }, 712 { 713 "cell_type": "markdown", 714 "metadata": {}, 715 "source": [ 716 "### <ins>**References**" 717 ] 718 }, 719 { 720 "attachments": {}, 721 "cell_type": "markdown", 722 "metadata": {}, 723 "source": [ 724 "* [Dask API](https://docs.dask.org/en/stable/dataframe-api.html)\n", 725 "* [Pandas API](https://pandas.pydata.org/docs/reference/index.html)\n", 726 "* [AIStore Python SDK](https://github.com/NVIDIA/aistore/blob/master/docs/python_sdk.md)\n", 727 "* [AIS-ETL](https://github.com/NVIDIA/aistore/blob/master/docs/etl.md)" 728 ] 729 } 730 ], 731 "metadata": { 732 "kernelspec": { 733 "display_name": "Python 3.10.4 64-bit", 734 "language": "python", 735 "name": "python3" 736 }, 737 "language_info": { 738 "codemirror_mode": { 739 "name": "ipython", 740 "version": 3 741 }, 742 "file_extension": ".py", 743 "mimetype": "text/x-python", 744 "name": "python", 745 "nbconvert_exporter": "python", 746 "pygments_lexer": "ipython3", 747 "version": "3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]" 748 }, 749 "orig_nbformat": 4, 750 "vscode": { 751 "interpreter": { 752 "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" 753 } 754 } 755 }, 756 "nbformat": 4, 757 "nbformat_minor": 2 758 }