github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/huggingface_datasets.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/huggingface_datasets.md (about)

     1  ---
     2  title: HuggingFace Datasets
     3  description: Read, write and version your HuggingFace datasets with lakeFS
     4  parent: Integrations
     5  
     6  ---
     7  # Versioning HuggingFace Datasets with lakeFS
     8  
     9  
    10  {% include toc_2-3.html %}
    11  
    12  
    13  [HuggingFace 🤗 Datasets](https://www.kubeflow.org/docs/about/kubeflow/) is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
    14  
    15  🤗 Datasets supports access to [cloud storage](https://huggingface.co/docs/datasets/en/filesystems) providers through [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) FileSystem implementations.
    16  
    17  [lakefs-spec](https://lakefs-spec.org/) is a community implementation of an fsspec Filesystem that fully leverages lakeFS' capabilities. Let's start by installing it:
    18  
    19  ## Installation
    20  
    21  ```shell
    22  pip install lakefs-spec
    23  ```
    24  
    25  ## Configuration
    26  
    27  If you've already configured the lakeFS python SDK and/or lakectl, you should have a `$HOME/.lakectl.yaml` file that contains your access credentials and endpoint for your lakeFS environment.
    28  
    29  Otherwise, install [`lakectl`](../reference/cli.html##installing-lakectl-locally) and run `lakectl config` to set up your access credentials.
    30  
    31  
    32  ## Reading a Dataset
    33  
    34  To read a dataset, all we have to do is use a `lakefs://...` URI when calling [`load_dataset`](https://huggingface.co/docs/datasets/en/loading):
    35  
    36  ```python
    37  >>> from datasets import load_dataset
    38  >>> 
    39  >>> dataset = load_dataset('csv', data_files='lakefs://example-repository/my-branch/data/example.csv')
    40  ```
    41  
    42  That's it! this should automatically load the lakefs-spec implementation that we've installed, which will use the `$HOME/.lakectl.yaml` file to read its credentials, so we don't need to pass additional configuration.
    43  
    44  ## Saving/Loading
    45  
    46  Once we've loaded a Dataset, we can save it using the `save_to_disk` method as normal:
    47  
    48  ```python
    49  >>> dataset.save_to_disk('lakefs://example-repository/my-branch/datasets/example/')
    50  ```
    51  
    52  At this point, we might want to commit that change to lakeFS, and tag it, so we could share it with our colleagues.
    53  
    54  We can do it through the UI or lakectl, but let's do it with the [lakeFS Python SDK](./python.md#using-the-lakefs-sdk):
    55  
    56  
    57  ```python
    58  >>> import lakefs
    59  >>>
    60  >>> repo = lakefs.repository('example-repository')
    61  >>> commit = repo.branch('my-branch').commit(
    62  ...     'saved my first huggingface Dataset!',
    63  ...     metadata={'using': '🤗'})
    64  >>> repo.tag('alice_experiment1').create(commit)
    65  ```
    66  
    67  Now, others on our team can load our exact dataset by using the tag we created:
    68  
    69  ```python
    70  >>> from datasets import load_from_disk
    71  >>>
    72  >>> dataset = load_from_disk('lakefs://example-repository/alice_experiment1/datasets/example/')
    73  ```