github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/huggingface_datasets.md (about) 1 --- 2 title: HuggingFace Datasets 3 description: Read, write and version your HuggingFace datasets with lakeFS 4 parent: Integrations 5 6 --- 7 # Versioning HuggingFace Datasets with lakeFS 8 9 10 {% include toc_2-3.html %} 11 12 13 [HuggingFace 🤗 Datasets](https://www.kubeflow.org/docs/about/kubeflow/) is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks. 14 15 🤗 Datasets supports access to [cloud storage](https://huggingface.co/docs/datasets/en/filesystems) providers through [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) FileSystem implementations. 16 17 [lakefs-spec](https://lakefs-spec.org/) is a community implementation of an fsspec Filesystem that fully leverages lakeFS' capabilities. Let's start by installing it: 18 19 ## Installation 20 21 ```shell 22 pip install lakefs-spec 23 ``` 24 25 ## Configuration 26 27 If you've already configured the lakeFS python SDK and/or lakectl, you should have a `$HOME/.lakectl.yaml` file that contains your access credentials and endpoint for your lakeFS environment. 28 29 Otherwise, install [`lakectl`](../reference/cli.html##installing-lakectl-locally) and run `lakectl config` to set up your access credentials. 30 31 32 ## Reading a Dataset 33 34 To read a dataset, all we have to do is use a `lakefs://...` URI when calling [`load_dataset`](https://huggingface.co/docs/datasets/en/loading): 35 36 ```python 37 >>> from datasets import load_dataset 38 >>> 39 >>> dataset = load_dataset('csv', data_files='lakefs://example-repository/my-branch/data/example.csv') 40 ``` 41 42 That's it! this should automatically load the lakefs-spec implementation that we've installed, which will use the `$HOME/.lakectl.yaml` file to read its credentials, so we don't need to pass additional configuration. 43 44 ## Saving/Loading 45 46 Once we've loaded a Dataset, we can save it using the `save_to_disk` method as normal: 47 48 ```python 49 >>> dataset.save_to_disk('lakefs://example-repository/my-branch/datasets/example/') 50 ``` 51 52 At this point, we might want to commit that change to lakeFS, and tag it, so we could share it with our colleagues. 53 54 We can do it through the UI or lakectl, but let's do it with the [lakeFS Python SDK](./python.md#using-the-lakefs-sdk): 55 56 57 ```python 58 >>> import lakefs 59 >>> 60 >>> repo = lakefs.repository('example-repository') 61 >>> commit = repo.branch('my-branch').commit( 62 ... 'saved my first huggingface Dataset!', 63 ... metadata={'using': '🤗'}) 64 >>> repo.tag('alice_experiment1').create(commit) 65 ``` 66 67 Now, others on our team can load our exact dataset by using the tag we created: 68 69 ```python 70 >>> from datasets import load_from_disk 71 >>> 72 >>> dataset = load_from_disk('lakefs://example-repository/alice_experiment1/datasets/example/') 73 ```