github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/vertex_ai.md (about) 1 --- 2 title: Vertex AI 3 description: How to use Vertex Datasets and gcsfuse with lakeFS 4 parent: Integrations 5 --- 6 7 # Using Vertex AI with lakeFS 8 9 Vertex AI lets Google Cloud users Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case. 10 11 lakeFS Works with Vertex AI by allowing users to create repositories on [GCS Buckets](../howto/deploy/gcp.md), then use either the Dataset API to create managed Datasets on top of lakeFS version, or by automatically exporting lakeFS object versions in a way readable by [Cloud Storage Mounts](https://cloud.google.com/blog/products/ai-machine-learning/cloud-storage-file-system-ai-training). 12 13 {% include toc.html %} 14 15 ## Using lakeFS with Vertex Managed Datasets 16 17 Vertex's [ImageDataset](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.ImageDataset) and [VideoDataset](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.VideoDataset) allow creating a dataset by importing a CSV file from gcs (see [`gcs_source`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.ImageDataset#google_cloud_aiplatform_ImageDataset_create)). 18 19 This CSV file contains GCS addresses of image files and their corresponding labels. 20 21 Since the lakeFS API supports exporting the underlying GCS address of versioned objects, we can generate such a CSV file when creating the dataset: 22 23 ```python 24 #!/usr/bin/env python 25 26 # Requirements: 27 # google-cloud-aiplatform>=1.31.0 28 # lakefs-client>=0.107.0 29 30 import csv 31 from pathlib import PosixPath 32 from io import StringIO 33 34 import lakefs_client 35 from lakefs_client.client import LakeFSClient 36 from google.cloud import storage 37 from google.cloud import aiplatform 38 39 # lakeFS connection details 40 configuration = lakefs_client.Configuration() 41 configuration.username = 'AKIAIOSFODNN7EXAMPLE' 42 configuration.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' 43 configuration.host = 'https://lakefs.example.com/' 44 client = LakeFSClient(configuration) 45 46 # Dataset configuration 47 lakefs_repo = 'my-repository' 48 lakefs_ref = 'main' 49 img_dataset = 'datasets/my-images/' 50 51 # Vertex configuration 52 import_bucket = 'underlying-gcs-bucket' 53 54 # produce import file for Vertex's SDK 55 buf = StringIO() 56 csv_writer = csv.writer(buf) 57 has_more = True 58 next_offset = "" 59 while has_more: 60 files = client.objects_api.list_objects( 61 lakefs_repo, lakefs_ref, prefix=img_dataset, after=next_offset) 62 for r in files.get('results'): 63 p = PosixPath(r.path) 64 csv_writer.writerow((r.physical_address, p.parent.name)) 65 has_more = files.get('pagination').get('has_more') 66 next_offset = files.get('pagination').get('next_offset') 67 68 # spit out CSV 69 print('generated path and labels CSV') 70 buf.seek(0) 71 72 # Write it to storage 73 storage_client = storage.Client() 74 bucket = storage_client.bucket(import_bucket) 75 blob = bucket.blob(f'vertex/imports/{lakefs_repo}/{lakefs_ref}/labels.csv') 76 with blob.open('w') as out: 77 out.write(buf.read()) 78 79 print(f'wrote csv to gs: gs://{import_bucket}/vertex/imports/{lakefs_repo}/{lakefs_ref}/labels.csv') 80 81 # import in vertex, as dataset 82 print('Importing dataset...') 83 ds = aiplatform.ImageDataset.create( 84 display_name=f'{lakefs_repo}_{lakefs_ref}_imgs', 85 gcs_source=f'gs://{import_bucket}/vertex/imports/{lakefs_repo}/{lakefs_ref}/labels.csv', 86 import_schema_uri=aiplatform.schema.dataset.ioformat.image.single_label_classification, 87 sync=True 88 ) 89 ds.wait() 90 print(f'Done! {ds.display_name} ({ds.resource_name})') 91 ``` 92 93 ## Using lakeFS with Cloud Storage Fuse 94 95 Vertex allows using Google Cloud Storage mounted as a [Fuse Filesystem](https://cloud.google.com/vertex-ai/docs/training/cloud-storage-file-system) as custom input for training jobs. 96 97 Instead of having to copy lakeFS files for each version we want to consume, we can create symlinks by using [gcsfuse](https://github.com/GoogleCloudPlatform/gcsfuse)'s native [symlink inodes](https://github.com/GoogleCloudPlatform/gcsfuse/blob/v1.0.0/docs/semantics.md#symlink-inodes). 98 99 This process can be fully automated by using the example [gcsfuse_symlink_exporter.lua](https://github.com/treeverse/lakeFS/blob/master/examples/hooks/gcsfuse_symlink_exporter.lua) Lua hook. 100 101 Here's what we need to do: 102 103 1. Upload the example `.lua` file into our lakeFS repository. For this example, we'll put it under `scripts/gcsfuse_symlink_exporter.lua`. 104 2. Create a new hook definition file and upload to `_lakefs_actions/export_images.yaml`: 105 106 ```yaml 107 --- 108 # Example hook declaration: (_lakefs_actions/export_images.yaml): 109 name: export_images 110 111 on: 112 post-commit: 113 branches: ["main"] 114 post-merge: 115 branches: ["main"] 116 post-create-tag: 117 118 hooks: 119 - id: gcsfuse_export_images 120 type: lua 121 properties: 122 script_path: scripts/export_gcs_fuse.lua # Path to the script we uploaded in the previous step 123 args: 124 prefix: "datasets/images/" # Path we want to export every commit 125 destination: "gs://my-bucket/exports/my-repo/" # Where should we create the symlinks? 126 mount: 127 from: "gs://my-bucket/repos/my-repo/" # Symlinks are to a unix-mounted file 128 to: "/gcs/my-bucket/repos/my-repo/" # This will ensure they point to a location that exists. 129 130 # Should be the contents of a valid credentials.json file 131 # See: https://developers.google.com/workspace/guides/create-credentials 132 # Will be used to write the symlink files 133 gcs_credentials_json_string: | 134 { 135 "client_id": "...", 136 "client_secret": "...", 137 "refresh_token": "...", 138 "type": "..." 139 } 140 ``` 141 142 Done! On the next tag creation or update to the `main` branch, we'll automatically export the lakeFS version of `datasets/images/` to a mountable location. 143 144 To consume the symlink-ed files, we can read them normally from the mount: 145 146 ```python 147 with open('/gcs/my-bucket/exports/my-repo/branches/main/datasets/images/001.jpg') as f: 148 image_data = f.read() 149 150 ``` 151 152 Previously exported commits are also readable, if we exported them in the past: 153 154 ```python 155 commit_id = 'abcdef123deadbeef567' 156 with open(f'/gcs/my-bucket/exports/my-repo/commits/{commit_id}/datasets/images/001.jpg') as f: 157 image_data = f.read() 158 159 ``` 160 161 ### Considerations when using lakeFS with Cloud Storage Fuse 162 163 For lakeFS paths to be readable by gcsfuse, the mount option `--implicit-dirs` must be specified. 164