github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/vertex_ai.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/vertex_ai.md (about)

     1  ---
     2  title: Vertex AI
     3  description: How to use Vertex Datasets and gcsfuse with lakeFS
     4  parent: Integrations
     5  ---
     6  
     7  # Using Vertex AI with lakeFS
     8  
     9  Vertex AI lets Google Cloud users Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case.
    10  
    11  lakeFS Works with Vertex AI by allowing users to create repositories on [GCS Buckets](../howto/deploy/gcp.md), then use either the Dataset API to create managed Datasets on top of lakeFS version, or by automatically exporting lakeFS object versions in a way readable by [Cloud Storage Mounts](https://cloud.google.com/blog/products/ai-machine-learning/cloud-storage-file-system-ai-training). 
    12  
    13  {% include toc.html %}
    14  
    15  ## Using lakeFS with Vertex Managed Datasets
    16  
    17  Vertex's [ImageDataset](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.ImageDataset) and [VideoDataset](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.VideoDataset) allow creating a dataset by importing a CSV file from gcs (see [`gcs_source`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.ImageDataset#google_cloud_aiplatform_ImageDataset_create)).
    18  
    19  This CSV file contains GCS addresses of image files and their corresponding labels.
    20  
    21  Since the lakeFS API supports exporting the underlying GCS address of versioned objects, we can generate such a CSV file when creating the dataset:  
    22  
    23  ```python
    24  #!/usr/bin/env python
    25  
    26  # Requirements:
    27  # google-cloud-aiplatform>=1.31.0
    28  # lakefs-client>=0.107.0
    29  
    30  import csv
    31  from pathlib import PosixPath
    32  from io import StringIO
    33  
    34  import lakefs_client
    35  from lakefs_client.client import LakeFSClient
    36  from google.cloud import storage
    37  from google.cloud import aiplatform
    38  
    39  # lakeFS connection details
    40  configuration = lakefs_client.Configuration()
    41  configuration.username = 'AKIAIOSFODNN7EXAMPLE'
    42  configuration.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
    43  configuration.host = 'https://lakefs.example.com/'
    44  client = LakeFSClient(configuration)
    45  
    46  # Dataset configuration
    47  lakefs_repo = 'my-repository'
    48  lakefs_ref = 'main'
    49  img_dataset = 'datasets/my-images/'
    50  
    51  # Vertex configuration
    52  import_bucket = 'underlying-gcs-bucket'
    53  
    54  # produce import file for Vertex's SDK
    55  buf = StringIO()
    56  csv_writer = csv.writer(buf)
    57  has_more = True
    58  next_offset = ""
    59  while has_more:
    60      files = client.objects_api.list_objects(
    61          lakefs_repo, lakefs_ref, prefix=img_dataset, after=next_offset)
    62      for r in files.get('results'):
    63          p = PosixPath(r.path)
    64          csv_writer.writerow((r.physical_address, p.parent.name))
    65      has_more = files.get('pagination').get('has_more')
    66      next_offset = files.get('pagination').get('next_offset')
    67  
    68  # spit out CSV
    69  print('generated path and labels CSV')
    70  buf.seek(0)
    71  
    72  # Write it to storage
    73  storage_client = storage.Client()
    74  bucket = storage_client.bucket(import_bucket)
    75  blob = bucket.blob(f'vertex/imports/{lakefs_repo}/{lakefs_ref}/labels.csv')
    76  with blob.open('w') as out:
    77      out.write(buf.read())
    78  
    79  print(f'wrote csv to gs: gs://{import_bucket}/vertex/imports/{lakefs_repo}/{lakefs_ref}/labels.csv')
    80  
    81  # import in vertex, as dataset
    82  print('Importing dataset...')
    83  ds = aiplatform.ImageDataset.create(
    84      display_name=f'{lakefs_repo}_{lakefs_ref}_imgs',
    85      gcs_source=f'gs://{import_bucket}/vertex/imports/{lakefs_repo}/{lakefs_ref}/labels.csv',
    86      import_schema_uri=aiplatform.schema.dataset.ioformat.image.single_label_classification,
    87      sync=True
    88  )
    89  ds.wait()
    90  print(f'Done! {ds.display_name} ({ds.resource_name})')
    91  ```
    92  
    93  ## Using lakeFS with Cloud Storage Fuse
    94  
    95  Vertex allows using Google Cloud Storage mounted as a [Fuse Filesystem](https://cloud.google.com/vertex-ai/docs/training/cloud-storage-file-system) as custom input for training jobs.
    96  
    97  Instead of having to copy lakeFS files for each version we want to consume, we can create symlinks by using [gcsfuse](https://github.com/GoogleCloudPlatform/gcsfuse)'s native [symlink inodes](https://github.com/GoogleCloudPlatform/gcsfuse/blob/v1.0.0/docs/semantics.md#symlink-inodes).
    98  
    99  This process can be fully automated by using the example [gcsfuse_symlink_exporter.lua](https://github.com/treeverse/lakeFS/blob/master/examples/hooks/gcsfuse_symlink_exporter.lua) Lua hook.
   100  
   101  Here's what we need to do:
   102  
   103  1. Upload the example `.lua` file into our lakeFS repository. For this example, we'll put it under `scripts/gcsfuse_symlink_exporter.lua`.
   104  2. Create a new hook definition file and upload to `_lakefs_actions/export_images.yaml`:
   105  
   106  ```yaml
   107  ---
   108  # Example hook declaration: (_lakefs_actions/export_images.yaml):
   109  name: export_images
   110  
   111  on:
   112    post-commit:
   113      branches: ["main"]
   114    post-merge:
   115      branches: ["main"]
   116    post-create-tag:
   117  
   118  hooks:
   119  - id: gcsfuse_export_images
   120    type: lua
   121    properties:
   122      script_path: scripts/export_gcs_fuse.lua  # Path to the script we uploaded in the previous step
   123      args:
   124        prefix: "datasets/images/"  # Path we want to export every commit
   125        destination: "gs://my-bucket/exports/my-repo/"  # Where should we create the symlinks?
   126        mount:
   127          from: "gs://my-bucket/repos/my-repo/"  # Symlinks are to a unix-mounted file
   128          to: "/gcs/my-bucket/repos/my-repo/"    #  This will ensure they point to a location that exists.
   129        
   130        # Should be the contents of a valid credentials.json file
   131        # See: https://developers.google.com/workspace/guides/create-credentials
   132        # Will be used to write the symlink files
   133        gcs_credentials_json_string: |
   134          {
   135            "client_id": "...",
   136            "client_secret": "...",
   137            "refresh_token": "...",
   138            "type": "..."
   139          }
   140  ```
   141  
   142  Done! On the next tag creation or update to the `main` branch, we'll automatically export the lakeFS version of `datasets/images/` to a mountable location.
   143  
   144  To consume the symlink-ed files, we can read them normally from the mount:
   145  
   146  ```python
   147  with open('/gcs/my-bucket/exports/my-repo/branches/main/datasets/images/001.jpg') as f:
   148      image_data = f.read()
   149  
   150  ```
   151  
   152  Previously exported commits are also readable, if we exported them in the past:
   153  
   154  ```python
   155  commit_id = 'abcdef123deadbeef567'
   156  with open(f'/gcs/my-bucket/exports/my-repo/commits/{commit_id}/datasets/images/001.jpg') as f:
   157      image_data = f.read()
   158  
   159  ```
   160  
   161  ### Considerations when using lakeFS with Cloud Storage Fuse
   162  
   163  For lakeFS paths to be readable by gcsfuse, the mount option `--implicit-dirs` must be specified.
   164