github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/sagemaker.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/sagemaker.md (about)

     1  ---
     2  title: Amazon SageMaker
     3  description: This section explains how to integrate your Amazon SageMaker installation to work with lakeFS.
     4  parent: Integrations
     5  redirect_from: /using/sagemaker.html
     6  ---
     7  
     8  # Using lakeFS with Amazon SageMaker 
     9  [Amazon SageMaker](https://aws.amazon.com/sagemaker/) helps to prepare, build, train and deploy ML models quickly by bringing together a broad set of capabilities purpose-built for ML.
    10  
    11  {% include toc.html %}
    12  
    13  ## Initializing session and client
    14  
    15  Initialize a Sagemaker session and an S3 client with lakeFS as the endpoint:
    16  ```python
    17  import sagemaker
    18  import boto3
    19  
    20  endpoint_url = '<LAKEFS_ENDPOINT>'
    21  aws_access_key_id = '<LAKEFS_ACCESS_KEY_ID>'
    22  aws_secret_access_key = '<LAKEFS_SECRET_ACCESS_KEY>'
    23  repo = 'example-repo'
    24  
    25  sm = boto3.client('sagemaker',
    26      endpoint_url=endpoint_url,
    27      aws_access_key_id=aws_access_key_id,
    28      aws_secret_access_key=aws_secret_access_key)
    29  
    30  s3_resource = boto3.resource('s3',
    31      endpoint_url=endpoint_url,
    32      aws_access_key_id=aws_access_key_id,
    33      aws_secret_access_key=aws_secret_access_key)
    34  
    35  session = sagemaker.Session(boto3.Session(), sagemaker_client=sm, default_bucket=repo)
    36  session.s3_resource = s3_resource
    37  ```
    38  
    39  ## Usage Examples
    40  
    41  ### Upload train and test data
    42  
    43  Let's use the created session for uploading data to the 'main' branch:
    44  
    45  ```python
    46  prefix = "/prefix-within-branch"
    47  branch = 'main'
    48  
    49  train_file = 'train_data.csv';
    50  train_data.to_csv(train_file, index=False, header=True)
    51  train_data_s3_path = session.upload_data(path=train_file, key_prefix=branch + prefix + "/train")
    52  
    53  test_file = 'test_data.csv';
    54  test_data_no_target.to_csv(test_file, index=False, header=False)
    55  test_data_s3_path = session.upload_data(path=test_file, key_prefix=branch + prefix + "/test")
    56  ```
    57  
    58  ### Download objects
    59  
    60  You can use the integration with lakeFS to download a portion of the data you see fit:
    61   
    62  ```python
    63  repo = 'example-repo'
    64  prefix = "/prefix-to-download"
    65  branch = 'main'
    66  localpath = './' + branch
    67  
    68  session.download_data(path=localpath, bucket=repo, key_prefix = branch + prefix)
    69  ```
    70  
    71  **Note:**
    72  Advanced AWS SageMaker features, like Autopilot jobs, are encapsulated and don't have the option to override the S3 endpoint.
    73  However, it is possible to [export]({% link howto/export.md %}) the required inputs from lakeFS to S3.
    74  <br/>If you're using SageMaker features that aren't supported by lakeFS, we'd love to [hear from you](https://lakefs.io/slack).
    75  {: .note}