github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/python.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/python.md (about)

     1  ---
     2  title: Python
     3  description: Use Python to interact with your objects on lakeFS
     4  parent: Integrations
     5  redirect_from:
     6    - /using/python.html
     7    - /using/boto.html
     8    - /integrations/boto.html
     9  ---
    10  
    11  # Use Python to interact with your objects on lakeFS
    12  
    13  {% include toc_2-3.html %}
    14  
    15  
    16  **High Level Python SDK**  <span class="badge mr-1">New</span>
    17  We've just released a new High Level Python SDK library, and we're super excited to tell you about it! Continue reading to get the
    18  full story!
    19  Though our previous SDK client is still supported and maintained, we highly recommend using the new High Level SDK.
    20  **For previous Python SDKs follow these links:**
    21  [lakefs-sdk](https://pydocs-sdk.lakefs.io)
    22  [legacy-sdk](https://pydocs.lakefs.io) (Depracated)
    23  {: .note }
    24  
    25  There are three primary ways to work with lakeFS from Python:
    26  
    27  * [Use Boto](#using-boto) to perform **object operations** through the **lakeFS S3 gateway**.
    28  * [Use the High Level lakeFS SDK](#using-the-lakefs-sdk) to perform **object operations**, **versioning** and other **lakeFS-specific operations**.
    29  * [Using lakefs-spec](#using-lakefs-spec-for-higher-level-file-operations) to
    30  perform high-level file operations through a file-system-like API.
    31  
    32  ## Using the lakeFS SDK
    33  
    34  ### Installing
    35  
    36  Install the Python client using pip:
    37  
    38  
    39  ```shell
    40  pip install lakefs
    41  ```
    42  
    43  ### Initializing
    44  
    45  The High Level SDK by default will try to collect authentication parameters from the environment and attempt to create a default client.
    46  When working in an environment where **lakectl** is configured it is not necessary to instantiate a lakeFS client or provide it for creating the lakeFS objects.
    47  In case no authentication parameters exist, it is also possible to explicitly create a lakeFS client
    48  
    49  Here's how to instantiate a client:
    50  
    51  ```python
    52  from lakefs.client import Client
    53  
    54  clt = Client(
    55      host="http://localhost:8000",
    56      username="AKIAIOSFODNN7EXAMPLE",
    57      password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    58  )
    59  ```
    60  
    61  You can use TLS with a CA that is not trusted on the host by configuring the
    62  client with a CA cert bundle file.  It should contain concatenated CA
    63  certificates in PEM format:
    64  ```python
    65  clt = Client(
    66      host="http://localhost:8000",
    67      username="AKIAIOSFODNN7EXAMPLE",
    68      password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    69      # Customize the CA certificates used to verify the peer.
    70      ssl_ca_cert="path/to/concatenated_CA_certificates.PEM",
    71  )
    72  ```
    73  
    74  For testing SSL endpoints you may wish to use a self-signed certificate.  If you do this and receive an `SSL: CERTIFICATE_VERIFY_FAILED` error message you might add the following configuration to your client:
    75  
    76  ```python
    77  clt = Client(
    78      host="http://localhost:8000",
    79      username="AKIAIOSFODNN7EXAMPLE",
    80      password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    81      verify_ssl=False,
    82  )
    83  ```
    84  
    85  {: .warning }
    86  This setting allows well-known "man-in-the-middle",
    87  impersonation, and credential stealing attacks.  Never use this in any
    88  production setting.
    89  
    90  Optionally, to enable communication via proxies, add a proxy configuration:
    91  
    92  ```python
    93  clt = Client(
    94      host="http://localhost:8000",
    95      username="AKIAIOSFODNN7EXAMPLE",
    96      password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    97      ssl_ca_cert="(if needed)",
    98      proxy="<proxy server URL>",
    99  )
   100  ```
   101  
   102  ### Usage Examples
   103  
   104  Lets see how we can interact with lakeFS using the High Level SDK.
   105  
   106  #### Creating a repository
   107  
   108  ```python
   109  import lakefs
   110  
   111  repo = lakefs.repository("example-repo").create(storage_namespace="s3://storage-bucket/repos/example-repo")
   112  print(repo)
   113  ```
   114  
   115  If using an explicit client, create the Repository object and pass the client to it (note the changed syntax).
   116  
   117  ```python
   118  import lakefs
   119  from lakefs.client import Client
   120  
   121  clt = Client(
   122      host="http://localhost:8000",
   123      username="AKIAIOSFODNN7EXAMPLE",
   124      password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
   125  )
   126  
   127  repo = lakefs.Repository("example-repo", client=clt).create(storage_namespace="s3://storage-bucket/repos/example-repo")
   128  print(repo)
   129  ```
   130  
   131  #### Output
   132  ```
   133  {id: 'example-repo', creation_date: 1697815536, default_branch: 'main', storage_namespace: 's3://storage-bucket/repos/example-repo'}
   134  ```
   135  
   136  #### List repositories
   137  
   138  
   139  ```python
   140  import lakefs
   141  
   142  print("Listing repositories:")
   143  for repo in lakefs.repositories():
   144      print(repo)
   145  
   146  ```
   147  
   148  #### Output
   149  ```
   150  Listing repositories:
   151  {id: 'example-repo', creation_date: 1697815536, default_branch: 'main', storage_namespace: 's3://storage-bucket/repos/example-repo'}
   152  ```
   153  
   154  #### Creating a branch
   155  
   156  ```python
   157  import lakefs
   158  
   159  branch1 = lakefs.repository("example-repo").branch("experiment1").create(source_reference_id="main")
   160  print("experiment1 ref:", branch1.get_commit().id)
   161  
   162  branch1 = lakefs.repository("example-repo").branch("experiment2").create(source_reference_id="main")
   163  print("experiment2 ref:", branch2.get_commit().id)
   164  ```
   165  
   166  #### Output
   167  ```
   168  experiment1 ref: 7a300b41a8e1ca666c653171a364c08f640549c24d7e82b401bf077c646f8859
   169  experiment2 ref: 7a300b41a8e1ca666c653171a364c08f640549c24d7e82b401bf077c646f8859
   170  ```
   171  
   172  ### List branches
   173  
   174  ```python
   175  import lakefs
   176  
   177  for branch in lakefs.repository("example-repo").branches():
   178      print(branch)
   179  
   180  ```
   181  
   182  #### Output
   183  ```
   184  experiment1
   185  experiment2
   186  main
   187  ```
   188  
   189  ## IO
   190  
   191  Great, now lets see some IO operations in action!
   192  The new High Level SDK provide IO semantics which allow to work with lakeFS objects as if they were files in your
   193  filesystem. This is extremely useful when working with data transformation packages that accept file descriptors and streams.
   194  
   195  ### Upload
   196  
   197  A simple way to upload data is to use the `upload` method which accepts contents as `str/bytes`
   198  
   199  ```python
   200  obj = branch1.object(path="text/sample_data.txt").upload(content_type="text/plain", data="This is my object data")
   201  print(obj.stats())
   202  ```
   203  
   204  #### Output
   205  ```
   206  {'path': 'text/sample_data.txt', 'physical_address': 's3://storage-bucket/repos/example-repo/data/gke0ignnl531fa6k90p0/ckpfk4fnl531fa6k90pg', 'physical_address_expiry': None, 'checksum': '4a09d10820234a95bb548f14e4435bba', 'size_bytes': 15, 'mtime': 1701865289, 'metadata': {}, 'content_type': 'text/plain'}
   207  ```
   208  
   209  Reading the data is just as simple:
   210  ```python
   211  print(obj.reader(mode='r').read())
   212  ```
   213  
   214  #### Output
   215  ```
   216  This is my object data
   217  ```
   218  
   219  Now let's generate a "sample_data.csv" file and write it directly to a lakeFS writer object
   220  
   221  ```python
   222  import csv
   223  
   224  sample_data = [
   225      [1, "Alice", "alice@example.com"],
   226      [2, "Bob", "bob@example.com"],
   227      [3, "Carol", "carol@example.com"],
   228  ]
   229  
   230  obj = branch1.object(path="csv/sample_data.csv")
   231  
   232  with obj.writer(mode='w', pre_sign=True, content_type="text/csv") as fd:
   233      writer = csv.writer(fd)
   234      writer.writerow(["ID", "Name", "Email"])
   235      for row in sample_data:
   236          writer.writerow(row)
   237  ```
   238  
   239  On context exit the object will be uploaded to lakeFS
   240  
   241  ```python
   242  print(obj.stats())
   243  ```
   244  
   245  #### Output
   246  ```
   247  {'path': 'csv/sample_data.csv', 'physical_address': 's3://storage-bucket/repos/example-repo/data/gke0ignnl531fa6k90p0/ckpfk4fnl531fa6k90pg', 'physical_address_expiry': None, 'checksum': 'f181262c138901a74d47652d5ea72295', 'size_bytes': 88, 'mtime': 1701865939, 'metadata': {}, 'content_type': 'text/csv'}
   248  ```
   249  
   250  We can also upload raw byte contents:
   251  
   252  ```python
   253  obj = branch1.object(path="raw/file1.data").upload(data=b"Hello Object World", pre_sign=True)
   254  print(obj.stats())
   255  ```
   256  
   257  #### Output
   258  ```
   259  {'path': 'raw/file1.data', 'physical_address': 's3://storage-bucket/repos/example-repo/data/gke0ignnl531fa6k90p0/ckpfltvnl531fa6k90q0', 'physical_address_expiry': None, 'checksum': '0ef432f8eb0305f730b0c57bbd7a6b08', 'size_bytes': 18, 'mtime': 1701866323, 'metadata': {}, 'content_type': 'application/octet-stream'}
   260  ```
   261  
   262  ### Uncommitted changes
   263  
   264  Using the branch `uncommmitted` method will show all the uncommitted changes on that branch:
   265  
   266  ```python
   267  for diff in branch1.uncommitted():
   268      print(diff)
   269  ```
   270  
   271  #### Output
   272  
   273  ```
   274  {'type': 'added', 'path': 'text/sample_data.txt', 'path_type': 'object', 'size_bytes': 15}
   275  {'type': 'added', 'path': 'csv/sample_data.csv', 'path_type': 'object', 'size_bytes': 88}
   276  {'type': 'added', 'path': 'raw/file1.data', 'path_type': 'object', 'size_bytes': 18}
   277  
   278  ```
   279  
   280  As expected, our change appears here. Let's commit it and attach some arbitrary metadata:
   281  
   282  ```python
   283  ref = branch1.commit(message='Add some data!', metadata={'using': 'python_sdk'})
   284  print(ref.get_commit())
   285  ```
   286  
   287  #### Output
   288  ```
   289  {'id': 'c4666db80d2a984b4eab8ce02b6a60830767eba53995c26350e0ad994e15fedb', 'parents': ['a7a092a5a32a2cd97f22abcc99414f6283d29f6b9dd2725ce89f90188c5901e5'], 'committer': 'admin', 'message': 'Add some data!', 'creation_date': 1701866838, 'meta_range_id': '999bedeab1b740f83d2cf8c52548d55446f9038c69724d399adc4438412cade2', 'metadata': {'using': 'python_sdk'}}
   290  
   291  ```
   292  
   293  Calling `uncommitted` again on the same branch, this time there should be no uncommitted files:
   294  
   295  ```python
   296  print(len(list(branch1.uncommitted())))
   297  ```
   298  
   299  #### Output
   300  ```
   301  0
   302  ```
   303  
   304  #### Merging changes from a branch into main
   305  
   306  Let's diff between your branch and the main branch:
   307  
   308  ```python
   309  main = repo.branch("main")
   310  for diff in main.diff(other_ref=branch1):
   311      print(diff)
   312  ```
   313  
   314  #### Output
   315  ```
   316  {'type': 'added', 'path': 'text/sample_data.txt', 'path_type': 'object', 'size_bytes': 15}
   317  {'type': 'added', 'path': 'csv/sample_data.csv', 'path_type': 'object', 'size_bytes': 88}
   318  {'type': 'added', 'path': 'raw/file1.data', 'path_type': 'object', 'size_bytes': 18}
   319  ```
   320  
   321  Looks like we have some changes. Let's merge them:
   322  
   323  ```python
   324  res = branch1.merge_into(main)
   325  print(res)
   326  # output:
   327  # cfddb68b7265ae0b17fafa1a2068f8414395e0a8b8bc0f8d741cbcce1e67e394
   328  ```
   329  
   330  Let's diff again - there should be no changes as all changes are on our main branch already:
   331  
   332  ```python
   333  print(len(list(main.diff(other_ref=branch1))))
   334  ```
   335  
   336  #### Output
   337  ```
   338  0
   339  ```
   340  
   341  ### Read data from main branch
   342  
   343  ```python
   344  import csv
   345  
   346  obj = main.object(path="csv/sample_data.csv")
   347  
   348  for row in csv.reader(obj.reader(mode='r')):
   349      print(row)
   350  ```
   351  
   352  #### Output
   353  ```
   354  ['ID', 'Name', 'Email']
   355  ['1', 'Alice', 'alice@example.com']
   356  ['2', 'Bob', 'bob@example.com']
   357  ['3', 'Carol', 'carol@example.com']
   358  ```
   359  
   360  ### Importing data into lakeFS
   361  
   362  The new SDK makes it much easier to import existing data from the object store into lakeFS, using the new ImportManager
   363  
   364  ```python
   365  import lakefs
   366  
   367  branch = lakefs.repository("example-repo").repo.branch("experiment3")
   368  
   369  # We can import data from multiple sources in a single import process
   370  # The following example initializes a new ImportManager and adds 2 source types; A prefix and an object.
   371  importer = branch.import_data(commit_message="added public S3 data") \
   372      .prefix("s3://example-bucket1/path1/", destination="datasets/path1/") \
   373      .object("s3://example-bucket1/path2/imported_obj", destination="datasets/path2/imported_obj")
   374  
   375  # run() is a convenience method that blocks until the import is reported as done, raising an exception if it fails.
   376  importer.run()
   377  
   378  ```
   379  
   380  Alternatively we can call `start()` and `status()` ourselves for an async version of the above
   381  
   382  ```python
   383  import time
   384  
   385  # Async version
   386  importer.start()
   387  status = importer.start()
   388  
   389  while not status.completed or status.error is None:
   390          time.sleep(3)  # or whatever interval you choose
   391          status = importer.status()
   392  
   393  if status.error:
   394      # handle!
   395  
   396  print(f"imported a total of {status.ingested_objects} objects!")
   397  
   398  ```
   399  
   400  #### Output
   401  ```
   402  imported a total of 25478 objects!
   403  ```
   404  
   405  ### Transactions
   406  
   407  Transactions is a new feature in the High Level SDK. It allows performing a sequence of operations on a branch as an atomic unit, similarly to how database transactions work.
   408  Under the hood, the transaction creates an ephemeral branch from the source branch, performs all the operation on that branch, and merges it back to the source branch once the transaction is completed.
   409  Transactions are currently supported as a context manager only.
   410  
   411  ```python
   412  import lakefs
   413  
   414  branch = lakefs.repository("example-repo").repo.branch("experiment3")
   415  
   416  with branch.transact(commit_message="my transaction") as tx:
   417      for obj in tx.objects(prefix="prefix_to_delete/"):  # Delete some objects
   418          obj.delete()
   419  
   420      # Create new object
   421      tx.object("new_object").upload("new object data")
   422  
   423  print(len(list(branch.objects(prefix="prefix_to_delete/"))))
   424  print(branch.object("new_object").exists())
   425  ```
   426  
   427  #### Output
   428  ```
   429  0
   430  True
   431  ```
   432  
   433  ### Python SDK documentation and API reference
   434  
   435  For the documentation of lakeFS’s Python package and full api reference, see [https://pydocs-lakefs.lakefs.io](https://pydocs-lakefs.lakefs.io)
   436  
   437  ## Using lakefs-spec for higher-level file operations
   438  
   439  The [lakefs-spec](https://lakefs-spec.org/latest/) project
   440  provides higher-level file operations on lakeFS objects with a filesystem API,
   441  built on the [fsspec](https://github.com/fsspec/filesystem_spec) project.
   442  
   443  **Note** This library is a third-party package and not maintained by the lakeFS developers; please file issues and bug reports directly
   444  in the [lakefs-spec](https://github.com/aai-institute/lakefs-spec) repository.
   445   {: .note}
   446  
   447  ### Installation
   448  
   449  Install `lakefs-spec` directly with `pip`:
   450  
   451  ```
   452  python -m pip install --upgrade lakefs-spec
   453  ```
   454  
   455  ### Interacting with lakeFS through a file system
   456  
   457  To write a file directly to a branch in a lakeFS repository, consider the following example:
   458  
   459  ```python
   460  from pathlib import Path
   461  
   462  from lakefs_spec import LakeFSFileSystem
   463  
   464  REPO, BRANCH = "example-repo", "main"
   465  
   466  # Prepare a local example file.
   467  lpath = Path("demo.txt")
   468  lpath.write_text("Hello, lakeFS!")
   469  
   470  fs = LakeFSFileSystem()  # will auto-discover credentials from ~/.lakectl.yaml
   471  rpath = f"{REPO}/{BRANCH}/{lpath.name}"
   472  fs.put(lpath, rpath)
   473  ```
   474  
   475  Reading it again from remote is as easy as the following:
   476  
   477  ```python
   478  f = fs.open(rpath, "rt")
   479  print(f.readline())  # prints "Hello, lakeFS!"
   480  ```
   481  
   482  Many more operations like retrieving an object's metadata or checking an
   483  object's existence on the lakeFS server are also supported. For a full list,
   484  see the [API reference](https://lakefs-spec.org/latest/reference/lakefs_spec/).
   485  
   486  ### Integrations with popular data science packages
   487  
   488  A number of Python data science projects support fsspec, with [pandas](https://pandas.pydata.org/) being a prominent example. Reading a Parquet file from a lakeFS repository into a Pandas data frame for analysis is very easy, demonstrated on the quickstart repository sample data:
   489  
   490  ```python
   491  import pandas as pd
   492  
   493  # Read into pandas directly by supplying the lakeFS URI...
   494  lakes = pd.read_parquet(f"lakefs://quickstart/main/lakes.parquet")
   495  german_lakes = lakes.query('Country == "Germany"')
   496  # ... and store directly, again with a raw lakeFS URI.
   497  german_lakes.to_csv(f"lakefs://quickstart/main/german_lakes.csv")
   498  ```
   499  
   500  A list of integrations with popular data science libraries can be found in the [lakefs-spec documentation](https://lakefs-spec.org/latest/guides/integrations/).
   501  
   502  ### Using transactions for atomic versioning operations
   503  
   504  As with the high-level SDK (see above), lakefs-spec also supports transactions
   505  for conducting versioning operations on newly modified files. The following is an example of creating a commit on the repository's main branch directly after a file upload:
   506  
   507  ```python
   508  from lakefs_spec import LakeFSFileSystem
   509  
   510  fs = LakeFSFileSystem()
   511  
   512  # assumes you have a local train-test split as two text files:
   513  # train-data.txt, and test-data.txt.
   514  with fs.transaction("example-repo", "main") as tx:
   515      fs.put_file("train-data.txt", f"example-repo/{tx.branch.id}/train-data.txt")
   516      tx.commit(message="Add training data")
   517      fs.put_file("test-data.txt", f"example-repo/{tx.branch.id}/test-data.txt")
   518      sha = tx.commit(message="Add test data")
   519      tx.tag(sha, name="My train-test split")
   520  ```
   521  
   522  Transactions are atomic - if an exception happens at any point of the transaction, the repository remains unchanged.
   523  
   524  ### Further information
   525  
   526  For more user guides, tutorials on integrations with data science tools like pandas, and more, check out the [lakefs-spec documentation](https://lakefs-spec.org/latest/).
   527  
   528  ## Using Boto
   529  
   530  💡 To use Boto with lakeFS alongside S3, check out [Boto S3 Router](https://github.com/treeverse/boto-s3-router){:target="_blank"}. It will route
   531  requests to either S3 or lakeFS according to the provided bucket name.
   532  {: .note }
   533  
   534  lakeFS exposes an S3-compatible API, so you can use Boto to interact with your objects on lakeFS.
   535  
   536  ### Initializing
   537  
   538  Create a Boto3 S3 client with your lakeFS endpoint and key-pair:
   539  
   540  ```python
   541  import boto3
   542  s3 = boto3.client('s3',
   543      endpoint_url='https://lakefs.example.com',
   544      aws_access_key_id='AKIAIOSFODNN7EXAMPLE',
   545      aws_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')
   546  ```
   547  
   548  The client is now configured to operate on your lakeFS installation.
   549  
   550  ### Usage Examples
   551  
   552  #### Put an object into lakeFS
   553  
   554  Use a branch name and a path to put an object in lakeFS:
   555  
   556  ```python
   557  with open('/local/path/to/file_0', 'rb') as f:
   558      s3.put_object(Body=f, Bucket='example-repo', Key='main/example-file.parquet')
   559  ```
   560  
   561  You can now commit this change using the lakeFS UI or CLI.
   562  
   563  #### List objects
   564  
   565  List the branch objects starting with a prefix:
   566  
   567  ```python
   568  list_resp = s3.list_objects_v2(Bucket='example-repo', Prefix='main/example-prefix')
   569  for obj in list_resp['Contents']:
   570      print(obj['Key'])
   571  ```
   572  
   573  Or, use a lakeFS commit ID to list objects for a specific commit:
   574  
   575  ```python
   576  list_resp = s3.list_objects_v2(Bucket='example-repo', Prefix='c7a632d74f/example-prefix')
   577  for obj in list_resp['Contents']:
   578      print(obj['Key'])
   579  ```
   580  
   581  #### Get object metadata
   582  
   583  Get object metadata using branch and path:
   584  
   585  ```python
   586  s3.head_object(Bucket='example-repo', Key='main/example-file.parquet')
   587  # output:
   588  # {'ResponseMetadata': {'RequestId': '72A9EBD1210E90FA',
   589  #  'HostId': '',
   590  #  'HTTPStatusCode': 200,
   591  #  'HTTPHeaders': {'accept-ranges': 'bytes',
   592  #   'content-length': '1024',
   593  #   'etag': '"2398bc5880e535c61f7624ad6f138d62"',
   594  #   'last-modified': 'Sun, 24 May 2020 10:42:24 GMT',
   595  #   'x-amz-request-id': '72A9EBD1210E90FA',
   596  #   'date': 'Sun, 24 May 2020 10:45:42 GMT'},
   597  #  'RetryAttempts': 0},
   598  # 'AcceptRanges': 'bytes',
   599  # 'LastModified': datetime.datetime(2020, 5, 24, 10, 42, 24, tzinfo=tzutc()),
   600  # 'ContentLength': 1024,
   601  # 'ETag': '"2398bc5880e535c61f7624ad6f138d62"',
   602  # 'Metadata': {}}
   603  ```