github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/python.md (about) 1 --- 2 title: Python 3 description: Use Python to interact with your objects on lakeFS 4 parent: Integrations 5 redirect_from: 6 - /using/python.html 7 - /using/boto.html 8 - /integrations/boto.html 9 --- 10 11 # Use Python to interact with your objects on lakeFS 12 13 {% include toc_2-3.html %} 14 15 16 **High Level Python SDK** <span class="badge mr-1">New</span> 17 We've just released a new High Level Python SDK library, and we're super excited to tell you about it! Continue reading to get the 18 full story! 19 Though our previous SDK client is still supported and maintained, we highly recommend using the new High Level SDK. 20 **For previous Python SDKs follow these links:** 21 [lakefs-sdk](https://pydocs-sdk.lakefs.io) 22 [legacy-sdk](https://pydocs.lakefs.io) (Depracated) 23 {: .note } 24 25 There are three primary ways to work with lakeFS from Python: 26 27 * [Use Boto](#using-boto) to perform **object operations** through the **lakeFS S3 gateway**. 28 * [Use the High Level lakeFS SDK](#using-the-lakefs-sdk) to perform **object operations**, **versioning** and other **lakeFS-specific operations**. 29 * [Using lakefs-spec](#using-lakefs-spec-for-higher-level-file-operations) to 30 perform high-level file operations through a file-system-like API. 31 32 ## Using the lakeFS SDK 33 34 ### Installing 35 36 Install the Python client using pip: 37 38 39 ```shell 40 pip install lakefs 41 ``` 42 43 ### Initializing 44 45 The High Level SDK by default will try to collect authentication parameters from the environment and attempt to create a default client. 46 When working in an environment where **lakectl** is configured it is not necessary to instantiate a lakeFS client or provide it for creating the lakeFS objects. 47 In case no authentication parameters exist, it is also possible to explicitly create a lakeFS client 48 49 Here's how to instantiate a client: 50 51 ```python 52 from lakefs.client import Client 53 54 clt = Client( 55 host="http://localhost:8000", 56 username="AKIAIOSFODNN7EXAMPLE", 57 password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 58 ) 59 ``` 60 61 You can use TLS with a CA that is not trusted on the host by configuring the 62 client with a CA cert bundle file. It should contain concatenated CA 63 certificates in PEM format: 64 ```python 65 clt = Client( 66 host="http://localhost:8000", 67 username="AKIAIOSFODNN7EXAMPLE", 68 password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 69 # Customize the CA certificates used to verify the peer. 70 ssl_ca_cert="path/to/concatenated_CA_certificates.PEM", 71 ) 72 ``` 73 74 For testing SSL endpoints you may wish to use a self-signed certificate. If you do this and receive an `SSL: CERTIFICATE_VERIFY_FAILED` error message you might add the following configuration to your client: 75 76 ```python 77 clt = Client( 78 host="http://localhost:8000", 79 username="AKIAIOSFODNN7EXAMPLE", 80 password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 81 verify_ssl=False, 82 ) 83 ``` 84 85 {: .warning } 86 This setting allows well-known "man-in-the-middle", 87 impersonation, and credential stealing attacks. Never use this in any 88 production setting. 89 90 Optionally, to enable communication via proxies, add a proxy configuration: 91 92 ```python 93 clt = Client( 94 host="http://localhost:8000", 95 username="AKIAIOSFODNN7EXAMPLE", 96 password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 97 ssl_ca_cert="(if needed)", 98 proxy="<proxy server URL>", 99 ) 100 ``` 101 102 ### Usage Examples 103 104 Lets see how we can interact with lakeFS using the High Level SDK. 105 106 #### Creating a repository 107 108 ```python 109 import lakefs 110 111 repo = lakefs.repository("example-repo").create(storage_namespace="s3://storage-bucket/repos/example-repo") 112 print(repo) 113 ``` 114 115 If using an explicit client, create the Repository object and pass the client to it (note the changed syntax). 116 117 ```python 118 import lakefs 119 from lakefs.client import Client 120 121 clt = Client( 122 host="http://localhost:8000", 123 username="AKIAIOSFODNN7EXAMPLE", 124 password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 125 ) 126 127 repo = lakefs.Repository("example-repo", client=clt).create(storage_namespace="s3://storage-bucket/repos/example-repo") 128 print(repo) 129 ``` 130 131 #### Output 132 ``` 133 {id: 'example-repo', creation_date: 1697815536, default_branch: 'main', storage_namespace: 's3://storage-bucket/repos/example-repo'} 134 ``` 135 136 #### List repositories 137 138 139 ```python 140 import lakefs 141 142 print("Listing repositories:") 143 for repo in lakefs.repositories(): 144 print(repo) 145 146 ``` 147 148 #### Output 149 ``` 150 Listing repositories: 151 {id: 'example-repo', creation_date: 1697815536, default_branch: 'main', storage_namespace: 's3://storage-bucket/repos/example-repo'} 152 ``` 153 154 #### Creating a branch 155 156 ```python 157 import lakefs 158 159 branch1 = lakefs.repository("example-repo").branch("experiment1").create(source_reference_id="main") 160 print("experiment1 ref:", branch1.get_commit().id) 161 162 branch1 = lakefs.repository("example-repo").branch("experiment2").create(source_reference_id="main") 163 print("experiment2 ref:", branch2.get_commit().id) 164 ``` 165 166 #### Output 167 ``` 168 experiment1 ref: 7a300b41a8e1ca666c653171a364c08f640549c24d7e82b401bf077c646f8859 169 experiment2 ref: 7a300b41a8e1ca666c653171a364c08f640549c24d7e82b401bf077c646f8859 170 ``` 171 172 ### List branches 173 174 ```python 175 import lakefs 176 177 for branch in lakefs.repository("example-repo").branches(): 178 print(branch) 179 180 ``` 181 182 #### Output 183 ``` 184 experiment1 185 experiment2 186 main 187 ``` 188 189 ## IO 190 191 Great, now lets see some IO operations in action! 192 The new High Level SDK provide IO semantics which allow to work with lakeFS objects as if they were files in your 193 filesystem. This is extremely useful when working with data transformation packages that accept file descriptors and streams. 194 195 ### Upload 196 197 A simple way to upload data is to use the `upload` method which accepts contents as `str/bytes` 198 199 ```python 200 obj = branch1.object(path="text/sample_data.txt").upload(content_type="text/plain", data="This is my object data") 201 print(obj.stats()) 202 ``` 203 204 #### Output 205 ``` 206 {'path': 'text/sample_data.txt', 'physical_address': 's3://storage-bucket/repos/example-repo/data/gke0ignnl531fa6k90p0/ckpfk4fnl531fa6k90pg', 'physical_address_expiry': None, 'checksum': '4a09d10820234a95bb548f14e4435bba', 'size_bytes': 15, 'mtime': 1701865289, 'metadata': {}, 'content_type': 'text/plain'} 207 ``` 208 209 Reading the data is just as simple: 210 ```python 211 print(obj.reader(mode='r').read()) 212 ``` 213 214 #### Output 215 ``` 216 This is my object data 217 ``` 218 219 Now let's generate a "sample_data.csv" file and write it directly to a lakeFS writer object 220 221 ```python 222 import csv 223 224 sample_data = [ 225 [1, "Alice", "alice@example.com"], 226 [2, "Bob", "bob@example.com"], 227 [3, "Carol", "carol@example.com"], 228 ] 229 230 obj = branch1.object(path="csv/sample_data.csv") 231 232 with obj.writer(mode='w', pre_sign=True, content_type="text/csv") as fd: 233 writer = csv.writer(fd) 234 writer.writerow(["ID", "Name", "Email"]) 235 for row in sample_data: 236 writer.writerow(row) 237 ``` 238 239 On context exit the object will be uploaded to lakeFS 240 241 ```python 242 print(obj.stats()) 243 ``` 244 245 #### Output 246 ``` 247 {'path': 'csv/sample_data.csv', 'physical_address': 's3://storage-bucket/repos/example-repo/data/gke0ignnl531fa6k90p0/ckpfk4fnl531fa6k90pg', 'physical_address_expiry': None, 'checksum': 'f181262c138901a74d47652d5ea72295', 'size_bytes': 88, 'mtime': 1701865939, 'metadata': {}, 'content_type': 'text/csv'} 248 ``` 249 250 We can also upload raw byte contents: 251 252 ```python 253 obj = branch1.object(path="raw/file1.data").upload(data=b"Hello Object World", pre_sign=True) 254 print(obj.stats()) 255 ``` 256 257 #### Output 258 ``` 259 {'path': 'raw/file1.data', 'physical_address': 's3://storage-bucket/repos/example-repo/data/gke0ignnl531fa6k90p0/ckpfltvnl531fa6k90q0', 'physical_address_expiry': None, 'checksum': '0ef432f8eb0305f730b0c57bbd7a6b08', 'size_bytes': 18, 'mtime': 1701866323, 'metadata': {}, 'content_type': 'application/octet-stream'} 260 ``` 261 262 ### Uncommitted changes 263 264 Using the branch `uncommmitted` method will show all the uncommitted changes on that branch: 265 266 ```python 267 for diff in branch1.uncommitted(): 268 print(diff) 269 ``` 270 271 #### Output 272 273 ``` 274 {'type': 'added', 'path': 'text/sample_data.txt', 'path_type': 'object', 'size_bytes': 15} 275 {'type': 'added', 'path': 'csv/sample_data.csv', 'path_type': 'object', 'size_bytes': 88} 276 {'type': 'added', 'path': 'raw/file1.data', 'path_type': 'object', 'size_bytes': 18} 277 278 ``` 279 280 As expected, our change appears here. Let's commit it and attach some arbitrary metadata: 281 282 ```python 283 ref = branch1.commit(message='Add some data!', metadata={'using': 'python_sdk'}) 284 print(ref.get_commit()) 285 ``` 286 287 #### Output 288 ``` 289 {'id': 'c4666db80d2a984b4eab8ce02b6a60830767eba53995c26350e0ad994e15fedb', 'parents': ['a7a092a5a32a2cd97f22abcc99414f6283d29f6b9dd2725ce89f90188c5901e5'], 'committer': 'admin', 'message': 'Add some data!', 'creation_date': 1701866838, 'meta_range_id': '999bedeab1b740f83d2cf8c52548d55446f9038c69724d399adc4438412cade2', 'metadata': {'using': 'python_sdk'}} 290 291 ``` 292 293 Calling `uncommitted` again on the same branch, this time there should be no uncommitted files: 294 295 ```python 296 print(len(list(branch1.uncommitted()))) 297 ``` 298 299 #### Output 300 ``` 301 0 302 ``` 303 304 #### Merging changes from a branch into main 305 306 Let's diff between your branch and the main branch: 307 308 ```python 309 main = repo.branch("main") 310 for diff in main.diff(other_ref=branch1): 311 print(diff) 312 ``` 313 314 #### Output 315 ``` 316 {'type': 'added', 'path': 'text/sample_data.txt', 'path_type': 'object', 'size_bytes': 15} 317 {'type': 'added', 'path': 'csv/sample_data.csv', 'path_type': 'object', 'size_bytes': 88} 318 {'type': 'added', 'path': 'raw/file1.data', 'path_type': 'object', 'size_bytes': 18} 319 ``` 320 321 Looks like we have some changes. Let's merge them: 322 323 ```python 324 res = branch1.merge_into(main) 325 print(res) 326 # output: 327 # cfddb68b7265ae0b17fafa1a2068f8414395e0a8b8bc0f8d741cbcce1e67e394 328 ``` 329 330 Let's diff again - there should be no changes as all changes are on our main branch already: 331 332 ```python 333 print(len(list(main.diff(other_ref=branch1)))) 334 ``` 335 336 #### Output 337 ``` 338 0 339 ``` 340 341 ### Read data from main branch 342 343 ```python 344 import csv 345 346 obj = main.object(path="csv/sample_data.csv") 347 348 for row in csv.reader(obj.reader(mode='r')): 349 print(row) 350 ``` 351 352 #### Output 353 ``` 354 ['ID', 'Name', 'Email'] 355 ['1', 'Alice', 'alice@example.com'] 356 ['2', 'Bob', 'bob@example.com'] 357 ['3', 'Carol', 'carol@example.com'] 358 ``` 359 360 ### Importing data into lakeFS 361 362 The new SDK makes it much easier to import existing data from the object store into lakeFS, using the new ImportManager 363 364 ```python 365 import lakefs 366 367 branch = lakefs.repository("example-repo").repo.branch("experiment3") 368 369 # We can import data from multiple sources in a single import process 370 # The following example initializes a new ImportManager and adds 2 source types; A prefix and an object. 371 importer = branch.import_data(commit_message="added public S3 data") \ 372 .prefix("s3://example-bucket1/path1/", destination="datasets/path1/") \ 373 .object("s3://example-bucket1/path2/imported_obj", destination="datasets/path2/imported_obj") 374 375 # run() is a convenience method that blocks until the import is reported as done, raising an exception if it fails. 376 importer.run() 377 378 ``` 379 380 Alternatively we can call `start()` and `status()` ourselves for an async version of the above 381 382 ```python 383 import time 384 385 # Async version 386 importer.start() 387 status = importer.start() 388 389 while not status.completed or status.error is None: 390 time.sleep(3) # or whatever interval you choose 391 status = importer.status() 392 393 if status.error: 394 # handle! 395 396 print(f"imported a total of {status.ingested_objects} objects!") 397 398 ``` 399 400 #### Output 401 ``` 402 imported a total of 25478 objects! 403 ``` 404 405 ### Transactions 406 407 Transactions is a new feature in the High Level SDK. It allows performing a sequence of operations on a branch as an atomic unit, similarly to how database transactions work. 408 Under the hood, the transaction creates an ephemeral branch from the source branch, performs all the operation on that branch, and merges it back to the source branch once the transaction is completed. 409 Transactions are currently supported as a context manager only. 410 411 ```python 412 import lakefs 413 414 branch = lakefs.repository("example-repo").repo.branch("experiment3") 415 416 with branch.transact(commit_message="my transaction") as tx: 417 for obj in tx.objects(prefix="prefix_to_delete/"): # Delete some objects 418 obj.delete() 419 420 # Create new object 421 tx.object("new_object").upload("new object data") 422 423 print(len(list(branch.objects(prefix="prefix_to_delete/")))) 424 print(branch.object("new_object").exists()) 425 ``` 426 427 #### Output 428 ``` 429 0 430 True 431 ``` 432 433 ### Python SDK documentation and API reference 434 435 For the documentation of lakeFS’s Python package and full api reference, see [https://pydocs-lakefs.lakefs.io](https://pydocs-lakefs.lakefs.io) 436 437 ## Using lakefs-spec for higher-level file operations 438 439 The [lakefs-spec](https://lakefs-spec.org/latest/) project 440 provides higher-level file operations on lakeFS objects with a filesystem API, 441 built on the [fsspec](https://github.com/fsspec/filesystem_spec) project. 442 443 **Note** This library is a third-party package and not maintained by the lakeFS developers; please file issues and bug reports directly 444 in the [lakefs-spec](https://github.com/aai-institute/lakefs-spec) repository. 445 {: .note} 446 447 ### Installation 448 449 Install `lakefs-spec` directly with `pip`: 450 451 ``` 452 python -m pip install --upgrade lakefs-spec 453 ``` 454 455 ### Interacting with lakeFS through a file system 456 457 To write a file directly to a branch in a lakeFS repository, consider the following example: 458 459 ```python 460 from pathlib import Path 461 462 from lakefs_spec import LakeFSFileSystem 463 464 REPO, BRANCH = "example-repo", "main" 465 466 # Prepare a local example file. 467 lpath = Path("demo.txt") 468 lpath.write_text("Hello, lakeFS!") 469 470 fs = LakeFSFileSystem() # will auto-discover credentials from ~/.lakectl.yaml 471 rpath = f"{REPO}/{BRANCH}/{lpath.name}" 472 fs.put(lpath, rpath) 473 ``` 474 475 Reading it again from remote is as easy as the following: 476 477 ```python 478 f = fs.open(rpath, "rt") 479 print(f.readline()) # prints "Hello, lakeFS!" 480 ``` 481 482 Many more operations like retrieving an object's metadata or checking an 483 object's existence on the lakeFS server are also supported. For a full list, 484 see the [API reference](https://lakefs-spec.org/latest/reference/lakefs_spec/). 485 486 ### Integrations with popular data science packages 487 488 A number of Python data science projects support fsspec, with [pandas](https://pandas.pydata.org/) being a prominent example. Reading a Parquet file from a lakeFS repository into a Pandas data frame for analysis is very easy, demonstrated on the quickstart repository sample data: 489 490 ```python 491 import pandas as pd 492 493 # Read into pandas directly by supplying the lakeFS URI... 494 lakes = pd.read_parquet(f"lakefs://quickstart/main/lakes.parquet") 495 german_lakes = lakes.query('Country == "Germany"') 496 # ... and store directly, again with a raw lakeFS URI. 497 german_lakes.to_csv(f"lakefs://quickstart/main/german_lakes.csv") 498 ``` 499 500 A list of integrations with popular data science libraries can be found in the [lakefs-spec documentation](https://lakefs-spec.org/latest/guides/integrations/). 501 502 ### Using transactions for atomic versioning operations 503 504 As with the high-level SDK (see above), lakefs-spec also supports transactions 505 for conducting versioning operations on newly modified files. The following is an example of creating a commit on the repository's main branch directly after a file upload: 506 507 ```python 508 from lakefs_spec import LakeFSFileSystem 509 510 fs = LakeFSFileSystem() 511 512 # assumes you have a local train-test split as two text files: 513 # train-data.txt, and test-data.txt. 514 with fs.transaction("example-repo", "main") as tx: 515 fs.put_file("train-data.txt", f"example-repo/{tx.branch.id}/train-data.txt") 516 tx.commit(message="Add training data") 517 fs.put_file("test-data.txt", f"example-repo/{tx.branch.id}/test-data.txt") 518 sha = tx.commit(message="Add test data") 519 tx.tag(sha, name="My train-test split") 520 ``` 521 522 Transactions are atomic - if an exception happens at any point of the transaction, the repository remains unchanged. 523 524 ### Further information 525 526 For more user guides, tutorials on integrations with data science tools like pandas, and more, check out the [lakefs-spec documentation](https://lakefs-spec.org/latest/). 527 528 ## Using Boto 529 530 💡 To use Boto with lakeFS alongside S3, check out [Boto S3 Router](https://github.com/treeverse/boto-s3-router){:target="_blank"}. It will route 531 requests to either S3 or lakeFS according to the provided bucket name. 532 {: .note } 533 534 lakeFS exposes an S3-compatible API, so you can use Boto to interact with your objects on lakeFS. 535 536 ### Initializing 537 538 Create a Boto3 S3 client with your lakeFS endpoint and key-pair: 539 540 ```python 541 import boto3 542 s3 = boto3.client('s3', 543 endpoint_url='https://lakefs.example.com', 544 aws_access_key_id='AKIAIOSFODNN7EXAMPLE', 545 aws_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY') 546 ``` 547 548 The client is now configured to operate on your lakeFS installation. 549 550 ### Usage Examples 551 552 #### Put an object into lakeFS 553 554 Use a branch name and a path to put an object in lakeFS: 555 556 ```python 557 with open('/local/path/to/file_0', 'rb') as f: 558 s3.put_object(Body=f, Bucket='example-repo', Key='main/example-file.parquet') 559 ``` 560 561 You can now commit this change using the lakeFS UI or CLI. 562 563 #### List objects 564 565 List the branch objects starting with a prefix: 566 567 ```python 568 list_resp = s3.list_objects_v2(Bucket='example-repo', Prefix='main/example-prefix') 569 for obj in list_resp['Contents']: 570 print(obj['Key']) 571 ``` 572 573 Or, use a lakeFS commit ID to list objects for a specific commit: 574 575 ```python 576 list_resp = s3.list_objects_v2(Bucket='example-repo', Prefix='c7a632d74f/example-prefix') 577 for obj in list_resp['Contents']: 578 print(obj['Key']) 579 ``` 580 581 #### Get object metadata 582 583 Get object metadata using branch and path: 584 585 ```python 586 s3.head_object(Bucket='example-repo', Key='main/example-file.parquet') 587 # output: 588 # {'ResponseMetadata': {'RequestId': '72A9EBD1210E90FA', 589 # 'HostId': '', 590 # 'HTTPStatusCode': 200, 591 # 'HTTPHeaders': {'accept-ranges': 'bytes', 592 # 'content-length': '1024', 593 # 'etag': '"2398bc5880e535c61f7624ad6f138d62"', 594 # 'last-modified': 'Sun, 24 May 2020 10:42:24 GMT', 595 # 'x-amz-request-id': '72A9EBD1210E90FA', 596 # 'date': 'Sun, 24 May 2020 10:45:42 GMT'}, 597 # 'RetryAttempts': 0}, 598 # 'AcceptRanges': 'bytes', 599 # 'LastModified': datetime.datetime(2020, 5, 24, 10, 42, 24, tzinfo=tzutc()), 600 # 'ContentLength': 1024, 601 # 'ETag': '"2398bc5880e535c61f7624ad6f138d62"', 602 # 'Metadata': {}} 603 ```