github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/high-level-python-wrapper.md (about) 1 # lakefs high level Python SDK Wrapper 2 3 ## Goals 4 5 1. Provide a simpler programming interface with less configuration 6 1. Behave closer to other related Python SDKs (Pinecone, OpenAI, HuggingFace, ...) 7 1. Allow infering identity from environment 8 1. Provide better abstractions for common, more complex operations (I/O, imports) 9 10 ## Non-Goals 11 12 1. For now, we explicitly leave out environment administration: setting up server, configuring GC rules, all IAM/ACL related things. 13 14 ## Authentication 15 16 Any operation that calls out to lakeFS will try to authenticate using the following chain: 17 18 1. All models receive an optional `client` kwarg with explicit credentials 19 1. Otherwise, if `lakefs.init(...)` is called with parameters (`access_key_id`, `jwt_token`, ...) - these will be set on a `lakefs.DefaultClient` object 20 1. If `lakefs.init()` is called with no parameters, or `init` is not called: 21 1. use `LAKECTL_SERVER_ENDPOINT_URL`, `LAKECTL_ACCESS_KEY_ID` and `LAKECTL_ACCESS_SECRET_KEY` if set 22 1. Otherwise, use `~/.lakectl.yaml` if exists 23 1. Otherwise, try and use IAM role from current machine (using AWS IAM role (will only work with enterprise/cloud)) 24 1. If init is not called, it will be lazily called on the first use of `DefaultClient`, deferring authentication to the first API call. 25 26 ## API wrapper interface 27 28 the higher level SDK will be resource-class based. performing API operations will be 29 done by calling the methods on their parent object. Examples: 30 31 ```python 32 import lakefs 33 34 repo = lakefs.Repository('example') 35 branch = repo.Branch('main') 36 37 for item in branch.objects.list(prefix='foo/'): 38 if item.path.endswith('.parquet'): 39 print(item.path) 40 41 data: bytes = branch.Object('datasets/foo/1.parquet').open().read() 42 branch.Object('datasets/foo/1.parquet').upload(data) 43 44 # this will work: 45 data: bytes = lakefs.Repository('example').Commit('abc123').Object('a/b.txt').open().read() 46 47 # since commits are immutable, create() will not exist: 48 lakefs.Repository('example').Commit('abc123').Object('a/b.txt'). create(data) 49 ``` 50 51 ## Partial interface definition 52 53 ### Authentication 54 55 ```python 56 class Client: 57 """ 58 Wrapper around lakefs_sdk's client object 59 Takes care of instantiating it from the environment 60 """ 61 def __init__(self, **kwargs): 62 self._client = _infer_auth_chain(**kwargs) 63 64 # global default client 65 DefaultClient: Optional[Client] = None 66 67 try: 68 DefaultClient = Client() 69 except NoAuthenticationFound: 70 # must call init() explicitly 71 DefaultClient = None 72 73 74 def init(**kwargs): 75 global DefaultClient 76 DefaultClient = Client(**kwargs) 77 78 ``` 79 80 ### model driven interface 81 82 83 ```python 84 class Repository: 85 def __init__(self, repository_id: str, client: Client = DefaultClient): ... 86 def create(self, storage_namespace: str, default_branch_id: str = 'main', include_samples: bool = False, exist_ok: bool = False) -> Repository: ... 87 def metadata(self) -> dict[str, str]: ... 88 def Branch(self, branch_id: str) -> Branch: ... 89 def Commit(self, commit_id: str) -> Reference: ... 90 # Ref can take a branch, tag or commit ID, returns only committed state (i.e. branch will be rev-parsed and its underlying commit returned) 91 # This is asctually how the GetCommit API operation behaves, so this is essentially an alias for Commit()! 92 def Ref(self, ref_id: str) -> Reference: ... 93 def Tag(self, tag_id: str) -> Tag: ... 94 95 @property 96 def branches(self) -> BranchManager: ... 97 98 @property 99 def tags(self) -> TagManager: ... 100 101 102 class BranchManager: 103 def __init__(self, repository_id: str, client: Client = DefaultClient): ... 104 def list(max_amount: Optional[int], after: str ='', prefix: str ='') -> Generator[Branch]: ... 105 106 107 class TagManager: 108 def __init__(self, repository_id: str, client: Client = DefaultClient): ... 109 def list(max_amount: Optional[int], after: str ='', prefix: str ='') -> Generator[Tag]: ... 110 111 112 class StoredObject: 113 def __init__(self, repository_id: str, reference_id: str, path: str, client: Client = DefaultClient): ... 114 def open(self, mode: Literal['r', 'rb'] = 'r', pre_signed: Optional[bool] = None) -> TextIO | BinaryIO: ... 115 def stat(self) -> ObjectInfo: ... 116 117 118 class WritableObject(StoredObject): 119 def create(self, data: bytes | str | TextIO | BinaryIO, path: str, pre_signed: Optional[bool] = None, 120 content_type: Optional[str] = None, metadata: Optional[dict[str, str]] = None, 121 mode: Literal['x', 'xb', 'w', 'wb'] = 'wb') -> ObjectInfo: ... 122 def delete(self): ... 123 def copy(self, to_reference: str, to_path: str): ... 124 125 126 class ObjectManager: 127 def __init__(self, repository_id: str, reference_id: str, client: Client = DefaultClient): ... 128 def list(self, max_amount: Optional[int], after: str ='', prefix: str ='', delimiter: str = '/') -> Generator[ObjectInfo | CommonPrefix]: ... 129 130 131 class WritableObjectManager(ObjectManager): 132 def uncommitted(self, max_amount: Optional[int], after: str ='', prefix: str ='') -> Generator[Change]: ... 133 def import(self, commit_message: str) -> ImportManager: ... 134 def delete(self, object_paths: str | Iterable[str]): ... 135 def transact(self, commit_message: str) -> Transaction: ... 136 def reset_changes(self, path: Optional[str] = None): ... 137 138 139 class Reference: 140 def __init__(self, repository_id: str, reference_id: str, client: Client = DefaultClient): ... 141 def log(self, max_amount: Optional[int]) -> Generator[Reference]: ... 142 def metadata(self) -> dict[str, str]: ... 143 def commit_message(self) -> str: ... 144 def diff(self, other_ref: str | Reference, max_amount: Optional[int], after: str ='', prefix: str ='', delimiter: str = '/') -> Generator[Change]: ... 145 def merge_into(self, destination_branch_id: str | Branch): ... 146 def Object(self, path: str) -> Object: ... 147 @property 148 def objects(self) -> ObjectManager: ... 149 150 151 class Branch(Reference): 152 def create(self, source_reference_id: str, exist_ok: bool = False) -> Branch: ... 153 def head(self) -> Reference: ... 154 def commit(self, message: str, metadata: dict[str, str]) -> Reference: ... 155 def delete(self): ... 156 def revert(self, reference_id: str): ... 157 def Object(self, path: str) -> WritableObject: ... 158 @property 159 def objects(self) -> WritableObjectManager: ... 160 161 162 class Tag(Reference): 163 def create(self, source_reference_id: str, exist_ok: bool = False) -> Tag: ... 164 def delete(self, exist_ok: bool = False): ... 165 166 167 class CommonPrefix: 168 def __init__(self, repository_id: str, reference_id: str, path: str, client: Client = DefaultClient): ... 169 def exists(self) -> bool: ... 170 171 172 class ObjectInfo: 173 def __init__(self, repository_id: str, reference_id: str, path: str, client: Client = DefaultClient): ... 174 def path(self) -> self: ... 175 def modified_time(self) -> datetime.datetime: ... 176 def size_bytes(self) -> int: ... 177 def content_type(self) -> Optional[str]: ... 178 def metadata(self) -> dict[str, str]: ... 179 def physical_address(self) -> str: ... 180 def delete(self): ... 181 182 183 class Change(NamedTuple): 184 type: Literal(['added', 'removed', 'changed', 'conflict,', 'prefix_changed']) 185 path: str 186 path_type: Literal(['common_prefix', 'object']) 187 size_bytes: Optional[int] 188 189 190 class ServerConfiguration: 191 def __init__(self, client: Client = DefaultClient): ... 192 def version(self) -> str: ... 193 def storage_config(self) -> ServerStorageConfiguration: ... 194 195 196 class ServerStorageConfiguration(NamedTuple): 197 blockstore_type: str 198 pre_sign_support: bool 199 import_support: bool 200 201 202 class ImportManager: 203 def __init__(self, repository_id: str, reference_id: str, client: Client = DefaultClient): ... 204 def prefix(self, object_store_uri: str, destination: str) -> ImportManager: ... 205 def object(self, object_store_uri: str, destination: str) -> ImportManager: ... 206 def start(self) -> str: 207 'start import, reporting back (and storing) a process id' 208 ... 209 def wait(self, poll_interval: timedelta = timedelta(seconds=2)) -> ImportResult: 210 'poll a started import task ID, blocking until completion' 211 ... 212 def run(self, poll_interval: Optional[timedelta] = None) -> ImportResult: 213 'same as calling start() and then wait()' 214 ... 215 216 217 class ImportResult(NamedTuple): 218 commit: Commit 219 ingested_objects: int 220 221 class Transaction(Branch): 222 def __init__(self, repository_id: str, branch_id: str, client: Client = DefaultClient): ... 223 def begin() -> None: 224 'Create ephemeral branch from the source branch (e.g. <source_branch_id>-txn-<uuid>' 225 ... 226 227 def commit() -> Commit: 228 'commit, merge, delete ephemeral branch' 229 ... 230 231 def rollback(delete_temp_branch: bool = True) -> None: 232 'if delete_temp_branch = True, delete the ephemeral branch createds' 233 234 def __enter__(self): 235 'calls begin()' 236 ... 237 238 def __exit__(self, type, value, traceback): 239 'if successful, commit(), otherwise rollback() and report a meaningful error' 240 ... 241 242 ``` 243 244 While this list is fairly exhaustive, it might require a few additional tweaks and additions. 245 246 Additionally, we define the following exception hierarchy: 247 248 ```python 249 250 # lakefs.exceptions 251 class LakeFSException(Exception): 252 status_code: int 253 message: str 254 255 # More specific "not found"s can inherit from this: 256 class NotFoundException(LakeFSException): ... 257 class NotAuthorizedException(LakeFSException): ... 258 class ServerException(LakeFSException): ... 259 class UnsupportedOperationException(LakeFSException): ... 260 class ObjectNotFoundException(NotFoundException, FileNotFoundError): ... 261 262 # raised when Object('...').create(mode='x') and object exists 263 class ObjectExistsException(LakeFSException, FileExistsError): ... 264 265 # Retured by Object.open() and Object.create() for compatibility with python 266 class PermissionException(NotAuthorizedException, PermissionError) 267 268 ``` 269 270 Other, more specific exceptions may subclass these, but all errors returned by the lakeFS server should sub-class one of these to make error handling easier for developers. 271 272 Hierarchy: 273 274 ```text 275 LakeFSException 276 ├── NotFoundException 277 │ └── ObjectNotFoundException 278 ├── NotAuthorizedException 279 │ └── PermissionException 280 ├── ServerException 281 ├── UnsupportedOperationException 282 └── ObjectExistsException 283 284 ``` 285 286 287 ## Higher Level Utilities 288 289 ### I/O - reading/writing objects 290 291 Provide a pythonic `open()` method that returns a "file-like object" (read-only) 292 293 ```python 294 import lakefs 295 296 repo = lakefs.Repository('example') 297 branch = repo.Branch('main') 298 299 # Will check the underlying client for pre-signed URL support 300 # if supported, will do get_physical_address->http upload->link address 301 # Otherwise, will try a direct upload. 302 # *In the future*, we can accept a stream/file-like object, sniff for its size/content type 303 # opt for multi-part, etc. 304 branch.Object('foo/bar.txt').create(data=b'hello world!\n') 305 306 with branch.Object('foo/bar.txt').open() as f: 307 data = f.read() 308 309 with repo.Commit('abc123').Object('foo/bar.txt').open() as f: 310 f.read() # read all 311 f.read(1024) # or a range request 312 313 ``` 314 315 `open()` will also accept an explicit `pre_signed: Optional[bool] = None` argument. 316 if set, don't try and probe the client for this capability 317 318 ### Import Manager 319 320 Provide a utility to run 321 322 ```python 323 import lakefs 324 325 main = lakefs.repository('example').branch('main') 326 task = main.objects.import(commit_message='imported stuff!'). \ 327 .prefix('s3://bucket/path', destination='some/path/'). \ 328 .prefix('s3://bucket2/other/path', destination='other/path/') 329 330 task.start() # will not block, run the import API 331 task.wait() # Block, polling in the background 332 333 # or just run(), same as start() & wait() 334 main.objects.import('sync datasets').prefix('s3://bucket/path/', destination='datasets/').run() 335 ``` 336 337 ### Transaction Manager 338 339 ```python 340 import lakefs 341 342 dev = lakefs.Repository('example').Branch('dev') 343 344 # Will create an ephemeral branch from `dev` (e.g. `tx-dev-343829f89`) 345 # uploads and downloads will apply to that ephemeral branch 346 # on success, commit with provided message, merge and delete ephemeral branch 347 # on exception or failure, leave branch as is and report it in a wrapping exception 348 # for easy troubleshooting 349 with dev.transaction('do things') as tx: 350 tx.Object('foo').create(data) 351 tx.Object('foo').open() as f: 352 data = f.read() 353 354 ``` 355 356 ### Creating repositories 357 358 Small helper for writing succint examples/samples: 359 360 ```python 361 import lakefs 362 363 repo = lakefs.Repository('example').create(storage_namespace='s3://bucket/path/', exist_ok=True) 364 365 # From here, proceed as usual.. 366 main = repo.Branch('main') 367 ```