github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/high-level-python-wrapper.md (about)

     1  # lakefs high level Python SDK Wrapper
     2  
     3  ## Goals
     4  
     5  1. Provide a simpler programming interface with less configuration
     6  1. Behave closer to other related Python SDKs (Pinecone, OpenAI, HuggingFace, ...)
     7  1. Allow infering identity from environment
     8  1. Provide better abstractions for common, more complex operations (I/O, imports)
     9  
    10  ## Non-Goals
    11  
    12  1. For now, we explicitly leave out environment administration: setting up server, configuring GC rules, all IAM/ACL related things.
    13  
    14  ## Authentication
    15  
    16  Any operation that calls out to lakeFS will try to authenticate using the following chain:
    17  
    18  1. All models receive an optional `client` kwarg with explicit credentials
    19  1. Otherwise, if `lakefs.init(...)` is called with parameters (`access_key_id`, `jwt_token`, ...) - these will be set on a `lakefs.DefaultClient` object
    20  1. If `lakefs.init()` is called with no parameters, or `init` is not called:
    21      1. use `LAKECTL_SERVER_ENDPOINT_URL`, `LAKECTL_ACCESS_KEY_ID` and `LAKECTL_ACCESS_SECRET_KEY` if set
    22      1. Otherwise, use `~/.lakectl.yaml` if exists
    23      1. Otherwise, try and use IAM role from current machine (using AWS IAM role (will only work with enterprise/cloud))
    24  1. If init is not called, it will be lazily called on the first use of `DefaultClient`, deferring authentication to the first API call.
    25  
    26  ## API wrapper interface
    27  
    28  the higher level SDK will be resource-class based. performing API operations will be
    29  done by calling the methods on their parent object. Examples:
    30  
    31  ```python
    32  import lakefs
    33  
    34  repo = lakefs.Repository('example')
    35  branch = repo.Branch('main')
    36  
    37  for item in branch.objects.list(prefix='foo/'):
    38      if item.path.endswith('.parquet'):
    39          print(item.path)
    40  
    41  data: bytes = branch.Object('datasets/foo/1.parquet').open().read()
    42  branch.Object('datasets/foo/1.parquet').upload(data)
    43  
    44  # this will work:
    45  data: bytes = lakefs.Repository('example').Commit('abc123').Object('a/b.txt').open().read()
    46  
    47  # since commits are immutable, create() will not exist:
    48  lakefs.Repository('example').Commit('abc123').Object('a/b.txt'). create(data)
    49  ```
    50  
    51  ## Partial interface definition
    52  
    53  ### Authentication
    54  
    55  ```python
    56  class Client:
    57      """
    58      Wrapper around lakefs_sdk's client object
    59      Takes care of instantiating it from the environment
    60      """
    61      def __init__(self, **kwargs):
    62          self._client = _infer_auth_chain(**kwargs)
    63  
    64  # global default client
    65  DefaultClient: Optional[Client] = None
    66  
    67  try:
    68      DefaultClient = Client()
    69  except NoAuthenticationFound:
    70      # must call init() explicitly
    71      DefaultClient = None
    72  
    73  
    74  def init(**kwargs):
    75      global DefaultClient
    76      DefaultClient = Client(**kwargs)
    77  
    78  ```
    79  
    80  ### model driven interface
    81  
    82  
    83  ```python
    84  class Repository:
    85      def __init__(self, repository_id: str, client: Client = DefaultClient): ...
    86      def create(self, storage_namespace: str, default_branch_id: str = 'main', include_samples: bool = False, exist_ok: bool = False) -> Repository: ...
    87      def metadata(self) -> dict[str, str]: ...
    88      def Branch(self, branch_id: str) -> Branch: ...
    89      def Commit(self, commit_id: str) -> Reference: ...
    90      # Ref can take a branch, tag or commit ID, returns only committed state (i.e. branch will be rev-parsed and its underlying commit returned)
    91      # This is asctually how the GetCommit API operation behaves, so this is essentially an alias for Commit()!
    92      def Ref(self, ref_id: str) -> Reference: ...  
    93      def Tag(self, tag_id: str) -> Tag: ...
    94      
    95      @property
    96      def branches(self) -> BranchManager: ...
    97      
    98      @property
    99      def tags(self) -> TagManager: ...
   100  
   101  
   102  class BranchManager:
   103      def __init__(self, repository_id: str, client: Client = DefaultClient): ...
   104      def list(max_amount: Optional[int], after: str ='', prefix: str ='') -> Generator[Branch]: ...
   105  
   106  
   107  class TagManager:
   108      def __init__(self, repository_id: str, client: Client = DefaultClient): ...
   109      def list(max_amount: Optional[int], after: str ='', prefix: str ='') -> Generator[Tag]: ...
   110  
   111  
   112  class StoredObject:
   113      def __init__(self, repository_id: str, reference_id: str, path: str, client: Client = DefaultClient): ...
   114      def open(self, mode: Literal['r', 'rb'] = 'r', pre_signed: Optional[bool] = None) -> TextIO | BinaryIO: ...
   115      def stat(self) -> ObjectInfo: ...
   116  
   117  
   118  class WritableObject(StoredObject):
   119      def create(self, data: bytes | str | TextIO | BinaryIO, path: str, pre_signed: Optional[bool] = None,
   120          content_type: Optional[str] = None, metadata: Optional[dict[str, str]] = None, 
   121          mode: Literal['x', 'xb', 'w', 'wb'] = 'wb') -> ObjectInfo: ...
   122      def delete(self): ...
   123      def copy(self, to_reference: str, to_path: str): ...
   124  
   125  
   126  class ObjectManager:
   127      def __init__(self, repository_id: str, reference_id: str, client: Client = DefaultClient): ...
   128      def list(self, max_amount: Optional[int], after: str ='', prefix: str ='', delimiter: str = '/') -> Generator[ObjectInfo | CommonPrefix]: ...
   129  
   130  
   131  class WritableObjectManager(ObjectManager):
   132      def uncommitted(self, max_amount: Optional[int], after: str ='', prefix: str ='') -> Generator[Change]: ...
   133      def import(self, commit_message: str) -> ImportManager: ...
   134      def delete(self, object_paths: str | Iterable[str]): ...
   135      def transact(self, commit_message: str) -> Transaction: ...
   136      def reset_changes(self, path: Optional[str] = None): ...
   137  
   138  
   139  class Reference:
   140      def __init__(self, repository_id: str, reference_id: str, client: Client = DefaultClient): ...
   141      def log(self, max_amount: Optional[int]) -> Generator[Reference]: ...
   142      def metadata(self) -> dict[str, str]: ...
   143      def commit_message(self) -> str: ...
   144      def diff(self, other_ref: str | Reference, max_amount: Optional[int], after: str ='', prefix: str ='', delimiter: str = '/') -> Generator[Change]: ...
   145      def merge_into(self, destination_branch_id: str | Branch): ...
   146      def Object(self, path: str) -> Object: ...
   147      @property
   148      def objects(self) -> ObjectManager: ...
   149  
   150  
   151  class Branch(Reference):
   152      def create(self, source_reference_id: str, exist_ok: bool = False) -> Branch: ...
   153      def head(self) -> Reference: ...
   154      def commit(self, message: str, metadata: dict[str, str]) -> Reference: ...
   155      def delete(self): ...
   156      def revert(self, reference_id: str): ...
   157      def Object(self, path: str) -> WritableObject: ...
   158      @property
   159      def objects(self) -> WritableObjectManager: ...
   160      
   161  
   162  class Tag(Reference):
   163      def create(self, source_reference_id: str, exist_ok: bool = False) -> Tag: ...
   164      def delete(self, exist_ok: bool = False): ...
   165  
   166  
   167  class CommonPrefix:
   168      def __init__(self, repository_id: str, reference_id: str, path: str, client: Client = DefaultClient): ...
   169      def exists(self) -> bool: ...
   170  
   171  
   172  class ObjectInfo:
   173      def __init__(self, repository_id: str, reference_id: str, path: str, client: Client = DefaultClient): ...
   174      def path(self) -> self: ...
   175      def modified_time(self) -> datetime.datetime: ...
   176      def size_bytes(self) -> int: ...
   177      def content_type(self) -> Optional[str]: ...
   178      def metadata(self) -> dict[str, str]: ...
   179      def physical_address(self) -> str: ...
   180      def delete(self): ...
   181  
   182  
   183  class Change(NamedTuple):
   184      type: Literal(['added', 'removed', 'changed', 'conflict,', 'prefix_changed'])
   185      path: str
   186      path_type: Literal(['common_prefix', 'object'])
   187      size_bytes: Optional[int]
   188  
   189  
   190  class ServerConfiguration:
   191      def __init__(self, client: Client = DefaultClient): ...
   192      def version(self) -> str: ...
   193      def storage_config(self) -> ServerStorageConfiguration: ...
   194  
   195  
   196  class ServerStorageConfiguration(NamedTuple):
   197      blockstore_type: str
   198      pre_sign_support: bool
   199      import_support: bool
   200  
   201  
   202  class ImportManager:
   203      def __init__(self, repository_id: str, reference_id: str, client: Client = DefaultClient): ...
   204      def prefix(self, object_store_uri: str, destination: str) -> ImportManager: ...
   205      def object(self, object_store_uri: str, destination: str) -> ImportManager: ...
   206      def start(self) -> str:
   207          'start import, reporting back (and storing) a process id'
   208          ...
   209      def wait(self, poll_interval: timedelta = timedelta(seconds=2)) -> ImportResult:
   210          'poll a started import task ID, blocking until completion'
   211          ...
   212      def run(self, poll_interval: Optional[timedelta] = None) -> ImportResult:
   213          'same as calling start() and then wait()'
   214          ...
   215  
   216  
   217  class ImportResult(NamedTuple):
   218      commit: Commit
   219      ingested_objects: int
   220  
   221  class Transaction(Branch):
   222      def __init__(self, repository_id: str, branch_id: str, client: Client = DefaultClient): ...
   223      def begin() -> None:
   224          'Create ephemeral branch from the source branch (e.g. <source_branch_id>-txn-<uuid>'
   225          ...
   226  
   227      def commit() -> Commit:
   228          'commit, merge, delete ephemeral branch'
   229          ...
   230  
   231      def rollback(delete_temp_branch: bool = True) -> None:
   232          'if delete_temp_branch = True, delete the ephemeral branch createds'
   233  
   234      def __enter__(self):
   235          'calls begin()'
   236          ...
   237  
   238      def __exit__(self, type, value, traceback):
   239          'if successful, commit(), otherwise rollback() and report a meaningful error'
   240          ...
   241  
   242  ```
   243  
   244  While this list is fairly exhaustive, it might require a few additional tweaks and additions.
   245  
   246  Additionally, we define the following exception hierarchy:
   247  
   248  ```python
   249  
   250  # lakefs.exceptions
   251  class LakeFSException(Exception):
   252      status_code: int
   253      message: str
   254  
   255  # More specific "not found"s can inherit from this:
   256  class NotFoundException(LakeFSException): ...
   257  class NotAuthorizedException(LakeFSException): ...
   258  class ServerException(LakeFSException): ...
   259  class UnsupportedOperationException(LakeFSException): ...
   260  class ObjectNotFoundException(NotFoundException, FileNotFoundError): ...
   261  
   262  # raised when Object('...').create(mode='x') and object exists
   263  class ObjectExistsException(LakeFSException, FileExistsError): ...
   264  
   265  # Retured by Object.open() and Object.create() for compatibility with python
   266  class PermissionException(NotAuthorizedException, PermissionError)
   267  
   268  ```
   269  
   270  Other, more specific exceptions may subclass these, but all errors returned by the lakeFS server should sub-class one of these to make error handling easier for developers.
   271  
   272  Hierarchy:
   273  
   274  ```text
   275  LakeFSException
   276   ├── NotFoundException
   277   │    └── ObjectNotFoundException
   278   ├── NotAuthorizedException
   279   │    └── PermissionException
   280   ├── ServerException
   281   ├── UnsupportedOperationException
   282   └── ObjectExistsException
   283  
   284  ```
   285  
   286  
   287  ## Higher Level Utilities
   288  
   289  ### I/O - reading/writing objects
   290  
   291  Provide a pythonic `open()` method that returns a "file-like object" (read-only)
   292  
   293  ```python
   294  import lakefs
   295  
   296  repo = lakefs.Repository('example')
   297  branch = repo.Branch('main')
   298  
   299  # Will check the underlying client for pre-signed URL support
   300  # if supported, will do get_physical_address->http upload->link address 
   301  # Otherwise, will try a direct upload.
   302  # *In the future*, we can accept a stream/file-like object, sniff for its size/content type
   303  # opt for multi-part, etc.
   304  branch.Object('foo/bar.txt').create(data=b'hello world!\n')
   305  
   306  with branch.Object('foo/bar.txt').open() as f:
   307      data = f.read()
   308  
   309  with repo.Commit('abc123').Object('foo/bar.txt').open() as f:
   310      f.read() # read all
   311      f.read(1024)  # or a range request
   312  
   313  ```
   314  
   315  `open()` will also accept an explicit `pre_signed: Optional[bool] = None` argument.
   316  if set, don't try and probe the client for this capability
   317  
   318  ### Import Manager
   319  
   320  Provide a utility to run 
   321  
   322  ```python
   323  import lakefs
   324  
   325  main = lakefs.repository('example').branch('main')
   326  task = main.objects.import(commit_message='imported stuff!'). \
   327      .prefix('s3://bucket/path', destination='some/path/'). \
   328      .prefix('s3://bucket2/other/path', destination='other/path/')
   329  
   330  task.start()  # will not block, run the import API
   331  task.wait()  # Block, polling in the background
   332  
   333  # or just run(), same as start() & wait()
   334  main.objects.import('sync datasets').prefix('s3://bucket/path/', destination='datasets/').run()
   335  ```
   336  
   337  ### Transaction Manager
   338  
   339  ```python
   340  import lakefs
   341  
   342  dev = lakefs.Repository('example').Branch('dev')
   343  
   344  # Will create an ephemeral branch from `dev` (e.g. `tx-dev-343829f89`)
   345  # uploads and downloads will apply to that ephemeral branch
   346  # on success, commit with provided message, merge and delete ephemeral branch
   347  # on exception or failure, leave branch as is and report it in a wrapping exception
   348  # for easy troubleshooting
   349  with dev.transaction('do things') as tx:
   350      tx.Object('foo').create(data)
   351      tx.Object('foo').open() as f:
   352          data = f.read()
   353  
   354  ```
   355  
   356  ### Creating repositories
   357  
   358  Small helper for writing succint examples/samples:
   359  
   360  ```python
   361  import lakefs
   362  
   363  repo = lakefs.Repository('example').create(storage_namespace='s3://bucket/path/', exist_ok=True)
   364  
   365  # From here, proceed as usual..
   366  main = repo.Branch('main')
   367  ```