github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/model.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/model.md (about)

     1  ---
     2  title: Concepts and  Model 
     3  description: The lakeFS object model blends the object models of Git and of object stores such as S3. Read this page to learn more.
     4  parent: Understanding lakeFS
     5  redirect_from: 
     6    - /reference/object-model.html
     7    - /understand/branching-model.html
     8    - /understand/object-model.html
     9  ---
    10  
    11  # lakeFS Concepts and Model
    12  
    13  {% include toc_2-3.html %}
    14  
    15  lakeFS blends concepts from object stores such as S3 with concepts from Git. This reference
    16  defines the common concepts of lakeFS.
    17  
    18  ## Objects
    19  
    20  lakeFS is an interface to manage objects in an object store.
    21  
    22  {: .tip }
    23  > The actual data itself is not stored inside lakeFS directly but in an [underlying object store](#concepts-unique-to-lakefs).
    24  > lakeFS manages pointers and additional metadata about these objects.
    25  
    26  ## Version Control
    27  
    28  lakeFS is spearheading version control semantics for data. Most of these concepts will be familiar to Git users:
    29  
    30  ### Repository
    31  
    32  In lakeFS, a _repository_ is a set of related objects (or collections of objects). In many cases, these represent tables of [various formats](https://lakefs.io/blog/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/){:target="_blank"} for tabular data, semi-structured data such as JSON or log files - or a set of unstructured objects such as images, videos, sensor data, etc.
    33  
    34  lakeFS represents repositories as a logical namespace used to group together objects, branches, and commits - analogous to a repository in Git.
    35  
    36  lakeFS repository naming requirements are as follows:
    37  
    38  - Start with a lower case letter or number
    39  - Contain only lower case letters, numbers and hyphens
    40  - Be between 3 and 63 characters long
    41  
    42  ### Commits
    43  
    44  Using commits, you can view a [repository](#repository) at a certain point in its history and you're guaranteed that the data you see is exactly as it was at the point of committing it.
    45  
    46  These commits are immutable "checkpoints" containing all contents of a repository at a given point in the repository's history.
    47  
    48  Each commit contains metadata - the committer, timestamp, a commit message, as well as arbitrary key/value pairs you can choose to add.
    49  
    50    **Identifying Commits**<br/><br/>
    51    A commit is identified by its _commit ID_, a digest of all contents of the commit. <br/>
    52    Commit IDs are by nature long, so you may use a unique prefix to abbreviate them. A commit may also be identified by using a textual definition, called a _ref_. <br/><br/>
    53    Examples of refs include tags, branch names, and expressions.
    54  {: .note }
    55  
    56  ### Branches
    57  
    58  Branches in lakeFS allow users to create their own "isolated" view of the repository.
    59  
    60  Changes on one branch do not appear on other branches. Users can take changes from one branch and apply it to another by [merging](#merge) them.
    61  
    62  #### Zero-copy branching
    63  
    64  Under the hood, branches are simply a pointer to a [commit](#commits) along with a set of uncommitted changes.
    65  Creating a branch is a **zero-copy operation**; instead of duplicating data, it involves creating a pointer to the source commit for the branch.
    66  
    67  ### Tags
    68  
    69  Tags are a way to give a meaningful name to a specific commit.
    70  Using tags allow users to reference specific releases, experiments, or versions by using a human friendly name.
    71  
    72  Example tags:
    73  
    74  - `v2.3` to mark a release.
    75  - `dev-jane-before-v2.3-merge` to mark Jane's private temporary point.
    76  
    77  Tag names adhere to the same rules as [git ref names](https://git-scm.com/docs/git-check-ref-format).
    78  
    79  ### History
    80  
    81  The _history_ of the branch is the list of commits from the branch tip through the first
    82  parent of each commit. Histories go back in time.
    83  
    84  ### Merge
    85  
    86  _Merging_ is the way to integrate changes from a branch into another branch.
    87  The result of a merge is a new commit, with the destination as the first parent and the source as the second.
    88  
    89  To learn more about how merging works in lakeFS, see the [merge reference]({% link understand/how/merge.md %})
    90  {: .note }
    91  
    92  ### Ref expressions
    93  
    94  lakeFS also supports _expressions_ for creating a ref. These are similar to [revisions in
    95  Git](https://git-scm.com/docs/gitrevisions#_specifying_revisions); indeed all `~` and `^`
    96  examples at the end of that section will work unchanged in lakeFS.
    97  
    98  - A branch or a tag are ref expressions.
    99  - If `<ref>` is a ref expression, then:
   100    - `<ref>^` is a ref expression referring to its first parent.
   101    - `<ref>^N` is a ref expression referring to its N'th parent; in particular `<ref>^1` is the
   102      same as `<ref>^`.
   103    - `<ref>~` is a ref expression referring to its first parent; in particular `<ref>~` is the
   104      same as `<ref>^` and `<ref>~`.
   105    - `<ref>~N` is a ref expression referring to its N'th parent, always traversing to the first
   106      parent.  So `<ref>~N` is the same as `<ref>^^...^` with N consecutive carets `^`.
   107  
   108  ## Concepts unique to lakeFS
   109  
   110  The _underlying storage_ is a location in an object store where lakeFS keeps your objects and some immutable metadata.
   111  
   112  When creating a lakeFS repository, you assign it with a _storage namespace_. The repository's
   113  storage namespace is a location in the underlying storage where data for this repository
   114  will be stored.
   115  
   116  We sometimes refer to underlying storage as _physical_. The path used to store the contents of an object is then termed a _physical path_.
   117  Once lakeFS saves an object in the underlying storage it is never modified, except to remove it
   118  entirely during some cleanups.
   119  
   120  A lot of what lakeFS does is to manage how lakeFS paths translate to _physical paths_ on the
   121  object store. This mapping is generally **not** straightforward. Importantly (and contrary to
   122  many object stores), lakeFS may map multiple paths to the same object on backing storage, and
   123  always does this for objects that are unchanged across versions.
   124  
   125  ### `lakefs` protocol URIs
   126  
   127  lakeFS uses a specific format for path URIs. The URI `lakefs://<REPO>/<REF>/<KEY>` is a path
   128  to objects in the given repo and ref expression under key. This is used both for path
   129  prefixes and for full paths. In similar fashion, `lakefs://<REPO>/<REF>` identifies the
   130  repository at a ref expression, and `lakefs://<REPO>` identifes a repo.