github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/model.md (about) 1 --- 2 title: Concepts and Model 3 description: The lakeFS object model blends the object models of Git and of object stores such as S3. Read this page to learn more. 4 parent: Understanding lakeFS 5 redirect_from: 6 - /reference/object-model.html 7 - /understand/branching-model.html 8 - /understand/object-model.html 9 --- 10 11 # lakeFS Concepts and Model 12 13 {% include toc_2-3.html %} 14 15 lakeFS blends concepts from object stores such as S3 with concepts from Git. This reference 16 defines the common concepts of lakeFS. 17 18 ## Objects 19 20 lakeFS is an interface to manage objects in an object store. 21 22 {: .tip } 23 > The actual data itself is not stored inside lakeFS directly but in an [underlying object store](#concepts-unique-to-lakefs). 24 > lakeFS manages pointers and additional metadata about these objects. 25 26 ## Version Control 27 28 lakeFS is spearheading version control semantics for data. Most of these concepts will be familiar to Git users: 29 30 ### Repository 31 32 In lakeFS, a _repository_ is a set of related objects (or collections of objects). In many cases, these represent tables of [various formats](https://lakefs.io/blog/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/){:target="_blank"} for tabular data, semi-structured data such as JSON or log files - or a set of unstructured objects such as images, videos, sensor data, etc. 33 34 lakeFS represents repositories as a logical namespace used to group together objects, branches, and commits - analogous to a repository in Git. 35 36 lakeFS repository naming requirements are as follows: 37 38 - Start with a lower case letter or number 39 - Contain only lower case letters, numbers and hyphens 40 - Be between 3 and 63 characters long 41 42 ### Commits 43 44 Using commits, you can view a [repository](#repository) at a certain point in its history and you're guaranteed that the data you see is exactly as it was at the point of committing it. 45 46 These commits are immutable "checkpoints" containing all contents of a repository at a given point in the repository's history. 47 48 Each commit contains metadata - the committer, timestamp, a commit message, as well as arbitrary key/value pairs you can choose to add. 49 50 **Identifying Commits**<br/><br/> 51 A commit is identified by its _commit ID_, a digest of all contents of the commit. <br/> 52 Commit IDs are by nature long, so you may use a unique prefix to abbreviate them. A commit may also be identified by using a textual definition, called a _ref_. <br/><br/> 53 Examples of refs include tags, branch names, and expressions. 54 {: .note } 55 56 ### Branches 57 58 Branches in lakeFS allow users to create their own "isolated" view of the repository. 59 60 Changes on one branch do not appear on other branches. Users can take changes from one branch and apply it to another by [merging](#merge) them. 61 62 #### Zero-copy branching 63 64 Under the hood, branches are simply a pointer to a [commit](#commits) along with a set of uncommitted changes. 65 Creating a branch is a **zero-copy operation**; instead of duplicating data, it involves creating a pointer to the source commit for the branch. 66 67 ### Tags 68 69 Tags are a way to give a meaningful name to a specific commit. 70 Using tags allow users to reference specific releases, experiments, or versions by using a human friendly name. 71 72 Example tags: 73 74 - `v2.3` to mark a release. 75 - `dev-jane-before-v2.3-merge` to mark Jane's private temporary point. 76 77 Tag names adhere to the same rules as [git ref names](https://git-scm.com/docs/git-check-ref-format). 78 79 ### History 80 81 The _history_ of the branch is the list of commits from the branch tip through the first 82 parent of each commit. Histories go back in time. 83 84 ### Merge 85 86 _Merging_ is the way to integrate changes from a branch into another branch. 87 The result of a merge is a new commit, with the destination as the first parent and the source as the second. 88 89 To learn more about how merging works in lakeFS, see the [merge reference]({% link understand/how/merge.md %}) 90 {: .note } 91 92 ### Ref expressions 93 94 lakeFS also supports _expressions_ for creating a ref. These are similar to [revisions in 95 Git](https://git-scm.com/docs/gitrevisions#_specifying_revisions); indeed all `~` and `^` 96 examples at the end of that section will work unchanged in lakeFS. 97 98 - A branch or a tag are ref expressions. 99 - If `<ref>` is a ref expression, then: 100 - `<ref>^` is a ref expression referring to its first parent. 101 - `<ref>^N` is a ref expression referring to its N'th parent; in particular `<ref>^1` is the 102 same as `<ref>^`. 103 - `<ref>~` is a ref expression referring to its first parent; in particular `<ref>~` is the 104 same as `<ref>^` and `<ref>~`. 105 - `<ref>~N` is a ref expression referring to its N'th parent, always traversing to the first 106 parent. So `<ref>~N` is the same as `<ref>^^...^` with N consecutive carets `^`. 107 108 ## Concepts unique to lakeFS 109 110 The _underlying storage_ is a location in an object store where lakeFS keeps your objects and some immutable metadata. 111 112 When creating a lakeFS repository, you assign it with a _storage namespace_. The repository's 113 storage namespace is a location in the underlying storage where data for this repository 114 will be stored. 115 116 We sometimes refer to underlying storage as _physical_. The path used to store the contents of an object is then termed a _physical path_. 117 Once lakeFS saves an object in the underlying storage it is never modified, except to remove it 118 entirely during some cleanups. 119 120 A lot of what lakeFS does is to manage how lakeFS paths translate to _physical paths_ on the 121 object store. This mapping is generally **not** straightforward. Importantly (and contrary to 122 many object stores), lakeFS may map multiple paths to the same object on backing storage, and 123 always does this for objects that are unchanged across versions. 124 125 ### `lakefs` protocol URIs 126 127 lakeFS uses a specific format for path URIs. The URI `lakefs://<REPO>/<REF>/<KEY>` is a path 128 to objects in the given repo and ref expression under key. This is used both for path 129 prefixes and for full paths. In similar fashion, `lakefs://<REPO>/<REF>` identifies the 130 repository at a ref expression, and `lakefs://<REPO>` identifes a repo.