github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/glossary.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/glossary.md (about)

1 ---
2 title: Glossary
3 description: Glossary of all terms related to lakeFS technical internals and the architecture.
4 parent: Understanding lakeFS
5 redirect_from:
6 - /reference/glossary.html
7 - /glossary.html
8 ---
9
10 # Glossary
11 This page has definition and explanations of all terms related to lakeFS technical internals and the architecture.
12
13 {% include toc.html %}
14
15 ## Auditing
16 Data auditing is data assessment to ensure its accuracy, security, and efficacy for specific usage. It also involves assessing data quality through its lifecycle and understanding the impact of poor quality data on the organization's performance and revenue. Ensuring data reproducibility, auditability, and governance is one of the key concerns of data engineers today. lakeFS commit history helps the data teams to keep track of all changes to the data, supporting data auditing.
17
18 ## Branch
19
20 Branches in lakeFS allow users to create their own "isolated" view of the repository. [Read more][branches].
21
22 ## Collection
23 A collection, roughly speaking, is a set of data. Collections may be structured or unstructured; a structured collection is often referred to as a table.
24
25 ## Commit
26
27 Using commits, you can view a [repository][repository] at a certain point in its history and you're guaranteed that the data you see is exactly as it was at the point of committing it. [Read More][commit].
28
29 ## Cross-Collection Consistency
30 It is unfortunate that the word 'consistency' has multiple meanings, at least four of them according to Martin Kleppmann. Consistency in the context of lakeFS and data versioning is, the guarantee that operations in a transaction are performed accurately, correctly and most important, atomically.
31
32 A repository (and thus a branch) in lakeFS, can span multiple tables or collections. By providing branch, commit, merge and revert operations atomically on a branch, lakeFS achieves consistency guarantees across different logical collections. That is, data versioning is consistent across multiple collections within a repository.
33
34 It is sometimes referred as multi-table transactions. That is, lakeFS offers transactional guarantees across multiple tables.
35
36 
37
38 ## Data Lake Governance ###
39 The goal of data lake governance is to apply policies, standards and processes on the data. This allows creating high-quality data and ensuring that it’s used appropriately across the organization. Data lake governance improves the data quality and increases data usage for business decision-making, leading to operational improvements, better-informed business strategies, and stronger financial performance. lakeFS Cloud offers advanced data lake management features such as: [Role-Based Access Control]({% link reference/security/rbac.md %}), [Branch Aware Managed Garbage Collection]({% link howto/garbage-collection/gc.md %}), [Data Lineage and Audit log]({% link reference/auditing.md %}).
40
41 ## Data Lifecycle Management
42 In data-intensive applications, data should be managed through its entire lifecycle similar to how teams manage code. By doing so, we could leverage the best practices and tools from application lifecycle management (like CI/CD operations) and apply them to data. lakeFS offers data lifecycle management via [isolated data development environments]({% link understand/use_cases/etl_testing.md %}) instead of shared buckets.
43
44 ## Data Pipeline Reproducibility
45 Reproducibility in data pipelines is the ability to repeat a process. An example of this is recreating an issue that occurred in the production pipeline. Reproducibility allows for the controlled manufacture of an error to debug and troubleshoot it at a later point in time. Reproducing a data pipeline issue is a challenge that most data engineers face on a daily basis. Learn more about how lakeFS supports data pipeline [reproducibility]({% link understand/use_cases/reproducibility.md %}). Other use cases include running ad-hoc queries (useful for data science), review, and backfill.
46
47 ## Data Quality Testing
48 This term describes ways to test data for its accuracy, completeness, consistency, timeliness, validity, and integrity. lakeFS hooks can be used to implement and run data quality tests before promoting staging data into production.
49
50 ## Data Versioning
51 To version data means creating a unique point-in-time reference for data that can be accessed later. This reference can take the form of a query, an ID, or also commonly, a DateTime identifier. Data versioning may also include saving an entire copy of the data under a new name or file path every time you want to create a version of it. More advanced versioning solutions like lakeFS perform versioning through zero-copy data operations. lakeFS also optimizes storage usage between versions and exposes special operations to manage them.
52
53 ## Git-like Operations
54 lakeFS allows teams to treat their data lake as a Git repository. Git is used for code versioning, whereas lakeFS is used for data versioning. lakeFS provides Git-like operations such as branch, commit, merge and revert.
55
56 ## Graveler
57 Graveler is the core versioning engine of lakeFS. It handles versioning by translating lakeFS addresses to the actual stored objects. See the [versioning internals section]({% link understand/how/versioning-internals.md %}) to learn how lakeFS stores metadata.
58
59 ## Hooks
60 lakeFS hooks allow you to automate and ensure that a given set of checks and validations happens before important lifecycle events. They are similar conceptually to [Git Hooks](https://git-scm.com/docs/githooks), but in contrast, they run remotely on a server. Currently, lakeFS allows executing hooks when two types of events occur: pre-commit events that run before a commit is acknowledged and pre-merge events that trigger right before a merge operation.
61
62 ## Isolated Data Snapshot
63 Creating a branch in lakeFS provides an isolated environment containing a snapshot of your repository. While working on your branch in isolation, all other data users will be looking at the repository's main branch. So they won't see your changes, and you also won't see the changes applied to the main branch. All of this happens without any data duplication but metadata management.
64
65 ## Main Branch
66 Every Git repository has the main branch (unless you take explicit steps to remove it) and it plays a key role in the software development process. In most projects, it represents the source of truth - all the code that works has been tested and is ready to be pushed to production. Similarly, main branch in lakeFS could be used as the single source of truth. For example, the live production data can be on the main branch.
67
68 ## Metadata Management
69 Where there is data, there is also metadata. lakeFS uses metadata to define schema, data types, data versions, relations to other datasets, etc. This helps to improve discoverability and manageability. lakeFS performs data versioning through metadata operations.
70
71 ## Merge
72 lakeFS merge command, similar to the Git merge functionality, allows you to merge data branches. Once you commit data, you can review it and then merge the committed data into the target branch. A merge generates a commit on the target branch with all your changes. lakeFS guarantees atomic merges that are fast, given they don’t involve copying data. [Read More][merge].
73
74 ## Repository
75
76 In lakeFS, a _repository_ is a set of related objects (or collections of objects). [Read More][repository].
77
78 ## Rollback
79 A rollback is an atomic operation reversing the effects of a previous commit. If a developer introduces a new code version to production and discovers that it has a critical bug, they can simply roll back to the previous version. In lakeFS, a rollback is an atomic action that prevents the data consumers from receiving low-quality data until the issue is resolved. Learn more about how lakeFS supports the [rollback]({% link understand/use_cases/rollback.md %}) operation.
80
81 ## Storage Namespace
82 The storage namespace is a location in the underlying storage dedicated to a specific repository.
83 lakeFS uses it to store the repository's objects and some of its metadata.
84
85 ## Underlying Storage
86 The underlying storage is a location in some object store where lakeFS keeps your objects and some metadata.
87
88 ## Tag
89
90 Tags are a way to give a meaningful name to a specific commit. [Read More][tags].
91
92 ## Fluffy
93
94 lakeFS Enterprise Single-Sign-On service, it's delegated with lakeFS authentication requests and replies back to lakeFS with the authentication response.
95
96
97 [branches]: {% link understand/model.md %}#branches
98 [commit]: {% link understand/model.md %}#commits
99 [repository]: {% link understand/model.md %}#repository
100 [merge]: {% link understand/model.md %}#merge
101 [tags]: {% link understand/model.md %}#tags