github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/glossary.md (about)

     1  ---
     2  title: Glossary
     3  description: Glossary of all terms related to lakeFS technical internals and the architecture.
     4  parent: Understanding lakeFS
     5  redirect_from:
     6      - /reference/glossary.html
     7      - /glossary.html
     8  ---
     9  
    10  # Glossary
    11  This page has definition and explanations of all terms related to lakeFS technical internals and the architecture.
    12  
    13  {% include toc.html %}
    14  
    15  ## Auditing
    16  Data auditing is data assessment to ensure its accuracy, security, and efficacy for specific usage. It also involves assessing data quality through its lifecycle and understanding the impact of poor quality data on the organization's performance and revenue. Ensuring data reproducibility, auditability, and governance is one of the key concerns of data engineers today. lakeFS commit history helps the data teams to keep track of all changes to the data, supporting data auditing.
    17  
    18  ## Branch
    19  
    20  Branches in lakeFS allow users to create their own "isolated" view of the repository. [Read more][branches].
    21  
    22  ## Collection
    23  A collection, roughly speaking, is a set of data. Collections may be structured or unstructured; a structured collection is often referred to as a table.
    24  
    25  ## Commit
    26  
    27  Using commits, you can view a [repository][repository] at a certain point in its history and you're guaranteed that the data you see is exactly as it was at the point of committing it. [Read More][commit].
    28  
    29  ## Cross-Collection Consistency
    30  It is unfortunate that the word 'consistency' has multiple meanings, at least four of them according to Martin Kleppmann. Consistency in the context of lakeFS and data versioning is, the guarantee that operations in a transaction are performed accurately, correctly and most important, atomically. 
    31  
    32  A repository (and thus a branch) in lakeFS, can span multiple tables or collections. By providing branch, commit, merge and revert operations atomically on a branch, lakeFS achieves consistency guarantees across different logical collections. That is, data versioning is consistent across multiple collections within a repository.
    33  
    34  It is sometimes referred as multi-table transactions. That is, lakeFS offers transactional guarantees across multiple tables.
    35  
    36  <!---Learn more about cross-collection consistency here (link to CCC blog) -->
    37  
    38  ## Data Lake Governance ###
    39  The goal of data lake governance is to apply policies, standards and processes on the data. This allows creating high-quality data and ensuring that it’s used appropriately across the organization. Data lake governance improves the data quality and increases data usage for business decision-making, leading to operational improvements, better-informed business strategies, and stronger financial performance. lakeFS Cloud offers advanced data lake management features such as: [Role-Based Access Control]({% link reference/security/rbac.md %}), [Branch Aware Managed Garbage Collection]({% link howto/garbage-collection/gc.md %}), [Data Lineage and Audit log]({% link reference/auditing.md %}).
    40  
    41  ## Data Lifecycle Management
    42  In data-intensive applications, data should be managed through its entire lifecycle similar to how teams manage code. By doing so, we could leverage the best practices and tools from application lifecycle management (like CI/CD operations) and apply them to data. lakeFS offers data lifecycle management via [isolated data development environments]({% link understand/use_cases/etl_testing.md %}) instead of shared buckets.
    43  
    44  ## Data Pipeline Reproducibility
    45  Reproducibility in data pipelines is the ability to repeat a process. An example of this is recreating an issue that occurred in the production pipeline. Reproducibility allows for the controlled manufacture of an error to debug and troubleshoot it at a later point in time. Reproducing a data pipeline issue is a challenge that most data engineers face on a daily basis. Learn more about how lakeFS supports data pipeline [reproducibility]({% link understand/use_cases/reproducibility.md %}). Other use cases include running ad-hoc queries (useful for data science), review, and backfill.
    46  
    47  ## Data Quality Testing
    48  This term describes ways to test data for its accuracy, completeness, consistency, timeliness, validity, and integrity. lakeFS hooks can be used to implement and run data quality tests before promoting staging data into production. 
    49  
    50  ## Data Versioning
    51  To version data means creating a unique point-in-time reference for data that can be accessed later. This reference can take the form of a query, an ID, or also commonly, a DateTime identifier. Data versioning may also include saving an entire copy of the data under a new name or file path every time you want to create a version of it. More advanced versioning solutions like lakeFS perform versioning through zero-copy data operations. lakeFS also optimizes storage usage between versions and exposes special operations to manage them.
    52  
    53  ## Git-like Operations
    54  lakeFS allows teams to treat their data lake as a Git repository.   Git is used for code versioning, whereas lakeFS is used for data versioning.  lakeFS provides Git-like operations such as branch, commit, merge and revert.
    55  
    56  ## Graveler
    57  Graveler is the core versioning engine of lakeFS. It handles versioning by translating lakeFS addresses to the actual stored objects. See the [versioning internals section]({% link understand/how/versioning-internals.md %}) to learn how lakeFS stores metadata.
    58  
    59  ## Hooks
    60  lakeFS hooks allow you to automate and ensure that a given set of checks and validations happens before important lifecycle events. They are similar conceptually to [Git Hooks](https://git-scm.com/docs/githooks), but in contrast, they run remotely on a server. Currently, lakeFS allows executing hooks when two types of events occur: pre-commit events that run before a commit is acknowledged and pre-merge events that trigger right before a merge operation. 
    61  
    62  ## Isolated Data Snapshot
    63  Creating a branch in lakeFS provides an isolated environment containing a snapshot of your repository. While working on your branch in isolation, all other data users will be looking at the repository's main branch. So they won't see your changes, and you also won't see the changes applied to the main branch. All of this happens without any data duplication but metadata management.
    64  
    65  ## Main Branch
    66  Every Git repository has the main branch (unless you take explicit steps to remove it) and it plays a key role in the software development process. In most projects, it represents the source of truth - all the code that works has been tested and is ready to be pushed to production. Similarly, main branch in lakeFS could be used as the single source of truth. For example, the live production data can be on the main branch. 
    67  
    68  ## Metadata Management
    69  Where there is data, there is also metadata. lakeFS uses metadata to define schema, data types, data versions, relations to other datasets, etc. This helps to improve discoverability and manageability. lakeFS performs data versioning through metadata operations. 
    70  
    71  ## Merge
    72  lakeFS merge command, similar to the Git merge functionality, allows you to merge data branches. Once you commit data, you can review it and then merge the committed data into the target branch. A merge generates a commit on the target branch with all your changes. lakeFS guarantees atomic merges that are fast, given they don’t involve copying data. [Read More][merge].
    73  
    74  ## Repository
    75  
    76  In lakeFS, a _repository_ is a set of related objects (or collections of objects). [Read More][repository].
    77  
    78  ## Rollback
    79  A rollback is an atomic operation reversing the effects of a previous commit. If a developer introduces a new code version to production and discovers that it has a critical bug, they can simply roll back to the previous version. In lakeFS, a rollback is an atomic action that prevents the data consumers from receiving low-quality data until the issue is resolved. Learn more about how lakeFS supports the [rollback]({% link understand/use_cases/rollback.md %}) operation.
    80  
    81  ## Storage Namespace
    82  The storage namespace is a location in the underlying storage dedicated to a specific repository.
    83  lakeFS uses it to store the repository's objects and some of its metadata.
    84  
    85  ## Underlying Storage
    86  The underlying storage is a location in some object store where lakeFS keeps your objects and some metadata.
    87  
    88  ## Tag
    89  
    90  Tags are a way to give a meaningful name to a specific commit. [Read More][tags].
    91  
    92  ## Fluffy
    93  
    94  lakeFS Enterprise Single-Sign-On service, it's delegated with lakeFS authentication requests and replies back to lakeFS with the authentication response.
    95  
    96  
    97  [branches]:  {% link understand/model.md %}#branches
    98  [commit]: {% link understand/model.md %}#commits
    99  [repository]:  {% link understand/model.md %}#repository
   100  [merge]:  {% link understand/model.md %}#merge
   101  [tags]:  {% link understand/model.md %}#tags