github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/spark-co-exist-with-object-storages.md (about)

     1  # Spark: co-existing with existing underlying object store - Design proposal
     2  
     3  This document includes a proposed solution for https://github.com/treeverse/lakeFS/issues/2625.
     4  
     5  ## Goals
     6  
     7  ### Must do
     8  
     9  1. Provide a configurable set of overrides to translate paths at runtime from an old object store location to a new lakeFS location.
    10  2. Support any object storage that lakeFS supports.
    11  3. Do something that adds the least amount of friction to a Spark user as possible.
    12  
    13  ### Nice to have
    14  
    15  Make the solution usable by non-lakeFS users by:
    16  1. Supporting path translation from and to any type of path.
    17  2. Make the solution a standalone component with no direct dependency on anything other than Spark itself. 
    18  
    19  ## Non-goals
    20  
    21  1. Convert users from [accessing lakeFS from S3 gateway](../docs/integrations/spark.md#access-lakefs-using-the-s3a-gateway) to [accessing lakeFS with lakeFS-specific Hadoop filesystem](../docs/integrations/spark.md#access-lakefs-using-the-lakefs-specific-hadoop-filesystem).
    22  
    23  ## Proposal: Introducing RouterFileSystem
    24  
    25  We would like to implement a [HadoopFileSystem](https://github.com/apache/hadoop/blob/2960d83c255a00a549f8809882cd3b73a6266b6d/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java) 
    26  that translates object store URIs into lakeFS URIs according to a configurable mapping, and uses a relevant (configured) Hadoop file system to 
    27  perform any file system operation.
    28  
    29  For simplicity, this document uses `s3a` as the only URI scheme used inside the Spark application code, but this solution holds for any other URI scheme.  
    30  
    31  ### Handle any interaction with the underlying object store
    32  
    33  To allow Spark users to integrate with lakeFS without changing their Spark code, we configure `RouterFileSystem` as the 
    34  file system for the URI scheme (or schemes) their Spark application is using. For example, for a Spark application that 
    35  reads and writes from S3 using `S3AFileSystem`, we configure `RouterFileSystem` to be the file system for URIs with `scheme=s3a`, as follows: 
    36  ```properties
    37  fs.s3a.impl=RouterFileSystem
    38  ```
    39  This will force any file system operation performed on an object URI with `scheme=s3a` to go through `RouterFileSystem` before it 
    40  interacts with the underlying storage.
    41  
    42  ### URI translation
    43  
    44  `RouterFileSystem` has access to a configurable mapping that maps any object store URI to any type of URI. This proposal 
    45  uses s3 object store paths (with URI `scheme=s3a`) and lakeFS paths as examples.
    46  
    47  #### Mapping configurations
    48  
    49  The mapping configurations are spark properties of the following form:
    50  ```properties
    51  routerfs.mapping.${toFsScheme}.${mappingIdx}.replace='^${fromFsScheme}://bucket/prefix'
    52  routerfs.mapping.${toFsScheme}.${mappingIdx}.with='${toFsScheme}://another-bucket/prefix'
    53  ```
    54  
    55  Where `mappingIdx` is an unbounded running index initialized for each `toFsScheme`, and `toFsScheme` is a URI scheme that
    56  uses the `fs.toFsScheme.impl` property to point to the file system that handles the interaction with the underlying storage.
    57  For example, the following mapping configurations
    58  ```properties
    59  routerfs.mapping.lakefs.1.replace='^s3a://bucket/prefix'
    60  routerfs.mapping.lakefs.1.with='lakefs://example-repo/dev/prefix'
    61  ```
    62  together with 
    63  ```properties
    64  fs.lakefs.impl=S3AFileSystem
    65  ```
    66  make `RouterFileSystem` translate `s3a://bucket/prefix/foo.parquet` into `lakefs://example-repo/dev/prefix/foo.parquet`, 
    67  and later use `S3AFileSystem` to interact with the underlying object storage which is S3 in this case. This example uses
    68  [S3 gateway](../docs/integrations/spark.md#access-lakefs-using-the-s3a-gateway) as the Spark-lakeFS integration method.
    69  
    70  ##### Multiple mapping configurations   
    71  
    72  Each `toFsScheme` can have any number of mapping configurations. E.g., below are two mapping configurations for `toFsScheme=lakefs`.
    73  ```properties
    74  routerfs.mapping.lakefs.1.replace='^s3a://bucket/prefix'
    75  routerfs.mapping.lakefs.1.with='lakefs://example-repo/dev/prefix'
    76  routerfs.mapping.lakefs.2.replace='^s3a://another-bucket'
    77  routerfs.mapping.lakefs.2.with='lakefs://another-repo/main'
    78  ```
    79  Mapping configurations applied in order, therefore in case of a conflict in mapping configurations the prior configuration 
    80  applies. 
    81  
    82  ##### Multiple `toFsScheme`s (and multiple mapping configuration groups)
    83  
    84  With `RouterFileSystem`, Spark users can define any number of `toFsScheme`s. Each forms its own mapping configuration group,
    85  and allows applying different set of Spark/Hadoop configurations. e.g. credentials, s3 endpoint, etc. Users would typically
    86  define new `toFsScheme` while trying to migrate a collection from one storage space to another without changing their Spark code.
    87  
    88  The example below demonstrates how routerFS mapping configurations and some Hadoop configurations will look like for
    89  Spark application that accesses s3, MinIO, and lakeFS, but is using s3a as its sole URI scheme. 
    90  ```properties
    91  # Mapping configurations
    92  routerfs.mapping.lakefs.1.replace='^s3a://bucket/prefix'
    93  routerfs.mapping.lakefs.1.with='lakefs://example-repo/dev/prefix'
    94  routerfs.mapping.s3a1.1.replace='^s3a://bucket'
    95  routerfs.mapping.s3a1.1.with='s3a1://another-bucket'
    96  routerfs.mapping.minio.1.replace='^s3a://minio-bucket'
    97  routerfs.mapping.minio.1.with='minio://another-minio-bucket'
    98  
    99  # File System configurations
   100  # Implementation
   101  fs.s3a.impl=RouterFileSystem
   102  fs.lakefs.impl=S3AFileSystem
   103  fs.s3a1.impl=S3AFileSystem
   104  fs.minio.impl=S3AFileSystem
   105  
   106  # Access keys, the example `toFsScheme=s3a1` but required for each configured `toFsScheme`
   107  fs.s3a1.access.key=AKIAIOSFODNN7EXAMPLE
   108  fs.s3a1.secret.key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
   109  ```
   110  
   111  ##### Default mapping configuration 
   112  
   113  `RouterFileSystem` requires a default mapping configuration in case that none of the mappings matches a URI. 
   114  For example, the following default configuration states that in case or a URI with `scheme=s3a` that didn't match the
   115  `^s3a://bucket/prefix` pattern, `RouterFileSystem` uses the default file system that's configured to be `S3AFileSystem`.  
   116  
   117  ```properties
   118  routerfs.mapping.lakefs.1.replace='^s3a://bucket/prefix'
   119  routerfs.mapping.lakefs.1.with='lakefs://example-repo/dev/prefix'
   120  
   121  # Default mapping, should be applied last 
   122  routerfs.mapping.s3a-default.replace='^s3a://'
   123  routerfs.mapping.s3a-default.with='s3a-default://'
   124  
   125  # File System configurations
   126  fs.s3a.impl=RouterFileSystem
   127  fs.lakefs.impl=S3AFileSystem
   128  fs.s3a-default.impl=S3AFileSystem
   129  ```
   130  
   131  This configuration is required because otherwise `RouterFileSystem` will get stuck in an infinite loop, by calling itself 
   132  as the file system for `scheme=s3a` in the example. 
   133  
   134  ### Invoke file system operations
   135  
   136  After translating URIs to their final form, `RouterFileSystem` will use the translated path and its relevant file system to 
   137  perform file system operations against the relevant object store. See example in [Mapping configurations](#mapping-configurations).  
   138  
   139  ### Getting the relevant File System 
   140  
   141  `RouterFileSystem` uses the `fs.<toFsScheme>.impl=S3AFileSystem` Hadoop configuration and the FileSystem method 
   142  [path.getFileSystem()](https://github.com/apache/hadoop/blob/2960d83c255a00a549f8809882cd3b73a6266b6d/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java#L365)
   143  to access the correct file system at runtime, according to user configurations.
   144  
   145  ### Integrating with lakeFS
   146  
   147  `RouterFileSystem` does not change the exiting [integration methods](../docs/integrations/spark.md#two-tiered-spark-support) 
   148  lakeFS and Spark have. that is, one can use both `S3AFileSystem` and `LakeFSFileSystem` and to read and write objects from lakeFS
   149  (See configurations reference below). 
   150  
   151  #### Access lakeFS using S3 gateway
   152  ```properties
   153  routerfs.mapping.lakefs.1.replace='^s3a://bucket/prefix'
   154  routerfs.mapping.lakefs.1.with='lakefs://example-repo/dev/prefix'
   155  ```
   156  together with
   157  ```properties
   158  fs.s3a.impl=RouterFileSystem
   159  fs.lakefs.impl=S3AFileSystem
   160  ```
   161  
   162  #### Access lakeFS using lakeFS-specific Hadoop FileSystem
   163  
   164  ```properties
   165  routerfs.mapping.lakefs.1.replace='^s3a://bucket/prefix'
   166  routerfs.mapping.lakefs.1.with='lakefs://example-repo/dev/prefix'
   167  ```
   168  together with
   169  ```properties
   170  fs.s3a.impl=RouterFileSystem
   171  fs.lakefs.impl=LakeFSFileSystem
   172  ```
   173  
   174  **Note** There is a (solvable) open item here: during its initialization, LakeFSFileSystem [dynamically fetches](https://github.com/treeverse/lakeFS/blob/276ee87fe41841589d631aaeec1c4859308001c1/clients/hadoopfs/src/main/java/io/lakefs/LakeFSFileSystem.java#L93)
   175  the underlying file system from the repository storage namespace. The file system configurations above will make `LakeFSFileSystem` 
   176  fetch `RouterFileSystem` for storage namespaces on s3, preventing from `LakeFSFileSystem` delegate file system operations to 
   177  the right underlying file system (`S3AFileSystem`). 
   178  
   179  ### Examples
   180  
   181  #### Read an object managed by lakeFS
   182  
   183  ![RouterFS with lakeFS URI](diagrams/routerFS-by_lakefs.png)
   184  
   185  #### Read an object directly from the object store
   186  
   187  ![RouterFS with S3 URI](diagrams/routerFS_by_s3.png)
   188  
   189  ### Pros & Cons
   190  
   191  ### Pros
   192  
   193  1. We already have experience developing Hadoop file systems, therefore, the ramp up should not be significant.
   194  2. `RouterFileSystem` suggests much simpler functionality than what lakeFSFS supports (it only needs to receive calls, translate paths, and route to the right file system), which reduces the estimated number of unknowns unknowns.
   195  3. It does not change anything related to the existing Spark<>lakeFS integrations.
   196  4. `RouterFileSystem` can support non-lakeFS use-cases because it does not relay on any lakeFS client.
   197  5. `RouterFileSystem` can be developed in a separate repo and delivered as a standalone OSS product, we may be able to contribute it to Spark. 
   198  6. `RouterFileSystem` does not rely on per-bucket configurations that are only supported for hadoop-aws (includes S3AFileSystem implementation) versions >= [2.8.0](https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets).  
   199  
   200  ### Cons
   201  
   202  1. Based on our experience with lakeFSFS, we already know that supporting a hadoop file system is difficult. There are many things that can go wrong in terms of dependency conflicts, and unexpected behaviours working with managed frameworks (i.e. Databricks, EMR)
   203  2. It's complex.
   204  3. `RouterFileSystem` is unaware of the number of mapping configurations every `toFsScheme` has and needs to figure this out at runtime. 
   205  4. `toFsScheme` may be a confusing concept.
   206  5. Requires adjustments of [LakeFSFileSystem](../clients/hadoopfs/src/main/java/io/lakefs/LakeFSFileSystem.java), see [this](#access-lakefs-using-lakefs-specific-hadoop-filesystem) for a reference.
   207  6. Requires discovery and documentation of limitations some fs operation have in case of overlapping configurations. For example, a recursive delete operation can map paths to different file systems: recursive deletion of /data, while one file system is the complete package, and /data/lakefs is mapped to lakeFS.