github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/clients/spark/CHANGELOG.md (about)

     1  # Changelog
     2  
     3  ## v0.13.0 - 2024-01-21
     4  * Shade scalapb for use in current DataBricks LTS releases (#7289). :bricks:
     5  
     6  This is a minor release: while it adds no new functionality, it changes how
     7  users can use the Spark metadata client with DataBricks (and possibly other
     8  managed Spark providers).
     9  
    10  ## v0.12.0 - 2023-12-21
    11  * Metaclient: Skip dummy file when scanning metadata prefix (#7192)
    12  
    13  ## v0.11.0 - 2023-10-23
    14  * Bump lakeFS SDK to v1.0
    15  * **Breaking:** drop support for lakeFS server < v0.108.0
    16  
    17  ## v0.10.0 - 2023-08-15
    18  * Bug fixes and improvements.
    19  * Seperated flows of committs and uncommitted GC are no longer supported. Use the [new GC job](https://docs.lakefs.io/howto/garbage-collection/).
    20  
    21  ## v0.9.1 - 2023-07-31
    22  * Fix bug when GC runs into the initial commit repository.
    23  * Fix bug when GC runs on a bucket outside the provided region. (#6283)
    24  
    25  ## v0.9.0 - 2023-07-23
    26  * A new an improved Garbage Collection job. See the [docs](https://docs.lakefs.io/howto/garbage-collection.html) for more information.
    27  
    28  ## v0.8.1 - 2023-05-22
    29  
    30  Bug fixes:
    31  * Read previous run ID in case of a sweep-only run (#5963)
    32  * Perform run ID logging only after sweep completed successfully (#5963) 
    33  
    34  ## v0.8.0 - 2023-05-18
    35  
    36  What's new:
    37  * Incremental committed GC implementation - supported starting from lakeFS 0.100.0
    38  
    39  ## v0.7.3 - 2023-05-07
    40  
    41  Bug fixes:
    42  * Fix uncommitted gc to handle no uncommitted location (#5817)
    43  
    44  ## v0.7.2 - 2023-05-02
    45  
    46  Bug fixes:
    47  * Fix uncommitted garbage collection bulk remove out of disk space (#5776)
    48  
    49  ## v0.7.1 - 2023-04-24
    50  
    51  === Performance improvements===
    52  
    53  No user-visible parts inside, but some parameters...
    54  
    55  Deletion now retries S3 deleteObjects, a **lot**.  Parameters
    56  `lakefs.gc.s3.min_backoff_secs`, `lakefs.gc.s3.max_backoff_secs` control how
    57  long it will try.
    58  
    59  ## v0.7.0 - 2023-03-13
    60  
    61  === Performance improvements===
    62  
    63  No user-visible parts inside, but some parameters...
    64  
    65  * Write expired addresses to fewer locations
    66    * Only write text-format expired addresses if new Hadoop config option
    67      `lakefs.gc.address.write_as_text` is set.
    68  	* Remove one (unused) expired address location.
    69  * Handle huge metaranges: New Hadoop config option
    70    `lakefs.gc.address.approx_num_ranges_to_spread_per_partition` can be set
    71    when there are very many ranges.  Values 20..100 are probably best.
    72  * For debugging performance
    73  
    74    Guaranteed to produce incorrect results!  You probably **never want to set
    75    this** in production or on any repository you care about.
    76    * New Hadoop config option `lakefs.debug.gc.addresses_sample_fraction` can
    77  	be set below 1.0 _to debug performance **only**_.
    78  
    79  
    80  ## v0.6.5 - 2023-03-14
    81  
    82  Bug fix:
    83  * UGC fix uncommitted exists change (#5467)
    84  * UGC fix list mark addresses
    85  
    86  ## v0.6.4 - 2023-03-13
    87  
    88  Bug fix:
    89  * UGC handle no uncommitted data on repository (#5451)
    90  
    91  ## v0.6.3 - 2023-03-09
    92  
    93  Bug fix:
    94  * UGC repartition by addresses to handle large repositories
    95  * UGC use task context to delete temporary files
    96  * UGC copy metadata to local without crc files
    97  
    98  ## v0.6.2 - 2023-02-23
    99  
   100  Bug fix:
   101  * Add exponential backoff retry to the S3 client
   102  
   103  ## v0.6.1 - 2023-01-30
   104  
   105  What's new:
   106  * Upgrade lakeFS client to v0.91.0
   107  * Add UGC cutoff time to the report
   108  
   109  ## v0.6.0 - 2022-12-14
   110  
   111  What's new:
   112  * Beta feature: Uncommitted garbage collection. [Documentation](https://docs.lakefs.io/howto/garbage-collection.html#beta-deleting-uncommitted-objects)
   113  
   114  Bug fixes:
   115  * Correctly write GC JSON summary object using UTF-8 (#4644)
   116  
   117  ## v0.5.2 - 2022-11-29
   118  Bug fixes:
   119  * Identify the region of the S3 bucket if it's not reachable and update the initialized S3 client using that region.
   120  
   121  ## v0.5.1 - 2022-10-20
   122  Bug fixes:
   123  * Make GC backup and restore support expired addresses list including object not in the underlying object store (#4367)
   124  * Don't package with hadoop-aws.  This removes many dependency failures and
   125    simplifies configuration.  But it also means that for plain Spark
   126    distributions such as that provided when downloading from the Apache Spark
   127    homepage you will need to add `--packages
   128    org.apache.hadoop:hadoop-aws:2.7.7` or `--packages
   129    org.apache.hadoop:hadoop-aws:3.2.1` or similar, to add in this package. (#4399)
   130  
   131  ## v0.5.0 - 2022-10-06
   132  What's new:
   133  * A utility for GC backup and restore. It allows users to copy objects that GC plans to delete or restore objects
   134  from a previously created backup (#4318)
   135  
   136  ## v0.4.0 - 2022-09-30
   137  What's new:
   138  * Separate GC into a mark and sweep parts and add configuration parameters to control what phases to run (#4264)
   139  
   140  Bug fixes:
   141  * Fix the failure to write an empty dataframe into GC reports when running in mark-only mode (#4239)
   142  * Only clean up relative path names (#4222) 
   143  
   144  ## v0.3.0 - 2022-09-21
   145  What's new:
   146  - Add retries mechanism (#4190)
   147  - Improve performance (#4194)
   148  
   149  ## v0.2.3 - 2022-09-11
   150  - Performance improvements (#4097, #4099, #4110)
   151  - Fix bug: parsing problems in Azure (#4081)
   152  
   153  ## v0.2.2 - 2022-08-29
   154  What's new:
   155  - Added custom lakeFS client read timeout configuration (#3983)
   156  - Rename custom lakeFS client timeout configuration keys (#4017)
   157  
   158  Bug fixes:
   159  - [GC] re-use http clients to limit the number of open connections and fix a resource leak (#3998)   
   160  
   161  ## v0.2.1 - 2022-08-18
   162  Bug fixes:
   163  - [GC] Added configuration flag of lakeFS client timeout to spark-submit command (#3905)
   164  
   165  ## v0.2.0 - 2022-08-01
   166  What's new:
   167  - Garbage Collection on Azure (#3733, #3654)
   168  
   169  Bug fixes:
   170  - [GC] Respect Hadoop AWS access key configuration in S3Client (#3762)
   171  - exit GC in case that no GC rules are configured for a repo (#3779)