github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/clients/spark/CHANGELOG.md (about) 1 # Changelog 2 3 ## v0.13.0 - 2024-01-21 4 * Shade scalapb for use in current DataBricks LTS releases (#7289). :bricks: 5 6 This is a minor release: while it adds no new functionality, it changes how 7 users can use the Spark metadata client with DataBricks (and possibly other 8 managed Spark providers). 9 10 ## v0.12.0 - 2023-12-21 11 * Metaclient: Skip dummy file when scanning metadata prefix (#7192) 12 13 ## v0.11.0 - 2023-10-23 14 * Bump lakeFS SDK to v1.0 15 * **Breaking:** drop support for lakeFS server < v0.108.0 16 17 ## v0.10.0 - 2023-08-15 18 * Bug fixes and improvements. 19 * Seperated flows of committs and uncommitted GC are no longer supported. Use the [new GC job](https://docs.lakefs.io/howto/garbage-collection/). 20 21 ## v0.9.1 - 2023-07-31 22 * Fix bug when GC runs into the initial commit repository. 23 * Fix bug when GC runs on a bucket outside the provided region. (#6283) 24 25 ## v0.9.0 - 2023-07-23 26 * A new an improved Garbage Collection job. See the [docs](https://docs.lakefs.io/howto/garbage-collection.html) for more information. 27 28 ## v0.8.1 - 2023-05-22 29 30 Bug fixes: 31 * Read previous run ID in case of a sweep-only run (#5963) 32 * Perform run ID logging only after sweep completed successfully (#5963) 33 34 ## v0.8.0 - 2023-05-18 35 36 What's new: 37 * Incremental committed GC implementation - supported starting from lakeFS 0.100.0 38 39 ## v0.7.3 - 2023-05-07 40 41 Bug fixes: 42 * Fix uncommitted gc to handle no uncommitted location (#5817) 43 44 ## v0.7.2 - 2023-05-02 45 46 Bug fixes: 47 * Fix uncommitted garbage collection bulk remove out of disk space (#5776) 48 49 ## v0.7.1 - 2023-04-24 50 51 === Performance improvements=== 52 53 No user-visible parts inside, but some parameters... 54 55 Deletion now retries S3 deleteObjects, a **lot**. Parameters 56 `lakefs.gc.s3.min_backoff_secs`, `lakefs.gc.s3.max_backoff_secs` control how 57 long it will try. 58 59 ## v0.7.0 - 2023-03-13 60 61 === Performance improvements=== 62 63 No user-visible parts inside, but some parameters... 64 65 * Write expired addresses to fewer locations 66 * Only write text-format expired addresses if new Hadoop config option 67 `lakefs.gc.address.write_as_text` is set. 68 * Remove one (unused) expired address location. 69 * Handle huge metaranges: New Hadoop config option 70 `lakefs.gc.address.approx_num_ranges_to_spread_per_partition` can be set 71 when there are very many ranges. Values 20..100 are probably best. 72 * For debugging performance 73 74 Guaranteed to produce incorrect results! You probably **never want to set 75 this** in production or on any repository you care about. 76 * New Hadoop config option `lakefs.debug.gc.addresses_sample_fraction` can 77 be set below 1.0 _to debug performance **only**_. 78 79 80 ## v0.6.5 - 2023-03-14 81 82 Bug fix: 83 * UGC fix uncommitted exists change (#5467) 84 * UGC fix list mark addresses 85 86 ## v0.6.4 - 2023-03-13 87 88 Bug fix: 89 * UGC handle no uncommitted data on repository (#5451) 90 91 ## v0.6.3 - 2023-03-09 92 93 Bug fix: 94 * UGC repartition by addresses to handle large repositories 95 * UGC use task context to delete temporary files 96 * UGC copy metadata to local without crc files 97 98 ## v0.6.2 - 2023-02-23 99 100 Bug fix: 101 * Add exponential backoff retry to the S3 client 102 103 ## v0.6.1 - 2023-01-30 104 105 What's new: 106 * Upgrade lakeFS client to v0.91.0 107 * Add UGC cutoff time to the report 108 109 ## v0.6.0 - 2022-12-14 110 111 What's new: 112 * Beta feature: Uncommitted garbage collection. [Documentation](https://docs.lakefs.io/howto/garbage-collection.html#beta-deleting-uncommitted-objects) 113 114 Bug fixes: 115 * Correctly write GC JSON summary object using UTF-8 (#4644) 116 117 ## v0.5.2 - 2022-11-29 118 Bug fixes: 119 * Identify the region of the S3 bucket if it's not reachable and update the initialized S3 client using that region. 120 121 ## v0.5.1 - 2022-10-20 122 Bug fixes: 123 * Make GC backup and restore support expired addresses list including object not in the underlying object store (#4367) 124 * Don't package with hadoop-aws. This removes many dependency failures and 125 simplifies configuration. But it also means that for plain Spark 126 distributions such as that provided when downloading from the Apache Spark 127 homepage you will need to add `--packages 128 org.apache.hadoop:hadoop-aws:2.7.7` or `--packages 129 org.apache.hadoop:hadoop-aws:3.2.1` or similar, to add in this package. (#4399) 130 131 ## v0.5.0 - 2022-10-06 132 What's new: 133 * A utility for GC backup and restore. It allows users to copy objects that GC plans to delete or restore objects 134 from a previously created backup (#4318) 135 136 ## v0.4.0 - 2022-09-30 137 What's new: 138 * Separate GC into a mark and sweep parts and add configuration parameters to control what phases to run (#4264) 139 140 Bug fixes: 141 * Fix the failure to write an empty dataframe into GC reports when running in mark-only mode (#4239) 142 * Only clean up relative path names (#4222) 143 144 ## v0.3.0 - 2022-09-21 145 What's new: 146 - Add retries mechanism (#4190) 147 - Improve performance (#4194) 148 149 ## v0.2.3 - 2022-09-11 150 - Performance improvements (#4097, #4099, #4110) 151 - Fix bug: parsing problems in Azure (#4081) 152 153 ## v0.2.2 - 2022-08-29 154 What's new: 155 - Added custom lakeFS client read timeout configuration (#3983) 156 - Rename custom lakeFS client timeout configuration keys (#4017) 157 158 Bug fixes: 159 - [GC] re-use http clients to limit the number of open connections and fix a resource leak (#3998) 160 161 ## v0.2.1 - 2022-08-18 162 Bug fixes: 163 - [GC] Added configuration flag of lakeFS client timeout to spark-submit command (#3905) 164 165 ## v0.2.0 - 2022-08-01 166 What's new: 167 - Garbage Collection on Azure (#3733, #3654) 168 169 Bug fixes: 170 - [GC] Respect Hadoop AWS access key configuration in S3Client (#3762) 171 - exit GC in case that no GC rules are configured for a repo (#3779)