github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/object-metadata-s3.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/object-metadata-s3.md (about)

1 # Object Metadata and S3
2
3 ### Overview
4
5 lakeFS keeps the following metadata per object:
6
7 1. Physical address
8 1. Last modified time
9 1. Size
10 1. ETag
11 1. Metadata
12 1. Address type (relative / full)
13
14
15 ### What is missing?
16
17 The following two issues are currently not addressed by our S3 interface:
18
19 1. Content-Type - The content-type header sent part of put object and multipart upload is not kept, we expect get object to return the same content type.
20 Related issues: [Support Content-Type](https://github.com/treeverse/lakeFS/issues/2296) [Support Trino AVRO format](https://github.com/treeverse/lakeFS/issues/2429)
21 1. User-defined metadata - when posting an object AWS enables additional metadata by passing x-amz-meta-* headers. Do do not process these headers and by passing them to our metadata on put/get, we will enable better integration S3 compatability and integration with tools like Rsync
22 Related issues: [Store some per-file metadata](https://github.com/treeverse/lakeFS/issues/2486)
23
24
25 ### Solution
26
27 The catalog entity will include ContentType as additional metadata field. The field will be added to the entntry identity calculation (unless it is empty for backward support).
28 On read of a previous committed entry without ContentType, default content-type will be returned.
29 On write a new entry we be set with the ContentType used to post the data. In case nothing is set, a default will be set on the object.
30 Our API (open api) will pass the content-type as additional field in any location we pass the object metadata, when we get or upload a file we use the standard content-type header to pass the entry content-type.
31 AWS metadata posted with an object will be added to the object's metadata. We will map "x-amz-meta-<name>" to name/value in our metadata. The equivalent to create entry metadata using our open api.
32 The S3 gateway get object will map the metadata key/value back to "x-amz-meta-<name>" headers.
33
34 Example of S3 head request on object with content-type will look like:
35
36 ```json
37 {
38 "AcceptRanges": "bytes",
39 "LastModified": "Tue, 28 Sep 2021 22:34:44 GMT",
40 "ContentLength": 10485760,
41 "ETag": "\"f962bf40d19fed2e80bcbaa33bd1dfe7\"",
42 "ContentType": "example/data",
43 "Metadata": {
44 "x-amz-meta-author": "barak"
45 }
46 }
47 ```
48
49 Supporting AWS user-metadata will be implemented by passing them into our entry's `Metadata` field. More information about S3 object metadata can be found [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html#SysMetadata).
50 While using the S3 gateway, put or multipart upload will store the `x-amz-meta-*` request headers into the object metadata. On get we will the `x-amz-meta-*` metadata keys as headers.
51 Our current metadata field will continue to be used. While so no changes to the OpenAPI will be required.