github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/import.md (about) 1 --- 2 title: Import Data 3 description: Import existing data into a lakeFS repository 4 parent: How-To 5 redirect_from: 6 - /setup/import.html 7 --- 8 9 _This section describes how to import existing data into a lakeFS repository, without copying it. 10 If you are interested in copying data into lakeFS, see [Copying data to/from lakeFS](./copying.md)._ 11 {: .mt-5 .mb-1 } 12 13 # Importing data into lakeFS 14 {: .mt-2 } 15 16 {% include toc_2-3.html %} 17 18 ## Prerequisites 19 20 * Importing is permitted for users in the Supers (open-source) group or the SuperUsers (Cloud/Enterprise) group. 21 To learn how lakeFS Cloud and lakeFS Enterprise users can fine-tune import permissions, see [Fine-grained permissions](#fine-grained-permissions) below. 22 * The lakeFS _server_ must have permissions to list the objects in the source bucket. 23 * The source bucket must be on the same cloud provider and in the same region as your repository. 24 25 ## Using the lakeFS UI 26 27 1. In your repository's main page, click the _Import_ button to open the import dialog. 28 2. Under _Import from_, fill in the location on your object store you would like to import from. 29 3. Fill in the import destination in lakeFS. This should be a path under the current branch. 30 4. Add a commit message, and optionally commit metadata. 31 5. Press _Import_. 32 33 Once the import is complete, a new commit containing the imported objects will be created in the destination branch. 34 35  36 37 ## Using the CLI: _lakectl import_ 38 The _lakectl import_ command acts the same as the UI import wizard. It commits the changes to the selected branch. 39 40 <div class="tabs"> 41 <ul> 42 <li><a href="#import-tabs-1">AWS S3 or S3 API Compatible storage</a></li> 43 <li><a href="#import-tabs-2">Azure Blob</a></li> 44 <li><a href="#import-tabs-3">Google Cloud Storage</a></li> 45 </ul> 46 <div markdown="1" id="import-tabs-1"> 47 ```shell 48 lakectl import \ 49 --from s3://bucket/optional/prefix/ \ 50 --to lakefs://my-repo/my-branch/optional/path/ 51 ``` 52 </div> 53 <div markdown="1" id="import-tabs-2"> 54 ```shell 55 lakectl import \ 56 --from https://storageAccountName.blob.core.windows.net/container/optional/prefix/ \ 57 --to lakefs://my-repo/my-branch/optional/path/ 58 ``` 59 </div> 60 <div markdown="1" id="import-tabs-3"> 61 ```shell 62 lakectl import \ 63 --from gs://bucket/optional/prefix/ \ 64 --to lakefs://my-repo/my-branch/optional/path/ 65 ``` 66 </div> 67 </div> 68 69 ## Notes 70 {:.no_toc} 71 72 1. Any previously existing objects under the destination prefix will be deleted. 73 1. The import duration depends on the amount of imported objects, but will roughly be a few thousand objects per second. 74 1. For security reasons, if you are using lakeFS on top of your local disk (`blockstore.type=local`), you need to enable the import feature explicitly. 75 To do so, set the `blockstore.local.import_enabled` to `true` and specify the allowed import paths in `blockstore.local.allowed_external_prefixes` (see [configuration reference]({% link reference/configuration.md %})). 76 When using lakectl or the lakeFS UI, you can currently import only directories locally. If you need to import a single file, use the [HTTP API](https://docs.lakefs.io/reference/api.html#/import/importStart) or API Clients with `type=object` in the request body and `destination=<full-path-to-file>`. 77 1. Making changes to data in the original bucket will not be reflected in lakeFS, and may cause inconsistencies. 78 79 ## Examples 80 To explore practical examples and real-world use cases of importing data into lakeFS, 81 we recommend checking out our comprehensive [blog post on the subject](https://lakefs.io/blog/import-data-lakefs/). 82 83 ## Fine-grained permissions 84 {:.no_toc} 85 {: .d-inline-block } 86 lakeFS Cloud 87 {: .label .label-green } 88 lakeFS Enterprise 89 {: .label .label-purple } 90 91 With RBAC support, The lakeFS user running the import command should have the following permissions in lakeFS: 92 `fs:WriteObject`, `fs:CreateMetaRange`, `fs:CreateCommit`, `fs:ImportFromStorage` and `fs:ImportCancel`. 93 94 As mentioned above, all of these permissions are available by default to the Supers (open-source) group or the SuperUsers (Cloud/Enterprise). 95 96 ## Provider-specific permissions 97 {:.no_toc} 98 99 In addition, the following for provider-specific permissions may be required: 100 101 <div class="tabs"> 102 <ul> 103 <li><a href="#aws-s3">AWS S3</a></li> 104 <li><a href="#azure-storage">Azure Storage</a></li> 105 <li><a href="#gcs">Google Cloud Storage</a></li> 106 </ul> 107 <div markdown="1" id="aws-s3"> 108 109 ## AWS S3: Importing from public buckets 110 {:.no_toc} 111 112 lakeFS needs access to the imported location to first list the files to import and later read the files upon users request. 113 114 There are some use cases where the user would like to import from a destination which isn't owned by the account running lakeFS. 115 For example, importing public datasets to experiment with lakeFS and Spark. 116 117 lakeFS will require additional permissions to read from public buckets. For example, for S3 public buckets, 118 the following policy needs to be attached to the lakeFS S3 service-account to allow access to public buckets, while blocking access to other owned buckets: 119 120 ```json 121 { 122 "Version": "2012-10-17", 123 "Statement": [ 124 { 125 "Sid": "PubliclyAccessibleBuckets", 126 "Effect": "Allow", 127 "Action": [ 128 "s3:GetBucketVersioning", 129 "s3:ListBucket", 130 "s3:GetBucketLocation", 131 "s3:ListBucketMultipartUploads", 132 "s3:ListBucketVersions", 133 "s3:GetObject", 134 "s3:GetObjectVersion", 135 "s3:AbortMultipartUpload", 136 "s3:ListMultipartUploadParts" 137 ], 138 "Resource": ["*"], 139 "Condition": { 140 "StringNotEquals": { 141 "s3:ResourceAccount": "<YourAccountID>" 142 } 143 } 144 } 145 ] 146 } 147 ``` 148 149 </div> 150 <div markdown="1" id="azure-storage"> 151 152 **Note:** The use of the `adls` hint for ADLS Gen2 storage accounts is deprecated, please use the original source url for import. 153 {: .note} 154 155 See [Azure deployment][deploy-azure-storage-account-creds] on limitations when using account credentials. 156 157 </div> 158 <div markdown="1" id="gcs"> 159 No specific prerequisites 160 </div> 161 </div> 162 163 [deploy-azure-storage-account-creds]: {% link howto/deploy/azure.md %}#storage-account-credentials 164