github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/import.md (about)

     1  ---
     2  title: Import Data
     3  description: Import existing data into a lakeFS repository
     4  parent: How-To
     5  redirect_from: 
     6    - /setup/import.html
     7  ---
     8  
     9  _This section describes how to import existing data into a lakeFS repository, without copying it.
    10  If you are interested in copying data into lakeFS, see [Copying data to/from lakeFS](./copying.md)._
    11  {: .mt-5 .mb-1 }
    12  
    13  # Importing data into lakeFS
    14  {: .mt-2 }
    15  
    16  {% include toc_2-3.html %}
    17  
    18  ## Prerequisites
    19  
    20  * Importing is permitted for users in the Supers (open-source) group or the SuperUsers (Cloud/Enterprise) group.
    21     To learn how lakeFS Cloud and lakeFS Enterprise users can fine-tune import permissions, see [Fine-grained permissions](#fine-grained-permissions) below.
    22  * The lakeFS _server_ must have permissions to list the objects in the source bucket.
    23  * The source bucket must be on the same cloud provider and in the same region as your repository.
    24  
    25  ## Using the lakeFS UI
    26  
    27  1. In your repository's main page, click the _Import_ button to open the import dialog.
    28  2. Under _Import from_, fill in the location on your object store you would like to import from.
    29  3. Fill in the import destination in lakeFS. This should be a path under the current branch.
    30  4. Add a commit message, and optionally commit metadata.
    31  5. Press _Import_.
    32  
    33  Once the import is complete, a new commit containing the imported objects will be created in the destination branch.
    34  
    35  ![lakeFS UI import dialog]({% link assets/img/UI-Import-Dialog.png %})
    36  
    37  ## Using the CLI: _lakectl import_
    38  The _lakectl import_ command acts the same as the UI import wizard. It commits the changes to the selected branch.
    39  
    40  <div class="tabs">
    41  <ul>
    42    <li><a href="#import-tabs-1">AWS S3 or S3 API Compatible storage</a></li>
    43    <li><a href="#import-tabs-2">Azure Blob</a></li>
    44    <li><a href="#import-tabs-3">Google Cloud Storage</a></li>
    45  </ul>
    46  <div markdown="1" id="import-tabs-1">
    47  ```shell
    48  lakectl import \
    49    --from s3://bucket/optional/prefix/ \
    50    --to lakefs://my-repo/my-branch/optional/path/
    51  ```
    52  </div>
    53  <div markdown="1" id="import-tabs-2">
    54  ```shell
    55  lakectl import \
    56     --from https://storageAccountName.blob.core.windows.net/container/optional/prefix/ \
    57     --to lakefs://my-repo/my-branch/optional/path/
    58  ```
    59  </div>
    60  <div markdown="1" id="import-tabs-3">
    61  ```shell
    62  lakectl import \
    63     --from gs://bucket/optional/prefix/ \
    64     --to lakefs://my-repo/my-branch/optional/path/
    65  ```
    66  </div>
    67  </div>
    68  
    69  ## Notes
    70  {:.no_toc}
    71  
    72  1. Any previously existing objects under the destination prefix will be deleted.
    73  1. The import duration depends on the amount of imported objects, but will roughly be a few thousand objects per second.
    74  1. For security reasons, if you are using lakeFS on top of your local disk (`blockstore.type=local`), you need to enable the import feature explicitly. 
    75     To do so, set the `blockstore.local.import_enabled` to `true` and specify the allowed import paths in `blockstore.local.allowed_external_prefixes` (see [configuration reference]({% link reference/configuration.md %})).
    76     When using lakectl or the lakeFS UI, you can currently import only directories locally. If you need to import a single file, use the [HTTP API](https://docs.lakefs.io/reference/api.html#/import/importStart) or API Clients with `type=object` in the request body and `destination=<full-path-to-file>`. 
    77  1. Making changes to data in the original bucket will not be reflected in lakeFS, and may cause inconsistencies. 
    78  
    79  ## Examples
    80  To explore practical examples and real-world use cases of importing data into lakeFS,
    81  we recommend checking out our comprehensive [blog post on the subject](https://lakefs.io/blog/import-data-lakefs/).
    82  
    83  ## Fine-grained permissions
    84  {:.no_toc}
    85  {: .d-inline-block }
    86  lakeFS Cloud
    87  {: .label .label-green }
    88  lakeFS Enterprise
    89  {: .label .label-purple }
    90  
    91  With RBAC support, The lakeFS user running the import command should have the following permissions in lakeFS:
    92  `fs:WriteObject`, `fs:CreateMetaRange`, `fs:CreateCommit`, `fs:ImportFromStorage` and `fs:ImportCancel`.
    93  
    94  As mentioned above, all of these permissions are available by default to the Supers (open-source) group or the SuperUsers (Cloud/Enterprise).
    95  
    96  ## Provider-specific permissions
    97  {:.no_toc}
    98  
    99  In addition, the following for provider-specific permissions may be required:
   100  
   101  <div class="tabs">
   102  <ul>
   103    <li><a href="#aws-s3">AWS S3</a></li>
   104    <li><a href="#azure-storage">Azure Storage</a></li>
   105    <li><a href="#gcs">Google Cloud Storage</a></li>
   106  </ul>
   107  <div markdown="1" id="aws-s3">
   108  
   109  ## AWS S3: Importing from public buckets
   110  {:.no_toc}
   111  
   112  lakeFS needs access to the imported location to first list the files to import and later read the files upon users request.
   113  
   114  There are some use cases where the user would like to import from a destination which isn't owned by the account running lakeFS.
   115  For example, importing public datasets to experiment with lakeFS and Spark.
   116  
   117  lakeFS will require additional permissions to read from public buckets. For example, for S3 public buckets,
   118  the following policy needs to be attached to the lakeFS S3 service-account to allow access to public buckets, while blocking access to other owned buckets:
   119  
   120    ```json
   121     {
   122       "Version": "2012-10-17",
   123       "Statement": [
   124         {
   125           "Sid": "PubliclyAccessibleBuckets",
   126           "Effect": "Allow",
   127           "Action": [
   128              "s3:GetBucketVersioning",
   129              "s3:ListBucket",
   130              "s3:GetBucketLocation",
   131              "s3:ListBucketMultipartUploads",
   132              "s3:ListBucketVersions",
   133              "s3:GetObject",
   134              "s3:GetObjectVersion",
   135              "s3:AbortMultipartUpload",
   136              "s3:ListMultipartUploadParts"
   137           ],
   138           "Resource": ["*"],
   139           "Condition": {
   140             "StringNotEquals": {
   141               "s3:ResourceAccount": "<YourAccountID>"
   142             }
   143           }
   144         }
   145       ]
   146     }
   147     ```
   148  
   149  </div>
   150  <div markdown="1" id="azure-storage">
   151  
   152  **Note:** The use of the `adls` hint for ADLS Gen2 storage accounts is deprecated, please use the original source url for import.
   153  {: .note}
   154  
   155  See [Azure deployment][deploy-azure-storage-account-creds] on limitations when using account credentials.
   156  
   157  </div>
   158  <div markdown="1" id="gcs">
   159  No specific prerequisites
   160  </div>
   161  </div>
   162  
   163  [deploy-azure-storage-account-creds]:  {% link howto/deploy/azure.md %}#storage-account-credentials
   164