github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/s3gateway.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/s3gateway.md (about)

     1  # Using the S3Gateway
     2  
     3  Pachyderm includes an S3 gateway that enables you to interact with PFS storage
     4  through an HTTP application programming interface (API) that imitates the
     5  Amazon S3 Storage API. Therefore, with Pachyderm S3 gateway, you can interact
     6  with Pachyderm through tools and libraries designed to work with object stores.
     7  For example, you can use these tools:
     8  
     9  * [MinIO](https://docs.min.io/docs/minio-client-complete-guide)
    10  * [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
    11  
    12  When you deploy `pachd`, the S3 gateway starts automatically.
    13  
    14  The S3 gateway has some limitations that are outlined below. If you need richer
    15  access, use the PFS gRPC interface instead, or one of the
    16  [client drivers](https://github.com/pachyderm/python-pachyderm).
    17  
    18  ## Authentication
    19  
    20  If auth is enabled on the Pachyderm cluster, credentials must be passed with
    21  each s3 gateway endpoint using AWS' signature v2 or v4 methods. Object store
    22  tools and libraries provide built-in support for these methods, but they do
    23  not work in the browser. When you use authentication, set the access and secret key
    24  to the same value; they are both the Pachyderm auth token used
    25  to issue the relevant PFS calls.
    26  
    27  If auth is not enabled on the Pachyderm cluster, no credentials need to be
    28  passed to s3gateway requests.
    29  
    30  ## Buckets
    31  
    32  The S3 gateway presents each branch from every Pachyderm repository as
    33  an S3 bucket.
    34  For example, if you have a `master` branch in the `images` repository,
    35  an S3 tool sees `images@master` as the `master.images` S3 bucket.
    36  
    37  ## Versioning
    38  
    39  Most operations act on the `HEAD` of the given branch. However, if your object
    40  store library or tool supports versioning, you can get objects in non-HEAD
    41  commits by using the commit ID as the version.
    42  
    43  ## Port Forwarding
    44  
    45  If you do not have direct access to the Kubernetes cluster, you can use port
    46  forwarding instead. Simply run `pachctl port-forward`, which will allow you
    47  to access the s3 gateway through `localhost:30600`.
    48  
    49  However, the Kubernetes port forwarder incurs substantial overhead and
    50  does not recover well from broken connections. Connecting to the cluster
    51  directly is therefore faster and more reliable.
    52  
    53  ## Configure the S3 client
    54  
    55  Before you can work with the S3 gateway, configure your S3 client
    56  to access Pachyderm. Complete the steps in one of the sections below that
    57  correspond to your S3 client.
    58  
    59  ### Configure MinIO
    60  
    61  If you are not using the MinIO client, skip this section.
    62  
    63  To install and configure MinIO, complete the following steps:
    64  
    65  1. Install the MinIO client on your platform as
    66  described on the [MinIO download page](https://min.io/download#/macos).
    67  
    68  1. Verify that MinIO components are successfully installed by running
    69  the following command:
    70  
    71     ```shell
    72     $ minio version
    73     $ mc version
    74     Version: 2019-07-11T19:31:28Z
    75     Release-tag: RELEASE.2019-07-11T19-31-28Z
    76     Commit-id: 31e5ac02bdbdbaf20a87683925041f406307cfb9
    77     ```
    78  
    79  1. Set up the MinIO configuration file to use the `30600` port for your host:
    80  
    81     ```shell
    82     vi ~/.mc/config.json
    83     ```
    84  
    85     You should see a configuration similar to the following:
    86  
    87     * For a minikube deployment, verify the
    88     `local` host configuration:
    89  
    90       ```shell
    91       "local": {
    92                 "url": "http://localhost:30600",
    93                 "accessKey": "YOUR-PACHYDERM-AUTH-TOKEN",
    94                 "secretKey": "YOUR-PACHYDERM-AUTH-TOKEN",
    95                 "api": "S3v4",
    96                 "lookup": "auto"
    97              },
    98       ```
    99  
   100       Set the access key and secret key to your
   101       Pachyderm authentication token. If authentication is not enabled
   102       on the cluster, both parameters must be empty strings.
   103  
   104  ### Configure the AWS CLI
   105  
   106  If you are not using the AWS CLI, skip this section.
   107  
   108  If you have not done so already, you need to install and
   109  configure the AWS CLI client on your machine. To configure the AWS CLI,
   110  complete the following steps:
   111  
   112  1. Install the AWS CLI for your operating system as described
   113  in the [AWS documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html).
   114  
   115  1. Verify that the AWS CLI is installed:
   116  
   117     ```shell
   118     $ aws --version aws-cli/1.16.204 Python/2.7.16 Darwin/17.7.0 botocore/1.12.194
   119     ```
   120  
   121  1. Configure AWS CLI:
   122  
   123     ```shell
   124     $ aws configure
   125     AWS Access Key ID: YOUR-PACHYDERM-AUTH-TOKEN
   126     AWS Secret Access Key: YOUR-PACHYDERM-AUTH-TOKEN
   127     Default region name:
   128     Default output format [None]:
   129     ```
   130  
   131     Both the access key and secret key should be set to your
   132     Pachyderm authentication token. If authentication is not enabled
   133     on the cluster, both parameters must be empty strings.
   134  
   135  ## Supported Operations
   136  
   137  The Pachyderm S3 gateway supports the following operations:
   138  
   139  * Create buckets: Creates a repo and branch.
   140  * Delete buckets: Deletes a branch or a repo with all branches.
   141  * List buckets: Lists all branches on all repos as S3 buckets.
   142  * Write objects: Atomically overwrites a file on a branch.
   143  * Remove objects: Atomically removes a file on a branch.
   144  * List objects: Lists the files in the HEAD of a branch.
   145  * Get objects: Gets file contents on a branch.
   146  
   147  ### List Filesystem Objects
   148  
   149  If you have configured your S3 client correctly, you should be
   150  able to see the list of filesystem objects in your Pachyderm
   151  repository by running an S3 client `ls` command.
   152  To list filesystem objects, complete the following steps:
   153  
   154  1. Verify that your S3 client can access all of your Pachyderm repositories:
   155  
   156     * If you are using MinIO, type:
   157  
   158       ```shell
   159       $ mc ls local
   160       [2019-07-12 15:09:50 PDT]      0B master.train/
   161       [2019-07-12 14:58:50 PDT]      0B master.pre_process/
   162       [2019-07-12 14:58:09 PDT]      0B master.split/
   163       [2019-07-12 14:58:09 PDT]      0B stats.split/
   164       [2019-07-12 14:36:27 PDT]      0B master.raw_data/
   165       ```
   166  
   167     * If you are using AWS, type:
   168  
   169       ```shell
   170       $ aws --endpoint-url http://localhost:30600 s3 ls
   171       2019-07-12 15:09:50 master.train
   172       2019-07-12 14:58:50 master.pre_process
   173       2019-07-12 14:58:09 master.split
   174       2019-07-12 14:58:09 stats.split
   175       2019-07-12 14:36:27 master.raw_data
   176       ```
   177  
   178  1. List the contents of a repository:
   179  
   180     * If you are using MinIO, type:
   181  
   182       ```shell
   183       $ mc ls local/master.raw_data
   184       [2019-07-19 12:11:37 PDT]  2.6MiB github_issues_medium.csv
   185       ```
   186  
   187     * If you are using AWS, type:
   188  
   189       ```shell
   190       $ aws --endpoint-url http://localhost:30600/ s3 ls s3://master.raw_data
   191       2019-07-26 11:22:23    2685061 github_issues_medium.csv
   192       ```
   193  
   194  ### Create an S3 Bucket
   195  
   196  You can create an S3 bucket in Pachyderm by using the AWS CLI or
   197  the MinIO client commands.
   198  The S3 bucket that you create is a branch in a repository
   199  in Pachyderm.
   200  
   201  To create an S3 bucket, complete the following steps:
   202  
   203  1. Use the `mb <host/branch.repo>` command to create a new
   204  S3 bucket, which is a repository with a branch in Pachyderm.
   205  
   206     * If you are using MinIO, type:
   207  
   208       ```shell
   209       $ mc mb local/master.test
   210       Bucket created successfully `local/master.test`.
   211       ```
   212  
   213     * If you are using AWS, type:
   214  
   215       ```shell
   216       $ aws --endpoint-url http://localhost:30600/ s3 mb s3://master.test
   217       make_bucket: master.test
   218       ```
   219  
   220  1. Verify that the S3 bucket has been successfully created:
   221  
   222     * If you are using MinIO, type:
   223  
   224       ```shell
   225       $ mc ls local
   226       [2019-07-18 13:32:44 PDT]      0B master.test/
   227       [2019-07-12 15:09:50 PDT]      0B master.train/
   228       [2019-07-12 14:58:50 PDT]      0B master.pre_process/
   229       [2019-07-12 14:58:09 PDT]      0B master.split/
   230       [2019-07-12 14:58:09 PDT]      0B stats.split/
   231       [2019-07-12 14:36:27 PDT]      0B master.raw_data/
   232       ```
   233  
   234     * If you are using AWS, type:
   235  
   236       ```shell
   237       $ aws --endpoint-url http://localhost:30600/ s3 ls
   238       2019-07-26 11:35:28 master.test
   239       2019-07-12 14:58:50 master.pre_process
   240       2019-07-12 14:58:09 master.split
   241       2019-07-12 14:58:09 stats.split
   242       2019-07-12 14:36:27 master.raw_data
   243            ```
   244  
   245     * You can also use the `pachctl list repo` command to view the
   246     list of repositories:
   247  
   248       ```shell
   249       $ pachctl list repo
   250       NAME               CREATED                    SIZE (MASTER)
   251       test               About an hour ago          0B
   252       train              6 days ago                 68.57MiB
   253       pre_process        6 days ago                 1.18MiB
   254       split              6 days ago                 1.019MiB
   255       raw_data           6 days ago                 2.561MiB
   256       ```
   257  
   258       You should see the newly created repository in this list.
   259  
   260  ### Delete an S3 Bucket
   261  
   262  You can delete an S3 bucket in Pachyderm from the AWS CLI or
   263  MinIO client by running the following command:
   264  
   265  * If you are using MinIO, type:
   266  
   267    ```shell
   268    $ mc rb local/master.test
   269    Removed `local/master.test` successfully.
   270    ```
   271  
   272  * If you are using AWS, type:
   273  
   274    ```shell
   275    $ aws --endpoint-url http://localhost:30600/ s3 rb s3://master.test
   276    remove_bucket: master.test
   277    ```
   278  
   279  ### Upload and Download File Objects
   280  
   281  For input repositories at the top of your DAG, you can both add files
   282  to and download files from the repository.
   283  When you add files, Pachyderm automatically overwrites the previous
   284  version of the file if it already exists.
   285  
   286  Uploading new files is not supported for output repositories,
   287  these are the repositories that are the output of a pipeline.
   288  Not all the repositories that you see in the results of the `ls` command are
   289  input repositories that can be written to. Some of them might be read-only
   290  output repos. Check your pipeline specification to verify which
   291  repositories are the input repos.
   292  
   293  To add a file to a repository, complete the following steps:
   294  
   295  1. Run the `cp` command for your S3 client:
   296  
   297     * If you are using MinIO, type:
   298  
   299       ```shell
   300       $ mc cp test.csv local/master.raw_data/test.csv
   301       test.csv:                  62 B / 62 B  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  100.00% 206 B/s 0s
   302       ```
   303  
   304     * If you are using AWS, type:
   305  
   306       ```shell
   307       $ aws --endpoint-url http://localhost:30600/ s3 cp test.csv s3://master.raw_data
   308       upload: ./test.csv to s3://master.raw_data/test.csv
   309       ```
   310  
   311     These commands add the `test.csv` file to the `master` branch in
   312     the `raw_data` repository. `raw_data` is an input repository.
   313  
   314  1. Check that the file was added:
   315  
   316     * If you are using MinIO, type:
   317  
   318       ```shell
   319       $ mc ls local/master.raw_data
   320       [2019-07-19 12:11:37 PDT]  2.6MiB github_issues_medium.csv
   321       [2019-07-19 12:11:37 PDT]     62B test.csv
   322       ```
   323  
   324     * If you are using AWS, type:
   325  
   326       ```shell
   327       $ aws --endpoint-url http://localhost:30600/ s3 ls s3://master.raw_data/
   328       2019-07-19 12:11:37  2685061 github_issues_medium.csv
   329       2019-07-19 12:11:37       62 test.csv
   330       ```
   331  
   332  1. Download a file from MinIO to the
   333  current directory by running the following commands:
   334  
   335     * If you are using MinIO, type:
   336  
   337       ```shell
   338       $ mc cp local/master.raw_data/github_issues_medium.csv .
   339       ...hub_issues_medium.csv:  2.56 MiB / 2.56 MiB  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 100.00% 1.26 MiB/s 2s
   340       ```
   341  
   342     * If you are using AWS, type:
   343  
   344       ```
   345       $ aws --endpoint-url http://localhost:30600/ s3 cp s3://master.raw_data/test.csv .
   346       download: s3://master.raw_data/test.csv to ./test.csv
   347       ```
   348  
   349  ### Remove a File Object
   350  
   351  You can delete a file in the `HEAD` of a Pachyderm branch by using the
   352  MinIO command-line interface:
   353  
   354  1. List the files in the input repository:
   355  
   356     * If you are using MinIO, type:
   357  
   358       ```shell
   359       $ mc ls local/master.raw_data/
   360       [2019-07-19 12:11:37 PDT]  2.6MiB github_issues_medium.csv
   361       [2019-07-19 12:11:37 PDT]     62B test.csv
   362       ```
   363  
   364     * If you are using AWS, type:
   365  
   366       ```shell
   367       $ aws --endpoint-url http://localhost:30600/ s3 ls s3://master.raw_data
   368       2019-07-19 12:11:37    2685061 github_issues_medium.csv
   369       2019-07-19 12:11:37         62 test.csv
   370       ```
   371  
   372  1. Delete a file from a repository. Example:
   373  
   374     * If you are using MinIO, type:
   375  
   376       ```shell
   377       $ mc rm local/master.raw_data/test.csv
   378       Removing `local/master.raw_data/test.csv`.
   379       ```
   380  
   381     * If you are using AWS, type:
   382  
   383       ```shell
   384       $ aws --endpoint-url http://localhost:30600/ s3 rm s3://master.raw_data/test.csv
   385       delete: s3://master.raw_data/test.csv
   386       ```
   387  
   388  ## Unsupported operations
   389  
   390  Some of the S3 functionalities are not yet supported by Pachyderm.
   391  If you run any of these operations, Pachyderm returns a standard
   392  S3 `NotImplemented` error.
   393  
   394  The S3 Gateway does not support the following S3 operations:
   395  
   396  * Accelerate
   397  * Analytics
   398  * Object copying. PFS supports this functionality through gRPC.
   399  * CORS configuration
   400  * Encryption
   401  * HTML form uploads
   402  * Inventory
   403  * Legal holds
   404  * Lifecycles
   405  * Logging
   406  * Metrics
   407  * Notifications
   408  * Object locks
   409  * Payment requests
   410  * Policies
   411  * Public access blocks
   412  * Regions
   413  * Replication
   414  * Retention policies
   415  * Tagging
   416  * Torrents
   417  * Website configuration