github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/s3gateway.md (about) 1 # Using the S3Gateway 2 3 Pachyderm includes an S3 gateway that enables you to interact with PFS storage 4 through an HTTP application programming interface (API) that imitates the 5 Amazon S3 Storage API. Therefore, with Pachyderm S3 gateway, you can interact 6 with Pachyderm through tools and libraries designed to work with object stores. 7 For example, you can use these tools: 8 9 * [MinIO](https://docs.min.io/docs/minio-client-complete-guide) 10 * [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) 11 12 When you deploy `pachd`, the S3 gateway starts automatically. 13 14 The S3 gateway has some limitations that are outlined below. If you need richer 15 access, use the PFS gRPC interface instead, or one of the 16 [client drivers](https://github.com/pachyderm/python-pachyderm). 17 18 ## Authentication 19 20 If auth is enabled on the Pachyderm cluster, credentials must be passed with 21 each s3 gateway endpoint using AWS' signature v2 or v4 methods. Object store 22 tools and libraries provide built-in support for these methods, but they do 23 not work in the browser. When you use authentication, set the access and secret key 24 to the same value; they are both the Pachyderm auth token used 25 to issue the relevant PFS calls. 26 27 If auth is not enabled on the Pachyderm cluster, no credentials need to be 28 passed to s3gateway requests. 29 30 ## Buckets 31 32 The S3 gateway presents each branch from every Pachyderm repository as 33 an S3 bucket. 34 For example, if you have a `master` branch in the `images` repository, 35 an S3 tool sees `images@master` as the `master.images` S3 bucket. 36 37 ## Versioning 38 39 Most operations act on the `HEAD` of the given branch. However, if your object 40 store library or tool supports versioning, you can get objects in non-HEAD 41 commits by using the commit ID as the version. 42 43 ## Port Forwarding 44 45 If you do not have direct access to the Kubernetes cluster, you can use port 46 forwarding instead. Simply run `pachctl port-forward`, which will allow you 47 to access the s3 gateway through `localhost:30600`. 48 49 However, the Kubernetes port forwarder incurs substantial overhead and 50 does not recover well from broken connections. Connecting to the cluster 51 directly is therefore faster and more reliable. 52 53 ## Configure the S3 client 54 55 Before you can work with the S3 gateway, configure your S3 client 56 to access Pachyderm. Complete the steps in one of the sections below that 57 correspond to your S3 client. 58 59 ### Configure MinIO 60 61 If you are not using the MinIO client, skip this section. 62 63 To install and configure MinIO, complete the following steps: 64 65 1. Install the MinIO client on your platform as 66 described on the [MinIO download page](https://min.io/download#/macos). 67 68 1. Verify that MinIO components are successfully installed by running 69 the following command: 70 71 ```shell 72 $ minio version 73 $ mc version 74 Version: 2019-07-11T19:31:28Z 75 Release-tag: RELEASE.2019-07-11T19-31-28Z 76 Commit-id: 31e5ac02bdbdbaf20a87683925041f406307cfb9 77 ``` 78 79 1. Set up the MinIO configuration file to use the `30600` port for your host: 80 81 ```shell 82 vi ~/.mc/config.json 83 ``` 84 85 You should see a configuration similar to the following: 86 87 * For a minikube deployment, verify the 88 `local` host configuration: 89 90 ```shell 91 "local": { 92 "url": "http://localhost:30600", 93 "accessKey": "YOUR-PACHYDERM-AUTH-TOKEN", 94 "secretKey": "YOUR-PACHYDERM-AUTH-TOKEN", 95 "api": "S3v4", 96 "lookup": "auto" 97 }, 98 ``` 99 100 Set the access key and secret key to your 101 Pachyderm authentication token. If authentication is not enabled 102 on the cluster, both parameters must be empty strings. 103 104 ### Configure the AWS CLI 105 106 If you are not using the AWS CLI, skip this section. 107 108 If you have not done so already, you need to install and 109 configure the AWS CLI client on your machine. To configure the AWS CLI, 110 complete the following steps: 111 112 1. Install the AWS CLI for your operating system as described 113 in the [AWS documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html). 114 115 1. Verify that the AWS CLI is installed: 116 117 ```shell 118 $ aws --version aws-cli/1.16.204 Python/2.7.16 Darwin/17.7.0 botocore/1.12.194 119 ``` 120 121 1. Configure AWS CLI: 122 123 ```shell 124 $ aws configure 125 AWS Access Key ID: YOUR-PACHYDERM-AUTH-TOKEN 126 AWS Secret Access Key: YOUR-PACHYDERM-AUTH-TOKEN 127 Default region name: 128 Default output format [None]: 129 ``` 130 131 Both the access key and secret key should be set to your 132 Pachyderm authentication token. If authentication is not enabled 133 on the cluster, both parameters must be empty strings. 134 135 ## Supported Operations 136 137 The Pachyderm S3 gateway supports the following operations: 138 139 * Create buckets: Creates a repo and branch. 140 * Delete buckets: Deletes a branch or a repo with all branches. 141 * List buckets: Lists all branches on all repos as S3 buckets. 142 * Write objects: Atomically overwrites a file on a branch. 143 * Remove objects: Atomically removes a file on a branch. 144 * List objects: Lists the files in the HEAD of a branch. 145 * Get objects: Gets file contents on a branch. 146 147 ### List Filesystem Objects 148 149 If you have configured your S3 client correctly, you should be 150 able to see the list of filesystem objects in your Pachyderm 151 repository by running an S3 client `ls` command. 152 To list filesystem objects, complete the following steps: 153 154 1. Verify that your S3 client can access all of your Pachyderm repositories: 155 156 * If you are using MinIO, type: 157 158 ```shell 159 $ mc ls local 160 [2019-07-12 15:09:50 PDT] 0B master.train/ 161 [2019-07-12 14:58:50 PDT] 0B master.pre_process/ 162 [2019-07-12 14:58:09 PDT] 0B master.split/ 163 [2019-07-12 14:58:09 PDT] 0B stats.split/ 164 [2019-07-12 14:36:27 PDT] 0B master.raw_data/ 165 ``` 166 167 * If you are using AWS, type: 168 169 ```shell 170 $ aws --endpoint-url http://localhost:30600 s3 ls 171 2019-07-12 15:09:50 master.train 172 2019-07-12 14:58:50 master.pre_process 173 2019-07-12 14:58:09 master.split 174 2019-07-12 14:58:09 stats.split 175 2019-07-12 14:36:27 master.raw_data 176 ``` 177 178 1. List the contents of a repository: 179 180 * If you are using MinIO, type: 181 182 ```shell 183 $ mc ls local/master.raw_data 184 [2019-07-19 12:11:37 PDT] 2.6MiB github_issues_medium.csv 185 ``` 186 187 * If you are using AWS, type: 188 189 ```shell 190 $ aws --endpoint-url http://localhost:30600/ s3 ls s3://master.raw_data 191 2019-07-26 11:22:23 2685061 github_issues_medium.csv 192 ``` 193 194 ### Create an S3 Bucket 195 196 You can create an S3 bucket in Pachyderm by using the AWS CLI or 197 the MinIO client commands. 198 The S3 bucket that you create is a branch in a repository 199 in Pachyderm. 200 201 To create an S3 bucket, complete the following steps: 202 203 1. Use the `mb <host/branch.repo>` command to create a new 204 S3 bucket, which is a repository with a branch in Pachyderm. 205 206 * If you are using MinIO, type: 207 208 ```shell 209 $ mc mb local/master.test 210 Bucket created successfully `local/master.test`. 211 ``` 212 213 * If you are using AWS, type: 214 215 ```shell 216 $ aws --endpoint-url http://localhost:30600/ s3 mb s3://master.test 217 make_bucket: master.test 218 ``` 219 220 1. Verify that the S3 bucket has been successfully created: 221 222 * If you are using MinIO, type: 223 224 ```shell 225 $ mc ls local 226 [2019-07-18 13:32:44 PDT] 0B master.test/ 227 [2019-07-12 15:09:50 PDT] 0B master.train/ 228 [2019-07-12 14:58:50 PDT] 0B master.pre_process/ 229 [2019-07-12 14:58:09 PDT] 0B master.split/ 230 [2019-07-12 14:58:09 PDT] 0B stats.split/ 231 [2019-07-12 14:36:27 PDT] 0B master.raw_data/ 232 ``` 233 234 * If you are using AWS, type: 235 236 ```shell 237 $ aws --endpoint-url http://localhost:30600/ s3 ls 238 2019-07-26 11:35:28 master.test 239 2019-07-12 14:58:50 master.pre_process 240 2019-07-12 14:58:09 master.split 241 2019-07-12 14:58:09 stats.split 242 2019-07-12 14:36:27 master.raw_data 243 ``` 244 245 * You can also use the `pachctl list repo` command to view the 246 list of repositories: 247 248 ```shell 249 $ pachctl list repo 250 NAME CREATED SIZE (MASTER) 251 test About an hour ago 0B 252 train 6 days ago 68.57MiB 253 pre_process 6 days ago 1.18MiB 254 split 6 days ago 1.019MiB 255 raw_data 6 days ago 2.561MiB 256 ``` 257 258 You should see the newly created repository in this list. 259 260 ### Delete an S3 Bucket 261 262 You can delete an S3 bucket in Pachyderm from the AWS CLI or 263 MinIO client by running the following command: 264 265 * If you are using MinIO, type: 266 267 ```shell 268 $ mc rb local/master.test 269 Removed `local/master.test` successfully. 270 ``` 271 272 * If you are using AWS, type: 273 274 ```shell 275 $ aws --endpoint-url http://localhost:30600/ s3 rb s3://master.test 276 remove_bucket: master.test 277 ``` 278 279 ### Upload and Download File Objects 280 281 For input repositories at the top of your DAG, you can both add files 282 to and download files from the repository. 283 When you add files, Pachyderm automatically overwrites the previous 284 version of the file if it already exists. 285 286 Uploading new files is not supported for output repositories, 287 these are the repositories that are the output of a pipeline. 288 Not all the repositories that you see in the results of the `ls` command are 289 input repositories that can be written to. Some of them might be read-only 290 output repos. Check your pipeline specification to verify which 291 repositories are the input repos. 292 293 To add a file to a repository, complete the following steps: 294 295 1. Run the `cp` command for your S3 client: 296 297 * If you are using MinIO, type: 298 299 ```shell 300 $ mc cp test.csv local/master.raw_data/test.csv 301 test.csv: 62 B / 62 B ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 100.00% 206 B/s 0s 302 ``` 303 304 * If you are using AWS, type: 305 306 ```shell 307 $ aws --endpoint-url http://localhost:30600/ s3 cp test.csv s3://master.raw_data 308 upload: ./test.csv to s3://master.raw_data/test.csv 309 ``` 310 311 These commands add the `test.csv` file to the `master` branch in 312 the `raw_data` repository. `raw_data` is an input repository. 313 314 1. Check that the file was added: 315 316 * If you are using MinIO, type: 317 318 ```shell 319 $ mc ls local/master.raw_data 320 [2019-07-19 12:11:37 PDT] 2.6MiB github_issues_medium.csv 321 [2019-07-19 12:11:37 PDT] 62B test.csv 322 ``` 323 324 * If you are using AWS, type: 325 326 ```shell 327 $ aws --endpoint-url http://localhost:30600/ s3 ls s3://master.raw_data/ 328 2019-07-19 12:11:37 2685061 github_issues_medium.csv 329 2019-07-19 12:11:37 62 test.csv 330 ``` 331 332 1. Download a file from MinIO to the 333 current directory by running the following commands: 334 335 * If you are using MinIO, type: 336 337 ```shell 338 $ mc cp local/master.raw_data/github_issues_medium.csv . 339 ...hub_issues_medium.csv: 2.56 MiB / 2.56 MiB ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 100.00% 1.26 MiB/s 2s 340 ``` 341 342 * If you are using AWS, type: 343 344 ``` 345 $ aws --endpoint-url http://localhost:30600/ s3 cp s3://master.raw_data/test.csv . 346 download: s3://master.raw_data/test.csv to ./test.csv 347 ``` 348 349 ### Remove a File Object 350 351 You can delete a file in the `HEAD` of a Pachyderm branch by using the 352 MinIO command-line interface: 353 354 1. List the files in the input repository: 355 356 * If you are using MinIO, type: 357 358 ```shell 359 $ mc ls local/master.raw_data/ 360 [2019-07-19 12:11:37 PDT] 2.6MiB github_issues_medium.csv 361 [2019-07-19 12:11:37 PDT] 62B test.csv 362 ``` 363 364 * If you are using AWS, type: 365 366 ```shell 367 $ aws --endpoint-url http://localhost:30600/ s3 ls s3://master.raw_data 368 2019-07-19 12:11:37 2685061 github_issues_medium.csv 369 2019-07-19 12:11:37 62 test.csv 370 ``` 371 372 1. Delete a file from a repository. Example: 373 374 * If you are using MinIO, type: 375 376 ```shell 377 $ mc rm local/master.raw_data/test.csv 378 Removing `local/master.raw_data/test.csv`. 379 ``` 380 381 * If you are using AWS, type: 382 383 ```shell 384 $ aws --endpoint-url http://localhost:30600/ s3 rm s3://master.raw_data/test.csv 385 delete: s3://master.raw_data/test.csv 386 ``` 387 388 ## Unsupported operations 389 390 Some of the S3 functionalities are not yet supported by Pachyderm. 391 If you run any of these operations, Pachyderm returns a standard 392 S3 `NotImplemented` error. 393 394 The S3 Gateway does not support the following S3 operations: 395 396 * Accelerate 397 * Analytics 398 * Object copying. PFS supports this functionality through gRPC. 399 * CORS configuration 400 * Encryption 401 * HTML form uploads 402 * Inventory 403 * Legal holds 404 * Lifecycles 405 * Logging 406 * Metrics 407 * Notifications 408 * Object locks 409 * Payment requests 410 * Policies 411 * Public access blocks 412 * Regions 413 * Replication 414 * Retention policies 415 * Tagging 416 * Torrents 417 * Website configuration