github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/SQS-S3/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Amazon SQS S3 Spout
     5  
     6  This example describes how to create a simple spout
     7  that listens for "object added" notifications on an
     8  Amazon™ Simple Queue Service (SQS) queue, grabs the
     9  files, and places them into a Pachyderm repository.
    10  
    11  ## Prerequisites
    12  
    13  You must have the following configured in your environment to
    14  run this example:
    15  
    16  * An AWS account
    17  * Pachyderm 1.9.5 or later
    18  
    19  ## Configure AWS Prerequisites
    20  
    21  Before you can run this spout, you need to configure
    22  an S3 bucket, a Simple Notification Service (SNS),
    23  and an SQS queue in your AWS account.
    24  
    25  Complete the following steps:
    26  
    27  1. Create an S3 Bucket.
    28  2. Create an SNS topic and an SQS queue as described in
    29  the [Amazon Documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/ways-to-add-notification-config-to-bucket.html).
    30  3. In your S3 bucket, add an event notification:
    31  
    32     1. Select your S3 bucket.
    33     2. Go to **Properties**.
    34     3. Click **Events > Add notification**.
    35     4. Select **All object create events**.
    36     5. In **Send to**, select **SQS Queue** and pick your
    37     SQS queue from the dropdown list.
    38  
    39  4. Test that the SNS topic and SQS are working by adding a test
    40     file into your S3 bucket. You should get an email notification
    41     about a new object created in the bucket.
    42  
    43  ## Create a Spout
    44  
    45  Use [the SQS example pipeline specification](sqs-spout.json)
    46  and [the sample Python script](sqs-spout.py)
    47  to create a spout pipeline:
    48  
    49  1. Clone the Pachyderm repository:
    50  
    51     ```shell
    52     $ git clone git@github.com:pachyderm/pachyderm.git
    53     ```
    54  
    55  1. Add the following environment variables to `sqs-spout.json`:
    56  
    57     * `AWS_REGION`
    58     * `OUTPUT_PIPE`
    59     * `S3_BUCKET`
    60     * `SQS_QUEUE_URL`
    61     * `VERBOSE_LOGGING`
    62  
    63     For more information, see [Pipeline Environment Parameters](#pipeline-environment-parameters).
    64  
    65  1. Add a secret with the following two keys
    66  
    67     * `AWS_ACCESS_KEY_ID`
    68     * `AWS_SECRET_ACCESS_KEY`
    69  
    70     The values `<your-password>` and `<account name>` are enclosed in single quotes to prevent the shell from interpreting them.
    71     
    72     ```shell
    73     $ echo -n '<account-name>' > AWS_ACCESS_KEY_ID ; chmod 600 AWS_ACCESS_KEY_ID
    74     $ echo -n '<your-password>' > AWS_SECRET_ACCESS_KEY ; chmod 600 AWS_SECRET_ACCESS_KEY
    75     ```
    76     
    77  1. Confirm the values in these files are what you expect.
    78  
    79     ```shell
    80     $ cat AWS_ACCESS_KEY_ID
    81     $ cat AWS_SECRET_ACCESS_KEY
    82     ```
    83     
    84     The output from those two commands should be `<account-name>` and `<your-password>`, respectively.
    85     
    86     Creating the secret will require different steps,
    87     depending on whether you have Kubernetes access or not.
    88     Pachyderm Hub users don't have access to Kubernetes.
    89     If you have Kubernetes access, 
    90     follow the two steps prefixed with "(Kubernetes)".
    91     If you don't have access to Kubernetes,
    92     follow the three steps labeled "(Pachyderm Hub)" 
    93  
    94  1. (Kubernetes) If you have direct access to the Kubernetes cluster, you can create a secret using `kubectl`.
    95     
    96     ```shell
    97     $ kubectl create secret generic aws-credentials --from-file=./AWS_ACCESS_KEY_ID --from-file=./AWS_SECRET_ACCESS_KEY
    98     ```
    99     
   100  1. (Kubernetes) Confirm that the secrets got set correctly.
   101     You use `kubectl get secret` to output the secrets, and then decode them using `jq` to confirm they're correct.
   102     
   103     ```shell
   104     $ kubectl get secret aws-credentials -o json | jq '.data | map_values(@base64d)'
   105     {
   106         "AWS_ACCESS_KEY_ID": "<account-name>",
   107         "AWS_SECRET_ACCESS_KEY": "<your-password>"
   108     }
   109     ```
   110  
   111     You will have to use pachctl if you're using Pachyderm Hub,
   112     or don't have access to the Kubernetes cluster.
   113     The next three steps show how to do that.
   114  
   115  1. (Pachyderm Hub) Create a secrets file from the provided template.
   116  
   117     ```shell
   118     $ jq -n --arg AWS_ACCESS_KEY_ID $(cat AWS_ACCESS_KEY_ID) --arg AWS_SECRET_ACCESS_KEY $(cat AWS_SECRET_ACCESS_KEY) \
   119           -f aws-credentials-template.jq  > aws-credentials-secret.json 
   120     $ chmod 600 aws-credentials-secret.json
   121     ```
   122  
   123  1. (Pachyderm Hub) Confirm the secrets file is correct by decoding the values.
   124  
   125     ```shell
   126     $ jq '.data | map_values(@base64d)' aws-credentials-secret.json
   127     {
   128         "AWS_ACCESS_KEY_ID": "<account-name>",
   129         "AWS_SECRET_ACCESS_KEY": "<your-password>"
   130     }
   131     ```
   132  
   133  1. (Pachyderm Hub) Generate a secret using pachctl
   134  
   135     ```shell
   136     $ pachctl create secret -f mongodb-credentials-secret.json
   137     ```
   138     
   139  1. Create a pipeline from `sqs-spout.json`:
   140  
   141     ```shell
   142     $ pachctl create pipeline -f sqs-spout.json
   143     ```
   144  
   145  1. Verify that the pipeline was created:
   146  
   147     ```shell
   148     $ pachctl list pipeline
   149     NAME       VERSION INPUT    CREATED        STATE / LAST JOB
   150     sqs-spout  1       none     2 minutes ago  running / starting
   151     ```
   152  
   153     You should also see that an output repository was created for your
   154     spout pipeline:
   155  
   156     ```shell
   157     $ pachctl list repo
   158     NAME       CREATED       SIZE
   159     sqs-spout  2 minutes ago 0B
   160     ```
   161  
   162  ## Run the Spout
   163  
   164  After you create an SQS spout, you can test it by uploading a file
   165  into your S3 bucket and later finding it in the
   166  SQS pipeline output repository.
   167  
   168  To test the spout, complete the following steps:
   169  
   170  1. In the IAM console, go to S3 and find your bucket.
   171  
   172  1. Upload a file into your bucket. For example, `01-pipeline.png`. Depending
   173  on the size of the file, it might take some time for the file to get uploaded.
   174  
   175  1. In your terminal, run:
   176  
   177     ```shell
   178     $ pachctl list commit sqs-spout
   179     REPO      BRANCH COMMIT                           PARENT    STARTED        DURATION           SIZE
   180     sqs-spout master 4ecc933d523d485b8a9cce6b1feeac95 none      6 minutes ago  Less than a second 37.44KiB
   181     ```
   182  
   183  1. Verify that the file that you have uploaded to the S3 bucket is
   184  in the `sqs-spout` output repository. Example:
   185  
   186     ```shell
   187     $ pachctl list file sqs-spout@master
   188     NAME             TYPE SIZE
   189     /01-pipeline.png file 37.44KiB
   190     ```
   191  
   192  ## Pipeline Environment Parameters
   193  
   194  This table describes pipeline parameters that you can specify in your
   195  pipeline specification.
   196  
   197  | Optional Parameter  | Description   |
   198  | ------------------- | ------------- |
   199  | `-i AWS_ACCESS_KEY_ID`, `--aws_access_key_id AWS_ACCESS_KEY_ID` | An AWS Access Key ID for accessing the SQS queue and the bucket. Overrides env var AWS_ACCESS_KEY_ID. The default value is `user-id`. You can view your AWS credentials in your AWS Management Console or, if you have set up AWS CLI, in the `~/.aws/config` file. |
   200  | `-k AWS_SECRET_ACCESS_KEY`, `--aws_secret_access_key AWS_SECRET_ACCESS_KEY` | AWS secret key for accessing the SQS queue and the bucket. Overrides env var AWS_SECRET_ACCESS_KEY. The default value is `secret-key`. You can view your AWS credentials in your AWS Management Console or, if you have set up AWS CLI, in the `~/.aws/config` file. |
   201  | `-r AWS_REGION`, `--aws_region AWS_REGION` | An AWS region. Overrides env var `AWS_REGION`. The default value is `us-east-1`. |
   202  | `-o OUTPUT_PIPE`, `--output_pipe OUTPUT_PIPE` | The named pipe that the tar stream that contains the files is written to. Overrides env var `OUTPUT_PIPE`. The default value is `/pfs/out`. |
   203  | `-b S3_BUCKET`, `--s3_bucket S3_BUCKET` | The URL to the SQS queue for bucket notifications. Overrides env var `S3_BUCKET`. The default values is `s3://bucket-name/`. |
   204  | `-q SQS_QUEUE_URL`, `--sqs_queue_url SQS_QUEUE_URL` | The URL to the SQS queue for bucket notifications. Overrides env var `SQS_QUEUE_URL`. The default value is `https://sqs.us-west-1.amazonaws.com/ID/Name`. |