github.com/pachyderm/pachyderm@v1.13.4/examples/redshift/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/redshift/README.md (about)

     1  >![pach_logo](../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  
     5  **Note**: This is a Pachyderm pre version 1.4 tutorial.  It needs to be updated for the latest versions of Pachyderm.
     6  
     7  # Exporting Pachyderm Data with SQL
     8  
     9  ## This tutorial is incomplete
    10  
    11  I've started committing the files used by this tuturial before the full
    12  tutorial is ready so that users can see an outline of how to use command-line
    13  utilities to interact with external systems from pachyderm (this tutorial uses
    14  the `psql` tool to write data to Amazon Redshift). However, the tutorial isn't
    15  finished, many of the pieces haven't been tested, and this particular example
    16  will soon be obsolete, as pachyderm will soon provide native support for
    17  writing output to SQL databases from pipelines.
    18  
    19  That said, some very basic notes on writing data to Redshift from Pachyderm:
    20  - Amazon Redshift speaks the PostgreSQL wire protocol, so any postgres client
    21    can be used to get data into Redshift. This example uses `psql`
    22  
    23  - Since figuring out how to get `psql` into a container seemed hard, I used
    24    postgres:9.6.1-alpine as the base container for my pipeline. In addition to
    25    the entire implementation of PostgreSQL, this container has a copy of the
    26    `psql` client
    27  
    28  - Also, since `psql` can only execute SQL queries, I wrote a little go script
    29    (in `json_to_sql`) that consumes arbitrary json records and outputs SQL
    30    commands. Since go binaries are statically linked, it's possible to just add
    31    the compiled binary to the pipeline container image (see `Dockerfile`) and
    32    run it in the pipeline command (see `transform.stdin` in `pipeline.json`)
    33  
    34    - If you actually want to do this, you'll need to build the docker container
    35      described by `Dockerfile`. That will look something like:
    36      ```shell
    37      $ DOCKER_IMAGE_NAME=msteffenpachyderm/to_sql
    38      $ cd json_to_sql && go build to_sql.go && cd .. && docker build ./ -t "${DOCKER_IMAGE_NAME}"
    39      $ docker push "${DOCKER_IMAGE_NAME}"
    40      ```
    41  
    42    - Then, set the `transform.image` field in `pipeline.json` to the docker
    43      image you just pushed
    44  
    45  - For `psql` to connect to Redshift, you need to give it your Redshift
    46    credentials.  Fortunately, Pachyderm makes it easy to access [Kubernetes
    47    secrets](https://kubernetes.io/docs/user-guide/secrets/) from inside pipeline
    48    containers.  You can use this to authenticate with Redshift by:
    49      - creating a [pgpass
    50        file](https://www.postgresql.org/docs/9.4/static/libpq-pgpass.html) with
    51        your Redshift credentials
    52      - creating a Kubernetes secret containing that file, and then
    53      - setting the PGPASSFILE environment variable in the pipeline to point to
    54        the Kubernetes secret (see `pipeline.json` for an outline of how that
    55        looks. The `chmod` command at the beginning is necessary because `psql`
    56        won't use a pgpass file that's too accessible).
    57  
    58  - The redshift pipeline also needs information about your Redshift cluster to
    59    find it. See the `REDSHIFT_*` environment variables defined in
    60    `pipeline.json`
    61  
    62  - Finally, make sure you set up your network ingress/egress rules
    63    appropriately. EC2 nodes and Redshift clusters can't talk to each other by
    64    default