github.com/pachyderm/pachyderm@v1.13.4/examples/redshift/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 5 **Note**: This is a Pachyderm pre version 1.4 tutorial. It needs to be updated for the latest versions of Pachyderm. 6 7 # Exporting Pachyderm Data with SQL 8 9 ## This tutorial is incomplete 10 11 I've started committing the files used by this tuturial before the full 12 tutorial is ready so that users can see an outline of how to use command-line 13 utilities to interact with external systems from pachyderm (this tutorial uses 14 the `psql` tool to write data to Amazon Redshift). However, the tutorial isn't 15 finished, many of the pieces haven't been tested, and this particular example 16 will soon be obsolete, as pachyderm will soon provide native support for 17 writing output to SQL databases from pipelines. 18 19 That said, some very basic notes on writing data to Redshift from Pachyderm: 20 - Amazon Redshift speaks the PostgreSQL wire protocol, so any postgres client 21 can be used to get data into Redshift. This example uses `psql` 22 23 - Since figuring out how to get `psql` into a container seemed hard, I used 24 postgres:9.6.1-alpine as the base container for my pipeline. In addition to 25 the entire implementation of PostgreSQL, this container has a copy of the 26 `psql` client 27 28 - Also, since `psql` can only execute SQL queries, I wrote a little go script 29 (in `json_to_sql`) that consumes arbitrary json records and outputs SQL 30 commands. Since go binaries are statically linked, it's possible to just add 31 the compiled binary to the pipeline container image (see `Dockerfile`) and 32 run it in the pipeline command (see `transform.stdin` in `pipeline.json`) 33 34 - If you actually want to do this, you'll need to build the docker container 35 described by `Dockerfile`. That will look something like: 36 ```shell 37 $ DOCKER_IMAGE_NAME=msteffenpachyderm/to_sql 38 $ cd json_to_sql && go build to_sql.go && cd .. && docker build ./ -t "${DOCKER_IMAGE_NAME}" 39 $ docker push "${DOCKER_IMAGE_NAME}" 40 ``` 41 42 - Then, set the `transform.image` field in `pipeline.json` to the docker 43 image you just pushed 44 45 - For `psql` to connect to Redshift, you need to give it your Redshift 46 credentials. Fortunately, Pachyderm makes it easy to access [Kubernetes 47 secrets](https://kubernetes.io/docs/user-guide/secrets/) from inside pipeline 48 containers. You can use this to authenticate with Redshift by: 49 - creating a [pgpass 50 file](https://www.postgresql.org/docs/9.4/static/libpq-pgpass.html) with 51 your Redshift credentials 52 - creating a Kubernetes secret containing that file, and then 53 - setting the PGPASSFILE environment variable in the pipeline to point to 54 the Kubernetes secret (see `pipeline.json` for an outline of how that 55 looks. The `chmod` command at the beginning is necessary because `psql` 56 won't use a pgpass file that's too accessible). 57 58 - The redshift pipeline also needs information about your Redshift cluster to 59 find it. See the `REDSHIFT_*` environment variables defined in 60 `pipeline.json` 61 62 - Finally, make sure you set up your network ingress/egress rules 63 appropriately. EC2 nodes and Redshift clusters can't talk to each other by 64 default