github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/EmailSentimentAnalyzer/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 5 # Email Sentiment Analysis 6 7 ## Background 8 9 This example connects to an IMAP mail account, 10 collects all the incoming mail and analyzes it for positive or negative sentiment, 11 sorting the emails into directories in its output repo with scoring information added to the email header "X-Sentiment-Rating". 12 13 It is inspired by the [email sentiment analysis bot](https://github.com/shanglun/SentimentAnalyzer) documented in [this article](https://www.toptal.com/java/email-sentiment-analysis-bot) by Shanglung Wang, 14 15 It uses [Python-based VADER](https://github.com/cjhutto/vaderSentiment) from CJ Hutto at Georgia Tech. 16 17 ## Introduction 18 In this example, we will connect a spout called imap_spout to an email account using IMAP. 19 That spout's repo will be the input to a pipeline, sentimentalist, which will score the email's positive, negative, neutral, and compound sentiment, 20 adding a header to each with a detailed sentiment score and sorting them into two folders, 21 positive and negative, 22 in its output repo based on the compound score. 23 24 This demo will process emails from an account you configure, moving them from the Inbox to a mailbox called "Processed", 25 which it will create if it doesn't exist. 26 The emails will be scored and then sorted. 27 You'll see them in the sentimentalist output repo by their unique identifier from the Inbox, 28 which ensures they'll be unique. 29 30 ## Setup 31 32 This guide assumes that you already have a Pachyderm cluster running and have configured `pachctl` to talk to the cluster and `kubectl` to talk to Kubernetes. 33 [Installation instructions can be found here](http://pachyderm.readthedocs.io/en/stable/getting_started/local_installation.html). 34 35 1. Create an email account you want to use. 36 Keep the email addrees (which is usually the account name) and the password handy. 37 38 1. Enable IMAP on that account. 39 In Gmail, click the gear for "settings" and then click "Forwarding and POP/IMAP" to get to the IMAP settings. 40 In this example, we're assuming you're using Gmail. 41 Look in the source code for [./imap_spout.py](imap_spout.py) for environment variables you may need to add to the pipeline spec for the spout to use another email service or other default IMAP folders. 42 43 1. The next few steps show you how to add a secret with the following two keys 44 45 * `IMAP_LOGIN` 46 * `IMAP_PASSWORD` 47 48 First, we'll save some values to files. 49 The values `<your-password>` and `<account name>` are enclosed in single quotes to prevent the shell from interpreting them. 50 51 ```shell 52 $ echo -n '<account-name>' > IMAP_LOGIN ; chmod 600 IMAP_LOGIN 53 $ echo -n '<your-password>' > IMAP_PASSWORD ; chmod 600 IMAP_PASSWORD 54 ``` 55 56 1. Confirm the values in these files are what you expect. 57 58 ```shell 59 $ cat IMAP_LOGIN 60 $ cat IMAP_PASSWORD 61 ``` 62 63 The output from those two commands should be `<account-name>` and `<your-password>`, respectively. 64 65 Creating the secret will require different steps, 66 depending on whether you have Kubernetes access or not. 67 Pachyderm Hub users don't have access to Kubernetes. 68 If you have Kubernetes access 69 and want to use `kubectl` to create secrets, 70 follow the two steps prefixed with "(Kubernetes)". 71 If you don't have access to Kubernetes 72 or want to use `pachctl` to create secrets, 73 follow the three steps labeled "(Pachyderm Hub)" 74 75 1. (Kubernetes) If you have direct access to the Kubernetes cluster, you can create a secret using `kubectl`. 76 77 ```shell 78 $ kubectl create secret generic imap-credentials --from-file=./IMAP_LOGIN --from-file=./IMAP_PASSWORD 79 ``` 80 81 1. (Kubernetes) Confirm that the secrets got set correctly. 82 You use `kubectl get secret` to output the secrets, and then decode them using `jq` to confirm they're correct. 83 84 ```shell 85 $ kubectl get secret imap-credentials -o json | jq '.data | map_values(@base64d)' 86 { 87 "IMAP_LOGIN": "<account-name>", 88 "IMAP_PASSWORD": "<your-password>" 89 } 90 ``` 91 92 You will have to use pachctl if you're using Pachyderm Hub, 93 or don't have access to the Kubernetes cluster. 94 The next three steps show how to do that. 95 96 1. (Pachyderm Hub) Create a secrets file from the provided template. 97 98 ```shell 99 $ jq -n --arg IMAP_LOGIN $(cat IMAP_LOGIN) --arg IMAP_PASSWORD $(cat IMAP_PASSWORD) \ 100 -f imap-credentials-template.jq > imap-credentials-secret.json 101 $ chmod 600 imap-credentials-secret.json 102 ``` 103 104 1. (Pachyderm Hub) Confirm the secrets file is correct by decoding the values. 105 106 ```shell 107 $ jq '.data | map_values(@base64d)' imap-credentials-secret.json 108 { 109 "IMAP_LOGIN": "<account-name>", 110 "IMAP_PASSWORD": "<your-password>" 111 } 112 ``` 113 114 1. (Pachyderm Hub) Generate a secret using pachctl 115 116 ```shell 117 $ pachctl create secret -f imap-credentials-secret.json 118 ``` 119 120 1. Build the docker image for the imap_spout. 121 Put your own docker account name in for`<docker-account-name>`. 122 There is a prebuilt image in the Pachyderm DockerHub registry account, if you want to use it. 123 124 ```shell 125 $ docker login 126 $ docker build -t <docker-account-name>/imap_spout:1.11 -f ./Dockerfile.imap_spout . 127 $ docker push <docker-account-name>/imap_spout:1.11 128 ``` 129 130 1. Build the docker image for the sentimentalist. 131 Put your own docker account name in for`<docker-account-name>`. 132 There is a prebuilt image in the Pachyderm DockerHub registry account, if you want to use it. 133 134 ```shell 135 $ docker build -t <docker-account-name>/sentimentalist:1.11 -f ./Dockerfile.sentimentalist . 136 $ docker push <docker-account-name>/sentimentalist:1.11 137 ``` 138 139 1. Edit the pipeline definition files to refer to your own docker repo, 140 if you don't use the prebuild images. 141 142 ```shell 143 $ sed s/pachyderm/<docker-account-name>/g < sentimentalist.json > my_sentimentalist.json 144 $ sed s/pachyderm/<docker-account-name>/g < imap_spout.json > my_imap_spout.json 145 ``` 146 147 1. Confirm the pipeline definition files are correct. 148 149 1. Create the pipelines 150 151 ```shell 152 pachctl create pipeline -f my_imap_spout.json 153 pachctl create pipeline -f my_sentimentalist.json 154 ``` 155 156 1. Start sending plain-text emails to the account you created. 157 Every few seconds, the imap_spout pipeline will fetch emails from that account via IMAP and send them to its output repo, 158 where the sentimentalist pipeline will score them as positive or negative and sort them into output repos accordingly. 159 Have fun! 160 Try tricking the VADER sentiment engine with vague and ironic statements. 161 Try emojis! 162 163 ## Pipelines 164 165 ### imap_spout 166 167 The imap_spout pipeline is an implementation of a [Pachyderm spout](http://docs.pachyderm.com/en/latest/fundamentals/spouts.html) in Python. 168 It's configurable with environment variables that can be populated by [Kubernetes secrets](https://kubernetes.io/docs/concepts/configuration/secret/). 169 170 The spout connects to an IMAP account via SSL, 171 creates a "Processed" mailbox for storing already-scored emails, 172 and every five seconds checks for new emails. 173 174 It then puts each email as a separate file in the spout's output repo. 175 176 A couple of things to note, to expand on the [Pachyderm spout](http://docs.pachyderm.com/en/latest/fundamentals/spouts.html) documentation. 177 178 1. Look in the source code for [./imap_spout.py](imap_spout.py) for environment variables you may need to add to the pipeline spec for the spout to use another email service or other default IMAP folders. 179 1. The function `open_pipe` opens `/pfs/out`, 180 the named pipe that's the gateway to the spout's output repo. 181 Note that it must open that pipe as _write only_ and in _binary_ mode. 182 If you omit this, you're likely to see errors like `TypeError: a bytes-like object is required, not 'str'` in your `pachctl logs` for the pipeline. 183 1. The files are not written directly to the `/pfs/out`; 184 they're written as part of a `tarfile` object. 185 Pachyderm uses the Unix `tar` format to ensure that multiple files can be written to `/pfs/out` and appear correctly in the output repo of your spout. 186 1. In Python, the `tarfile.open()` command must use the `mode="w|"` argument, 187 along with the named pipe's file object, 188 to ensure that the `tarfile` object won't try to `seek` on the named pipe `/pfs/out`. 189 If you forget this argument, you're likely to to see errors like `file stream is not seekable` in your `pachctl logs` for the pipeline. 190 1. Every time you `close()` `tarfile`, it's a commit. 191 1. Note that `open_pipe` backs off and attempts to open `/pfs/out` if any errors happen. 192 Sometimes it'll take the spout a little bit of time to reopen`/pfs/out` after out code closes it for a commit; 193 the backoff is insurance. 194 1. It saves each email in a file with the `mbox` extension, which is the standard extension for Unix emails. 195 `eml` is also commonly used, but is a slightly different format than what we use here. 196 Each `mbox` file contains one email. 197 198 ### sentimentalist 199 200 Sentimentalist is a thin wrapper around the [Python-based VADER](https://github.com/cjhutto/vaderSentiment) from CJ Hutto at Georgia Tech. 201 202 It looks in its input repo for individual email files, loads them into a Python email object, and extracts the body and subject as plain text for scoring. 203 204 It uses the "compound" score to sort the emails into different directories, and adds a header to each email with detailed scoring information for use by subsequent pipelines. 205 206 ## Citations 207 ``` 208 Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for 209 Sentiment Analysis of Social Media Text. Eighth International Conference on 210 Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014. 211 ```