github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/EmailSentimentAnalyzer/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  
     5  # Email Sentiment Analysis
     6  
     7  ## Background
     8  
     9  This example connects to an IMAP mail account, 
    10  collects all the incoming mail and analyzes it for positive or negative sentiment,
    11  sorting the emails into directories in its output repo with scoring information added to the email header "X-Sentiment-Rating".
    12  
    13  It is inspired by the [email sentiment analysis bot](https://github.com/shanglun/SentimentAnalyzer) documented in [this article](https://www.toptal.com/java/email-sentiment-analysis-bot) by Shanglung Wang, 
    14  
    15  It uses [Python-based VADER](https://github.com/cjhutto/vaderSentiment) from CJ Hutto at Georgia Tech.
    16  
    17  ## Introduction
    18  In this example, we will connect a spout called imap_spout to an email account using IMAP.
    19  That spout's repo will be the input to a pipeline, sentimentalist,  which will score the email's positive, negative, neutral, and compound sentiment, 
    20  adding a header to each with a detailed sentiment score and sorting them into two folders, 
    21  positive and negative, 
    22  in its output repo based on the compound score.
    23  
    24  This demo will process emails from an account you configure, moving them from the Inbox to a mailbox called "Processed", 
    25  which it will create if it doesn't exist.
    26  The emails will be scored and then sorted.
    27  You'll see them in the sentimentalist output repo by their unique identifier from the Inbox, 
    28  which ensures they'll be unique.
    29  
    30  ## Setup
    31  
    32  This guide assumes that you already have a Pachyderm cluster running and have configured `pachctl` to talk to the cluster and `kubectl` to talk to Kubernetes.
    33  [Installation instructions can be found here](http://pachyderm.readthedocs.io/en/stable/getting_started/local_installation.html).
    34  
    35  1. Create an email account you want to use.  
    36     Keep the email addrees (which is usually the account name) and the password handy.
    37  
    38  1. Enable IMAP on that account. 
    39     In Gmail, click the gear for "settings" and then click "Forwarding and POP/IMAP" to get to the IMAP settings. 
    40     In this example, we're assuming you're using Gmail.
    41     Look in the source code for [./imap_spout.py](imap_spout.py) for environment variables you may need to add to the pipeline spec for the spout to use another email service or other default IMAP folders.
    42  
    43  1. The next few steps show you how to add a secret with the following two keys
    44  
    45     * `IMAP_LOGIN`
    46     * `IMAP_PASSWORD`
    47  
    48     First, we'll save some values to files. 
    49     The values `<your-password>` and `<account name>` are enclosed in single quotes to prevent the shell from interpreting them.
    50     
    51     ```shell
    52     $ echo -n '<account-name>' > IMAP_LOGIN ; chmod 600 IMAP_LOGIN
    53     $ echo -n '<your-password>' > IMAP_PASSWORD ; chmod 600 IMAP_PASSWORD
    54     ```
    55     
    56  1. Confirm the values in these files are what you expect.
    57  
    58     ```shell
    59     $ cat IMAP_LOGIN
    60     $ cat IMAP_PASSWORD
    61     ```
    62     
    63     The output from those two commands should be `<account-name>` and `<your-password>`, respectively.
    64     
    65     Creating the secret will require different steps,
    66     depending on whether you have Kubernetes access or not.
    67     Pachyderm Hub users don't have access to Kubernetes.
    68     If you have Kubernetes access
    69     and want to use `kubectl` to create secrets, 
    70     follow the two steps prefixed with "(Kubernetes)".
    71     If you don't have access to Kubernetes
    72     or want to use `pachctl` to create secrets,
    73     follow the three steps labeled "(Pachyderm Hub)" 
    74  
    75  1. (Kubernetes) If you have direct access to the Kubernetes cluster, you can create a secret using `kubectl`.
    76     
    77     ```shell
    78     $ kubectl create secret generic imap-credentials --from-file=./IMAP_LOGIN --from-file=./IMAP_PASSWORD
    79     ```
    80     
    81  1. (Kubernetes) Confirm that the secrets got set correctly.
    82     You use `kubectl get secret` to output the secrets, and then decode them using `jq` to confirm they're correct.
    83     
    84     ```shell
    85     $ kubectl get secret imap-credentials -o json | jq '.data | map_values(@base64d)'
    86     {
    87         "IMAP_LOGIN": "<account-name>",
    88         "IMAP_PASSWORD": "<your-password>"
    89     }
    90     ```
    91  
    92     You will have to use pachctl if you're using Pachyderm Hub,
    93     or don't have access to the Kubernetes cluster.
    94     The next three steps show how to do that.
    95  
    96  1. (Pachyderm Hub) Create a secrets file from the provided template.
    97     
    98     ```shell
    99     $ jq -n --arg IMAP_LOGIN $(cat IMAP_LOGIN) --arg IMAP_PASSWORD $(cat IMAP_PASSWORD) \
   100           -f imap-credentials-template.jq  > imap-credentials-secret.json 
   101     $ chmod 600 imap-credentials-secret.json
   102     ```
   103  
   104  1. (Pachyderm Hub) Confirm the secrets file is correct by decoding the values.
   105     
   106     ```shell
   107     $ jq '.data | map_values(@base64d)' imap-credentials-secret.json
   108     {
   109         "IMAP_LOGIN": "<account-name>",
   110         "IMAP_PASSWORD": "<your-password>"
   111     }
   112     ```
   113  
   114  1. (Pachyderm Hub) Generate a secret using pachctl
   115  
   116     ```shell
   117     $ pachctl create secret -f imap-credentials-secret.json
   118     ```
   119  
   120  1. Build the docker image for the imap_spout. 
   121     Put your own docker account name in for`<docker-account-name>`.
   122     There is a prebuilt image in the Pachyderm DockerHub registry account, if you want to use it.
   123     
   124     ```shell
   125     $ docker login
   126     $ docker build -t <docker-account-name>/imap_spout:1.11 -f ./Dockerfile.imap_spout .
   127     $ docker push <docker-account-name>/imap_spout:1.11
   128     ```
   129     
   130  1. Build the docker image for the sentimentalist. 
   131     Put your own docker account name in for`<docker-account-name>`.
   132     There is a prebuilt image in the Pachyderm DockerHub registry account, if you want to use it.
   133     
   134     ```shell
   135     $ docker build -t <docker-account-name>/sentimentalist:1.11 -f ./Dockerfile.sentimentalist .
   136     $ docker push <docker-account-name>/sentimentalist:1.11
   137     ```
   138     
   139  1. Edit the pipeline definition files to refer to your own docker repo, 
   140     if you don't use the prebuild images.
   141     
   142     ```shell
   143     $ sed s/pachyderm/<docker-account-name>/g < sentimentalist.json > my_sentimentalist.json
   144     $ sed s/pachyderm/<docker-account-name>/g < imap_spout.json > my_imap_spout.json
   145     ```
   146     
   147  1. Confirm the pipeline definition files are correct.
   148  
   149  1. Create the pipelines
   150  
   151     ```shell
   152     pachctl create pipeline -f my_imap_spout.json
   153     pachctl create pipeline -f my_sentimentalist.json
   154     ```
   155     
   156  1. Start sending plain-text emails to the account you created. 
   157     Every few seconds, the imap_spout pipeline will fetch emails from that account via IMAP and send them to its output repo, 
   158     where the sentimentalist pipeline will score them as positive or negative and sort them into output repos accordingly.
   159     Have fun! 
   160     Try tricking the VADER sentiment engine with vague and ironic statements.
   161     Try emojis!
   162  
   163  ## Pipelines
   164  
   165  ### imap_spout
   166  
   167  The imap_spout pipeline is an implementation of a [Pachyderm spout](http://docs.pachyderm.com/en/latest/fundamentals/spouts.html) in Python. 
   168  It's configurable with environment variables that can be populated by [Kubernetes secrets](https://kubernetes.io/docs/concepts/configuration/secret/).
   169  
   170  The spout connects to an IMAP account via SSL, 
   171  creates a "Processed" mailbox for storing already-scored emails, 
   172  and every five seconds checks for new emails.
   173  
   174  It then puts each email as a separate file in the spout's output repo.
   175  
   176  A couple of things to note, to expand on the [Pachyderm spout](http://docs.pachyderm.com/en/latest/fundamentals/spouts.html) documentation.
   177  
   178  1. Look in the source code for [./imap_spout.py](imap_spout.py) for environment variables you may need to add to the pipeline spec for the spout to use another email service or other default IMAP folders.
   179  1. The function `open_pipe` opens `/pfs/out`, 
   180     the named pipe that's the gateway to the spout's output repo. 
   181     Note that it must open that pipe as _write only_ and in _binary_ mode. 
   182     If you omit this, you're likely to see errors like `TypeError: a bytes-like object is required, not 'str'` in your `pachctl logs` for the pipeline.
   183  1. The files are not written directly to the `/pfs/out`; 
   184     they're written as part of a `tarfile` object.  
   185     Pachyderm uses the Unix `tar` format to ensure that multiple files can be written to `/pfs/out` and appear correctly in the output repo of your spout.
   186  1. In Python, the `tarfile.open()` command must use the `mode="w|"` argument,
   187     along with the named pipe's file object,
   188     to ensure that the `tarfile` object won't try to `seek` on the named pipe `/pfs/out`.
   189     If you forget this argument, you're likely to to see errors like `file stream is not seekable` in your `pachctl logs` for the pipeline.
   190  1. Every time you `close()`  `tarfile`, it's a commit.
   191  1. Note that `open_pipe` backs off and attempts to open `/pfs/out` if any errors happen.
   192     Sometimes it'll take the spout a little bit of time to reopen`/pfs/out` after out code closes it for a commit;
   193     the backoff is insurance.
   194  1. It saves each email in a file with the `mbox` extension, which is the standard extension for Unix emails. 
   195     `eml` is also commonly used, but is a slightly different format than what we use here.
   196     Each `mbox` file contains one email.
   197  
   198  ### sentimentalist
   199  
   200  Sentimentalist is a thin wrapper around the [Python-based VADER](https://github.com/cjhutto/vaderSentiment) from CJ Hutto at Georgia Tech.
   201  
   202  It looks in its input repo for individual email files, loads them into a Python email object, and extracts the body and subject as plain text for scoring.  
   203  
   204  It uses the "compound" score to sort the emails into different directories, and adds a header to each email with detailed scoring information for use by subsequent pipelines.
   205  
   206  ## Citations
   207  ```
   208  Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
   209  Sentiment Analysis of Social Media Text. Eighth International Conference on
   210  Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
   211  ```