github.com/pachyderm/pachyderm@v1.13.4/examples/ml/gpt-2/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/ml/gpt-2/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # ML Pipeline for Tweet Generation
     5  
     6  In this example we'll create a machine learning pipeline that generates tweets
     7  using OpenAI's gpt-2 text generation model. This tutorial assumes that
     8  you already have Pachyderm up and running and just focuses on the pipeline
     9  creation. If that's not the case, head over to our [getting started
    10  guide](http://docs.pachyderm.io/en/latest/getting_started/index.html).
    11  
    12  The pipeline we're making has 3 steps in it:
    13  
    14  - tweet scraping
    15  - model training
    16  - tweet generation
    17  
    18  At the top of our DAG is a repo that contains Twitter queries we'd like to
    19  run to get our tweets to train on.
    20  
    21  ## Tweet scraping
    22  
    23  The first step in our pipeline is scraping tweets off of twitter. We named this
    24  step `tweets` and the code for it is in [tweets.py](./tweets.py):
    25  
    26  ```python
    27  #!/usr/local/bin/python3
    28  import os
    29  import twitterscraper as t
    30  
    31  for query in os.listdir("/pfs/queries/"):
    32      with open(os.path.join("/pfs/queries", query)) as f:
    33          for q in f:
    34              q = q.strip()  # clean whitespace
    35              with open(os.path.join("/pfs/out", query), "w+") as out:
    36                  for tweet in t.query_tweets(q):
    37                      out.write("<|startoftext|> ")
    38                      out.write(tweet.text)
    39                      out.write(" <|endoftext|> ")
    40  ```
    41  
    42  Most of this is fairly standard Pachyderm pipeline code. `"/pfs/queries"`
    43  is the path where our input (a list of queries) is mounted. `query_tweets` is
    44  where we actually send the query to twitter and then we write the tweets
    45  out to a file called `/pfs/out/<name-of-input-file>`. Notice that we
    46  inject `"<|startoftext|>"` and `"<|endoftext|>"` at the beginning and end
    47  of each tweet. These are special delimiters that gpt-2 has been trained on
    48  and that we can use to generate one tweet at a time in our generation
    49  step.
    50  
    51  To deploy this as a Pachyderm pipeline, we'll need to use a Pachyderm pipeline spec which we've created as [tweets.json](./tweets.json):
    52  
    53  ```json
    54  {
    55      "pipeline": {
    56          "name": "tweets"
    57      },
    58      "transform": {
    59          "image": "pachyderm/gpt-2-example",
    60          "cmd": ["/tweets.py"]
    61      },
    62      "input": {
    63          "pfs": {
    64              "repo": "queries",
    65              "glob": "/*"
    66          }
    67      }
    68  }
    69  ```
    70  
    71  Notice that we are taking the `"queries"` repo as input with a glob
    72  pattern of `"/*"` so that our pipeline can run in parallel over
    73  several queries if we wanted. Before you can create this pipeline, you'll need to create
    74  its input repo:
    75  
    76  ```shell 
    77  $ pachctl create repo queries
    78  ```
    79  
    80  Now create the pipeline:
    81  
    82  ```shell
    83  $ pachctl create pipeline -f tweets.json
    84  ```
    85  
    86  The pipeline has now been created, let's test to see if it's working by giving
    87  it a query:
    88  
    89  ```shell
    90  $ echo "from:<username>" | pachctl put file queries@master:<username>
    91  ```
    92  
    93  Note that the username should _not_ contain the `@`.  This is a fairly simple
    94  query that just gets all the tweets from a single user. If you'd like to
    95  construct a more complicated query, check out [Twitter's query
    96  language help page](https://twitter.com/search-advanced).
    97  (Hit the search button and along the top of the page will be the query string.)
    98  
    99  After you run that `put file` you will have a new commit in your `"queries"`
   100  repo and a new output commit in `"tweets"`, along with a job that's scraping
   101  the tweets. To see the job running do:
   102  
   103  ```shell
   104  $ pachctl list job
   105  ```
   106  
   107  Once it's finished you can view the scraped tweets with:
   108  
   109  ```shell
   110  $ pachctl get file tweets@master:/<username>
   111  ```
   112  
   113  Assuming those results look reasonable, let's move on to training a model.
   114  
   115  
   116  ## Model training
   117  
   118  As mentioned, we'll be using OpenAI's gpt-2 text generation model -- actually
   119  we'll be using a handy wrapper:
   120  [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple).
   121  
   122  The code for this is in [train.py](./train.py):
   123  
   124  ```python
   125  #!/usr/local/bin/python3
   126  import gpt_2_simple as gpt2
   127  import os
   128  
   129  
   130  tweets = [f for f in os.listdir("/pfs/tweets")]
   131  
   132  # chdir so that the training process outputs to the right place
   133  out = os.path.join("/pfs/out", tweets[0])
   134  os.mkdir(out)
   135  os.chdir(out)
   136  
   137  model_name = "117M"
   138  gpt2.download_gpt2(model_name=model_name)
   139  
   140  
   141  sess = gpt2.start_tf_sess()
   142  gpt2.finetune(sess,
   143                os.path.join("/pfs/tweets", tweets[0]),
   144                model_name=model_name,
   145                steps=25)   # steps is max number of training steps
   146  ```
   147  
   148  Again, most of this is standard Pachyderm pipeline code to grab our inputs (this
   149  time our input is the `"tweets"`). We're also making a few choices in this
   150  pipeline. First, we're using the 117M version of the model. For better results you can
   151  use the 345M version of the model, but expect it to take much more time to train. Second,
   152  we're limiting our training process to 25 steps. This was more-or-less an
   153  arbitrary choice that seems to get good results without taking too long to run.
   154  
   155  The pipeline spec for training the model is very similar to the one above for
   156  scraping tweets:
   157  
   158  ```json
   159  {
   160      "pipeline": {
   161          "name": "train"
   162      },
   163      "transform": {
   164          "image": "pachyderm/gpt-2-example",
   165          "cmd": ["/train.py"]
   166      },
   167      "input": {
   168          "pfs": {
   169              "repo": "tweets",
   170              "glob": "/*"
   171          }
   172      },
   173      "resource_limits": {
   174          "gpu": {
   175              "type": "nvidia.com/gpu",
   176              "number": 1
   177          },
   178          "memory": "10G",
   179          "cpu": 1
   180      },
   181      "resource_requests": {
   182          "memory": "10G",
   183          "cpu": 1
   184      },
   185      "standby": true
   186  }
   187  ```
   188  
   189  A few things have changed from the `tweets` pipeline. First we're taking the
   190  `tweets` repo as input, rather than `queries` and we're running a different
   191  script in our transform. We've also added a `resource_limits` section, because
   192  this is a much more computationally intensive task than we did in the tweets
   193  pipeline, so it makes sense to give it a gpu and a large chunk of memory to
   194  train on. We also enable `standby`, which prevents the pipeline from holding
   195  onto those resources when it's not processing data. You can create this
   196  pipeline with:
   197  
   198  ```shell
   199  $ pachctl create pipeline -f train.json
   200  ```
   201  
   202  This will kick off a job immediately because there are already inputs to be
   203  processed. Expect this job to take a while to run (~1hr on my laptop), but you can make it run
   204  quicker by reducing the max steps and building your own Docker image to use.
   205  
   206  While that's running, let's setup the last step: generating text.
   207  
   208  ## Text Generation
   209  
   210  The last step is to take our trained model(s) and make them tweet! The code
   211  for this is in [generate.py](./generate.py) and looks like this:
   212  
   213  ```python
   214  #!/usr/local/bin/python3
   215  import gpt_2_simple as gpt2
   216  import os
   217  
   218  models = [f for f in os.listdir("/pfs/train")]
   219  
   220  model_dir = os.path.join("/pfs/train", models[0])
   221  # can't tell gpt2 where to read from, so we chdir
   222  os.chdir(model_dir)
   223  
   224  sess = gpt2.start_tf_sess()
   225  gpt2.load_gpt2(sess)
   226  
   227  out = os.path.join("/pfs/out", models[0])
   228  gpt2.generate_to_file(sess, destination_path=out, prefix="<|startoftext|>",
   229                        truncate="<|endoftext|>", include_prefix=False,
   230                        length=280, nsamples=30)
   231  ```
   232  
   233  Again, this code includes some standard Pachyderm boilerplate to read the data
   234  from the local filesystem. The interesting bit is the call to
   235  `generate_to_file`, which actually generates the tweets. A few things to
   236  mention: we set prefix to `"<|startoftext|>"` and truncate `"<|endoftext|>"`
   237  off the end. These are the tokens we added in the first steps (and that were
   238  added in the original training set) to delineate the beginning and end of
   239  tweets. We also set `include_prefix` to `False` so that we don't have
   240  `"<|startoftext|>"` appended to every single tweet. Adding them here tells
   241  gpt-2 to generate a single coherent (hopefully) piece of text. We also set the
   242  length to 280 characters, which is Twitter's limit on tweet size. In a future
   243  version, we may teach gpt-2 to post tweet storms. Lastly, we tell it to give us
   244  30 samples, in this case a sample is a tweet.
   245  
   246  The pipeline spec to run this on Pachyderm should look familiar by now:
   247  
   248  ```json
   249  {
   250      "pipeline": {
   251          "name": "generate"
   252      },
   253      "transform": {
   254          "image": "pachyderm/gpt-2-example",
   255          "cmd": ["/generate.py"]
   256      },
   257      "input": {
   258          "pfs": {
   259              "repo": "train",
   260              "glob": "/*"
   261          }
   262      },
   263      "resource_limits": {
   264          "gpu": {
   265              "type": "nvidia.com/gpu",
   266              "number": 1
   267          },
   268          "memory": "10G",
   269          "cpu": 1
   270      },
   271      "resource_requests": {
   272          "memory": "10G",
   273          "cpu": 1
   274      },
   275      "standby": true
   276  }
   277  ```
   278  
   279  # Modifying and running this example
   280  
   281  This example comes with a simple Makefile to build and deploy it.
   282  
   283  To build the docker images (after modifying the code):
   284  
   285  ```shell
   286  $ make docker-build
   287  ```
   288  
   289  ```shell
   290  $ make deploy
   291  ```
   292