github.com/pachyderm/pachyderm@v1.13.4/examples/ml/gpt-2/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 # ML Pipeline for Tweet Generation 5 6 In this example we'll create a machine learning pipeline that generates tweets 7 using OpenAI's gpt-2 text generation model. This tutorial assumes that 8 you already have Pachyderm up and running and just focuses on the pipeline 9 creation. If that's not the case, head over to our [getting started 10 guide](http://docs.pachyderm.io/en/latest/getting_started/index.html). 11 12 The pipeline we're making has 3 steps in it: 13 14 - tweet scraping 15 - model training 16 - tweet generation 17 18 At the top of our DAG is a repo that contains Twitter queries we'd like to 19 run to get our tweets to train on. 20 21 ## Tweet scraping 22 23 The first step in our pipeline is scraping tweets off of twitter. We named this 24 step `tweets` and the code for it is in [tweets.py](./tweets.py): 25 26 ```python 27 #!/usr/local/bin/python3 28 import os 29 import twitterscraper as t 30 31 for query in os.listdir("/pfs/queries/"): 32 with open(os.path.join("/pfs/queries", query)) as f: 33 for q in f: 34 q = q.strip() # clean whitespace 35 with open(os.path.join("/pfs/out", query), "w+") as out: 36 for tweet in t.query_tweets(q): 37 out.write("<|startoftext|> ") 38 out.write(tweet.text) 39 out.write(" <|endoftext|> ") 40 ``` 41 42 Most of this is fairly standard Pachyderm pipeline code. `"/pfs/queries"` 43 is the path where our input (a list of queries) is mounted. `query_tweets` is 44 where we actually send the query to twitter and then we write the tweets 45 out to a file called `/pfs/out/<name-of-input-file>`. Notice that we 46 inject `"<|startoftext|>"` and `"<|endoftext|>"` at the beginning and end 47 of each tweet. These are special delimiters that gpt-2 has been trained on 48 and that we can use to generate one tweet at a time in our generation 49 step. 50 51 To deploy this as a Pachyderm pipeline, we'll need to use a Pachyderm pipeline spec which we've created as [tweets.json](./tweets.json): 52 53 ```json 54 { 55 "pipeline": { 56 "name": "tweets" 57 }, 58 "transform": { 59 "image": "pachyderm/gpt-2-example", 60 "cmd": ["/tweets.py"] 61 }, 62 "input": { 63 "pfs": { 64 "repo": "queries", 65 "glob": "/*" 66 } 67 } 68 } 69 ``` 70 71 Notice that we are taking the `"queries"` repo as input with a glob 72 pattern of `"/*"` so that our pipeline can run in parallel over 73 several queries if we wanted. Before you can create this pipeline, you'll need to create 74 its input repo: 75 76 ```shell 77 $ pachctl create repo queries 78 ``` 79 80 Now create the pipeline: 81 82 ```shell 83 $ pachctl create pipeline -f tweets.json 84 ``` 85 86 The pipeline has now been created, let's test to see if it's working by giving 87 it a query: 88 89 ```shell 90 $ echo "from:<username>" | pachctl put file queries@master:<username> 91 ``` 92 93 Note that the username should _not_ contain the `@`. This is a fairly simple 94 query that just gets all the tweets from a single user. If you'd like to 95 construct a more complicated query, check out [Twitter's query 96 language help page](https://twitter.com/search-advanced). 97 (Hit the search button and along the top of the page will be the query string.) 98 99 After you run that `put file` you will have a new commit in your `"queries"` 100 repo and a new output commit in `"tweets"`, along with a job that's scraping 101 the tweets. To see the job running do: 102 103 ```shell 104 $ pachctl list job 105 ``` 106 107 Once it's finished you can view the scraped tweets with: 108 109 ```shell 110 $ pachctl get file tweets@master:/<username> 111 ``` 112 113 Assuming those results look reasonable, let's move on to training a model. 114 115 116 ## Model training 117 118 As mentioned, we'll be using OpenAI's gpt-2 text generation model -- actually 119 we'll be using a handy wrapper: 120 [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple). 121 122 The code for this is in [train.py](./train.py): 123 124 ```python 125 #!/usr/local/bin/python3 126 import gpt_2_simple as gpt2 127 import os 128 129 130 tweets = [f for f in os.listdir("/pfs/tweets")] 131 132 # chdir so that the training process outputs to the right place 133 out = os.path.join("/pfs/out", tweets[0]) 134 os.mkdir(out) 135 os.chdir(out) 136 137 model_name = "117M" 138 gpt2.download_gpt2(model_name=model_name) 139 140 141 sess = gpt2.start_tf_sess() 142 gpt2.finetune(sess, 143 os.path.join("/pfs/tweets", tweets[0]), 144 model_name=model_name, 145 steps=25) # steps is max number of training steps 146 ``` 147 148 Again, most of this is standard Pachyderm pipeline code to grab our inputs (this 149 time our input is the `"tweets"`). We're also making a few choices in this 150 pipeline. First, we're using the 117M version of the model. For better results you can 151 use the 345M version of the model, but expect it to take much more time to train. Second, 152 we're limiting our training process to 25 steps. This was more-or-less an 153 arbitrary choice that seems to get good results without taking too long to run. 154 155 The pipeline spec for training the model is very similar to the one above for 156 scraping tweets: 157 158 ```json 159 { 160 "pipeline": { 161 "name": "train" 162 }, 163 "transform": { 164 "image": "pachyderm/gpt-2-example", 165 "cmd": ["/train.py"] 166 }, 167 "input": { 168 "pfs": { 169 "repo": "tweets", 170 "glob": "/*" 171 } 172 }, 173 "resource_limits": { 174 "gpu": { 175 "type": "nvidia.com/gpu", 176 "number": 1 177 }, 178 "memory": "10G", 179 "cpu": 1 180 }, 181 "resource_requests": { 182 "memory": "10G", 183 "cpu": 1 184 }, 185 "standby": true 186 } 187 ``` 188 189 A few things have changed from the `tweets` pipeline. First we're taking the 190 `tweets` repo as input, rather than `queries` and we're running a different 191 script in our transform. We've also added a `resource_limits` section, because 192 this is a much more computationally intensive task than we did in the tweets 193 pipeline, so it makes sense to give it a gpu and a large chunk of memory to 194 train on. We also enable `standby`, which prevents the pipeline from holding 195 onto those resources when it's not processing data. You can create this 196 pipeline with: 197 198 ```shell 199 $ pachctl create pipeline -f train.json 200 ``` 201 202 This will kick off a job immediately because there are already inputs to be 203 processed. Expect this job to take a while to run (~1hr on my laptop), but you can make it run 204 quicker by reducing the max steps and building your own Docker image to use. 205 206 While that's running, let's setup the last step: generating text. 207 208 ## Text Generation 209 210 The last step is to take our trained model(s) and make them tweet! The code 211 for this is in [generate.py](./generate.py) and looks like this: 212 213 ```python 214 #!/usr/local/bin/python3 215 import gpt_2_simple as gpt2 216 import os 217 218 models = [f for f in os.listdir("/pfs/train")] 219 220 model_dir = os.path.join("/pfs/train", models[0]) 221 # can't tell gpt2 where to read from, so we chdir 222 os.chdir(model_dir) 223 224 sess = gpt2.start_tf_sess() 225 gpt2.load_gpt2(sess) 226 227 out = os.path.join("/pfs/out", models[0]) 228 gpt2.generate_to_file(sess, destination_path=out, prefix="<|startoftext|>", 229 truncate="<|endoftext|>", include_prefix=False, 230 length=280, nsamples=30) 231 ``` 232 233 Again, this code includes some standard Pachyderm boilerplate to read the data 234 from the local filesystem. The interesting bit is the call to 235 `generate_to_file`, which actually generates the tweets. A few things to 236 mention: we set prefix to `"<|startoftext|>"` and truncate `"<|endoftext|>"` 237 off the end. These are the tokens we added in the first steps (and that were 238 added in the original training set) to delineate the beginning and end of 239 tweets. We also set `include_prefix` to `False` so that we don't have 240 `"<|startoftext|>"` appended to every single tweet. Adding them here tells 241 gpt-2 to generate a single coherent (hopefully) piece of text. We also set the 242 length to 280 characters, which is Twitter's limit on tweet size. In a future 243 version, we may teach gpt-2 to post tweet storms. Lastly, we tell it to give us 244 30 samples, in this case a sample is a tweet. 245 246 The pipeline spec to run this on Pachyderm should look familiar by now: 247 248 ```json 249 { 250 "pipeline": { 251 "name": "generate" 252 }, 253 "transform": { 254 "image": "pachyderm/gpt-2-example", 255 "cmd": ["/generate.py"] 256 }, 257 "input": { 258 "pfs": { 259 "repo": "train", 260 "glob": "/*" 261 } 262 }, 263 "resource_limits": { 264 "gpu": { 265 "type": "nvidia.com/gpu", 266 "number": 1 267 }, 268 "memory": "10G", 269 "cpu": 1 270 }, 271 "resource_requests": { 272 "memory": "10G", 273 "cpu": 1 274 }, 275 "standby": true 276 } 277 ``` 278 279 # Modifying and running this example 280 281 This example comes with a simple Makefile to build and deploy it. 282 283 To build the docker images (after modifying the code): 284 285 ```shell 286 $ make docker-build 287 ``` 288 289 ```shell 290 $ make deploy 291 ``` 292