github.com/pachyderm/pachyderm@v1.13.4/examples/scraper/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 5 :warning: **Warning**: This is a Pachyderm pre version 1.4 tutorial. This example is deprecated until it has been updated for the latest versions of Pachyderm. 6 7 # Quick Start Guide: Web Scraper 8 In this guide you're going to create a Pachyderm pipeline to scrape web pages. 9 We'll use a standard unix tool, `wget`, to do our scraping. 10 11 ## Setup 12 13 This guide assumes that you already have a Pachyderm cluster running and have configured `pachctl` to talk to the cluster. [Installation instructions can be found here](http://pachyderm.readthedocs.io/en/stable/getting_started/local_installation.html). 14 15 ## Create a Repo 16 17 A `Repo` is the highest level primitive in `pfs`. Like all primitives in pfs, they share 18 their name with a primitive in Git and are designed to behave analogously. 19 Generally, a `repo` should be dedicated to a single source of data such as log 20 messages from a particular service. Repos are dirt cheap so don't be shy about 21 making them very specific. 22 23 For this demo we'll simply create a `repo` called 24 “urls” to hold a list of urls that we want to scrape. 25 26 ```shell 27 $ pachctl create repo urls 28 $ pachctl list repo 29 urls 30 ``` 31 32 33 ## Start a Commit 34 Now that we’ve created a `repo` we’ve got a place to add data. 35 If you try writing to the `repo` right away though, it will fail because you can't write directly to a 36 `Repo`. In Pachyderm, you write data to an explicit `commit`. Commits are 37 immutable snapshots of your data which give Pachyderm its version control for 38 data properties. Unlike Git though, commits in Pachyderm must be explicitly 39 started and finished. 40 41 Let's start a new commit in the “urls” repo: 42 ```shell 43 $ pachctl start commit urls@master 44 master/0 45 ``` 46 47 This returns a brand new commit id. Yours should be different from mine. 48 Now if we take a look inside our repo, we’ve created a directory for the new commit: 49 ```shell 50 $ pachctl list commit urls 51 master/0 52 ``` 53 54 A new directory has been created for our commit and now we can start adding 55 data. Data for this example is just a single file with a list of urls. We've provided a sample file for you with just 3 urls, Google, Reddit, and Imgur. 56 We're going to write that data as a file called “urls” in pfs. 57 58 ```shell 59 # Write sample data into pfs 60 $ cat examples/scraper/urls | pachctl put file urls@master/0:urls 61 ``` 62 63 ## Finish a Commit 64 65 Pachyderm won't let you read data from a commit until the `commit` is `finished`. 66 This prevents reads from racing with writes. Furthermore, every write 67 to pfs is atomic. Now let's finish the commit: 68 69 ```shell 70 $ pachctl finish commit urls@master/0 71 ``` 72 73 Now we can view the file: 74 75 ```shell 76 $ pachctl get file urls@master/0:urls 77 www.google.com 78 www.reddit.com 79 www.imgur.com 80 ``` 81 However, we've lost the ability to write to this `commit` since finished 82 commits are immutable. In Pachyderm, a `commit` is always either _write-only_ 83 when it's been started and files are being added, or _read-only_ after it's 84 finished. 85 86 ## Create a Pipeline 87 88 Now that we've got some data in our `repo` it's time to do something with it. 89 Pipelines are the core primitive for Pachyderm's processing system (pps) and 90 they're specified with a JSON encoding. We're going to create a pipeline that simply scrapes each of the web pages in “urls.” 91 92 ``` 93 +----------+ +---------------+ 94 |input data| --> |scrape pipeline| 95 +----------+ +---------------+ 96 ``` 97 98 The `pipeline` we're creating can be found at [scraper.json](scraper.json). The full content is also below. 99 ```json 100 { 101 "pipeline": { 102 "name": "scraper” 103 }, 104 "transform": { 105 "cmd": [ "wget", 106 "--recursive", 107 "--level", "1", 108 "--accept", "jpg,jpeg,png,gif,bmp", 109 "--page-requisites", 110 "--adjust-extension", 111 "--span-hosts", 112 "--no-check-certificate", 113 "--timestamping", 114 "--directory-prefix", 115 "/pfs/out", 116 "--input-file", "/pfs/urls/urls" 117 ], 118 "acceptReturnCode": [4,5,6,7,8] 119 }, 120 "parallelism": "1", 121 "input": { 122 "pfs": { 123 "repo": "urls" 124 } 125 } 126 } 127 ``` 128 129 In this pipeline, we’re just using `wget` to scrape the content of our input web pages. “level” indicates how many recursive links `wget` will retrieve. We currently have it set to 1, which will only scrape the home page, but you can crank it up later if you want. 130 131 Another important section to notice is that we read data 132 from `/pfs/urls/urls` (/pfs/[input_repo_name]) and write data to `/pfs/out/`. We create a directory for each url in “urls” with all of the relevant scrapes as files. 133 134 Now let's create the pipeline in Pachyderm: 135 136 ```shell 137 $ pachctl create pipeline -f examples/scraper/scraper.json 138 ``` 139 140 ## What Happens When You Create a Pipeline 141 Creating a `pipeline` tells Pachyderm to run your code on *every* finished 142 `commit` in a `repo` as well as *all future commits* that happen after the pipeline is 143 created. Our `repo` already had a `commit` with the file “urls” in it so Pachyderm will automatically 144 launch a `job` to scrape those webpages. 145 146 You can view the job with: 147 148 ```shell 149 $ pachctl list job 150 ID OUTPUT STATE 151 09a7eb68995c43979cba2b0d29432073 scraper/2b43def9b52b4fdfadd95a70215e90c9 JOB_STATE_RUNNING 152 ``` 153 154 Depending on how quickly you do the above, you may see `running` or 155 `success`. 156 157 Pachyderm `job`s are implemented as Kubernetes jobs, so you can also see your job with: 158 159 ```shell 160 $ kubectl get job 161 JOB CONTAINER(S) IMAGE(S) SELECTOR SUCCESSFUL 162 09a7eb68995c43979cba2b0d29432073 user ubuntu:14.04 app in (09a7eb68995c43979cba2b0d29432073),suite in (pachyderm) 1 163 ``` 164 165 Every `pipeline` creates a corresponding `repo` with the same 166 name where it stores its output results. In our example, the pipeline was named “scraper” so it created a `repo` called “scraper” which contains the final output. 167 168 169 ## Reading the Output 170 There are a couple of different ways to retrieve the output. We can read a single output file from the “scraper” `repo` in the same fashion that we read the input data: 171 172 ```shell 173 $ pachctl list file scraper@2b43def9b52b4fdfadd95a70215e90c9:urls 174 $ pachctl get file scraper@2b43def9b52b4fdfadd95a70215e90c9:urls/www.imgur.com/index.html 175 ``` 176 177 Using `get file` is good if you know exactly what file you’re looking for, but for this example we want to just see all the scraped pages. One great way to do this is to mount the distributed file system locally and then just poke around. 178 179 ## Mount the Filesystem 180 First create the mount point: 181 182 ```shell 183 $ mkdir ~/pfs 184 ``` 185 186 And then mount it: 187 188 ```shell 189 # We background this process because it blocks. 190 $ pachctl mount ~/pfs & 191 ``` 192 193 This will mount pfs on `~/pfs` you can inspect the filesystem like you would any 194 other local filesystem. Try: 195 196 ```shell 197 $ ls ~/pfs 198 urls 199 scraper 200 ``` 201 You should see the urls repo that we created. 202 203 Now you can simply `ls` and `cd` around the file system. Try pointing your browser at the scraped output files! 204 205 206 ## Processing More Data 207 208 Pipelines can be triggered manually, but also will automatically process the data from new commits as they are 209 created. Think of pipelines as being subscribed to any new commits that are 210 finished on their input repo(s). 211 212 If we want to re-scrape some of our urls to see if the sites of have changed, we can use the `pachctl update pipeline` command with the `--reprocess` flag: 213 214 ```shell 215 $ pachctl update pipeline scraper --reprocess 216 ``` 217 218 Next, let’s add additional urls to our input data . We're going to append more urls from “urls2” to the file “urls.” 219 220 We first need to start a new commit to add more data. Similar to Git, commits have a parental 221 structure that track how files change over time. Specifying a parent is 222 optional when creating a commit (notice we didn't specify a parent when we 223 created the first commit), but in this case we're going to be adding 224 more data to the same file “urls.” 225 226 227 Let's create a new commit with our previous commit as the parent: 228 229 ```shell 230 $ pachctl start commit urls@master 231 master/1 232 ``` 233 234 Append more data to our urls file in the new commit: 235 ```shell 236 $ cat examples/scraper/urls2 | pachctl put file urls@master/1:urls 237 ``` 238 Finally, we'll want to finish our second commit. After it's finished, we can 239 read “scraper” from the latest commit to see all the scrapes. 240 241 ```shell 242 $ pachctl finish commit urls@master/1 243 ``` 244 Finishing this commit will also automatically trigger the pipeline to run on 245 the new data we've added. We'll see a corresponding commit to the output 246 “scraper” repo with data from our newly added sites. 247 248 ```shell 249 $ pachctl list commit scraper 250 ``` 251 ## Next Steps 252 You've now got a working Pachyderm cluster with data and a pipelines! Here are a few ideas for next steps that you can expand on your working setup. 253 - Add a bunch more urls and crank up the “level” in the pipeline. You’ll have to delete the old pipeline and re-create or give your pipeline and new name. 254 - Add a new pipeline than does something interesting with the scraper output. Image or text processing could be fun. Just create a pipeline with the scraper repo as an input. 255 256 We'd love to help and see what you come up with so submit any issues/questions you come across or email at info@pachyderm.io if you want to show off anything nifty you've created!