github.com/pachyderm/pachyderm@v1.13.4/examples/db/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 # Periodic ingress from MongoDB 5 6 This example pipeline executes a query periodically against a MongoDB database outside of Pachyderm. The results of the query are stored in a corresponding output repository. This repository could be used to drive additional pipeline stages periodically based on the results of the query. 7 8 The example assumes that you have: 9 10 - A Pachyderm cluster running - see [Local Installation](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/) to get up and running with a local Pachyderm cluster in just a few minutes. 11 - The `pachctl` CLI tool installed and connected to your Pachyderm cluster - see [deploy docs](https://docs.pachyderm.com/1.13.x/deploy-manage/) for instructions. 12 13 ## Setup MongoDB 14 15 The easiest way to demonstrate this example is with a free hosted MongoDB cluster, such as the free tier of [MongoDB Atlas](https://www.mongodb.com/cloud/atlas) or [MLab](https://mlab.com/) (although you could certainly do this with any MongoDB). Assuming that you are using MongoDB Atlas: 16 17 1. Deploy a new `Cluster0` MongoDB cluster with MongoDB Atlas (remmeber the admin username and password as you will need these shortly). Once deployed you should be able to see this cluster in the MongoDB Atlas dashboard: 18 19  20 21 1. Click on the "connect" button for your cluster and make sure that all IPs are whitelisted (or at least the k8s master IP where you have Pachyderm deployed): 22 23  24 25 1. Then click on "Connect with the MongoDB shell" to find the URI, DB name (`test` if you are using MongoDB Atlas `Cluster0`), username, and authentication DB for connecting to your cluster. You will need these to query MongoDB. 26 27 1. Make sure you have the MongoDB tools installed locally. You can follow [this guide](https://docs.mongodb.com/manual/administration/install-community/) to install themk. 28 29 ## Import example data 30 31 We are going to run this example with an example set of data about restaurants. This dataset comes directly from MongoDB and is used in many of their examples as well. 32 33 1. Download the dataset from [here](https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json). It is named `primer-dataset.json`. Each of the records in this dataset look like the following: 34 35 ``` 36 { 37 "address": { 38 "building": "1007", 39 "coord": [ -73.856077, 40.848447 ], 40 "street": "Morris Park Ave", 41 "zipcode": "10462" 42 }, 43 "borough": "Bronx", 44 "cuisine": "Bakery", 45 "grades": [ 46 { "date": { "$date": 1393804800000 }, "grade": "A", "score": 2 }, 47 { "date": { "$date": 1378857600000 }, "grade": "A", "score": 6 }, 48 { "date": { "$date": 1358985600000 }, "grade": "A", "score": 10 }, 49 { "date": { "$date": 1322006400000 }, "grade": "A", "score": 9 }, 50 { "date": { "$date": 1299715200000 }, "grade": "B", "score": 14 } 51 ], 52 "name": "Morris Park Bake Shop", 53 "restaurant_id": "30075445" 54 } 55 ``` 56 57 1. Import the dataset to the `restaurants` collection in MongoDB (in the `test` DB if you are using MongoDB Atlas) using the `mongoimport` command. You will need to specify the Mongo hosts, username, password, etc. from your MongoDB cluster. For example: 58 59 ```shell 60 $ mongoimport --host Cluster0-shard-0/cluster0-shard-00-00-cwehf.mongodb.net:27017,cluster0-shard-00-01-cwehf.mongodb.net:27017,cluster0-shard-00-02-cwehf.mongodb.net:27017 --ssl -u admin -p '<my password>' --authenticationDatabase admin --db test --collection restaurants --drop --file primer-dataset.json 61 2017-08-28T13:40:38.983-0400 connected to: Cluster0-shard-0/cluster0-shard-00-00-cwehf.mongodb.net:27017,cluster0-shard-00-01-cwehf.mongodb.net:27017,cluster0-shard-00-02-cwehf.mongodb.net:27017 62 2017-08-28T13:40:39.048-0400 dropping: test.restaurants 63 2017-08-28T13:40:41.310-0400 [#.......................] test.restaurants 540KB/11.3MB (4.7%) 64 2017-08-28T13:40:44.310-0400 [##......................] test.restaurants 1.04MB/11.3MB (9.2%) 65 2017-08-28T13:40:47.310-0400 [##......................] test.restaurants 1.04MB/11.3MB (9.2%) 66 2017-08-28T13:40:50.310-0400 [###.....................] test.restaurants 1.55MB/11.3MB (13.7%) 67 2017-08-28T13:40:53.310-0400 [####....................] test.restaurants 2.07MB/11.3MB (18.2%) 68 2017-08-28T13:40:56.310-0400 [#####...................] test.restaurants 2.58MB/11.3MB (22.8%) 69 2017-08-28T13:40:59.310-0400 [######..................] test.restaurants 3.10MB/11.3MB (27.4%) 70 2017-08-28T13:41:02.310-0400 [######..................] test.restaurants 3.10MB/11.3MB (27.4%) 71 2017-08-28T13:41:05.310-0400 [#######.................] test.restaurants 3.61MB/11.3MB (31.9%) 72 2017-08-28T13:41:08.310-0400 [########................] test.restaurants 4.12MB/11.3MB (36.4%) 73 2017-08-28T13:41:11.310-0400 [#########...............] test.restaurants 4.64MB/11.3MB (41.0%) 74 2017-08-28T13:41:14.310-0400 [#########...............] test.restaurants 4.64MB/11.3MB (41.0%) 75 2017-08-28T13:41:17.310-0400 [##########..............] test.restaurants 5.15MB/11.3MB (45.5%) 76 2017-08-28T13:41:20.310-0400 [############............] test.restaurants 5.67MB/11.3MB (50.0%) 77 2017-08-28T13:41:23.310-0400 [#############...........] test.restaurants 6.19MB/11.3MB (54.6%) 78 2017-08-28T13:41:26.310-0400 [#############...........] test.restaurants 6.19MB/11.3MB (54.6%) 79 2017-08-28T13:41:29.310-0400 [##############..........] test.restaurants 6.70MB/11.3MB (59.2%) 80 2017-08-28T13:41:32.310-0400 [###############.........] test.restaurants 7.21MB/11.3MB (63.7%) 81 2017-08-28T13:41:35.310-0400 [################........] test.restaurants 7.71MB/11.3MB (68.1%) 82 2017-08-28T13:41:38.310-0400 [################........] test.restaurants 7.71MB/11.3MB (68.1%) 83 2017-08-28T13:41:41.310-0400 [#################.......] test.restaurants 8.18MB/11.3MB (72.3%) 84 2017-08-28T13:41:44.310-0400 [##################......] test.restaurants 8.62MB/11.3MB (76.1%) 85 2017-08-28T13:41:47.310-0400 [###################.....] test.restaurants 9.03MB/11.3MB (79.7%) 86 2017-08-28T13:41:50.310-0400 [###################.....] test.restaurants 9.41MB/11.3MB (83.1%) 87 2017-08-28T13:41:53.310-0400 [####################....] test.restaurants 9.77MB/11.3MB (86.3%) 88 2017-08-28T13:41:56.310-0400 [#####################...] test.restaurants 10.1MB/11.3MB (89.2%) 89 2017-08-28T13:41:59.310-0400 [######################..] test.restaurants 10.4MB/11.3MB (91.9%) 90 2017-08-28T13:42:02.310-0400 [######################..] test.restaurants 10.7MB/11.3MB (94.5%) 91 2017-08-28T13:42:05.310-0400 [#######################.] test.restaurants 11.0MB/11.3MB (97.0%) 92 2017-08-28T13:42:08.310-0400 [########################] test.restaurants 11.3MB/11.3MB (100.0%) 93 2017-08-28T13:42:08.449-0400 [########################] test.restaurants 11.3MB/11.3MB (100.0%) 94 2017-08-28T13:42:08.449-0400 imported 25359 documents 95 ``` 96 97 ## Create a Kubernetes secret with your Mongo creds 98 99 In order for your Pachyderm pipeline to talk with MongoDB, we need to tell Pachyderm about the MongoDB URI, username, password, etc. We will do this via a [Kubernetes secret](https://kubernetes.io/docs/concepts/configuration/secret/), 100 loaded via `pachctl create secret`. 101 102 103 1. The next few steps show you how to add a secret with the following five keys 104 105 * `uri` 106 * `username` 107 * `password` 108 * `db` 109 * `collection` 110 111 First, we'll save some values to files. 112 The values should all be enclosed in single quotes to prevent the shell from interpreting them. 113 114 ```shell 115 $ echo -n '<uri>' > uri ; chmod 600 uri 116 $ echo -n '<username>' > username ; chmod 600 username 117 $ echo -n '<password>' > password ; chmod 600 password 118 $ echo -n '<db>' > db ; chmod 600 db 119 $ echo -n '<collection>' > collection ; chmod 600 collection 120 ``` 121 122 1. Confirm the values in these files are what you expect. 123 124 ```shell 125 $ cat uri 126 $ cat username 127 $ cat password 128 $ cat db 129 $ cat collection 130 ``` 131 132 Creating the secret will require different steps, 133 depending on whether you have Kubernetes access or not. 134 Pachyderm Hub users don't have access to Kubernetes. 135 If you have Kubernetes access and want to use `kubectl`, 136 you may follow the two steps prefixed with "(Kubernetes)". 137 If you don't have access to Kubernetes or don't want to use `kubectl`, 138 follow the two steps labeled "(Pachyderm Hub)" 139 140 1. (Kubernetes) If you have direct access to the Kubernetes cluster, you can create a secret using `kubectl`. 141 142 ```shell 143 $ kubectl create secret generic mongosecret --from-file=./uri \ 144 --from-file=./username \ 145 --from-file=./password \ 146 --from-file=./db \ 147 --from-file=./collection 148 ``` 149 150 1. (Kubernetes) Confirm that the secrets got set correctly. 151 You use `kubectl get secret` to output the secrets, and then decode them using `jq` to confirm they're correct. 152 153 ```shell 154 $ kubectl get secret mongosecret -o json | jq '.data | map_values(@base64d)' 155 { 156 "uri": "<uri>", 157 "username": "<username>" 158 "password": "<password>" 159 "db": "<db>" 160 "collection": "<collection>" 161 } 162 ``` 163 164 You will have to use pachctl if you're using Pachyderm Hub, 165 or don't have access to the Kubernetes cluster. 166 The next three steps show how to do that. 167 168 1. (Pachyderm Hub) Create a secrets file from the provided template. 169 170 ```shell 171 $ jq -n --arg uri $(cat uri) --arg username $(cat username) \ 172 --arg password $(cat password) --arg db $(cat db) --arg collection $(cat collection) \ 173 -f mongodb-credentials-template.jq > mongodb-credentials-secret.json 174 $ chmod 600 mongodb-credentials-secret.json 175 ``` 176 177 1. (Pachyderm Hub) Confirm the secrets file is correct by decoding the values. 178 179 ```shell 180 $ jq '.data | map_values(@base64d)' mongodb-credentials-secret.json 181 { 182 "uri": "<uri>", 183 "username": "<username>" 184 "password": "<password>" 185 "db": "<db>" 186 "collection": "<collection>" 187 } 188 ``` 189 190 1. (Pachyderm Hub) Generate a secret using pachctl 191 192 ```shell 193 $ pachctl create secret -f mongodb-credentials-secret.json 194 ``` 195 196 ## Create the pipeline, view the results 197 198 In our [pipeline spec](query.json), we will do the following: 199 200 - Grab the Kubernetes secret that we defined above. 201 - Define a `cron` input that will cause the pipeline to be triggered every 10 seconds. 202 - Using the official `mongo` Docker image, query for a random document from the `restaurants` collection and output that to `/pfs/out`. 203 204 ``` 205 { 206 "pipeline": { 207 "name": "query" 208 }, 209 "transform": { 210 "image": "mongo", 211 "cmd": [ "/bin/bash" ], 212 "stdin": [ 213 "export uri=$(cat /tmp/mongosecret/uri)", 214 "export db=$(cat /tmp/mongosecret/db)", 215 "export collection=$(cat /tmp/mongosecret/collection)", 216 "export username=$(cat /tmp/mongosecret/username)", 217 "export password=$(cat /tmp/mongosecret/password)", 218 "mongo \"$uri\" --authenticationDatabase admin --ssl --username $username --password $password --quiet --eval 'db.restaurants.aggregate({ $sample: { size: 1 } });' | tail -n1 | egrep -v \"^>|^bye\" > /pfs/out/output.json" 219 ], 220 "secrets": [ 221 { 222 "name": "mongosecret", 223 "mount_path": "/tmp/mongosecret" 224 } 225 ] 226 }, 227 "input": { 228 "cron": { 229 "name": "tick", 230 "spec": "@every 10s" 231 } 232 } 233 } 234 ``` 235 236 This will allow us to view the head of the output over time to see a bunch of random documents being queried out of MongoDB. 237 238 1. Create the pipeline. 239 240 ```shell 241 $ pachctl create pipeline -f query.json 242 ``` 243 244 1. Run the command `pachctl list pipeline` to make sure it's running: 245 246 ```shell 247 $ pachctl list pipeline 248 NAME VERSION INPUT CREATED STATE / LAST JOB DESCRIPTION 249 query 1 tick:@every 10s 6 seconds ago running / starting 250 ``` 251 252 1. After the pipeline is running, you should see jobs start to be triggered every 10 seconds. 253 254 ```shell 255 $ pachctl list job 256 ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS DL UL STATE 257 842e4e6c-4920-42c0-9c81-e5299b67e4a0 query/- 1 second ago - 0 0 + 0 / 1 0B 0B running 258 $ pachctl list job 259 ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS DL UL STATE 260 5938a0d0-9512-455f-a390-14adc3669e5f query/0f8a2ba1150a463299ee71961427bdcb 3 seconds ago 3 seconds 0 1 + 0 / 1 26B 617B success 261 952427a6-c92d-4c98-a781-87616988d528 query/33776e4df3b24ab68d70b5185eb37661 13 seconds ago 1 second 0 1 + 0 / 1 26B 613B success 262 1bc5f608-85fd-44eb-833e-562d15629706 query/6dd2a4da566f4d30ad9c66fc60244bab 23 seconds ago 1 second 0 1 + 0 / 1 26B 721B success 263 efa677a4-7f83-424b-879d-70a0c5690bb2 query/f56b1f314030455c8bdf8a10b68ebd16 33 seconds ago 1 second 0 1 + 0 / 1 26B 529B success 264 842e4e6c-4920-42c0-9c81-e5299b67e4a0 query/2a11bfc3e6d74af0a8d254d3ecf6f6af 43 seconds ago 1 second 0 1 + 0 / 1 26B 535B success 265 $ pachctl list job 266 ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS DL UL STATE 267 7ab2b1cf-bd13-4aa7-bf7b-b06f2c29242a query/- 1 second ago - 0 0 + 0 / 1 0B 0B running 268 bc71d40b-5b1c-474d-a24d-f7487e037cee query/997587e3bd794e2ea48890e6022434f4 11 seconds ago 2 seconds 0 1 + 0 / 1 26B 836B success 269 41d499ad-7ba2-4ba6-82b0-68a8d5e3e67a query/b8eae461132a49819e59b62c39e6b6eb 21 seconds ago 1 second 0 1 + 0 / 1 26B 669B success 270 5938a0d0-9512-455f-a390-14adc3669e5f query/0f8a2ba1150a463299ee71961427bdcb 31 seconds ago 3 seconds 0 1 + 0 / 1 26B 617B success 271 952427a6-c92d-4c98-a781-87616988d528 query/33776e4df3b24ab68d70b5185eb37661 41 seconds ago 1 second 0 1 + 0 / 1 26B 613B success 272 1bc5f608-85fd-44eb-833e-562d15629706 query/6dd2a4da566f4d30ad9c66fc60244bab 51 seconds ago 1 second 0 1 + 0 / 1 26B 721B success 273 efa677a4-7f83-424b-879d-70a0c5690bb2 query/f56b1f314030455c8bdf8a10b68ebd16 About a minute ago 1 second 0 1 + 0 / 1 26B 529B success 274 842e4e6c-4920-42c0-9c81-e5299b67e4a0 query/2a11bfc3e6d74af0a8d254d3ecf6f6af About a minute ago 1 second 0 1 + 0 / 1 26B 535B success 275 ``` 276 277 1. You can then observe the output changing over time. You can watch it change with each query by executing: 278 279 ```shell 280 $ watch pachctl get file query@master:output.json 281 ``` 282 283 Or you can look at individual results over time via the commit IDs: 284 285 ```shell 286 $ pachctl get file query@master:output.json 287 { "_id" : ObjectId("59a455af69a077c0dc028410"), "address" : { "building" : "119", "coord" : [ -73.9784962, 40.6788476 ], "street" : "5 Avenue", "zipcode" : "11217" }, "borough" : "Brooklyn", "cuisine" : "Mexican", "grades" : [ { "date" : ISODate("2014-07-29T00:00:00Z"), "grade" : "B", "score" : 27 }, { "date" : ISODate("2014-03-10T00:00:00Z"), "grade" : "B", "score" : 15 }, { "date" : ISODate("2014-02-12T00:00:00Z"), "grade" : "P", "score" : 3 }, { "date" : ISODate("2013-09-05T00:00:00Z"), "grade" : "C", "score" : 35 }, { "date" : ISODate("2013-03-06T00:00:00Z"), "grade" : "A", "score" : 12 }, { "date" : ISODate("2012-09-12T00:00:00Z"), "grade" : "A", "score" : 13 }, { "date" : ISODate("2012-04-17T00:00:00Z"), "grade" : "A", "score" : 12 } ], "name" : "El Pollito Mexicano", "restaurant_id" : "41051406" } 288 $ pachctl get file query@64ac2bd721d04212a3a0b90833f751e5:output.json 289 { "_id" : ObjectId("59a455f069a077c0dc02e16e"), "address" : { "building" : "1650", "coord" : [ -73.928079, 40.856481 ], "street" : "Saint Nicholas Ave", "zipcode" : "10040" }, "borough" : "Manhattan", "cuisine" : "Spanish", "grades" : [ { "date" : ISODate("2015-01-20T00:00:00Z"), "grade" : "Not Yet Graded", "score" : 2 } ], "name" : "Angebienvendia", "restaurant_id" : "50018661" } 290 $ pachctl get file query@74a6cf68de2047fe94ac7982065df03d:output.json 291 { "_id" : ObjectId("59a455b669a077c0dc02904d"), "address" : { "building" : "14", "coord" : [ -73.990382, 40.741571 ], "street" : "West 23 Street", "zipcode" : "10010" }, "borough" : "Manhattan", "cuisine" : "Café/Coffee/Tea", "grades" : [ { "date" : ISODate("2014-05-02T00:00:00Z"), "grade" : "A", "score" : 11 }, { "date" : ISODate("2013-11-22T00:00:00Z"), "grade" : "A", "score" : 8 }, { "date" : ISODate("2012-11-20T00:00:00Z"), "grade" : "A", "score" : 9 }, { "date" : ISODate("2011-11-18T00:00:00Z"), "grade" : "A", "score" : 6 } ], "name" : "Starbucks Coffee (Store #13539)", "restaurant_id" : "41290548" } 292 ```