github.com/pachyderm/pachyderm@v1.13.4/examples/db/README.md (about)

     1  >![pach_logo](../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Periodic ingress from MongoDB
     5  
     6  This example pipeline executes a query periodically against a MongoDB database outside of Pachyderm.  The results of the query are stored in a corresponding output repository.  This repository could be used to drive additional pipeline stages periodically based on the results of the query.
     7  
     8  The example assumes that you have:
     9  
    10  - A Pachyderm cluster running - see [Local Installation](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/) to get up and running with a local Pachyderm cluster in just a few minutes.
    11  - The `pachctl` CLI tool installed and connected to your Pachyderm cluster - see [deploy docs](https://docs.pachyderm.com/1.13.x/deploy-manage/) for instructions.
    12  
    13  ## Setup MongoDB
    14  
    15  The easiest way to demonstrate this example is with a free hosted MongoDB cluster, such as the free tier of [MongoDB Atlas](https://www.mongodb.com/cloud/atlas) or [MLab](https://mlab.com/) (although you could certainly do this with any MongoDB). Assuming that you are using MongoDB Atlas:
    16  
    17  1. Deploy a new `Cluster0` MongoDB cluster with MongoDB Atlas (remmeber the admin username and password as you will need these shortly).  Once deployed you should be able to see this cluster in the MongoDB Atlas dashboard:
    18  
    19  ![alt text](mongo1.png)
    20  
    21  1. Click on the "connect" button for your cluster and make sure that all IPs are whitelisted (or at least the k8s master IP where you have Pachyderm deployed):
    22  
    23  ![alt text](mongo2.png)
    24  
    25  1. Then click on "Connect with the MongoDB shell" to find the URI, DB name (`test` if you are using MongoDB Atlas `Cluster0`), username, and authentication DB for connecting to your cluster.  You will need these to query MongoDB.
    26  
    27  1. Make sure you have the MongoDB tools installed locally. You can follow [this guide](https://docs.mongodb.com/manual/administration/install-community/) to install themk.
    28  
    29  ## Import example data
    30  
    31  We are going to run this example with an example set of data about restaurants.  This dataset comes directly from MongoDB and is used in many of their examples as well.
    32  
    33  1. Download the dataset from [here](https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json).  It is named `primer-dataset.json`.  Each of the records in this dataset look like the following:
    34  
    35      ```
    36      {
    37        "address": {
    38           "building": "1007",
    39           "coord": [ -73.856077, 40.848447 ],
    40           "street": "Morris Park Ave",
    41           "zipcode": "10462"
    42        },
    43        "borough": "Bronx",
    44        "cuisine": "Bakery",
    45        "grades": [
    46           { "date": { "$date": 1393804800000 }, "grade": "A", "score": 2 },
    47           { "date": { "$date": 1378857600000 }, "grade": "A", "score": 6 },
    48           { "date": { "$date": 1358985600000 }, "grade": "A", "score": 10 },
    49           { "date": { "$date": 1322006400000 }, "grade": "A", "score": 9 },
    50           { "date": { "$date": 1299715200000 }, "grade": "B", "score": 14 }
    51        ],
    52        "name": "Morris Park Bake Shop",
    53        "restaurant_id": "30075445"
    54      }
    55      ```
    56  
    57  1. Import the dataset to the `restaurants` collection in MongoDB (in the `test` DB if you are using MongoDB Atlas) using the `mongoimport` command.  You will need to specify the Mongo hosts, username, password, etc. from your MongoDB cluster.  For example:
    58  
    59      ```shell
    60      $ mongoimport --host Cluster0-shard-0/cluster0-shard-00-00-cwehf.mongodb.net:27017,cluster0-shard-00-01-cwehf.mongodb.net:27017,cluster0-shard-00-02-cwehf.mongodb.net:27017 --ssl -u admin -p '<my password>' --authenticationDatabase admin --db test --collection restaurants --drop --file primer-dataset.json         
    61      2017-08-28T13:40:38.983-0400    connected to: Cluster0-shard-0/cluster0-shard-00-00-cwehf.mongodb.net:27017,cluster0-shard-00-01-cwehf.mongodb.net:27017,cluster0-shard-00-02-cwehf.mongodb.net:27017
    62      2017-08-28T13:40:39.048-0400    dropping: test.restaurants
    63      2017-08-28T13:40:41.310-0400    [#.......................] test.restaurants     540KB/11.3MB (4.7%)
    64      2017-08-28T13:40:44.310-0400    [##......................] test.restaurants     1.04MB/11.3MB (9.2%)
    65      2017-08-28T13:40:47.310-0400    [##......................] test.restaurants     1.04MB/11.3MB (9.2%)
    66      2017-08-28T13:40:50.310-0400    [###.....................] test.restaurants     1.55MB/11.3MB (13.7%)
    67      2017-08-28T13:40:53.310-0400    [####....................] test.restaurants     2.07MB/11.3MB (18.2%)
    68      2017-08-28T13:40:56.310-0400    [#####...................] test.restaurants     2.58MB/11.3MB (22.8%)
    69      2017-08-28T13:40:59.310-0400    [######..................] test.restaurants     3.10MB/11.3MB (27.4%)
    70      2017-08-28T13:41:02.310-0400    [######..................] test.restaurants     3.10MB/11.3MB (27.4%)
    71      2017-08-28T13:41:05.310-0400    [#######.................] test.restaurants     3.61MB/11.3MB (31.9%)
    72      2017-08-28T13:41:08.310-0400    [########................] test.restaurants     4.12MB/11.3MB (36.4%)
    73      2017-08-28T13:41:11.310-0400    [#########...............] test.restaurants     4.64MB/11.3MB (41.0%)
    74      2017-08-28T13:41:14.310-0400    [#########...............] test.restaurants     4.64MB/11.3MB (41.0%)
    75      2017-08-28T13:41:17.310-0400    [##########..............] test.restaurants     5.15MB/11.3MB (45.5%)
    76      2017-08-28T13:41:20.310-0400    [############............] test.restaurants     5.67MB/11.3MB (50.0%)
    77      2017-08-28T13:41:23.310-0400    [#############...........] test.restaurants     6.19MB/11.3MB (54.6%)
    78      2017-08-28T13:41:26.310-0400    [#############...........] test.restaurants     6.19MB/11.3MB (54.6%)
    79      2017-08-28T13:41:29.310-0400    [##############..........] test.restaurants     6.70MB/11.3MB (59.2%)
    80      2017-08-28T13:41:32.310-0400    [###############.........] test.restaurants     7.21MB/11.3MB (63.7%)
    81      2017-08-28T13:41:35.310-0400    [################........] test.restaurants     7.71MB/11.3MB (68.1%)
    82      2017-08-28T13:41:38.310-0400    [################........] test.restaurants     7.71MB/11.3MB (68.1%)
    83      2017-08-28T13:41:41.310-0400    [#################.......] test.restaurants     8.18MB/11.3MB (72.3%)
    84      2017-08-28T13:41:44.310-0400    [##################......] test.restaurants     8.62MB/11.3MB (76.1%)
    85      2017-08-28T13:41:47.310-0400    [###################.....] test.restaurants     9.03MB/11.3MB (79.7%)
    86      2017-08-28T13:41:50.310-0400    [###################.....] test.restaurants     9.41MB/11.3MB (83.1%)
    87      2017-08-28T13:41:53.310-0400    [####################....] test.restaurants     9.77MB/11.3MB (86.3%)
    88      2017-08-28T13:41:56.310-0400    [#####################...] test.restaurants     10.1MB/11.3MB (89.2%)
    89      2017-08-28T13:41:59.310-0400    [######################..] test.restaurants     10.4MB/11.3MB (91.9%)
    90      2017-08-28T13:42:02.310-0400    [######################..] test.restaurants     10.7MB/11.3MB (94.5%)
    91      2017-08-28T13:42:05.310-0400    [#######################.] test.restaurants     11.0MB/11.3MB (97.0%)
    92      2017-08-28T13:42:08.310-0400    [########################] test.restaurants     11.3MB/11.3MB (100.0%)
    93      2017-08-28T13:42:08.449-0400    [########################] test.restaurants     11.3MB/11.3MB (100.0%)
    94      2017-08-28T13:42:08.449-0400    imported 25359 documents
    95      ```
    96  
    97  ## Create a Kubernetes secret with your Mongo creds 
    98  
    99  In order for your Pachyderm pipeline to talk with MongoDB, we need to tell Pachyderm about the MongoDB URI, username, password, etc.  We will do this via a [Kubernetes secret](https://kubernetes.io/docs/concepts/configuration/secret/),
   100  loaded via `pachctl create secret`.
   101  
   102  
   103  1. The next few steps show you how to add a secret with the following five keys
   104  
   105     * `uri`
   106     * `username`
   107     * `password`
   108     * `db`
   109     * `collection`
   110  
   111     First, we'll save some values to files. 
   112     The values should all be enclosed in single quotes to prevent the shell from interpreting them.
   113     
   114     ```shell
   115     $ echo -n '<uri>' > uri ; chmod 600 uri
   116     $ echo -n '<username>' > username ; chmod 600 username
   117     $ echo -n '<password>' > password ; chmod 600 password
   118     $ echo -n '<db>' > db ; chmod 600 db
   119     $ echo -n '<collection>' > collection ; chmod 600 collection
   120     ```
   121     
   122  1. Confirm the values in these files are what you expect.
   123  
   124     ```shell
   125     $ cat uri
   126     $ cat username
   127     $ cat password
   128     $ cat db
   129     $ cat collection
   130     ```
   131     
   132     Creating the secret will require different steps,
   133     depending on whether you have Kubernetes access or not.
   134     Pachyderm Hub users don't have access to Kubernetes.
   135     If you have Kubernetes access and want to use `kubectl`, 
   136     you may follow the two steps prefixed with "(Kubernetes)".
   137     If you don't have access to Kubernetes or don't want to use `kubectl`,
   138     follow the two steps labeled "(Pachyderm Hub)" 
   139  
   140  1. (Kubernetes) If you have direct access to the Kubernetes cluster, you can create a secret using `kubectl`.
   141     
   142     ```shell
   143     $ kubectl create secret generic mongosecret --from-file=./uri \
   144         --from-file=./username \
   145         --from-file=./password \
   146         --from-file=./db \
   147         --from-file=./collection
   148     ```
   149     
   150  1. (Kubernetes) Confirm that the secrets got set correctly.
   151     You use `kubectl get secret` to output the secrets, and then decode them using `jq` to confirm they're correct.
   152     
   153     ```shell
   154     $ kubectl get secret mongosecret -o json | jq '.data | map_values(@base64d)'
   155     {
   156         "uri": "<uri>",
   157         "username": "<username>"
   158         "password": "<password>"
   159         "db": "<db>"
   160         "collection": "<collection>"
   161     }
   162     ```
   163  
   164     You will have to use pachctl if you're using Pachyderm Hub,
   165     or don't have access to the Kubernetes cluster.
   166     The next three steps show how to do that.
   167  
   168  1. (Pachyderm Hub) Create a secrets file from the provided template.
   169     
   170     ```shell
   171     $ jq -n --arg uri $(cat uri) --arg username $(cat username) \
   172         --arg password $(cat password) --arg db $(cat db) --arg collection $(cat collection) \
   173         -f mongodb-credentials-template.jq  > mongodb-credentials-secret.json 
   174     $ chmod 600 mongodb-credentials-secret.json
   175     ```
   176  
   177  1. (Pachyderm Hub) Confirm the secrets file is correct by decoding the values.
   178     
   179     ```shell
   180     $ jq '.data | map_values(@base64d)' mongodb-credentials-secret.json
   181     {
   182         "uri": "<uri>",
   183         "username": "<username>"
   184         "password": "<password>"
   185         "db": "<db>"
   186         "collection": "<collection>"
   187     }
   188     ```
   189  
   190  1. (Pachyderm Hub) Generate a secret using pachctl
   191  
   192     ```shell
   193     $ pachctl create secret -f mongodb-credentials-secret.json
   194     ```
   195  
   196  ## Create the pipeline, view the results
   197  
   198  In our [pipeline spec](query.json), we will do the following:
   199  
   200  - Grab the Kubernetes secret that we defined above.
   201  - Define a `cron` input that will cause the pipeline to be triggered every 10 seconds.
   202  - Using the official `mongo` Docker image, query for a random document from the `restaurants` collection and output that to `/pfs/out`.
   203  
   204  ```
   205  {
   206    "pipeline": {
   207      "name": "query"
   208    },
   209    "transform": {
   210      "image": "mongo",
   211      "cmd": [ "/bin/bash" ],
   212      "stdin": [
   213        "export uri=$(cat /tmp/mongosecret/uri)",
   214        "export db=$(cat /tmp/mongosecret/db)",
   215        "export collection=$(cat /tmp/mongosecret/collection)",
   216        "export username=$(cat /tmp/mongosecret/username)",
   217        "export password=$(cat /tmp/mongosecret/password)",
   218        "mongo \"$uri\" --authenticationDatabase admin --ssl --username $username --password $password --quiet --eval 'db.restaurants.aggregate({ $sample: { size: 1 } });' | tail -n1 | egrep -v \"^>|^bye\" > /pfs/out/output.json"
   219      ],
   220      "secrets": [ 
   221        {
   222          "name": "mongosecret",
   223          "mount_path": "/tmp/mongosecret"
   224        } 
   225      ]
   226    },
   227    "input": {
   228      "cron": {
   229        "name": "tick",
   230        "spec": "@every 10s"
   231      }  
   232    }
   233  }
   234  ```
   235  
   236  This will allow us to view the head of the output over time to see a bunch of random documents being queried out of MongoDB.
   237  
   238  1. Create the pipeline.
   239  
   240      ```shell
   241      $ pachctl create pipeline -f query.json
   242      ``` 
   243      
   244  1. Run the command `pachctl list pipeline` to make sure it's running:
   245     
   246     ```shell
   247     $ pachctl list pipeline
   248     NAME      VERSION INPUT                     CREATED       STATE / LAST JOB   DESCRIPTION                                                                                               
   249     query     1       tick:@every 10s           6 seconds ago running / starting                                                                                    
   250     ```
   251     
   252  1. After the pipeline is running, you should see jobs start to be triggered every 10 seconds.
   253  
   254      ```shell
   255      $ pachctl list job
   256      ID                                   OUTPUT COMMIT STARTED      DURATION RESTART PROGRESS  DL UL STATE            
   257      842e4e6c-4920-42c0-9c81-e5299b67e4a0 query/-       1 second ago -        0       0 + 0 / 1 0B 0B running 
   258      $ pachctl list job
   259      ID                                   OUTPUT COMMIT                          STARTED        DURATION  RESTART PROGRESS  DL  UL   STATE            
   260      5938a0d0-9512-455f-a390-14adc3669e5f query/0f8a2ba1150a463299ee71961427bdcb 3 seconds ago  3 seconds 0       1 + 0 / 1 26B 617B success 
   261      952427a6-c92d-4c98-a781-87616988d528 query/33776e4df3b24ab68d70b5185eb37661 13 seconds ago 1 second  0       1 + 0 / 1 26B 613B success 
   262      1bc5f608-85fd-44eb-833e-562d15629706 query/6dd2a4da566f4d30ad9c66fc60244bab 23 seconds ago 1 second  0       1 + 0 / 1 26B 721B success 
   263      efa677a4-7f83-424b-879d-70a0c5690bb2 query/f56b1f314030455c8bdf8a10b68ebd16 33 seconds ago 1 second  0       1 + 0 / 1 26B 529B success 
   264      842e4e6c-4920-42c0-9c81-e5299b67e4a0 query/2a11bfc3e6d74af0a8d254d3ecf6f6af 43 seconds ago 1 second  0       1 + 0 / 1 26B 535B success 
   265      $ pachctl list job
   266      ID                                   OUTPUT COMMIT                          STARTED            DURATION  RESTART PROGRESS  DL  UL   STATE            
   267      7ab2b1cf-bd13-4aa7-bf7b-b06f2c29242a query/-                                1 second ago       -         0       0 + 0 / 1 0B  0B   running 
   268      bc71d40b-5b1c-474d-a24d-f7487e037cee query/997587e3bd794e2ea48890e6022434f4 11 seconds ago     2 seconds 0       1 + 0 / 1 26B 836B success 
   269      41d499ad-7ba2-4ba6-82b0-68a8d5e3e67a query/b8eae461132a49819e59b62c39e6b6eb 21 seconds ago     1 second  0       1 + 0 / 1 26B 669B success 
   270      5938a0d0-9512-455f-a390-14adc3669e5f query/0f8a2ba1150a463299ee71961427bdcb 31 seconds ago     3 seconds 0       1 + 0 / 1 26B 617B success 
   271      952427a6-c92d-4c98-a781-87616988d528 query/33776e4df3b24ab68d70b5185eb37661 41 seconds ago     1 second  0       1 + 0 / 1 26B 613B success 
   272      1bc5f608-85fd-44eb-833e-562d15629706 query/6dd2a4da566f4d30ad9c66fc60244bab 51 seconds ago     1 second  0       1 + 0 / 1 26B 721B success 
   273      efa677a4-7f83-424b-879d-70a0c5690bb2 query/f56b1f314030455c8bdf8a10b68ebd16 About a minute ago 1 second  0       1 + 0 / 1 26B 529B success 
   274      842e4e6c-4920-42c0-9c81-e5299b67e4a0 query/2a11bfc3e6d74af0a8d254d3ecf6f6af About a minute ago 1 second  0       1 + 0 / 1 26B 535B success
   275      ```
   276  
   277  1. You can then observe the output changing over time.  You can watch it change with each query by executing:
   278  
   279      ```shell
   280      $ watch pachctl get file query@master:output.json
   281      ```
   282      
   283      Or you can look at individual results over time via the commit IDs:
   284  
   285      ```shell
   286      $ pachctl get file query@master:output.json
   287      { "_id" : ObjectId("59a455af69a077c0dc028410"), "address" : { "building" : "119", "coord" : [ -73.9784962, 40.6788476 ], "street" : "5 Avenue", "zipcode" : "11217" }, "borough" : "Brooklyn", "cuisine" : "Mexican", "grades" : [ { "date" : ISODate("2014-07-29T00:00:00Z"), "grade" : "B", "score" : 27 }, { "date" : ISODate("2014-03-10T00:00:00Z"), "grade" : "B", "score" : 15 }, { "date" : ISODate("2014-02-12T00:00:00Z"), "grade" : "P", "score" : 3 }, { "date" : ISODate("2013-09-05T00:00:00Z"), "grade" : "C", "score" : 35 }, { "date" : ISODate("2013-03-06T00:00:00Z"), "grade" : "A", "score" : 12 }, { "date" : ISODate("2012-09-12T00:00:00Z"), "grade" : "A", "score" : 13 }, { "date" : ISODate("2012-04-17T00:00:00Z"), "grade" : "A", "score" : 12 } ], "name" : "El Pollito Mexicano", "restaurant_id" : "41051406" }
   288      $ pachctl get file query@64ac2bd721d04212a3a0b90833f751e5:output.json
   289      { "_id" : ObjectId("59a455f069a077c0dc02e16e"), "address" : { "building" : "1650", "coord" : [ -73.928079, 40.856481 ], "street" : "Saint Nicholas Ave", "zipcode" : "10040" }, "borough" : "Manhattan", "cuisine" : "Spanish", "grades" : [ { "date" : ISODate("2015-01-20T00:00:00Z"), "grade" : "Not Yet Graded", "score" : 2 } ], "name" : "Angebienvendia", "restaurant_id" : "50018661" }
   290      $ pachctl get file query@74a6cf68de2047fe94ac7982065df03d:output.json
   291      { "_id" : ObjectId("59a455b669a077c0dc02904d"), "address" : { "building" : "14", "coord" : [ -73.990382, 40.741571 ], "street" : "West   23 Street", "zipcode" : "10010" }, "borough" : "Manhattan", "cuisine" : "Café/Coffee/Tea", "grades" : [ { "date" : ISODate("2014-05-02T00:00:00Z"), "grade" : "A", "score" : 11 }, { "date" : ISODate("2013-11-22T00:00:00Z"), "grade" : "A", "score" : 8 }, { "date" : ISODate("2012-11-20T00:00:00Z"), "grade" : "A", "score" : 9 }, { "date" : ISODate("2011-11-18T00:00:00Z"), "grade" : "A", "score" : 6 } ], "name" : "Starbucks Coffee (Store #13539)", "restaurant_id" : "41290548" }
   292      ```