github.com/pachyderm/pachyderm@v1.13.4/examples/ml/housing-prices/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Boston Housing Prices
     5  
     6  This example creates a simple machine learning pipeline in Pachyderm to train a regression model on the Boston Housing Dataset to predict the value of homes in Boston. The pipeline itself is written in Python, though a Pachyderm pipeline could be written in any language.
     7  
     8  <p align="center">
     9    <img width="200" height="300" src="images/regression_pipeline.png">
    10  </p>
    11  
    12  The Pachyderm pipeline performs the following actions:
    13  
    14  1. Imports the structured dataset (`.csv`) with `pandas`.
    15  2. Performs data analysis with `scikit-learn`.
    16  3. Trains a regression model to predict housing prices.
    17  4. Generates a learning curve and performance metrics to estimate the quality of the model.
    18  
    19  Table of Contents:
    20  
    21  - [Boston Housing Prices](#boston-housing-prices)
    22    - [Housing Prices Dataset](#housing-prices-dataset)
    23    - [Prerequisites](#prerequisites)
    24    - [Python Code](#python-code)
    25      - [Data Analysis](#data-analysis)
    26      - [Train a regression model](#train-a-regression-model)
    27      - [Evaluate the model](#evaluate-the-model)
    28    - [Pachyderm Pipeline](#pachyderm-pipeline)
    29      - [TLDR; Just give me the code](#tldr-just-give-me-the-code)
    30      - [Step 1: Create an input data repository](#step-1-create-an-input-data-repository)
    31      - [Step 2: Create the regression pipeline](#step-2-create-the-regression-pipeline)
    32      - [Step 3: Add the housing dataset to the repo](#step-3-add-the-housing-dataset-to-the-repo)
    33      - [Step 4: Download files once the pipeline has finished](#step-4-download-files-once-the-pipeline-has-finished)
    34      - [Step 5: Update Dataset](#step-5-update-dataset)
    35      - [Step 6: Inspect the Pipeline Lineage](#step-6-inspect-the-pipeline-lineage)
    36  
    37  ## Housing Prices Dataset
    38  
    39  The housing prices dataset used for this example is a reduced version of the original [Boston Housing Datset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html), which was originally collected by the U.S. Census Service. We choose to focus on three features of the originally dataset (RM, LSTST, and PTRATIO) and the output, or target (MEDV) that we are learning to predict.
    40  |Feature| Description|
    41  |---|---|
    42  |RM |       Average number of rooms per dwelling|
    43  |LSTAT |    A measurement of the socioeconomic status of people living in the area|
    44  |PTRATIO |  Pupil-teacher ratio by town - approximation of the local education system's quality|
    45  |MEDV |     Median value of owner-occupied homes in $1000's|
    46  
    47  Sample:
    48  |RM   |LSTAT|PTRATIO|MEDV|
    49  |-----|----|----|--------|
    50  |6.575|4.98|15.3|504000.0|
    51  |6.421|9.14|17.8|453600.0|
    52  |7.185|4.03|17.8|728700.0|
    53  |6.998|2.94|18.7|701400.0|
    54  
    55  ## Prerequisites
    56  
    57  Before you can deploy this example you need to have the following components:
    58  
    59  1. A clone of this Pachyderm repository on your local computer. (could potentially include those instructions)
    60  2. A Pachyderm cluster - You can deploy a cluster on [PacHub](hub.pachyderm.com) or deploy locally as described [here](https://docs.pachyderm.com/1.13.x/getting_started/).
    61  
    62  Verify that your environment is accessible by running `pachctl version` which will show both the `pachctl` and `pachd` versions.
    63  ```shell
    64  $ pachctl version
    65  COMPONENT           VERSION
    66  pachctl             1.11.0
    67  pachd               1.11.0
    68  ```
    69  
    70  ## Python Code
    71  
    72  The `regression.py` Python file contains the machine learning code for the example. We will give a brief description of it here, but full knowledge of it is not required for the example. 
    73  
    74  ```
    75  $ python regression.py --help
    76  
    77  usage: regression.py [-h] [--input INPUT] [--target-col TARGET_COL]
    78                       [--output DIR]
    79  
    80  Structured data regression
    81  
    82  optional arguments:
    83    -h, --help            show this help message and exit
    84    --input INPUT         csv file with all examples
    85    --target-col TARGET_COL
    86                          column with target values
    87    --output DIR          output directory
    88    ```
    89  
    90  The regression code performs the following actions:
    91  
    92  1. Analyses the data.
    93  2. Trains a regressor.
    94  3. Evaluates the model.
    95  
    96  ### Data Analysis
    97  The first step in the pipeline creates a pairplot showing the relationship between features. By seeing what features are positively or negatively correlated to the target value (or each other), it can helps us understand what features may be valuable to the model.
    98  <p align="center">
    99    <img width="500" height="400"  src="images/pairplot-1.png">
   100  </p>
   101  
   102  We can represent the same data in color form with a correlation matrix. The darker the color, the higher the correlation (+/-).
   103  
   104  <p align="center">
   105    <img width="500" height="400"  src="images/corr_matrix-1.png">
   106  </p>
   107  
   108  ### Train a regression model
   109  To train the regression model using scikit-learn. In our case, we will train a Random Forest Regressor ensemble. After splitting the data into features and targets (`X` and `y`), we can fit the model to our parameters.  
   110  
   111  ### Evaluate the model
   112  After the model is trained we output some visualizations to evaluate its effectiveness of it using the learning curve and other statistics.
   113  <p align="center">
   114    <img src="images/cv_reg_output-1.png">
   115  </p>
   116  
   117  
   118  ## Pachyderm Pipeline
   119  
   120  Now we'll deploy this python code with Pachyderm.
   121  
   122  ### TLDR; Just give me the code
   123  
   124  ```shell
   125  # Step 1: Create input data repository
   126  pachctl create repo housing_data
   127  
   128  # Step 2: Create the regression pipeline
   129  pachctl create pipeline -f regression.json
   130  
   131  # Step 3: Add the housing dataset to the repo
   132  pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-1.csv
   133  
   134  # Step 4: Download files once the pipeline has finished
   135  pachctl get file regression@master:/ --recursive --output .
   136  
   137  # Step 5: Update dataset with more data
   138  pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-2.csv --overwrite
   139  
   140  # Step 6: Inspect the lineage of the pipeline
   141  pachctl list commit regression@master
   142  ```
   143  
   144  ### Step 1: Create an input data repository
   145  
   146  Once the Pachyderm cluster is running, create a data repository called `housing_data` where we will put our dataset.
   147  
   148  ```shell
   149  $ pachctl create repo housing_data
   150  $ pachctl list repo
   151  NAME                CREATED             SIZE
   152  housing_data        3 seconds ago       0 B
   153  ```
   154  
   155  ### Step 2: Create the regression pipeline
   156  
   157  We can now connect a pipeline to watch the data repo. Pipelines are defined in `json` format. Here is the one that we'll be used for the regression pipeline:
   158  
   159  ```json
   160  # regression.json
   161  {
   162      "pipeline": {
   163          "name": "regression"
   164      },
   165      "description": "A pipeline that trains produces a regression model for housing prices.",
   166      "input": {
   167          "pfs": {
   168              "glob": "/*",
   169              "repo": "housing_data"
   170          }
   171      },
   172      "transform": {
   173          "cmd": [
   174              "python", "regression.py",
   175              "--input", "/pfs/housing_data/",
   176              "--target-col", "MEDV",
   177              "--output", "/pfs/out/"
   178          ],
   179          "image": "pachyderm/housing-prices:1.11.0"
   180      }
   181  }
   182  ```
   183  
   184  For the **input** field in the pipeline definition, we define input data repo(s) and a [glob pattern](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/glob-pattern/). A glob pattern tells the pipeline how to map data into a job, here we have it create a new job for each datum in the `housing_data` repository.
   185  
   186  The **image** defines what Docker image will be used for the pipeline, and the **transform** is the command run once a pipeline job starts.
   187  
   188  Once this pipeline is created, it watches for any changes to its input, and if detected, it starts a new job to train given the new dataset.
   189  
   190  ```shell
   191  $ pachctl create pipeline -f regression.json
   192  ```
   193  
   194  The pipeline writes the output to a PFS repo (`/pfs/out/` in the pipeline json) created with the same name as the pipeline.
   195  
   196  ### Step 3: Add the housing dataset to the repo
   197  Now we can add the data, which will kick off the processing automatically. If we update the data with a new commit, then the pipeline will automatically re-run. 
   198  
   199  ```shell
   200  $ pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-1.csv
   201  ```
   202  
   203  We can inspect that the data is in the repository by looking at the files in the repository.
   204  
   205  ```shell
   206  $ pachctl list file housing_data@master
   207  NAME                    TYPE SIZE
   208  /housing-simplified.csv file 12.14KiB
   209  ```
   210  
   211  We can see that the pipeline is running by looking at the status of the job(s). 
   212  
   213  ```shell
   214  $ pachctl list job
   215  ID                               PIPELINE   STARTED        DURATION   RESTART PROGRESS  DL       UL      STATE
   216  299b4f36535e47e399e7df7fc6ee2f7f regression 23 seconds ago 18 seconds 0       1 + 0 / 1 2.482KiB 1002KiB success
   217  ```
   218  
   219  ### Step 4: Download files once the pipeline has finished
   220  Once the pipeline is completed, we can download the files that were created.
   221  
   222  ```shell
   223  $ pachctl list file regression@master
   224  NAME               TYPE SIZE
   225  /housing-simplified_corr_matrix.png   file 18.66KiB
   226  /housing-simplified_cv_reg_output.png file 62.19KiB
   227  /housing-simplified_final_model.sav   file 1.007KiB
   228  /housing-simplified_pairplot.png      file 207.5KiB
   229  
   230  $ pachctl get file regression@master:/ --recursive --output .
   231  ```
   232  
   233  When we inspect the learning curve, we can see that there is a large gap between the training score and the validation score. This typically indicates that our model could benefit from the addition of more data. 
   234  
   235  <p align="center">
   236    <img width="400" height="350" src="images/learning_curve-1.png">
   237  </p>
   238  
   239  Now let's update our dataset with additional examples.
   240  
   241  ### Step 5: Update Dataset
   242  Here's where Pachyderm truly starts to shine. To update our dataset we can run the following command (note that we could also append new examples to the existing file, but in this example we're simply overwriting our previous file to one with more data):
   243  
   244  ```shell
   245  $ pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-2.csv --overwrite
   246  ```
   247  
   248  The new commit of data to the `housing_data` repository automatically kicks off a job on the `regression` pipeline without us having to do anything. 
   249  
   250  When the job is complete we can download the new files and see that our model has improved, given the new learning curve.
   251  <p align="center">
   252    <img src="images/cv_reg_output-2.png">
   253  </p>
   254  
   255  ### Step 6: Inspect the Pipeline Lineage
   256  
   257  Note that because versions all of our input and output data automatically, we can continue to iterate on our data and code and Pachyderm will track all of our experiments. For any given output commit, Pachyderm will tell us exactly which input commit of data was run. In our simple example we only have 2 experiments run so far, but this becomes incredibly important and valuable when we do many more iterations.
   258  
   259  We can list out the commits to any repository by using the `list commit` commandand.
   260  
   261  ```shell
   262  $ pachctl list commit housing_data@master
   263  REPO         BRANCH COMMIT                           FINISHED       SIZE     PROGRESS DESCRIPTION
   264  housing_data master a186886de0bf430ebf6fce4d538d4db7 3 minutes ago  12.14KiB ▇▇▇▇▇▇▇▇
   265  housing_data master bbe5ce248aa44522a012f1967295ccdd 23 minutes ago 2.482KiB ▇▇▇▇▇▇▇▇
   266  
   267  $ pachctl list commit regression@master
   268  REPO       BRANCH COMMIT                           FINISHED       SIZE     PROGRESS DESCRIPTION
   269  regression master f59a6663073b4e81a2d2ab3b4b7c68fc 2 minutes ago  4.028MiB -
   270  regression master bc0ecea5a2cd43349a9db3e89933fb42 22 minutes ago 1001KiB  -
   271  ```
   272  
   273  We can show exactly what version of the dataset and pipeline created the model by selecting the commmit ID and using the `inspect` command.
   274  
   275  ```shell
   276  $ pachctl inspect commit regression@f59a6663073b4e81a2d2ab3b4b7c68fc
   277  Commit: regression@f59a6663073b4e81a2d2ab3b4b7c68fc
   278  Original Branch: master
   279  Parent: bc0ecea5a2cd43349a9db3e89933fb42
   280  Started: 7 minutes ago
   281  Finished: 7 minutes ago
   282  Size: 4.028MiB
   283  Provenance:  __spec__@5b17c425a8d54026a6daaeaf8721707a (regression)  housing_data@a186886de0bf430ebf6fce4d538d4db7 (master)
   284  ```
   285  
   286  Additionally, can also show the downstream provenance of a commit by using the `flush` command, showing us everything that was run and produced from a commit.
   287  
   288  ```shell
   289  $ pachctl flush commit housing_data@bbe5ce248aa44522a012f1967295ccdd
   290  REPO       BRANCH COMMIT                           FINISHED       SIZE    PROGRESS DESCRIPTION
   291  regression master bc0ecea5a2cd43349a9db3e89933fb42 31 minutes ago 1001KiB -
   292  ```