github.com/pachyderm/pachyderm@v1.13.4/examples/ml/housing-prices/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 # Boston Housing Prices 5 6 This example creates a simple machine learning pipeline in Pachyderm to train a regression model on the Boston Housing Dataset to predict the value of homes in Boston. The pipeline itself is written in Python, though a Pachyderm pipeline could be written in any language. 7 8 <p align="center"> 9 <img width="200" height="300" src="images/regression_pipeline.png"> 10 </p> 11 12 The Pachyderm pipeline performs the following actions: 13 14 1. Imports the structured dataset (`.csv`) with `pandas`. 15 2. Performs data analysis with `scikit-learn`. 16 3. Trains a regression model to predict housing prices. 17 4. Generates a learning curve and performance metrics to estimate the quality of the model. 18 19 Table of Contents: 20 21 - [Boston Housing Prices](#boston-housing-prices) 22 - [Housing Prices Dataset](#housing-prices-dataset) 23 - [Prerequisites](#prerequisites) 24 - [Python Code](#python-code) 25 - [Data Analysis](#data-analysis) 26 - [Train a regression model](#train-a-regression-model) 27 - [Evaluate the model](#evaluate-the-model) 28 - [Pachyderm Pipeline](#pachyderm-pipeline) 29 - [TLDR; Just give me the code](#tldr-just-give-me-the-code) 30 - [Step 1: Create an input data repository](#step-1-create-an-input-data-repository) 31 - [Step 2: Create the regression pipeline](#step-2-create-the-regression-pipeline) 32 - [Step 3: Add the housing dataset to the repo](#step-3-add-the-housing-dataset-to-the-repo) 33 - [Step 4: Download files once the pipeline has finished](#step-4-download-files-once-the-pipeline-has-finished) 34 - [Step 5: Update Dataset](#step-5-update-dataset) 35 - [Step 6: Inspect the Pipeline Lineage](#step-6-inspect-the-pipeline-lineage) 36 37 ## Housing Prices Dataset 38 39 The housing prices dataset used for this example is a reduced version of the original [Boston Housing Datset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html), which was originally collected by the U.S. Census Service. We choose to focus on three features of the originally dataset (RM, LSTST, and PTRATIO) and the output, or target (MEDV) that we are learning to predict. 40 |Feature| Description| 41 |---|---| 42 |RM | Average number of rooms per dwelling| 43 |LSTAT | A measurement of the socioeconomic status of people living in the area| 44 |PTRATIO | Pupil-teacher ratio by town - approximation of the local education system's quality| 45 |MEDV | Median value of owner-occupied homes in $1000's| 46 47 Sample: 48 |RM |LSTAT|PTRATIO|MEDV| 49 |-----|----|----|--------| 50 |6.575|4.98|15.3|504000.0| 51 |6.421|9.14|17.8|453600.0| 52 |7.185|4.03|17.8|728700.0| 53 |6.998|2.94|18.7|701400.0| 54 55 ## Prerequisites 56 57 Before you can deploy this example you need to have the following components: 58 59 1. A clone of this Pachyderm repository on your local computer. (could potentially include those instructions) 60 2. A Pachyderm cluster - You can deploy a cluster on [PacHub](hub.pachyderm.com) or deploy locally as described [here](https://docs.pachyderm.com/1.13.x/getting_started/). 61 62 Verify that your environment is accessible by running `pachctl version` which will show both the `pachctl` and `pachd` versions. 63 ```shell 64 $ pachctl version 65 COMPONENT VERSION 66 pachctl 1.11.0 67 pachd 1.11.0 68 ``` 69 70 ## Python Code 71 72 The `regression.py` Python file contains the machine learning code for the example. We will give a brief description of it here, but full knowledge of it is not required for the example. 73 74 ``` 75 $ python regression.py --help 76 77 usage: regression.py [-h] [--input INPUT] [--target-col TARGET_COL] 78 [--output DIR] 79 80 Structured data regression 81 82 optional arguments: 83 -h, --help show this help message and exit 84 --input INPUT csv file with all examples 85 --target-col TARGET_COL 86 column with target values 87 --output DIR output directory 88 ``` 89 90 The regression code performs the following actions: 91 92 1. Analyses the data. 93 2. Trains a regressor. 94 3. Evaluates the model. 95 96 ### Data Analysis 97 The first step in the pipeline creates a pairplot showing the relationship between features. By seeing what features are positively or negatively correlated to the target value (or each other), it can helps us understand what features may be valuable to the model. 98 <p align="center"> 99 <img width="500" height="400" src="images/pairplot-1.png"> 100 </p> 101 102 We can represent the same data in color form with a correlation matrix. The darker the color, the higher the correlation (+/-). 103 104 <p align="center"> 105 <img width="500" height="400" src="images/corr_matrix-1.png"> 106 </p> 107 108 ### Train a regression model 109 To train the regression model using scikit-learn. In our case, we will train a Random Forest Regressor ensemble. After splitting the data into features and targets (`X` and `y`), we can fit the model to our parameters. 110 111 ### Evaluate the model 112 After the model is trained we output some visualizations to evaluate its effectiveness of it using the learning curve and other statistics. 113 <p align="center"> 114 <img src="images/cv_reg_output-1.png"> 115 </p> 116 117 118 ## Pachyderm Pipeline 119 120 Now we'll deploy this python code with Pachyderm. 121 122 ### TLDR; Just give me the code 123 124 ```shell 125 # Step 1: Create input data repository 126 pachctl create repo housing_data 127 128 # Step 2: Create the regression pipeline 129 pachctl create pipeline -f regression.json 130 131 # Step 3: Add the housing dataset to the repo 132 pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-1.csv 133 134 # Step 4: Download files once the pipeline has finished 135 pachctl get file regression@master:/ --recursive --output . 136 137 # Step 5: Update dataset with more data 138 pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-2.csv --overwrite 139 140 # Step 6: Inspect the lineage of the pipeline 141 pachctl list commit regression@master 142 ``` 143 144 ### Step 1: Create an input data repository 145 146 Once the Pachyderm cluster is running, create a data repository called `housing_data` where we will put our dataset. 147 148 ```shell 149 $ pachctl create repo housing_data 150 $ pachctl list repo 151 NAME CREATED SIZE 152 housing_data 3 seconds ago 0 B 153 ``` 154 155 ### Step 2: Create the regression pipeline 156 157 We can now connect a pipeline to watch the data repo. Pipelines are defined in `json` format. Here is the one that we'll be used for the regression pipeline: 158 159 ```json 160 # regression.json 161 { 162 "pipeline": { 163 "name": "regression" 164 }, 165 "description": "A pipeline that trains produces a regression model for housing prices.", 166 "input": { 167 "pfs": { 168 "glob": "/*", 169 "repo": "housing_data" 170 } 171 }, 172 "transform": { 173 "cmd": [ 174 "python", "regression.py", 175 "--input", "/pfs/housing_data/", 176 "--target-col", "MEDV", 177 "--output", "/pfs/out/" 178 ], 179 "image": "pachyderm/housing-prices:1.11.0" 180 } 181 } 182 ``` 183 184 For the **input** field in the pipeline definition, we define input data repo(s) and a [glob pattern](https://docs.pachyderm.com/1.13.x/concepts/pipeline-concepts/datum/glob-pattern/). A glob pattern tells the pipeline how to map data into a job, here we have it create a new job for each datum in the `housing_data` repository. 185 186 The **image** defines what Docker image will be used for the pipeline, and the **transform** is the command run once a pipeline job starts. 187 188 Once this pipeline is created, it watches for any changes to its input, and if detected, it starts a new job to train given the new dataset. 189 190 ```shell 191 $ pachctl create pipeline -f regression.json 192 ``` 193 194 The pipeline writes the output to a PFS repo (`/pfs/out/` in the pipeline json) created with the same name as the pipeline. 195 196 ### Step 3: Add the housing dataset to the repo 197 Now we can add the data, which will kick off the processing automatically. If we update the data with a new commit, then the pipeline will automatically re-run. 198 199 ```shell 200 $ pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-1.csv 201 ``` 202 203 We can inspect that the data is in the repository by looking at the files in the repository. 204 205 ```shell 206 $ pachctl list file housing_data@master 207 NAME TYPE SIZE 208 /housing-simplified.csv file 12.14KiB 209 ``` 210 211 We can see that the pipeline is running by looking at the status of the job(s). 212 213 ```shell 214 $ pachctl list job 215 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 216 299b4f36535e47e399e7df7fc6ee2f7f regression 23 seconds ago 18 seconds 0 1 + 0 / 1 2.482KiB 1002KiB success 217 ``` 218 219 ### Step 4: Download files once the pipeline has finished 220 Once the pipeline is completed, we can download the files that were created. 221 222 ```shell 223 $ pachctl list file regression@master 224 NAME TYPE SIZE 225 /housing-simplified_corr_matrix.png file 18.66KiB 226 /housing-simplified_cv_reg_output.png file 62.19KiB 227 /housing-simplified_final_model.sav file 1.007KiB 228 /housing-simplified_pairplot.png file 207.5KiB 229 230 $ pachctl get file regression@master:/ --recursive --output . 231 ``` 232 233 When we inspect the learning curve, we can see that there is a large gap between the training score and the validation score. This typically indicates that our model could benefit from the addition of more data. 234 235 <p align="center"> 236 <img width="400" height="350" src="images/learning_curve-1.png"> 237 </p> 238 239 Now let's update our dataset with additional examples. 240 241 ### Step 5: Update Dataset 242 Here's where Pachyderm truly starts to shine. To update our dataset we can run the following command (note that we could also append new examples to the existing file, but in this example we're simply overwriting our previous file to one with more data): 243 244 ```shell 245 $ pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-2.csv --overwrite 246 ``` 247 248 The new commit of data to the `housing_data` repository automatically kicks off a job on the `regression` pipeline without us having to do anything. 249 250 When the job is complete we can download the new files and see that our model has improved, given the new learning curve. 251 <p align="center"> 252 <img src="images/cv_reg_output-2.png"> 253 </p> 254 255 ### Step 6: Inspect the Pipeline Lineage 256 257 Note that because versions all of our input and output data automatically, we can continue to iterate on our data and code and Pachyderm will track all of our experiments. For any given output commit, Pachyderm will tell us exactly which input commit of data was run. In our simple example we only have 2 experiments run so far, but this becomes incredibly important and valuable when we do many more iterations. 258 259 We can list out the commits to any repository by using the `list commit` commandand. 260 261 ```shell 262 $ pachctl list commit housing_data@master 263 REPO BRANCH COMMIT FINISHED SIZE PROGRESS DESCRIPTION 264 housing_data master a186886de0bf430ebf6fce4d538d4db7 3 minutes ago 12.14KiB ▇▇▇▇▇▇▇▇ 265 housing_data master bbe5ce248aa44522a012f1967295ccdd 23 minutes ago 2.482KiB ▇▇▇▇▇▇▇▇ 266 267 $ pachctl list commit regression@master 268 REPO BRANCH COMMIT FINISHED SIZE PROGRESS DESCRIPTION 269 regression master f59a6663073b4e81a2d2ab3b4b7c68fc 2 minutes ago 4.028MiB - 270 regression master bc0ecea5a2cd43349a9db3e89933fb42 22 minutes ago 1001KiB - 271 ``` 272 273 We can show exactly what version of the dataset and pipeline created the model by selecting the commmit ID and using the `inspect` command. 274 275 ```shell 276 $ pachctl inspect commit regression@f59a6663073b4e81a2d2ab3b4b7c68fc 277 Commit: regression@f59a6663073b4e81a2d2ab3b4b7c68fc 278 Original Branch: master 279 Parent: bc0ecea5a2cd43349a9db3e89933fb42 280 Started: 7 minutes ago 281 Finished: 7 minutes ago 282 Size: 4.028MiB 283 Provenance: __spec__@5b17c425a8d54026a6daaeaf8721707a (regression) housing_data@a186886de0bf430ebf6fce4d538d4db7 (master) 284 ``` 285 286 Additionally, can also show the downstream provenance of a commit by using the `flush` command, showing us everything that was run and produced from a commit. 287 288 ```shell 289 $ pachctl flush commit housing_data@bbe5ce248aa44522a012f1967295ccdd 290 REPO BRANCH COMMIT FINISHED SIZE PROGRESS DESCRIPTION 291 regression master bc0ecea5a2cd43349a9db3e89933fb42 31 minutes ago 1001KiB - 292 ```