github.com/pachyderm/pachyderm@v1.13.4/examples/ml/hyperparameter/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Distributed hyperparameter tuning
     5  
     6  This example demonstrates how you can evaluate a model or function in a distributed manner on multiple sets of parameters.  In this particular case, we will evaluate many machine learning models, each configured uses different sets of parameters (aka hyperparameters), and we will output only the best performing model or models.
     7  
     8  The models trained and evaluated in the example will attempt to predict the species of iris flowers using the iris data set, which is often used to demonstrate ML methods.  The different sets of parameters used in the example are the *C* and *Gamma* parameters of an SVM machine learning model.  If you aren't familiar with that model or those parameters, don't worry about them too much.  The point here is that *C* and *Gamma* are parameters of this model, and we want to search over many combinations of *C* and *Gamma* to determine which combination best predicts iris flower species.
     9  
    10  The example assumes that you have:
    11  
    12  - A Pachyderm cluster running - see [Local Installation](https://docs.pachyderm.com/1.13.x/getting_started/local_installation/) to get up and running with a local Pachyderm cluster in just a few minutes.
    13  - The `pachctl` CLI tool installed and connected to your Pachyderm cluster - see [any of our deploy docs](https://docs.pachyderm.com/1.13.x/deploy-manage/deploy/) for instructions.
    14  
    15  ## The pipelines
    16  
    17  The example uses 4 pipeline stages to accomplish this distributed hyperparameter tuning/search.  First we will split our iris data set into a training and test data set.  The training set will be used to train or fit our model with the various sets of parameters and the test set will be used later to evaluate each trained model.
    18  
    19  ![alt text](pipeline1.png)
    20  
    21  Next, we will train a model for each combination of *C* and *Gamma* parameters in a `parameters` repo.  The trained models will be serialized and output to the `model` repo.
    22  
    23  ![alt text](pipeline2.png)
    24  
    25  In a `test` stage we will pair each trained/fit model in `model` with our test data set.  Using the test data set we will generate an evaluation metric, or score, for each of the train models.
    26  
    27  ![alt text](pipeline3.png)
    28  
    29  Finally, in a `select` stage we will determine which of the evaluated metrics in `test` is the best, select out the models corresponding to those metrics, and output them to the `select` repo.
    30  
    31  ![alt text](pipeline4.png)
    32  
    33  ## Preparing the input data
    34  
    35  The two input data repositories for this example are `raw-data` containing the raw iris data set and `parameters` containing all of our *C* and *Gamma* parameters.  First let's create these repositories:
    36  
    37  ```shell
    38  $ pachctl create repo raw_data
    39  $ pachctl create repo parameters
    40  $ pachctl list repo
    41  NAME                CREATED             SIZE                
    42  parameters          47 seconds ago      0B                  
    43  raw-data            52 seconds ago      0B
    44  ```
    45  
    46  Then, we can put our iris data set into `raw-data`.  We are going to use a version of the iris data set that includes a little bit of noise to make the classification problem more difficult.  This data set is included under [data/noisy_iris.csv](data/noisy_iris.csv).  To commit this data set into Pachyderm:
    47  
    48  ```shell
    49  $ cd data 
    50  $ pachctl put file raw_data@master:iris.csv -f noisy_iris.csv 
    51  $ pachctl list file raw_data@master
    52  NAME                TYPE                SIZE                
    53  iris.csv            file                10.29KiB
    54  ```
    55  
    56  The *C* and *Gamma* parameters that we will be searching over are included in [data/parameters](data/parameters) under two respective files.  In order to process each combination of these parameters in parallel, we are going to use Pachyderm's built in splitting capability to split each parameter value into a separate file:
    57  
    58  ```shell
    59  $ cd parameters 
    60  $ pachctl put file parameters@master -f c_parameters.txt --split line --target-file-datums 1 
    61  $ pachctl put file parameters@master -f gamma_parameters.txt --split line --target-file-datums 1
    62  $ pachctl list file parameters@master
    63  NAME                   TYPE                SIZE                
    64  c_parameters.txt       dir                 81B                 
    65  gamma_parameters.txt   dir                 42B                 
    66  $ pachctl list file parameters@master:c_parameters.txt
    67  NAME                                TYPE                SIZE                
    68  c_parameters.txt/0000000000000000   file                6B                  
    69  c_parameters.txt/0000000000000001   file                6B                  
    70  c_parameters.txt/0000000000000002   file                6B                  
    71  c_parameters.txt/0000000000000003   file                6B                  
    72  c_parameters.txt/0000000000000004   file                6B                  
    73  c_parameters.txt/0000000000000005   file                7B                  
    74  c_parameters.txt/0000000000000006   file                8B                  
    75  c_parameters.txt/0000000000000007   file                8B                  
    76  c_parameters.txt/0000000000000008   file                9B                  
    77  c_parameters.txt/0000000000000009   file                9B                  
    78  c_parameters.txt/000000000000000a   file                10B                 
    79  $ pachctl list file parameters@master:gamma_parameters.txt
    80  NAME                                    TYPE                SIZE                
    81  gamma_parameters.txt/0000000000000000   file                6B                  
    82  gamma_parameters.txt/0000000000000001   file                6B                  
    83  gamma_parameters.txt/0000000000000002   file                6B                  
    84  gamma_parameters.txt/0000000000000003   file                6B                  
    85  gamma_parameters.txt/0000000000000004   file                6B                  
    86  gamma_parameters.txt/0000000000000005   file                6B                  
    87  gamma_parameters.txt/0000000000000006   file                6B
    88  ``` 
    89  
    90  As you can see, each of the parameter files has been split into a file per line, and thus a file per parameter.  This can be seen by looking at the file contents:
    91  
    92  ```shell
    93  $ pachctl get file parameters@master:c_parameters.txt/0000000000000000
    94  0.031
    95  $ pachctl get file parameters@master:c_parameters.txt/0000000000000001
    96  0.125
    97  $ pachctl get file parameters@master:c_parameters.txt/0000000000000002
    98  0.500
    99  ```
   100  
   101  For more information on splitting data files, see our [splitting data for distributed processing](https://docs.pachyderm.com/1.13.x/how-tos/splitting-data/splitting/).
   102  
   103  ## Creating the pipelines
   104  
   105  To create the four pipelines mentioned and illustrated above:
   106  
   107  ```shell
   108  $ cd ../../
   109  $ pachctl create pipeline -f split.json 
   110  $ pachctl create pipeline -f model.json
   111  $ pachctl create pipeline -f test.json 
   112  $ pachctl create pipeline -f select.json
   113  ```
   114  
   115  The pipelines should soon all be in the "running" state:
   116  
   117  ```shell
   118  pachctl list pipeline
   119  NAME   VERSION INPUT                                                                                      CREATED       STATE / LAST JOB  DESCRIPTION                                                                                  
   120  select 1       (model:/ ⨯ test:/)                                                                         9 seconds ago running / starting A pipeline that selects the best evaluation metrics from the results of the `test` pipeline. 
   121  test   1       (model:/* ⨯ split:/test.csv)                                                               9 seconds ago running / starting A pipeline that scores each of the trained models.                                           
   122  model  1       (parameters:/c_parameters.txt/* ⨯ parameters:/gamma_parameters.txt/* ⨯ raw_data:/iris.csv) 9 seconds ago running / starting A pipeline that trains the model for each combination of C and Gamma parameters.             
   123  split  1       raw_data:/                                                                                 9 seconds ago running / starting A pipeline that splits the `iris` data set into the `training` and `test` data sets.         
   124  ```
   125  
   126  
   127  And, after waiting a few minutes, you should see the successful jobs that did our distributed hyperparameter tuning:
   128  
   129  ```shell
   130  $ pachctl list job
   131  ID                                   OUTPUT COMMIT                           STARTED            DURATION           RESTART PROGRESS    DL       UL       STATE            
   132  e2b75a61-13e2-4067-88b7-adec4d32f830 select/f38eae7cea574fc6a90adda706d4714e 18 seconds ago     Less than a second 0       1 + 0 / 1   243.2KiB 82.3KiB  success 
   133  4116af2b-efa5-405e-ba04-f850a656e25d test/1e379911118c4492932a2dd9eb198e9a   About a minute ago About a minute     0       77 + 0 / 77 400.3KiB 924B     success 
   134  f628028e-2c88-439e-8738-823fe0441e1b model/6a877b93e3e2445e92a11af8bde6dddf  3 minutes ago      About a minute     0       77 + 0 / 77 635.1KiB 242.3KiB success 
   135  a2ba2024-db12-4a78-9383-82adba5a4c3d split/04955ad7fda64a66820db5578478c1d6  5 minutes ago      Less than a second 0       1 + 0 / 1   10.29KiB 10.29KiB success
   136  ```
   137  
   138  ## Looking at the results
   139  
   140  If we look at the models that were trained based on our training data, we will see one model for each of the combinations of *C* and *Gamma* parameters:
   141  
   142  ```shell
   143  $ pachctl list file model@master
   144  NAME                        TYPE                SIZE                
   145  model_C0.031_G0.001.pkl     file                6.908KiB            
   146  model_C0.031_G0.004.pkl     file                6.908KiB            
   147  model_C0.031_G0.016.pkl     file                6.908KiB            
   148  model_C0.031_G0.063.pkl     file                6.908KiB            
   149  model_C0.031_G0.25.pkl      file                6.908KiB            
   150  model_C0.031_G1.0.pkl       file                6.908KiB            
   151  model_C0.031_G4.0.pkl       file                6.908KiB            
   152  model_C0.125_G0.001.pkl     file                4.85KiB             
   153  model_C0.125_G0.004.pkl     file                4.85KiB             
   154  model_C0.125_G0.016.pkl     file                4.85KiB             
   155  model_C0.125_G0.063.pkl     file                4.85KiB             
   156  model_C0.125_G0.25.pkl      file                4.85KiB             
   157  model_C0.125_G1.0.pkl       file                4.85KiB
   158  etc...
   159  ``` 
   160  
   161  There should be 77 of these models:
   162  
   163  ```shell
   164  $ pachctl list file model@master | wc -l
   165  78
   166  ```
   167  
   168  But not all of these models are ideal for making our predictions. Our `select` pipeline stage automatically selected out the best of these models (based on the evaluation metrics generated by the `test` stage). We can see which of the models are ideal for our predictions as follows:
   169  
   170  ```shell
   171  $ pachctl list file select@master | wc -l
   172  36
   173  $ pachctl list file select@master
   174  NAME                      TYPE                SIZE
   175  model_C0.031_G0.001.pkl   file                5.713KiB
   176  model_C0.031_G0.004.pkl   file                5.713KiB
   177  model_C0.031_G0.016.pkl   file                5.713KiB
   178  model_C0.031_G0.063.pkl   file                5.713KiB
   179  model_C0.031_G0.25.pkl    file                5.713KiB
   180  model_C0.031_G1.0.pkl     file                5.713KiB
   181  model_C0.031_G4.0.pkl     file                5.713KiB
   182  etc...
   183  ``` 
   184  
   185  *Note* - Here, 36 of the 77 models were selected as ideal. Due to the fact that we are randomly shuffling our training/test data, your results may vary slightly.
   186  
   187  *Note* - The pipeline we've built here is very easy to generalize for any sort of parameter space exploration. As long as you break up your parameters into individual files (as shown above), you can test the whole parameter space in a massively distributed way and simply pick out the best results.