github.com/pachyderm/pachyderm@v1.13.4/examples/transactions/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 # Use Transactions with Hyperparameter Tuning 5 6 !!! note "Summary" 7 Transactions can help optimize the use of resources 8 by postponing pipeline runs. 9 10 Hyperparameter tuning is a machine learning technique 11 of narrowing down a set of parameters to 12 an optimal number of parameters to train a learning 13 algorithm. The Pachyderm documentation includes a 14 [hyperparameter tuning example](https://github.com/pachyderm/pachyderm/tree/1.13.x/examples/ml/hyperparameter) 15 that describes how this computation works in Pachyderm. 16 17 In the hyperparameter example, training data is submitted 18 to the `data` repository, and the parameters are stored 19 in the `parameters` repository. In that example, the 20 data processing takes seconds and, therefore, you can 21 run this operation for every commit without being worried 22 about the use of resources. But, if your 23 data processing takes significant time, 24 you might want to optimize Pachyderm to run the pipeline against 25 specific commits in the `data` and `parameters` repositories. 26 You can do so by using transactions. 27 28 ## Set up the Hyperparameter Example 29 30 To demonstrate the benefits of using transactions, we 31 will use transactions on the `model` pipeline step from the 32 [hyperparameter tuning example](https://github.com/pachyderm/pachyderm/tree/1.13.x/examples/ml/hyperparameter). 33 In this transaction example, we omit the splitting step and 34 have just the `model` pipeline that consumes commits from 35 the `data` and `parameters` repositories and outputs the 36 result to the `model` repository. You can adapt this example 37 to your pipelines as needed. 38 39 The following diagram describes the pipeline structure: 40 41  42 43 To set up the pipeline, complete the following steps: 44 45 1. Create the `data` repository: 46 47 ```shell 48 $ pachctl create repo data 49 ``` 50 51 1. Create the `parameters` repository: 52 53 ```shell 54 $ pachctl create repo parameters 55 ``` 56 57 1. Verify that the repositories were successfully created: 58 59 ```shell 60 $ pachctl list repo 61 NAME CREATED SIZE (MASTER) 62 parameters 44 minutes ago 123B 63 raw_data 44 minutes ago 6.858KiB 64 ``` 65 66 1. Clone the Pachyderm repository: 67 68 ```shell 69 $ git clone git@github.com:pachyderm/pachyderm.git 70 ``` 71 72 1. Change the directory to `examples/transactions`: 73 74 ```shell 75 $ cd examples/transactions/ 76 ``` 77 78 1. Create the `model` pipeline: 79 80 ```shell 81 $ pachctl create pipeline -f model.json 82 ``` 83 84 1. Verify that the pipeline has been created: 85 86 ```shell 87 $ pachctl list pipeline 88 NAME VERSION INPUT CREATED STATE / LAST JOB 89 model 1 (parameters:/c_parameters.txt/* ⨯ parameters:/gamma_parameters.txt/* ⨯ raw_data:/iris.csv) 12 seconds ago running / starting 90 ``` 91 92 ## Run the Transaction 93 94 To match commits in a pipeline, you need to create 95 a transaction, open two commits inside of that transaction, 96 then close the transaction, add your files, and then close both 97 commits. Pachyderm puts the changes to both repositories simultaneously 98 only when all commits that you have opened within a transaction are 99 closed. 100 101 To run the transaction, complete the following steps: 102 103 1. Start a transaction: 104 105 ```shell 106 $ pachctl start transaction 107 Started new transaction: 854e8503-6e5d-4542-805c-a73a39200bf8 108 ``` 109 110 1. Open a commit into the `master` branch of the `raw_data` repository: 111 112 ```shell 113 $ pachctl start commit raw_data@master 114 Added to transaction: 854e8503-6e5d-4542-805c-a73a39200bf8 115 42b893e48e7d40f1bb5ed770526a9a07 116 ``` 117 118 1. Open a commit into the `master` branch of the `parameters` repository: 119 120 ```shell 121 $ pachctl start commit parameters@master 122 Added to transaction: 854e8503-6e5d-4542-805c-a73a39200bf8 123 c4dc446b25e54a938a67a5e913b3f9a4 124 ``` 125 126 1. Close the transaction: 127 128 ```shell 129 $ pachctl finish transaction 130 Completed transaction with 2 requests: 854e8503-6e5d-4542-805c-a73a39200bf8 131 ``` 132 133 1. Add the data to the parameters repository by splitting each line 134 into a separate file: 135 136 ```shell 137 $ pachctl put file parameters@master -f c_parameters.txt --split line --target-file-datums 1 138 $ pachctl put file parameters@master -f gamma_parameters.txt --split line --target-file-datums 1 139 ``` 140 141 ```shell 142 $ pachctl list file parameters@master 143 NAME TYPE SIZE 144 /c_parameters.txt dir 81B 145 /gamma_parameters.txt dir 42B 146 ``` 147 148 **Note:** Although the files are in the repository, no jobs were 149 triggered for the `model` pipeline. You can verify that by running 150 the following command: 151 152 ```shell 153 $ pachctl list job --pipeline=model --no-pager 154 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 155 ``` 156 157 1. Add the data to the `raw_data` repository: 158 159 ```shell 160 $ pachctl put file raw_data@master:iris.csv -f noisy_iris.csv 161 ``` 162 163 If you check whether the pipeline has run or not, you 164 can see that it has not yet run: 165 166 ```shell 167 $ pachctl list job --pipeline=model --no-pager 168 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 169 ``` 170 171 1. Close the commit to the `raw_data` repository that you have 172 started within the transaction: 173 174 ```shell 175 $ pachctl finish commit raw_data@master 176 ``` 177 178 Still no jobs run for the `model` pipeline. If we had not 179 started those commits in a transaction, a job would normally 180 be triggered here because Pachyderm normally triggers jobs 181 whenever a commit is made on any input. In this case, because 182 those commits were added to a transaction, Pachyderm waits 183 for both input commits to be finished before a job triggers. 184 185 1. Close the commit to the `parameters` repository that you have 186 started within the transaction: 187 188 ```shell 189 $ pachctl finish commit parameters@master 190 ``` 191 192 Now, the Pachyderm finishes the transaction by creating one 193 job that takes the commits that you have specified within the 194 transaction and runs your code against these two commits: 195 196 ```shell 197 $ pachctl list job --pipeline=model --no-pager 198 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 199 6cdc80ae105f47b4a09f0ab8ce005003 model 37 seconds ago - 0 21 + 0 / 77 115.5KiB 62.2KiB running 200 ``` 201 202 1. View the contents of the model output repo: 203 204 ```shell 205 $ pachctl list file model@master 206 NAME TYPE SIZE 207 /model_C0.031_G0.001.pkl file 5.713KiB 208 /model_C0.031_G0.004.pkl file 5.713KiB 209 /model_C0.031_G0.016.pkl file 5.713KiB 210 /model_C0.031_G0.063.pkl file 5.713KiB 211 ... 212 ``` 213 214 In this example, we learned that if a pipeline 215 takes a lot of time to run, you can optimize it by using 216 transactions. Transactions enable you to accumulate your 217 changes in input repositories and postpone pipeline runs 218 until after the commits are closed.