github.com/pachyderm/pachyderm@v1.13.4/examples/deferred_processing/deferred_processing_plus_transactions/README.md (about) 1 # Deferred Processing Plus Transactions 2 3 This example, 4 which uses a simple DAG based on our [OpenCV example](https://github.com/pachyderm/pachyderm/tree/1.13.x/examples/opencv), 5 illustrates two Pachyderm usage patterns for fine-grain control over when pipelines trigger jobs. 6 7 [Deferred processing](https://docs.pachyderm.com/1.13.x/how-tos/deferred_processing/) 8 is a Pachyderm technique for controlling when data gets processed. 9 Deferred processing uses branches to prevent pipelines from triggering on every input commit. 10 11 [Transactions](https://docs.pachyderm.com/1.13.x/how-tos/use-transactions-to-run-multiple-commands/) are a Pachyderm feature 12 that allows you to batch match multiple operations at once, 13 such as committing data to two different repos, 14 but only trigger a single job, 15 so data from both repos gets processed together. 16 17 ## Prerequisites 18 19 Before you begin, you need to have Pachyderm v1.9.8 or later installed on your computer or cloud platform. 20 See [Deploy Pachyderm](https://docs.pachyderm.com/1.13.x/deploy-manage/deploy/). 21 22 ## Pipelines 23 24 The following diagram demonstrates the DAG that is used in this example. 25 26  27 28 The DAG shown is a simple elaboration on the OpenCV example, 29 with pipeline and repo names chosen to avoid collisions with that example 30 if installed in the same cluster. 31 32 The functionality is slightly different. 33 The `edges_dp` pipeline performs edge detection on images committed to `images_dp_1`. 34 `montage_dp` pipeline creates a montage out of images committed to `images_dp_2` and images in the master branch of `edges_dp`. 35 This configuration enables you to verify the files being processed 36 by visually inspecting the montage. 37 38 The most significant change from the OpenCV example is the pipeline spec for `edges_dp`, 39 which has the `output_branch` attribute set to `dev`. 40 It also has added a `name` field to the `input` repo, 41 to avoid having to change the code in the example. 42 43 ```json hl_lines="9,12" 44 { 45 "pipeline": { 46 "name": "edges_dp" 47 }, 48 "input": { 49 "pfs": { 50 "glob": "/*", 51 "repo": "images_dp_1", 52 "name": "images" 53 } 54 }, 55 "output_branch": "dev", 56 "transform": { 57 "cmd": [ "python3", "/edges.py" ], 58 "image": "pachyderm/opencv" 59 } 60 } 61 ``` 62 63 Therefore, 64 this pipeline puts the output in the `dev` branch 65 instead of putting it in the `master` branch. 66 67 Since `montage_dp` is subscribed to the master branch of `edges_dp`, jobs will not trigger when the edges pipeline outputs files to the `dev` branch. Instead, to trigger a montage job, 68 we can simply create a `master` branch attached to any commit in `edges_dp` to trigger the pipeline. 69 70 ## Example run-through 71 72 This section provides steps that you can run to test this example. 73 74 ### Deferred Processing 75 76 You should have a Pachyderm cluster set up 77 and access to it configured from your local computer 78 before you run this example. 79 80 1. Run the script `setup.sh` included in this repo. 81 The script executes the following commands: 82 83 ```shell 84 pachctl create repo images_dp_1 85 pachctl create repo images_dp_2 86 pachctl create pipeline -f ./edges_dp.json 87 pachctl create pipeline -f ./montage_dp.json 88 pachctl put file images_dp_1@master -i ./images.txt 89 pachctl put file images_dp_1@master -i ./images2.txt 90 pachctl put file images_dp_2@master -i ./images3.txt 91 ``` 92 93 2. Once the demo is loaded, 94 check the commits in `edges_dp`. 95 You should see an output similar to this. 96 Note that there are two commits in `edges_dp` 97 both in the `dev` branch that is the output for the pipeline: 98 99 ```shell 100 $ pachctl list commit edges_dp 101 REPO BRANCH COMMIT FINISHED 102 edges_dp dev 364f49663dd848098b60c1ac97a332af 36 seconds ago 103 edges_dp dev a07c857b91a14add9f8309a81d86dbe8 44 seconds ago 104 ``` 105 106 Remember that the `edges_dp` pipeline outputs to the `dev` branch. 107 Since the `montage_dp` pipeline subscribes to the `master` branch, 108 it will not be triggered when `edges_dp` jobs complete, 109 since that output goes into the `dev` branch. 110 111 3. List the branches in `edges_dp`. 112 113 ```shell 114 $ pachctl list branch edges_dp 115 BRANCH HEAD 116 master - 117 dev 364f49663dd848098b60c1ac97a332af 118 ``` 119 120 Note that the `dev` branch has the most recent commit. 121 Take note of the commit id and match it to the id above. 122 The `master` branch does not have any commits. 123 124 4. View the list of jobs: 125 126 ```shell 127 $ pachctl list job 128 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 129 2288709b4d8044409c2232d673ec8f23 montage_dp 55 seconds ago 1 second 0 0 + 0 / 0 0B 0B success 130 6d9d4cf0f6524b0ca126fa97141303ea edges_dp About a minute ago 4 seconds 0 2 + 1 / 3 181.1KiB 111.4KiB success 131 fcaf537975554935b0f15d184d7a0984 edges_dp About a minute ago 3 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 132 133 ``` 134 135 You should see that there are three jobs: 136 - one 0-datum job for `montage_dp` and 137 - two jobs for `edges_dp` with the appropriate number of datums in each. 138 139 This is what you should expect. 140 There is no data in the master branch of `edges_dp`, 141 so an empty job was created in `montage_dp` 142 when data was commited to `images_dp_2` 143 because of its `cross` input. 144 145 5. Commit a file to `images_dp_1`. 146 147 ```shell 148 $ pachctl put file images_dp_1@master:1VqcWw9.jpg -f http://imgur.com/1VqcWw9.jpg 149 ``` 150 151 6. View the list of jobs, again. 152 153 ```shell 154 $ pachctl list job 155 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 156 c7e69e46e9954611ad8efc8aeac47f2a edges_dp 12 seconds ago 3 seconds 0 1 + 3 / 4 175.1KiB 92.18KiB success 157 2288709b4d8044409c2232d673ec8f23 montage_dp About a minute ago 1 second 0 0 + 0 / 0 0B 0B success 158 6d9d4cf0f6524b0ca126fa97141303ea edges_dp About a minute ago 4 seconds 0 2 + 1 / 3 181.1KiB 111.4KiB success 159 fcaf537975554935b0f15d184d7a0984 edges_dp About a minute ago 3 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 160 ``` 161 162 You see that one job was triggered in `edges_dp` 163 with the one datum we committed, above, processed 164 and the three existing datums skipped. 165 You may also confirm that no job was triggered for `montage_dp`. 166 167 7. To trigger `montage_dp` on the set of data in our `dev` branch, 168 you create a `master` branch with `dev` as its head. 169 170 ```shell 171 $ pachctl create branch edges_dp@master --head dev 172 ``` 173 174 8. Listing jobs will show that a job got triggered on `montage_dp`: 175 176 ```shell 177 $ pachctl list job 178 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 179 e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 10 seconds ago 4 seconds 0 1 + 0 / 1 919.6KiB 1.055MiB success 180 c7e69e46e9954611ad8efc8aeac47f2a edges_dp 42 seconds ago 3 seconds 0 1 + 3 / 4 175.1KiB 92.18KiB success 181 2288709b4d8044409c2232d673ec8f23 montage_dp 2 minutes ago 1 second 0 0 + 0 / 0 0B 0B success 182 6d9d4cf0f6524b0ca126fa97141303ea edges_dp 2 minutes ago 4 seconds 0 2 + 1 / 3 181.1KiB 111.4KiB success 183 fcaf537975554935b0f15d184d7a0984 edges_dp 2 minutes ago 3 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 184 ``` 185 186 9. Commit more data to `images_dp_1`. 187 It will only trigger a job in `edges_dp`: 188 189 ```shell 190 $ pachctl put file images_dp_1@master:2GI70mb.jpg -f http://imgur.com/2GI70mb.jpg 191 $ pachctl list job 192 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 193 65eacaae2e63461bbfc1ed609e8b6f5e edges_dp 8 seconds ago 3 seconds 0 1 + 4 / 5 204KiB 18.89KiB success 194 e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 13 minutes ago 4 seconds 0 1 + 0 / 1 919.6KiB 1.055MiB success 195 c7e69e46e9954611ad8efc8aeac47f2a edges_dp 13 minutes ago 3 seconds 0 1 + 3 / 4 175.1KiB 92.18KiB success 196 2288709b4d8044409c2232d673ec8f23 montage_dp 15 minutes ago 1 second 0 0 + 0 / 0 0B 0B success 197 6d9d4cf0f6524b0ca126fa97141303ea edges_dp 15 minutes ago 4 seconds 0 2 + 1 / 3 181.1KiB 111.4KiB success 198 fcaf537975554935b0f15d184d7a0984 edges_dp 15 minutes ago 3 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 199 ``` 200 201 10. Move the `master` branch in `edges_dp` to point dev, again. 202 It will trigger a job against the data currently in dev. 203 204 ```shell 205 $ pachctl create branch edges_dp@master --head dev 206 $ pachctl list job 207 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 208 65eddcb60ae1475aa6d59b2baa69c78e montage_dp 8 seconds ago 5 seconds 0 1 + 0 / 1 938.5KiB 1.066MiB success 209 65eacaae2e63461bbfc1ed609e8b6f5e edges_dp 3 minutes ago 3 seconds 0 1 + 4 / 5 204KiB 18.89KiB success 210 e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 16 minutes ago 4 seconds 0 1 + 0 / 1 919.6KiB 1.055MiB success 211 c7e69e46e9954611ad8efc8aeac47f2a edges_dp 16 minutes ago 3 seconds 0 1 + 3 / 4 175.1KiB 92.18KiB success 212 2288709b4d8044409c2232d673ec8f23 montage_dp 18 minutes ago 1 second 0 0 + 0 / 0 0B 0B success 213 6d9d4cf0f6524b0ca126fa97141303ea edges_dp 18 minutes ago 4 seconds 0 2 + 1 / 3 181.1KiB 111.4KiB success 214 fcaf537975554935b0f15d184d7a0984 edges_dp 18 minutes ago 3 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 215 ``` 216 217 218 ### Transactions 219 220 After you test deferred processing, 221 you can explore how transactions work in combination with deferred processing. 222 223 1. If you want to run a particular set of data in `images_dp_2` 224 against a particular branch of `edges_dp`, 225 you need to perform two operations 226 - commit data to `images_dp_2` and 227 - point `edges_dp@master` to the specific commit of interest. 228 229 If you do not use a transaction, this will result in two jobs being triggered, one for the new commit and a second when we move `edges_dp@master` branch. 230 - `images_dp_2@master` running against whatever is currently in `edges_dp@master` 231 - `images_dp_2@master` running against whatever you set `edges_dp@master` to 232 233 Remember that in step 10 above, 234 we performed the `create branch` operation against `edges_dp`. 235 Now we perform the commit to `images_dp_2`. 236 and see that another job got triggered. 237 238 ```shell 239 $ pachctl put file images_dp_2@master:3Kr6Mr6.jpg -f http://imgur.com/3Kr6Mr6.jpg 240 $ pachctl list job 241 $ pachctl list job 242 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 243 9c97578031544cab9cc5fb64e9d77153 montage_dp 9 seconds ago 5 seconds 0 1 + 0 / 1 1015KiB 1.292MiB success 244 65eddcb60ae1475aa6d59b2baa69c78e montage_dp 28 seconds ago 5 seconds 0 1 + 0 / 1 938.5KiB 1.066MiB success 245 65eacaae2e63461bbfc1ed609e8b6f5e edges_dp 3 minutes ago 3 seconds 0 1 + 4 / 5 204KiB 18.89KiB success 246 e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 16 minutes ago 4 seconds 0 1 + 0 / 1 919.6KiB 1.055MiB success 247 c7e69e46e9954611ad8efc8aeac47f2a edges_dp 16 minutes ago 3 seconds 0 1 + 3 / 4 175.1KiB 92.18KiB success 248 2288709b4d8044409c2232d673ec8f23 montage_dp 18 minutes ago 1 second 0 0 + 0 / 0 0B 0B success 249 6d9d4cf0f6524b0ca126fa97141303ea edges_dp 18 minutes ago 4 seconds 0 2 + 1 / 3 181.1KiB 111.4KiB success 250 fcaf537975554935b0f15d184d7a0984 edges_dp 18 minutes ago 3 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 251 ``` 252 253 2. If you want to just have one job 254 where `images_dp_2@master` runs against whatever you set `edges_dp@master` to, 255 you can use Pachyderm transactions. 256 First step is to start a transaction. 257 258 ```shell 259 $ pachctl start transaction 260 Started new transaction: 11fbbcbd-6cda-42fa-b1fe-cd63b292582e 261 ``` 262 263 3. Once the transaction is started, 264 you start all commits and branch creations 265 within the scope of the transaction. 266 267 ```shell 268 $ pachctl start commit images_dp_2@master 269 Added to transaction: 11fbbcbd-6cda-42fa-b1fe-cd63b292582e 270 de55d4856e814c41a65836321fe672fa 271 $ pachctl create branch edges_dp@master --head dev 272 Added to transaction: 11fbbcbd-6cda-42fa-b1fe-cd63b292582e 273 ``` 274 275 276 4. Before you put any files in a repo, 277 you need to finish the transaction. 278 When you run `pachctl finish transaction`, Pachyderm groups all the commits and branches together, 279 triggering when the last commit in the transaction is finished. 280 281 ```shell 282 $ pachctl finish transaction 283 Completed transaction with 2 requests: 11fbbcbd-6cda-42fa-b1fe-cd63b292582e 284 ``` 285 286 5. Commit a file, 287 and job list will show no new jobs. 288 289 ```shell 290 $ pachctl put file images_dp_2@master:9iIlokw.jpg -f http://imgur.com/9iIlokw.jpg 291 $ pachctl list job 292 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 293 9c97578031544cab9cc5fb64e9d77153 montage_dp 18 minutes ago 5 seconds 0 1 + 0 / 1 1015KiB 1.292MiB success 294 65eddcb60ae1475aa6d59b2baa69c78e montage_dp 19 minutes ago 5 seconds 0 1 + 0 / 1 938.5KiB 1.066MiB success 295 65eacaae2e63461bbfc1ed609e8b6f5e edges_dp 22 minutes ago 3 seconds 0 1 + 4 / 5 204KiB 18.89KiB success 296 e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 35 minutes ago 4 seconds 0 1 + 0 / 1 919.6KiB 1.055MiB success 297 c7e69e46e9954611ad8efc8aeac47f2a edges_dp 36 minutes ago 3 seconds 0 1 + 3 / 4 175.1KiB 92.18KiB success 298 2288709b4d8044409c2232d673ec8f23 montage_dp 37 minutes ago 1 second 0 0 + 0 / 0 0B 0B success 299 6d9d4cf0f6524b0ca126fa97141303ea edges_dp 38 minutes ago 4 seconds 0 2 + 1 / 3 181.1KiB 111.4KiB success 300 fcaf537975554935b0f15d184d7a0984 edges_dp 38 minutes ago 3 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 301 ``` 302 303 6. Finish the commit that you started during the transaction. 304 That will start the job. 305 306 ``` 307 $ pachctl finish commit images_dp_2@master 308 $ pachctl list job 309 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 310 76f1e7c311fd4529938653787b1d283a montage_dp 14 seconds ago 6 seconds 0 1 + 0 / 1 1.175MiB 1.587MiB success 311 9c97578031544cab9cc5fb64e9d77153 montage_dp 19 minutes ago 5 seconds 0 1 + 0 / 1 1015KiB 1.292MiB success 312 65eddcb60ae1475aa6d59b2baa69c78e montage_dp 20 minutes ago 5 seconds 0 1 + 0 / 1 938.5KiB 1.066MiB success 313 65eacaae2e63461bbfc1ed609e8b6f5e edges_dp 23 minutes ago 3 seconds 0 1 + 4 / 5 204KiB 18.89KiB success 314 e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 37 minutes ago 4 seconds 0 1 + 0 / 1 919.6KiB 1.055MiB success 315 c7e69e46e9954611ad8efc8aeac47f2a edges_dp 37 minutes ago 3 seconds 0 1 + 3 / 4 175.1KiB 92.18KiB success 316 2288709b4d8044409c2232d673ec8f23 montage_dp 38 minutes ago 1 second 0 0 + 0 / 0 0B 0B success 317 6d9d4cf0f6524b0ca126fa97141303ea edges_dp 39 minutes ago 4 seconds 0 2 + 1 / 3 181.1KiB 111.4KiB success 318 fcaf537975554935b0f15d184d7a0984 edges_dp 39 minutes ago 3 seconds 0 1 + 0 / 1 57.27KiB 22.22KiB success 319 ``` 320 321 ## Summary 322 323 Deferred processing with transactions in Pachyderm 324 will give you fine-grained control of jobs and datums 325 while preserving Pachyderm's advantages of data lineage and incremental processing.