github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/advanced-concepts/deferred_processing.md (about) 1 # Deferred Processing of Data 2 3 While a Pachyderm pipeline is running, it processes any new data that you 4 commit to its input branch. However, in some cases, you 5 want to commit data more frequently than you want to process it. 6 7 Because Pachyderm pipelines do not reprocess the data that has 8 already been processed, in most cases, this is not an issue. But, some 9 pipelines might need to process everything from scratch. For example, 10 you might want to commit data every hour, but only want to retrain a 11 machine learning model on that data daily because it needs to train 12 on all the data from scratch. 13 14 In these cases, you can leverage a massive performance benefit from deferred 15 processing. This section covers how to achieve that and control 16 what gets processed. 17 18 Pachyderm controls what is being processed by using the _filesystem_, 19 rather than at the pipeline level. Although pipelines are inflexible, 20 they are simple and always try to process the data at the heads of 21 their input branches. In contrast, the filesystem is very flexible and 22 gives you the ability to commit data in different places and then efficiently 23 move and rename the data so that it gets processed when you want. 24 25 ## Configure a Staging Branch in an Input repository 26 27 When you want to load data into Pachyderm without triggering a pipeline, 28 you can upload it to a staging branch and then submit accumulated 29 changes in one batch by re-pointing the `HEAD` of your `master` branch 30 to a commit in the staging branch. 31 32 Although, in this section, the branch in which you consolidate changes 33 is called `staging`, you can name it as you like. Also, you can have multiple 34 staging branches. For example, `dev1`, `dev2`, and so on. 35 36 In the example below, the repository that is created called `data`. 37 38 To configure a staging branch, complete the following steps: 39 40 1. Create a repository. For example, `data`. 41 42 ```shell 43 $ pachctl create repo data 44 ``` 45 46 1. Create a `master` branch. 47 48 ```shell 49 $ pachctl create branch data@master 50 ``` 51 52 1. View the created branch: 53 54 ```shell 55 $ pachctl list branch data 56 57 BRANCH HEAD 58 master - 59 ``` 60 61 No `HEAD` means that nothing has yet been committed into this 62 branch. When you commit data to the `master` branch, the pipeline 63 immediately starts a job to process it. 64 However, if you want to commit something without immediately 65 processing it, you need to commit it to a different branch. 66 67 1. Commit a file to the staging branch: 68 69 ```shell 70 $ pachctl put file data@staging -f <file> 71 ``` 72 73 Pachyderm automatically creates the `staging` branch. 74 Your repo now has 2 branches, `staging` and `master`. In this 75 example, the `staging` name is used, but you can 76 name the branch as you want. 77 78 1. Verify that the branches were created: 79 80 ```shell 81 $ pachctl list branch data 82 83 BRANCH HEAD 84 85 staging f3506f0fab6e483e8338754081109e69 86 master - 87 ``` 88 89 The `master` branch still does not have a `HEAD` commit, but the 90 new branch, `staging`, does. There still have been no jobs, because 91 there are no pipelines that take `staging` as inputs. You can 92 continue to commit to `staging` to add new data to the branch, and the 93 pipeline will not process anything. 94 95 1. When you are ready to process the data, update the `master` branch 96 to point it to the head of the staging branch: 97 98 ```shell 99 $ pachctl create branch data@master --head staging 100 ``` 101 102 1. List your branches to verify that the master branch has a `HEAD` 103 commit: 104 105 ```shell 106 $ pachctl list branch data 107 108 staging f3506f0fab6e483e8338754081109e69 109 master f3506f0fab6e483e8338754081109e69 110 ``` 111 112 The `master` and `staging` branches now have the same `HEAD` commit. 113 This means that your pipeline has data to process. 114 115 1. Verify that the pipeline has new jobs: 116 117 ```shell 118 $ pachctl list job 119 120 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 121 061b0ef8f44f41bab5247420b4e62ca2 test 32 seconds ago Less than a second 0 6 + 0 / 6 108B 24B success 122 ``` 123 124 You should see one job that Pachyderm created for all the changes you 125 have submitted to the `staging` branch. While the commits to the 126 `staging` branch are ancestors of the current `HEAD` in `master`, 127 they were never the actual `HEAD` of `master` themselves, so they 128 do not get processed. This behavior works for most of the use cases 129 because commits in Pachyderm are generally additive, so processing 130 the HEAD commit also processes data from previous commits. 131 132  133 134 ## Process Specific Commits 135 136 Sometimes you want to process specific intermediary commits 137 that are not in the `HEAD` of the branch. 138 To do this, you need to set `master` to have these commits as `HEAD`. 139 For example, if you submitted ten commits in the `staging` branch and you 140 want to process the seventh, third, and most recent commits, you need 141 to run the following commands respectively: 142 143 ```shell 144 $ pachctl create branch data@master --head staging^7 145 $ pachctl create branch data@master --head staging^3 146 $ pachctl create branch data@master --head staging 147 ``` 148 149 When you run the commands above, Pachyderm creates a job for each 150 of the commands one after another. Therefore, when one job is completed, 151 Pachyderm starts the next one. To verify 152 that Pachyderm created jobs for these commands, run `pachctl list job`. 153 154 ### Change the HEAD of your Branch 155 156 You can move backward to previous commits as easily as advancing to the 157 latest commits. For example, if you want to change the final output to be 158 the result of processing `staging^1`, you can *roll back* your HEAD commit 159 by running the following command: 160 161 ```shell 162 $ pachctl create branch data@master --head staging^1 163 ``` 164 165 This command starts a new job to process `staging^1`. The `HEAD` commit on 166 your output repo will be the result of processing `staging^1` instead of 167 `staging`. 168 169 ## Copy Files from One Branch to Another 170 171 Using a staging branch allows you to defer processing. To use 172 this functionality you need to know your input commits in advance. 173 However, sometimes you want to be able to commit data in an ad-hoc, 174 disorganized manner and then organize it later. Instead of pointing 175 your `master` branch to a commit in a staging branch, you can copy 176 individual files from `staging` to `master`. 177 When you run `copy file`, Pachyderm only copies references to the files and 178 does not move the actual data for the files around. 179 180 To copy files from one branch to another, complete the following steps: 181 182 1. Start a commit: 183 184 ```shell 185 $ pachctl start commit data@master 186 ``` 187 188 1. Copy files: 189 190 ```shell 191 $ pachctl copy file data@staging:file1 data@master:file1 192 $ pachctl copy file data@staging:file2 data@master:file2 193 ... 194 ``` 195 196 1. Close the commit: 197 198 ```shell 199 $ pachctl finish commit data@master 200 ``` 201 202 While the commit is open, you can run `pachctl delete file` if you want to remove something from 203 the parent commit or `pachctl put file`if you want to upload something that is not in a repo yet. 204 205 ## Deferred Processing in Output Repositories 206 207 You can perform the same deferred processing operations with data in output 208 repositories. To do so, rather than committing to a 209 `staging` branch, configure the `output_branch` field 210 in your pipeline specification. 211 212 To configure deffered processing in an output repository, complete the 213 following steps: 214 215 1. In the pipeline specification, add the `output_branch` field with 216 the name of the branch in which you want to accumulate your data 217 before processing: 218 219 ``` 220 "output_branch": "staging" 221 ``` 222 223 1. When you want to process data, run: 224 225 ```shell 226 $ pachctl create branch pipeline@master --head staging 227 ``` 228 229 ## Automate Deferred Processing With Branch Triggers 230 231 Typically, repointing from one branch to another happens when a certain 232 condition is met. For example, you might want to repoint your branch when you 233 have a specific number of commits, or when the amount of unprocessed data 234 reaches a certain size, or at a specific time interval, such as daily, or 235 other. This can be automated using branch triggers. A trigger is a relationship 236 between two branches, such as `master` and `staging` in the examples above, 237 that says: when the head commit of `staging` meets a certain condition it 238 should trigger `master` to update its head to that same commit. In other words it 239 does `pachctl create branch data@master --head staging` automatically when the 240 trigger condition is met. 241 242 Building on the example above, to make `master` automatically trigger when 243 there's 1 Megabyte of new data on `staging`, run: 244 245 ```shell 246 $ pachctl create branch data@master --trigger staging --trigger-size 1MB 247 $ pachctl list branch data 248 249 BRANCH HEAD TRIGGER 250 staging 8b5f3eb8dc4346dcbd1a547f537982a6 - 251 master - staging on Size(1MB) 252 ``` 253 254 When you run that command, it may or may not set the head of `master`. It depends 255 on the difference between the size of the head of `staging` and the existing 256 head of `master`, or `0` if it doesn't exist. Notice that in the example above 257 `staging` had an existing head with less than a MB of data in it so `master` 258 still has no head. If you don't see `staging` when you `list branch` that's ok, 259 triggers can point to branches that don't exist yet. The head of `master` will 260 update if you add a MB of new data to `staging`: 261 262 ```shell 263 $ dd if=/dev/urandom bs=1MiB count=1 | pachctl put file data@staging:/file 264 $ pachctl list branch data 265 266 BRANCH HEAD TRIGGER 267 staging 64b70e6aeda84845858c42d755023673 - 268 master 64b70e6aeda84845858c42d755023673 staging on Size(1MB) 269 ``` 270 271 Triggers automate deferred processing, but they don't prevent manually updating 272 the head of a branch. If you ever want to trigger `master` even though the 273 trigger condition hasn't been met you can run: 274 275 ```shell 276 $ pachctl create branch data@master --head staging 277 ``` 278 279 Notice that you don't need to re-specify the trigger when you call `create 280 branch` to change the head. If you do want to clear the trigger delete the 281 branch and recreate it. 282 283 There are three conditions on which you can trigger the repointing of a branch. 284 285 - time, using a cron specification (--trigger-cron) 286 - size (--trigger-size) 287 - number of commits (--trigger-commits) 288 289 When more than one is specified, a branch repoint will be triggered when any of 290 the conditions is met. To guarantee that they all must be met, add 291 --trigger-all. 292 293 To experiment further, see the full [triggers example](https://github.com/pachyderm/examples/tree/master/deferred_processing/triggers). 294 295 ## Embed Triggers in Pipelines 296 297 Triggers can also be specified in the pipeline spec and automatically created 298 when the pipeline is created. For example, this is the edges pipeline from our 299 our OpenCV demo modified to only trigger when there is a 1 Megabyte of new images: 300 301 ``` 302 { 303 "pipeline": { 304 "name": "edges" 305 }, 306 "description": "A pipeline that performs image edge detection by using the OpenCV library.", 307 "input": { 308 "pfs": { 309 "glob": "/*", 310 "repo": "images", 311 "trigger": { 312 "size": "1MB" 313 } 314 } 315 }, 316 "transform": { 317 "cmd": [ "python3", "/edges.py" ], 318 "image": "pachyderm/opencv" 319 } 320 } 321 ``` 322 323 When you create this pipeline, Pachyderm will also create a branch in the input 324 repo that specifies the trigger and the pipeline will use that branch as its 325 input. The name of the branch is auto-generated with the form 326 `<pipeline-name>-trigger-n`. You can manually update the heads of these branches 327 to trigger processing just like in the previous example. 328 329 !!! note 330 Deleting or updating a pipeline **will not clean up** the trigger branch that it has created. 331 In fact, the trigger branch has a lifetime that is not tied to the pipeline's lifetime. 332 There is no guarantee that other pipelines are not using that trigger branch. 333 A trigger branch can, however, be deleted manually. 334 335 ## More advanced automation 336 337 More advanced use cases might not be covered by the trigger methods above. For 338 those, you need to create a Kubernetes application that uses Pachyderm APIs and 339 watches the repositories for the specified condition. When the condition is 340 met, the application switches the Pachyderm branch from `staging` to `master`.