github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/advanced-concepts/deferred_processing.md (about) 1 # Deferred Processing of Data 2 3 While a Pachyderm pipeline is running, it processes any new data that you 4 commit to its input branch. However, in some cases, you 5 want to commit data more frequently than you want to process it. 6 7 Because Pachyderm pipelines do not reprocess the data that has 8 already been processed, in most cases, this is not an issue. But, some 9 pipelines might need to process everything from scratch. For example, 10 you might want to commit data every hour, but only want to retrain a 11 machine learning model on that data daily because it needs to train 12 on all the data from scratch. 13 14 In these cases, you can leverage a massive performance benefit from deferred 15 processing. This section covers how to achieve that and control 16 what gets processed. 17 18 Pachyderm controls what is being processed by using the _filesystem_, 19 rather than at the pipeline level. Although pipelines are inflexible, 20 they are simple and always try to process the data at the heads of 21 their input branches. In contrast, the filesystem is very flexible and 22 gives you the ability to commit data in different places and then efficiently 23 move and rename the data so that it gets processed when you want. 24 25 ## Configure a Staging Branch in an Input repository 26 27 When you want to load data into Pachyderm without triggering a pipeline, 28 you can upload it to a staging branch and then submit accumulated 29 changes in one batch by re-pointing the `HEAD` of your `master` branch 30 to a commit in the staging branch. 31 32 Although, in this section, the branch in which you consolidate changes 33 is called `staging`, you can name it as you like. Also, you can have multiple 34 staging branches. For example, `dev1`, `dev2`, and so on. 35 36 In the example below, the repository that is created called `data`. 37 38 To configure a staging branch, complete the following steps: 39 40 1. Create a repository. For example, `data`. 41 42 ```shell 43 $ pachctl create repo data 44 ``` 45 46 1. Create a `master` branch. 47 48 ```shell 49 $ pachctl create branch data@master 50 ``` 51 52 1. View the created branch: 53 54 ```shell 55 $ pachctl list branch data 56 BRANCH HEAD 57 master - 58 ``` 59 60 No `HEAD` means that nothing has yet been committed into this 61 branch. When you commit data to the `master` branch, the pipeline 62 immediately starts a job to process it. 63 However, if you want to commit something without immediately 64 processing it, you need to commit it to a different branch. 65 66 1. Commit a file to the staging branch: 67 68 ```shell 69 $ pachctl put file data@staging -f <file> 70 ``` 71 72 Pachyderm automatically creates the `staging` branch. 73 Your repo now has 2 branches, `staging` and `master`. In this 74 example, the `staging` name is used, but you can 75 name the branch as you want. 76 77 1. Verify that the branches were created: 78 79 ```shell 80 $ pachctl list branch data 81 BRANCH HEAD 82 staging f3506f0fab6e483e8338754081109e69 83 master - 84 ``` 85 86 The `master` branch still does not have a `HEAD` commit, but the 87 new branch, `staging`, does. There still have been no jobs, because 88 there are no pipelines that take `staging` as inputs. You can 89 continue to commit to `staging` to add new data to the branch, and the 90 pipeline will not process anything. 91 92 1. When you are ready to process the data, update the `master` branch 93 to point it to the head of the staging branch: 94 95 ```shell 96 $ pachctl create branch data@master --head staging 97 ``` 98 99 1. List your branches to verify that the master branch has a `HEAD` 100 commit: 101 102 ```shell 103 $ pachctl list branch 104 staging f3506f0fab6e483e8338754081109e69 105 master f3506f0fab6e483e8338754081109e69 106 ``` 107 108 The `master` and `staging` branches now have the same `HEAD` commit. 109 This means that your pipeline has data to process. 110 111 1. Verify that the pipeline has new jobs: 112 113 ```shell 114 $ pachctl list job 115 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 116 061b0ef8f44f41bab5247420b4e62ca2 test 32 seconds ago Less than a second 0 6 + 0 / 6 108B 24B success 117 ``` 118 119 You should see one job that Pachyderm created for all the changes you 120 have submitted to the `staging` branch. While the commits to the 121 `staging` branch are ancestors of the current `HEAD` in `master`, 122 they were never the actual `HEAD` of `master` themselves, so they 123 do not get processed. This behavior works for most of the use cases 124 because commits in Pachyderm are generally additive, so processing 125 the HEAD commit also processes data from previous commits. 126 127  128 129 ## Process Specific Commits 130 131 Sometimes you want to process specific intermediary commits 132 that are not in the `HEAD` of the branch. 133 To do this, you need to set `master` to have these commits as `HEAD`. 134 For example, if you submitted ten commits in the `staging` branch and you 135 want to process the seventh, third, and most recent commits, you need 136 to run the following commands respectively: 137 138 ```shell 139 $ pachctl create branch data@master --head staging^7 140 $ pachctl create branch data@master --head staging^3 141 $ pachctl create branch data@master --head staging 142 ``` 143 144 When you run the commands above, Pachyderm creates a job for each 145 of the commands one after another. Therefore, when one job is completed, 146 Pachyderm starts the next one. To verify 147 that Pachyderm created jobs for these commands, run `pachctl list job`. 148 149 ### Change the HEAD of your Branch 150 151 You can move backward to previous commits as easily as advancing to the 152 latest commits. For example, if you want to change the final output to be 153 the result of processing `staging^1`, you can *roll back* your HEAD commit 154 by running the following command: 155 156 ```shell 157 $ pachctl create branch data@master --head staging^1 158 ``` 159 160 This command starts a new job to process `staging^1`. The `HEAD` commit on 161 your output repo will be the result of processing `staging^1` instead of 162 `staging`. 163 164 ## Copy Files from One Branch to Another 165 166 Using a staging branch allows you to defer processing. To use 167 this functionality you need to know your input commits in advance. 168 However, sometimes you want to be able to commit data in an ad-hoc, 169 disorganized manner and then organize it later. Instead of pointing 170 your `master` branch to a commit in a staging branch, you can copy 171 individual files from `staging` to `master`. 172 When you run `copy file`, Pachyderm only copies references to the files and 173 does not move the actual data for the files around. 174 175 To copy files from one branch to another, complete the following steps: 176 177 1. Start a commit: 178 179 ```shell 180 $ pachctl start commit data@master 181 ``` 182 183 1. Copy files: 184 185 ```shell 186 $ pachctl copy file data@staging:file1 data@master:file1 187 $ pachctl copy file data@staging:file2 data@master:file2 188 ... 189 ``` 190 191 1. Close the commit: 192 193 ```shell 194 $ pachctl finish commit data@master 195 ``` 196 197 Also, you can run `pachctl delete file` and `pachctl put file` 198 while the commit is open if you want to remove something from 199 the parent commit or add something that is not stored anywhere else. 200 201 ## Deferred Processing in Output Repositories 202 203 You can perform same deferred processing opertions with data in output 204 repositories. To do so, rather than committing to a 205 `staging` branch, configure the `output_branch` field 206 in your pipeline specification. 207 208 To configure deffered processing in an output repository, complete the 209 following steps: 210 211 1. In the pipeline specification, add the `output_branch` field with 212 the name of the branch in which you want to accumulate your data 213 before processing: 214 215 ```shell 216 "output_branch": "staging" 217 ``` 218 219 1. When you want to process data, run: 220 221 ```shell 222 $ pachctl create-branch pipeline master --head staging 223 ``` 224 225 ## Automate Branch Switching 226 227 Typically, repointing from one branch to another 228 happens when a certain condition is met. For example, you might 229 want to repoint your branch when you have a specific number of commits, 230 or when the amount of unprocessed data reaches a certain size, or 231 at a specific time interval, such as daily, or other. 232 To configure this functionality, you need to create a Kubernetes 233 application that uses Pachyderm APIs and watches the repositories for the 234 specified condition. When the condition is met, the application switches 235 the Pachyderm branch from `staging` to `master`.