github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/advanced-concepts/deferred_processing.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/advanced-concepts/deferred_processing.md (about)

1 # Deferred Processing of Data
2
3 While a Pachyderm pipeline is running, it processes any new data that you
4 commit to its input branch. However, in some cases, you
5 want to commit data more frequently than you want to process it.
6
7 Because Pachyderm pipelines do not reprocess the data that has
8 already been processed, in most cases, this is not an issue. But, some
9 pipelines might need to process everything from scratch. For example,
10 you might want to commit data every hour, but only want to retrain a
11 machine learning model on that data daily because it needs to train
12 on all the data from scratch.
13
14 In these cases, you can leverage a massive performance benefit from deferred
15 processing. This section covers how to achieve that and control
16 what gets processed.
17
18 Pachyderm controls what is being processed by using the _filesystem_,
19 rather than at the pipeline level. Although pipelines are inflexible,
20 they are simple and always try to process the data at the heads of
21 their input branches. In contrast, the filesystem is very flexible and
22 gives you the ability to commit data in different places and then efficiently
23 move and rename the data so that it gets processed when you want.
24
25 ## Configure a Staging Branch in an Input repository
26
27 When you want to load data into Pachyderm without triggering a pipeline,
28 you can upload it to a staging branch and then submit accumulated
29 changes in one batch by re-pointing the `HEAD` of your `master` branch
30 to a commit in the staging branch.
31
32 Although, in this section, the branch in which you consolidate changes
33 is called `staging`, you can name it as you like. Also, you can have multiple
34 staging branches. For example, `dev1`, `dev2`, and so on.
35
36 In the example below, the repository that is created called `data`.
37
38 To configure a staging branch, complete the following steps:
39
40 1. Create a repository. For example, `data`.
41
42 ```shell
43 $ pachctl create repo data
44 ```
45
46 1. Create a `master` branch.
47
48 ```shell
49 $ pachctl create branch data@master
50 ```
51
52 1. View the created branch:
53
54 ```shell
55 $ pachctl list branch data
56
57 BRANCH HEAD
58 master -
59 ```
60
61 No `HEAD` means that nothing has yet been committed into this
62 branch. When you commit data to the `master` branch, the pipeline
63 immediately starts a job to process it.
64 However, if you want to commit something without immediately
65 processing it, you need to commit it to a different branch.
66
67 1. Commit a file to the staging branch:
68
69 ```shell
70 $ pachctl put file data@staging -f <file>
71 ```
72
73 Pachyderm automatically creates the `staging` branch.
74 Your repo now has 2 branches, `staging` and `master`. In this
75 example, the `staging` name is used, but you can
76 name the branch as you want.
77
78 1. Verify that the branches were created:
79
80 ```shell
81 $ pachctl list branch data
82
83 BRANCH HEAD
84
85 staging f3506f0fab6e483e8338754081109e69
86 master -
87 ```
88
89 The `master` branch still does not have a `HEAD` commit, but the
90 new branch, `staging`, does. There still have been no jobs, because
91 there are no pipelines that take `staging` as inputs. You can
92 continue to commit to `staging` to add new data to the branch, and the
93 pipeline will not process anything.
94
95 1. When you are ready to process the data, update the `master` branch
96 to point it to the head of the staging branch:
97
98 ```shell
99 $ pachctl create branch data@master --head staging
100 ```
101
102 1. List your branches to verify that the master branch has a `HEAD`
103 commit:
104
105 ```shell
106 $ pachctl list branch data
107
108 staging f3506f0fab6e483e8338754081109e69
109 master f3506f0fab6e483e8338754081109e69
110 ```
111
112 The `master` and `staging` branches now have the same `HEAD` commit.
113 This means that your pipeline has data to process.
114
115 1. Verify that the pipeline has new jobs:
116
117 ```shell
118 $ pachctl list job
119
120 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE
121 061b0ef8f44f41bab5247420b4e62ca2 test 32 seconds ago Less than a second 0 6 + 0 / 6 108B 24B success
122 ```
123
124 You should see one job that Pachyderm created for all the changes you
125 have submitted to the `staging` branch. While the commits to the
126 `staging` branch are ancestors of the current `HEAD` in `master`,
127 they were never the actual `HEAD` of `master` themselves, so they
128 do not get processed. This behavior works for most of the use cases
129 because commits in Pachyderm are generally additive, so processing
130 the HEAD commit also processes data from previous commits.
131
132 ![deferred processing](../../assets/images/deferred_processing.gif)
133
134 ## Process Specific Commits
135
136 Sometimes you want to process specific intermediary commits
137 that are not in the `HEAD` of the branch.
138 To do this, you need to set `master` to have these commits as `HEAD`.
139 For example, if you submitted ten commits in the `staging` branch and you
140 want to process the seventh, third, and most recent commits, you need
141 to run the following commands respectively:
142
143 ```shell
144 $ pachctl create branch data@master --head staging^7
145 $ pachctl create branch data@master --head staging^3
146 $ pachctl create branch data@master --head staging
147 ```
148
149 When you run the commands above, Pachyderm creates a job for each
150 of the commands one after another. Therefore, when one job is completed,
151 Pachyderm starts the next one. To verify
152 that Pachyderm created jobs for these commands, run `pachctl list job`.
153
154 ### Change the HEAD of your Branch
155
156 You can move backward to previous commits as easily as advancing to the
157 latest commits. For example, if you want to change the final output to be
158 the result of processing `staging^1`, you can *roll back* your HEAD commit
159 by running the following command:
160
161 ```shell
162 $ pachctl create branch data@master --head staging^1
163 ```
164
165 This command starts a new job to process `staging^1`. The `HEAD` commit on
166 your output repo will be the result of processing `staging^1` instead of
167 `staging`.
168
169 ## Copy Files from One Branch to Another
170
171 Using a staging branch allows you to defer processing. To use
172 this functionality you need to know your input commits in advance.
173 However, sometimes you want to be able to commit data in an ad-hoc,
174 disorganized manner and then organize it later. Instead of pointing
175 your `master` branch to a commit in a staging branch, you can copy
176 individual files from `staging` to `master`.
177 When you run `copy file`, Pachyderm only copies references to the files and
178 does not move the actual data for the files around.
179
180 To copy files from one branch to another, complete the following steps:
181
182 1. Start a commit:
183
184 ```shell
185 $ pachctl start commit data@master
186 ```
187
188 1. Copy files:
189
190 ```shell
191 $ pachctl copy file data@staging:file1 data@master:file1
192 $ pachctl copy file data@staging:file2 data@master:file2
193 ...
194 ```
195
196 1. Close the commit:
197
198 ```shell
199 $ pachctl finish commit data@master
200 ```
201
202 While the commit is open, you can run `pachctl delete file` if you want to remove something from
203 the parent commit or `pachctl put file`if you want to upload something that is not in a repo yet.
204
205 ## Deferred Processing in Output Repositories
206
207 You can perform the same deferred processing operations with data in output
208 repositories. To do so, rather than committing to a
209 `staging` branch, configure the `output_branch` field
210 in your pipeline specification.
211
212 To configure deffered processing in an output repository, complete the
213 following steps:
214
215 1. In the pipeline specification, add the `output_branch` field with
216 the name of the branch in which you want to accumulate your data
217 before processing:
218
219 ```
220 "output_branch": "staging"
221 ```
222
223 1. When you want to process data, run:
224
225 ```shell
226 $ pachctl create branch pipeline@master --head staging
227 ```
228
229 ## Automate Deferred Processing With Branch Triggers
230
231 Typically, repointing from one branch to another happens when a certain
232 condition is met. For example, you might want to repoint your branch when you
233 have a specific number of commits, or when the amount of unprocessed data
234 reaches a certain size, or at a specific time interval, such as daily, or
235 other. This can be automated using branch triggers. A trigger is a relationship
236 between two branches, such as `master` and `staging` in the examples above,
237 that says: when the head commit of `staging` meets a certain condition it
238 should trigger `master` to update its head to that same commit. In other words it
239 does `pachctl create branch data@master --head staging` automatically when the
240 trigger condition is met.
241
242 Building on the example above, to make `master` automatically trigger when
243 there's 1 Megabyte of new data on `staging`, run:
244
245 ```shell
246 $ pachctl create branch data@master --trigger staging --trigger-size 1MB
247 $ pachctl list branch data
248
249 BRANCH HEAD TRIGGER
250 staging 8b5f3eb8dc4346dcbd1a547f537982a6 -
251 master - staging on Size(1MB)
252 ```
253
254 When you run that command, it may or may not set the head of `master`. It depends
255 on the difference between the size of the head of `staging` and the existing
256 head of `master`, or `0` if it doesn't exist. Notice that in the example above
257 `staging` had an existing head with less than a MB of data in it so `master`
258 still has no head. If you don't see `staging` when you `list branch` that's ok,
259 triggers can point to branches that don't exist yet. The head of `master` will
260 update if you add a MB of new data to `staging`:
261
262 ```shell
263 $ dd if=/dev/urandom bs=1MiB count=1 | pachctl put file data@staging:/file
264 $ pachctl list branch data
265
266 BRANCH HEAD TRIGGER
267 staging 64b70e6aeda84845858c42d755023673 -
268 master 64b70e6aeda84845858c42d755023673 staging on Size(1MB)
269 ```
270
271 Triggers automate deferred processing, but they don't prevent manually updating
272 the head of a branch. If you ever want to trigger `master` even though the
273 trigger condition hasn't been met you can run:
274
275 ```shell
276 $ pachctl create branch data@master --head staging
277 ```
278
279 Notice that you don't need to re-specify the trigger when you call `create
280 branch` to change the head. If you do want to clear the trigger delete the
281 branch and recreate it.
282
283 There are three conditions on which you can trigger the repointing of a branch.
284
285 - time, using a cron specification (--trigger-cron)
286 - size (--trigger-size)
287 - number of commits (--trigger-commits)
288
289 When more than one is specified, a branch repoint will be triggered when any of
290 the conditions is met. To guarantee that they all must be met, add
291 --trigger-all.
292
293 To experiment further, see the full [triggers example](https://github.com/pachyderm/examples/tree/master/deferred_processing/triggers).
294
295 ## Embed Triggers in Pipelines
296
297 Triggers can also be specified in the pipeline spec and automatically created
298 when the pipeline is created. For example, this is the edges pipeline from our
299 our OpenCV demo modified to only trigger when there is a 1 Megabyte of new images:
300
301 ```
302 {
303 "pipeline": {
304 "name": "edges"
305 },
306 "description": "A pipeline that performs image edge detection by using the OpenCV library.",
307 "input": {
308 "pfs": {
309 "glob": "/*",
310 "repo": "images",
311 "trigger": {
312 "size": "1MB"
313 }
314 }
315 },
316 "transform": {
317 "cmd": [ "python3", "/edges.py" ],
318 "image": "pachyderm/opencv"
319 }
320 }
321 ```
322
323 When you create this pipeline, Pachyderm will also create a branch in the input
324 repo that specifies the trigger and the pipeline will use that branch as its
325 input. The name of the branch is auto-generated with the form
326 `<pipeline-name>-trigger-n`. You can manually update the heads of these branches
327 to trigger processing just like in the previous example.
328
329 !!! note
330 Deleting or updating a pipeline **will not clean up** the trigger branch that it has created.
331 In fact, the trigger branch has a lifetime that is not tied to the pipeline's lifetime.
332 There is no guarantee that other pipelines are not using that trigger branch.
333 A trigger branch can, however, be deleted manually.
334
335 ## More advanced automation
336
337 More advanced use cases might not be covered by the trigger methods above. For
338 those, you need to create a Kubernetes application that uses Pachyderm APIs and
339 watches the repositories for the specified condition. When the condition is
340 met, the application switches the Pachyderm branch from `staging` to `master`.