github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/advanced-concepts/deferred_processing.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/advanced-concepts/deferred_processing.md (about)

1 # Deferred Processing of Data
2
3 While a Pachyderm pipeline is running, it processes any new data that you
4 commit to its input branch. However, in some cases, you
5 want to commit data more frequently than you want to process it.
6
7 Because Pachyderm pipelines do not reprocess the data that has
8 already been processed, in most cases, this is not an issue. But, some
9 pipelines might need to process everything from scratch. For example,
10 you might want to commit data every hour, but only want to retrain a
11 machine learning model on that data daily because it needs to train
12 on all the data from scratch.
13
14 In these cases, you can leverage a massive performance benefit from deferred
15 processing. This section covers how to achieve that and control
16 what gets processed.
17
18 Pachyderm controls what is being processed by using the _filesystem_,
19 rather than at the pipeline level. Although pipelines are inflexible,
20 they are simple and always try to process the data at the heads of
21 their input branches. In contrast, the filesystem is very flexible and
22 gives you the ability to commit data in different places and then efficiently
23 move and rename the data so that it gets processed when you want.
24
25 ## Configure a Staging Branch in an Input repository
26
27 When you want to load data into Pachyderm without triggering a pipeline,
28 you can upload it to a staging branch and then submit accumulated
29 changes in one batch by re-pointing the `HEAD` of your `master` branch
30 to a commit in the staging branch.
31
32 Although, in this section, the branch in which you consolidate changes
33 is called `staging`, you can name it as you like. Also, you can have multiple
34 staging branches. For example, `dev1`, `dev2`, and so on.
35
36 In the example below, the repository that is created called `data`.
37
38 To configure a staging branch, complete the following steps:
39
40 1. Create a repository. For example, `data`.
41
42 ```shell
43 $ pachctl create repo data
44 ```
45
46 1. Create a `master` branch.
47
48 ```shell
49 $ pachctl create branch data@master
50 ```
51
52 1. View the created branch:
53
54 ```shell
55 $ pachctl list branch data
56 BRANCH HEAD
57 master -
58 ```
59
60 No `HEAD` means that nothing has yet been committed into this
61 branch. When you commit data to the `master` branch, the pipeline
62 immediately starts a job to process it.
63 However, if you want to commit something without immediately
64 processing it, you need to commit it to a different branch.
65
66 1. Commit a file to the staging branch:
67
68 ```shell
69 $ pachctl put file data@staging -f <file>
70 ```
71
72 Pachyderm automatically creates the `staging` branch.
73 Your repo now has 2 branches, `staging` and `master`. In this
74 example, the `staging` name is used, but you can
75 name the branch as you want.
76
77 1. Verify that the branches were created:
78
79 ```shell
80 $ pachctl list branch data
81 BRANCH HEAD
82 staging f3506f0fab6e483e8338754081109e69
83 master -
84 ```
85
86 The `master` branch still does not have a `HEAD` commit, but the
87 new branch, `staging`, does. There still have been no jobs, because
88 there are no pipelines that take `staging` as inputs. You can
89 continue to commit to `staging` to add new data to the branch, and the
90 pipeline will not process anything.
91
92 1. When you are ready to process the data, update the `master` branch
93 to point it to the head of the staging branch:
94
95 ```shell
96 $ pachctl create branch data@master --head staging
97 ```
98
99 1. List your branches to verify that the master branch has a `HEAD`
100 commit:
101
102 ```shell
103 $ pachctl list branch
104 staging f3506f0fab6e483e8338754081109e69
105 master f3506f0fab6e483e8338754081109e69
106 ```
107
108 The `master` and `staging` branches now have the same `HEAD` commit.
109 This means that your pipeline has data to process.
110
111 1. Verify that the pipeline has new jobs:
112
113 ```shell
114 $ pachctl list job
115 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE
116 061b0ef8f44f41bab5247420b4e62ca2 test 32 seconds ago Less than a second 0 6 + 0 / 6 108B 24B success
117 ```
118
119 You should see one job that Pachyderm created for all the changes you
120 have submitted to the `staging` branch. While the commits to the
121 `staging` branch are ancestors of the current `HEAD` in `master`,
122 they were never the actual `HEAD` of `master` themselves, so they
123 do not get processed. This behavior works for most of the use cases
124 because commits in Pachyderm are generally additive, so processing
125 the HEAD commit also processes data from previous commits.
126
127 ![deferred processing](../../assets/images/deferred_processing.gif)
128
129 ## Process Specific Commits
130
131 Sometimes you want to process specific intermediary commits
132 that are not in the `HEAD` of the branch.
133 To do this, you need to set `master` to have these commits as `HEAD`.
134 For example, if you submitted ten commits in the `staging` branch and you
135 want to process the seventh, third, and most recent commits, you need
136 to run the following commands respectively:
137
138 ```shell
139 $ pachctl create branch data@master --head staging^7
140 $ pachctl create branch data@master --head staging^3
141 $ pachctl create branch data@master --head staging
142 ```
143
144 When you run the commands above, Pachyderm creates a job for each
145 of the commands one after another. Therefore, when one job is completed,
146 Pachyderm starts the next one. To verify
147 that Pachyderm created jobs for these commands, run `pachctl list job`.
148
149 ### Change the HEAD of your Branch
150
151 You can move backward to previous commits as easily as advancing to the
152 latest commits. For example, if you want to change the final output to be
153 the result of processing `staging^1`, you can *roll back* your HEAD commit
154 by running the following command:
155
156 ```shell
157 $ pachctl create branch data@master --head staging^1
158 ```
159
160 This command starts a new job to process `staging^1`. The `HEAD` commit on
161 your output repo will be the result of processing `staging^1` instead of
162 `staging`.
163
164 ## Copy Files from One Branch to Another
165
166 Using a staging branch allows you to defer processing. To use
167 this functionality you need to know your input commits in advance.
168 However, sometimes you want to be able to commit data in an ad-hoc,
169 disorganized manner and then organize it later. Instead of pointing
170 your `master` branch to a commit in a staging branch, you can copy
171 individual files from `staging` to `master`.
172 When you run `copy file`, Pachyderm only copies references to the files and
173 does not move the actual data for the files around.
174
175 To copy files from one branch to another, complete the following steps:
176
177 1. Start a commit:
178
179 ```shell
180 $ pachctl start commit data@master
181 ```
182
183 1. Copy files:
184
185 ```shell
186 $ pachctl copy file data@staging:file1 data@master:file1
187 $ pachctl copy file data@staging:file2 data@master:file2
188 ...
189 ```
190
191 1. Close the commit:
192
193 ```shell
194 $ pachctl finish commit data@master
195 ```
196
197 Also, you can run `pachctl delete file` and `pachctl put file`
198 while the commit is open if you want to remove something from
199 the parent commit or add something that is not stored anywhere else.
200
201 ## Deferred Processing in Output Repositories
202
203 You can perform same deferred processing opertions with data in output
204 repositories. To do so, rather than committing to a
205 `staging` branch, configure the `output_branch` field
206 in your pipeline specification.
207
208 To configure deffered processing in an output repository, complete the
209 following steps:
210
211 1. In the pipeline specification, add the `output_branch` field with
212 the name of the branch in which you want to accumulate your data
213 before processing:
214
215 ```shell
216 "output_branch": "staging"
217 ```
218
219 1. When you want to process data, run:
220
221 ```shell
222 $ pachctl create-branch pipeline master --head staging
223 ```
224
225 ## Automate Branch Switching
226
227 Typically, repointing from one branch to another
228 happens when a certain condition is met. For example, you might
229 want to repoint your branch when you have a specific number of commits,
230 or when the amount of unprocessed data reaches a certain size, or
231 at a specific time interval, such as daily, or other.
232 To configure this functionality, you need to create a Kubernetes
233 application that uses Pachyderm APIs and watches the repositories for the
234 specified condition. When the condition is met, the application switches
235 the Pachyderm branch from `staging` to `master`.