github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/how-tos/use-transactions-to-run-multiple-commands.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/how-tos/use-transactions-to-run-multiple-commands.md (about)

1 # Use Transactions
2
3 !!! note "TL;DR"
4 Use transactions to run multiple Pachyderm commands
5 simultaneously in one job run.
6
7 A transaction is a Pachyderm operation that enables you to create
8 a collection of Pachyderm commands and execute them concurrently.
9 Regular Pachyderm operations, that are not in a transaction, are
10 executed one after another. However, when you need
11 to run multiple commands at the same time, you can use transactions.
12 This functionality is useful in particular for pipelines with multiple
13 inputs. If you need to update two or more input repos, you might not want
14 pipeline jobs for each state change. You can issue a transaction
15 to start commits in each of the input repos, which creates a single
16 downstream commit in the pipeline repo. After the transaction, you
17 can put files and finish the commits at will, and the pipeline job
18 will run once all the input commits have been finished.
19
20 ## Use Cases
21
22 Pachyderm users implement transactions to their own workflows finding
23 unique ways to benefit from this feature, whether it is a small
24 research team or an enterprise-grade machine learning workflow.
25
26 Below are examples of the most commonly employed ways of using transactions.
27
28 ### Commit to Separate Repositories Simultaneously
29
30 For example, you have a Pachyderm pipeline with two input
31 repositories. One repository includes training data and the
32 other `parameters` for your machine learning pipeline. If you need
33 to run specific data against specific parameters, you need to
34 run your pipeline against specific commits in both repositories.
35 To achieve this, you need to commit to these repositories
36 simultaneously.
37
38 If you use a regular Pachyderm workflow, the data is uploaded sequentially,
39 each time triggering a separate job instead of one job with both commits
40 of new data. One `put file` operation commits changes to
41 the data repository and the other updates the parameters repository.
42 The following animation shows the standard Pachyderm workflow without
43 a transaction:
44
45 ![Standard workflow](../assets/images/transaction_wrong.gif)
46
47 In Pachyderm, a pipeline starts as soon as a new commit lands in
48 a repository. In the diagram above, as soon as `commit 1` is added
49 to the `data` repository, Pachyderm runs a job for `commit 1` and
50 `commit 0` in the `parameters` repository. You can also see
51 that Pachyderm runs the second job and processes `commit 1`
52 from the `data` repository with the `commit 1` in the `parameters`
53 repository. In some cases, this is perfectly acceptable solution.
54 But if your job takes many hours and you are only interested in the
55 result of the pipeline run with `commit 1` from both repositories,
56 this approach does not work.
57
58 With transactions, you can ensure that only one job triggers with
59 both the new `data` and `parameters`. The following animation
60 demonstrates how transactions work:
61
62 ![Transactions workflow](../assets/images/transaction_right.gif)
63
64 The transaction ensures that a single job runs for the two commits
65 that were started within the transaction.
66 While Pachyderm supports some workflows where you can get the
67 same effect by having both data and parameters in the same repo,
68 often separating them and using transactions is much more efficient for
69 organizational and performance reasons.
70
71 ### Switching from Staging to Master Simultaneously
72
73 If you are using [deferred processing](../../concepts/advanced-concepts/deferred_processing/)
74 in your repositories because you want to commit your changes frequently
75 without triggering jobs every time, then transactions can help you
76 manage deferred processing with multiple inputs. You commit your
77 changes to the staging branch and
78 when needed, switch the `HEAD` of you master branch to a commit in the
79 staging branch. To do this simultaneously, you can use transactions.
80
81 For example, you have two repositories `data` and `parameters`, both
82 of which have a `master` and `staging` branch. You commit your
83 changes to the staging branch while your pipeline is subscribed to the
84 master branch. To switch to these branches simultaneously, you can
85 use transactions like this:
86
87 ```shell
88 pachctl start transaction
89 ```
90
91 **System Response:**
92
93 ```shell
94 Started new transaction: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
95 pachctl pachctl create branch data@master --head staging
96 Added to transaction: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
97 pachctl create branch parameters@master --head staging
98 Added to transaction: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
99 pachctl finish transaction
100 Completed transaction with 2 requests: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
101 ```
102
103 When you finish the transaction, both repositories switch to
104 to the master branch at the same time which triggers one job to process
105 those commits together.
106
107 ### Updating Multiple Pipelines Simultaneously
108
109 If you want to change logic or intermediate data formats in your DAG, you
110 may need to change multiple pipelines. Performing these changes together
111 in a transaction can avoid creating jobs with mismatched pipeline versions
112 and potentially wasting work.
113
114 ## Start and Finish Transactions
115
116 To start a transaction, run the following command:
117
118 ```shell
119 pachctl start transaction
120 ```
121
122 **System Response:**
123
124 ```shell
125 Started new transaction: 7a81eab5-e6c6-430a-a5c0-1deb06852ca5
126 ```
127
128 This command generates a transaction object in the cluster and saves
129 its ID in the local Pachyderm configuration file. By default, this file
130 is stored at `~/.pachyderm/config.json`.
131
132 !!! example
133 ```json hl_lines="9"
134 {
135 "user_id": "b4fe4317-be21-4836-824f-6661c68b8fba",
136 "v2": {
137 "active_context": "local-2",
138 "contexts": {
139 "default": {},
140 "local-2": {
141 "source": 3,
142 "active_transaction": "7a81eab5-e6c6-430a-a5c0-1deb06852ca5",
143 "cluster_name": "minikube",
144 "auth_info": "minikube",
145 "namespace": "default"
146 },
147 ```
148
149 After you start a transaction, you can add supported commands, such
150 as `pachctl create repo`, `pachctl create branch`, and so on, to the
151 transaction. All commands that are performed in a transaction are
152 queued up and not executed against the actual cluster until you finish
153 the transaction. When you finish the transaction, all queued command
154 are executed atomically.
155
156 To finish a transaction, run:
157
158 ```shell
159 pachctl finish transaction
160 ```
161
162 **System Response:**
163
164 ```shell
165 Completed transaction with 1 requests: 7a81eab5-e6c6-430a-a5c0-1deb06852ca5
166 ```
167
168 ## Other Transaction Commands
169 Other supporting commands for transactions include the following commands:
170
171 | Command | Description |
172 | ------------ | ----------- |
173 | `pachctl list transaction` | List all unfinished transactions available in the Pachyderm cluster. |
174 | `pachctl stop transaction` | Remove the currently active transaction from the local Pachyderm config file. The transaction remains in the Pachyderm cluster and can be resumed later. |
175 | `pachctl resume transaction` | Set an already-existing transaction as the active transaction in the local Pachyderm config file. |
176 | `pachctl delete transaction` | Deletes a transaction from the Pachyderm cluster. |
177 | `pachctl inspect transaction` | Provides detailed information about an existing transaction, including which operations it will perform. By default, displays information about the current transaction. If you specify a transaction ID, displays information about the corresponding transaction. |
178
179 ## Supported Operations
180
181 While there is a transaction object in the Pachyderm configuration
182 file, all supported API requests append the request to the
183 transaction instead of running directly. These supported commands include:
184
185 ```shell
186 create repo
187 delete repo
188 start commit
189 finish commit
190 delete commit
191 create branch
192 delete branch
193 create pipeline
194 update pipeline
195 ```
196
197 Each time you add a command to a transaction, Pachyderm validates the
198 transaction against the current state of the cluster metadata and obtains
199 any return values, which is important for such commands as
200 `start commit`. If validation fails for any reason, Pachyderm does
201 not add the operation to the transaction. If the transaction has been
202 invalidated by changing the cluster state, you must delete the transaction
203 and start over, taking into account the new state of the cluster.
204 From a command-line perspective, these commands work identically within
205 a transaction as without. The only difference is that you do not apply
206 your changes until you run `finish transaction`, and a message that
207 Pachyderm logs to `stderr` to indicate that the command was placed
208 in a transaction rather than run directly.
209
210 ## Multiple Opened Transactions
211
212 Some systems have a notion of *nested* transactions. That is when you
213 open transactions within an already opened transaction. In such systems, the
214 operations added to the subsequent transactions are not executed
215 until all the nested transactions and the main transaction are closed.
216
217 Pachyderm does not support such behavior. Instead, when you open a
218 transaction, the transaction ID is written to the Pachyderm configuration
219 file. If you begin another transaction while the first one is open, Pachyderm
220 returns an error.
221
222 Every time you add a command to a transaction,
223 Pachyderm creates a blueprint of the commit and verifies that the
224 command is valid. However, one transaction can invalidate another.
225 In this case, a transaction that is closed first takes precedence
226 over the other. For example, if two transactions create a repository
227 with the same name, the one that is executed first results in the
228 creation of the repository, and the other results in error.
229
230 !!! tip
231 While you cannot use `pachctl put file` in a transaction, you can
232 start a commit within a transaction, finish the transation,
233 then put as many files as you need, and then finish your commit.
234 Your changes will only be applied in one batch when you close
235 the commit.
236
237 To get a better understanding of how transactions work in practice, try
238 [Use Transactions with Hyperparameter Tuning](https://github.com/pachyderm/pachyderm/tree/master/examples/transactions/).
239