github.com/pachyderm/pachyderm@v1.13.4/src/server/transaction/README.md

github.com/pachyderm/pachyderm@v1.13.4/src/server/transaction/README.md (about)

1 # Transactions in Pachyderm
2
3 Transactions were added to Pachyderm as a way to make multiple changes to the state of Pachyderm while only triggering jobs once. This is done by constructing a batch of operations to perform on the cluster state, then running the set of operations in a single ETCD transaction.
4
5 The transaction framework provides a method for batching together commit propagation such that changed branches are collected over the course of the transaction and all propagated in one batch at the end. This allows Pachyderm to dedupe changed branches and branches provenant on the changed branches so that the minimum number of new commits are issued.
6
7 This is useful in particular for pipelines with multiple inputs. If you need to update two or more input repos, you might not want pipeline jobs for each state change. You can issue a transaction to start commits in each of the input repos, which will create a single downstream commit in the pipeline repo. After the transaction, you can put files and finish the commits at will, and the pipeline job will run once all the input commits have been finished.
8
9 ## Pachctl
10
11 In `pachctl`, a transaction can be initiated through the `start transaction` command. This will generate a transaction object in the cluster and save its ID into the local pachyderm config (`~/.pachyderm/config.json` by default).
12
13 While there is a transaction object in the config file, all transactionally-supported API requests will append the request to the transaction instead of running directly. These commands (as of v1.9.0) are:
14
15 * `create repo`
16 * `delete repo`
17 * `start commit`
18 * `finish commit`
19 * `delete commit`
20 * `create branch`
21 * `delete branch`
22
23 Each time a command is added to a transaction, the transaction is dry-run against the current state of the cluster metadata to make sure it is still valid and to obtain any return values (important for commands like `start commit`). If the dry-run fails for any reason, the operation will not be added to the transaction. If the transaction has been invalidated by changing cluster state, the transaction will need to be deleted and started over, taking into account the new state of the cluster.
24
25 From a command-line perspective, these commands should work identically within a transaction as without with the exception that the changes will not be committed until `finish transaction` is run, and a message will be logged to `stderr` to indicate that the command was placed in a transaction rather than run directly.
26
27 There are several other supporting commands for transactions:
28
29 * `list transaction` - list all unfinished transactions available in the pachyderm cluster
30 * `stop transaction` - remove the currently active transaction from the local pachyderm config file - it remains in the pachyderm cluster and may be resumed later
31 * `resume transaction` - set an already-existing transaction as the active transaction in the local pachyderm config file
32 * `delete transaction` - deletes a transaction from the pachyderm cluster
33 * `inspect transaction` - provide detailed information about an existing transaction, including which operations it will perform
34
35 ## Implementation Details
36
37 Files and Packages:
38 * `src/client/transaction.go` - client helper functions for transactions
39 * `src/client/transaction/transaction.proto` - protobuf definitions for the API
40 * `src/server/transaction/cmds` - implementation of the `pachctl` transaction commands
41 * `src/server/transaction/pretty` - pretty-printing code used by `pachctl` commands
42 * `src/server/transaction/server` - implementation of the GRPC API defined in the protobuf file
43 * `src/server/pkg/transactiondb` - definition of transaction metadata in etcd
44 * `src/server/pkg/transactionenv` - an environment object passed to each API server in `pachd` that coordinates calls across package boundaries, and provides common abstractions needed for transaction support
45
46 ### TransactionEnv
47
48 The `transactionenv.TransactionEnv` object allows us to coordinate calls across the API-server objects in `pachd` without going through an RPC. This means we can include the state of an open STM transaction and guarantee consistent reads and writes efficiently. This interface introduces the `TransactionContext` object, which is a simple container with getters the full suite of objects needed:
49
50 * `ClientContext()` - the client context from the API client which initiated the current request
51 * `Client()` - a pachyderm API client for making RPC calls to other subsystems. Using this does _not_ result in consistent reads and writes.
52 * `Stm()` - the `col.STM` object associated with the current request
53 * `PfsDefer()` - the `pfs.TransactionDefer` object associated with the current request. Primarily for its `PropagateCommit` call.
54
55 Two main functions are provided for starting an etcd transaction with a `TransactionContext`. Each one takes a callback that will be provided with the `TransactionContext` that is valid for the duration of the callback. When the callback finishes, deferred tasks (i.e. the `PropagateCommit` calls) will be run in the STM before committing the changes. `WithReadContext` uses a dry-run STM so that all changes are discarded, and `WithWriteContext` uses a normal STM.
56
57 ### Auth considerations
58
59 The transaction API does not use auth at all. This is fine because each operation in a transaction checks auth as usual. When adding an operation to a transaction, the entire transaction is dry-run. If a user attempts to add an operation they cannot perform, it will be rejected when the dry-run fails. Similarly, if a user attempts to finish a transaction that contains operations they cannot perform, it will use their auth credentials and fail.
60
61 The most an adversarial user can do is delete or invalidate transactions that other users are building. Any other commands they might add to a transaction, they could run directly on the cluster.
62
63 ### Limitations
64
65 The main limitation in transactions are the etcd operation limit, which means that a transaction may grow large enough that it will be rejected. There is no easy way to predict when this will happen at the moment, and the only workaround is to break up the operations into multiple transactions.
66
67 In addition, transactions (especially large transactions) may be very slow, because a dry-run is executed every time an operation is added to a transaction. This means that we end up performing `O(n)` dry-runs and `O(n^2)` operations for a transaction with `n` operations in it.
68
69 At the moment, there is no way to modify the files of a commit within a transaction. Primarily, this is because we do not have a good way to provide an STM interface that supports list operations. Modifying files on open commits involves merging the tree from the committed state with the tree stored in the open commits collection in etcd, which uses list operations. As such, this will likely not be available without major changes to the architecture. Luckily, due to how pipeline triggering works, it is not important to change files transactionally - starting commits transactionally is the important part.
70
71 ### Future Work
72
73 * Support transactions in PPS calls
74 * Have `delete all` be performed transactionally
75 * Provide an API method for issuing a batch of operations as a transaction without round-trips