github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/use-transactions-to-run-multiple-commands.md (about)

     1  # Use Transactions
     2  
     3  !!! note "TL;DR"
     4      Use transactions to run multiple Pachyderm commands
     5      simultaneously in one job run.
     6  
     7  A transaction is a Pachyderm operation that enables you to create
     8  a collection of Pachyderm commands and execute them concurrently.
     9  Regular Pachyderm operations, that are not in a transaction, are
    10  executed one after another. However, when you need
    11  to run multiple commands at the same time, you can use transactions.
    12  This functionality is useful in particular for pipelines with multiple
    13  inputs. If you need to update two or more input repos, you might not want
    14  pipeline jobs for each state change. You can issue a transaction
    15  to start commits in each of the input repos, which creates a single
    16  downstream commit in the pipeline repo. After the transaction, you
    17  can put files and finish the commits at will, and the pipeline job
    18  will run once all the input commits have been finished.
    19  
    20  ## Use Cases
    21  
    22  Pachyderm users implement transactions to their own workflows finding
    23  unique ways to benefit from this feature, whether it is a small
    24  research team or an enterprise-grade machine learning workflow.
    25  
    26  Below are examples of the most commonly employed ways of using transactions.
    27  
    28  ### Commit to Separate Repositories Simultaneously
    29  
    30  For example, you have a Pachyderm pipeline with two input
    31  repositories. One repository includes training data and the
    32  other `parameters` for your machine learning pipeline. If you need
    33  to run specific data against specific parameters, you need to
    34  run your pipeline against specific commits in both repositories.
    35  To achieve this, you need to commit to these repositories
    36  simultaneously.
    37  
    38  If you use a regular Pachyderm workflow, the data is uploaded sequentially,
    39  each time triggering a separate job instead of one job with both commits
    40  of new data. One `put file` operation commits changes to
    41  the data repository and the other updates the parameters repository.
    42  The following animation shows the standard Pachyderm workflow without
    43  a transaction:
    44  
    45  ![Standard workflow](../assets/images/transaction_wrong.gif)
    46  
    47  In Pachyderm, a pipeline starts as soon as a new commit lands in
    48  a repository. In the diagram above, as soon as `commit 1` is added
    49  to the `data` repository, Pachyderm runs a job for `commit 1` and
    50  `commit 0` in the `parameters` repository. You can also see
    51  that Pachyderm runs the second job and processes `commit 1`
    52  from the `data` repository with the `commit 1` in the `parameters`
    53  repository. In some cases, this is perfectly acceptable solution.
    54  But if your job takes many hours and you are only interested in the
    55  result of the pipeline run with `commit 1` from both repositories,
    56  this approach does not work.
    57  
    58  With transactions, you can ensure that only one job triggers with
    59  both the new `data` and `parameters`. The following animation
    60  demonstrates how transactions work:
    61  
    62  ![Transactions workflow](../assets/images/transaction_right.gif)
    63  
    64  The transaction ensures that a single job runs for the two commits
    65  that were started within the transaction.
    66  While Pachyderm supports some workflows where you can get the
    67  same effect by having both data and parameters in the same repo,
    68  often separating them and using transactions is much more efficient for
    69  organizational and performance reasons.
    70  
    71  ### Switching from Staging to Master Simultaneously
    72  
    73  If you are using [deferred processing](../deferred_processing/)
    74  in your repositories because you want to commit your changes frequently
    75  without triggering jobs every time, then transactions can help you
    76  manage deferred processing with multiple inputs. You commit your
    77  changes to the staging branch and
    78  when needed, switch the `HEAD` of you master branch to a commit in the
    79  staging branch. To do this simultaneously, you can use transactions.
    80  
    81  For example, you have two repositories `data` and `parameters`, both
    82  of which have a `master` and `staging` branch. You commit your
    83  changes to the staging branch while your pipeline is subscribed to the
    84  master branch. To switch to these branches simultaneously, you can
    85  use transactions like this:
    86  
    87  ```shell
    88  $ pachctl start transaction
    89  Started new transaction: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
    90  $ pachctl pachctl create branch data@master --head staging
    91  Added to transaction: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
    92  $ pachctl create branch parameters@master --head staging
    93  Added to transaction: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
    94  $ pachctl finish transaction
    95  Completed transaction with 2 requests: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
    96  ```
    97  
    98  When you finish the transaction, both repositories switch to
    99  to the master branch at the same time which triggers one job to process
   100  those commits together.
   101  
   102  ## Start and Finish Transactions
   103  
   104  To start a transaction, run the following command:
   105  
   106  ```shell
   107  $ pachctl start transaction
   108  Started new transaction: 7a81eab5-e6c6-430a-a5c0-1deb06852ca5
   109  ```
   110  
   111  This command generates a transaction object in the cluster and saves
   112  its ID in the local Pachyderm configuration file. By default, this file
   113  is stored at `~/.pachyderm/config.json`.
   114  
   115  !!! example
   116      ```json hl_lines="9"
   117      {
   118         "user_id": "b4fe4317-be21-4836-824f-6661c68b8fba",
   119         "v2": {
   120           "active_context": "local-2",
   121           "contexts": {
   122             "default": {},
   123             "local-2": {
   124               "source": 3,
   125               "active_transaction": "7a81eab5-e6c6-430a-a5c0-1deb06852ca5",
   126               "cluster_name": "minikube",
   127               "auth_info": "minikube",
   128               "namespace": "default"
   129             },
   130      ```
   131  
   132  After you start a transaction, you can add supported commands, such
   133  as `pachctl create repo`, `pachctl create branch`, and so on, to the
   134  transaction. All commands that are performed in a transaction are
   135  queued up and not executed against the actual cluster until you finish
   136  the transaction. When you finish the transaction, all queued command
   137  are executed atomically.
   138  
   139  To finish a transaction, run:
   140  
   141  ```shell
   142  $ pachctl finsh transaction
   143  Completed transaction with 1 requests: 7a81eab5-e6c6-430a-a5c0-1deb06852ca5
   144  ```
   145  
   146  ## Other Transaction Commands
   147  Other supporting commands for transactions include the following commands:
   148  
   149  | Command      | Description |
   150  | ------------ | ----------- |
   151  | `pachctl list transaction` | List all unfinished transactions available in the Pachyderm cluster. |
   152  | `pachctl stop transaction` | Remove the currently active transaction from the local Pachyderm config file. The transaction remains in the Pachyderm cluster and can be resumed later. |
   153  | `pachctl resume transaction` | Set an already-existing transaction as the active transaction in the local Pachyderm config file. |
   154  | `pachctl delete transaction` | Deletes a transaction from the Pachyderm cluster. |
   155  | `pachctl inspect transaction` | Provides detailed information about an existing transaction, including which operations it will perform. By default, displays information about the current transaction. If you specify a transaction ID, displays information about the corresponding transaction. |
   156  
   157  ## Supported Operations
   158  
   159  While there is a transaction object in the Pachyderm configuration
   160  file, all supported API requests append the request to the
   161  transaction instead of running directly. These supported commands include:
   162  
   163  ```shell
   164  create repo
   165  delete repo
   166  start commit
   167  finish commit
   168  delete commit
   169  create branch
   170  delete branch
   171  ```
   172  
   173  Each time you add a command to a transaction, Pachyderm validates the
   174  transaction against the current state of the cluster metadata and obtains
   175  any return values, which is important for such commands as
   176  `start commit`. If validation fails for any reason, Pachyderm does
   177  not add the operation to the transaction. If the transaction has been
   178  invalidated by changing the cluster state, you must delete the transaction
   179  and start over, taking into account the new state of the cluster.
   180  From a command-line perspective, these commands work identically within
   181  a transaction as without. The only difference is that you do not apply
   182  your changes until you run `finish transaction`, and a message that
   183  Pachyderm logs to `stderr` to indicate that the command was placed
   184  in a transaction rather than run directly.
   185  
   186  ## Multiple Opened Transactions
   187  
   188  Some systems have a notion of *nested* transactions. That is when you
   189  open transactions within an already opened transaction. In such systems, the
   190  operations added to the subsequent transactions are not executed
   191  until all the nested transactions and the main transaction are closed.
   192  
   193  Pachyderm does not support such behavior. Instead, when you open a
   194  transaction, the transaction ID is written to the Pachyderm configuration
   195  file. If you begin another transaction while the first one is open, Pachyderm
   196  suspends the first transaction and overwrites the transaction ID in the
   197  configuration file. All operations that you add to the ensuing
   198  transactions will be executed as soon as you close those
   199  transactions. To resume the initial transaction, you need to run
   200  `pachctl resume transaction`.
   201  
   202  Every time you add a command to a transaction,
   203  Pachyderm creates a blueprint of the commit and verifies that the
   204  command is valid. However, one transaction can invalidate another.
   205  In this case, a transaction that is closed first takes precedence
   206  over the other. For example, if two transactions create a repository
   207  with the same name, the one that is executed first results in the
   208  creation of the repository, and the other results in error.
   209  
   210  !!! tip
   211       While you cannot use `pachctl put file` in a transaction, you can
   212       start a commit within a transaction, finish the transation,
   213       then put as many files as you need, and then finish your commit.
   214       Your changes will only be applied in one batch when you close
   215       the commit.
   216  
   217  To get a better understanding of how transactions work in practice, try
   218  [Use Transactions with Hyperparameter Tuning](https://github.com/pachyderm/pachyderm/tree/master/examples/transactions/).
   219