github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/how-tos/use-transactions-to-run-multiple-commands.md (about)

     1  # Use Transactions
     2  
     3  !!! note "TL;DR"
     4      Use transactions to run multiple Pachyderm commands
     5      simultaneously in one job run.
     6  
     7  A transaction is a Pachyderm operation that enables you to create
     8  a collection of Pachyderm commands and execute them concurrently.
     9  Regular Pachyderm operations, that are not in a transaction, are
    10  executed one after another. However, when you need
    11  to run multiple commands at the same time, you can use transactions.
    12  This functionality is useful in particular for pipelines with multiple
    13  inputs. If you need to update two or more input repos, you might not want
    14  pipeline jobs for each state change. You can issue a transaction
    15  to start commits in each of the input repos, which creates a single
    16  downstream commit in the pipeline repo. After the transaction, you
    17  can put files and finish the commits at will, and the pipeline job
    18  will run once all the input commits have been finished.
    19  
    20  ## Use Cases
    21  
    22  Pachyderm users implement transactions to their own workflows finding
    23  unique ways to benefit from this feature, whether it is a small
    24  research team or an enterprise-grade machine learning workflow.
    25  
    26  Below are examples of the most commonly employed ways of using transactions.
    27  
    28  ### Commit to Separate Repositories Simultaneously
    29  
    30  For example, you have a Pachyderm pipeline with two input
    31  repositories. One repository includes training data and the
    32  other `parameters` for your machine learning pipeline. If you need
    33  to run specific data against specific parameters, you need to
    34  run your pipeline against specific commits in both repositories.
    35  To achieve this, you need to commit to these repositories
    36  simultaneously.
    37  
    38  If you use a regular Pachyderm workflow, the data is uploaded sequentially,
    39  each time triggering a separate job instead of one job with both commits
    40  of new data. One `put file` operation commits changes to
    41  the data repository and the other updates the parameters repository.
    42  The following animation shows the standard Pachyderm workflow without
    43  a transaction:
    44  
    45  ![Standard workflow](../assets/images/transaction_wrong.gif)
    46  
    47  In Pachyderm, a pipeline starts as soon as a new commit lands in
    48  a repository. In the diagram above, as soon as `commit 1` is added
    49  to the `data` repository, Pachyderm runs a job for `commit 1` and
    50  `commit 0` in the `parameters` repository. You can also see
    51  that Pachyderm runs the second job and processes `commit 1`
    52  from the `data` repository with the `commit 1` in the `parameters`
    53  repository. In some cases, this is perfectly acceptable solution.
    54  But if your job takes many hours and you are only interested in the
    55  result of the pipeline run with `commit 1` from both repositories,
    56  this approach does not work.
    57  
    58  With transactions, you can ensure that only one job triggers with
    59  both the new `data` and `parameters`. The following animation
    60  demonstrates how transactions work:
    61  
    62  ![Transactions workflow](../assets/images/transaction_right.gif)
    63  
    64  The transaction ensures that a single job runs for the two commits
    65  that were started within the transaction.
    66  While Pachyderm supports some workflows where you can get the
    67  same effect by having both data and parameters in the same repo,
    68  often separating them and using transactions is much more efficient for
    69  organizational and performance reasons.
    70  
    71  ### Switching from Staging to Master Simultaneously
    72  
    73  If you are using [deferred processing](../../concepts/advanced-concepts/deferred_processing/)
    74  in your repositories because you want to commit your changes frequently
    75  without triggering jobs every time, then transactions can help you
    76  manage deferred processing with multiple inputs. You commit your
    77  changes to the staging branch and
    78  when needed, switch the `HEAD` of you master branch to a commit in the
    79  staging branch. To do this simultaneously, you can use transactions.
    80  
    81  For example, you have two repositories `data` and `parameters`, both
    82  of which have a `master` and `staging` branch. You commit your
    83  changes to the staging branch while your pipeline is subscribed to the
    84  master branch. To switch to these branches simultaneously, you can
    85  use transactions like this:
    86  
    87  ```shell
    88  pachctl start transaction
    89  ```
    90  
    91  **System Response:**
    92  
    93  ```shell
    94  Started new transaction: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
    95  pachctl pachctl create branch data@master --head staging
    96  Added to transaction: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
    97  pachctl create branch parameters@master --head staging
    98  Added to transaction: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
    99  pachctl finish transaction
   100  Completed transaction with 2 requests: 0d6f0bc3-37a0-4936-96e3-82034a2a2055
   101  ```
   102  
   103  When you finish the transaction, both repositories switch to
   104  to the master branch at the same time which triggers one job to process
   105  those commits together.
   106  
   107  ### Updating Multiple Pipelines Simultaneously
   108  
   109  If you want to change logic or intermediate data formats in your DAG, you 
   110  may need to change multiple pipelines. Performing these changes together
   111  in a transaction can avoid creating jobs with mismatched pipeline versions
   112  and potentially wasting work.
   113  
   114  ## Start and Finish Transactions
   115  
   116  To start a transaction, run the following command:
   117  
   118  ```shell
   119  pachctl start transaction
   120  ```
   121  
   122  **System Response:**
   123  
   124  ```shell
   125  Started new transaction: 7a81eab5-e6c6-430a-a5c0-1deb06852ca5
   126  ```
   127  
   128  This command generates a transaction object in the cluster and saves
   129  its ID in the local Pachyderm configuration file. By default, this file
   130  is stored at `~/.pachyderm/config.json`.
   131  
   132  !!! example
   133      ```json hl_lines="9"
   134      {
   135         "user_id": "b4fe4317-be21-4836-824f-6661c68b8fba",
   136         "v2": {
   137           "active_context": "local-2",
   138           "contexts": {
   139             "default": {},
   140             "local-2": {
   141               "source": 3,
   142               "active_transaction": "7a81eab5-e6c6-430a-a5c0-1deb06852ca5",
   143               "cluster_name": "minikube",
   144               "auth_info": "minikube",
   145               "namespace": "default"
   146             },
   147      ```
   148  
   149  After you start a transaction, you can add supported commands, such
   150  as `pachctl create repo`, `pachctl create branch`, and so on, to the
   151  transaction. All commands that are performed in a transaction are
   152  queued up and not executed against the actual cluster until you finish
   153  the transaction. When you finish the transaction, all queued command
   154  are executed atomically.
   155  
   156  To finish a transaction, run:
   157  
   158  ```shell
   159  pachctl finish transaction
   160  ```
   161  
   162  **System Response:**
   163  
   164  ```shell
   165  Completed transaction with 1 requests: 7a81eab5-e6c6-430a-a5c0-1deb06852ca5
   166  ```
   167  
   168  ## Other Transaction Commands
   169  Other supporting commands for transactions include the following commands:
   170  
   171  | Command      | Description |
   172  | ------------ | ----------- |
   173  | `pachctl list transaction` | List all unfinished transactions available in the Pachyderm cluster. |
   174  | `pachctl stop transaction` | Remove the currently active transaction from the local Pachyderm config file. The transaction remains in the Pachyderm cluster and can be resumed later. |
   175  | `pachctl resume transaction` | Set an already-existing transaction as the active transaction in the local Pachyderm config file. |
   176  | `pachctl delete transaction` | Deletes a transaction from the Pachyderm cluster. |
   177  | `pachctl inspect transaction` | Provides detailed information about an existing transaction, including which operations it will perform. By default, displays information about the current transaction. If you specify a transaction ID, displays information about the corresponding transaction. |
   178  
   179  ## Supported Operations
   180  
   181  While there is a transaction object in the Pachyderm configuration
   182  file, all supported API requests append the request to the
   183  transaction instead of running directly. These supported commands include:
   184  
   185  ```shell
   186  create repo
   187  delete repo
   188  start commit
   189  finish commit
   190  delete commit
   191  create branch
   192  delete branch
   193  create pipeline
   194  update pipeline
   195  ```
   196  
   197  Each time you add a command to a transaction, Pachyderm validates the
   198  transaction against the current state of the cluster metadata and obtains
   199  any return values, which is important for such commands as
   200  `start commit`. If validation fails for any reason, Pachyderm does
   201  not add the operation to the transaction. If the transaction has been
   202  invalidated by changing the cluster state, you must delete the transaction
   203  and start over, taking into account the new state of the cluster.
   204  From a command-line perspective, these commands work identically within
   205  a transaction as without. The only difference is that you do not apply
   206  your changes until you run `finish transaction`, and a message that
   207  Pachyderm logs to `stderr` to indicate that the command was placed
   208  in a transaction rather than run directly.
   209  
   210  ## Multiple Opened Transactions
   211  
   212  Some systems have a notion of *nested* transactions. That is when you
   213  open transactions within an already opened transaction. In such systems, the
   214  operations added to the subsequent transactions are not executed
   215  until all the nested transactions and the main transaction are closed.
   216  
   217  Pachyderm does not support such behavior. Instead, when you open a
   218  transaction, the transaction ID is written to the Pachyderm configuration
   219  file. If you begin another transaction while the first one is open, Pachyderm
   220  returns an error.
   221  
   222  Every time you add a command to a transaction,
   223  Pachyderm creates a blueprint of the commit and verifies that the
   224  command is valid. However, one transaction can invalidate another.
   225  In this case, a transaction that is closed first takes precedence
   226  over the other. For example, if two transactions create a repository
   227  with the same name, the one that is executed first results in the
   228  creation of the repository, and the other results in error.
   229  
   230  !!! tip
   231       While you cannot use `pachctl put file` in a transaction, you can
   232       start a commit within a transaction, finish the transation,
   233       then put as many files as you need, and then finish your commit.
   234       Your changes will only be applied in one batch when you close
   235       the commit.
   236  
   237  To get a better understanding of how transactions work in practice, try
   238  [Use Transactions with Hyperparameter Tuning](https://github.com/pachyderm/pachyderm/tree/master/examples/transactions/).
   239