github.com/pachyderm/pachyderm@v1.13.4/examples/transactions/README.md (about)

     1  >![pach_logo](../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Use Transactions with Hyperparameter Tuning
     5  
     6  !!! note "Summary"
     7      Transactions can help optimize the use of resources
     8      by postponing pipeline runs.
     9  
    10  Hyperparameter tuning is a machine learning technique
    11  of narrowing down a set of parameters to
    12  an optimal number of parameters to train a learning
    13  algorithm. The Pachyderm documentation includes a
    14  [hyperparameter tuning example](https://github.com/pachyderm/pachyderm/tree/1.13.x/examples/ml/hyperparameter)
    15  that describes how this computation works in Pachyderm.
    16  
    17  In the hyperparameter example, training data is submitted
    18  to the `data` repository, and the parameters are stored
    19  in the `parameters` repository. In that example, the
    20  data processing takes seconds and, therefore, you can
    21  run this operation for every commit without being worried
    22  about the use of resources. But, if your
    23  data processing takes significant time,
    24  you might want to optimize Pachyderm to run the pipeline against
    25  specific commits in the `data` and `parameters` repositories.
    26  You can do so by using transactions.
    27  
    28  ## Set up the Hyperparameter Example
    29  
    30  To demonstrate the benefits of using transactions, we
    31  will use transactions on the `model` pipeline step from the
    32  [hyperparameter tuning example](https://github.com/pachyderm/pachyderm/tree/1.13.x/examples/ml/hyperparameter).
    33  In this transaction example, we omit the splitting step and
    34  have just the `model` pipeline that consumes commits from
    35  the `data` and `parameters` repositories and outputs the
    36  result to the `model` repository. You can adapt this example
    37  to your pipelines as needed.
    38  
    39  The following diagram describes the pipeline structure:
    40  
    41  ![transactions diagram](https://github.com/pachyderm/pachyderm/tree/1.13.x/doc/docs/master/assets/images/d_transactions_hyperparameter.svg)
    42  
    43  To set up the pipeline, complete the following steps:
    44  
    45  1. Create the `data` repository:
    46  
    47     ```shell
    48     $ pachctl create repo data
    49     ```
    50  
    51  1. Create the `parameters` repository:
    52  
    53     ```shell
    54     $ pachctl create repo parameters
    55     ```
    56  
    57  1. Verify that the repositories were successfully created:
    58  
    59     ```shell
    60     $ pachctl list repo
    61     NAME       CREATED        SIZE (MASTER)
    62     parameters 44 minutes ago 123B
    63     raw_data   44 minutes ago 6.858KiB
    64     ```
    65  
    66  1. Clone the Pachyderm repository:
    67  
    68     ```shell
    69     $ git clone git@github.com:pachyderm/pachyderm.git
    70     ```
    71  
    72  1. Change the directory to `examples/transactions`:
    73  
    74     ```shell
    75     $ cd examples/transactions/
    76     ```
    77  
    78  1. Create the `model` pipeline:
    79  
    80     ```shell
    81     $ pachctl create pipeline -f model.json
    82     ```
    83  
    84  1. Verify that the pipeline has been created:
    85  
    86     ```shell
    87     $ pachctl list pipeline
    88     NAME       VERSION INPUT                                                                                      CREATED        STATE / LAST JOB
    89     model      1       (parameters:/c_parameters.txt/* ⨯ parameters:/gamma_parameters.txt/* ⨯ raw_data:/iris.csv) 12 seconds ago running / starting
    90     ```
    91  
    92  ## Run the Transaction
    93  
    94  To match commits in a pipeline, you need to create
    95  a transaction, open two commits inside of that transaction,
    96  then close the transaction, add your files, and then close both
    97  commits. Pachyderm puts the changes to both repositories simultaneously
    98  only when all commits that you have opened within a transaction are
    99  closed.
   100  
   101  To run the transaction, complete the following steps:
   102  
   103  1. Start a transaction:
   104  
   105     ```shell
   106     $ pachctl start transaction
   107     Started new transaction: 854e8503-6e5d-4542-805c-a73a39200bf8
   108     ```
   109  
   110  1. Open a commit into the `master` branch of the `raw_data` repository:
   111  
   112     ```shell
   113     $ pachctl start commit raw_data@master
   114     Added to transaction: 854e8503-6e5d-4542-805c-a73a39200bf8
   115     42b893e48e7d40f1bb5ed770526a9a07
   116     ```
   117  
   118  1. Open a commit into the `master` branch of the `parameters` repository:
   119  
   120     ```shell
   121     $ pachctl start commit parameters@master
   122     Added to transaction: 854e8503-6e5d-4542-805c-a73a39200bf8
   123     c4dc446b25e54a938a67a5e913b3f9a4
   124     ```
   125  
   126  1. Close the transaction:
   127  
   128     ```shell
   129     $ pachctl finish transaction
   130     Completed transaction with 2 requests: 854e8503-6e5d-4542-805c-a73a39200bf8
   131     ```
   132  
   133  1. Add the data to the parameters repository by splitting each line
   134     into a separate file:
   135  
   136     ```shell
   137     $ pachctl put file parameters@master -f c_parameters.txt --split line --target-file-datums 1
   138     $ pachctl put file parameters@master -f gamma_parameters.txt --split line --target-file-datums 1
   139     ```
   140  
   141     ```shell
   142     $ pachctl list file parameters@master
   143     NAME                  TYPE SIZE
   144     /c_parameters.txt     dir  81B
   145     /gamma_parameters.txt dir  42B
   146     ```
   147  
   148     **Note:** Although the files are in the repository, no jobs were
   149     triggered for the `model` pipeline. You can verify that by running
   150     the following command:
   151  
   152     ```shell
   153     $ pachctl list job --pipeline=model --no-pager
   154     ID                               PIPELINE STARTED      DURATION RESTART PROGRESS  DL UL STATE
   155     ```
   156  
   157  1. Add the data to the `raw_data` repository:
   158  
   159     ```shell
   160     $ pachctl put file raw_data@master:iris.csv -f noisy_iris.csv
   161     ```
   162  
   163     If you check whether the pipeline has run or not, you
   164     can see that it has not yet run:
   165  
   166     ```shell
   167     $ pachctl list job --pipeline=model --no-pager
   168     ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE
   169     ```
   170  
   171  1. Close the commit to the `raw_data` repository that you have
   172     started within the transaction:
   173  
   174     ```shell
   175     $ pachctl finish commit raw_data@master
   176     ```
   177  
   178     Still no jobs run for the `model` pipeline. If we had not
   179     started those commits in a transaction, a job would normally
   180     be triggered here because Pachyderm normally triggers jobs
   181     whenever a commit is made on any input. In this case, because
   182     those commits were added to a transaction, Pachyderm waits
   183     for both input commits to be finished before a job triggers.
   184  
   185  1. Close the commit to the `parameters` repository that you have
   186     started within the transaction:
   187  
   188     ```shell
   189     $ pachctl finish commit parameters@master
   190     ```
   191  
   192     Now, the Pachyderm finishes the transaction by creating one
   193     job that takes the commits that you have specified within the
   194     transaction and runs your code against these two commits:
   195  
   196     ```shell
   197     $ pachctl list job --pipeline=model --no-pager
   198     ID                               PIPELINE STARTED          DURATION    RESTART  PROGRESS  DL       UL      STATE
   199     6cdc80ae105f47b4a09f0ab8ce005003 model    37 seconds ago - 0           21 + 0 / 77        115.5KiB 62.2KiB running
   200     ```
   201  
   202  1. View the contents of the model output repo:
   203  
   204     ```shell
   205     $ pachctl list file model@master
   206     NAME                      TYPE SIZE
   207     /model_C0.031_G0.001.pkl  file 5.713KiB
   208     /model_C0.031_G0.004.pkl  file 5.713KiB
   209     /model_C0.031_G0.016.pkl  file 5.713KiB
   210     /model_C0.031_G0.063.pkl  file 5.713KiB
   211     ...
   212     ```
   213  
   214  In this example, we learned that if a pipeline
   215  takes a lot of time to run, you can optimize it by using
   216  transactions. Transactions enable you to accumulate your
   217  changes in input repositories and postpone pipeline runs
   218  until after the commits are closed.