github.com/pachyderm/pachyderm@v1.13.4/examples/deferred_processing/deferred_processing_plus_transactions/README.md (about)

     1  # Deferred Processing Plus Transactions
     2  
     3  This example, 
     4  which uses a simple DAG based on our [OpenCV example](https://github.com/pachyderm/pachyderm/tree/1.13.x/examples/opencv), 
     5  illustrates two Pachyderm usage patterns for fine-grain control over when pipelines trigger jobs.
     6  
     7  [Deferred processing](https://docs.pachyderm.com/1.13.x/how-tos/deferred_processing/) 
     8  is a Pachyderm technique for controlling when data gets processed.
     9  Deferred processing uses branches to prevent pipelines from triggering on every input commit.
    10  
    11  [Transactions](https://docs.pachyderm.com/1.13.x/how-tos/use-transactions-to-run-multiple-commands/) are a Pachyderm feature 
    12  that allows you to batch match multiple operations at once,
    13  such as committing data to two different repos, 
    14  but only trigger a single job, 
    15  so data from both repos gets processed together.
    16  
    17  ## Prerequisites
    18  
    19  Before you begin, you need to have Pachyderm v1.9.8 or later installed on your computer or cloud platform. 
    20  See [Deploy Pachyderm](https://docs.pachyderm.com/1.13.x/deploy-manage/deploy/).
    21  
    22  ## Pipelines
    23  
    24  The following diagram demonstrates the DAG that is used in this example.
    25  
    26  ![Example DAG](example_dag.png)
    27  
    28  The DAG shown is a simple elaboration on the OpenCV example,
    29  with pipeline and repo names chosen to avoid collisions with that example 
    30  if installed in the same cluster.
    31  
    32  The functionality is slightly different.
    33  The `edges_dp` pipeline performs edge detection on images committed to `images_dp_1`.
    34  `montage_dp` pipeline creates a montage out of images committed to `images_dp_2` and images in the master branch of `edges_dp`. 
    35  This configuration enables you to verify the files being processed
    36  by visually inspecting the montage.
    37  
    38  The most significant change from the OpenCV example is the pipeline spec for `edges_dp`,
    39  which has the `output_branch` attribute set to `dev`.
    40  It also has added a `name` field to the `input` repo,
    41  to avoid having to change the code in the example.
    42  
    43  ```json hl_lines="9,12"
    44  {
    45      "pipeline": {
    46          "name": "edges_dp"
    47      },
    48      "input": {
    49          "pfs": {
    50              "glob": "/*",
    51              "repo": "images_dp_1",
    52              "name": "images"
    53          }
    54      },
    55      "output_branch": "dev",
    56      "transform": {
    57          "cmd": [ "python3", "/edges.py" ],
    58          "image": "pachyderm/opencv"
    59      }
    60  }
    61  ```
    62  
    63  Therefore, 
    64  this pipeline puts the output in the `dev` branch
    65  instead of putting it in the `master` branch.
    66  
    67  Since `montage_dp` is subscribed to the master branch of `edges_dp`, jobs will not trigger when the edges pipeline outputs files to the `dev` branch. Instead, to trigger a montage job, 
    68  we can simply create a `master` branch attached to any commit in `edges_dp` to trigger the pipeline.
    69  
    70  ## Example run-through
    71  
    72  This section provides steps that you can run to test this example.
    73  
    74  ### Deferred Processing
    75  
    76  You should have a Pachyderm cluster set up 
    77  and access to it configured from your local computer
    78  before you run this example.
    79  
    80  1. Run the script `setup.sh` included in this repo.
    81     The script executes the following commands:
    82     
    83     ```shell
    84     pachctl create repo images_dp_1
    85     pachctl create repo images_dp_2
    86     pachctl create pipeline -f ./edges_dp.json 
    87     pachctl create pipeline -f ./montage_dp.json 
    88     pachctl put file images_dp_1@master -i ./images.txt
    89     pachctl put file images_dp_1@master -i ./images2.txt
    90     pachctl put file images_dp_2@master -i ./images3.txt
    91     ```
    92     
    93  2. Once the demo is loaded, 
    94     check the commits in `edges_dp`.
    95     You should see an output similar to this.
    96     Note that there are two commits in `edges_dp`
    97     both in the `dev` branch that is the output for the pipeline:
    98     
    99     ```shell
   100     $  pachctl list commit edges_dp
   101     REPO     BRANCH COMMIT                           FINISHED       
   102     edges_dp dev    364f49663dd848098b60c1ac97a332af 36 seconds ago 
   103     edges_dp dev    a07c857b91a14add9f8309a81d86dbe8 44 seconds ago 
   104     ```
   105     
   106     Remember that the `edges_dp` pipeline outputs to the `dev` branch.
   107     Since the `montage_dp` pipeline subscribes to the `master` branch,
   108     it will not be triggered when `edges_dp` jobs complete,
   109     since that output goes into the `dev` branch.
   110  
   111  3. List the branches in `edges_dp`.
   112     
   113     ```shell
   114     $ pachctl list branch edges_dp
   115     BRANCH HEAD                             
   116     master -                                
   117     dev    364f49663dd848098b60c1ac97a332af 
   118     ```
   119     
   120     Note that the `dev` branch has the most recent commit.
   121     Take note of the commit id and match it to the id above.
   122     The `master` branch does not have any commits.
   123  
   124  4. View the list of jobs:
   125     
   126     ```shell
   127     $ pachctl list job
   128     ID                               PIPELINE   STARTED            DURATION  RESTART PROGRESS  DL       UL       STATE   
   129     2288709b4d8044409c2232d673ec8f23 montage_dp 55 seconds ago     1 second  0       0 + 0 / 0 0B       0B       success 
   130     6d9d4cf0f6524b0ca126fa97141303ea edges_dp   About a minute ago 4 seconds 0       2 + 1 / 3 181.1KiB 111.4KiB success 
   131     fcaf537975554935b0f15d184d7a0984 edges_dp   About a minute ago 3 seconds 0       1 + 0 / 1 57.27KiB 22.22KiB success 
   132  
   133     ```
   134     
   135     You should see that there are three jobs:
   136     - one 0-datum job for `montage_dp` and 
   137     - two jobs for `edges_dp` with the appropriate number of datums in each.
   138     
   139     This is what you should expect.
   140     There is no data in the master branch of `edges_dp`, 
   141     so an empty job was created in `montage_dp`
   142     when data was commited to `images_dp_2`
   143     because of its `cross` input.
   144     
   145  5. Commit a file to `images_dp_1`. 
   146  
   147     ```shell
   148     $ pachctl put file images_dp_1@master:1VqcWw9.jpg -f http://imgur.com/1VqcWw9.jpg
   149     ```
   150  
   151  6. View the list of jobs, again. 
   152  
   153     ```shell
   154     $ pachctl list job
   155     ID                               PIPELINE   STARTED            DURATION  RESTART PROGRESS  DL       UL       STATE   
   156     c7e69e46e9954611ad8efc8aeac47f2a edges_dp   12 seconds ago     3 seconds 0       1 + 3 / 4 175.1KiB 92.18KiB success 
   157     2288709b4d8044409c2232d673ec8f23 montage_dp About a minute ago 1 second  0       0 + 0 / 0 0B       0B       success 
   158     6d9d4cf0f6524b0ca126fa97141303ea edges_dp   About a minute ago 4 seconds 0       2 + 1 / 3 181.1KiB 111.4KiB success 
   159     fcaf537975554935b0f15d184d7a0984 edges_dp   About a minute ago 3 seconds 0       1 + 0 / 1 57.27KiB 22.22KiB success 
   160     ```
   161  
   162     You see that one job was triggered in `edges_dp`
   163     with the one datum we committed, above, processed
   164     and the three existing datums skipped.
   165     You may also confirm that no job was triggered for `montage_dp`. 
   166     
   167  7. To trigger `montage_dp` on the set of data in our `dev` branch,
   168     you create a `master` branch with `dev` as its head.
   169     
   170     ```shell
   171     $ pachctl create branch edges_dp@master --head dev
   172     ```
   173     
   174  8. Listing jobs will show that a job got triggered on `montage_dp`:
   175  
   176      ```shell
   177      $ pachctl list job
   178      ID                               PIPELINE   STARTED        DURATION  RESTART PROGRESS  DL       UL       STATE   
   179      e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 10 seconds ago 4 seconds 0       1 + 0 / 1 919.6KiB 1.055MiB success 
   180      c7e69e46e9954611ad8efc8aeac47f2a edges_dp   42 seconds ago 3 seconds 0       1 + 3 / 4 175.1KiB 92.18KiB success 
   181      2288709b4d8044409c2232d673ec8f23 montage_dp 2 minutes ago  1 second  0       0 + 0 / 0 0B       0B       success 
   182      6d9d4cf0f6524b0ca126fa97141303ea edges_dp   2 minutes ago  4 seconds 0       2 + 1 / 3 181.1KiB 111.4KiB success 
   183      fcaf537975554935b0f15d184d7a0984 edges_dp   2 minutes ago  3 seconds 0       1 + 0 / 1 57.27KiB 22.22KiB success 
   184      ```
   185  
   186  9. Commit more data to `images_dp_1`.
   187     It will only trigger a job in `edges_dp`:
   188  
   189      ```shell
   190      $ pachctl put file images_dp_1@master:2GI70mb.jpg -f http://imgur.com/2GI70mb.jpg
   191      $ pachctl list job
   192      ID                               PIPELINE   STARTED        DURATION  RESTART PROGRESS  DL       UL       STATE   
   193      65eacaae2e63461bbfc1ed609e8b6f5e edges_dp   8 seconds ago  3 seconds 0       1 + 4 / 5 204KiB   18.89KiB success 
   194      e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 13 minutes ago 4 seconds 0       1 + 0 / 1 919.6KiB 1.055MiB success 
   195      c7e69e46e9954611ad8efc8aeac47f2a edges_dp   13 minutes ago 3 seconds 0       1 + 3 / 4 175.1KiB 92.18KiB success 
   196      2288709b4d8044409c2232d673ec8f23 montage_dp 15 minutes ago 1 second  0       0 + 0 / 0 0B       0B       success 
   197      6d9d4cf0f6524b0ca126fa97141303ea edges_dp   15 minutes ago 4 seconds 0       2 + 1 / 3 181.1KiB 111.4KiB success 
   198      fcaf537975554935b0f15d184d7a0984 edges_dp   15 minutes ago 3 seconds 0       1 + 0 / 1 57.27KiB 22.22KiB success
   199      ```
   200  
   201  10. Move the `master` branch in `edges_dp` to point dev, again.
   202      It will trigger a job against the data currently in dev.
   203  
   204      ```shell
   205      $ pachctl create branch edges_dp@master --head dev
   206      $ pachctl list job
   207      ID                               PIPELINE   STARTED        DURATION  RESTART PROGRESS  DL       UL       STATE   
   208      65eddcb60ae1475aa6d59b2baa69c78e montage_dp 8 seconds ago  5 seconds 0       1 + 0 / 1 938.5KiB 1.066MiB success 
   209      65eacaae2e63461bbfc1ed609e8b6f5e edges_dp   3 minutes ago  3 seconds 0       1 + 4 / 5 204KiB   18.89KiB success 
   210      e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 16 minutes ago 4 seconds 0       1 + 0 / 1 919.6KiB 1.055MiB success 
   211      c7e69e46e9954611ad8efc8aeac47f2a edges_dp   16 minutes ago 3 seconds 0       1 + 3 / 4 175.1KiB 92.18KiB success 
   212      2288709b4d8044409c2232d673ec8f23 montage_dp 18 minutes ago 1 second  0       0 + 0 / 0 0B       0B       success 
   213      6d9d4cf0f6524b0ca126fa97141303ea edges_dp   18 minutes ago 4 seconds 0       2 + 1 / 3 181.1KiB 111.4KiB success 
   214      fcaf537975554935b0f15d184d7a0984 edges_dp   18 minutes ago 3 seconds 0       1 + 0 / 1 57.27KiB 22.22KiB success 
   215      ```
   216  
   217  
   218  ### Transactions
   219  
   220  After you test deferred processing, 
   221  you can explore how transactions work in combination with deferred processing.
   222  
   223  1. If you want to run a particular set of data in `images_dp_2` 
   224      against a particular branch of `edges_dp`,
   225      you need to perform two operations
   226      - commit data to `images_dp_2` and
   227      - point `edges_dp@master` to the specific commit of interest.
   228      
   229      If you do not use a transaction, this will result in two jobs being triggered, one for the new commit and a second when we move `edges_dp@master` branch.
   230      - `images_dp_2@master` running against whatever is currently in `edges_dp@master`
   231      - `images_dp_2@master` running against whatever you set `edges_dp@master` to
   232      
   233      Remember that in step 10 above, 
   234      we performed the `create branch` operation against `edges_dp`.
   235      Now we perform the commit to `images_dp_2`.
   236      and see that another job got triggered.
   237      
   238      ```shell
   239      $ pachctl put file images_dp_2@master:3Kr6Mr6.jpg  -f http://imgur.com/3Kr6Mr6.jpg
   240      $ pachctl list job
   241      $ pachctl list job
   242      ID                               PIPELINE   STARTED        DURATION  RESTART PROGRESS  DL       UL       STATE   
   243      9c97578031544cab9cc5fb64e9d77153 montage_dp 9 seconds ago  5 seconds 0       1 + 0 / 1 1015KiB  1.292MiB success 
   244      65eddcb60ae1475aa6d59b2baa69c78e montage_dp 28 seconds ago 5 seconds 0       1 + 0 / 1 938.5KiB 1.066MiB success 
   245      65eacaae2e63461bbfc1ed609e8b6f5e edges_dp   3 minutes ago  3 seconds 0       1 + 4 / 5 204KiB   18.89KiB success 
   246      e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 16 minutes ago 4 seconds 0       1 + 0 / 1 919.6KiB 1.055MiB success 
   247      c7e69e46e9954611ad8efc8aeac47f2a edges_dp   16 minutes ago 3 seconds 0       1 + 3 / 4 175.1KiB 92.18KiB success 
   248      2288709b4d8044409c2232d673ec8f23 montage_dp 18 minutes ago 1 second  0       0 + 0 / 0 0B       0B       success 
   249      6d9d4cf0f6524b0ca126fa97141303ea edges_dp   18 minutes ago 4 seconds 0       2 + 1 / 3 181.1KiB 111.4KiB success 
   250      fcaf537975554935b0f15d184d7a0984 edges_dp   18 minutes ago 3 seconds 0       1 + 0 / 1 57.27KiB 22.22KiB success 
   251      ```
   252      
   253  2. If you want to just have one job 
   254      where `images_dp_2@master` runs against whatever you set `edges_dp@master` to,
   255      you can use Pachyderm transactions.
   256      First step is to start a transaction.
   257      
   258      ```shell
   259      $ pachctl start transaction
   260      Started new transaction: 11fbbcbd-6cda-42fa-b1fe-cd63b292582e
   261      ```
   262      
   263  3. Once the transaction is started,
   264      you start all commits and branch creations 
   265      within the scope of the transaction. 
   266      
   267      ```shell
   268      $ pachctl start commit  images_dp_2@master
   269      Added to transaction: 11fbbcbd-6cda-42fa-b1fe-cd63b292582e
   270      de55d4856e814c41a65836321fe672fa
   271      $ pachctl create branch edges_dp@master --head dev
   272      Added to transaction: 11fbbcbd-6cda-42fa-b1fe-cd63b292582e
   273      ```
   274  
   275  
   276  4.  Before you put any files in a repo, 
   277      you need to finish the transaction.
   278      When you run `pachctl finish transaction`, Pachyderm groups all the commits and branches together,
   279      triggering when the last commit in the transaction is finished.
   280      
   281      ```shell
   282      $ pachctl finish transaction
   283      Completed transaction with 2 requests: 11fbbcbd-6cda-42fa-b1fe-cd63b292582e
   284      ```
   285      
   286  5.  Commit a file, 
   287      and job list will show no new jobs.
   288      
   289      ```shell
   290      $ pachctl put file images_dp_2@master:9iIlokw.jpg -f http://imgur.com/9iIlokw.jpg
   291      $ pachctl list job
   292      ID                               PIPELINE   STARTED        DURATION  RESTART PROGRESS  DL       UL       STATE   
   293      9c97578031544cab9cc5fb64e9d77153 montage_dp 18 minutes ago 5 seconds 0       1 + 0 / 1 1015KiB  1.292MiB success 
   294      65eddcb60ae1475aa6d59b2baa69c78e montage_dp 19 minutes ago 5 seconds 0       1 + 0 / 1 938.5KiB 1.066MiB success 
   295      65eacaae2e63461bbfc1ed609e8b6f5e edges_dp   22 minutes ago 3 seconds 0       1 + 4 / 5 204KiB   18.89KiB success 
   296      e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 35 minutes ago 4 seconds 0       1 + 0 / 1 919.6KiB 1.055MiB success 
   297      c7e69e46e9954611ad8efc8aeac47f2a edges_dp   36 minutes ago 3 seconds 0       1 + 3 / 4 175.1KiB 92.18KiB success 
   298      2288709b4d8044409c2232d673ec8f23 montage_dp 37 minutes ago 1 second  0       0 + 0 / 0 0B       0B       success 
   299      6d9d4cf0f6524b0ca126fa97141303ea edges_dp   38 minutes ago 4 seconds 0       2 + 1 / 3 181.1KiB 111.4KiB success 
   300      fcaf537975554935b0f15d184d7a0984 edges_dp   38 minutes ago 3 seconds 0       1 + 0 / 1 57.27KiB 22.22KiB success 
   301      ```
   302  
   303  6.  Finish the commit that you started during the transaction.
   304      That will start the job.
   305  
   306      ```
   307      $ pachctl finish commit images_dp_2@master
   308      $ pachctl list job
   309      ID                               PIPELINE   STARTED        DURATION  RESTART PROGRESS  DL       UL       STATE   
   310      76f1e7c311fd4529938653787b1d283a montage_dp 14 seconds ago 6 seconds 0       1 + 0 / 1 1.175MiB 1.587MiB success 
   311      9c97578031544cab9cc5fb64e9d77153 montage_dp 19 minutes ago 5 seconds 0       1 + 0 / 1 1015KiB  1.292MiB success 
   312      65eddcb60ae1475aa6d59b2baa69c78e montage_dp 20 minutes ago 5 seconds 0       1 + 0 / 1 938.5KiB 1.066MiB success 
   313      65eacaae2e63461bbfc1ed609e8b6f5e edges_dp   23 minutes ago 3 seconds 0       1 + 4 / 5 204KiB   18.89KiB success 
   314      e5a116fd9c2e4678a0f49fcb2f8c8331 montage_dp 37 minutes ago 4 seconds 0       1 + 0 / 1 919.6KiB 1.055MiB success 
   315      c7e69e46e9954611ad8efc8aeac47f2a edges_dp   37 minutes ago 3 seconds 0       1 + 3 / 4 175.1KiB 92.18KiB success 
   316      2288709b4d8044409c2232d673ec8f23 montage_dp 38 minutes ago 1 second  0       0 + 0 / 0 0B       0B       success 
   317      6d9d4cf0f6524b0ca126fa97141303ea edges_dp   39 minutes ago 4 seconds 0       2 + 1 / 3 181.1KiB 111.4KiB success 
   318      fcaf537975554935b0f15d184d7a0984 edges_dp   39 minutes ago 3 seconds 0       1 + 0 / 1 57.27KiB 22.22KiB success 
   319      ```
   320  
   321  ## Summary
   322  
   323  Deferred processing with transactions in Pachyderm 
   324  will give you fine-grained control of jobs and datums
   325  while preserving Pachyderm's advantages of data lineage and incremental processing.