github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/advanced-concepts/deferred_processing.md (about)

     1  # Deferred Processing of Data
     2  
     3  While a Pachyderm pipeline is running, it processes any new data that you
     4  commit to its input branch. However, in some cases, you
     5  want to commit data more frequently than you want to process it.
     6  
     7  Because Pachyderm pipelines do not reprocess the data that has
     8  already been processed, in most cases, this is not an issue. But, some
     9  pipelines might need to process everything from scratch. For example,
    10  you might want to commit data every hour, but only want to retrain a
    11  machine learning model on that data daily because it needs to train
    12  on all the data from scratch.
    13  
    14  In these cases, you can leverage a massive performance benefit from deferred
    15  processing. This section covers how to achieve that and control
    16  what gets processed.
    17  
    18  Pachyderm controls what is being processed by using the _filesystem_,
    19  rather than at the pipeline level. Although pipelines are inflexible,
    20  they are simple and always try to process the data at the heads of
    21  their input branches. In contrast, the filesystem is very flexible and
    22  gives you the ability to commit data in different places and then efficiently
    23  move and rename the data so that it gets processed when you want.
    24  
    25  ## Configure a Staging Branch in an Input repository
    26  
    27  When you want to load data into Pachyderm without triggering a pipeline,
    28  you can upload it to a staging branch and then submit accumulated
    29  changes in one batch by re-pointing the `HEAD` of your `master` branch
    30  to a commit in the staging branch.
    31  
    32  Although, in this section, the branch in which you consolidate changes
    33  is called `staging`, you can name it as you like. Also, you can have multiple
    34  staging branches. For example, `dev1`, `dev2`, and so on.
    35  
    36  In the example below, the repository that is created called `data`.
    37  
    38  To configure a staging branch, complete the following steps:
    39  
    40  1. Create a repository. For example, `data`.
    41  
    42     ```shell
    43     $ pachctl create repo data
    44     ```
    45  
    46  1. Create a `master` branch.
    47  
    48     ```shell
    49     $ pachctl create branch data@master
    50     ```
    51  
    52  1. View the created branch:
    53  
    54     ```shell
    55     $ pachctl list branch data
    56  
    57     BRANCH HEAD
    58     master -
    59     ```
    60  
    61     No `HEAD` means that nothing has yet been committed into this
    62     branch. When you commit data to the `master` branch, the pipeline
    63     immediately starts a job to process it.
    64     However, if you want to commit something without immediately
    65     processing it, you need to commit it to a different branch.
    66  
    67  1. Commit a file to the staging branch:
    68  
    69     ```shell
    70     $ pachctl put file data@staging -f <file>
    71     ```
    72  
    73     Pachyderm automatically creates the `staging` branch.
    74     Your repo now has 2 branches, `staging` and `master`. In this
    75     example, the `staging` name is used, but you can
    76     name the branch as you want.
    77  
    78  1. Verify that the branches were created:
    79  
    80     ```shell
    81     $ pachctl list branch data
    82  
    83     BRANCH  HEAD
    84     
    85     staging f3506f0fab6e483e8338754081109e69
    86     master  -
    87     ```
    88  
    89     The `master` branch still does not have a `HEAD` commit, but the
    90     new branch, `staging`, does. There still have been no jobs, because
    91     there are no pipelines that take `staging` as inputs. You can
    92     continue to commit to `staging` to add new data to the branch, and the
    93     pipeline will not process anything.
    94  
    95  1. When you are ready to process the data, update the `master` branch
    96     to point it to the head of the staging branch:
    97  
    98     ```shell
    99     $ pachctl create branch data@master --head staging
   100     ```
   101  
   102  1. List your branches to verify that the master branch has a `HEAD`
   103     commit:
   104  
   105     ```shell
   106     $ pachctl list branch data
   107  
   108     staging f3506f0fab6e483e8338754081109e69
   109     master  f3506f0fab6e483e8338754081109e69
   110     ```
   111  
   112     The `master` and `staging` branches now have the same `HEAD` commit.
   113     This means that your pipeline has data to process.
   114  
   115  1. Verify that the pipeline has new jobs:
   116  
   117     ```shell
   118     $ pachctl list job
   119  
   120     ID                               PIPELINE STARTED        DURATION           RESTART PROGRESS  DL   UL  STATE
   121     061b0ef8f44f41bab5247420b4e62ca2 test     32 seconds ago Less than a second 0       6 + 0 / 6 108B 24B success
   122     ```
   123  
   124     You should see one job that Pachyderm created for all the changes you
   125     have submitted to the `staging` branch. While the commits to the
   126     `staging` branch are ancestors of the current `HEAD` in  `master`,
   127     they were never the actual `HEAD` of `master` themselves, so they
   128     do not get processed. This behavior works for most of the use cases
   129     because commits in Pachyderm are generally additive, so processing
   130     the HEAD commit also processes data from previous commits.
   131  
   132  ![deferred processing](../../assets/images/deferred_processing.gif)
   133  
   134  ## Process Specific Commits
   135  
   136  Sometimes you want to process specific intermediary commits
   137  that are not in the `HEAD` of the branch.
   138  To do this, you need to set `master` to have these commits as `HEAD`.
   139  For example, if you submitted ten commits in the `staging` branch and you
   140  want to process the seventh, third, and most recent commits, you need
   141  to run the following commands respectively:
   142  
   143  ```shell
   144  $ pachctl create branch data@master --head staging^7
   145  $ pachctl create branch data@master --head staging^3
   146  $ pachctl create branch data@master --head staging
   147  ```
   148  
   149  When you run the commands above, Pachyderm creates a job for each
   150  of the commands one after another. Therefore, when one job is completed,
   151  Pachyderm starts the next one. To verify
   152  that Pachyderm created jobs for these commands, run `pachctl list job`.
   153  
   154  ### Change the HEAD of your Branch
   155  
   156  You can move backward to previous commits as easily as advancing to the
   157  latest commits. For example, if you want to change the final output to be
   158  the result of processing `staging^1`, you can *roll back* your HEAD commit
   159  by running the following command:
   160  
   161  ```shell
   162  $ pachctl create branch data@master --head staging^1
   163  ```
   164  
   165  This command starts a new job to process `staging^1`. The `HEAD` commit on
   166  your output repo will be the result of processing `staging^1` instead of
   167  `staging`.
   168  
   169  ## Copy Files from One Branch to Another
   170  
   171  Using a staging branch allows you to defer processing. To use
   172  this functionality you need to know your input commits in advance.
   173  However, sometimes you want to be able to commit data in an ad-hoc,
   174  disorganized manner and then organize it later. Instead of pointing
   175  your `master` branch to a commit in a staging branch, you can copy
   176  individual files from `staging` to `master`.
   177  When you run `copy file`, Pachyderm only copies references to the files and
   178  does not move the actual data for the files around.
   179  
   180  To copy files from one branch to another, complete the following steps:
   181  
   182  1. Start a commit:
   183  
   184     ```shell
   185     $ pachctl start commit data@master
   186     ```
   187  
   188  1. Copy files:
   189  
   190     ```shell
   191     $ pachctl copy file data@staging:file1 data@master:file1
   192     $ pachctl copy file data@staging:file2 data@master:file2
   193     ...
   194     ```
   195  
   196  1. Close the commit:
   197  
   198     ```shell
   199     $ pachctl finish commit data@master
   200     ```
   201  
   202  While the commit is open, you can run `pachctl delete file` if you want to remove something from
   203  the parent commit or `pachctl put file`if you want to upload something that is not in a repo yet.
   204  
   205  ## Deferred Processing in Output Repositories
   206  
   207  You can perform the same deferred processing operations with data in output
   208  repositories. To do so, rather than committing to a
   209  `staging` branch, configure the `output_branch` field
   210  in your pipeline specification.
   211  
   212  To configure deffered processing in an output repository, complete the
   213  following steps:
   214  
   215  1. In the pipeline specification, add the `output_branch` field with
   216     the name of the branch in which you want to accumulate your data
   217     before processing:
   218  
   219     ```
   220     "output_branch": "staging"
   221     ```
   222  
   223  1. When you want to process data, run:
   224  
   225     ```shell
   226     $ pachctl create branch pipeline@master --head staging
   227     ```
   228  
   229  ## Automate Deferred Processing With Branch Triggers
   230  
   231  Typically, repointing from one branch to another happens when a certain
   232  condition is met. For example, you might want to repoint your branch when you
   233  have a specific number of commits, or when the amount of unprocessed data
   234  reaches a certain size, or at a specific time interval, such as daily, or
   235  other. This can be automated using branch triggers. A trigger is a relationship
   236  between two branches, such as `master` and `staging` in the examples above,
   237  that says: when the head commit of `staging` meets a certain condition it
   238  should trigger `master` to update its head to that same commit. In other words it
   239  does `pachctl create branch data@master --head staging` automatically when the
   240  trigger condition is met.
   241  
   242  Building on the example above, to make `master` automatically trigger when
   243  there's 1 Megabyte of new data on `staging`, run:
   244  
   245  ```shell
   246  $ pachctl create branch data@master --trigger staging --trigger-size 1MB
   247  $ pachctl list branch data
   248  
   249  BRANCH  HEAD                             TRIGGER
   250  staging 8b5f3eb8dc4346dcbd1a547f537982a6 -
   251  master  -                                staging on Size(1MB)
   252  ```
   253  
   254  When you run that command, it may or may not set the head of `master`.  It depends
   255  on the difference between the size of the head of `staging` and the existing
   256  head of `master`, or `0` if it doesn't exist. Notice that in the example above
   257  `staging` had an existing head with less than a MB of data in it so `master`
   258  still has no head. If you don't see `staging` when you `list branch` that's ok,
   259  triggers can point to branches that don't exist yet. The head of `master` will
   260  update if you add a MB of new data to `staging`:
   261  
   262  ```shell
   263  $ dd if=/dev/urandom bs=1MiB count=1 | pachctl put file data@staging:/file
   264  $ pachctl list branch data
   265  
   266  BRANCH  HEAD                             TRIGGER
   267  staging 64b70e6aeda84845858c42d755023673 -
   268  master  64b70e6aeda84845858c42d755023673 staging on Size(1MB)
   269  ```
   270  
   271  Triggers automate deferred processing, but they don't prevent manually updating
   272  the head of a branch. If you ever want to trigger `master` even though the
   273  trigger condition hasn't been met you can run:
   274  
   275  ```shell
   276  $ pachctl create branch data@master --head staging
   277  ```
   278  
   279  Notice that you don't need to re-specify the trigger when you call `create
   280  branch` to change the head. If you do want to clear the trigger delete the
   281  branch and recreate it.
   282  
   283  There are three conditions on which you can trigger the repointing of a branch.
   284  
   285  - time, using a cron specification (--trigger-cron)
   286  - size (--trigger-size)
   287  - number of commits (--trigger-commits)
   288  
   289  When more than one is specified, a branch repoint will be triggered when any of
   290  the conditions is met. To guarantee that they all must be met, add
   291  --trigger-all.
   292  
   293  To experiment further, see the full [triggers example](https://github.com/pachyderm/examples/tree/master/deferred_processing/triggers).
   294  
   295  ## Embed Triggers in Pipelines
   296  
   297  Triggers can also be specified in the pipeline spec and automatically created
   298  when the pipeline is created. For example, this is the edges pipeline from our
   299  our OpenCV demo modified to only trigger when there is a 1 Megabyte of new images:
   300  
   301  ```
   302  {
   303    "pipeline": {
   304      "name": "edges"
   305    },
   306    "description": "A pipeline that performs image edge detection by using the OpenCV library.",
   307    "input": {
   308      "pfs": {
   309        "glob": "/*",
   310        "repo": "images",
   311        "trigger": {
   312            "size": "1MB"
   313        }
   314      }
   315    },
   316    "transform": {
   317      "cmd": [ "python3", "/edges.py" ],
   318      "image": "pachyderm/opencv"
   319    }
   320  }
   321  ```
   322  
   323  When you create this pipeline, Pachyderm will also create a branch in the input
   324  repo that specifies the trigger and the pipeline will use that branch as its
   325  input. The name of the branch is auto-generated with the form
   326  `<pipeline-name>-trigger-n`. You can manually update the heads of these branches
   327  to trigger processing just like in the previous example.
   328  
   329  !!! note
   330     Deleting or updating a pipeline **will not clean up** the trigger branch that it has created.
   331     In fact, the trigger branch has a lifetime that is not tied to the pipeline's lifetime.
   332     There is no guarantee that other pipelines are not using that trigger branch.
   333     A trigger branch can, however, be deleted manually.
   334  
   335  ## More advanced automation
   336  
   337  More advanced use cases might not be covered by the trigger methods above. For
   338  those, you need to create a Kubernetes application that uses Pachyderm APIs and
   339  watches the repositories for the specified condition. When the condition is
   340  met, the application switches the Pachyderm branch from `staging` to `master`.