github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/advanced-concepts/deferred_processing.md (about)

     1  # Deferred Processing of Data
     2  
     3  While a Pachyderm pipeline is running, it processes any new data that you
     4  commit to its input branch. However, in some cases, you
     5  want to commit data more frequently than you want to process it.
     6  
     7  Because Pachyderm pipelines do not reprocess the data that has
     8  already been processed, in most cases, this is not an issue. But, some
     9  pipelines might need to process everything from scratch. For example,
    10  you might want to commit data every hour, but only want to retrain a
    11  machine learning model on that data daily because it needs to train
    12  on all the data from scratch.
    13  
    14  In these cases, you can leverage a massive performance benefit from deferred
    15  processing. This section covers how to achieve that and control
    16  what gets processed.
    17  
    18  Pachyderm controls what is being processed by using the _filesystem_,
    19  rather than at the pipeline level. Although pipelines are inflexible,
    20  they are simple and always try to process the data at the heads of
    21  their input branches. In contrast, the filesystem is very flexible and
    22  gives you the ability to commit data in different places and then efficiently
    23  move and rename the data so that it gets processed when you want.
    24  
    25  ## Configure a Staging Branch in an Input repository
    26  
    27  When you want to load data into Pachyderm without triggering a pipeline,
    28  you can upload it to a staging branch and then submit accumulated
    29  changes in one batch by re-pointing the `HEAD` of your `master` branch
    30  to a commit in the staging branch.
    31  
    32  Although, in this section, the branch in which you consolidate changes
    33  is called `staging`, you can name it as you like. Also, you can have multiple
    34  staging branches. For example, `dev1`, `dev2`, and so on.
    35  
    36  In the example below, the repository that is created called `data`.
    37  
    38  To configure a staging branch, complete the following steps:
    39  
    40  1. Create a repository. For example, `data`.
    41  
    42     ```shell
    43     $ pachctl create repo data
    44     ```
    45  
    46  1. Create a `master` branch.
    47  
    48     ```shell
    49     $ pachctl create branch data@master
    50     ```
    51  
    52  1. View the created branch:
    53  
    54     ```shell
    55     $ pachctl list branch data
    56     BRANCH HEAD
    57     master -
    58     ```
    59  
    60     No `HEAD` means that nothing has yet been committed into this
    61     branch. When you commit data to the `master` branch, the pipeline
    62     immediately starts a job to process it.
    63     However, if you want to commit something without immediately
    64     processing it, you need to commit it to a different branch.
    65  
    66  1. Commit a file to the staging branch:
    67  
    68     ```shell
    69     $ pachctl put file data@staging -f <file>
    70     ```
    71  
    72     Pachyderm automatically creates the `staging` branch.
    73     Your repo now has 2 branches, `staging` and `master`. In this
    74     example, the `staging` name is used, but you can
    75     name the branch as you want.
    76  
    77  1. Verify that the branches were created:
    78  
    79     ```shell
    80     $ pachctl list branch data
    81     BRANCH  HEAD
    82     staging f3506f0fab6e483e8338754081109e69
    83     master  -
    84     ```
    85  
    86     The `master` branch still does not have a `HEAD` commit, but the
    87     new branch, `staging`, does. There still have been no jobs, because
    88     there are no pipelines that take `staging` as inputs. You can
    89     continue to commit to `staging` to add new data to the branch, and the
    90     pipeline will not process anything.
    91  
    92  1. When you are ready to process the data, update the `master` branch
    93     to point it to the head of the staging branch:
    94  
    95     ```shell
    96     $ pachctl create branch data@master --head staging
    97     ```
    98  
    99  1. List your branches to verify that the master branch has a `HEAD`
   100     commit:
   101  
   102     ```shell
   103     $ pachctl list branch
   104     staging f3506f0fab6e483e8338754081109e69
   105     master  f3506f0fab6e483e8338754081109e69
   106     ```
   107  
   108     The `master` and `staging` branches now have the same `HEAD` commit.
   109     This means that your pipeline has data to process.
   110  
   111  1. Verify that the pipeline has new jobs:
   112  
   113     ```shell
   114     $ pachctl list job
   115     ID                               PIPELINE STARTED        DURATION           RESTART PROGRESS  DL   UL  STATE
   116     061b0ef8f44f41bab5247420b4e62ca2 test     32 seconds ago Less than a second 0       6 + 0 / 6 108B 24B success
   117     ```
   118  
   119     You should see one job that Pachyderm created for all the changes you
   120     have submitted to the `staging` branch. While the commits to the
   121     `staging` branch are ancestors of the current `HEAD` in  `master`,
   122     they were never the actual `HEAD` of `master` themselves, so they
   123     do not get processed. This behavior works for most of the use cases
   124     because commits in Pachyderm are generally additive, so processing
   125     the HEAD commit also processes data from previous commits.
   126  
   127  ![deferred processing](../../assets/images/deferred_processing.gif)
   128  
   129  ## Process Specific Commits
   130  
   131  Sometimes you want to process specific intermediary commits
   132  that are not in the `HEAD` of the branch.
   133  To do this, you need to set `master` to have these commits as `HEAD`.
   134  For example, if you submitted ten commits in the `staging` branch and you
   135  want to process the seventh, third, and most recent commits, you need
   136  to run the following commands respectively:
   137  
   138  ```shell
   139  $ pachctl create branch data@master --head staging^7
   140  $ pachctl create branch data@master --head staging^3
   141  $ pachctl create branch data@master --head staging
   142  ```
   143  
   144  When you run the commands above, Pachyderm creates a job for each
   145  of the commands one after another. Therefore, when one job is completed,
   146  Pachyderm starts the next one. To verify
   147  that Pachyderm created jobs for these commands, run `pachctl list job`.
   148  
   149  ### Change the HEAD of your Branch
   150  
   151  You can move backward to previous commits as easily as advancing to the
   152  latest commits. For example, if you want to change the final output to be
   153  the result of processing `staging^1`, you can *roll back* your HEAD commit
   154  by running the following command:
   155  
   156  ```shell
   157  $ pachctl create branch data@master --head staging^1
   158  ```
   159  
   160  This command starts a new job to process `staging^1`. The `HEAD` commit on
   161  your output repo will be the result of processing `staging^1` instead of
   162  `staging`.
   163  
   164  ## Copy Files from One Branch to Another
   165  
   166  Using a staging branch allows you to defer processing. To use
   167  this functionality you need to know your input commits in advance.
   168  However, sometimes you want to be able to commit data in an ad-hoc,
   169  disorganized manner and then organize it later. Instead of pointing
   170  your `master` branch to a commit in a staging branch, you can copy
   171  individual files from `staging` to `master`.
   172  When you run `copy file`, Pachyderm only copies references to the files and
   173  does not move the actual data for the files around.
   174  
   175  To copy files from one branch to another, complete the following steps:
   176  
   177  1. Start a commit:
   178  
   179     ```shell
   180     $ pachctl start commit data@master
   181     ```
   182  
   183  1. Copy files:
   184  
   185     ```shell
   186     $ pachctl copy file data@staging:file1 data@master:file1
   187     $ pachctl copy file data@staging:file2 data@master:file2
   188     ...
   189     ```
   190  
   191  1. Close the commit:
   192  
   193     ```shell
   194     $ pachctl finish commit data@master
   195     ```
   196  
   197  Also, you can run `pachctl delete file` and `pachctl put file`
   198  while the commit is open if you want to remove something from
   199  the parent commit or add something that is not stored anywhere else.
   200  
   201  ## Deferred Processing in Output Repositories
   202  
   203  You can perform same deferred processing opertions with data in output
   204  repositories. To do so, rather than committing to a
   205  `staging` branch, configure the `output_branch` field
   206  in your pipeline specification.
   207  
   208  To configure deffered processing in an output repository, complete the
   209  following steps:
   210  
   211  1. In the pipeline specification, add the `output_branch` field with
   212     the name of the branch in which you want to accumulate your data
   213     before processing:
   214  
   215     ```shell
   216     "output_branch": "staging"
   217     ```
   218  
   219  1. When you want to process data, run:
   220  
   221     ```shell
   222     $ pachctl create-branch pipeline master --head staging
   223     ```
   224  
   225  ## Automate Branch Switching
   226  
   227  Typically, repointing from one branch to another
   228  happens when a certain condition is met. For example, you might
   229  want to repoint your branch when you have a specific number of commits,
   230  or when the amount of unprocessed data reaches a certain size, or
   231  at a specific time interval, such as daily, or other.
   232  To configure this functionality, you need to create a Kubernetes
   233  application that uses Pachyderm APIs and watches the repositories for the
   234  specified condition. When the condition is met, the application switches
   235  the Pachyderm branch from `staging` to `master`.