github.com/pachyderm/pachyderm@v1.13.4/examples/deferred_processing/automated_deferred_processing/README.md (about)

     1  # Automated Deferred Processing
     2  
     3  [Deferred processing](https://docs.pachyderm.com/1.13.x/how-tos/deferred_processing/) 
     4  is a Pachyderm technique for controlling when data gets processed.
     5  Deferred processing uses branches to prevent pipelines from triggering on every input commit.
     6  This example shows how to automate the movement of those branches,
     7  by using a cron pipeline.
     8  The Makefile in this example,
     9  along with the explanations provided in this document,
    10  should give you a good start on implementing this in your Pachyderm cluster
    11  with or without access controls activated.
    12  
    13  In this example, we will cover:
    14  
    15  1. Creating a cron pipeline that will move the master branch to the commit in another repo periodically
    16  1. Adding an authentication token to allow it to work when access controls are activated
    17  
    18  
    19  ## Prerequisites
    20  
    21  Before you start working on this example, 
    22  you should understand deferred processing by reading the documentation
    23  and trying the [deferred processing example](../deferred_processing_plus_transactions).
    24  That example is used extensively here.
    25  
    26  To create and update branch labels, 
    27  the `branch-mover` pipeline uses `pachctl` to send commands to Pachyderm's `pachd`.
    28  The `branch-mover` pipeline, since it's embedded in Pachyderm itself, 
    29  will need a configuration to talk to `pachd` 
    30  and, if access controls are activated, credentials to authenticate itself.
    31   
    32  For a Pachyderm cluster with activated access controls,
    33  this example demonstrates how to create a Pachyderm authentication token,
    34  load the token into a Kubernetes secret provisioned through `pachctl`, 
    35  and use `transform.secrets` in the pipeline spec,
    36  which both mounts the secret as a Kubernetes volume
    37  and creates an environment variable for use by the pipeline.
    38  If you are unfamiliar with those things,
    39  you might want to refer to the following documentation as you work through the example.
    40  
    41  * [Pachyderm access controls and authentication documentation](https://docs.pachyderm.com/1.13.x/enterprise/auth/)
    42  * [Kubernetes documentation on Secrets](https://kubernetes.io/docs/concepts/configuration/secret/)
    43  * The [pachctl create secret](https://docs.pachyderm.com/1.13.x/reference/pachctl/pachctl_create_secret/) command
    44  * [transform.secret in the pipeline specification](https://docs.pachyderm.com/1.13.x/reference/pipeline_spec/)
    45  
    46  Before you can start working on this example, make sure you have the following prerequisites:
    47  
    48  * You need to have Pachyderm v1.9.8 or later installed on your computer or cloud platform. 
    49    See [Deploy Pachyderm](https://docs.pachyderm.com/1.13.x/deploy-manage/deploy/).
    50  * Basic familiarity with Makefiles and  Unix shell scripting
    51  * The [jq utility](https://stedolan.github.io/jq/manual/) for transforming json files in shell scripts
    52  
    53  ## Pipelines
    54  
    55  This example uses the same DAG as in the deferred processing example,
    56  with the addition of a cron pipeline 
    57  for periodically moving the `dev` branch to `master`.
    58  
    59  For details on the deferred processing example DAG,
    60  see [the Deferred Processing example](../deferred_processing_plus_transactions).
    61  
    62  ### Branch mover without access controls
    63  
    64  If you do not have access controls enabled in your Pachyderm cluster, 
    65  use the instructions in this section. 
    66  Otherwise, proceed to [Branch mover with access controls](#branch_mover_with_access_controls).
    67  
    68  The cron pipeline is called `branch-mover`. 
    69  By default,
    70  it is configured to run every minute, 
    71  per its tick input:
    72  
    73  ```
    74    "input": {
    75      "cron": {
    76          "name": "tick",
    77          "spec": "@every 1m",
    78          "overwrite": true
    79      }
    80    },
    81  ```
    82  
    83  Using the official Pachyderm `pachctl` image, 
    84  the transform first updates the default `pachctl` config
    85  so `pachctl` can talk directly to `pachd` in the cluster.
    86  It uses the `kubedns` name for `pachd` 
    87  and the internal Service port of `650`. 
    88  
    89  ```
    90            "echo '{\"pachd_address\": \"grpc://pachd:650\"}' | pachctl config set context default --overwrite",
    91  ```
    92  
    93  Similar to the deferred processing example,
    94  the next command moves the `master` branch on `edges_dp` to point to `dev`,
    95  
    96  
    97  ```
    98            "pachctl create branch edges_dp@master --head dev"
    99  ```
   100  
   101  This is all the cron pipeline needs to do, 
   102  without access controls.
   103  The `transform` section of the pipeline spec `branch-mover-no-auth.json`
   104  will look like this:
   105  
   106  ```
   107    "transform": {
   108        "cmd": ["sh" ],
   109        "stdin": [
   110            "echo '{\"pachd_address\": \"grpc://pachd:650\"}' | pachctl config set context default --overwrite",
   111            "pachctl create branch edges_dp@master --head dev"
   112        ],
   113      "image": "pachyderm/pachctl:1.11.0"
   114    }
   115  ```
   116  
   117  ### Branch mover with access controls
   118  
   119  Use the instructions in this section
   120  if you have activated access controls in your Pachyderm cluster.
   121  Otherwise, go back to  [Branch mover without access controls](#branch_mover_without_access_controls).
   122  
   123  Adding support for access controls to the `branch-mover` pipeline requires a few steps.
   124  
   125  1. Creating a [Kubernetes Secret](https://kubernetes.io/docs/concepts/configuration/secret/)
   126     containing an authentication token.
   127  2. Loading that secret into Kubernetes using `pachctl create secret`.
   128  3. Adding a `.transform.secret` to the pipeline spec 
   129     to create an environment variable from a key value in the secret.
   130  4. Adding a line to the pipeline transform to authenticate using the token prior to moving the branch.
   131  
   132  Let's go through each of these steps in detail.
   133  
   134  #### Creating the authentication token and the secret
   135  
   136  Once Pachyderm access controls are activated,
   137  log in as the user the `branch-mover` 
   138  will authenticate as to run this example.
   139  
   140  You may want to test this with the `robot:admin`
   141  configured when access controls were activated,
   142  or your own credentials.
   143  Please see [Using this example in production](#using_this_example_in_production) below
   144  for information regarding production-level security configuration.
   145  
   146  Create a Pachyderm authentication token by running the following command:
   147  
   148  ```
   149  pachctl auth get-auth-token --ttl <some-golang-formatted-duration>
   150  ```
   151  
   152  A golang-formatted duration uses `h` for hours, `m` for minutes, `s` for seconds.
   153  26 weeks would be `24 * 7 * 26` hours, 
   154  expressed as `624h`. 
   155  The token will only be generated for this duration
   156  if it is *shorter* than the lifetime of the session
   157  for the user who is logged into the cluster
   158  where the command is run. 
   159  Otherwise, it is generated for the duration of that user's current session.
   160  The expiration of a user's current session can be determined
   161  by running `pachctl auth whomai`.
   162  
   163  The duration of the token 
   164  determines how long the cron pipeline may run 
   165  before the secret needs to be refreshed 
   166  and the pipeline restarted.
   167  
   168  Here is a Unix command 
   169  for generating a token using `pachctl`
   170  and only outputting the value of the token:
   171  
   172  ```
   173  pachctl auth get-auth-token --ttl "624h" | \
   174      grep Token | awk '{print $2}' | \
   175  ```
   176  
   177  The command is enhanced to encode the token with the `base64` encoding scheme,
   178  so it can be used in a Kubernetes secret,
   179  and trim off unnecessary characters.
   180  
   181  ```
   182  pachctl auth get-auth-token --ttl "624h" | \
   183      grep Token | awk '{print $2}' | \
   184      base64 -e | tr -d '\r\n'
   185  ```
   186  
   187  Next, that data must be placed into a secret.
   188  The template for an appropriate secret 
   189  is in the file `pachyderm-user-secret.secret`.
   190  The `jq` utility enables you to place the encoded token 
   191  in the proper `data.auth_token` field
   192  in the secret
   193  by using a subshell to run that command
   194  and direct the output into a json file,
   195  which we'll give the `secret` extension.
   196  
   197  ```
   198  jq ".data.auth_token=\"$(pachctl auth get-auth-token --ttl "624h" | \
   199      grep Token | awk '{print $2}' | \
   200      base64 -e | tr -d '\r\n')\"" \
   201      < pachyderm-user-secret.clear \
   202      > pachyderm-user-secret.secret
   203  ```
   204  
   205  #### Loading the secret into Kubernetes
   206  
   207  Next, let us load the secret into Kubernetes by running the following command:
   208  
   209  ```
   210  pachctl create secret -f pachyderm-user-secret.secret
   211  ```
   212  
   213  !!! note
   214      You can run the two previous steps by running
   215      `make pachyderm-user-secret.secret`.
   216  
   217  #### Mounting the secret in the pipeline
   218  
   219  To add the secret to our pipeline,
   220  we can just use the `transform.secrets` field
   221  to expose the `auth_token` key as an environment variable.
   222  This is `transform.secrets` in the file `branch-mover.json`
   223  
   224  ```
   225        "secrets": [ {
   226            "name": "pachyderm-user-secret",
   227            "env_var": "PACHYDERM_AUTH_TOKEN",
   228            "key": "auth_token"
   229        } ]
   230  ```
   231  
   232  #### Authenticating to Pachyderm
   233  
   234  The `branch-mover.json` file includes one line
   235  that uses the `PACHYDERM_AUTH_TOKEN` environment variable
   236  to authenticate to Pachyderm. 
   237  
   238  ```
   239            "echo ${PACHYDERM_AUTH_TOKEN} | pachctl auth use-auth-token"
   240  ```
   241  
   242  That line is inserted prior to creating the branch, 
   243  making the pipeline transform in `branch-mover.json`
   244  look like this:
   245  
   246  ```
   247  "transform": {
   248        "cmd": ["sh" ],
   249        "stdin": [
   250            "echo '{\"pachd_address\": \"grpc://pachd:650\"}' | pachctl config set context default --overwrite",
   251            "echo ${PACHYDERM_AUTH_TOKEN} | pachctl auth use-auth-token",
   252            "pachctl create branch edges_dp@master --head dev"
   253        ],
   254      "image": "pachyderm/pachctl:1.11.0"
   255    }
   256  ```
   257  
   258  #### Creating the pipeline
   259  
   260  Finally, create the pipeline using that spec:
   261  
   262  ```
   263  pachctl create pipeline -f branch-mover.json
   264  ```
   265  
   266  !!! note
   267      You can run this step with the command  `make create-branch-mover`.
   268  
   269  ## Example run-through
   270  
   271  This example can be used with access controls activated or not.
   272  The only difference is the command that you use to create the pipeline 
   273  in the second step, below.
   274  
   275  1. If the DAG
   276     used by the deferred processing example
   277     hasn't yet been created,
   278     create that starting DAG 
   279     by running this command 
   280     from inside this directory.
   281     
   282     ```
   283     make create-deferred-processing-cluster
   284     ```
   285  
   286  1. If your Pachyderm cluster does not have access controls activated,
   287     create the branch-mover cron pipeline
   288     using the `create-branch-mover-no-auth` Makefile target.
   289     
   290     ```
   291     make create-branch-mover-no-auth
   292     ```
   293     
   294     If you have access controls activated,
   295     create the branch-mover cron pipeline
   296     using the `create-branch-mover` Makefile target.
   297     
   298     ```
   299     make create-branch-mover
   300     ```
   301  
   302  1. Watch `pachctl jobs` in another terminal window
   303     by using this command:
   304     
   305     ```
   306     watch -cn 2 pachctl list job --no-pager
   307     ```
   308  
   309  !!! note
   310      On macOS, you may need to install `watch`, 
   311          which may be installed via [Homebrew](https://brew.sh/)
   312          using the command `brew install watch`.
   313  
   314  1. Every minute, you should see a job triggered on `branch-mover`.
   315     The very first job will be immediately followed
   316     by a job for `montage_dp`, 
   317     as existing files are moved to the `edges_dp@master` branch.
   318     Subsequent ticks will trigger no jobs in `montage_dp`.
   319  
   320  1. Commit data to the `images_dp_1` repo.
   321  
   322  ```
   323  pachctl put file images_dp_1@master:1VqcWw9.jpg -f http://imgur.com/1VqcWw9.jpg
   324  ```
   325  
   326     A job will be triggered on `edges_dp`, 
   327     but no jobs will be triggered on `montage_dp`
   328     until after `branch-mover` runs
   329     moving the `edges@dev` branch to `edges@master`.
   330  
   331  ## Using this example in production
   332  
   333  When you implement this example on production pipelines with access controls activated,
   334  you will have to periodically renew the token
   335  by either running the appropriate make target
   336  to update the pipeline with a new secret
   337  or manually updating the secret
   338  and deleting and recreating the pipeline.
   339  
   340  It is a best security practice in production
   341  to create a Pachyderm user 
   342  with the [least privilege](https://en.wikipedia.org/wiki/Principle_of_least_privilege) required to do this pipeline's tasks.
   343  
   344  This is a periodic maintenance task
   345  with security implications
   346  the automation of which should be reviewed
   347  by appropropriate engineering personnel.