github.com/pachyderm/pachyderm@v1.13.4/examples/deferred_processing/automated_deferred_processing/README.md (about) 1 # Automated Deferred Processing 2 3 [Deferred processing](https://docs.pachyderm.com/1.13.x/how-tos/deferred_processing/) 4 is a Pachyderm technique for controlling when data gets processed. 5 Deferred processing uses branches to prevent pipelines from triggering on every input commit. 6 This example shows how to automate the movement of those branches, 7 by using a cron pipeline. 8 The Makefile in this example, 9 along with the explanations provided in this document, 10 should give you a good start on implementing this in your Pachyderm cluster 11 with or without access controls activated. 12 13 In this example, we will cover: 14 15 1. Creating a cron pipeline that will move the master branch to the commit in another repo periodically 16 1. Adding an authentication token to allow it to work when access controls are activated 17 18 19 ## Prerequisites 20 21 Before you start working on this example, 22 you should understand deferred processing by reading the documentation 23 and trying the [deferred processing example](../deferred_processing_plus_transactions). 24 That example is used extensively here. 25 26 To create and update branch labels, 27 the `branch-mover` pipeline uses `pachctl` to send commands to Pachyderm's `pachd`. 28 The `branch-mover` pipeline, since it's embedded in Pachyderm itself, 29 will need a configuration to talk to `pachd` 30 and, if access controls are activated, credentials to authenticate itself. 31 32 For a Pachyderm cluster with activated access controls, 33 this example demonstrates how to create a Pachyderm authentication token, 34 load the token into a Kubernetes secret provisioned through `pachctl`, 35 and use `transform.secrets` in the pipeline spec, 36 which both mounts the secret as a Kubernetes volume 37 and creates an environment variable for use by the pipeline. 38 If you are unfamiliar with those things, 39 you might want to refer to the following documentation as you work through the example. 40 41 * [Pachyderm access controls and authentication documentation](https://docs.pachyderm.com/1.13.x/enterprise/auth/) 42 * [Kubernetes documentation on Secrets](https://kubernetes.io/docs/concepts/configuration/secret/) 43 * The [pachctl create secret](https://docs.pachyderm.com/1.13.x/reference/pachctl/pachctl_create_secret/) command 44 * [transform.secret in the pipeline specification](https://docs.pachyderm.com/1.13.x/reference/pipeline_spec/) 45 46 Before you can start working on this example, make sure you have the following prerequisites: 47 48 * You need to have Pachyderm v1.9.8 or later installed on your computer or cloud platform. 49 See [Deploy Pachyderm](https://docs.pachyderm.com/1.13.x/deploy-manage/deploy/). 50 * Basic familiarity with Makefiles and Unix shell scripting 51 * The [jq utility](https://stedolan.github.io/jq/manual/) for transforming json files in shell scripts 52 53 ## Pipelines 54 55 This example uses the same DAG as in the deferred processing example, 56 with the addition of a cron pipeline 57 for periodically moving the `dev` branch to `master`. 58 59 For details on the deferred processing example DAG, 60 see [the Deferred Processing example](../deferred_processing_plus_transactions). 61 62 ### Branch mover without access controls 63 64 If you do not have access controls enabled in your Pachyderm cluster, 65 use the instructions in this section. 66 Otherwise, proceed to [Branch mover with access controls](#branch_mover_with_access_controls). 67 68 The cron pipeline is called `branch-mover`. 69 By default, 70 it is configured to run every minute, 71 per its tick input: 72 73 ``` 74 "input": { 75 "cron": { 76 "name": "tick", 77 "spec": "@every 1m", 78 "overwrite": true 79 } 80 }, 81 ``` 82 83 Using the official Pachyderm `pachctl` image, 84 the transform first updates the default `pachctl` config 85 so `pachctl` can talk directly to `pachd` in the cluster. 86 It uses the `kubedns` name for `pachd` 87 and the internal Service port of `650`. 88 89 ``` 90 "echo '{\"pachd_address\": \"grpc://pachd:650\"}' | pachctl config set context default --overwrite", 91 ``` 92 93 Similar to the deferred processing example, 94 the next command moves the `master` branch on `edges_dp` to point to `dev`, 95 96 97 ``` 98 "pachctl create branch edges_dp@master --head dev" 99 ``` 100 101 This is all the cron pipeline needs to do, 102 without access controls. 103 The `transform` section of the pipeline spec `branch-mover-no-auth.json` 104 will look like this: 105 106 ``` 107 "transform": { 108 "cmd": ["sh" ], 109 "stdin": [ 110 "echo '{\"pachd_address\": \"grpc://pachd:650\"}' | pachctl config set context default --overwrite", 111 "pachctl create branch edges_dp@master --head dev" 112 ], 113 "image": "pachyderm/pachctl:1.11.0" 114 } 115 ``` 116 117 ### Branch mover with access controls 118 119 Use the instructions in this section 120 if you have activated access controls in your Pachyderm cluster. 121 Otherwise, go back to [Branch mover without access controls](#branch_mover_without_access_controls). 122 123 Adding support for access controls to the `branch-mover` pipeline requires a few steps. 124 125 1. Creating a [Kubernetes Secret](https://kubernetes.io/docs/concepts/configuration/secret/) 126 containing an authentication token. 127 2. Loading that secret into Kubernetes using `pachctl create secret`. 128 3. Adding a `.transform.secret` to the pipeline spec 129 to create an environment variable from a key value in the secret. 130 4. Adding a line to the pipeline transform to authenticate using the token prior to moving the branch. 131 132 Let's go through each of these steps in detail. 133 134 #### Creating the authentication token and the secret 135 136 Once Pachyderm access controls are activated, 137 log in as the user the `branch-mover` 138 will authenticate as to run this example. 139 140 You may want to test this with the `robot:admin` 141 configured when access controls were activated, 142 or your own credentials. 143 Please see [Using this example in production](#using_this_example_in_production) below 144 for information regarding production-level security configuration. 145 146 Create a Pachyderm authentication token by running the following command: 147 148 ``` 149 pachctl auth get-auth-token --ttl <some-golang-formatted-duration> 150 ``` 151 152 A golang-formatted duration uses `h` for hours, `m` for minutes, `s` for seconds. 153 26 weeks would be `24 * 7 * 26` hours, 154 expressed as `624h`. 155 The token will only be generated for this duration 156 if it is *shorter* than the lifetime of the session 157 for the user who is logged into the cluster 158 where the command is run. 159 Otherwise, it is generated for the duration of that user's current session. 160 The expiration of a user's current session can be determined 161 by running `pachctl auth whomai`. 162 163 The duration of the token 164 determines how long the cron pipeline may run 165 before the secret needs to be refreshed 166 and the pipeline restarted. 167 168 Here is a Unix command 169 for generating a token using `pachctl` 170 and only outputting the value of the token: 171 172 ``` 173 pachctl auth get-auth-token --ttl "624h" | \ 174 grep Token | awk '{print $2}' | \ 175 ``` 176 177 The command is enhanced to encode the token with the `base64` encoding scheme, 178 so it can be used in a Kubernetes secret, 179 and trim off unnecessary characters. 180 181 ``` 182 pachctl auth get-auth-token --ttl "624h" | \ 183 grep Token | awk '{print $2}' | \ 184 base64 -e | tr -d '\r\n' 185 ``` 186 187 Next, that data must be placed into a secret. 188 The template for an appropriate secret 189 is in the file `pachyderm-user-secret.secret`. 190 The `jq` utility enables you to place the encoded token 191 in the proper `data.auth_token` field 192 in the secret 193 by using a subshell to run that command 194 and direct the output into a json file, 195 which we'll give the `secret` extension. 196 197 ``` 198 jq ".data.auth_token=\"$(pachctl auth get-auth-token --ttl "624h" | \ 199 grep Token | awk '{print $2}' | \ 200 base64 -e | tr -d '\r\n')\"" \ 201 < pachyderm-user-secret.clear \ 202 > pachyderm-user-secret.secret 203 ``` 204 205 #### Loading the secret into Kubernetes 206 207 Next, let us load the secret into Kubernetes by running the following command: 208 209 ``` 210 pachctl create secret -f pachyderm-user-secret.secret 211 ``` 212 213 !!! note 214 You can run the two previous steps by running 215 `make pachyderm-user-secret.secret`. 216 217 #### Mounting the secret in the pipeline 218 219 To add the secret to our pipeline, 220 we can just use the `transform.secrets` field 221 to expose the `auth_token` key as an environment variable. 222 This is `transform.secrets` in the file `branch-mover.json` 223 224 ``` 225 "secrets": [ { 226 "name": "pachyderm-user-secret", 227 "env_var": "PACHYDERM_AUTH_TOKEN", 228 "key": "auth_token" 229 } ] 230 ``` 231 232 #### Authenticating to Pachyderm 233 234 The `branch-mover.json` file includes one line 235 that uses the `PACHYDERM_AUTH_TOKEN` environment variable 236 to authenticate to Pachyderm. 237 238 ``` 239 "echo ${PACHYDERM_AUTH_TOKEN} | pachctl auth use-auth-token" 240 ``` 241 242 That line is inserted prior to creating the branch, 243 making the pipeline transform in `branch-mover.json` 244 look like this: 245 246 ``` 247 "transform": { 248 "cmd": ["sh" ], 249 "stdin": [ 250 "echo '{\"pachd_address\": \"grpc://pachd:650\"}' | pachctl config set context default --overwrite", 251 "echo ${PACHYDERM_AUTH_TOKEN} | pachctl auth use-auth-token", 252 "pachctl create branch edges_dp@master --head dev" 253 ], 254 "image": "pachyderm/pachctl:1.11.0" 255 } 256 ``` 257 258 #### Creating the pipeline 259 260 Finally, create the pipeline using that spec: 261 262 ``` 263 pachctl create pipeline -f branch-mover.json 264 ``` 265 266 !!! note 267 You can run this step with the command `make create-branch-mover`. 268 269 ## Example run-through 270 271 This example can be used with access controls activated or not. 272 The only difference is the command that you use to create the pipeline 273 in the second step, below. 274 275 1. If the DAG 276 used by the deferred processing example 277 hasn't yet been created, 278 create that starting DAG 279 by running this command 280 from inside this directory. 281 282 ``` 283 make create-deferred-processing-cluster 284 ``` 285 286 1. If your Pachyderm cluster does not have access controls activated, 287 create the branch-mover cron pipeline 288 using the `create-branch-mover-no-auth` Makefile target. 289 290 ``` 291 make create-branch-mover-no-auth 292 ``` 293 294 If you have access controls activated, 295 create the branch-mover cron pipeline 296 using the `create-branch-mover` Makefile target. 297 298 ``` 299 make create-branch-mover 300 ``` 301 302 1. Watch `pachctl jobs` in another terminal window 303 by using this command: 304 305 ``` 306 watch -cn 2 pachctl list job --no-pager 307 ``` 308 309 !!! note 310 On macOS, you may need to install `watch`, 311 which may be installed via [Homebrew](https://brew.sh/) 312 using the command `brew install watch`. 313 314 1. Every minute, you should see a job triggered on `branch-mover`. 315 The very first job will be immediately followed 316 by a job for `montage_dp`, 317 as existing files are moved to the `edges_dp@master` branch. 318 Subsequent ticks will trigger no jobs in `montage_dp`. 319 320 1. Commit data to the `images_dp_1` repo. 321 322 ``` 323 pachctl put file images_dp_1@master:1VqcWw9.jpg -f http://imgur.com/1VqcWw9.jpg 324 ``` 325 326 A job will be triggered on `edges_dp`, 327 but no jobs will be triggered on `montage_dp` 328 until after `branch-mover` runs 329 moving the `edges@dev` branch to `edges@master`. 330 331 ## Using this example in production 332 333 When you implement this example on production pipelines with access controls activated, 334 you will have to periodically renew the token 335 by either running the appropriate make target 336 to update the pipeline with a new secret 337 or manually updating the secret 338 and deleting and recreating the pipeline. 339 340 It is a best security practice in production 341 to create a Pachyderm user 342 with the [least privilege](https://en.wikipedia.org/wiki/Principle_of_least_privilege) required to do this pipeline's tasks. 343 344 This is a periodic maintenance task 345 with security implications 346 the automation of which should be reviewed 347 by appropropriate engineering personnel.