github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/reference/pipeline_spec.md (about) 1 # Pipeline Specification 2 3 This document discusses each of the fields present in a pipeline specification. 4 To see how to use a pipeline spec to create a pipeline, refer to the [pachctl 5 create pipeline](pachctl/pachctl_create_pipeline.md) section. 6 7 ## JSON Manifest Format 8 9 ```json 10 { 11 "pipeline": { 12 "name": string 13 }, 14 "description": string, 15 "metadata": { 16 "annotations": { 17 "annotation": string 18 }, 19 "labels": { 20 "label": string 21 } 22 }, 23 "transform": { 24 "image": string, 25 "cmd": [ string ], 26 "stdin": [ string ], 27 "err_cmd": [ string ], 28 "err_stdin": [ string ], 29 "env": { 30 string: string 31 }, 32 "secrets": [ { 33 "name": string, 34 "mount_path": string 35 }, 36 { 37 "name": string, 38 "env_var": string, 39 "key": string 40 } ], 41 "image_pull_secrets": [ string ], 42 "accept_return_code": [ int ], 43 "debug": bool, 44 "user": string, 45 "working_dir": string, 46 }, 47 "parallelism_spec": { 48 // Set at most one of the following: 49 "constant": int, 50 "coefficient": number 51 }, 52 "hashtree_spec": { 53 "constant": int, 54 }, 55 "resource_requests": { 56 "memory": string, 57 "cpu": number, 58 "disk": string, 59 }, 60 "resource_limits": { 61 "memory": string, 62 "cpu": number, 63 "gpu": { 64 "type": string, 65 "number": int 66 } 67 "disk": string, 68 }, 69 "sidecar_resource_limits": { 70 "memory": string, 71 "cpu": number 72 }, 73 "datum_timeout": string, 74 "datum_tries": int, 75 "job_timeout": string, 76 "input": { 77 <"pfs", "cross", "union", "cron", or "git" see below> 78 }, 79 "s3_out": bool, 80 "output_branch": string, 81 "egress": { 82 "URL": "s3://bucket/dir" 83 }, 84 "standby": bool, 85 "cache_size": string, 86 "enable_stats": bool, 87 "service": { 88 "internal_port": int, 89 "external_port": int 90 }, 91 "spout": { 92 "overwrite": bool 93 \\ Optionally, you can combine a spout with a service: 94 "service": { 95 "internal_port": int, 96 "external_port": int 97 } 98 }, 99 "max_queue_size": int, 100 "chunk_spec": { 101 "number": int, 102 "size_bytes": int 103 }, 104 "scheduling_spec": { 105 "node_selector": {string: string}, 106 "priority_class_name": string 107 }, 108 "pod_spec": string, 109 "pod_patch": string, 110 } 111 112 ------------------------------------ 113 "pfs" input 114 ------------------------------------ 115 116 "pfs": { 117 "name": string, 118 "repo": string, 119 "branch": string, 120 "glob": string, 121 "lazy" bool, 122 "empty_files": bool, 123 "s3": bool 124 } 125 126 ------------------------------------ 127 "cross" or "union" input 128 ------------------------------------ 129 130 "cross" or "union": [ 131 { 132 "pfs": { 133 "name": string, 134 "repo": string, 135 "branch": string, 136 "glob": string, 137 "lazy" bool, 138 "empty_files": bool 139 "s3": bool 140 } 141 }, 142 { 143 "pfs": { 144 "name": string, 145 "repo": string, 146 "branch": string, 147 "glob": string, 148 "lazy" bool, 149 "empty_files": bool 150 "s3": bool 151 } 152 } 153 ... 154 ] 155 156 157 158 ------------------------------------ 159 "cron" input 160 ------------------------------------ 161 162 "cron": { 163 "name": string, 164 "spec": string, 165 "repo": string, 166 "start": time, 167 "overwrite": bool 168 } 169 170 ------------------------------------ 171 "join" input 172 ------------------------------------ 173 174 "join": [ 175 { 176 "pfs": { 177 "name": string, 178 "repo": string, 179 "branch": string, 180 "glob": string, 181 "join_on": string 182 "lazy": bool 183 "empty_files": bool 184 "s3": bool 185 } 186 }, 187 { 188 "pfs": { 189 "name": string, 190 "repo": string, 191 "branch": string, 192 "glob": string, 193 "join_on": string 194 "lazy": bool 195 "empty_files": bool 196 "s3": bool 197 } 198 } 199 ] 200 201 ------------------------------------ 202 "git" input 203 ------------------------------------ 204 205 "git": { 206 "URL": string, 207 "name": string, 208 "branch": string 209 } 210 211 ``` 212 213 In practice, you rarely need to specify all the fields. 214 Most fields either come with sensible defaults or can be empty. 215 The following text is an example of a minimum spec: 216 217 ```json 218 { 219 "pipeline": { 220 "name": "wordcount" 221 }, 222 "transform": { 223 "image": "wordcount-image", 224 "cmd": ["/binary", "/pfs/data", "/pfs/out"] 225 }, 226 "input": { 227 "pfs": { 228 "repo": "data", 229 "glob": "/*" 230 } 231 } 232 } 233 ``` 234 235 ### Name (required) 236 237 `pipeline.name` is the name of the pipeline that you are creating. Each 238 pipeline needs to have a unique name. Pipeline names must meet the following 239 requirements: 240 241 - Include only alphanumeric characters, `_` and `-`. 242 - Begin or end with only alphanumeric characters (not `_` or `-`). 243 - Not exceed 63 characters in length. 244 245 ### Description (optional) 246 247 `description` is an optional text field where you can add information 248 about the pipeline. 249 250 ### Metadata 251 252 This parameter enables you to add metadata to your pipeline pods by using Kubernetes' `labels` and `annotations`. Labels help you to organize and keep track of your cluster objects by creating groups of pods based on the application they run, resources they use, or other parameters. Labels simplify the querying of Kubernetes objects and are handy in operations. 253 254 Similarly to labels, you can add metadata through annotations. The difference is that you can specify any arbitrary metadata through annotations. 255 256 Both parameters require a key-value pair. Do not confuse this parameter with `pod_patch` which adds metadata to the user container of the pipeline pod. For more information, see [Labels and Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) and [Kubernetes Annotations](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/) in the Kubernetes documentation. 257 258 ### Transform (required) 259 260 `transform.image` is the name of the Docker image that your jobs use. 261 262 `transform.cmd` is the command passed to the Docker run invocation. Similarly 263 to Docker, `cmd` is not run inside a shell which means that 264 wildcard globbing (`*`), pipes (`|`), and file redirects (`>` and `>>`) do 265 not work. To specify these settings, you can set `cmd` to be a shell of your 266 choice, such as `sh` and pass a shell script to `stdin`. 267 268 `transform.stdin` is an array of lines that are sent to your command on 269 `stdin`. 270 Lines do not have to end in newline characters. 271 272 `transform.err_cmd` is an optional command that is executed on failed datums. 273 If the `err_cmd` is successful and returns 0 error code, it does not prevent 274 the job from succeeding. 275 This behavior means that `transform.err_cmd` can be used to ignore 276 failed datums while still writing successful datums to the output repo, 277 instead of failing the whole job when some datums fail. The `transform.err_cmd` 278 command has the same limitations as `transform.cmd`. 279 280 `transform.err_stdin` is an array of lines that are sent to your error command 281 on `stdin`. 282 Lines do not have to end in newline characters. 283 284 `transform.env` is a key-value map of environment variables that 285 Pachyderm injects into the container. There are also environment variables 286 that are automatically injected into the container, such as: 287 288 * `PACH_JOB_ID` – the ID of the current job. 289 * `PACH_OUTPUT_COMMIT_ID` – the ID of the commit in the output repo for 290 the current job. 291 * `<input>_COMMIT` - the ID of the input commit. For example, if your 292 input is the `images` repo, this will be `images_COMMIT`. 293 294 For a complete list of variables and 295 descriptions see: [Configure Environment Variables](../../deploy-manage/deploy/environment-variables/). 296 297 `transform.secrets` is an array of secrets. You can use the secrets to 298 embed sensitive data, such as credentials. The secrets reference 299 Kubernetes secrets by name and specify a path to map the secrets or 300 an environment variable (`env_var`) that the value should be bound to. Secrets 301 must set `name` which should be the name of a secret in Kubernetes. Secrets 302 must also specify either `mount_path` or `env_var` and `key`. See more 303 information about Kubernetes secrets [here](https://kubernetes.io/docs/concepts/configuration/secret/). 304 305 `transform.image_pull_secrets` is an array of image pull secrets, image pull 306 secrets are similar to secrets except that they are mounted before the 307 containers are created so they can be used to provide credentials for image 308 pulling. For example, if you are using a private Docker registry for your 309 images, you can specify it by running the following command: 310 311 ```shell 312 kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL 313 ``` 314 315 And then, notify your pipeline about it by using 316 `"image_pull_secrets": [ "myregistrykey" ]`. Read more about image pull secrets 317 [here](https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod). 318 319 `transform.accept_return_code` is an array of return codes, such as exit codes 320 from your Docker command that are considered acceptable. 321 If your Docker command exits with one of the codes in this array, it is 322 considered a successful run to set job status. `0` 323 is always considered a successful exit code. 324 325 `transform.debug` turns on added debug logging for the pipeline. 326 327 `transform.user` sets the user that your code runs as, this can also be 328 accomplished with a `USER` directive in your `Dockerfile`. 329 330 `transform.working_dir` sets the directory that your command runs from. You 331 can also specify the `WORKDIR` directive in your `Dockerfile`. 332 333 `transform.dockerfile` is the path to the `Dockerfile` used with the `--build` 334 flag. This defaults to `./Dockerfile`. 335 336 ### Parallelism Spec (optional) 337 338 `parallelism_spec` describes how Pachyderm parallelizes your pipeline. 339 Currently, Pachyderm has two parallelism strategies: `constant` and 340 `coefficient`. 341 342 If you set the `constant` field, Pachyderm starts the number of workers 343 that you specify. For example, set `"constant":10` to use 10 workers. 344 345 If you set the `coefficient` field, Pachyderm starts a number of workers 346 that is a multiple of your Kubernetes cluster’s size. For example, if your 347 Kubernetes cluster has 10 nodes, and you set `"coefficient": 0.5`, Pachyderm 348 starts five workers. If you set it to 2.0, Pachyderm starts 20 workers 349 (two per Kubernetes node). 350 351 The default value is "constant=1". 352 353 Because spouts and services are designed to be single instances, do not 354 modify the default `parallism_spec` value for these pipelines. 355 356 ### Resource Requests (optional) 357 358 `resource_requests` describes the amount of resources that the pipeline 359 workers will consume. Knowing this in advance 360 enables Pachyderm to schedule big jobs on separate machines, so that they 361 do not conflict, slow down, or terminate. 362 363 This parameter is optional, and if you do not explicitly add it in 364 the pipeline spec, Pachyderm creates Kubernetes containers with the 365 following default resources: 366 367 - The user container requests 0 CPU, 0 disk space, and 64MB of memory. 368 - The init container requests the same amount of CPU, memory, and disk 369 space that is set for the user container. 370 - The storage container requests 0 CPU and the amount of memory set by the 371 [cache_size](#cache-size-optional) parameter. 372 373 The `resource_requests` parameter enables you to overwrite these default 374 values. 375 376 The `memory` field is a string that describes the amount of memory, in bytes, 377 that each worker needs. Allowed SI suffixes include M, K, G, Mi, Ki, Gi, and 378 other. 379 380 For example, a worker that needs to read a 1GB file into memory might set 381 `"memory": "1.2G"` with a little extra for the code to use in addition to the 382 file. Workers for this pipeline will be placed on machines with at least 383 1.2GB of free memory, and other large workers will be prevented from using it, 384 if they also set their `resource_requests`. 385 386 The `cpu` field is a number that describes the amount of CPU time in `cpu 387 seconds/real seconds` that each worker needs. Setting `"cpu": 0.5` indicates that 388 the worker should get 500ms of CPU time per second. Setting `"cpu": 2` 389 indicates that the worker gets 2000ms of CPU time per second. In other words, 390 it is using 2 CPUs, though worker threads might spend 500ms on four 391 physical CPUs instead of one second on two physical CPUs. 392 393 The `disk` field is a string that describes the amount of ephemeral disk space, 394 in bytes, that each worker needs. Allowed SI suffixes include M, K, G, Mi, 395 Ki, Gi, and other. 396 397 In both cases, the resource requests are not upper bounds. If the worker uses 398 more memory than it is requested, it does not mean that it will be shut down. 399 However, if the whole node runs out of memory, Kubernetes starts deleting 400 pods that have been placed on it and exceeded their memory request, 401 to reclaim memory. 402 To prevent deletion of your worker node, you must set your `memory` request to 403 a sufficiently large value. However, if the total memory requested by all 404 workers in the system is too large, Kubernetes cannot schedule new 405 workers because no machine has enough unclaimed memory. `cpu` works 406 similarly, but for CPU time. 407 408 For more information about resource requests and limits see the 409 [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/) 410 on the subject. 411 412 ### Resource Limits (optional) 413 414 `resource_limits` describes the upper threshold of allowed resources a given 415 worker can consume. If a worker exceeds this value, it will be evicted. 416 417 The `gpu` field is a number that describes how many GPUs each worker needs. 418 Only whole number are supported, Kubernetes does not allow multiplexing of 419 GPUs. Unlike the other resource fields, GPUs only have meaning in Limits, by 420 requesting a GPU the worker will have sole access to that GPU while it is 421 running. It's recommended to enable `standby` if you are using GPUs so other 422 processes in the cluster will have access to the GPUs while the pipeline has 423 nothing to process. For more information about scheduling GPUs see the 424 [Kubernetes docs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) 425 on the subject. 426 427 ### Sidecar Resource Limits (optional) 428 429 `sidecar_resource_limits` determines the upper threshold of resources 430 allocated to the sidecar containers. 431 432 This field can be useful in deployments where Kubernetes automatically 433 applies resource limits to containers, which might conflict with Pachyderm 434 pipelines' resource requests. Such a deployment might fail if Pachyderm 435 requests more than the default Kubernetes limit. The `sidecar_resource_limits` 436 enables you to explicitly specify these resources to fix the issue. 437 438 ### Datum Timeout (optional) 439 440 `datum_timeout` determines the maximum execution time allowed for each 441 datum. The value must be a string that represents a time value, such as 442 `1s`, `5m`, or `15h`. This parameter takes precedence over the parallelism 443 or number of datums, therefore, no single datum is allowed to exceed 444 this value. By default, `datum_timeout` is not set, and the datum continues to 445 be processed as long as needed. 446 447 ### Datum Tries (optional) 448 449 `datum_tries` is an integer, such as `1`, `2`, or `3`, that determines the 450 number of times a job attempts to run on a datum when a failure occurs. 451 Setting `datum_tries` to `1` will attempt a job once with no retries. 452 Only failed datums are retried in a retry attempt. If the operation succeeds 453 in retry attempts, then the job is marked as successful. Otherwise, the job 454 is marked as failed. 455 456 457 ### Job Timeout (optional) 458 459 `job_timeout` determines the maximum execution time allowed for a job. It 460 differs from `datum_timeout` in that the limit is applied across all 461 workers and all datums. This is the *wall time*, which means that if 462 you set `job_timeout` to one hour and the job does not finish the work 463 in one hour, it will be interrupted. 464 When you set this value, you need to 465 consider the parallelism, total number of datums, and execution time per 466 datum. The value must be a string that represents a time value, such as 467 `1s`, `5m`, or `15h`. In addition, the number of datums might change over 468 jobs. Some new commits might have more files, and therefore, more datums. 469 Similarly, other commits might have fewer files and datums. If this 470 parameter is not set, the job will run indefinitely until it succeeds or fails. 471 472 ### S3 Output Repository 473 474 `s3_out` allows your pipeline code to write results out to an S3 gateway 475 endpoint instead of the typical `pfs/out` directory. When this parameter 476 is set to `true`, Pachyderm includes a sidecar S3 gateway instance 477 container in the same pod as the pipeline container. The address of the 478 output repository will be `s3://<output_repo>`. If you enable `s3_out`, 479 verify that the `enable_stats` parameter is disabled. 480 481 If you want to expose an input repository through an S3 gateway, see 482 `input.pfs.s3` in [PFS Input](#pfs-input). 483 484 !!! note "See Also:" 485 [Environment Variables](../../deploy-manage/deploy/environment-variables/) 486 487 ### Input 488 489 `input` specifies repos that will be visible to the jobs during runtime. 490 Commits to these repos will automatically trigger the pipeline to create new 491 jobs to process them. Input is a recursive type, there are multiple different 492 kinds of inputs which can be combined together. The `input` object is a 493 container for the different input types with a field for each, only one of 494 these fields be set for any instantiation of the object. While most types 495 of pipeline specifications require an `input` repository, there are 496 exceptions, such as a spout, which does not need an `input`. 497 498 ```json 499 { 500 "pfs": pfs_input, 501 "union": union_input, 502 "cross": cross_input, 503 "cron": cron_input 504 } 505 ``` 506 507 #### PFS Input 508 509 PFS inputs are the simplest inputs, they take input from a single branch on a 510 single repo. 511 512 ``` 513 { 514 "name": string, 515 "repo": string, 516 "branch": string, 517 "glob": string, 518 "lazy" bool, 519 "empty_files": bool 520 "s3": bool 521 } 522 ``` 523 524 `input.pfs.name` is the name of the input. An input with the name `XXX` is 525 visible under the path `/pfs/XXX` when a job runs. Input names must be unique 526 if the inputs are crossed, but they may be duplicated between `PFSInput`s that 527 are combined by using the `union` operator. This is because when 528 `PFSInput`s are combined, you only ever see a datum from one input 529 at a time. Overlapping the names of combined inputs allows 530 you to write simpler code since you no longer need to consider which 531 input directory a particular datum comes from. If an input's name is not 532 specified, it defaults to the name of the repo. Therefore, if you have two 533 crossed inputs from the same repo, you must give at least one of them a 534 unique name. 535 536 `input.pfs.repo` is the name of the Pachyderm repository with the data that 537 you want to join with other data. 538 539 `input.pfs.branch` is the `branch` to watch for commits. If left blank, 540 Pachyderm sets this value to `master`. 541 542 `input.pfs.glob` is a glob pattern that is used to determine how the 543 input data is partitioned. 544 545 `input.pfs.lazy` controls how the data is exposed to jobs. The default is 546 `false` which means the job eagerly downloads the data it needs to process and 547 exposes it as normal files on disk. If lazy is set to `true`, data is 548 exposed as named pipes instead, and no data is downloaded until the job 549 opens the pipe and reads it. If the pipe is never opened, then no data is 550 downloaded. 551 552 Some applications do not work with pipes. For example, pipes do not support 553 applications that makes `syscalls` such as `Seek`. Applications that can work 554 with pipes must use them since they are more performant. The difference will 555 be especially notable if the job only reads a subset of the files that are 556 available to it. 557 558 !!! note 559 `lazy` does not support datums that 560 contain more than 10000 files. 561 562 `input.pfs.empty_files` controls how files are exposed to jobs. If 563 set to `true`, it causes files from this PFS to be presented as empty files. 564 This is useful in shuffle pipelines where you want to read the names of 565 files and reorganize them by using symlinks. 566 567 `input.pfs.s3` sets whether the sidecar in the pipeline worker pod 568 should include a sidecar S3 gateway instance. This option enables an S3 gateway 569 to serve on a pipeline-level basis and, therefore, ensure provenance tracking 570 for pipelines that integrate with external systems, such as Kubeflow. When 571 this option is set to `true`, Pachyderm deploys an S3 gateway instance 572 alongside the pipeline container and creates an S3 bucket for the pipeline 573 input repo. The address of the 574 input repository will be `s3://<input_repo>`. When you enable this 575 parameter, you cannot use glob patterns. All files will be processed 576 as one datum. 577 578 Another limitation for S3-enabled pipelines is that you can only use 579 either a single input or a cross input. Join and union inputs are not 580 supported. 581 582 If you want to expose an output repository through an S3 583 gateway, see [S3 Output Repository](#s3-output-repository). 584 585 #### Union Input 586 587 Union inputs take the union of other inputs. In the example 588 below, each input includes individual datums, such as if `foo` and `bar` 589 were in the same repository with the glob pattern set to `/*`. 590 Alternatively, each of these datums might have come from separate repositories 591 with the glob pattern set to `/` and being the only file system objects in these 592 repositories. 593 594 ``` 595 | inputA | inputB | inputA ∪ inputB | 596 | ------ | ------ | --------------- | 597 | foo | fizz | foo | 598 | bar | buzz | fizz | 599 | | | bar | 600 | | | buzz | 601 ``` 602 603 The union inputs do not take a name and maintain the names of the 604 sub-inputs. In the example above, you would see files under 605 `/pfs/inputA/...` or `/pfs/inputB/...`, but never both at the same time. 606 When you write code to address this behavior, make sure that 607 your code first determines which input directory is present. Starting 608 with Pachyderm 1.5.3, we recommend that you give your inputs the 609 same `Name`. That way your code only needs to handle data being present 610 in that directory. This only works if your code does not need to be 611 aware of which of the underlying inputs the data comes from. 612 613 `input.union` is an array of inputs to combine. The inputs do not have to be 614 `pfs` inputs. They can also be `union` and `cross` inputs. Although, there is 615 no reason to take a union of unions because union is associative. 616 617 #### Cross Input 618 619 Cross inputs create the cross product of other inputs. In other words, 620 a cross input creates tuples of the datums in the inputs. In the example 621 below, each input includes individual datums, such as if `foo` and `bar` 622 were in the same repository with the glob pattern set to `/*`. 623 Alternatively, each of these datums might have come from separate repositories 624 with the glob pattern set to `/` and being the only file system objects in these 625 repositories. 626 627 ``` 628 | inputA | inputB | inputA ⨯ inputB | 629 | ------ | ------ | --------------- | 630 | foo | fizz | (foo, fizz) | 631 | bar | buzz | (foo, buzz) | 632 | | | (bar, fizz) | 633 | | | (bar, buzz) | 634 ``` 635 636 The cross inputs above do not take a name and maintain 637 the names of the sub-inputs. 638 In the example above, you would see files under `/pfs/inputA/...` 639 and `/pfs/inputB/...`. 640 641 `input.cross` is an array of inputs to cross. 642 The inputs do not have to be `pfs` inputs. They can also be 643 `union` and `cross` inputs. Although, there is 644 no reason to take a union of unions because union is associative. 645 646 #### Cron Input 647 648 Cron inputs allow you to trigger pipelines based on time. A Cron input is 649 based on the Unix utility called `cron`. When you create a pipeline with 650 one or more Cron inputs, `pachd` creates a repo for each of them. The start 651 time for Cron input is specified in its spec. 652 When a Cron input triggers, 653 `pachd` commits a single file, named by the current [RFC 654 3339 timestamp](https://www.ietf.org/rfc/rfc3339.txt) to the repo which 655 contains the time which satisfied the spec. 656 657 ``` 658 { 659 "name": string, 660 "spec": string, 661 "repo": string, 662 "start": time, 663 "overwrite": bool 664 } 665 ``` 666 667 `input.cron.name` is the name for the input. Its semantics is similar to 668 those of `input.pfs.name`. Except that it is not optional. 669 670 `input.cron.spec` is a cron expression which specifies the schedule on 671 which to trigger the pipeline. To learn more about how to write schedules, 672 see the [Wikipedia page on cron](https://en.wikipedia.org/wiki/Cron). 673 Pachyderm supports non-standard schedules, such as `"@daily"`. 674 675 `input.cron.repo` is the repo which Pachyderm creates for the input. This 676 parameter is optional. If you do not specify this parameter, then 677 `"<pipeline-name>_<input-name>"` is used by default. 678 679 `input.cron.start` is the time to start counting from for the input. This 680 parameter is optional. If you do not specify this parameter, then the 681 time when the pipeline was created is used by default. Specifying a 682 time enables you to run on matching times from the past or skip times 683 from the present and only start running 684 on matching times in the future. Format the time value according to [RFC 685 3339](https://www.ietf.org/rfc/rfc3339.txt). 686 687 `input.cron.overwrite` is a flag to specify whether you want the timestamp file 688 to be overwritten on each tick. This parameter is optional, and if you do not 689 specify it, it defaults to simply writing new files on each tick. By default, 690 when `"overwrite"` is disabled, ticks accumulate in the cron input repo. When 691 `"overwrite"` is enabled, Pachyderm erases the old ticks and adds new ticks 692 with each commit. If you do not add any manual ticks or run 693 `pachctl run cron`, only one tick file per commit (for the latest tick) 694 is added to the input repo. 695 696 #### Join Input 697 698 A join input enables you to join files that are stored in separate 699 Pachyderm repositories and that match a configured glob 700 pattern. A join input must have the `glob` and `join_on` parameters configured 701 to work properly. A join can combine multiple PFS inputs. 702 703 You can specify the following parameters for the `join` input. 704 705 * `input.pfs.name` — the name of the PFS input that appears in the 706 `INPUT` field when you run the `pachctl list job` command. 707 If an input name is not specified, it defaults to the name of the repo. 708 709 * `input.pfs.repo` — see the description in [PFS Input](#pfs-input). 710 the name of the Pachyderm repository with the data that 711 you want to join with other data. 712 713 * `input.pfs.branch` — see the description in [PFS Input](#pfs-input). 714 715 * `input.pfs.glob` — a wildcard pattern that defines how a dataset is broken 716 up into datums for further processing. When you use a glob pattern in joins, 717 it creates a naming convention that Pachyderm uses to join files. In other 718 words, Pachyderm joins the files that are named according to the glob 719 pattern and skips those that are not. 720 721 You can specify the glob pattern for joins in a parenthesis to create 722 one or multiple capture groups. A capture group can include one or multiple 723 characters. Use standard UNIX globbing characters to create capture, 724 groups, including the following: 725 726 * `?` — matches a single character in a filepath. For example, you 727 have files named `file000.txt`, `file001.txt`, `file002.txt`, and so on. 728 You can set the glob pattern to `/file(?)(?)(?)` and the `join_on` key to 729 `$2`, so that Pachyderm matches only the files that have same second 730 character. 731 732 * `*` — any number of characters in the filepath. For example, if you set 733 your capture group to `/(*)`, Pachyderm matches all files in the root 734 directory. 735 736 If you do not specify a correct `glob` pattern, Pachyderm performs the 737 `cross` input operation instead of `join`. 738 739 * `input.pfs.lazy` — see the description in [PFS Input](#pfs-input). 740 * `input.pfs.empty_files` — see the description in [PFS Input](#pfs-input). 741 742 #### Git Input (alpha feature) 743 744 Git inputs allow you to pull code from a public git URL and execute that code as part of your pipeline. A pipeline with a Git Input will get triggered (i.e. will see a new input commit and will spawn a job) whenever you commit to your git repository. 745 746 **Note:** This only works on cloud deployments, not local clusters. 747 748 `input.git.URL` must be a URL of the form: `https://github.com/foo/bar.git` 749 750 `input.git.name` is the name for the input, its semantics are similar to 751 those of `input.pfs.name`. It is optional. 752 753 `input.git.branch` is the name of the git branch to use as input. 754 755 Git inputs also require some additional configuration. In order for new commits on your git repository to correspond to new commits on the Pachyderm Git Input repo, we need to setup a git webhook. At the moment, only GitHub is supported. (Though if you ask nicely, we can add support for GitLab or BitBucket). 756 757 1. Create your Pachyderm pipeline with the Git Input. 758 759 2. To get the URL of the webhook to your cluster, do `pachctl inspect pipeline` on your pipeline. You should see a `Githook URL` field with a URL set. Note - this will only work if you've deployed to a cloud provider (e.g. AWS, GKE). If you see `pending` as the value (and you've deployed on a cloud provider), it's possible that the service is still being provisioned. You can check `kubectl get svc` to make sure you see the `githook` service running. 760 761 3. To setup the GitHub webhook, navigate to: 762 763 ``` 764 https://github.com/<your_org>/<your_repo>/settings/hooks/new 765 ``` 766 Or navigate to webhooks under settings. Then you'll want to copy the `Githook URL` into the 'Payload URL' field. 767 768 ### Output Branch (optional) 769 770 This is the branch where the pipeline outputs new commits. By default, 771 it's "master". 772 773 ### Egress (optional) 774 775 `egress` allows you to push the results of a Pipeline to an external data 776 store such as s3, Google Cloud Storage or Azure Storage. Data will be pushed 777 after the user code has finished running but before the job is marked as 778 successful. 779 780 For more information, see [Exporting Data by using egress](../../how-tos/export-data-out-pachyderm/#export-your-data-with-egress) 781 782 ### Standby (optional) 783 784 `standby` indicates that the pipeline should be put into "standby" when there's 785 no data for it to process. A pipeline in standby will have no pods running and 786 thus will consume no resources, it's state will be displayed as "standby". 787 788 Standby replaces `scale_down_threshold` from releases prior to 1.7.1. 789 790 ### Cache Size (optional) 791 792 `cache_size` controls how much cache a pipeline's sidecar containers use. In 793 general, your pipeline's performance will increase with the cache size, but 794 only up to a certain point depending on your workload. 795 796 Every worker in every pipeline has a limited-functionality `pachd` server 797 running adjacent to it, which proxies PFS reads and writes (this prevents 798 thundering herds when jobs start and end, which is when all of a pipeline's 799 workers are reading from and writing to PFS simultaneously). Part of what these 800 "sidecar" pachd servers do is cache PFS reads. If a pipeline has a cross input, 801 and a worker is downloading the same datum from one branch of the input 802 repeatedly, then the cache can speed up processing significantly. 803 804 ### Enable Stats (optional) 805 806 The `enable_stats` parameter turns on statistics tracking for the pipeline. 807 When you enable the statistics tracking, the pipeline automatically creates 808 and commits datum processing information to a special branch in its output 809 repo called `"stats"`. This branch stores information about each datum that 810 the pipeline processes, including timing information, size information, logs, 811 and `/pfs` snapshots. You can view this statistics by running the `pachctl 812 inspect datum` and `pachctl list datum` commands, as well as through the web UI. 813 Do not enable statistics tracking for S3-enabled pipelines. 814 815 Once turned on, statistics tracking cannot be disabled for the pipeline. You can 816 turn it off by deleting the pipeline, setting `enable_stats` to `false` or 817 completely removing it from your pipeline spec, and recreating the pipeline from 818 that updated spec file. While the pipeline that collects the stats 819 exists, the storage space used by the stats cannot be released. 820 821 !!! note 822 Enabling stats results in slight storage use increase for logs and timing 823 information. 824 However, stats do not use as much extra storage as it might appear because 825 snapshots of the `/pfs` directory that are the largest stored assets 826 do not require extra space. 827 828 ### Service (alpha feature, optional) 829 830 `service` specifies that the pipeline should be treated as a long running 831 service rather than a data transformation. This means that `transform.cmd` is 832 not expected to exit, if it does it will be restarted. Furthermore, the service 833 is exposed outside the container using a Kubernetes service. 834 `"internal_port"` should be a port that the user code binds to inside the 835 container, `"external_port"` is the port on which it is exposed through the 836 `NodePorts` functionality of Kubernetes services. After a service has been 837 created, you should be able to access it at 838 `http://<kubernetes-host>:<external_port>`. 839 840 ### Spout (optional) 841 842 `spout` is a type of pipeline that processes streaming data. 843 Unlike a union or cross pipeline, a spout pipeline does not have 844 a PFS input. Instead, it opens a Linux *named pipe* into the source of the 845 streaming data. Your pipeline 846 can be either a spout or a service and not both. Therefore, if you added 847 the `service` as a top-level object in your pipeline, you cannot add `spout`. 848 However, you can expose a service from inside of a spout pipeline by 849 specifying it as a field in the `spout` spec. Then, Kubernetes creates 850 a service endpoint that you can expose externally. You can get the information 851 about the service by running `kubectl get services`. 852 853 For more information, see [Spouts](../concepts/pipeline-concepts/pipeline/spout.md). 854 855 ### Max Queue Size (optional) 856 `max_queue_size` specifies that maximum number of datums that a worker should 857 hold in its processing queue at a given time (after processing its entire 858 queue, a worker "checkpoints" its progress by writing to persistent storage). 859 The default value is `1` which means workers will only hold onto the value that 860 they're currently processing. 861 862 Increasing this value can improve pipeline performance, as that allows workers 863 to simultaneously download, process and upload different datums at the same 864 time (and reduces the total time spent on checkpointing). Decreasing this value 865 can make jobs more robust to failed workers, as work gets checkpointed more 866 often, and a failing worker will not lose as much progress. Setting this value 867 too high can also cause problems if you have `lazy` inputs, as there's a cap of 868 10,000 `lazy` files per worker and multiple datums that are running all count 869 against this limit. 870 871 ### Chunk Spec (optional) 872 `chunk_spec` specifies how a pipeline should chunk its datums. 873 A chunk is the unit of work that workers claim. Each worker claims 1 or more datums 874 and it commits a full chunk once it's done processing it. 875 876 `chunk_spec.number` if nonzero, specifies that each chunk should contain `number` 877 datums. Chunks may contain fewer if the total number of datums don't 878 divide evenly. If you lower the chunk number to 1 it'll update after every datum, 879 the cost is extra load on etcd which can slow other stuff down. 880 The default value is 2. 881 882 `chunk_spec.size_bytes` , if nonzero, specifies a target size for each chunk of datums. 883 Chunks may be larger or smaller than `size_bytes`, but will usually be 884 pretty close to `size_bytes` in size. 885 886 ### Scheduling Spec (optional) 887 `scheduling_spec` specifies how the pods for a pipeline should be scheduled. 888 889 `scheduling_spec.node_selector` allows you to select which nodes your pipeline 890 will run on. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector) 891 on node selectors for more information about how this works. 892 893 `scheduling_spec.priority_class_name` allows you to select the prioriy class 894 for the pipeline, which will how Kubernetes chooses to schedule and deschedule 895 the pipeline. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass) 896 on priority and preemption for more information about how this works. 897 898 ### Pod Spec (optional) 899 `pod_spec` is an advanced option that allows you to set fields in the pod spec 900 that haven't been explicitly exposed in the rest of the pipeline spec. A good 901 way to figure out what JSON you should pass is to create a pod in Kubernetes 902 with the proper settings, then do: 903 904 ``` 905 kubectl get po/<pod-name> -o json | jq .spec 906 ``` 907 908 this will give you a correctly formated piece of JSON, you should then remove 909 the extraneous fields that Kubernetes injects or that can be set else where. 910 911 The JSON is applied after the other parameters for the `pod_spec` have already 912 been set as a [JSON Merge Patch](https://tools.ietf.org/html/rfc7386). This 913 means that you can modify things such as the storage and user containers. 914 915 ### Pod Patch (optional) 916 `pod_patch` is similar to `pod_spec` above but is applied as a [JSON 917 Patch](https://tools.ietf.org/html/rfc6902). Note, this means that the 918 process outlined above of modifying an existing pod spec and then manually 919 blanking unchanged fields won't work, you'll need to create a correctly 920 formatted patch by diffing the two pod specs. 921 922 ## The Input Glob Pattern 923 924 Each PFS input needs to specify a [glob pattern](../../concepts/pipeline-concepts/datum/glob-pattern/). 925 926 Pachyderm uses the glob pattern to determine how many "datums" an input 927 consists of. Datums are the unit of parallelism in Pachyderm. That is, 928 Pachyderm attempts to process datums in parallel whenever possible. 929 930 Intuitively, you may think of the input repo as a file system, and you are 931 applying the glob pattern to the root of the file system. The files and 932 directories that match the glob pattern are considered datums. 933 934 For instance, let's say your input repo has the following structure: 935 936 ``` 937 /foo-1 938 /foo-2 939 /bar 940 /bar-1 941 /bar-2 942 ``` 943 944 Now let's consider what the following glob patterns would match respectively: 945 946 * `/`: this pattern matches `/`, the root directory itself, meaning all the data would be a single large datum. 947 * `/*`: this pattern matches everything under the root directory given us 3 datums: 948 `/foo-1.`, `/foo-2.`, and everything under the directory `/bar`. 949 * `/bar/*`: this pattern matches files only under the `/bar` directory: `/bar-1` and `/bar-2` 950 * `/foo*`: this pattern matches files under the root directory that start with the characters `foo` 951 * `/*/*`: this pattern matches everything that's two levels deep relative 952 to the root: `/bar/bar-1` and `/bar/bar-2` 953 954 The datums are defined as whichever files or directories match by the glob pattern. For instance, if we used 955 `/*`, then the job will process three datums (potentially in parallel): 956 `/foo-1`, `/foo-2`, and `/bar`. Both the `bar-1` and `bar-2` files within the directory `bar` would be grouped together and always processed by the same worker. 957 958 ## PPS Mounts and File Access 959 960 ### Mount Paths 961 962 The root mount point is at `/pfs`, which contains: 963 964 - `/pfs/input_name` which is where you would find the datum. 965 - Each input will be found here by its name, which defaults to the repo 966 name if not specified. 967 - `/pfs/out` which is where you write any output.