github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/reference/pipeline_spec.md (about) 1 # Pipeline Specification 2 3 This document discusses each of the fields present in a pipeline specification. 4 To see how to use a pipeline spec to create a pipeline, refer to the [pachctl 5 create pipeline](pachctl/pachctl_create_pipeline.md) section. 6 7 ## JSON Manifest Format 8 9 ```json 10 { 11 "pipeline": { 12 "name": string 13 }, 14 "description": string, 15 "transform": { 16 "image": string, 17 "cmd": [ string ], 18 "stdin": [ string ], 19 "err_cmd": [ string ], 20 "err_stdin": [ string ], 21 "env": { 22 string: string 23 }, 24 "secrets": [ { 25 "name": string, 26 "mount_path": string 27 }, 28 { 29 "name": string, 30 "env_var": string, 31 "key": string 32 } ], 33 "image_pull_secrets": [ string ], 34 "accept_return_code": [ int ], 35 "debug": bool, 36 "user": string, 37 "working_dir": string, 38 }, 39 "parallelism_spec": { 40 // Set at most one of the following: 41 "constant": int, 42 "coefficient": number 43 }, 44 "hashtree_spec": { 45 "constant": int, 46 }, 47 "resource_requests": { 48 "memory": string, 49 "cpu": number, 50 "disk": string, 51 }, 52 "resource_limits": { 53 "memory": string, 54 "cpu": number, 55 "gpu": { 56 "type": string, 57 "number": int 58 } 59 "disk": string, 60 }, 61 "datum_timeout": string, 62 "datum_tries": int, 63 "job_timeout": string, 64 "input": { 65 <"pfs", "cross", "union", "cron", or "git" see below> 66 }, 67 "output_branch": string, 68 "egress": { 69 "URL": "s3://bucket/dir" 70 }, 71 "standby": bool, 72 "cache_size": string, 73 "enable_stats": bool, 74 "service": { 75 "internal_port": int, 76 "external_port": int 77 }, 78 "spout": { 79 "overwrite": bool 80 \\ Optionally, you can combine a spout with a service: 81 "service": { 82 "internal_port": int, 83 "external_port": int, 84 "annotations": { 85 "foo": "bar" 86 } 87 } 88 }, 89 "max_queue_size": int, 90 "chunk_spec": { 91 "number": int, 92 "size_bytes": int 93 }, 94 "scheduling_spec": { 95 "node_selector": {string: string}, 96 "priority_class_name": string 97 }, 98 "pod_spec": string, 99 "pod_patch": string, 100 } 101 102 ------------------------------------ 103 "pfs" input 104 ------------------------------------ 105 106 "pfs": { 107 "name": string, 108 "repo": string, 109 "branch": string, 110 "glob": string, 111 "lazy" bool, 112 "empty_files": bool 113 } 114 115 ------------------------------------ 116 "cross" or "union" input 117 ------------------------------------ 118 119 "cross" or "union": [ 120 { 121 "pfs": { 122 "name": string, 123 "repo": string, 124 "branch": string, 125 "glob": string, 126 "lazy" bool, 127 "empty_files": bool 128 } 129 }, 130 { 131 "pfs": { 132 "name": string, 133 "repo": string, 134 "branch": string, 135 "glob": string, 136 "lazy" bool, 137 "empty_files": bool 138 } 139 } 140 ... 141 ] 142 143 144 145 ------------------------------------ 146 "cron" input 147 ------------------------------------ 148 149 "cron": { 150 "name": string, 151 "spec": string, 152 "repo": string, 153 "start": time, 154 "overwrite": bool 155 } 156 157 ------------------------------------ 158 "join" input 159 ------------------------------------ 160 161 "join": [ 162 { 163 "pfs": { 164 "name": string, 165 "repo": string, 166 "branch": string, 167 "glob": string, 168 "join_on": string 169 "lazy": bool 170 "empty_files": bool 171 } 172 }, 173 { 174 "pfs": { 175 "name": string, 176 "repo": string, 177 "branch": string, 178 "glob": string, 179 "join_on": string 180 "lazy": bool 181 "empty_files": bool 182 } 183 } 184 ] 185 186 ------------------------------------ 187 "git" input 188 ------------------------------------ 189 190 "git": { 191 "URL": string, 192 "name": string, 193 "branch": string 194 } 195 196 ``` 197 198 In practice, you rarely need to specify all the fields. 199 Most fields either come with sensible defaults or can be empty. 200 The following text is an example of a minimum spec: 201 202 ```json 203 { 204 "pipeline": { 205 "name": "wordcount" 206 }, 207 "transform": { 208 "image": "wordcount-image", 209 "cmd": ["/binary", "/pfs/data", "/pfs/out"] 210 }, 211 "input": { 212 "pfs": { 213 "repo": "data", 214 "glob": "/*" 215 } 216 } 217 } 218 ``` 219 220 ### Name (required) 221 222 `pipeline.name` is the name of the pipeline that you are creating. Each 223 pipeline needs to have a unique name. Pipeline names must meet the following 224 requirements: 225 226 - Include only alphanumeric characters, `_` and `-`. 227 - Begin or end with only alphanumeric characters (not `_` or `-`). 228 - Not exceed 63 characters in length. 229 230 ### Description (optional) 231 232 `description` is an optional text field where you can add information 233 about the pipeline. 234 235 ### Transform (required) 236 237 `transform.image` is the name of the Docker image that your jobs use. 238 239 `transform.cmd` is the command passed to the Docker run invocation. Similarly 240 to Docker, `cmd` is not run inside a shell which means that 241 wildcard globbing (`*`), pipes (`|`), and file redirects (`>` and `>>`) do 242 not work. To specify these settings, you can set `cmd` to be a shell of your 243 choice, such as `sh` and pass a shell script to `stdin`. 244 245 `transform.stdin` is an array of lines that are sent to your command on 246 `stdin`. 247 Lines do not have to end in newline characters. 248 249 `transform.err_cmd` is an optional command that is executed on failed datums. 250 If the `err_cmd` is successful and returns 0 error code, it does not prevent 251 the job from succeeding. 252 This behavior means that `transform.err_cmd` can be used to ignore 253 failed datums while still writing successful datums to the output repo, 254 instead of failing the whole job when some datums fail. The `transform.err_cmd` 255 command has the same limitations as `transform.cmd`. 256 257 `transform.err_stdin` is an array of lines that are sent to your error command 258 on `stdin`. 259 Lines do not have to end in newline characters. 260 261 `transform.env` is a key-value map of environment variables that 262 Pachyderm injects into the container. 263 264 **Note:** There are environment variables that are automatically injected 265 into the container, for a comprehensive list of them see the [Environment 266 Variables](#environment-variables) section below. 267 268 `transform.secrets` is an array of secrets. You can use the secrets to 269 embed sensitive data, such as credentials. The secrets reference 270 Kubernetes secrets by name and specify a path to map the secrets or 271 an environment variable (`env_var`) that the value should be bound to. Secrets 272 must set `name` which should be the name of a secret in Kubernetes. Secrets 273 must also specify either `mount_path` or `env_var` and `key`. See more 274 information about Kubernetes secrets [here](https://kubernetes.io/docs/concepts/configuration/secret/). 275 276 `transform.image_pull_secrets` is an array of image pull secrets, image pull 277 secrets are similar to secrets except that they are mounted before the 278 containers are created so they can be used to provide credentials for image 279 pulling. For example, if you are using a private Docker registry for your 280 images, you can specify it by running the following command: 281 282 ```shell 283 $ kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL 284 ``` 285 286 And then, notify your pipeline about it by using 287 `"image_pull_secrets": [ "myregistrykey" ]`. Read more about image pull secrets 288 [here](https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod). 289 290 `transform.accept_return_code` is an array of return codes, such as exit codes 291 from your Docker command that are considered acceptable. 292 If your Docker command exits with one of the codes in this array, it is 293 considered a successful run to set job status. `0` 294 is always considered a successful exit code. 295 296 `transform.debug` turns on added debug logging for the pipeline. 297 298 `transform.user` sets the user that your code runs as, this can also be 299 accomplished with a `USER` directive in your `Dockerfile`. 300 301 `transform.working_dir` sets the directory that your command runs from. You 302 can also specify the `WORKDIR` directive in your `Dockerfile`. 303 304 `transform.dockerfile` is the path to the `Dockerfile` used with the `--build` 305 flag. This defaults to `./Dockerfile`. 306 307 ### Parallelism Spec (optional) 308 309 `parallelism_spec` describes how Pachyderm parallelizes your pipeline. 310 Currently, Pachyderm has two parallelism strategies: `constant` and 311 `coefficient`. 312 313 If you set the `constant` field, Pachyderm starts the number of workers 314 that you specify. For example, set `"constant":10` to use 10 workers. 315 316 If you set the `coefficient` field, Pachyderm starts a number of workers 317 that is a multiple of your Kubernetes cluster’s size. For example, if your 318 Kubernetes cluster has 10 nodes, and you set `"coefficient": 0.5`, Pachyderm 319 starts five workers. If you set it to 2.0, Pachyderm starts 20 workers 320 (two per Kubernetes node). 321 322 The default value is "constant=1" . 323 324 Because spouts and services are designed to be single instances, do not 325 modify the default `parallism_spec` value for these pipelines. 326 327 ### Resource Requests (optional) 328 329 `resource_requests` describes the amount of resources you expect the 330 workers for a given pipeline to consume. Knowing this in advance 331 lets Pachyderm schedule big jobs on separate machines, so that they do not 332 conflict and either slow down or die. 333 334 The `memory` field is a string that describes the amount of memory, in bytes, 335 each worker needs (with allowed SI suffixes (M, K, G, Mi, Ki, Gi, and so on). 336 For example, a worker that needs to read a 1GB file into memory might set 337 `"memory": "1.2G"` with a little extra for the code to use in addition to the 338 file. Workers for this pipeline will be placed on machines with at least 339 1.2GB of free memory, and other large workers will be prevented from using it 340 (if they also set their `resource_requests`). 341 342 The `cpu` field is a number that describes the amount of CPU time in `cpu 343 seconds/real seconds` that each worker needs. Setting `"cpu": 0.5` indicates that 344 the worker should get 500ms of CPU time per second. Setting `"cpu": 2` 345 indicates that the worker gets 2000ms of CPU time per second. In other words, 346 it is using 2 CPUs, though worker threads might spend 500ms on four 347 physical CPUs instead of one second on two physical CPUs. 348 349 The `disk` field is a string that describes the amount of ephemeral disk space, 350 in bytes, each worker needs with allowed SI suffixes (M, K, G, Mi, Ki, Gi, 351 and so on). 352 353 In both cases, the resource requests are not upper bounds. If the worker uses 354 more memory than it is requested, it does not mean that it will be shut down. 355 However, if the whole node runs out of memory, Kubernetes starts deleting 356 pods that have been placed on it and exceeded their memory request, 357 to reclaim memory. 358 To prevent deletion of your worker node, you must set your `memory` request to 359 a sufficiently large value. However, if the total memory requested by all 360 workers in the system is too large, Kubernetes cannot schedule new 361 workers because no machine has enough unclaimed memory. `cpu` works 362 similarly, but for CPU time. 363 364 By default, workers are scheduled with an effective resource request of 0 (to 365 avoid scheduling problems that prevent users from being unable to run 366 pipelines). This means that if a node runs out of memory, any such worker 367 might be terminated. 368 369 For more information about resource requests and limits see the 370 [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/) 371 on the subject. 372 373 ### Resource Limits (optional) 374 375 `resource_limits` describes the upper threshold of allowed resources a given 376 worker can consume. If a worker exceeds this value, it will be evicted. 377 378 The `gpu` field is a number that describes how many GPUs each worker needs. 379 Only whole number are supported, Kubernetes does not allow multiplexing of 380 GPUs. Unlike the other resource fields, GPUs only have meaning in Limits, by 381 requesting a GPU the worker will have sole access to that GPU while it is 382 running. It's recommended to enable `standby` if you are using GPUs so other 383 processes in the cluster will have access to the GPUs while the pipeline has 384 nothing to process. For more information about scheduling GPUs see the 385 [Kubernetes docs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) 386 on the subject. 387 388 ### Datum Timeout (optional) 389 390 `datum_timeout` is a string (e.g. `1s`, `5m`, or `15h`) that determines the 391 maximum execution time allowed per datum. So no matter what your parallelism 392 or number of datums, no single datum is allowed to exceed this value. 393 394 ### Datum Tries (optional) 395 396 `datum_tries` is an integer, such as `1`, `2`, or `3`, that determines the 397 number of times a job attempts to run on a datum when a failure occurs. 398 Setting `datum_tries` to `1` will attempt a job once with no retries. 399 Only failed datums are retried in a retry attempt. If the operation succeeds 400 in retry attempts, then the job is marked as successful. Otherwise, the job 401 is marked as failed. 402 403 404 ### Job Timeout (optional) 405 406 `job_timeout` is a string (e.g. `1s`, `5m`, or `15h`) that determines the 407 maximum execution time allowed for a job. It differs from `datum_timeout` 408 in that the limit gets applied across all workers and all datums. That 409 means that you'll need to keep in mind the parallelism, total number of 410 datums, and execution time per datum when setting this value. Keep in 411 mind that the number of datums may change over jobs. Some new commits may 412 have a bunch of new files (and so new datums). Some may have fewer. 413 414 ### Input (required) 415 416 `input` specifies repos that will be visible to the jobs during runtime. 417 Commits to these repos will automatically trigger the pipeline to create new 418 jobs to process them. Input is a recursive type, there are multiple different 419 kinds of inputs which can be combined together. The `input` object is a 420 container for the different input types with a field for each, only one of 421 these fields be set for any instantiation of the object. 422 423 ``` 424 { 425 "pfs": pfs_input, 426 "union": union_input, 427 "cross": cross_input, 428 "cron": cron_input 429 } 430 ``` 431 432 #### PFS Input 433 434 PFS inputs are the simplest inputs, they take input from a single branch on a 435 single repo. 436 437 ``` 438 { 439 "name": string, 440 "repo": string, 441 "branch": string, 442 "glob": string, 443 "lazy" bool, 444 "empty_files": bool 445 } 446 ``` 447 448 `input.pfs.name` is the name of the input. An input with the name `XXX` is 449 visible under the path `/pfs/XXX` when a job runs. Input names must be unique 450 if the inputs are crossed, but they may be duplicated between `PFSInput`s that 451 are combined by using the `union` operator. This is because when 452 `PFSInput`s are combined, you only ever see a datum from one input 453 at a time. Overlapping the names of combined inputs allows 454 you to write simpler code since you no longer need to consider which 455 input directory a particular datum comes from. If an input's name is not 456 specified, it defaults to the name of the repo. Therefore, if you have two 457 crossed inputs from the same repo, you must give at least one of them a 458 unique name. 459 460 `input.pfs.repo` is the name of the Pachyderm repository with the data that 461 you want to join with other data. 462 463 `input.pfs.branch` is the `branch` to watch for commits. If left blank, 464 Pachyderm sets this value to `master`. 465 466 `input.pfs.glob` is a glob pattern that is used to determine how the 467 input data is partitioned. 468 469 `input.pfs.lazy` controls how the data is exposed to jobs. The default is 470 `false` which means the job eagerly downloads the data it needs to process and 471 exposes it as normal files on disk. If lazy is set to `true`, data is 472 exposed as named pipes instead, and no data is downloaded until the job 473 opens the pipe and reads it. If the pipe is never opened, then no data is 474 downloaded. 475 476 Some applications do not work with pipes. For example, pipes do not support 477 applications that makes `syscalls` such as `Seek`. Applications that can work 478 with pipes must use them since they are more performant. The difference will 479 be especially notable if the job only reads a subset of the files that are 480 available to it. 481 482 **Note:** `lazy` currently does not support datums that 483 contain more than 10000 files. 484 485 `input.pfs.empty_files` controls how files are exposed to jobs. If 486 set to `true`, it causes files from this PFS to be presented as empty files. 487 This is useful in shuffle pipelines where you want to read the names of 488 files and reorganize them by using symlinks. 489 490 #### Union Input 491 492 Union inputs take the union of other inputs. In the example 493 below, each input includes individual datums, such as if `foo` and `bar` 494 were in the same repository with the glob pattern set to `/*`. 495 Alternatively, each of these datums might have come from separate repositories 496 with the glob pattern set to `/` and being the only filesystm objects in these 497 repositories. 498 499 ``` 500 | inputA | inputB | inputA ∪ inputB | 501 | ------ | ------ | --------------- | 502 | foo | fizz | foo | 503 | bar | buzz | fizz | 504 | | | bar | 505 | | | buzz | 506 ``` 507 508 The union inputs do not take a name and maintain the names of the 509 sub-inputs. In the example above, you would see files under 510 `/pfs/inputA/...` or `/pfs/inputB/...`, but never both at the same time. 511 When you write code to address this behavior, make sure that 512 your code first determines which input directory is present. Starting 513 with Pachyderm 1.5.3, we recommend that you give your inputs the 514 same `Name`. That way your code only needs to handle data being present 515 in that directory. This only works if your code does not need to be 516 aware of which of the underlying inputs the data comes from. 517 518 `input.union` is an array of inputs to combine. The inputs do not have to be 519 `pfs` inputs. They can also be `union` and `cross` inputs. Although, there is 520 no reason to take a union of unions because union is associative. 521 522 #### Cross Input 523 524 Cross inputs create the cross product of other inputs. In other words, 525 a cross input creates tuples of the datums in the inputs. In the example 526 below, each input includes individual datums, such as if `foo` and `bar` 527 were in the same repository with the glob pattern set to `/*`. 528 Alternatively, each of these datums might have come from separate repositories 529 with the glob pattern set to `/` and being the only filesystm objects in these 530 repositories. 531 532 ``` 533 | inputA | inputB | inputA ⨯ inputB | 534 | ------ | ------ | --------------- | 535 | foo | fizz | (foo, fizz) | 536 | bar | buzz | (foo, buzz) | 537 | | | (bar, fizz) | 538 | | | (bar, buzz) | 539 ``` 540 541 The cross inputs above do not take a name and maintain 542 the names of the sub-inputs. 543 In the example above, you would see files under `/pfs/inputA/...` 544 and `/pfs/inputB/...`. 545 546 `input.cross` is an array of inputs to cross. 547 The inputs do not have to be `pfs` inputs. They can also be 548 `union` and `cross` inputs. Although, there is 549 no reason to take a union of unions because union is associative. 550 551 #### Cron Input 552 553 Cron inputs allow you to trigger pipelines based on time. A Cron input is 554 based on the Unix utility called `cron`. When you create a pipeline with 555 one or more Cron inputs, `pachd` creates a repo for each of them. The start 556 time for Cron input is specified in its spec. 557 When a Cron input triggers, 558 `pachd` commits a single file, named by the current [RFC 559 3339 timestamp](https://www.ietf.org/rfc/rfc3339.txt) to the repo which 560 contains the time which satisfied the spec. 561 562 ``` 563 { 564 "name": string, 565 "spec": string, 566 "repo": string, 567 "start": time, 568 "overwrite": bool 569 } 570 ``` 571 572 `input.cron.name` is the name for the input. Its semantics is similar to 573 those of `input.pfs.name`. Except that it is not optional. 574 575 `input.cron.spec` is a cron expression which specifies the schedule on 576 which to trigger the pipeline. To learn more about how to write schedules, 577 see the [Wikipedia page on cron](https://en.wikipedia.org/wiki/Cron). 578 Pachyderm supports non-standard schedules, such as `"@daily"`. 579 580 `input.cron.repo` is the repo which Pachyderm creates for the input. This 581 parameter is optional. If you do not specify this parameter, then 582 `"<pipeline-name>_<input-name>"` is used by default. 583 584 `input.cron.start` is the time to start counting from for the input. This 585 parameter is optional. If you do not specify this parameter, then the 586 time when the pipeline was created is used by default. Specifying a 587 time enables you to run on matching times from the past or skip times 588 from the present and only start running 589 on matching times in the future. Format the time value according to [RFC 590 3339](https://www.ietf.org/rfc/rfc3339.txt). 591 592 `input.cron.overwrite` is a flag to specify whether you want the timestamp file 593 to be overwritten on each tick. This parameter is optional, and if you do not 594 specify it, it defaults to simply writing new files on each tick. By default, 595 `pachd` expects only the new information to be written out on each tick 596 and combines that data with the data from the previous ticks. If `"overwrite"` 597 is set to `true`, it expects the full dataset to be written out for each tick and 598 replaces previous outputs with the new data written out. 599 600 #### Join Input 601 602 A join input enables you to join files that are stored in separate 603 Pachyderm repositories and that match a configured glob 604 pattern. A join input must have the `glob` and `join_on` parameters configured 605 to work properly. A join can combine multiple PFS inputs. 606 607 You can specify the following parameters for the `join` input. 608 609 * `input.pfs.name` — the name of the PFS input that appears in the 610 `INPUT` field when you run the `pachctl list job` command. 611 If an input name is not specified, it defaults to the name of the repo. 612 613 * `input.pfs.repo` — see the description in [PFS Input](#pfs-input). 614 the name of the Pachyderm repository with the data that 615 you want to join with other data. 616 617 * `input.pfs.branch` — see the description in [PFS Input](#pfs-input). 618 619 * `input.pfs.glob` — a wildcard pattern that defines how a dataset is broken 620 up into datums for further processing. When you use a glob pattern in joins, 621 it creates a naming convention that Pachyderm uses to join files. In other 622 words, Pachyderm joins the files that are named according to the glob 623 pattern and skips those that are not. 624 625 You can specify the glob pattern for joins in a parenthesis to create 626 one or multiple capture groups. A capture group can include one or multiple 627 characters. Use standard UNIX globbing characters to create capture, 628 groups, including the following: 629 630 * `?` — matches a single character in a filepath. For example, you 631 have files named `file000.txt`, `file001.txt`, `file002.txt`, and so on. 632 You can set the glob pattern to `/file(?)(?)(?)` and the `join_on` key to 633 `$2`, so that Pachyderm matches only the files that have same second 634 character. 635 636 * `*` — any number of characters in the filepath. For example, if you set 637 your capture group to `/(*)`, Pachyderm matches all files in the root 638 directory. 639 640 If you do not specify a correct `glob` pattern, Pachyderm performs the 641 `cross` input operation instead of `join`. 642 643 * `input.pfs.lazy` — see the description in [PFS Input](#pfs-input). 644 * `input.pfs.empty_files` — see the description in [PFS Input](#pfs-input). 645 646 #### Git Input (alpha feature) 647 648 Git inputs allow you to pull code from a public git URL and execute that code as part of your pipeline. A pipeline with a Git Input will get triggered (i.e. will see a new input commit and will spawn a job) whenever you commit to your git repository. 649 650 **Note:** This only works on cloud deployments, not local clusters. 651 652 `input.git.URL` must be a URL of the form: `https://github.com/foo/bar.git` 653 654 `input.git.name` is the name for the input, its semantics are similar to 655 those of `input.pfs.name`. It is optional. 656 657 `input.git.branch` is the name of the git branch to use as input. 658 659 Git inputs also require some additional configuration. In order for new commits on your git repository to correspond to new commits on the Pachyderm Git Input repo, we need to setup a git webhook. At the moment, only GitHub is supported. (Though if you ask nicely, we can add support for GitLab or BitBucket). 660 661 1. Create your Pachyderm pipeline with the Git Input. 662 663 2. To get the URL of the webhook to your cluster, do `pachctl inspect pipeline` on your pipeline. You should see a `Githook URL` field with a URL set. Note - this will only work if you've deployed to a cloud provider (e.g. AWS, GKE). If you see `pending` as the value (and you've deployed on a cloud provider), it's possible that the service is still being provisioned. You can check `kubectl get svc` to make sure you see the `githook` service running. 664 665 3. To setup the GitHub webhook, navigate to: 666 667 ``` 668 https://github.com/<your_org>/<your_repo>/settings/hooks/new 669 ``` 670 Or navigate to webhooks under settings. Then you'll want to copy the `Githook URL` into the 'Payload URL' field. 671 672 ### Output Branch (optional) 673 674 This is the branch where the pipeline outputs new commits. By default, 675 it's "master". 676 677 ### Egress (optional) 678 679 `egress` allows you to push the results of a Pipeline to an external data 680 store such as s3, Google Cloud Storage or Azure Storage. Data will be pushed 681 after the user code has finished running but before the job is marked as 682 successful. 683 684 For more information, see [Exporting Data by using egress](../../how-tos/export-data-out-pachyderm/#export-your-data-with-egress) 685 686 ### Standby (optional) 687 688 `standby` indicates that the pipeline should be put into "standby" when there's 689 no data for it to process. A pipeline in standby will have no pods running and 690 thus will consume no resources, it's state will be displayed as "standby". 691 692 Standby replaces `scale_down_threshold` from releases prior to 1.7.1. 693 694 ### Cache Size (optional) 695 696 `cache_size` controls how much cache a pipeline's sidecar containers use. In 697 general, your pipeline's performance will increase with the cache size, but 698 only up to a certain point depending on your workload. 699 700 Every worker in every pipeline has a limited-functionality `pachd` server 701 running adjacent to it, which proxies PFS reads and writes (this prevents 702 thundering herds when jobs start and end, which is when all of a pipeline's 703 workers are reading from and writing to PFS simultaneously). Part of what these 704 "sidecar" pachd servers do is cache PFS reads. If a pipeline has a cross input, 705 and a worker is downloading the same datum from one branch of the input 706 repeatedly, then the cache can speed up processing significantly. 707 708 ### Enable Stats (optional) 709 710 The `enable_stats` parameter turns on statistics tracking for the pipeline. 711 When you enable the statistics tracking, the pipeline automatically creates 712 and commits datum processing information to a special branch in its output 713 repo called `"stats"`. This branch stores information about each datum that 714 the pipeline processes, including timing information, size information, logs, 715 and `/pfs` snapshots. You can view this statistics by running the `pachctl 716 inspect datum` and `pachctl list datum` commands, as well as through the web UI. 717 718 Once turned on, statistics tracking cannot be disabled for the pipeline. You can 719 turn it off by deleting the pipeline, setting `enable_stats` to `false` or 720 completely removing it from your pipeline spec, and recreating the pipeline from 721 that updated spec file. While the pipeline that collects the stats 722 exists, the storage space used by the stats cannot be released. 723 724 !!! note 725 Enabling stats results in slight storage use increase for logs and timing 726 information. 727 However, stats do not use as much extra storage as it might appear because 728 snapshots of the `/pfs` directory that are the largest stored assets 729 do not require extra space. 730 731 ### Service (alpha feature, optional) 732 733 `service` specifies that the pipeline should be treated as a long running 734 service rather than a data transformation. This means that `transform.cmd` is 735 not expected to exit, if it does it will be restarted. Furthermore, the service 736 is exposed outside the container using a Kubernetes service. 737 `"internal_port"` should be a port that the user code binds to inside the 738 container, `"external_port"` is the port on which it is exposed through the 739 `NodePorts` functionality of Kubernetes services. After a service has been 740 created, you should be able to access it at 741 `http://<kubernetes-host>:<external_port>`. 742 743 ### Spout (optional) 744 745 `spout` is a type of pipeline that processes streaming data. 746 Unlike a union or cross pipeline, a spout pipeline does not have 747 a PFS input. Instead, it opens a Linux *named pipe* into the source of the 748 streaming data. Your pipeline 749 can be either a spout or a service and not both. Therefore, if you added 750 the `service` as a top-level object in your pipeline, you cannot add `spout`. 751 However, you can expose a service from inside of a spout pipeline by 752 specifying it as a field in the `spout` spec. Then, Kubernetes creates 753 a service endpoint that you can expose externally. You can get the information 754 about the service by running `kubectl get services`. 755 756 For more information, see [Spouts](../concepts/pipeline-concepts/pipeline/spout.md). 757 758 ### Max Queue Size (optional) 759 `max_queue_size` specifies that maximum number of datums that a worker should 760 hold in its processing queue at a given time (after processing its entire 761 queue, a worker "checkpoints" its progress by writing to persistent storage). 762 The default value is `1` which means workers will only hold onto the value that 763 they're currently processing. 764 765 Increasing this value can improve pipeline performance, as that allows workers 766 to simultaneously download, process and upload different datums at the same 767 time (and reduces the total time spent on checkpointing). Decreasing this value 768 can make jobs more robust to failed workers, as work gets checkpointed more 769 often, and a failing worker will not lose as much progress. Setting this value 770 too high can also cause problems if you have `lazy` inputs, as there's a cap of 771 10,000 `lazy` files per worker and multiple datums that are running all count 772 against this limit. 773 774 ### Chunk Spec (optional) 775 `chunk_spec` specifies how a pipeline should chunk its datums. 776 777 `chunk_spec.number` if nonzero, specifies that each chunk should contain `number` 778 datums. Chunks may contain fewer if the total number of datums don't 779 divide evenly. 780 781 `chunk_spec.size_bytes` , if nonzero, specifies a target size for each chunk of datums. 782 Chunks may be larger or smaller than `size_bytes`, but will usually be 783 pretty close to `size_bytes` in size. 784 785 ### Scheduling Spec (optional) 786 `scheduling_spec` specifies how the pods for a pipeline should be scheduled. 787 788 `scheduling_spec.node_selector` allows you to select which nodes your pipeline 789 will run on. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector) 790 on node selectors for more information about how this works. 791 792 `scheduling_spec.priority_class_name` allows you to select the prioriy class 793 for the pipeline, which will how Kubernetes chooses to schedule and deschedule 794 the pipeline. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass) 795 on priority and preemption for more information about how this works. 796 797 ### Pod Spec (optional) 798 `pod_spec` is an advanced option that allows you to set fields in the pod spec 799 that haven't been explicitly exposed in the rest of the pipeline spec. A good 800 way to figure out what JSON you should pass is to create a pod in Kubernetes 801 with the proper settings, then do: 802 803 ``` 804 kubectl get po/<pod-name> -o json | jq .spec 805 ``` 806 807 this will give you a correctly formated piece of JSON, you should then remove 808 the extraneous fields that Kubernetes injects or that can be set else where. 809 810 The JSON is applied after the other parameters for the `pod_spec` have already 811 been set as a [JSON Merge Patch](https://tools.ietf.org/html/rfc7386). This 812 means that you can modify things such as the storage and user containers. 813 814 ### Pod Patch (optional) 815 `pod_patch` is similar to `pod_spec` above but is applied as a [JSON 816 Patch](https://tools.ietf.org/html/rfc6902). Note, this means that the 817 process outlined above of modifying an existing pod spec and then manually 818 blanking unchanged fields won't work, you'll need to create a correctly 819 formatted patch by diffing the two pod specs. 820 821 ## The Input Glob Pattern 822 823 Each PFS input needs to specify a [glob pattern](../concepts/pipeline-concepts/distributed_computing.md). 824 825 Pachyderm uses the glob pattern to determine how many "datums" an input 826 consists of. Datums are the unit of parallelism in Pachyderm. That is, 827 Pachyderm attempts to process datums in parallel whenever possible. 828 829 Intuitively, you may think of the input repo as a file system, and you are 830 applying the glob pattern to the root of the file system. The files and 831 directories that match the glob pattern are considered datums. 832 833 For instance, let's say your input repo has the following structure: 834 835 ``` 836 /foo-1 837 /foo-2 838 /bar 839 /bar-1 840 /bar-2 841 ``` 842 843 Now let's consider what the following glob patterns would match respectively: 844 845 * `/`: this pattern matches `/`, the root directory itself, meaning all the data would be a single large datum. 846 * `/*`: this pattern matches everything under the root directory given us 3 datums: 847 `/foo-1.`, `/foo-2.`, and everything under the directory `/bar`. 848 * `/bar/*`: this pattern matches files only under the `/bar` directory: `/bar-1` and `/bar-2` 849 * `/foo*`: this pattern matches files under the root directory that start with the characters `foo` 850 * `/*/*`: this pattern matches everything that's two levels deep relative 851 to the root: `/bar/bar-1` and `/bar/bar-2` 852 853 The datums are defined as whichever files or directories match by the glob pattern. For instance, if we used 854 `/*`, then the job will process three datums (potentially in parallel): 855 `/foo-1`, `/foo-2`, and `/bar`. Both the `bar-1` and `bar-2` files within the directory `bar` would be grouped together and always processed by the same worker. 856 857 ## PPS Mounts and File Access 858 859 ### Mount Paths 860 861 The root mount point is at `/pfs`, which contains: 862 863 - `/pfs/input_name` which is where you would find the datum. 864 - Each input will be found here by its name, which defaults to the repo 865 name if not specified. 866 - `/pfs/out` which is where you write any output. 867 868 # Environment Variables 869 870 There are several environment variables that get injected into the user code 871 before it runs. They are: 872 873 - `PACH_JOB_ID` the id the currently run job. 874 - `PACH_OUTPUT_COMMIT_ID` the id of the commit being outputted to. 875 - For each input there will be an environment variable with the same name 876 defined to the path of the file for that input. For example if you are 877 accessing an input called `foo` from the path `/pfs/foo` which contains a 878 file called `bar` then the environment variable `foo` will have the value 879 `/pfs/foo/bar`. The path in the environment variable is the path which 880 matched the glob pattern, even if the file is a directory, ie if your glob 881 pattern is `/*` it would match a directory `/bar`, the value of `$foo` 882 would then be `/pfs/foo/bar`. With a glob pattern of `/*/*` you would match 883 the files contained in `/bar` and thus the value of `foo` would be 884 `/pfs/foo/bar/quux`. 885 - For each input there will be an environment variable named `input_COMMIT` 886 indicating the id of the commit being used for that input. 887 888 In addition to these environment variables Kubernetes also injects others for 889 Services that are running inside the cluster. These allow you to connect to 890 those outside services, which can be powerful but also can be hard to reason 891 about, as processing might be retried multiple times. For example if your code 892 writes a row to a database that row may be written multiple times due to 893 retries. Interaction with outside services should be [idempotent](https://en.wikipedia.org/wiki/Idempotence) to prevent 894 unexpected behavior. Furthermore, one of the running services that your code 895 can connect to is Pachyderm itself, this is generally not recommended as very 896 little of the Pachyderm API is idempotent, but in some specific cases it can be 897 a viable approach.