github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/reference/pipeline_spec.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/reference/pipeline_spec.md (about)

1 # Pipeline Specification
2
3 This document discusses each of the fields present in a pipeline specification.
4 To see how to use a pipeline spec to create a pipeline, refer to the [pachctl
5 create pipeline](pachctl/pachctl_create_pipeline.md) section.
6
7 ## JSON Manifest Format
8
9 ```json
10 {
11 "pipeline": {
12 "name": string
13 },
14 "description": string,
15 "transform": {
16 "image": string,
17 "cmd": [ string ],
18 "stdin": [ string ],
19 "err_cmd": [ string ],
20 "err_stdin": [ string ],
21 "env": {
22 string: string
23 },
24 "secrets": [ {
25 "name": string,
26 "mount_path": string
27 },
28 {
29 "name": string,
30 "env_var": string,
31 "key": string
32 } ],
33 "image_pull_secrets": [ string ],
34 "accept_return_code": [ int ],
35 "debug": bool,
36 "user": string,
37 "working_dir": string,
38 },
39 "parallelism_spec": {
40 // Set at most one of the following:
41 "constant": int,
42 "coefficient": number
43 },
44 "hashtree_spec": {
45 "constant": int,
46 },
47 "resource_requests": {
48 "memory": string,
49 "cpu": number,
50 "disk": string,
51 },
52 "resource_limits": {
53 "memory": string,
54 "cpu": number,
55 "gpu": {
56 "type": string,
57 "number": int
58 }
59 "disk": string,
60 },
61 "datum_timeout": string,
62 "datum_tries": int,
63 "job_timeout": string,
64 "input": {
65 <"pfs", "cross", "union", "cron", or "git" see below>
66 },
67 "output_branch": string,
68 "egress": {
69 "URL": "s3://bucket/dir"
70 },
71 "standby": bool,
72 "cache_size": string,
73 "enable_stats": bool,
74 "service": {
75 "internal_port": int,
76 "external_port": int
77 },
78 "spout": {
79 "overwrite": bool
80 \\ Optionally, you can combine a spout with a service:
81 "service": {
82 "internal_port": int,
83 "external_port": int,
84 "annotations": {
85 "foo": "bar"
86 }
87 }
88 },
89 "max_queue_size": int,
90 "chunk_spec": {
91 "number": int,
92 "size_bytes": int
93 },
94 "scheduling_spec": {
95 "node_selector": {string: string},
96 "priority_class_name": string
97 },
98 "pod_spec": string,
99 "pod_patch": string,
100 }
101
102 ------------------------------------
103 "pfs" input
104 ------------------------------------
105
106 "pfs": {
107 "name": string,
108 "repo": string,
109 "branch": string,
110 "glob": string,
111 "lazy" bool,
112 "empty_files": bool
113 }
114
115 ------------------------------------
116 "cross" or "union" input
117 ------------------------------------
118
119 "cross" or "union": [
120 {
121 "pfs": {
122 "name": string,
123 "repo": string,
124 "branch": string,
125 "glob": string,
126 "lazy" bool,
127 "empty_files": bool
128 }
129 },
130 {
131 "pfs": {
132 "name": string,
133 "repo": string,
134 "branch": string,
135 "glob": string,
136 "lazy" bool,
137 "empty_files": bool
138 }
139 }
140 ...
141 ]
142
143
144
145 ------------------------------------
146 "cron" input
147 ------------------------------------
148
149 "cron": {
150 "name": string,
151 "spec": string,
152 "repo": string,
153 "start": time,
154 "overwrite": bool
155 }
156
157 ------------------------------------
158 "join" input
159 ------------------------------------
160
161 "join": [
162 {
163 "pfs": {
164 "name": string,
165 "repo": string,
166 "branch": string,
167 "glob": string,
168 "join_on": string
169 "lazy": bool
170 "empty_files": bool
171 }
172 },
173 {
174 "pfs": {
175 "name": string,
176 "repo": string,
177 "branch": string,
178 "glob": string,
179 "join_on": string
180 "lazy": bool
181 "empty_files": bool
182 }
183 }
184 ]
185
186 ------------------------------------
187 "git" input
188 ------------------------------------
189
190 "git": {
191 "URL": string,
192 "name": string,
193 "branch": string
194 }
195
196 ```
197
198 In practice, you rarely need to specify all the fields.
199 Most fields either come with sensible defaults or can be empty.
200 The following text is an example of a minimum spec:
201
202 ```json
203 {
204 "pipeline": {
205 "name": "wordcount"
206 },
207 "transform": {
208 "image": "wordcount-image",
209 "cmd": ["/binary", "/pfs/data", "/pfs/out"]
210 },
211 "input": {
212 "pfs": {
213 "repo": "data",
214 "glob": "/*"
215 }
216 }
217 }
218 ```
219
220 ### Name (required)
221
222 `pipeline.name` is the name of the pipeline that you are creating. Each
223 pipeline needs to have a unique name. Pipeline names must meet the following
224 requirements:
225
226 - Include only alphanumeric characters, `_` and `-`.
227 - Begin or end with only alphanumeric characters (not `_` or `-`).
228 - Not exceed 63 characters in length.
229
230 ### Description (optional)
231
232 `description` is an optional text field where you can add information
233 about the pipeline.
234
235 ### Transform (required)
236
237 `transform.image` is the name of the Docker image that your jobs use.
238
239 `transform.cmd` is the command passed to the Docker run invocation. Similarly
240 to Docker, `cmd` is not run inside a shell which means that
241 wildcard globbing (`*`), pipes (`|`), and file redirects (`>` and `>>`) do
242 not work. To specify these settings, you can set `cmd` to be a shell of your
243 choice, such as `sh` and pass a shell script to `stdin`.
244
245 `transform.stdin` is an array of lines that are sent to your command on
246 `stdin`.
247 Lines do not have to end in newline characters.
248
249 `transform.err_cmd` is an optional command that is executed on failed datums.
250 If the `err_cmd` is successful and returns 0 error code, it does not prevent
251 the job from succeeding.
252 This behavior means that `transform.err_cmd` can be used to ignore
253 failed datums while still writing successful datums to the output repo,
254 instead of failing the whole job when some datums fail. The `transform.err_cmd`
255 command has the same limitations as `transform.cmd`.
256
257 `transform.err_stdin` is an array of lines that are sent to your error command
258 on `stdin`.
259 Lines do not have to end in newline characters.
260
261 `transform.env` is a key-value map of environment variables that
262 Pachyderm injects into the container.
263
264 **Note:** There are environment variables that are automatically injected
265 into the container, for a comprehensive list of them see the [Environment
266 Variables](#environment-variables) section below.
267
268 `transform.secrets` is an array of secrets. You can use the secrets to
269 embed sensitive data, such as credentials. The secrets reference
270 Kubernetes secrets by name and specify a path to map the secrets or
271 an environment variable (`env_var`) that the value should be bound to. Secrets
272 must set `name` which should be the name of a secret in Kubernetes. Secrets
273 must also specify either `mount_path` or `env_var` and `key`. See more
274 information about Kubernetes secrets [here](https://kubernetes.io/docs/concepts/configuration/secret/).
275
276 `transform.image_pull_secrets` is an array of image pull secrets, image pull
277 secrets are similar to secrets except that they are mounted before the
278 containers are created so they can be used to provide credentials for image
279 pulling. For example, if you are using a private Docker registry for your
280 images, you can specify it by running the following command:
281
282 ```shell
283 $ kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
284 ```
285
286 And then, notify your pipeline about it by using
287 `"image_pull_secrets": [ "myregistrykey" ]`. Read more about image pull secrets
288 [here](https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod).
289
290 `transform.accept_return_code` is an array of return codes, such as exit codes
291 from your Docker command that are considered acceptable.
292 If your Docker command exits with one of the codes in this array, it is
293 considered a successful run to set job status. `0`
294 is always considered a successful exit code.
295
296 `transform.debug` turns on added debug logging for the pipeline.
297
298 `transform.user` sets the user that your code runs as, this can also be
299 accomplished with a `USER` directive in your `Dockerfile`.
300
301 `transform.working_dir` sets the directory that your command runs from. You
302 can also specify the `WORKDIR` directive in your `Dockerfile`.
303
304 `transform.dockerfile` is the path to the `Dockerfile` used with the `--build`
305 flag. This defaults to `./Dockerfile`.
306
307 ### Parallelism Spec (optional)
308
309 `parallelism_spec` describes how Pachyderm parallelizes your pipeline.
310 Currently, Pachyderm has two parallelism strategies: `constant` and
311 `coefficient`.
312
313 If you set the `constant` field, Pachyderm starts the number of workers
314 that you specify. For example, set `"constant":10` to use 10 workers.
315
316 If you set the `coefficient` field, Pachyderm starts a number of workers
317 that is a multiple of your Kubernetes cluster’s size. For example, if your
318 Kubernetes cluster has 10 nodes, and you set `"coefficient": 0.5`, Pachyderm
319 starts five workers. If you set it to 2.0, Pachyderm starts 20 workers
320 (two per Kubernetes node).
321
322 The default value is "constant=1" .
323
324 Because spouts and services are designed to be single instances, do not
325 modify the default `parallism_spec` value for these pipelines.
326
327 ### Resource Requests (optional)
328
329 `resource_requests` describes the amount of resources you expect the
330 workers for a given pipeline to consume. Knowing this in advance
331 lets Pachyderm schedule big jobs on separate machines, so that they do not
332 conflict and either slow down or die.
333
334 The `memory` field is a string that describes the amount of memory, in bytes,
335 each worker needs (with allowed SI suffixes (M, K, G, Mi, Ki, Gi, and so on).
336 For example, a worker that needs to read a 1GB file into memory might set
337 `"memory": "1.2G"` with a little extra for the code to use in addition to the
338 file. Workers for this pipeline will be placed on machines with at least
339 1.2GB of free memory, and other large workers will be prevented from using it
340 (if they also set their `resource_requests`).
341
342 The `cpu` field is a number that describes the amount of CPU time in `cpu
343 seconds/real seconds` that each worker needs. Setting `"cpu": 0.5` indicates that
344 the worker should get 500ms of CPU time per second. Setting `"cpu": 2`
345 indicates that the worker gets 2000ms of CPU time per second. In other words,
346 it is using 2 CPUs, though worker threads might spend 500ms on four
347 physical CPUs instead of one second on two physical CPUs.
348
349 The `disk` field is a string that describes the amount of ephemeral disk space,
350 in bytes, each worker needs with allowed SI suffixes (M, K, G, Mi, Ki, Gi,
351 and so on).
352
353 In both cases, the resource requests are not upper bounds. If the worker uses
354 more memory than it is requested, it does not mean that it will be shut down.
355 However, if the whole node runs out of memory, Kubernetes starts deleting
356 pods that have been placed on it and exceeded their memory request,
357 to reclaim memory.
358 To prevent deletion of your worker node, you must set your `memory` request to
359 a sufficiently large value. However, if the total memory requested by all
360 workers in the system is too large, Kubernetes cannot schedule new
361 workers because no machine has enough unclaimed memory. `cpu` works
362 similarly, but for CPU time.
363
364 By default, workers are scheduled with an effective resource request of 0 (to
365 avoid scheduling problems that prevent users from being unable to run
366 pipelines). This means that if a node runs out of memory, any such worker
367 might be terminated.
368
369 For more information about resource requests and limits see the
370 [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/)
371 on the subject.
372
373 ### Resource Limits (optional)
374
375 `resource_limits` describes the upper threshold of allowed resources a given
376 worker can consume. If a worker exceeds this value, it will be evicted.
377
378 The `gpu` field is a number that describes how many GPUs each worker needs.
379 Only whole number are supported, Kubernetes does not allow multiplexing of
380 GPUs. Unlike the other resource fields, GPUs only have meaning in Limits, by
381 requesting a GPU the worker will have sole access to that GPU while it is
382 running. It's recommended to enable `standby` if you are using GPUs so other
383 processes in the cluster will have access to the GPUs while the pipeline has
384 nothing to process. For more information about scheduling GPUs see the
385 [Kubernetes docs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/)
386 on the subject.
387
388 ### Datum Timeout (optional)
389
390 `datum_timeout` is a string (e.g. `1s`, `5m`, or `15h`) that determines the
391 maximum execution time allowed per datum. So no matter what your parallelism
392 or number of datums, no single datum is allowed to exceed this value.
393
394 ### Datum Tries (optional)
395
396 `datum_tries` is an integer, such as `1`, `2`, or `3`, that determines the
397 number of times a job attempts to run on a datum when a failure occurs.
398 Setting `datum_tries` to `1` will attempt a job once with no retries.
399 Only failed datums are retried in a retry attempt. If the operation succeeds
400 in retry attempts, then the job is marked as successful. Otherwise, the job
401 is marked as failed.
402
403
404 ### Job Timeout (optional)
405
406 `job_timeout` is a string (e.g. `1s`, `5m`, or `15h`) that determines the
407 maximum execution time allowed for a job. It differs from `datum_timeout`
408 in that the limit gets applied across all workers and all datums. That
409 means that you'll need to keep in mind the parallelism, total number of
410 datums, and execution time per datum when setting this value. Keep in
411 mind that the number of datums may change over jobs. Some new commits may
412 have a bunch of new files (and so new datums). Some may have fewer.
413
414 ### Input (required)
415
416 `input` specifies repos that will be visible to the jobs during runtime.
417 Commits to these repos will automatically trigger the pipeline to create new
418 jobs to process them. Input is a recursive type, there are multiple different
419 kinds of inputs which can be combined together. The `input` object is a
420 container for the different input types with a field for each, only one of
421 these fields be set for any instantiation of the object.
422
423 ```
424 {
425 "pfs": pfs_input,
426 "union": union_input,
427 "cross": cross_input,
428 "cron": cron_input
429 }
430 ```
431
432 #### PFS Input
433
434 PFS inputs are the simplest inputs, they take input from a single branch on a
435 single repo.
436
437 ```
438 {
439 "name": string,
440 "repo": string,
441 "branch": string,
442 "glob": string,
443 "lazy" bool,
444 "empty_files": bool
445 }
446 ```
447
448 `input.pfs.name` is the name of the input. An input with the name `XXX` is
449 visible under the path `/pfs/XXX` when a job runs. Input names must be unique
450 if the inputs are crossed, but they may be duplicated between `PFSInput`s that
451 are combined by using the `union` operator. This is because when
452 `PFSInput`s are combined, you only ever see a datum from one input
453 at a time. Overlapping the names of combined inputs allows
454 you to write simpler code since you no longer need to consider which
455 input directory a particular datum comes from. If an input's name is not
456 specified, it defaults to the name of the repo. Therefore, if you have two
457 crossed inputs from the same repo, you must give at least one of them a
458 unique name.
459
460 `input.pfs.repo` is the name of the Pachyderm repository with the data that
461 you want to join with other data.
462
463 `input.pfs.branch` is the `branch` to watch for commits. If left blank,
464 Pachyderm sets this value to `master`.
465
466 `input.pfs.glob` is a glob pattern that is used to determine how the
467 input data is partitioned.
468
469 `input.pfs.lazy` controls how the data is exposed to jobs. The default is
470 `false` which means the job eagerly downloads the data it needs to process and
471 exposes it as normal files on disk. If lazy is set to `true`, data is
472 exposed as named pipes instead, and no data is downloaded until the job
473 opens the pipe and reads it. If the pipe is never opened, then no data is
474 downloaded.
475
476 Some applications do not work with pipes. For example, pipes do not support
477 applications that makes `syscalls` such as `Seek`. Applications that can work
478 with pipes must use them since they are more performant. The difference will
479 be especially notable if the job only reads a subset of the files that are
480 available to it.
481
482 **Note:** `lazy` currently does not support datums that
483 contain more than 10000 files.
484
485 `input.pfs.empty_files` controls how files are exposed to jobs. If
486 set to `true`, it causes files from this PFS to be presented as empty files.
487 This is useful in shuffle pipelines where you want to read the names of
488 files and reorganize them by using symlinks.
489
490 #### Union Input
491
492 Union inputs take the union of other inputs. In the example
493 below, each input includes individual datums, such as if `foo` and `bar`
494 were in the same repository with the glob pattern set to `/*`.
495 Alternatively, each of these datums might have come from separate repositories
496 with the glob pattern set to `/` and being the only filesystm objects in these
497 repositories.
498
499 ```
500 | inputA | inputB | inputA ∪ inputB |
501 | ------ | ------ | --------------- |
502 | foo | fizz | foo |
503 | bar | buzz | fizz |
504 | | | bar |
505 | | | buzz |
506 ```
507
508 The union inputs do not take a name and maintain the names of the
509 sub-inputs. In the example above, you would see files under
510 `/pfs/inputA/...` or `/pfs/inputB/...`, but never both at the same time.
511 When you write code to address this behavior, make sure that
512 your code first determines which input directory is present. Starting
513 with Pachyderm 1.5.3, we recommend that you give your inputs the
514 same `Name`. That way your code only needs to handle data being present
515 in that directory. This only works if your code does not need to be
516 aware of which of the underlying inputs the data comes from.
517
518 `input.union` is an array of inputs to combine. The inputs do not have to be
519 `pfs` inputs. They can also be `union` and `cross` inputs. Although, there is
520 no reason to take a union of unions because union is associative.
521
522 #### Cross Input
523
524 Cross inputs create the cross product of other inputs. In other words,
525 a cross input creates tuples of the datums in the inputs. In the example
526 below, each input includes individual datums, such as if `foo` and `bar`
527 were in the same repository with the glob pattern set to `/*`.
528 Alternatively, each of these datums might have come from separate repositories
529 with the glob pattern set to `/` and being the only filesystm objects in these
530 repositories.
531
532 ```
533 | inputA | inputB | inputA ⨯ inputB |
534 | ------ | ------ | --------------- |
535 | foo | fizz | (foo, fizz) |
536 | bar | buzz | (foo, buzz) |
537 | | | (bar, fizz) |
538 | | | (bar, buzz) |
539 ```
540
541 The cross inputs above do not take a name and maintain
542 the names of the sub-inputs.
543 In the example above, you would see files under `/pfs/inputA/...`
544 and `/pfs/inputB/...`.
545
546 `input.cross` is an array of inputs to cross.
547 The inputs do not have to be `pfs` inputs. They can also be
548 `union` and `cross` inputs. Although, there is
549 no reason to take a union of unions because union is associative.
550
551 #### Cron Input
552
553 Cron inputs allow you to trigger pipelines based on time. A Cron input is
554 based on the Unix utility called `cron`. When you create a pipeline with
555 one or more Cron inputs, `pachd` creates a repo for each of them. The start
556 time for Cron input is specified in its spec.
557 When a Cron input triggers,
558 `pachd` commits a single file, named by the current [RFC
559 3339 timestamp](https://www.ietf.org/rfc/rfc3339.txt) to the repo which
560 contains the time which satisfied the spec.
561
562 ```
563 {
564 "name": string,
565 "spec": string,
566 "repo": string,
567 "start": time,
568 "overwrite": bool
569 }
570 ```
571
572 `input.cron.name` is the name for the input. Its semantics is similar to
573 those of `input.pfs.name`. Except that it is not optional.
574
575 `input.cron.spec` is a cron expression which specifies the schedule on
576 which to trigger the pipeline. To learn more about how to write schedules,
577 see the [Wikipedia page on cron](https://en.wikipedia.org/wiki/Cron).
578 Pachyderm supports non-standard schedules, such as `"@daily"`.
579
580 `input.cron.repo` is the repo which Pachyderm creates for the input. This
581 parameter is optional. If you do not specify this parameter, then
582 `"<pipeline-name>_<input-name>"` is used by default.
583
584 `input.cron.start` is the time to start counting from for the input. This
585 parameter is optional. If you do not specify this parameter, then the
586 time when the pipeline was created is used by default. Specifying a
587 time enables you to run on matching times from the past or skip times
588 from the present and only start running
589 on matching times in the future. Format the time value according to [RFC
590 3339](https://www.ietf.org/rfc/rfc3339.txt).
591
592 `input.cron.overwrite` is a flag to specify whether you want the timestamp file
593 to be overwritten on each tick. This parameter is optional, and if you do not
594 specify it, it defaults to simply writing new files on each tick. By default,
595 `pachd` expects only the new information to be written out on each tick
596 and combines that data with the data from the previous ticks. If `"overwrite"`
597 is set to `true`, it expects the full dataset to be written out for each tick and
598 replaces previous outputs with the new data written out.
599
600 #### Join Input
601
602 A join input enables you to join files that are stored in separate
603 Pachyderm repositories and that match a configured glob
604 pattern. A join input must have the `glob` and `join_on` parameters configured
605 to work properly. A join can combine multiple PFS inputs.
606
607 You can specify the following parameters for the `join` input.
608
609 * `input.pfs.name` — the name of the PFS input that appears in the
610 `INPUT` field when you run the `pachctl list job` command.
611 If an input name is not specified, it defaults to the name of the repo.
612
613 * `input.pfs.repo` — see the description in [PFS Input](#pfs-input).
614 the name of the Pachyderm repository with the data that
615 you want to join with other data.
616
617 * `input.pfs.branch` — see the description in [PFS Input](#pfs-input).
618
619 * `input.pfs.glob` — a wildcard pattern that defines how a dataset is broken
620 up into datums for further processing. When you use a glob pattern in joins,
621 it creates a naming convention that Pachyderm uses to join files. In other
622 words, Pachyderm joins the files that are named according to the glob
623 pattern and skips those that are not.
624
625 You can specify the glob pattern for joins in a parenthesis to create
626 one or multiple capture groups. A capture group can include one or multiple
627 characters. Use standard UNIX globbing characters to create capture,
628 groups, including the following:
629
630 * `?` — matches a single character in a filepath. For example, you
631 have files named `file000.txt`, `file001.txt`, `file002.txt`, and so on.
632 You can set the glob pattern to `/file(?)(?)(?)` and the `join_on` key to
633 `$2`, so that Pachyderm matches only the files that have same second
634 character.
635
636 * `*` — any number of characters in the filepath. For example, if you set
637 your capture group to `/(*)`, Pachyderm matches all files in the root
638 directory.
639
640 If you do not specify a correct `glob` pattern, Pachyderm performs the
641 `cross` input operation instead of `join`.
642
643 * `input.pfs.lazy` — see the description in [PFS Input](#pfs-input).
644 * `input.pfs.empty_files` — see the description in [PFS Input](#pfs-input).
645
646 #### Git Input (alpha feature)
647
648 Git inputs allow you to pull code from a public git URL and execute that code as part of your pipeline. A pipeline with a Git Input will get triggered (i.e. will see a new input commit and will spawn a job) whenever you commit to your git repository.
649
650 **Note:** This only works on cloud deployments, not local clusters.
651
652 `input.git.URL` must be a URL of the form: `https://github.com/foo/bar.git`
653
654 `input.git.name` is the name for the input, its semantics are similar to
655 those of `input.pfs.name`. It is optional.
656
657 `input.git.branch` is the name of the git branch to use as input.
658
659 Git inputs also require some additional configuration. In order for new commits on your git repository to correspond to new commits on the Pachyderm Git Input repo, we need to setup a git webhook. At the moment, only GitHub is supported. (Though if you ask nicely, we can add support for GitLab or BitBucket).
660
661 1. Create your Pachyderm pipeline with the Git Input.
662
663 2. To get the URL of the webhook to your cluster, do `pachctl inspect pipeline` on your pipeline. You should see a `Githook URL` field with a URL set. Note - this will only work if you've deployed to a cloud provider (e.g. AWS, GKE). If you see `pending` as the value (and you've deployed on a cloud provider), it's possible that the service is still being provisioned. You can check `kubectl get svc` to make sure you see the `githook` service running.
664
665 3. To setup the GitHub webhook, navigate to:
666
667 ```
668 https://github.com/<your_org>/<your_repo>/settings/hooks/new
669 ```
670 Or navigate to webhooks under settings. Then you'll want to copy the `Githook URL` into the 'Payload URL' field.
671
672 ### Output Branch (optional)
673
674 This is the branch where the pipeline outputs new commits. By default,
675 it's "master".
676
677 ### Egress (optional)
678
679 `egress` allows you to push the results of a Pipeline to an external data
680 store such as s3, Google Cloud Storage or Azure Storage. Data will be pushed
681 after the user code has finished running but before the job is marked as
682 successful.
683
684 For more information, see [Exporting Data by using egress](../../how-tos/export-data-out-pachyderm/#export-your-data-with-egress)
685
686 ### Standby (optional)
687
688 `standby` indicates that the pipeline should be put into "standby" when there's
689 no data for it to process. A pipeline in standby will have no pods running and
690 thus will consume no resources, it's state will be displayed as "standby".
691
692 Standby replaces `scale_down_threshold` from releases prior to 1.7.1.
693
694 ### Cache Size (optional)
695
696 `cache_size` controls how much cache a pipeline's sidecar containers use. In
697 general, your pipeline's performance will increase with the cache size, but
698 only up to a certain point depending on your workload.
699
700 Every worker in every pipeline has a limited-functionality `pachd` server
701 running adjacent to it, which proxies PFS reads and writes (this prevents
702 thundering herds when jobs start and end, which is when all of a pipeline's
703 workers are reading from and writing to PFS simultaneously). Part of what these
704 "sidecar" pachd servers do is cache PFS reads. If a pipeline has a cross input,
705 and a worker is downloading the same datum from one branch of the input
706 repeatedly, then the cache can speed up processing significantly.
707
708 ### Enable Stats (optional)
709
710 The `enable_stats` parameter turns on statistics tracking for the pipeline.
711 When you enable the statistics tracking, the pipeline automatically creates
712 and commits datum processing information to a special branch in its output
713 repo called `"stats"`. This branch stores information about each datum that
714 the pipeline processes, including timing information, size information, logs,
715 and `/pfs` snapshots. You can view this statistics by running the `pachctl
716 inspect datum` and `pachctl list datum` commands, as well as through the web UI.
717
718 Once turned on, statistics tracking cannot be disabled for the pipeline. You can
719 turn it off by deleting the pipeline, setting `enable_stats` to `false` or
720 completely removing it from your pipeline spec, and recreating the pipeline from
721 that updated spec file. While the pipeline that collects the stats
722 exists, the storage space used by the stats cannot be released.
723
724 !!! note
725 Enabling stats results in slight storage use increase for logs and timing
726 information.
727 However, stats do not use as much extra storage as it might appear because
728 snapshots of the `/pfs` directory that are the largest stored assets
729 do not require extra space.
730
731 ### Service (alpha feature, optional)
732
733 `service` specifies that the pipeline should be treated as a long running
734 service rather than a data transformation. This means that `transform.cmd` is
735 not expected to exit, if it does it will be restarted. Furthermore, the service
736 is exposed outside the container using a Kubernetes service.
737 `"internal_port"` should be a port that the user code binds to inside the
738 container, `"external_port"` is the port on which it is exposed through the
739 `NodePorts` functionality of Kubernetes services. After a service has been
740 created, you should be able to access it at
741 `http://<kubernetes-host>:<external_port>`.
742
743 ### Spout (optional)
744
745 `spout` is a type of pipeline that processes streaming data.
746 Unlike a union or cross pipeline, a spout pipeline does not have
747 a PFS input. Instead, it opens a Linux *named pipe* into the source of the
748 streaming data. Your pipeline
749 can be either a spout or a service and not both. Therefore, if you added
750 the `service` as a top-level object in your pipeline, you cannot add `spout`.
751 However, you can expose a service from inside of a spout pipeline by
752 specifying it as a field in the `spout` spec. Then, Kubernetes creates
753 a service endpoint that you can expose externally. You can get the information
754 about the service by running `kubectl get services`.
755
756 For more information, see [Spouts](../concepts/pipeline-concepts/pipeline/spout.md).
757
758 ### Max Queue Size (optional)
759 `max_queue_size` specifies that maximum number of datums that a worker should
760 hold in its processing queue at a given time (after processing its entire
761 queue, a worker "checkpoints" its progress by writing to persistent storage).
762 The default value is `1` which means workers will only hold onto the value that
763 they're currently processing.
764
765 Increasing this value can improve pipeline performance, as that allows workers
766 to simultaneously download, process and upload different datums at the same
767 time (and reduces the total time spent on checkpointing). Decreasing this value
768 can make jobs more robust to failed workers, as work gets checkpointed more
769 often, and a failing worker will not lose as much progress. Setting this value
770 too high can also cause problems if you have `lazy` inputs, as there's a cap of
771 10,000 `lazy` files per worker and multiple datums that are running all count
772 against this limit.
773
774 ### Chunk Spec (optional)
775 `chunk_spec` specifies how a pipeline should chunk its datums.
776
777 `chunk_spec.number` if nonzero, specifies that each chunk should contain `number`
778 datums. Chunks may contain fewer if the total number of datums don't
779 divide evenly.
780
781 `chunk_spec.size_bytes` , if nonzero, specifies a target size for each chunk of datums.
782 Chunks may be larger or smaller than `size_bytes`, but will usually be
783 pretty close to `size_bytes` in size.
784
785 ### Scheduling Spec (optional)
786 `scheduling_spec` specifies how the pods for a pipeline should be scheduled.
787
788 `scheduling_spec.node_selector` allows you to select which nodes your pipeline
789 will run on. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector)
790 on node selectors for more information about how this works.
791
792 `scheduling_spec.priority_class_name` allows you to select the prioriy class
793 for the pipeline, which will how Kubernetes chooses to schedule and deschedule
794 the pipeline. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass)
795 on priority and preemption for more information about how this works.
796
797 ### Pod Spec (optional)
798 `pod_spec` is an advanced option that allows you to set fields in the pod spec
799 that haven't been explicitly exposed in the rest of the pipeline spec. A good
800 way to figure out what JSON you should pass is to create a pod in Kubernetes
801 with the proper settings, then do:
802
803 ```
804 kubectl get po/<pod-name> -o json | jq .spec
805 ```
806
807 this will give you a correctly formated piece of JSON, you should then remove
808 the extraneous fields that Kubernetes injects or that can be set else where.
809
810 The JSON is applied after the other parameters for the `pod_spec` have already
811 been set as a [JSON Merge Patch](https://tools.ietf.org/html/rfc7386). This
812 means that you can modify things such as the storage and user containers.
813
814 ### Pod Patch (optional)
815 `pod_patch` is similar to `pod_spec` above but is applied as a [JSON
816 Patch](https://tools.ietf.org/html/rfc6902). Note, this means that the
817 process outlined above of modifying an existing pod spec and then manually
818 blanking unchanged fields won't work, you'll need to create a correctly
819 formatted patch by diffing the two pod specs.
820
821 ## The Input Glob Pattern
822
823 Each PFS input needs to specify a [glob pattern](../concepts/pipeline-concepts/distributed_computing.md).
824
825 Pachyderm uses the glob pattern to determine how many "datums" an input
826 consists of. Datums are the unit of parallelism in Pachyderm. That is,
827 Pachyderm attempts to process datums in parallel whenever possible.
828
829 Intuitively, you may think of the input repo as a file system, and you are
830 applying the glob pattern to the root of the file system. The files and
831 directories that match the glob pattern are considered datums.
832
833 For instance, let's say your input repo has the following structure:
834
835 ```
836 /foo-1
837 /foo-2
838 /bar
839 /bar-1
840 /bar-2
841 ```
842
843 Now let's consider what the following glob patterns would match respectively:
844
845 * `/`: this pattern matches `/`, the root directory itself, meaning all the data would be a single large datum.
846 * `/*`: this pattern matches everything under the root directory given us 3 datums:
847 `/foo-1.`, `/foo-2.`, and everything under the directory `/bar`.
848 * `/bar/*`: this pattern matches files only under the `/bar` directory: `/bar-1` and `/bar-2`
849 * `/foo*`: this pattern matches files under the root directory that start with the characters `foo`
850 * `/*/*`: this pattern matches everything that's two levels deep relative
851 to the root: `/bar/bar-1` and `/bar/bar-2`
852
853 The datums are defined as whichever files or directories match by the glob pattern. For instance, if we used
854 `/*`, then the job will process three datums (potentially in parallel):
855 `/foo-1`, `/foo-2`, and `/bar`. Both the `bar-1` and `bar-2` files within the directory `bar` would be grouped together and always processed by the same worker.
856
857 ## PPS Mounts and File Access
858
859 ### Mount Paths
860
861 The root mount point is at `/pfs`, which contains:
862
863 - `/pfs/input_name` which is where you would find the datum.
864 - Each input will be found here by its name, which defaults to the repo
865 name if not specified.
866 - `/pfs/out` which is where you write any output.
867
868 # Environment Variables
869
870 There are several environment variables that get injected into the user code
871 before it runs. They are:
872
873 - `PACH_JOB_ID` the id the currently run job.
874 - `PACH_OUTPUT_COMMIT_ID` the id of the commit being outputted to.
875 - For each input there will be an environment variable with the same name
876 defined to the path of the file for that input. For example if you are
877 accessing an input called `foo` from the path `/pfs/foo` which contains a
878 file called `bar` then the environment variable `foo` will have the value
879 `/pfs/foo/bar`. The path in the environment variable is the path which
880 matched the glob pattern, even if the file is a directory, ie if your glob
881 pattern is `/*` it would match a directory `/bar`, the value of `$foo`
882 would then be `/pfs/foo/bar`. With a glob pattern of `/*/*` you would match
883 the files contained in `/bar` and thus the value of `foo` would be
884 `/pfs/foo/bar/quux`.
885 - For each input there will be an environment variable named `input_COMMIT`
886 indicating the id of the commit being used for that input.
887
888 In addition to these environment variables Kubernetes also injects others for
889 Services that are running inside the cluster. These allow you to connect to
890 those outside services, which can be powerful but also can be hard to reason
891 about, as processing might be retried multiple times. For example if your code
892 writes a row to a database that row may be written multiple times due to
893 retries. Interaction with outside services should be [idempotent](https://en.wikipedia.org/wiki/Idempotence) to prevent
894 unexpected behavior. Furthermore, one of the running services that your code
895 can connect to is Pachyderm itself, this is generally not recommended as very
896 little of the Pachyderm API is idempotent, but in some specific cases it can be
897 a viable approach.