github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/reference/pipeline_spec.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/reference/pipeline_spec.md (about)

1 # Pipeline Specification
2
3 This document discusses each of the fields present in a pipeline specification.
4 To see how to use a pipeline spec to create a pipeline, refer to the [pachctl
5 create pipeline](pachctl/pachctl_create_pipeline.md) section.
6
7 ## JSON Manifest Format
8
9 ```json
10 {
11 "pipeline": {
12 "name": string
13 },
14 "description": string,
15 "metadata": {
16 "annotations": {
17 "annotation": string
18 },
19 "labels": {
20 "label": string
21 }
22 },
23 "transform": {
24 "image": string,
25 "cmd": [ string ],
26 "stdin": [ string ],
27 "err_cmd": [ string ],
28 "err_stdin": [ string ],
29 "env": {
30 string: string
31 },
32 "secrets": [ {
33 "name": string,
34 "mount_path": string
35 },
36 {
37 "name": string,
38 "env_var": string,
39 "key": string
40 } ],
41 "image_pull_secrets": [ string ],
42 "accept_return_code": [ int ],
43 "debug": bool,
44 "user": string,
45 "working_dir": string,
46 },
47 "parallelism_spec": {
48 // Set at most one of the following:
49 "constant": int,
50 "coefficient": number
51 },
52 "hashtree_spec": {
53 "constant": int,
54 },
55 "resource_requests": {
56 "memory": string,
57 "cpu": number,
58 "disk": string,
59 },
60 "resource_limits": {
61 "memory": string,
62 "cpu": number,
63 "gpu": {
64 "type": string,
65 "number": int
66 }
67 "disk": string,
68 },
69 "sidecar_resource_limits": {
70 "memory": string,
71 "cpu": number
72 },
73 "datum_timeout": string,
74 "datum_tries": int,
75 "job_timeout": string,
76 "input": {
77 <"pfs", "cross", "union", "cron", or "git" see below>
78 },
79 "s3_out": bool,
80 "output_branch": string,
81 "egress": {
82 "URL": "s3://bucket/dir"
83 },
84 "standby": bool,
85 "cache_size": string,
86 "enable_stats": bool,
87 "service": {
88 "internal_port": int,
89 "external_port": int
90 },
91 "spout": {
92 "overwrite": bool
93 \\ Optionally, you can combine a spout with a service:
94 "service": {
95 "internal_port": int,
96 "external_port": int
97 }
98 },
99 "max_queue_size": int,
100 "chunk_spec": {
101 "number": int,
102 "size_bytes": int
103 },
104 "scheduling_spec": {
105 "node_selector": {string: string},
106 "priority_class_name": string
107 },
108 "pod_spec": string,
109 "pod_patch": string,
110 }
111
112 ------------------------------------
113 "pfs" input
114 ------------------------------------
115
116 "pfs": {
117 "name": string,
118 "repo": string,
119 "branch": string,
120 "glob": string,
121 "lazy" bool,
122 "empty_files": bool,
123 "s3": bool
124 }
125
126 ------------------------------------
127 "cross" or "union" input
128 ------------------------------------
129
130 "cross" or "union": [
131 {
132 "pfs": {
133 "name": string,
134 "repo": string,
135 "branch": string,
136 "glob": string,
137 "lazy" bool,
138 "empty_files": bool
139 "s3": bool
140 }
141 },
142 {
143 "pfs": {
144 "name": string,
145 "repo": string,
146 "branch": string,
147 "glob": string,
148 "lazy" bool,
149 "empty_files": bool
150 "s3": bool
151 }
152 }
153 ...
154 ]
155
156
157
158 ------------------------------------
159 "cron" input
160 ------------------------------------
161
162 "cron": {
163 "name": string,
164 "spec": string,
165 "repo": string,
166 "start": time,
167 "overwrite": bool
168 }
169
170 ------------------------------------
171 "join" input
172 ------------------------------------
173
174 "join": [
175 {
176 "pfs": {
177 "name": string,
178 "repo": string,
179 "branch": string,
180 "glob": string,
181 "join_on": string
182 "lazy": bool
183 "empty_files": bool
184 "s3": bool
185 }
186 },
187 {
188 "pfs": {
189 "name": string,
190 "repo": string,
191 "branch": string,
192 "glob": string,
193 "join_on": string
194 "lazy": bool
195 "empty_files": bool
196 "s3": bool
197 }
198 }
199 ]
200
201 ------------------------------------
202 "git" input
203 ------------------------------------
204
205 "git": {
206 "URL": string,
207 "name": string,
208 "branch": string
209 }
210
211 ```
212
213 In practice, you rarely need to specify all the fields.
214 Most fields either come with sensible defaults or can be empty.
215 The following text is an example of a minimum spec:
216
217 ```json
218 {
219 "pipeline": {
220 "name": "wordcount"
221 },
222 "transform": {
223 "image": "wordcount-image",
224 "cmd": ["/binary", "/pfs/data", "/pfs/out"]
225 },
226 "input": {
227 "pfs": {
228 "repo": "data",
229 "glob": "/*"
230 }
231 }
232 }
233 ```
234
235 ### Name (required)
236
237 `pipeline.name` is the name of the pipeline that you are creating. Each
238 pipeline needs to have a unique name. Pipeline names must meet the following
239 requirements:
240
241 - Include only alphanumeric characters, `_` and `-`.
242 - Begin or end with only alphanumeric characters (not `_` or `-`).
243 - Not exceed 63 characters in length.
244
245 ### Description (optional)
246
247 `description` is an optional text field where you can add information
248 about the pipeline.
249
250 ### Metadata
251
252 This parameter enables you to add metadata to your pipeline pods by using Kubernetes' `labels` and `annotations`. Labels help you to organize and keep track of your cluster objects by creating groups of pods based on the application they run, resources they use, or other parameters. Labels simplify the querying of Kubernetes objects and are handy in operations.
253
254 Similarly to labels, you can add metadata through annotations. The difference is that you can specify any arbitrary metadata through annotations.
255
256 Both parameters require a key-value pair. Do not confuse this parameter with `pod_patch` which adds metadata to the user container of the pipeline pod. For more information, see [Labels and Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) and [Kubernetes Annotations](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/) in the Kubernetes documentation.
257
258 ### Transform (required)
259
260 `transform.image` is the name of the Docker image that your jobs use.
261
262 `transform.cmd` is the command passed to the Docker run invocation. Similarly
263 to Docker, `cmd` is not run inside a shell which means that
264 wildcard globbing (`*`), pipes (`|`), and file redirects (`>` and `>>`) do
265 not work. To specify these settings, you can set `cmd` to be a shell of your
266 choice, such as `sh` and pass a shell script to `stdin`.
267
268 `transform.stdin` is an array of lines that are sent to your command on
269 `stdin`.
270 Lines do not have to end in newline characters.
271
272 `transform.err_cmd` is an optional command that is executed on failed datums.
273 If the `err_cmd` is successful and returns 0 error code, it does not prevent
274 the job from succeeding.
275 This behavior means that `transform.err_cmd` can be used to ignore
276 failed datums while still writing successful datums to the output repo,
277 instead of failing the whole job when some datums fail. The `transform.err_cmd`
278 command has the same limitations as `transform.cmd`.
279
280 `transform.err_stdin` is an array of lines that are sent to your error command
281 on `stdin`.
282 Lines do not have to end in newline characters.
283
284 `transform.env` is a key-value map of environment variables that
285 Pachyderm injects into the container. There are also environment variables
286 that are automatically injected into the container, such as:
287
288 * `PACH_JOB_ID` – the ID of the current job.
289 * `PACH_OUTPUT_COMMIT_ID` – the ID of the commit in the output repo for
290 the current job.
291 * `<input>_COMMIT` - the ID of the input commit. For example, if your
292 input is the `images` repo, this will be `images_COMMIT`.
293
294 For a complete list of variables and
295 descriptions see: [Configure Environment Variables](../../deploy-manage/deploy/environment-variables/).
296
297 `transform.secrets` is an array of secrets. You can use the secrets to
298 embed sensitive data, such as credentials. The secrets reference
299 Kubernetes secrets by name and specify a path to map the secrets or
300 an environment variable (`env_var`) that the value should be bound to. Secrets
301 must set `name` which should be the name of a secret in Kubernetes. Secrets
302 must also specify either `mount_path` or `env_var` and `key`. See more
303 information about Kubernetes secrets [here](https://kubernetes.io/docs/concepts/configuration/secret/).
304
305 `transform.image_pull_secrets` is an array of image pull secrets, image pull
306 secrets are similar to secrets except that they are mounted before the
307 containers are created so they can be used to provide credentials for image
308 pulling. For example, if you are using a private Docker registry for your
309 images, you can specify it by running the following command:
310
311 ```shell
312 kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
313 ```
314
315 And then, notify your pipeline about it by using
316 `"image_pull_secrets": [ "myregistrykey" ]`. Read more about image pull secrets
317 [here](https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod).
318
319 `transform.accept_return_code` is an array of return codes, such as exit codes
320 from your Docker command that are considered acceptable.
321 If your Docker command exits with one of the codes in this array, it is
322 considered a successful run to set job status. `0`
323 is always considered a successful exit code.
324
325 `transform.debug` turns on added debug logging for the pipeline.
326
327 `transform.user` sets the user that your code runs as, this can also be
328 accomplished with a `USER` directive in your `Dockerfile`.
329
330 `transform.working_dir` sets the directory that your command runs from. You
331 can also specify the `WORKDIR` directive in your `Dockerfile`.
332
333 `transform.dockerfile` is the path to the `Dockerfile` used with the `--build`
334 flag. This defaults to `./Dockerfile`.
335
336 ### Parallelism Spec (optional)
337
338 `parallelism_spec` describes how Pachyderm parallelizes your pipeline.
339 Currently, Pachyderm has two parallelism strategies: `constant` and
340 `coefficient`.
341
342 If you set the `constant` field, Pachyderm starts the number of workers
343 that you specify. For example, set `"constant":10` to use 10 workers.
344
345 If you set the `coefficient` field, Pachyderm starts a number of workers
346 that is a multiple of your Kubernetes cluster’s size. For example, if your
347 Kubernetes cluster has 10 nodes, and you set `"coefficient": 0.5`, Pachyderm
348 starts five workers. If you set it to 2.0, Pachyderm starts 20 workers
349 (two per Kubernetes node).
350
351 The default value is "constant=1".
352
353 Because spouts and services are designed to be single instances, do not
354 modify the default `parallism_spec` value for these pipelines.
355
356 ### Resource Requests (optional)
357
358 `resource_requests` describes the amount of resources that the pipeline
359 workers will consume. Knowing this in advance
360 enables Pachyderm to schedule big jobs on separate machines, so that they
361 do not conflict, slow down, or terminate.
362
363 This parameter is optional, and if you do not explicitly add it in
364 the pipeline spec, Pachyderm creates Kubernetes containers with the
365 following default resources:
366
367 - The user container requests 0 CPU, 0 disk space, and 64MB of memory.
368 - The init container requests the same amount of CPU, memory, and disk
369 space that is set for the user container.
370 - The storage container requests 0 CPU and the amount of memory set by the
371 [cache_size](#cache-size-optional) parameter.
372
373 The `resource_requests` parameter enables you to overwrite these default
374 values.
375
376 The `memory` field is a string that describes the amount of memory, in bytes,
377 that each worker needs. Allowed SI suffixes include M, K, G, Mi, Ki, Gi, and
378 other.
379
380 For example, a worker that needs to read a 1GB file into memory might set
381 `"memory": "1.2G"` with a little extra for the code to use in addition to the
382 file. Workers for this pipeline will be placed on machines with at least
383 1.2GB of free memory, and other large workers will be prevented from using it,
384 if they also set their `resource_requests`.
385
386 The `cpu` field is a number that describes the amount of CPU time in `cpu
387 seconds/real seconds` that each worker needs. Setting `"cpu": 0.5` indicates that
388 the worker should get 500ms of CPU time per second. Setting `"cpu": 2`
389 indicates that the worker gets 2000ms of CPU time per second. In other words,
390 it is using 2 CPUs, though worker threads might spend 500ms on four
391 physical CPUs instead of one second on two physical CPUs.
392
393 The `disk` field is a string that describes the amount of ephemeral disk space,
394 in bytes, that each worker needs. Allowed SI suffixes include M, K, G, Mi,
395 Ki, Gi, and other.
396
397 In both cases, the resource requests are not upper bounds. If the worker uses
398 more memory than it is requested, it does not mean that it will be shut down.
399 However, if the whole node runs out of memory, Kubernetes starts deleting
400 pods that have been placed on it and exceeded their memory request,
401 to reclaim memory.
402 To prevent deletion of your worker node, you must set your `memory` request to
403 a sufficiently large value. However, if the total memory requested by all
404 workers in the system is too large, Kubernetes cannot schedule new
405 workers because no machine has enough unclaimed memory. `cpu` works
406 similarly, but for CPU time.
407
408 For more information about resource requests and limits see the
409 [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/)
410 on the subject.
411
412 ### Resource Limits (optional)
413
414 `resource_limits` describes the upper threshold of allowed resources a given
415 worker can consume. If a worker exceeds this value, it will be evicted.
416
417 The `gpu` field is a number that describes how many GPUs each worker needs.
418 Only whole number are supported, Kubernetes does not allow multiplexing of
419 GPUs. Unlike the other resource fields, GPUs only have meaning in Limits, by
420 requesting a GPU the worker will have sole access to that GPU while it is
421 running. It's recommended to enable `standby` if you are using GPUs so other
422 processes in the cluster will have access to the GPUs while the pipeline has
423 nothing to process. For more information about scheduling GPUs see the
424 [Kubernetes docs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/)
425 on the subject.
426
427 ### Sidecar Resource Limits (optional)
428
429 `sidecar_resource_limits` determines the upper threshold of resources
430 allocated to the sidecar containers.
431
432 This field can be useful in deployments where Kubernetes automatically
433 applies resource limits to containers, which might conflict with Pachyderm
434 pipelines' resource requests. Such a deployment might fail if Pachyderm
435 requests more than the default Kubernetes limit. The `sidecar_resource_limits`
436 enables you to explicitly specify these resources to fix the issue.
437
438 ### Datum Timeout (optional)
439
440 `datum_timeout` determines the maximum execution time allowed for each
441 datum. The value must be a string that represents a time value, such as
442 `1s`, `5m`, or `15h`. This parameter takes precedence over the parallelism
443 or number of datums, therefore, no single datum is allowed to exceed
444 this value. By default, `datum_timeout` is not set, and the datum continues to
445 be processed as long as needed.
446
447 ### Datum Tries (optional)
448
449 `datum_tries` is an integer, such as `1`, `2`, or `3`, that determines the
450 number of times a job attempts to run on a datum when a failure occurs.
451 Setting `datum_tries` to `1` will attempt a job once with no retries.
452 Only failed datums are retried in a retry attempt. If the operation succeeds
453 in retry attempts, then the job is marked as successful. Otherwise, the job
454 is marked as failed.
455
456
457 ### Job Timeout (optional)
458
459 `job_timeout` determines the maximum execution time allowed for a job. It
460 differs from `datum_timeout` in that the limit is applied across all
461 workers and all datums. This is the *wall time*, which means that if
462 you set `job_timeout` to one hour and the job does not finish the work
463 in one hour, it will be interrupted.
464 When you set this value, you need to
465 consider the parallelism, total number of datums, and execution time per
466 datum. The value must be a string that represents a time value, such as
467 `1s`, `5m`, or `15h`. In addition, the number of datums might change over
468 jobs. Some new commits might have more files, and therefore, more datums.
469 Similarly, other commits might have fewer files and datums. If this
470 parameter is not set, the job will run indefinitely until it succeeds or fails.
471
472 ### S3 Output Repository
473
474 `s3_out` allows your pipeline code to write results out to an S3 gateway
475 endpoint instead of the typical `pfs/out` directory. When this parameter
476 is set to `true`, Pachyderm includes a sidecar S3 gateway instance
477 container in the same pod as the pipeline container. The address of the
478 output repository will be `s3://<output_repo>`. If you enable `s3_out`,
479 verify that the `enable_stats` parameter is disabled.
480
481 If you want to expose an input repository through an S3 gateway, see
482 `input.pfs.s3` in [PFS Input](#pfs-input).
483
484 !!! note "See Also:"
485 [Environment Variables](../../deploy-manage/deploy/environment-variables/)
486
487 ### Input (required)
488
489 `input` specifies repos that will be visible to the jobs during runtime.
490 Commits to these repos will automatically trigger the pipeline to create new
491 jobs to process them. Input is a recursive type, there are multiple different
492 kinds of inputs which can be combined together. The `input` object is a
493 container for the different input types with a field for each, only one of
494 these fields be set for any instantiation of the object.
495
496 ```
497 {
498 "pfs": pfs_input,
499 "union": union_input,
500 "cross": cross_input,
501 "cron": cron_input
502 }
503 ```
504
505 #### PFS Input
506
507 PFS inputs are the simplest inputs, they take input from a single branch on a
508 single repo.
509
510 ```
511 {
512 "name": string,
513 "repo": string,
514 "branch": string,
515 "glob": string,
516 "lazy" bool,
517 "empty_files": bool
518 "s3": bool
519 }
520 ```
521
522 `input.pfs.name` is the name of the input. An input with the name `XXX` is
523 visible under the path `/pfs/XXX` when a job runs. Input names must be unique
524 if the inputs are crossed, but they may be duplicated between `PFSInput`s that
525 are combined by using the `union` operator. This is because when
526 `PFSInput`s are combined, you only ever see a datum from one input
527 at a time. Overlapping the names of combined inputs allows
528 you to write simpler code since you no longer need to consider which
529 input directory a particular datum comes from. If an input's name is not
530 specified, it defaults to the name of the repo. Therefore, if you have two
531 crossed inputs from the same repo, you must give at least one of them a
532 unique name.
533
534 `input.pfs.repo` is the name of the Pachyderm repository with the data that
535 you want to join with other data.
536
537 `input.pfs.branch` is the `branch` to watch for commits. If left blank,
538 Pachyderm sets this value to `master`.
539
540 `input.pfs.glob` is a glob pattern that is used to determine how the
541 input data is partitioned.
542
543 `input.pfs.lazy` controls how the data is exposed to jobs. The default is
544 `false` which means the job eagerly downloads the data it needs to process and
545 exposes it as normal files on disk. If lazy is set to `true`, data is
546 exposed as named pipes instead, and no data is downloaded until the job
547 opens the pipe and reads it. If the pipe is never opened, then no data is
548 downloaded.
549
550 Some applications do not work with pipes. For example, pipes do not support
551 applications that makes `syscalls` such as `Seek`. Applications that can work
552 with pipes must use them since they are more performant. The difference will
553 be especially notable if the job only reads a subset of the files that are
554 available to it.
555
556 !!! note
557 `lazy` does not support datums that
558 contain more than 10000 files.
559
560 `input.pfs.empty_files` controls how files are exposed to jobs. If
561 set to `true`, it causes files from this PFS to be presented as empty files.
562 This is useful in shuffle pipelines where you want to read the names of
563 files and reorganize them by using symlinks.
564
565 `input.pfs.s3` sets whether the sidecar in the pipeline worker pod
566 should include a sidecar S3 gateway instance. This option enables an S3 gateway
567 to serve on a pipeline-level basis and, therefore, ensure provenance tracking
568 for pipelines that integrate with external systems, such as Kubeflow. When
569 this option is set to `true`, Pachyderm deploys an S3 gateway instance
570 alongside the pipeline container and creates an S3 bucket for the pipeline
571 input repo. The address of the
572 input repository will be `s3://<input_repo>`. When you enable this
573 parameter, you cannot use glob patterns. All files will be processed
574 as one datum.
575
576 Another limitation for S3-enabled pipelines is that you can only use
577 either a single input or a cross input. Join and union inputs are not
578 supported.
579
580 If you want to expose an output repository through an S3
581 gateway, see [S3 Output Repository](#s3-output-repository).
582
583 #### Union Input
584
585 Union inputs take the union of other inputs. In the example
586 below, each input includes individual datums, such as if `foo` and `bar`
587 were in the same repository with the glob pattern set to `/*`.
588 Alternatively, each of these datums might have come from separate repositories
589 with the glob pattern set to `/` and being the only file system objects in these
590 repositories.
591
592 ```
593 | inputA | inputB | inputA ∪ inputB |
594 | ------ | ------ | --------------- |
595 | foo | fizz | foo |
596 | bar | buzz | fizz |
597 | | | bar |
598 | | | buzz |
599 ```
600
601 The union inputs do not take a name and maintain the names of the
602 sub-inputs. In the example above, you would see files under
603 `/pfs/inputA/...` or `/pfs/inputB/...`, but never both at the same time.
604 When you write code to address this behavior, make sure that
605 your code first determines which input directory is present. Starting
606 with Pachyderm 1.5.3, we recommend that you give your inputs the
607 same `Name`. That way your code only needs to handle data being present
608 in that directory. This only works if your code does not need to be
609 aware of which of the underlying inputs the data comes from.
610
611 `input.union` is an array of inputs to combine. The inputs do not have to be
612 `pfs` inputs. They can also be `union` and `cross` inputs. Although, there is
613 no reason to take a union of unions because union is associative.
614
615 #### Cross Input
616
617 Cross inputs create the cross product of other inputs. In other words,
618 a cross input creates tuples of the datums in the inputs. In the example
619 below, each input includes individual datums, such as if `foo` and `bar`
620 were in the same repository with the glob pattern set to `/*`.
621 Alternatively, each of these datums might have come from separate repositories
622 with the glob pattern set to `/` and being the only file system objects in these
623 repositories.
624
625 ```
626 | inputA | inputB | inputA ⨯ inputB |
627 | ------ | ------ | --------------- |
628 | foo | fizz | (foo, fizz) |
629 | bar | buzz | (foo, buzz) |
630 | | | (bar, fizz) |
631 | | | (bar, buzz) |
632 ```
633
634 The cross inputs above do not take a name and maintain
635 the names of the sub-inputs.
636 In the example above, you would see files under `/pfs/inputA/...`
637 and `/pfs/inputB/...`.
638
639 `input.cross` is an array of inputs to cross.
640 The inputs do not have to be `pfs` inputs. They can also be
641 `union` and `cross` inputs. Although, there is
642 no reason to take a union of unions because union is associative.
643
644 #### Cron Input
645
646 Cron inputs allow you to trigger pipelines based on time. A Cron input is
647 based on the Unix utility called `cron`. When you create a pipeline with
648 one or more Cron inputs, `pachd` creates a repo for each of them. The start
649 time for Cron input is specified in its spec.
650 When a Cron input triggers,
651 `pachd` commits a single file, named by the current [RFC
652 3339 timestamp](https://www.ietf.org/rfc/rfc3339.txt) to the repo which
653 contains the time which satisfied the spec.
654
655 ```
656 {
657 "name": string,
658 "spec": string,
659 "repo": string,
660 "start": time,
661 "overwrite": bool
662 }
663 ```
664
665 `input.cron.name` is the name for the input. Its semantics is similar to
666 those of `input.pfs.name`. Except that it is not optional.
667
668 `input.cron.spec` is a cron expression which specifies the schedule on
669 which to trigger the pipeline. To learn more about how to write schedules,
670 see the [Wikipedia page on cron](https://en.wikipedia.org/wiki/Cron).
671 Pachyderm supports non-standard schedules, such as `"@daily"`.
672
673 `input.cron.repo` is the repo which Pachyderm creates for the input. This
674 parameter is optional. If you do not specify this parameter, then
675 `"<pipeline-name>_<input-name>"` is used by default.
676
677 `input.cron.start` is the time to start counting from for the input. This
678 parameter is optional. If you do not specify this parameter, then the
679 time when the pipeline was created is used by default. Specifying a
680 time enables you to run on matching times from the past or skip times
681 from the present and only start running
682 on matching times in the future. Format the time value according to [RFC
683 3339](https://www.ietf.org/rfc/rfc3339.txt).
684
685 `input.cron.overwrite` is a flag to specify whether you want the timestamp file
686 to be overwritten on each tick. This parameter is optional, and if you do not
687 specify it, it defaults to simply writing new files on each tick. By default,
688 when `"overwrite"` is disabled, ticks accumulate in the cron input repo. When
689 `"overwrite"` is enabled, Pachyderm erases the old ticks and adds new ticks
690 with each commit. If you do not add any manual ticks or run
691 `pachctl run cron`, only one tick file per commit (for the latest tick)
692 is added to the input repo.
693
694 #### Join Input
695
696 A join input enables you to join files that are stored in separate
697 Pachyderm repositories and that match a configured glob
698 pattern. A join input must have the `glob` and `join_on` parameters configured
699 to work properly. A join can combine multiple PFS inputs.
700
701 You can specify the following parameters for the `join` input.
702
703 * `input.pfs.name` — the name of the PFS input that appears in the
704 `INPUT` field when you run the `pachctl list job` command.
705 If an input name is not specified, it defaults to the name of the repo.
706
707 * `input.pfs.repo` — see the description in [PFS Input](#pfs-input).
708 the name of the Pachyderm repository with the data that
709 you want to join with other data.
710
711 * `input.pfs.branch` — see the description in [PFS Input](#pfs-input).
712
713 * `input.pfs.glob` — a wildcard pattern that defines how a dataset is broken
714 up into datums for further processing. When you use a glob pattern in joins,
715 it creates a naming convention that Pachyderm uses to join files. In other
716 words, Pachyderm joins the files that are named according to the glob
717 pattern and skips those that are not.
718
719 You can specify the glob pattern for joins in a parenthesis to create
720 one or multiple capture groups. A capture group can include one or multiple
721 characters. Use standard UNIX globbing characters to create capture,
722 groups, including the following:
723
724 * `?` — matches a single character in a filepath. For example, you
725 have files named `file000.txt`, `file001.txt`, `file002.txt`, and so on.
726 You can set the glob pattern to `/file(?)(?)(?)` and the `join_on` key to
727 `$2`, so that Pachyderm matches only the files that have same second
728 character.
729
730 * `*` — any number of characters in the filepath. For example, if you set
731 your capture group to `/(*)`, Pachyderm matches all files in the root
732 directory.
733
734 If you do not specify a correct `glob` pattern, Pachyderm performs the
735 `cross` input operation instead of `join`.
736
737 * `input.pfs.lazy` — see the description in [PFS Input](#pfs-input).
738 * `input.pfs.empty_files` — see the description in [PFS Input](#pfs-input).
739
740 #### Git Input (alpha feature)
741
742 Git inputs allow you to pull code from a public git URL and execute that code as part of your pipeline. A pipeline with a Git Input will get triggered (i.e. will see a new input commit and will spawn a job) whenever you commit to your git repository.
743
744 **Note:** This only works on cloud deployments, not local clusters.
745
746 `input.git.URL` must be a URL of the form: `https://github.com/foo/bar.git`
747
748 `input.git.name` is the name for the input, its semantics are similar to
749 those of `input.pfs.name`. It is optional.
750
751 `input.git.branch` is the name of the git branch to use as input.
752
753 Git inputs also require some additional configuration. In order for new commits on your git repository to correspond to new commits on the Pachyderm Git Input repo, we need to setup a git webhook. At the moment, only GitHub is supported. (Though if you ask nicely, we can add support for GitLab or BitBucket).
754
755 1. Create your Pachyderm pipeline with the Git Input.
756
757 2. To get the URL of the webhook to your cluster, do `pachctl inspect pipeline` on your pipeline. You should see a `Githook URL` field with a URL set. Note - this will only work if you've deployed to a cloud provider (e.g. AWS, GKE). If you see `pending` as the value (and you've deployed on a cloud provider), it's possible that the service is still being provisioned. You can check `kubectl get svc` to make sure you see the `githook` service running.
758
759 3. To setup the GitHub webhook, navigate to:
760
761 ```
762 https://github.com/<your_org>/<your_repo>/settings/hooks/new
763 ```
764 Or navigate to webhooks under settings. Then you'll want to copy the `Githook URL` into the 'Payload URL' field.
765
766 ### Output Branch (optional)
767
768 This is the branch where the pipeline outputs new commits. By default,
769 it's "master".
770
771 ### Egress (optional)
772
773 `egress` allows you to push the results of a Pipeline to an external data
774 store such as s3, Google Cloud Storage or Azure Storage. Data will be pushed
775 after the user code has finished running but before the job is marked as
776 successful.
777
778 For more information, see [Exporting Data by using egress](../../how-tos/export-data-out-pachyderm/#export-your-data-with-egress)
779
780 ### Standby (optional)
781
782 `standby` indicates that the pipeline should be put into "standby" when there's
783 no data for it to process. A pipeline in standby will have no pods running and
784 thus will consume no resources, it's state will be displayed as "standby".
785
786 Standby replaces `scale_down_threshold` from releases prior to 1.7.1.
787
788 ### Cache Size (optional)
789
790 `cache_size` controls how much cache a pipeline's sidecar containers use. In
791 general, your pipeline's performance will increase with the cache size, but
792 only up to a certain point depending on your workload.
793
794 Every worker in every pipeline has a limited-functionality `pachd` server
795 running adjacent to it, which proxies PFS reads and writes (this prevents
796 thundering herds when jobs start and end, which is when all of a pipeline's
797 workers are reading from and writing to PFS simultaneously). Part of what these
798 "sidecar" pachd servers do is cache PFS reads. If a pipeline has a cross input,
799 and a worker is downloading the same datum from one branch of the input
800 repeatedly, then the cache can speed up processing significantly.
801
802 ### Enable Stats (optional)
803
804 The `enable_stats` parameter turns on statistics tracking for the pipeline.
805 When you enable the statistics tracking, the pipeline automatically creates
806 and commits datum processing information to a special branch in its output
807 repo called `"stats"`. This branch stores information about each datum that
808 the pipeline processes, including timing information, size information, logs,
809 and `/pfs` snapshots. You can view this statistics by running the `pachctl
810 inspect datum` and `pachctl list datum` commands, as well as through the web UI.
811 Do not enable statistics tracking for S3-enabled pipelines.
812
813 Once turned on, statistics tracking cannot be disabled for the pipeline. You can
814 turn it off by deleting the pipeline, setting `enable_stats` to `false` or
815 completely removing it from your pipeline spec, and recreating the pipeline from
816 that updated spec file. While the pipeline that collects the stats
817 exists, the storage space used by the stats cannot be released.
818
819 !!! note
820 Enabling stats results in slight storage use increase for logs and timing
821 information.
822 However, stats do not use as much extra storage as it might appear because
823 snapshots of the `/pfs` directory that are the largest stored assets
824 do not require extra space.
825
826 ### Service (alpha feature, optional)
827
828 `service` specifies that the pipeline should be treated as a long running
829 service rather than a data transformation. This means that `transform.cmd` is
830 not expected to exit, if it does it will be restarted. Furthermore, the service
831 is exposed outside the container using a Kubernetes service.
832 `"internal_port"` should be a port that the user code binds to inside the
833 container, `"external_port"` is the port on which it is exposed through the
834 `NodePorts` functionality of Kubernetes services. After a service has been
835 created, you should be able to access it at
836 `http://<kubernetes-host>:<external_port>`.
837
838 ### Spout (optional)
839
840 `spout` is a type of pipeline that processes streaming data.
841 Unlike a union or cross pipeline, a spout pipeline does not have
842 a PFS input. Instead, it opens a Linux *named pipe* into the source of the
843 streaming data. Your pipeline
844 can be either a spout or a service and not both. Therefore, if you added
845 the `service` as a top-level object in your pipeline, you cannot add `spout`.
846 However, you can expose a service from inside of a spout pipeline by
847 specifying it as a field in the `spout` spec. Then, Kubernetes creates
848 a service endpoint that you can expose externally. You can get the information
849 about the service by running `kubectl get services`.
850
851 For more information, see [Spouts](../concepts/pipeline-concepts/pipeline/spout.md).
852
853 ### Max Queue Size (optional)
854 `max_queue_size` specifies that maximum number of datums that a worker should
855 hold in its processing queue at a given time (after processing its entire
856 queue, a worker "checkpoints" its progress by writing to persistent storage).
857 The default value is `1` which means workers will only hold onto the value that
858 they're currently processing.
859
860 Increasing this value can improve pipeline performance, as that allows workers
861 to simultaneously download, process and upload different datums at the same
862 time (and reduces the total time spent on checkpointing). Decreasing this value
863 can make jobs more robust to failed workers, as work gets checkpointed more
864 often, and a failing worker will not lose as much progress. Setting this value
865 too high can also cause problems if you have `lazy` inputs, as there's a cap of
866 10,000 `lazy` files per worker and multiple datums that are running all count
867 against this limit.
868
869 ### Chunk Spec (optional)
870 `chunk_spec` specifies how a pipeline should chunk its datums.
871 A chunk is the unit of work that workers claim. Each worker claims 1 or more datums
872 and it commits a full chunk once it's done processing it.
873
874 `chunk_spec.number` if nonzero, specifies that each chunk should contain `number`
875 datums. Chunks may contain fewer if the total number of datums don't
876 divide evenly. If you lower the chunk number to 1 it'll update after every datum,
877 the cost is extra load on etcd which can slow other stuff down.
878 The default value is 2.
879
880 `chunk_spec.size_bytes` , if nonzero, specifies a target size for each chunk of datums.
881 Chunks may be larger or smaller than `size_bytes`, but will usually be
882 pretty close to `size_bytes` in size.
883
884 ### Scheduling Spec (optional)
885 `scheduling_spec` specifies how the pods for a pipeline should be scheduled.
886
887 `scheduling_spec.node_selector` allows you to select which nodes your pipeline
888 will run on. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector)
889 on node selectors for more information about how this works.
890
891 `scheduling_spec.priority_class_name` allows you to select the prioriy class
892 for the pipeline, which will how Kubernetes chooses to schedule and deschedule
893 the pipeline. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass)
894 on priority and preemption for more information about how this works.
895
896 ### Pod Spec (optional)
897 `pod_spec` is an advanced option that allows you to set fields in the pod spec
898 that haven't been explicitly exposed in the rest of the pipeline spec. A good
899 way to figure out what JSON you should pass is to create a pod in Kubernetes
900 with the proper settings, then do:
901
902 ```
903 kubectl get po/<pod-name> -o json | jq .spec
904 ```
905
906 this will give you a correctly formated piece of JSON, you should then remove
907 the extraneous fields that Kubernetes injects or that can be set else where.
908
909 The JSON is applied after the other parameters for the `pod_spec` have already
910 been set as a [JSON Merge Patch](https://tools.ietf.org/html/rfc7386). This
911 means that you can modify things such as the storage and user containers.
912
913 ### Pod Patch (optional)
914 `pod_patch` is similar to `pod_spec` above but is applied as a [JSON
915 Patch](https://tools.ietf.org/html/rfc6902). Note, this means that the
916 process outlined above of modifying an existing pod spec and then manually
917 blanking unchanged fields won't work, you'll need to create a correctly
918 formatted patch by diffing the two pod specs.
919
920 ## The Input Glob Pattern
921
922 Each PFS input needs to specify a [glob pattern](../../concepts/pipeline-concepts/datum/glob-pattern/).
923
924 Pachyderm uses the glob pattern to determine how many "datums" an input
925 consists of. Datums are the unit of parallelism in Pachyderm. That is,
926 Pachyderm attempts to process datums in parallel whenever possible.
927
928 Intuitively, you may think of the input repo as a file system, and you are
929 applying the glob pattern to the root of the file system. The files and
930 directories that match the glob pattern are considered datums.
931
932 For instance, let's say your input repo has the following structure:
933
934 ```
935 /foo-1
936 /foo-2
937 /bar
938 /bar-1
939 /bar-2
940 ```
941
942 Now let's consider what the following glob patterns would match respectively:
943
944 * `/`: this pattern matches `/`, the root directory itself, meaning all the data would be a single large datum.
945 * `/*`: this pattern matches everything under the root directory given us 3 datums:
946 `/foo-1.`, `/foo-2.`, and everything under the directory `/bar`.
947 * `/bar/*`: this pattern matches files only under the `/bar` directory: `/bar-1` and `/bar-2`
948 * `/foo*`: this pattern matches files under the root directory that start with the characters `foo`
949 * `/*/*`: this pattern matches everything that's two levels deep relative
950 to the root: `/bar/bar-1` and `/bar/bar-2`
951
952 The datums are defined as whichever files or directories match by the glob pattern. For instance, if we used
953 `/*`, then the job will process three datums (potentially in parallel):
954 `/foo-1`, `/foo-2`, and `/bar`. Both the `bar-1` and `bar-2` files within the directory `bar` would be grouped together and always processed by the same worker.
955
956 ## PPS Mounts and File Access
957
958 ### Mount Paths
959
960 The root mount point is at `/pfs`, which contains:
961
962 - `/pfs/input_name` which is where you would find the datum.
963 - Each input will be found here by its name, which defaults to the repo
964 name if not specified.
965 - `/pfs/out` which is where you write any output.