github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/reference/pipeline_spec.md (about)

     1  # Pipeline Specification
     2  
     3  This document discusses each of the fields present in a pipeline specification.
     4  To see how to use a pipeline spec to create a pipeline, refer to the [pachctl
     5  create pipeline](pachctl/pachctl_create_pipeline.md) section.
     6  
     7  ## JSON Manifest Format
     8  
     9  ```json
    10  {
    11    "pipeline": {
    12      "name": string
    13    },
    14    "description": string,
    15    "metadata": {
    16      "annotations": {
    17          "annotation": string
    18      },
    19      "labels": {
    20          "label": string
    21      }
    22    },
    23    "transform": {
    24      "image": string,
    25      "cmd": [ string ],
    26      "stdin": [ string ],
    27      "err_cmd": [ string ],
    28      "err_stdin": [ string ],
    29      "env": {
    30          string: string
    31      },
    32      "secrets": [ {
    33          "name": string,
    34          "mount_path": string
    35      },
    36      {
    37          "name": string,
    38          "env_var": string,
    39          "key": string
    40      } ],
    41      "image_pull_secrets": [ string ],
    42      "accept_return_code": [ int ],
    43      "debug": bool,
    44      "user": string,
    45      "working_dir": string,
    46    },
    47    "parallelism_spec": {
    48      // Set at most one of the following:
    49      "constant": int,
    50      "coefficient": number
    51    },
    52    "hashtree_spec": {
    53     "constant": int,
    54    },
    55    "resource_requests": {
    56      "memory": string,
    57      "cpu": number,
    58      "disk": string,
    59    },
    60    "resource_limits": {
    61      "memory": string,
    62      "cpu": number,
    63      "gpu": {
    64        "type": string,
    65        "number": int
    66      }
    67      "disk": string,
    68    },
    69    "sidecar_resource_limits": {
    70      "memory": string,
    71      "cpu": number
    72    },
    73    "datum_timeout": string,
    74    "datum_tries": int,
    75    "job_timeout": string,
    76    "input": {
    77      <"pfs", "cross", "union", "cron", or "git" see below>
    78    },
    79    "s3_out": bool,
    80    "output_branch": string,
    81    "egress": {
    82      "URL": "s3://bucket/dir"
    83    },
    84    "standby": bool,
    85    "cache_size": string,
    86    "enable_stats": bool,
    87    "service": {
    88      "internal_port": int,
    89      "external_port": int
    90    },
    91    "spout": {
    92    "overwrite": bool
    93    \\ Optionally, you can combine a spout with a service:
    94    "service": {
    95          "internal_port": int,
    96          "external_port": int
    97      }
    98    },
    99    "max_queue_size": int,
   100    "chunk_spec": {
   101      "number": int,
   102      "size_bytes": int
   103    },
   104    "scheduling_spec": {
   105      "node_selector": {string: string},
   106      "priority_class_name": string
   107    },
   108    "pod_spec": string,
   109    "pod_patch": string,
   110  }
   111  
   112  ------------------------------------
   113  "pfs" input
   114  ------------------------------------
   115  
   116  "pfs": {
   117    "name": string,
   118    "repo": string,
   119    "branch": string,
   120    "glob": string,
   121    "lazy" bool,
   122    "empty_files": bool,
   123    "s3": bool
   124  }
   125  
   126  ------------------------------------
   127  "cross" or "union" input
   128  ------------------------------------
   129  
   130  "cross" or "union": [
   131    {
   132      "pfs": {
   133        "name": string,
   134        "repo": string,
   135        "branch": string,
   136        "glob": string,
   137        "lazy" bool,
   138        "empty_files": bool
   139        "s3": bool
   140      }
   141    },
   142    {
   143      "pfs": {
   144        "name": string,
   145        "repo": string,
   146        "branch": string,
   147        "glob": string,
   148        "lazy" bool,
   149        "empty_files": bool
   150        "s3": bool
   151      }
   152    }
   153    ...
   154  ]
   155  
   156  
   157  
   158  ------------------------------------
   159  "cron" input
   160  ------------------------------------
   161  
   162  "cron": {
   163      "name": string,
   164      "spec": string,
   165      "repo": string,
   166      "start": time,
   167      "overwrite": bool
   168  }
   169  
   170  ------------------------------------
   171  "join" input
   172  ------------------------------------
   173  
   174  "join": [
   175    {
   176      "pfs": {
   177        "name": string,
   178        "repo": string,
   179        "branch": string,
   180        "glob": string,
   181        "join_on": string
   182        "lazy": bool
   183        "empty_files": bool
   184        "s3": bool
   185      }
   186    },
   187    {
   188      "pfs": {
   189         "name": string,
   190         "repo": string,
   191         "branch": string,
   192         "glob": string,
   193         "join_on": string
   194         "lazy": bool
   195         "empty_files": bool
   196         "s3": bool
   197      }
   198    }
   199  ]
   200  
   201  ------------------------------------
   202  "git" input
   203  ------------------------------------
   204  
   205  "git": {
   206    "URL": string,
   207    "name": string,
   208    "branch": string
   209  }
   210  
   211  ```
   212  
   213  In practice, you rarely need to specify all the fields.
   214  Most fields either come with sensible defaults or can be empty.
   215  The following text is an example of a minimum spec:
   216  
   217  ```json
   218  {
   219    "pipeline": {
   220      "name": "wordcount"
   221    },
   222    "transform": {
   223      "image": "wordcount-image",
   224      "cmd": ["/binary", "/pfs/data", "/pfs/out"]
   225    },
   226    "input": {
   227          "pfs": {
   228              "repo": "data",
   229              "glob": "/*"
   230          }
   231      }
   232  }
   233  ```
   234  
   235  ### Name (required)
   236  
   237  `pipeline.name` is the name of the pipeline that you are creating. Each
   238  pipeline needs to have a unique name. Pipeline names must meet the following
   239  requirements:
   240  
   241  - Include only alphanumeric characters, `_` and `-`.
   242  - Begin or end with only alphanumeric characters (not `_` or `-`).
   243  - Not exceed 63 characters in length.
   244  
   245  ### Description (optional)
   246  
   247  `description` is an optional text field where you can add information
   248  about the pipeline.
   249  
   250  ### Metadata
   251  
   252  This parameter enables you to add metadata to your pipeline pods by using Kubernetes' `labels` and `annotations`. Labels help you to organize and keep track of your cluster objects by creating groups of pods based on the application they run, resources they use, or other parameters. Labels simplify the querying of Kubernetes objects and are handy in operations.
   253  
   254  Similarly to labels, you can add metadata through annotations. The difference is that you can specify any arbitrary metadata through annotations.
   255  
   256  Both parameters require a key-value pair.  Do not confuse this parameter with `pod_patch` which adds metadata to the user container of the pipeline pod. For more information, see [Labels and Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) and [Kubernetes Annotations](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/) in the Kubernetes documentation.
   257  
   258  ### Transform (required)
   259  
   260  `transform.image` is the name of the Docker image that your jobs use.
   261  
   262  `transform.cmd` is the command passed to the Docker run invocation. Similarly
   263  to Docker, `cmd` is not run inside a shell which means that
   264  wildcard globbing (`*`), pipes (`|`), and file redirects (`>` and `>>`) do
   265  not work. To specify these settings, you can set `cmd` to be a shell of your
   266  choice, such as `sh` and pass a shell script to `stdin`.
   267  
   268  `transform.stdin` is an array of lines that are sent to your command on
   269  `stdin`.
   270  Lines do not have to end in newline characters.
   271  
   272  `transform.err_cmd` is an optional command that is executed on failed datums.
   273  If the `err_cmd` is successful and returns 0 error code, it does not prevent
   274  the job from succeeding.
   275  This behavior means that `transform.err_cmd` can be used to ignore
   276  failed datums while still writing successful datums to the output repo,
   277  instead of failing the whole job when some datums fail. The `transform.err_cmd`
   278  command has the same limitations as `transform.cmd`.
   279  
   280  `transform.err_stdin` is an array of lines that are sent to your error command
   281  on `stdin`.
   282  Lines do not have to end in newline characters.
   283  
   284  `transform.env` is a key-value map of environment variables that
   285  Pachyderm injects into the container. There are also environment variables
   286  that are automatically injected into the container, such as:
   287  
   288  * `PACH_JOB_ID` – the ID of the current job.
   289  * `PACH_OUTPUT_COMMIT_ID` – the ID of the commit in the output repo for 
   290  the current job.
   291  * `<input>_COMMIT` - the ID of the input commit. For example, if your
   292  input is the `images` repo, this will be `images_COMMIT`.
   293  
   294  For a complete list of variables and
   295  descriptions see: [Configure Environment Variables](../../deploy-manage/deploy/environment-variables/).
   296  
   297  `transform.secrets` is an array of secrets. You can use the secrets to
   298  embed sensitive data, such as credentials. The secrets reference
   299  Kubernetes secrets by name and specify a path to map the secrets or
   300  an environment variable (`env_var`) that the value should be bound to. Secrets
   301  must set `name` which should be the name of a secret in Kubernetes. Secrets
   302  must also specify either `mount_path` or `env_var` and `key`. See more
   303  information about Kubernetes secrets [here](https://kubernetes.io/docs/concepts/configuration/secret/).
   304  
   305  `transform.image_pull_secrets` is an array of image pull secrets, image pull
   306  secrets are similar to secrets except that they are mounted before the
   307  containers are created so they can be used to provide credentials for image
   308  pulling. For example, if you are using a private Docker registry for your
   309  images, you can specify it by running the following command:
   310  
   311  ```shell
   312  kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
   313  ```
   314  
   315  And then, notify your pipeline about it by using
   316  `"image_pull_secrets": [ "myregistrykey" ]`. Read more about image pull secrets
   317  [here](https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod).
   318  
   319  `transform.accept_return_code` is an array of return codes, such as exit codes
   320  from your Docker command that are considered acceptable.
   321  If your Docker command exits with one of the codes in this array, it is
   322  considered a successful run to set job status. `0`
   323  is always considered a successful exit code.
   324  
   325  `transform.debug` turns on added debug logging for the pipeline.
   326  
   327  `transform.user` sets the user that your code runs as, this can also be
   328  accomplished with a `USER` directive in your `Dockerfile`.
   329  
   330  `transform.working_dir` sets the directory that your command runs from. You
   331  can also specify the `WORKDIR` directive in your `Dockerfile`.
   332  
   333  `transform.dockerfile` is the path to the `Dockerfile` used with the `--build`
   334  flag. This defaults to `./Dockerfile`.
   335  
   336  ### Parallelism Spec (optional)
   337  
   338  `parallelism_spec` describes how Pachyderm parallelizes your pipeline.
   339  Currently, Pachyderm has two parallelism strategies: `constant` and
   340  `coefficient`.
   341  
   342  If you set the `constant` field, Pachyderm starts the number of workers
   343  that you specify. For example, set `"constant":10` to use 10 workers.
   344  
   345  If you set the `coefficient` field, Pachyderm starts a number of workers
   346  that is a multiple of your Kubernetes cluster’s size. For example, if your
   347  Kubernetes cluster has 10 nodes, and you set `"coefficient": 0.5`, Pachyderm
   348  starts five workers. If you set it to 2.0, Pachyderm starts 20 workers
   349  (two per Kubernetes node).
   350  
   351  The default value is "constant=1".
   352  
   353  Because spouts and services are designed to be single instances, do not
   354  modify the default `parallism_spec` value for these pipelines.
   355  
   356  ### Resource Requests (optional)
   357  
   358  `resource_requests` describes the amount of resources that the pipeline
   359  workers will consume. Knowing this in advance
   360  enables Pachyderm to schedule big jobs on separate machines, so that they
   361  do not conflict, slow down, or terminate.
   362  
   363  This parameter is optional, and if you do not explicitly add it in
   364  the pipeline spec, Pachyderm creates Kubernetes containers with the
   365  following default resources: 
   366  
   367  - The user container requests 0 CPU, 0 disk space, and 64MB of memory. 
   368  - The init container requests the same amount of CPU, memory, and disk
   369  space that is set for the user container.
   370  - The storage container requests 0 CPU and the amount of memory set by the
   371  [cache_size](#cache-size-optional) parameter.
   372  
   373  The `resource_requests` parameter enables you to overwrite these default
   374  values.
   375  
   376  The `memory` field is a string that describes the amount of memory, in bytes,
   377  that each worker needs. Allowed SI suffixes include M, K, G, Mi, Ki, Gi, and
   378  other.
   379  
   380  For example, a worker that needs to read a 1GB file into memory might set
   381  `"memory": "1.2G"` with a little extra for the code to use in addition to the
   382  file. Workers for this pipeline will be placed on machines with at least
   383  1.2GB of free memory, and other large workers will be prevented from using it,
   384  if they also set their `resource_requests`.
   385  
   386  The `cpu` field is a number that describes the amount of CPU time in `cpu
   387  seconds/real seconds` that each worker needs. Setting `"cpu": 0.5` indicates that
   388  the worker should get 500ms of CPU time per second. Setting `"cpu": 2`
   389  indicates that the worker gets 2000ms of CPU time per second. In other words,
   390  it is using 2 CPUs, though worker threads might spend 500ms on four
   391  physical CPUs instead of one second on two physical CPUs.
   392  
   393  The `disk` field is a string that describes the amount of ephemeral disk space,
   394  in bytes, that each worker needs. Allowed SI suffixes include M, K, G, Mi,
   395  Ki, Gi, and other.
   396  
   397  In both cases, the resource requests are not upper bounds. If the worker uses
   398  more memory than it is requested, it does not mean that it will be shut down.
   399  However, if the whole node runs out of memory, Kubernetes starts deleting
   400  pods that have been placed on it and exceeded their memory request,
   401  to reclaim memory.
   402  To prevent deletion of your worker node, you must set your `memory` request to
   403  a sufficiently large value. However, if the total memory requested by all
   404  workers in the system is too large, Kubernetes cannot schedule new
   405  workers because no machine has enough unclaimed memory. `cpu` works
   406  similarly, but for CPU time.
   407  
   408  For more information about resource requests and limits see the
   409  [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/)
   410  on the subject.
   411  
   412  ### Resource Limits (optional)
   413  
   414  `resource_limits` describes the upper threshold of allowed resources a given
   415  worker can consume. If a worker exceeds this value, it will be evicted.
   416  
   417  The `gpu` field is a number that describes how many GPUs each worker needs.
   418  Only whole number are supported, Kubernetes does not allow multiplexing of
   419  GPUs. Unlike the other resource fields, GPUs only have meaning in Limits, by
   420  requesting a GPU the worker will have sole access to that GPU while it is
   421  running. It's recommended to enable `standby` if you are using GPUs so other
   422  processes in the cluster will have access to the GPUs while the pipeline has
   423  nothing to process. For more information about scheduling GPUs see the
   424  [Kubernetes docs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/)
   425  on the subject.
   426  
   427  ### Sidecar Resource Limits (optional)
   428  
   429  `sidecar_resource_limits` determines the upper threshold of resources
   430  allocated to the sidecar containers.
   431  
   432  This field can be useful in deployments where Kubernetes automatically
   433  applies resource limits to containers, which might conflict with Pachyderm
   434  pipelines' resource requests. Such a deployment might fail if Pachyderm
   435  requests more than the default Kubernetes limit. The `sidecar_resource_limits`
   436  enables you to explicitly specify these resources to fix the issue.
   437  
   438  ### Datum Timeout (optional)
   439  
   440  `datum_timeout` determines the maximum execution time allowed for each
   441  datum. The value must be a string that represents a time value, such as
   442  `1s`, `5m`, or `15h`. This parameter takes precedence over the parallelism
   443  or number of datums, therefore, no single datum is allowed to exceed
   444  this value. By default, `datum_timeout` is not set, and the datum continues to
   445  be processed as long as needed.
   446  
   447  ### Datum Tries (optional)
   448  
   449  `datum_tries` is an integer, such as `1`, `2`, or `3`, that determines the
   450  number of times a job attempts to run on a datum when a failure occurs. 
   451  Setting `datum_tries` to `1` will attempt a job once with no retries. 
   452  Only failed datums are retried in a retry attempt. If the operation succeeds
   453  in retry attempts, then the job is marked as successful. Otherwise, the job
   454  is marked as failed.
   455  
   456  
   457  ### Job Timeout (optional)
   458  
   459  `job_timeout` determines the maximum execution time allowed for a job. It
   460  differs from `datum_timeout` in that the limit is applied across all
   461  workers and all datums. This is the *wall time*, which means that if
   462  you set `job_timeout` to one hour and the job does not finish the work
   463  in one hour, it will be interrupted.
   464  When you set this value, you need to
   465  consider the parallelism, total number of datums, and execution time per
   466  datum. The value must be a string that represents a time value, such as
   467  `1s`, `5m`, or `15h`. In addition, the number of datums might change over
   468  jobs. Some new commits might have more files, and therefore, more datums.
   469  Similarly, other commits might have fewer files and datums. If this
   470  parameter is not set, the job will run indefinitely until it succeeds or fails.
   471  
   472  ### S3 Output Repository
   473  
   474  `s3_out` allows your pipeline code to write results out to an S3 gateway
   475  endpoint instead of the typical `pfs/out` directory. When this parameter
   476  is set to `true`, Pachyderm includes a sidecar S3 gateway instance
   477  container in the same pod as the pipeline container. The address of the
   478  output repository will be `s3://<output_repo>`. If you enable `s3_out`,
   479  verify that the `enable_stats` parameter is disabled.
   480  
   481  If you want to expose an input repository through an S3 gateway, see
   482  `input.pfs.s3` in [PFS Input](#pfs-input). 
   483  
   484  !!! note "See Also:"
   485      [Environment Variables](../../deploy-manage/deploy/environment-variables/)
   486  
   487  ### Input
   488  
   489  `input` specifies repos that will be visible to the jobs during runtime.
   490  Commits to these repos will automatically trigger the pipeline to create new
   491  jobs to process them. Input is a recursive type, there are multiple different
   492  kinds of inputs which can be combined together. The `input` object is a
   493  container for the different input types with a field for each, only one of
   494  these fields be set for any instantiation of the object. While most types
   495  of pipeline specifications require an `input` repository, there are
   496  exceptions, such as a spout, which does not need an `input`.
   497  
   498  ```json
   499  {
   500      "pfs": pfs_input,
   501      "union": union_input,
   502      "cross": cross_input,
   503      "cron": cron_input
   504  }
   505  ```
   506  
   507  #### PFS Input
   508  
   509  PFS inputs are the simplest inputs, they take input from a single branch on a
   510  single repo.
   511  
   512  ```
   513  {
   514      "name": string,
   515      "repo": string,
   516      "branch": string,
   517      "glob": string,
   518      "lazy" bool,
   519      "empty_files": bool
   520      "s3": bool
   521  }
   522  ```
   523  
   524  `input.pfs.name` is the name of the input. An input with the name `XXX` is
   525  visible under the path `/pfs/XXX` when a job runs. Input names must be unique
   526  if the inputs are crossed, but they may be duplicated between `PFSInput`s that
   527  are combined by using the `union` operator. This is because when
   528  `PFSInput`s are combined, you only ever see a datum from one input
   529  at a time. Overlapping the names of combined inputs allows
   530  you to write simpler code since you no longer need to consider which
   531  input directory a particular datum comes from. If an input's name is not
   532  specified, it defaults to the name of the repo. Therefore, if you have two
   533  crossed inputs from the same repo, you must give at least one of them a
   534  unique name.
   535  
   536  `input.pfs.repo` is the name of the Pachyderm repository with the data that
   537  you want to join with other data.
   538  
   539  `input.pfs.branch` is the `branch` to watch for commits. If left blank,
   540  Pachyderm sets this value to `master`.
   541  
   542  `input.pfs.glob` is a glob pattern that is used to determine how the
   543  input data is partitioned.
   544  
   545  `input.pfs.lazy` controls how the data is exposed to jobs. The default is
   546  `false` which means the job eagerly downloads the data it needs to process and
   547  exposes it as normal files on disk. If lazy is set to `true`, data is
   548  exposed as named pipes instead, and no data is downloaded until the job
   549  opens the pipe and reads it. If the pipe is never opened, then no data is
   550  downloaded.
   551  
   552  Some applications do not work with pipes. For example, pipes do not support
   553  applications that makes `syscalls` such as `Seek`. Applications that can work
   554  with pipes must use them since they are more performant. The difference will
   555  be especially notable if the job only reads a subset of the files that are
   556  available to it.
   557  
   558  !!! note
   559      `lazy` does not support datums that
   560      contain more than 10000 files.
   561  
   562  `input.pfs.empty_files` controls how files are exposed to jobs. If
   563  set to `true`, it causes files from this PFS to be presented as empty files.
   564  This is useful in shuffle pipelines where you want to read the names of
   565  files and reorganize them by using symlinks.
   566  
   567  `input.pfs.s3` sets whether the sidecar in the pipeline worker pod
   568  should include a sidecar S3 gateway instance. This option enables an S3 gateway
   569  to serve on a pipeline-level basis and, therefore, ensure provenance tracking
   570  for pipelines that integrate with external systems, such as Kubeflow. When
   571  this option is set to `true`, Pachyderm deploys an S3 gateway instance
   572  alongside the pipeline container and creates an S3 bucket for the pipeline
   573  input repo. The address of the
   574  input repository will be `s3://<input_repo>`. When you enable this
   575  parameter, you cannot use glob patterns. All files will be processed
   576  as one datum.
   577  
   578  Another limitation for S3-enabled pipelines is that you can only use
   579  either a single input or a cross input. Join and union inputs are not
   580  supported.
   581  
   582  If you want to expose an output repository through an S3
   583  gateway, see [S3 Output Repository](#s3-output-repository).
   584  
   585  #### Union Input
   586  
   587  Union inputs take the union of other inputs. In the example
   588  below, each input includes individual datums, such as if  `foo` and `bar`
   589  were in the same repository with the glob pattern set to `/*`.
   590  Alternatively, each of these datums might have come from separate repositories
   591  with the glob pattern set to `/` and being the only file system objects in these
   592  repositories.
   593  
   594  ```
   595  | inputA | inputB | inputA ∪ inputB |
   596  | ------ | ------ | --------------- |
   597  | foo    | fizz   | foo             |
   598  | bar    | buzz   | fizz            |
   599  |        |        | bar             |
   600  |        |        | buzz            |
   601  ```
   602  
   603  The union inputs do not take a name and maintain the names of the
   604  sub-inputs. In the example above, you would see files under
   605  `/pfs/inputA/...` or `/pfs/inputB/...`, but never both at the same time.
   606  When you write code to address this behavior, make sure that
   607  your code first determines which input directory is present. Starting
   608  with Pachyderm 1.5.3, we recommend that you give your inputs the
   609  same `Name`. That way your code only needs to handle data being present
   610  in that directory. This only works if your code does not need to be
   611  aware of which of the underlying inputs the data comes from.
   612  
   613  `input.union` is an array of inputs to combine. The inputs do not have to be
   614  `pfs` inputs. They can also be `union` and `cross` inputs. Although, there is
   615  no reason to take a union of unions because union is associative.
   616  
   617  #### Cross Input
   618  
   619  Cross inputs create the cross product of other inputs. In other words,
   620  a cross input creates tuples of the datums in the inputs. In the example
   621  below, each input includes individual datums, such as if  `foo` and `bar`
   622  were in the same repository with the glob pattern set to `/*`.
   623  Alternatively, each of these datums might have come from separate repositories
   624  with the glob pattern set to `/` and being the only file system objects in these
   625  repositories.
   626  
   627  ```
   628  | inputA | inputB | inputA ⨯ inputB |
   629  | ------ | ------ | --------------- |
   630  | foo    | fizz   | (foo, fizz)     |
   631  | bar    | buzz   | (foo, buzz)     |
   632  |        |        | (bar, fizz)     |
   633  |        |        | (bar, buzz)     |
   634  ```
   635  
   636  The cross inputs above do not take a name and maintain
   637  the names of the sub-inputs.
   638  In the example above, you would see files under `/pfs/inputA/...`
   639  and `/pfs/inputB/...`.
   640  
   641  `input.cross` is an array of inputs to cross.
   642  The inputs do not have to be `pfs` inputs. They can also be
   643  `union` and `cross` inputs. Although, there is
   644   no reason to take a union of unions because union is associative.
   645  
   646  #### Cron Input
   647  
   648  Cron inputs allow you to trigger pipelines based on time. A Cron input is
   649  based on the Unix utility called `cron`. When you create a pipeline with
   650  one or more Cron inputs, `pachd` creates a repo for each of them. The start
   651  time for Cron input is specified in its spec.
   652  When a Cron input triggers,
   653  `pachd` commits a single file, named by the current [RFC
   654  3339 timestamp](https://www.ietf.org/rfc/rfc3339.txt) to the repo which
   655  contains the time which satisfied the spec.
   656  
   657  ```
   658  {
   659      "name": string,
   660      "spec": string,
   661      "repo": string,
   662      "start": time,
   663      "overwrite": bool
   664  }
   665  ```
   666  
   667  `input.cron.name` is the name for the input. Its semantics is similar to
   668  those of `input.pfs.name`. Except that it is not optional.
   669  
   670  `input.cron.spec` is a cron expression which specifies the schedule on
   671  which to trigger the pipeline. To learn more about how to write schedules,
   672  see the [Wikipedia page on cron](https://en.wikipedia.org/wiki/Cron).
   673  Pachyderm supports non-standard schedules, such as `"@daily"`.
   674  
   675  `input.cron.repo` is the repo which Pachyderm creates for the input. This
   676  parameter is optional. If you do not specify this parameter, then
   677  `"<pipeline-name>_<input-name>"` is used by default.
   678  
   679  `input.cron.start` is the time to start counting from for the input. This
   680  parameter is optional. If you do not specify this parameter, then the
   681  time when the pipeline was created is used by default. Specifying a
   682  time enables you to run on matching times from the past or skip times
   683  from the present and only start running
   684  on matching times in the future. Format the time value according to [RFC
   685  3339](https://www.ietf.org/rfc/rfc3339.txt).
   686  
   687  `input.cron.overwrite` is a flag to specify whether you want the timestamp file
   688  to be overwritten on each tick. This parameter is optional, and if you do not
   689  specify it, it defaults to simply writing new files on each tick. By default,
   690  when `"overwrite"` is disabled, ticks accumulate in the cron input repo. When
   691  `"overwrite"` is enabled, Pachyderm erases the old ticks and adds new ticks
   692  with each commit. If you do not add any manual ticks or run
   693  `pachctl run cron`, only one tick file per commit (for the latest tick)
   694  is added to the input repo.
   695  
   696  #### Join Input
   697  
   698  A join input enables you to join files that are stored in separate
   699  Pachyderm repositories and that match a configured glob
   700  pattern. A join input must have the `glob` and `join_on` parameters configured
   701  to work properly. A join can combine multiple PFS inputs.
   702  
   703  You can specify the following parameters for the `join` input.
   704  
   705  * `input.pfs.name` — the name of the PFS input that appears in the
   706  `INPUT` field when you run the `pachctl list job` command.
   707  If an input name is not specified, it defaults to the name of the repo.
   708  
   709  * `input.pfs.repo` — see the description in [PFS Input](#pfs-input).
   710  the name of the Pachyderm repository with the data that
   711  you want to join with other data.
   712  
   713  * `input.pfs.branch` — see the description in [PFS Input](#pfs-input).
   714  
   715  * `input.pfs.glob` — a wildcard pattern that defines how a dataset is broken
   716    up into datums for further processing. When you use a glob pattern in joins,
   717    it creates a naming convention that Pachyderm uses to join files. In other
   718    words, Pachyderm joins the files that are named according to the glob
   719    pattern and skips those that are not.
   720  
   721    You can specify the glob pattern for joins in a parenthesis to create
   722    one or multiple capture groups. A capture group can include one or multiple
   723    characters. Use standard UNIX globbing characters to create capture,
   724    groups, including the following:
   725  
   726    * `?` — matches a single character in a filepath. For example, you
   727    have files named `file000.txt`, `file001.txt`, `file002.txt`, and so on.
   728    You can set the glob pattern to `/file(?)(?)(?)` and the `join_on` key to
   729    `$2`, so that Pachyderm matches only the files that have same second
   730    character.
   731  
   732  * `*` — any number of characters in the filepath. For example, if you set
   733    your capture group to `/(*)`, Pachyderm matches all files in the root
   734    directory.
   735  
   736    If you do not specify a correct `glob` pattern, Pachyderm performs the
   737    `cross` input operation instead of `join`.
   738  
   739  * `input.pfs.lazy` — see the description in [PFS Input](#pfs-input).
   740  * `input.pfs.empty_files` — see the description in [PFS Input](#pfs-input).
   741  
   742  #### Git Input (alpha feature)
   743  
   744  Git inputs allow you to pull code from a public git URL and execute that code as part of your pipeline. A pipeline with a Git Input will get triggered (i.e. will see a new input commit and will spawn a job) whenever you commit to your git repository.
   745  
   746  **Note:** This only works on cloud deployments, not local clusters.
   747  
   748  `input.git.URL` must be a URL of the form: `https://github.com/foo/bar.git`
   749  
   750  `input.git.name` is the name for the input, its semantics are similar to
   751  those of `input.pfs.name`. It is optional.
   752  
   753  `input.git.branch` is the name of the git branch to use as input.
   754  
   755  Git inputs also require some additional configuration. In order for new commits on your git repository to correspond to new commits on the Pachyderm Git Input repo, we need to setup a git webhook. At the moment, only GitHub is supported. (Though if you ask nicely, we can add support for GitLab or BitBucket).
   756  
   757  1. Create your Pachyderm pipeline with the Git Input.
   758  
   759  2. To get the URL of the webhook to your cluster, do `pachctl inspect pipeline` on your pipeline. You should see a `Githook URL` field with a URL set. Note - this will only work if you've deployed to a cloud provider (e.g. AWS, GKE). If you see `pending` as the value (and you've deployed on a cloud provider), it's possible that the service is still being provisioned. You can check `kubectl get svc` to make sure you see the `githook` service running.
   760  
   761  3. To setup the GitHub webhook, navigate to:
   762  
   763  ```
   764  https://github.com/<your_org>/<your_repo>/settings/hooks/new
   765  ```
   766  Or navigate to webhooks under settings. Then you'll want to copy the `Githook URL` into the 'Payload URL' field.
   767  
   768  ### Output Branch (optional)
   769  
   770  This is the branch where the pipeline outputs new commits.  By default,
   771  it's "master".
   772  
   773  ### Egress (optional)
   774  
   775  `egress` allows you to push the results of a Pipeline to an external data
   776  store such as s3, Google Cloud Storage or Azure Storage. Data will be pushed
   777  after the user code has finished running but before the job is marked as
   778  successful.
   779  
   780  For more information, see [Exporting Data by using egress](../../how-tos/export-data-out-pachyderm/#export-your-data-with-egress)
   781  
   782  ### Standby (optional)
   783  
   784  `standby` indicates that the pipeline should be put into "standby" when there's
   785  no data for it to process.  A pipeline in standby will have no pods running and
   786  thus will consume no resources, it's state will be displayed as "standby".
   787  
   788  Standby replaces `scale_down_threshold` from releases prior to 1.7.1.
   789  
   790  ### Cache Size (optional)
   791  
   792  `cache_size` controls how much cache a pipeline's sidecar containers use. In
   793  general, your pipeline's performance will increase with the cache size, but
   794  only up to a certain point depending on your workload.
   795  
   796  Every worker in every pipeline has a limited-functionality `pachd` server
   797  running adjacent to it, which proxies PFS reads and writes (this prevents
   798  thundering herds when jobs start and end, which is when all of a pipeline's
   799  workers are reading from and writing to PFS simultaneously). Part of what these
   800  "sidecar" pachd servers do is cache PFS reads. If a pipeline has a cross input,
   801  and a worker is downloading the same datum from one branch of the input
   802  repeatedly, then the cache can speed up processing significantly.
   803  
   804  ### Enable Stats (optional)
   805  
   806  The `enable_stats` parameter turns on statistics tracking for the pipeline.
   807  When you enable the statistics tracking, the pipeline automatically creates
   808  and commits datum processing information to a special branch in its output
   809  repo called `"stats"`. This branch stores information about each datum that
   810  the pipeline processes, including timing information, size information, logs,
   811  and `/pfs` snapshots. You can view this statistics by running the `pachctl
   812  inspect datum` and `pachctl list datum` commands, as well as through the web UI.
   813  Do not enable statistics tracking for S3-enabled pipelines.
   814  
   815  Once turned on, statistics tracking cannot be disabled for the pipeline. You can
   816  turn it off by deleting the pipeline, setting `enable_stats` to `false` or
   817  completely removing it from your pipeline spec, and recreating the pipeline from
   818  that updated spec file. While the pipeline that collects the stats
   819  exists, the storage space used by the stats cannot be released.
   820  
   821  !!! note
   822      Enabling stats results in slight storage use increase for logs and timing
   823      information.
   824      However, stats do not use as much extra storage as it might appear because
   825      snapshots of the `/pfs` directory that are the largest stored assets
   826      do not require extra space.
   827  
   828  ### Service (alpha feature, optional)
   829  
   830  `service` specifies that the pipeline should be treated as a long running
   831  service rather than a data transformation. This means that `transform.cmd` is
   832  not expected to exit, if it does it will be restarted. Furthermore, the service
   833  is exposed outside the container using a Kubernetes service.
   834  `"internal_port"` should be a port that the user code binds to inside the
   835  container, `"external_port"` is the port on which it is exposed through the
   836  `NodePorts` functionality of Kubernetes services. After a service has been
   837  created, you should be able to access it at
   838  `http://<kubernetes-host>:<external_port>`.
   839  
   840  ### Spout (optional)
   841  
   842  `spout` is a type of pipeline that processes streaming data.
   843  Unlike a union or cross pipeline, a spout pipeline does not have
   844  a PFS input. Instead, it opens a Linux *named pipe* into the source of the
   845  streaming data. Your pipeline
   846  can be either a spout or a service and not both. Therefore, if you added
   847  the `service` as a top-level object in your pipeline, you cannot add `spout`.
   848  However, you can expose a service from inside of a spout pipeline by
   849  specifying it as a field in the `spout` spec. Then, Kubernetes creates
   850  a service endpoint that you can expose externally. You can get the information
   851  about the service by running `kubectl get services`.
   852  
   853  For more information, see [Spouts](../concepts/pipeline-concepts/pipeline/spout.md).
   854  
   855  ### Max Queue Size (optional)
   856  `max_queue_size` specifies that maximum number of datums that a worker should
   857  hold in its processing queue at a given time (after processing its entire
   858  queue, a worker "checkpoints" its progress by writing to persistent storage).
   859  The default value is `1` which means workers will only hold onto the value that
   860  they're currently processing.
   861  
   862  Increasing this value can improve pipeline performance, as that allows workers
   863  to simultaneously download, process and upload different datums at the same
   864  time (and reduces the total time spent on checkpointing). Decreasing this value
   865  can make jobs more robust to failed workers, as work gets checkpointed more
   866  often, and a failing worker will not lose as much progress. Setting this value
   867  too high can also cause problems if you have `lazy` inputs, as there's a cap of
   868  10,000 `lazy` files per worker and multiple datums that are running all count
   869  against this limit.
   870  
   871  ### Chunk Spec (optional)
   872  `chunk_spec` specifies how a pipeline should chunk its datums.
   873   A chunk is the unit of work that workers claim. Each worker claims 1 or more datums 
   874   and it commits a full chunk once it's done processing it.
   875   
   876  `chunk_spec.number` if nonzero, specifies that each chunk should contain `number`
   877   datums. Chunks may contain fewer if the total number of datums don't
   878   divide evenly. If you lower the chunk number to 1 it'll update after every datum, 
   879   the cost is extra load on etcd which can slow other stuff down.
   880   The default value is 2.
   881  
   882  `chunk_spec.size_bytes` , if nonzero, specifies a target size for each chunk of datums.
   883   Chunks may be larger or smaller than `size_bytes`, but will usually be
   884   pretty close to `size_bytes` in size.
   885  
   886  ### Scheduling Spec (optional)
   887  `scheduling_spec` specifies how the pods for a pipeline should be scheduled.
   888  
   889  `scheduling_spec.node_selector` allows you to select which nodes your pipeline
   890  will run on. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector)
   891  on node selectors for more information about how this works.
   892  
   893  `scheduling_spec.priority_class_name` allows you to select the prioriy class
   894  for the pipeline, which will how Kubernetes chooses to schedule and deschedule
   895  the pipeline. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass)
   896  on priority and preemption for more information about how this works.
   897  
   898  ### Pod Spec (optional)
   899  `pod_spec` is an advanced option that allows you to set fields in the pod spec
   900  that haven't been explicitly exposed in the rest of the pipeline spec. A good
   901  way to figure out what JSON you should pass is to create a pod in Kubernetes
   902  with the proper settings, then do:
   903  
   904  ```
   905  kubectl get po/<pod-name> -o json | jq .spec
   906  ```
   907  
   908  this will give you a correctly formated piece of JSON, you should then remove
   909  the extraneous fields that Kubernetes injects or that can be set else where.
   910  
   911  The JSON is applied after the other parameters for the `pod_spec` have already
   912  been set as a [JSON Merge Patch](https://tools.ietf.org/html/rfc7386). This
   913  means that you can modify things such as the storage and user containers.
   914  
   915  ### Pod Patch (optional)
   916  `pod_patch` is similar to `pod_spec` above but is applied as a [JSON
   917  Patch](https://tools.ietf.org/html/rfc6902). Note, this means that the
   918  process outlined above of modifying an existing pod spec and then manually
   919  blanking unchanged fields won't work, you'll need to create a correctly
   920  formatted patch by diffing the two pod specs.
   921  
   922  ## The Input Glob Pattern
   923  
   924  Each PFS input needs to specify a [glob pattern](../../concepts/pipeline-concepts/datum/glob-pattern/).
   925  
   926  Pachyderm uses the glob pattern to determine how many "datums" an input
   927  consists of.  Datums are the unit of parallelism in Pachyderm.  That is,
   928  Pachyderm attempts to process datums in parallel whenever possible.
   929  
   930  Intuitively, you may think of the input repo as a file system, and you are
   931  applying the glob pattern to the root of the file system.  The files and
   932  directories that match the glob pattern are considered datums.
   933  
   934  For instance, let's say your input repo has the following structure:
   935  
   936  ```
   937  /foo-1
   938  /foo-2
   939  /bar
   940    /bar-1
   941    /bar-2
   942  ```
   943  
   944  Now let's consider what the following glob patterns would match respectively:
   945  
   946  * `/`: this pattern matches `/`, the root directory itself, meaning all the data would be a single large datum.
   947  * `/*`:  this pattern matches everything under the root directory given us 3 datums:
   948  `/foo-1.`, `/foo-2.`, and everything under the directory `/bar`.
   949  * `/bar/*`: this pattern matches files only under the `/bar` directory: `/bar-1` and `/bar-2`
   950  * `/foo*`:  this pattern matches files under the root directory that start with the characters `foo`
   951  * `/*/*`:  this pattern matches everything that's two levels deep relative
   952  to the root: `/bar/bar-1` and `/bar/bar-2`
   953  
   954  The datums are defined as whichever files or directories match by the glob pattern. For instance, if we used
   955  `/*`, then the job will process three datums (potentially in parallel):
   956  `/foo-1`, `/foo-2`, and `/bar`. Both the `bar-1` and `bar-2` files within the directory `bar` would be grouped together and always processed by the same worker.
   957  
   958  ## PPS Mounts and File Access
   959  
   960  ### Mount Paths
   961  
   962  The root mount point is at `/pfs`, which contains:
   963  
   964  - `/pfs/input_name` which is where you would find the datum.
   965    - Each input will be found here by its name, which defaults to the repo
   966    name if not specified.
   967  - `/pfs/out` which is where you write any output.