github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/reference/pipeline_spec.md (about)

     1  # Pipeline Specification
     2  
     3  This document discusses each of the fields present in a pipeline specification.
     4  To see how to use a pipeline spec to create a pipeline, refer to the [pachctl
     5  create pipeline](pachctl/pachctl_create_pipeline.md) section.
     6  
     7  ## JSON Manifest Format
     8  
     9  ```json
    10  {
    11    "pipeline": {
    12      "name": string
    13    },
    14    "description": string,
    15    "transform": {
    16      "image": string,
    17      "cmd": [ string ],
    18      "stdin": [ string ],
    19      "err_cmd": [ string ],
    20      "err_stdin": [ string ],
    21      "env": {
    22          string: string
    23      },
    24      "secrets": [ {
    25          "name": string,
    26          "mount_path": string
    27      },
    28      {
    29          "name": string,
    30          "env_var": string,
    31          "key": string
    32      } ],
    33      "image_pull_secrets": [ string ],
    34      "accept_return_code": [ int ],
    35      "debug": bool,
    36      "user": string,
    37      "working_dir": string,
    38    },
    39    "parallelism_spec": {
    40      // Set at most one of the following:
    41      "constant": int,
    42      "coefficient": number
    43    },
    44    "hashtree_spec": {
    45     "constant": int,
    46    },
    47    "resource_requests": {
    48      "memory": string,
    49      "cpu": number,
    50      "disk": string,
    51    },
    52    "resource_limits": {
    53      "memory": string,
    54      "cpu": number,
    55      "gpu": {
    56        "type": string,
    57        "number": int
    58      }
    59      "disk": string,
    60    },
    61    "datum_timeout": string,
    62    "datum_tries": int,
    63    "job_timeout": string,
    64    "input": {
    65      <"pfs", "cross", "union", "cron", or "git" see below>
    66    },
    67    "output_branch": string,
    68    "egress": {
    69      "URL": "s3://bucket/dir"
    70    },
    71    "standby": bool,
    72    "cache_size": string,
    73    "enable_stats": bool,
    74    "service": {
    75      "internal_port": int,
    76      "external_port": int
    77    },
    78    "spout": {
    79    "overwrite": bool
    80    \\ Optionally, you can combine a spout with a service:
    81    "service": {
    82          "internal_port": int,
    83          "external_port": int,
    84          "annotations": {
    85              "foo": "bar"
    86          }
    87      }
    88    },
    89    "max_queue_size": int,
    90    "chunk_spec": {
    91      "number": int,
    92      "size_bytes": int
    93    },
    94    "scheduling_spec": {
    95      "node_selector": {string: string},
    96      "priority_class_name": string
    97    },
    98    "pod_spec": string,
    99    "pod_patch": string,
   100  }
   101  
   102  ------------------------------------
   103  "pfs" input
   104  ------------------------------------
   105  
   106  "pfs": {
   107    "name": string,
   108    "repo": string,
   109    "branch": string,
   110    "glob": string,
   111    "lazy" bool,
   112    "empty_files": bool
   113  }
   114  
   115  ------------------------------------
   116  "cross" or "union" input
   117  ------------------------------------
   118  
   119  "cross" or "union": [
   120    {
   121      "pfs": {
   122        "name": string,
   123        "repo": string,
   124        "branch": string,
   125        "glob": string,
   126        "lazy" bool,
   127        "empty_files": bool
   128      }
   129    },
   130    {
   131      "pfs": {
   132        "name": string,
   133        "repo": string,
   134        "branch": string,
   135        "glob": string,
   136        "lazy" bool,
   137        "empty_files": bool
   138      }
   139    }
   140    ...
   141  ]
   142  
   143  
   144  
   145  ------------------------------------
   146  "cron" input
   147  ------------------------------------
   148  
   149  "cron": {
   150      "name": string,
   151      "spec": string,
   152      "repo": string,
   153      "start": time,
   154      "overwrite": bool
   155  }
   156  
   157  ------------------------------------
   158  "join" input
   159  ------------------------------------
   160  
   161  "join": [
   162    {
   163      "pfs": {
   164        "name": string,
   165        "repo": string,
   166        "branch": string,
   167        "glob": string,
   168        "join_on": string
   169        "lazy": bool
   170        "empty_files": bool
   171      }
   172    },
   173    {
   174      "pfs": {
   175         "name": string,
   176         "repo": string,
   177         "branch": string,
   178         "glob": string,
   179         "join_on": string
   180         "lazy": bool
   181         "empty_files": bool
   182      }
   183    }
   184  ]
   185  
   186  ------------------------------------
   187  "git" input
   188  ------------------------------------
   189  
   190  "git": {
   191    "URL": string,
   192    "name": string,
   193    "branch": string
   194  }
   195  
   196  ```
   197  
   198  In practice, you rarely need to specify all the fields.
   199  Most fields either come with sensible defaults or can be empty.
   200  The following text is an example of a minimum spec:
   201  
   202  ```json
   203  {
   204    "pipeline": {
   205      "name": "wordcount"
   206    },
   207    "transform": {
   208      "image": "wordcount-image",
   209      "cmd": ["/binary", "/pfs/data", "/pfs/out"]
   210    },
   211    "input": {
   212          "pfs": {
   213              "repo": "data",
   214              "glob": "/*"
   215          }
   216      }
   217  }
   218  ```
   219  
   220  ### Name (required)
   221  
   222  `pipeline.name` is the name of the pipeline that you are creating. Each
   223  pipeline needs to have a unique name. Pipeline names must meet the following
   224  requirements:
   225  
   226  - Include only alphanumeric characters, `_` and `-`.
   227  - Begin or end with only alphanumeric characters (not `_` or `-`).
   228  - Not exceed 63 characters in length.
   229  
   230  ### Description (optional)
   231  
   232  `description` is an optional text field where you can add information
   233  about the pipeline.
   234  
   235  ### Transform (required)
   236  
   237  `transform.image` is the name of the Docker image that your jobs use.
   238  
   239  `transform.cmd` is the command passed to the Docker run invocation. Similarly
   240  to Docker, `cmd` is not run inside a shell which means that
   241  wildcard globbing (`*`), pipes (`|`), and file redirects (`>` and `>>`) do
   242  not work. To specify these settings, you can set `cmd` to be a shell of your
   243  choice, such as `sh` and pass a shell script to `stdin`.
   244  
   245  `transform.stdin` is an array of lines that are sent to your command on
   246  `stdin`.
   247  Lines do not have to end in newline characters.
   248  
   249  `transform.err_cmd` is an optional command that is executed on failed datums.
   250  If the `err_cmd` is successful and returns 0 error code, it does not prevent
   251  the job from succeeding.
   252  This behavior means that `transform.err_cmd` can be used to ignore
   253  failed datums while still writing successful datums to the output repo,
   254  instead of failing the whole job when some datums fail. The `transform.err_cmd`
   255  command has the same limitations as `transform.cmd`.
   256  
   257  `transform.err_stdin` is an array of lines that are sent to your error command
   258  on `stdin`.
   259  Lines do not have to end in newline characters.
   260  
   261  `transform.env` is a key-value map of environment variables that
   262  Pachyderm injects into the container.
   263  
   264  **Note:** There are environment variables that are automatically injected
   265  into the container, for a comprehensive list of them see the [Environment
   266  Variables](#environment-variables) section below.
   267  
   268  `transform.secrets` is an array of secrets. You can use the secrets to
   269  embed sensitive data, such as credentials. The secrets reference
   270  Kubernetes secrets by name and specify a path to map the secrets or
   271  an environment variable (`env_var`) that the value should be bound to. Secrets
   272  must set `name` which should be the name of a secret in Kubernetes. Secrets
   273  must also specify either `mount_path` or `env_var` and `key`. See more
   274  information about Kubernetes secrets [here](https://kubernetes.io/docs/concepts/configuration/secret/).
   275  
   276  `transform.image_pull_secrets` is an array of image pull secrets, image pull
   277  secrets are similar to secrets except that they are mounted before the
   278  containers are created so they can be used to provide credentials for image
   279  pulling. For example, if you are using a private Docker registry for your
   280  images, you can specify it by running the following command:
   281  
   282  ```shell
   283  $ kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
   284  ```
   285  
   286  And then, notify your pipeline about it by using
   287  `"image_pull_secrets": [ "myregistrykey" ]`. Read more about image pull secrets
   288  [here](https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod).
   289  
   290  `transform.accept_return_code` is an array of return codes, such as exit codes
   291  from your Docker command that are considered acceptable.
   292  If your Docker command exits with one of the codes in this array, it is
   293  considered a successful run to set job status. `0`
   294  is always considered a successful exit code.
   295  
   296  `transform.debug` turns on added debug logging for the pipeline.
   297  
   298  `transform.user` sets the user that your code runs as, this can also be
   299  accomplished with a `USER` directive in your `Dockerfile`.
   300  
   301  `transform.working_dir` sets the directory that your command runs from. You
   302  can also specify the `WORKDIR` directive in your `Dockerfile`.
   303  
   304  `transform.dockerfile` is the path to the `Dockerfile` used with the `--build`
   305  flag. This defaults to `./Dockerfile`.
   306  
   307  ### Parallelism Spec (optional)
   308  
   309  `parallelism_spec` describes how Pachyderm parallelizes your pipeline.
   310  Currently, Pachyderm has two parallelism strategies: `constant` and
   311  `coefficient`.
   312  
   313  If you set the `constant` field, Pachyderm starts the number of workers
   314  that you specify. For example, set `"constant":10` to use 10 workers.
   315  
   316  If you set the `coefficient` field, Pachyderm starts a number of workers
   317  that is a multiple of your Kubernetes cluster’s size. For example, if your
   318  Kubernetes cluster has 10 nodes, and you set `"coefficient": 0.5`, Pachyderm
   319  starts five workers. If you set it to 2.0, Pachyderm starts 20 workers
   320  (two per Kubernetes node).
   321  
   322  The default value is "constant=1" .
   323  
   324  Because spouts and services are designed to be single instances, do not
   325  modify the default `parallism_spec` value for these pipelines.
   326  
   327  ### Resource Requests (optional)
   328  
   329  `resource_requests` describes the amount of resources you expect the
   330  workers for a given pipeline to consume. Knowing this in advance
   331  lets Pachyderm schedule big jobs on separate machines, so that they do not
   332  conflict and either slow down or die.
   333  
   334  The `memory` field is a string that describes the amount of memory, in bytes,
   335  each worker needs (with allowed SI suffixes (M, K, G, Mi, Ki, Gi, and so on).
   336  For example, a worker that needs to read a 1GB file into memory might set
   337  `"memory": "1.2G"` with a little extra for the code to use in addition to the
   338  file. Workers for this pipeline will be placed on machines with at least
   339  1.2GB of free memory, and other large workers will be prevented from using it
   340  (if they also set their `resource_requests`).
   341  
   342  The `cpu` field is a number that describes the amount of CPU time in `cpu
   343  seconds/real seconds` that each worker needs. Setting `"cpu": 0.5` indicates that
   344  the worker should get 500ms of CPU time per second. Setting `"cpu": 2`
   345  indicates that the worker gets 2000ms of CPU time per second. In other words,
   346  it is using 2 CPUs, though worker threads might spend 500ms on four
   347  physical CPUs instead of one second on two physical CPUs.
   348  
   349  The `disk` field is a string that describes the amount of ephemeral disk space,
   350  in bytes, each worker needs with allowed SI suffixes (M, K, G, Mi, Ki, Gi,
   351  and so on).
   352  
   353  In both cases, the resource requests are not upper bounds. If the worker uses
   354  more memory than it is requested, it does not mean that it will be shut down.
   355  However, if the whole node runs out of memory, Kubernetes starts deleting
   356  pods that have been placed on it and exceeded their memory request,
   357  to reclaim memory.
   358  To prevent deletion of your worker node, you must set your `memory` request to
   359  a sufficiently large value. However, if the total memory requested by all
   360  workers in the system is too large, Kubernetes cannot schedule new
   361  workers because no machine has enough unclaimed memory. `cpu` works
   362  similarly, but for CPU time.
   363  
   364  By default, workers are scheduled with an effective resource request of 0 (to
   365  avoid scheduling problems that prevent users from being unable to run
   366  pipelines). This means that if a node runs out of memory, any such worker
   367  might be terminated.
   368  
   369  For more information about resource requests and limits see the
   370  [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/)
   371  on the subject.
   372  
   373  ### Resource Limits (optional)
   374  
   375  `resource_limits` describes the upper threshold of allowed resources a given
   376  worker can consume. If a worker exceeds this value, it will be evicted.
   377  
   378  The `gpu` field is a number that describes how many GPUs each worker needs.
   379  Only whole number are supported, Kubernetes does not allow multiplexing of
   380  GPUs. Unlike the other resource fields, GPUs only have meaning in Limits, by
   381  requesting a GPU the worker will have sole access to that GPU while it is
   382  running. It's recommended to enable `standby` if you are using GPUs so other
   383  processes in the cluster will have access to the GPUs while the pipeline has
   384  nothing to process. For more information about scheduling GPUs see the
   385  [Kubernetes docs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/)
   386  on the subject.
   387  
   388  ### Datum Timeout (optional)
   389  
   390  `datum_timeout` is a string (e.g. `1s`, `5m`, or `15h`) that determines the
   391  maximum execution time allowed per datum. So no matter what your parallelism
   392  or number of datums, no single datum is allowed to exceed this value.
   393  
   394  ### Datum Tries (optional)
   395  
   396  `datum_tries` is an integer, such as `1`, `2`, or `3`, that determines the
   397  number of times a job attempts to run on a datum when a failure occurs. 
   398  Setting `datum_tries` to `1` will attempt a job once with no retries. 
   399  Only failed datums are retried in a retry attempt. If the operation succeeds
   400  in retry attempts, then the job is marked as successful. Otherwise, the job
   401  is marked as failed.
   402  
   403  
   404  ### Job Timeout (optional)
   405  
   406  `job_timeout` is a string (e.g. `1s`, `5m`, or `15h`) that determines the
   407  maximum execution time allowed for a job. It differs from `datum_timeout`
   408  in that the limit gets applied across all workers and all datums. That
   409  means that you'll need to keep in mind the parallelism, total number of
   410  datums, and execution time per datum when setting this value. Keep in
   411  mind that the number of datums may change over jobs. Some new commits may
   412  have a bunch of new files (and so new datums). Some may have fewer.
   413  
   414  ### Input (required)
   415  
   416  `input` specifies repos that will be visible to the jobs during runtime.
   417  Commits to these repos will automatically trigger the pipeline to create new
   418  jobs to process them. Input is a recursive type, there are multiple different
   419  kinds of inputs which can be combined together. The `input` object is a
   420  container for the different input types with a field for each, only one of
   421  these fields be set for any instantiation of the object.
   422  
   423  ```
   424  {
   425      "pfs": pfs_input,
   426      "union": union_input,
   427      "cross": cross_input,
   428      "cron": cron_input
   429  }
   430  ```
   431  
   432  #### PFS Input
   433  
   434  PFS inputs are the simplest inputs, they take input from a single branch on a
   435  single repo.
   436  
   437  ```
   438  {
   439      "name": string,
   440      "repo": string,
   441      "branch": string,
   442      "glob": string,
   443      "lazy" bool,
   444      "empty_files": bool
   445  }
   446  ```
   447  
   448  `input.pfs.name` is the name of the input. An input with the name `XXX` is
   449  visible under the path `/pfs/XXX` when a job runs. Input names must be unique
   450  if the inputs are crossed, but they may be duplicated between `PFSInput`s that
   451  are combined by using the `union` operator. This is because when
   452  `PFSInput`s are combined, you only ever see a datum from one input
   453  at a time. Overlapping the names of combined inputs allows
   454  you to write simpler code since you no longer need to consider which
   455  input directory a particular datum comes from. If an input's name is not
   456  specified, it defaults to the name of the repo. Therefore, if you have two
   457  crossed inputs from the same repo, you must give at least one of them a
   458  unique name.
   459  
   460  `input.pfs.repo` is the name of the Pachyderm repository with the data that
   461  you want to join with other data.
   462  
   463  `input.pfs.branch` is the `branch` to watch for commits. If left blank,
   464  Pachyderm sets this value to `master`.
   465  
   466  `input.pfs.glob` is a glob pattern that is used to determine how the
   467  input data is partitioned.
   468  
   469  `input.pfs.lazy` controls how the data is exposed to jobs. The default is
   470  `false` which means the job eagerly downloads the data it needs to process and
   471  exposes it as normal files on disk. If lazy is set to `true`, data is
   472  exposed as named pipes instead, and no data is downloaded until the job
   473  opens the pipe and reads it. If the pipe is never opened, then no data is
   474  downloaded.
   475  
   476  Some applications do not work with pipes. For example, pipes do not support
   477  applications that makes `syscalls` such as `Seek`. Applications that can work
   478  with pipes must use them since they are more performant. The difference will
   479  be especially notable if the job only reads a subset of the files that are
   480  available to it.
   481  
   482  **Note:** `lazy` currently does not support datums that
   483  contain more than 10000 files.
   484  
   485  `input.pfs.empty_files` controls how files are exposed to jobs. If
   486  set to `true`, it causes files from this PFS to be presented as empty files.
   487  This is useful in shuffle pipelines where you want to read the names of
   488  files and reorganize them by using symlinks.
   489  
   490  #### Union Input
   491  
   492  Union inputs take the union of other inputs. In the example
   493  below, each input includes individual datums, such as if  `foo` and `bar`
   494  were in the same repository with the glob pattern set to `/*`.
   495  Alternatively, each of these datums might have come from separate repositories
   496  with the glob pattern set to `/` and being the only filesystm objects in these
   497  repositories.
   498  
   499  ```
   500  | inputA | inputB | inputA ∪ inputB |
   501  | ------ | ------ | --------------- |
   502  | foo    | fizz   | foo             |
   503  | bar    | buzz   | fizz            |
   504  |        |        | bar             |
   505  |        |        | buzz            |
   506  ```
   507  
   508  The union inputs do not take a name and maintain the names of the
   509  sub-inputs. In the example above, you would see files under
   510  `/pfs/inputA/...` or `/pfs/inputB/...`, but never both at the same time.
   511  When you write code to address this behavior, make sure that
   512  your code first determines which input directory is present. Starting
   513  with Pachyderm 1.5.3, we recommend that you give your inputs the
   514  same `Name`. That way your code only needs to handle data being present
   515  in that directory. This only works if your code does not need to be
   516  aware of which of the underlying inputs the data comes from.
   517  
   518  `input.union` is an array of inputs to combine. The inputs do not have to be
   519  `pfs` inputs. They can also be `union` and `cross` inputs. Although, there is
   520  no reason to take a union of unions because union is associative.
   521  
   522  #### Cross Input
   523  
   524  Cross inputs create the cross product of other inputs. In other words,
   525  a cross input creates tuples of the datums in the inputs. In the example
   526  below, each input includes individual datums, such as if  `foo` and `bar`
   527  were in the same repository with the glob pattern set to `/*`.
   528  Alternatively, each of these datums might have come from separate repositories
   529  with the glob pattern set to `/` and being the only filesystm objects in these
   530  repositories.
   531  
   532  ```
   533  | inputA | inputB | inputA ⨯ inputB |
   534  | ------ | ------ | --------------- |
   535  | foo    | fizz   | (foo, fizz)     |
   536  | bar    | buzz   | (foo, buzz)     |
   537  |        |        | (bar, fizz)     |
   538  |        |        | (bar, buzz)     |
   539  ```
   540  
   541  The cross inputs above do not take a name and maintain
   542  the names of the sub-inputs.
   543  In the example above, you would see files under `/pfs/inputA/...`
   544  and `/pfs/inputB/...`.
   545  
   546  `input.cross` is an array of inputs to cross.
   547  The inputs do not have to be `pfs` inputs. They can also be
   548  `union` and `cross` inputs. Although, there is
   549   no reason to take a union of unions because union is associative.
   550  
   551  #### Cron Input
   552  
   553  Cron inputs allow you to trigger pipelines based on time. A Cron input is
   554  based on the Unix utility called `cron`. When you create a pipeline with
   555  one or more Cron inputs, `pachd` creates a repo for each of them. The start
   556  time for Cron input is specified in its spec.
   557  When a Cron input triggers,
   558  `pachd` commits a single file, named by the current [RFC
   559  3339 timestamp](https://www.ietf.org/rfc/rfc3339.txt) to the repo which
   560  contains the time which satisfied the spec.
   561  
   562  ```
   563  {
   564      "name": string,
   565      "spec": string,
   566      "repo": string,
   567      "start": time,
   568      "overwrite": bool
   569  }
   570  ```
   571  
   572  `input.cron.name` is the name for the input. Its semantics is similar to
   573  those of `input.pfs.name`. Except that it is not optional.
   574  
   575  `input.cron.spec` is a cron expression which specifies the schedule on
   576  which to trigger the pipeline. To learn more about how to write schedules,
   577  see the [Wikipedia page on cron](https://en.wikipedia.org/wiki/Cron).
   578  Pachyderm supports non-standard schedules, such as `"@daily"`.
   579  
   580  `input.cron.repo` is the repo which Pachyderm creates for the input. This
   581  parameter is optional. If you do not specify this parameter, then
   582  `"<pipeline-name>_<input-name>"` is used by default.
   583  
   584  `input.cron.start` is the time to start counting from for the input. This
   585  parameter is optional. If you do not specify this parameter, then the
   586  time when the pipeline was created is used by default. Specifying a
   587  time enables you to run on matching times from the past or skip times
   588  from the present and only start running
   589  on matching times in the future. Format the time value according to [RFC
   590  3339](https://www.ietf.org/rfc/rfc3339.txt).
   591  
   592  `input.cron.overwrite` is a flag to specify whether you want the timestamp file
   593  to be overwritten on each tick. This parameter is optional, and if you do not
   594  specify it, it defaults to simply writing new files on each tick. By default,
   595  `pachd` expects only the new information to be written out on each tick
   596  and combines that data with the data from the previous ticks. If `"overwrite"`
   597  is set to `true`, it expects the full dataset to be written out for each tick and
   598  replaces previous outputs with the new data written out.
   599  
   600  #### Join Input
   601  
   602  A join input enables you to join files that are stored in separate
   603  Pachyderm repositories and that match a configured glob
   604  pattern. A join input must have the `glob` and `join_on` parameters configured
   605  to work properly. A join can combine multiple PFS inputs.
   606  
   607  You can specify the following parameters for the `join` input.
   608  
   609  * `input.pfs.name` — the name of the PFS input that appears in the
   610  `INPUT` field when you run the `pachctl list job` command.
   611  If an input name is not specified, it defaults to the name of the repo.
   612  
   613  * `input.pfs.repo` — see the description in [PFS Input](#pfs-input).
   614  the name of the Pachyderm repository with the data that
   615  you want to join with other data.
   616  
   617  * `input.pfs.branch` — see the description in [PFS Input](#pfs-input).
   618  
   619  * `input.pfs.glob` — a wildcard pattern that defines how a dataset is broken
   620    up into datums for further processing. When you use a glob pattern in joins,
   621    it creates a naming convention that Pachyderm uses to join files. In other
   622    words, Pachyderm joins the files that are named according to the glob
   623    pattern and skips those that are not.
   624  
   625    You can specify the glob pattern for joins in a parenthesis to create
   626    one or multiple capture groups. A capture group can include one or multiple
   627    characters. Use standard UNIX globbing characters to create capture,
   628    groups, including the following:
   629  
   630    * `?` — matches a single character in a filepath. For example, you
   631    have files named `file000.txt`, `file001.txt`, `file002.txt`, and so on.
   632    You can set the glob pattern to `/file(?)(?)(?)` and the `join_on` key to
   633    `$2`, so that Pachyderm matches only the files that have same second
   634    character.
   635  
   636  * `*` — any number of characters in the filepath. For example, if you set
   637    your capture group to `/(*)`, Pachyderm matches all files in the root
   638    directory.
   639  
   640    If you do not specify a correct `glob` pattern, Pachyderm performs the
   641    `cross` input operation instead of `join`.
   642  
   643  * `input.pfs.lazy` — see the description in [PFS Input](#pfs-input).
   644  * `input.pfs.empty_files` — see the description in [PFS Input](#pfs-input).
   645  
   646  #### Git Input (alpha feature)
   647  
   648  Git inputs allow you to pull code from a public git URL and execute that code as part of your pipeline. A pipeline with a Git Input will get triggered (i.e. will see a new input commit and will spawn a job) whenever you commit to your git repository.
   649  
   650  **Note:** This only works on cloud deployments, not local clusters.
   651  
   652  `input.git.URL` must be a URL of the form: `https://github.com/foo/bar.git`
   653  
   654  `input.git.name` is the name for the input, its semantics are similar to
   655  those of `input.pfs.name`. It is optional.
   656  
   657  `input.git.branch` is the name of the git branch to use as input.
   658  
   659  Git inputs also require some additional configuration. In order for new commits on your git repository to correspond to new commits on the Pachyderm Git Input repo, we need to setup a git webhook. At the moment, only GitHub is supported. (Though if you ask nicely, we can add support for GitLab or BitBucket).
   660  
   661  1. Create your Pachyderm pipeline with the Git Input.
   662  
   663  2. To get the URL of the webhook to your cluster, do `pachctl inspect pipeline` on your pipeline. You should see a `Githook URL` field with a URL set. Note - this will only work if you've deployed to a cloud provider (e.g. AWS, GKE). If you see `pending` as the value (and you've deployed on a cloud provider), it's possible that the service is still being provisioned. You can check `kubectl get svc` to make sure you see the `githook` service running.
   664  
   665  3. To setup the GitHub webhook, navigate to:
   666  
   667  ```
   668  https://github.com/<your_org>/<your_repo>/settings/hooks/new
   669  ```
   670  Or navigate to webhooks under settings. Then you'll want to copy the `Githook URL` into the 'Payload URL' field.
   671  
   672  ### Output Branch (optional)
   673  
   674  This is the branch where the pipeline outputs new commits.  By default,
   675  it's "master".
   676  
   677  ### Egress (optional)
   678  
   679  `egress` allows you to push the results of a Pipeline to an external data
   680  store such as s3, Google Cloud Storage or Azure Storage. Data will be pushed
   681  after the user code has finished running but before the job is marked as
   682  successful.
   683  
   684  For more information, see [Exporting Data by using egress](../../how-tos/export-data-out-pachyderm/#export-your-data-with-egress)
   685  
   686  ### Standby (optional)
   687  
   688  `standby` indicates that the pipeline should be put into "standby" when there's
   689  no data for it to process.  A pipeline in standby will have no pods running and
   690  thus will consume no resources, it's state will be displayed as "standby".
   691  
   692  Standby replaces `scale_down_threshold` from releases prior to 1.7.1.
   693  
   694  ### Cache Size (optional)
   695  
   696  `cache_size` controls how much cache a pipeline's sidecar containers use. In
   697  general, your pipeline's performance will increase with the cache size, but
   698  only up to a certain point depending on your workload.
   699  
   700  Every worker in every pipeline has a limited-functionality `pachd` server
   701  running adjacent to it, which proxies PFS reads and writes (this prevents
   702  thundering herds when jobs start and end, which is when all of a pipeline's
   703  workers are reading from and writing to PFS simultaneously). Part of what these
   704  "sidecar" pachd servers do is cache PFS reads. If a pipeline has a cross input,
   705  and a worker is downloading the same datum from one branch of the input
   706  repeatedly, then the cache can speed up processing significantly.
   707  
   708  ### Enable Stats (optional)
   709  
   710  The `enable_stats` parameter turns on statistics tracking for the pipeline.
   711  When you enable the statistics tracking, the pipeline automatically creates
   712  and commits datum processing information to a special branch in its output
   713  repo called `"stats"`. This branch stores information about each datum that
   714  the pipeline processes, including timing information, size information, logs,
   715  and `/pfs` snapshots. You can view this statistics by running the `pachctl
   716  inspect datum` and `pachctl list datum` commands, as well as through the web UI.
   717  
   718  Once turned on, statistics tracking cannot be disabled for the pipeline. You can
   719  turn it off by deleting the pipeline, setting `enable_stats` to `false` or
   720  completely removing it from your pipeline spec, and recreating the pipeline from
   721  that updated spec file. While the pipeline that collects the stats
   722  exists, the storage space used by the stats cannot be released.
   723  
   724  !!! note
   725      Enabling stats results in slight storage use increase for logs and timing
   726      information.
   727      However, stats do not use as much extra storage as it might appear because
   728      snapshots of the `/pfs` directory that are the largest stored assets
   729      do not require extra space.
   730  
   731  ### Service (alpha feature, optional)
   732  
   733  `service` specifies that the pipeline should be treated as a long running
   734  service rather than a data transformation. This means that `transform.cmd` is
   735  not expected to exit, if it does it will be restarted. Furthermore, the service
   736  is exposed outside the container using a Kubernetes service.
   737  `"internal_port"` should be a port that the user code binds to inside the
   738  container, `"external_port"` is the port on which it is exposed through the
   739  `NodePorts` functionality of Kubernetes services. After a service has been
   740  created, you should be able to access it at
   741  `http://<kubernetes-host>:<external_port>`.
   742  
   743  ### Spout (optional)
   744  
   745  `spout` is a type of pipeline that processes streaming data.
   746  Unlike a union or cross pipeline, a spout pipeline does not have
   747  a PFS input. Instead, it opens a Linux *named pipe* into the source of the
   748  streaming data. Your pipeline
   749  can be either a spout or a service and not both. Therefore, if you added
   750  the `service` as a top-level object in your pipeline, you cannot add `spout`.
   751  However, you can expose a service from inside of a spout pipeline by
   752  specifying it as a field in the `spout` spec. Then, Kubernetes creates
   753  a service endpoint that you can expose externally. You can get the information
   754  about the service by running `kubectl get services`.
   755  
   756  For more information, see [Spouts](../concepts/pipeline-concepts/pipeline/spout.md).
   757  
   758  ### Max Queue Size (optional)
   759  `max_queue_size` specifies that maximum number of datums that a worker should
   760  hold in its processing queue at a given time (after processing its entire
   761  queue, a worker "checkpoints" its progress by writing to persistent storage).
   762  The default value is `1` which means workers will only hold onto the value that
   763  they're currently processing.
   764  
   765  Increasing this value can improve pipeline performance, as that allows workers
   766  to simultaneously download, process and upload different datums at the same
   767  time (and reduces the total time spent on checkpointing). Decreasing this value
   768  can make jobs more robust to failed workers, as work gets checkpointed more
   769  often, and a failing worker will not lose as much progress. Setting this value
   770  too high can also cause problems if you have `lazy` inputs, as there's a cap of
   771  10,000 `lazy` files per worker and multiple datums that are running all count
   772  against this limit.
   773  
   774  ### Chunk Spec (optional)
   775  `chunk_spec` specifies how a pipeline should chunk its datums.
   776  
   777  `chunk_spec.number` if nonzero, specifies that each chunk should contain `number`
   778   datums. Chunks may contain fewer if the total number of datums don't
   779   divide evenly.
   780  
   781  `chunk_spec.size_bytes` , if nonzero, specifies a target size for each chunk of datums.
   782   Chunks may be larger or smaller than `size_bytes`, but will usually be
   783   pretty close to `size_bytes` in size.
   784  
   785  ### Scheduling Spec (optional)
   786  `scheduling_spec` specifies how the pods for a pipeline should be scheduled.
   787  
   788  `scheduling_spec.node_selector` allows you to select which nodes your pipeline
   789  will run on. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector)
   790  on node selectors for more information about how this works.
   791  
   792  `scheduling_spec.priority_class_name` allows you to select the prioriy class
   793  for the pipeline, which will how Kubernetes chooses to schedule and deschedule
   794  the pipeline. Refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass)
   795  on priority and preemption for more information about how this works.
   796  
   797  ### Pod Spec (optional)
   798  `pod_spec` is an advanced option that allows you to set fields in the pod spec
   799  that haven't been explicitly exposed in the rest of the pipeline spec. A good
   800  way to figure out what JSON you should pass is to create a pod in Kubernetes
   801  with the proper settings, then do:
   802  
   803  ```
   804  kubectl get po/<pod-name> -o json | jq .spec
   805  ```
   806  
   807  this will give you a correctly formated piece of JSON, you should then remove
   808  the extraneous fields that Kubernetes injects or that can be set else where.
   809  
   810  The JSON is applied after the other parameters for the `pod_spec` have already
   811  been set as a [JSON Merge Patch](https://tools.ietf.org/html/rfc7386). This
   812  means that you can modify things such as the storage and user containers.
   813  
   814  ### Pod Patch (optional)
   815  `pod_patch` is similar to `pod_spec` above but is applied as a [JSON
   816  Patch](https://tools.ietf.org/html/rfc6902). Note, this means that the
   817  process outlined above of modifying an existing pod spec and then manually
   818  blanking unchanged fields won't work, you'll need to create a correctly
   819  formatted patch by diffing the two pod specs.
   820  
   821  ## The Input Glob Pattern
   822  
   823  Each PFS input needs to specify a [glob pattern](../concepts/pipeline-concepts/distributed_computing.md).
   824  
   825  Pachyderm uses the glob pattern to determine how many "datums" an input
   826  consists of.  Datums are the unit of parallelism in Pachyderm.  That is,
   827  Pachyderm attempts to process datums in parallel whenever possible.
   828  
   829  Intuitively, you may think of the input repo as a file system, and you are
   830  applying the glob pattern to the root of the file system.  The files and
   831  directories that match the glob pattern are considered datums.
   832  
   833  For instance, let's say your input repo has the following structure:
   834  
   835  ```
   836  /foo-1
   837  /foo-2
   838  /bar
   839    /bar-1
   840    /bar-2
   841  ```
   842  
   843  Now let's consider what the following glob patterns would match respectively:
   844  
   845  * `/`: this pattern matches `/`, the root directory itself, meaning all the data would be a single large datum.
   846  * `/*`:  this pattern matches everything under the root directory given us 3 datums:
   847  `/foo-1.`, `/foo-2.`, and everything under the directory `/bar`.
   848  * `/bar/*`: this pattern matches files only under the `/bar` directory: `/bar-1` and `/bar-2`
   849  * `/foo*`:  this pattern matches files under the root directory that start with the characters `foo`
   850  * `/*/*`:  this pattern matches everything that's two levels deep relative
   851  to the root: `/bar/bar-1` and `/bar/bar-2`
   852  
   853  The datums are defined as whichever files or directories match by the glob pattern. For instance, if we used
   854  `/*`, then the job will process three datums (potentially in parallel):
   855  `/foo-1`, `/foo-2`, and `/bar`. Both the `bar-1` and `bar-2` files within the directory `bar` would be grouped together and always processed by the same worker.
   856  
   857  ## PPS Mounts and File Access
   858  
   859  ### Mount Paths
   860  
   861  The root mount point is at `/pfs`, which contains:
   862  
   863  - `/pfs/input_name` which is where you would find the datum.
   864    - Each input will be found here by its name, which defaults to the repo
   865    name if not specified.
   866  - `/pfs/out` which is where you write any output.
   867  
   868  # Environment Variables
   869  
   870  There are several environment variables that get injected into the user code
   871  before it runs. They are:
   872  
   873  - `PACH_JOB_ID` the id the currently run job.
   874  - `PACH_OUTPUT_COMMIT_ID` the id of the commit being outputted to.
   875  - For each input there will be an environment variable with the same name
   876      defined to the path of the file for that input. For example if you are
   877      accessing an input called `foo` from the path `/pfs/foo` which contains a
   878      file called `bar` then the environment variable `foo` will have the value
   879      `/pfs/foo/bar`. The path in the environment variable is the path which
   880      matched the glob pattern, even if the file is a directory, ie if your glob
   881      pattern is `/*` it would match a directory `/bar`, the value of `$foo`
   882      would then be `/pfs/foo/bar`. With a glob pattern of `/*/*` you would match
   883      the files contained in `/bar` and thus the value of `foo` would be
   884      `/pfs/foo/bar/quux`.
   885  - For each input there will be an environment variable named `input_COMMIT`
   886      indicating the id of the commit being used for that input.
   887  
   888  In addition to these environment variables Kubernetes also injects others for
   889  Services that are running inside the cluster. These allow you to connect to
   890  those outside services, which can be powerful but also can be hard to reason
   891  about, as processing might be retried multiple times. For example if your code
   892  writes a row to a database that row may be written multiple times due to
   893  retries. Interaction with outside services should be [idempotent](https://en.wikipedia.org/wiki/Idempotence) to prevent
   894  unexpected behavior. Furthermore, one of the running services that your code
   895  can connect to is Pachyderm itself, this is generally not recommended as very
   896  little of the Pachyderm API is idempotent, but in some specific cases it can be
   897  a viable approach.