github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/deploy-manage/deploy/environment-variables.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/deploy-manage/deploy/environment-variables.md (about)

     1  # Configure Environment Variables
     2  
     3  When you use Pachyderm, you can define environment variables that
     4  can transmit the required configuration directly to your application.
     5  
     6  In Pachyderm, you can define the following types of environment
     7  variables:
     8  
     9  * `pachd` environment variables that define parameters for your
    10  Pachyderm daemon container.
    11  
    12  * Pachyderm worker environment variables that define parameters
    13  on the Kubernetes pods that run your pipeline code.
    14  
    15  You can reference environment variables in your code. For example,
    16  if your code writes data to an external system and you want
    17  to know the current job ID, you can use the `PACH_JOB_ID`
    18  environment variable to refer to the current job ID.
    19  
    20  You can access all the variables in the Pachyderm manifest that
    21  is generated when you run `pachctl deploy` with the --dry-run`
    22  flag.
    23  
    24  !!! note "See Also:"
    25      [Deploy Pachyderm](../../../getting_started/local_installation/#deploy-pachyderm)
    26  
    27  ## `pachd` Environment Variables
    28  
    29  You can find the list of `pachd` environment variables in the
    30  `pachd` manifest by running the following command:
    31  
    32  ```shell
    33  kubectl get deploy pachd -o yaml
    34  ```
    35  
    36  The following tables list all the `pachd`
    37  environment variables.
    38  
    39  **Global Configuration**
    40  
    41  | Environment Variable   | Default Value     | Description |
    42  | ---------------------- | ----------------- | ----------- |
    43  | `ETCD_SERVICE_HOST`    | N/A               | The host on which the etcd service runs. |
    44  | `ETCD_SERVICE_PORT`    | N/A               | The etcd port number.                    |
    45  | `PPS_WORKER_GRPC_PORT` | `80`              | The GRPs port number.                    |
    46  | `PORT`                 | `650`             | The `pachd` port number. |
    47  | `HTTP_PORT`             | `652`             | The HTTP port number.   |
    48  | `PEER_PORT`             | `653`             | The port for pachd-to-pachd communication. |
    49  | `NAMESPACE`            | `deafult`         | The namespace in which Pachyderm is deployed. |
    50  
    51  **pachd Configuration**
    52  
    53  | Environment Variable       | Default Value | Description |
    54  | -------------------------- | ------------- | ----------- |
    55  | `NUM_SHARDS`               | `32`      | The max number of `pachd` pods that can run in a <br> single cluster. |
    56  | `STORAGE_BACKEND`          | `""`      | The storage backend defined for the Pachyderm cluster.|
    57  | `STORAGE_HOST_PATH`        | `""`      | The host path to storage. |
    58  | `KUBERNETES_PORT_443_TCP_ADDR` |`none` | An IP address that Kubernetes exports <br> automatically for your code to communicate with <br> the Kubernetes API. Read access only. Most variables <br> that have use the `PORT_ADDRESS_TCP_ADDR` pattern <br> are Kubernetes environment variables. For more information,<br> see [Kubernetes environment variables](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). |
    59  | `METRICS`                  | `true`   | Defines whether anonymous Pachyderm metrics are being <br>collected or not. |
    60  | `BLOCK_CACHE_BYTES`        | `1G`     | The size of the block cache in `pachd`. |
    61  | `WORKER_IMAGE`             | `""`     | The base Docker image that is used to run your pipeline.|
    62  | `WORKER_SIDECAR_IMAGE`     | `""`     | The `pachd` image that is used as a worker sidecar. |
    63  | `WORKER_IMAGE_PULL_POLICY` | `IfNotPresent`| The pull policy that defines how Docker images are <br>pulled. You can set <br> a Kubernetes image pull policy as needed. |
    64  | `LOG_LEVEL`                | `info`   | Verbosity of the log output. If you want to disable <br> logging, set this variable to `0`. Viable Options <br>`debug` <br>`info` <br> `error`<br>For more information, see [Go logrus log levels](https://godoc.org/github.com/sirupsen/logrus#Level). ||
    65  | `IAM_ROLE`                 |  `""`    | The role that defines permissions for Pachyderm in AWS.|
    66  | `IMAGE_PULL_SECRET`        |  `""`    | The Kubernetes secret for image pull credentials.|
    67  | `NO_EXPOSE_DOCKER_SOCKET`  |  `false` | Controls whether you can build images using <br> the `--build` command.|
    68  | `EXPOSE_OBJECT_API`        |  `false` | Controls access to internal Pachyderm API.|
    69  | `WORKER_USES_ROOT`         |  `true`  | Controls root access in the worker container.|
    70  | `S3GATEWAY_PORT`           |  `600`   | The S3 gateway port number|
    71  | `DISABLE_COMMIT_PROGRESS_COUNTER` |`false`| A feature flag that disables commit propagation <br> progress counter. If you have a large DAG, <br> setting this parameter to `true` might help <br> improve etcd performance. You only need to set <br>this parameter on the `pachd` pod. Pachyderm passes <br> this parameter to worker containers automatically. |
    72  
    73  **Storage Configuration**
    74  
    75  | Environment Variable       | Default Value     | Description |
    76  | -------------------------- | ----------------- | ----------- |
    77  | `STORAGE_MEMORY_THRESHOLD` | N/A               | Defines the storage memory threshold. |
    78  | `STORAGE_SHARD_THRESHOLD`  | N/A               | Defines the storage shard threshold.  |
    79  
    80  ## Pipeline Worker Environment Variables
    81  
    82  Pachyderm defines many environment variables for each Pachyderm
    83  worker that runs your pipeline code. You can print the list
    84  of environment variables into your Pachyderm logs by including
    85  the `env` command into your pipeline specification. For example,
    86  if you have an `images` repository, you can configure your pipeline
    87  specification like this:
    88  
    89  ```json
    90  {
    91      "pipeline": {
    92          "name": "env"
    93      },
    94      "input": {
    95          "pfs": {
    96              "glob": "/",
    97              "repo": "images"
    98          }
    99      },
   100      "transform": {
   101          "cmd": ["sh" ],
   102          "stdin": ["env"],
   103          "image": "ubuntu:14.04"
   104      },
   105      "enable_stats": true
   106  }
   107  ```
   108  
   109  Run this pipeline and upon completion you can view the log with
   110  variables by running the following command:
   111  
   112  ```shell
   113  pachctl logs --pipeline=env
   114  PPS_WORKER_IP=172.17.0.7
   115  DASH_PORT_8081_TCP_PROTO=tcp
   116  PACHD_PORT_600_TCP_PORT=600
   117  KUBERNETES_SERVICE_PORT=443
   118  KUBERNETES_PORT=tcp://10.96.0.1:443
   119  ...
   120  ```
   121  
   122  You should see a lengthy list of variables. Many of them define
   123  internal networking parameters that most probably you will not
   124  need to use.
   125  
   126  Most users find the following environment variables
   127  particularly useful:
   128  
   129  | Environment Variable       | Description |
   130  | -------------------------- | --------------------------------------------- |
   131  | `PACH_JOB_ID`              | The ID of the current job. For example, <br> `PACH_JOB_ID=8991d6e811554b2a8eccaff10ebfb341`. |
   132  | `PACH_OUTPUT_COMMIT_ID`    | The ID of the commit in the output repo for <br> the current job. For example, <br> `PACH_OUTPUT_COMMIT_ID=a974991ad44d4d37ba5cf33b9ff77394`. |
   133  | `PPS_NAMESPACE`            | The PPS namespace. For example, <br> `PPS_NAMESPACE=default`. |
   134  | `PPS_SPEC_COMMIT`          | The hash of the pipeline specification commit.<br> This value is tied to the pipeline version. Therefore, jobs that use <br> the same version of the same pipeline have the same spec commit. <br> For example, `PPS_SPEC_COMMIT=3596627865b24c4caea9565fcde29e7d`. |
   135  | `PPS_POD_NAME`             | The name of the pipeline pod. For example, <br>`pipeline-env-v1-zbwm2`. |
   136  | `PPS_PIPELINE_NAME`        | The name of the pipeline that this pod runs. <br> For example, `env`. |
   137  | `PIPELINE_SERVICE_PORT_PROMETHEUS_METRICS` | The port that you can use to <br> exposed metrics to Prometheus from within your pipeline. The default value is 9090. |
   138  | `HOME`                     | The path to the home directory. The default value is `/root` |
   139  | `<input-repo>=<path/to/input/repo>` | The path to the filesystem that is <br> defined in the `input` in your pipeline specification. Pachyderm defines <br> such a variable for each input. The path is defined by the `glob` pattern in the <br> spec. For example, if you have an input `images` and a glob pattern of `/`, <br> Pachyderm defines the `images=/pfs/images` variable. If you <br> have a glob pattern of `/*`, Pachyderm matches <br> the files in the `images` repository and, therefore, the path is <br> `images=/pfs/images/liberty.png`. |
   140  | `input_COMMIT`             | The ID of the commit that is used for the input. <br>For example, `images_COMMIT=fa765b5454e3475f902eadebf83eac34`. |
   141  | `S3_ENDPOINT`         | A Pachyderm S3 gateway sidecar container endpoint. <br> If you have an S3 enabled pipeline, this parameter specifies a URL that <br> you can use to access the pipeline's repositories state when a <br> particular job was run. The URL has the following format: <br> `http://<job-ID>-s3:600`. <br> An example of accessing the data by using AWS CLI looks like this: <br>`echo foo_data | aws --endpoint=${S3_ENDPOINT} s3 cp - s3://out/foo_file`. |
   142  
   143  In addition to these environment variables, Kubernetes injects others for
   144  Services that run inside the cluster. These variables enable you to connect to
   145  those outside services, which can be powerful but might also result
   146  in processing being retried multiple times. For example, if your code
   147  writes a row to a database, that row might be written multiple times because of
   148  retries. Interaction with outside services must be idempotent to prevent
   149  unexpected behavior. Furthermore, one of the running services that your code
   150  can connect to is Pachyderm itself. This is generally not recommended as very
   151  little of the Pachyderm API is idempotent, but in some specific cases it can be
   152  a viable approach.
   153  
   154  !!! note "See Also"
   155      - [transform.env](../../../reference/pipeline_spec/#transform-required)