github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/deploy-manage/deploy/amazon_web_services/aws-deploy-pachyderm.md (about)

     1  # Deploy Pachyderm on AWS
     2  
     3  After you deploy Kubernetes cluster by using `kops` or `eksctl`,
     4  you can deploy Pachyderm on top of that cluster.
     5  
     6  You need to complete the following steps to deploy Pachyderm:
     7  
     8  1. Install `pachctl` as described in [Install pachctl](../../../../getting_started/local_installation#install-pachctl).
     9  1. Add stateful storage for Pachyderm as described in [Add Stateful Storage](#add-stateful-storage).
    10  1. Deploy Pachyderm by using an [IAM role](#deploy-pachyderm-with-an-iam-role)
    11  (recommended) or [an access key](#deploy-pachyderm-with-an-access-key).
    12  
    13  ## Add Stateful Storage
    14  
    15  Pachyderm requires the following types of persistent storage:
    16  
    17  An S3 object store bucket for data. The S3 bucket name
    18   must be globally unique across the whole
    19   Amazon region. Therefore, add a descriptive prefix to the S3 bucket
    20   name, such as your username.
    21  
    22  An Elastic Block Storage (EBS) persistent volume (PV) for Pachyderm
    23   metadata. Pachyderm recommends that you assign at least 10 GB for this
    24   persistent EBS volume. If you expect your cluster to be very
    25   long running a scale to thousands of jobs per commits, you might
    26   need to go add more storage. However, you can easily increase the
    27   size of the persistent volume later.
    28  
    29  To add stateful storage, complete the following steps:
    30  
    31  1. Set up the following system variables:
    32  
    33     * `BUCKET_NAME` — A globally unique S3 bucket name.
    34     * `STORAGE_SIZE` — The size of the persistent volume in GB. For example, `10`.
    35     * `AWS_REGION` — The AWS region of your Kubernetes cluster. For example,
    36     `us-west-2` and not `us-west-2a`.
    37    
    38  
    39  1. Create an S3 bucket:
    40  
    41     * If you are creating an S3 bucket in the `us-east-1` region, run the following
    42     command:
    43  
    44       ```shell
    45       aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION}
    46       ```
    47  
    48     * If you are creating an S3 bucket in any region but the `us-east-1`
    49     region, run the following command:
    50  
    51       ```shell
    52       aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION} --create-bucket-configuration LocationConstraint=${AWS_REGION}
    53       ```
    54  
    55  1. Verify that the S3 bucket was created:
    56  
    57     ```
    58     aws s3api list-buckets --query 'Buckets[].Name'
    59     ```
    60  
    61  ### (Optional) Set up Bucket Encryption
    62  
    63  Amazon S3 supports two types of bucket encryption — server-side encryption
    64  (SSE-S3) and AWS Key Management Service (AWS KMS), which stores customer
    65  master keys. Pachyderm supports both these methods. Therefore, when you
    66  are creating a bucket for your Pachyderm cluster, you can set up either
    67  of them. Because Pachyderm requests to buckets do not include encryption
    68  information, the method that you select for the bucket is applied.
    69  Setting up communication between Pachyderm object storage clients and AWS KMS
    70  to append encryption information to Pachyderm requests is not supported and
    71  not recommended. 
    72  
    73  To set up bucket encryption, see [Amazon S3 Default Encryption for S3 Buckets](https://docs.aws.amazon.com/AmazonS3/latest/dev/bucket-encryption.html).
    74  
    75  ## Deploy Pachyderm with an IAM Role
    76  
    77  IAM roles provide better user management and security
    78  capabilities compared to access keys. If a malicious user gains access to
    79  an access key, your data might become compromised. Therefore, enterprises
    80  often opt out to use IAM roles rather than access keys for production
    81  deployments.
    82  
    83  You need to configure the following IAM settings:
    84  
    85  * The worker nodes on which Pachyderm is deployed must be associated
    86  with the IAM role that is assigned to the Kubernetes cluster.
    87  If you created your cluster by using `kops` or `eksctl`
    88  the nodes must have a dedicated IAM role already assigned.
    89  
    90  * The IAM role must have access to the S3 bucket that you created for
    91  Pachyderm.
    92  
    93  * The IAM role must have correct trust relationships.
    94  
    95    You need to set a system variable `IAM_ROLE` to the name
    96    of the IAM role that you will use to deploy the cluster.
    97    This role is different from the Role ARN or the Instance
    98    Profile ARN of the role. It is the actual role name.
    99  
   100  To deploy Pachyderm with an IAM role, complete the following steps:
   101  
   102  1. Find the IAM role assigned to the cluster:
   103  
   104     1. Go to the AWS Management console.
   105     1. Select an EC2 instance in the Kubernetes cluster.
   106     1. Click **Description**.
   107     1. Find the **IAM Role** field.
   108  
   109  1. Enable access to the S3 bucket for the IAM role:
   110  
   111     1. In the **IAM Role** field, click on the IAM role.
   112     1. In the **Permissions** tab, click **Edit policy**.
   113     1. Select the **JSON** tab.
   114     1. Append the following text to the end of the existing JSON:
   115  
   116        ```json
   117        {
   118            "Effect": "Allow",
   119                "Action": [
   120                    "s3:ListBucket"
   121                ],
   122                "Resource": [
   123                    "arn:aws:s3:::<your-bucket>"
   124                ]
   125        },
   126        {
   127            "Effect": "Allow",
   128            "Action": [
   129                "s3:PutObject",
   130            "s3:GetObject",
   131            "s3:DeleteObject"
   132            ],
   133            "Resource": [
   134                "arn:aws:s3:::<your-bucket>/*"
   135            ]
   136        }
   137        ```
   138  
   139        Replace `<your-bucket>` with the name of your S3 bucket.
   140  
   141        **Note:** For the EKS cluster, you might need to use the
   142        **Add inline policy** button and create a name for the new policy.
   143        The JSON above is inserted between the square brackets for the `Statement` element.
   144  
   145  1. Set up trust relationships for the IAM role:
   146  
   147     1. Click the **Trust relationships > Edit trust relationship**.
   148     1. Ensure that you see a statement with `sts:AssumeRole`. Example:
   149  
   150        ```json
   151        {
   152          "Version": "2012-10-17",
   153          "Statement": [
   154            {
   155              "Effect": "Allow",
   156              "Principal": {
   157                "Service": "ec2.amazonaws.com"
   158              },
   159              "Action": "sts:AssumeRole"
   160            }
   161          ]
   162        }
   163        ```
   164  
   165  1. Set the system variable `IAM_ROLE` to the IAM role name
   166     for the Pachyderm deployment.
   167  
   168  1. Deploy Pachyderm:
   169  
   170     ```shell
   171     pachctl deploy amazon ${BUCKET_NAME} ${AWS_REGION} ${STORAGE_SIZE} --dynamic-etcd-nodes=1 --iam-role ${IAM_ROLE}
   172     ```
   173  
   174     The deployment takes some time. You can run `kubectl get pods` periodically
   175     to check the status of deployment. When Pachyderm is deployed, the command
   176     shows all pods as `READY`:
   177  
   178     ```shell
   179     kubectl get pods
   180     ```
   181  
   182     **System Response:**
   183  
   184     ```shell
   185     NAME                     READY     STATUS    RESTARTS   AGE
   186     dash-6c9dc97d9c-89dv9    2/2       Running   0          1m
   187     etcd-0                   1/1       Running   0          4m
   188     pachd-65fd68d6d4-8vjq7   1/1       Running   0          4m
   189     ```
   190  
   191     **Note:** If you see a few restarts on the `pachd` nodes, it means that
   192     Kubernetes tried to bring up those pods before `etcd` was ready. Therefore,
   193     Kubernetes restarted those pods. You can safely ignore this message.
   194  
   195  1. Verify that the Pachyderm cluster is up and running:
   196  
   197     ```shell
   198     pachctl version
   199     ```
   200  
   201     **System Response:**
   202  
   203     ```shell
   204     COMPONENT           VERSION
   205     pachctl             {{ config.pach_latest_version }}
   206     pachd               {{ config.pach_latest_version }}
   207     ```
   208  
   209     * If you want to access the Pachyderm UI or use the S3 gateway, you need to
   210     forward Pachyderm ports. Open a new terminal window and run the
   211     following command:
   212  
   213     ```shell
   214     pachctl port-forward
   215     ```
   216  
   217  ## Deploy Pachyderm with an Access Key
   218  
   219  When you installed `kops`, you created a dedicated IAM
   220  user with access credentials such as an access key and
   221  secret key. You can deploy
   222  Pachyderm by using the credentials of this IAM user
   223  directly. However, deploying Pachyderm with an
   224  access key might not satisfy your enterprise security
   225  requirements. Therefore, deploying with an IAM role
   226  is preferred.
   227  
   228  To deploy Pachyderm with an access key, complete the following
   229  steps:
   230  
   231  1. Run the following command to deploy your Pachyderm cluster:
   232  
   233     ```shell
   234     pachctl deploy amazon ${BUCKET_NAME} ${AWS_REGION} ${STORAGE_SIZE} --dynamic-etcd-nodes=1 --credentials "${AWS_ACCESS_KEY_ID},${AWS_SECRET_ACCESS_KEY},"
   235     ```
   236  
   237     The `,` at the end of the `credentials` flag in the deploy
   238     command is for an optional temporary AWS token. You might use
   239     such a token if you are just experimenting with
   240     Pachyderm. However, do not use this token in a
   241     production deployment.
   242  
   243     The deployment takes some time. You can run `kubectl get pods` periodically
   244     to check the status of deployment. When Pachyderm is deployed, the command
   245     shows all pods as `READY`:
   246  
   247      ```shell
   248      kubectl get pods
   249      ```
   250  
   251      **System Response:**
   252  
   253      ```shell
   254      NAME                     READY     STATUS    RESTARTS   AGE
   255      dash-6c9dc97d9c-89dv9    2/2       Running   0          1m
   256      etcd-0                   1/1       Running   0          4m
   257      pachd-65fd68d6d4-8vjq7   1/1       Running   0          4m
   258      ```
   259  
   260      **Note:** If you see a few restarts on the `pachd` nodes, it means that
   261      Kubernetes tried to bring up those pods before `etcd` was ready.
   262      Therefore, Kubernetes restarted those pods. You can safely ignore this
   263      message.
   264  
   265  1. Verify that the Pachyderm cluster is up and running:
   266  
   267     ```shell
   268     pachctl version
   269     ```
   270  
   271     **System Response:**
   272  
   273     ```shell
   274  
   275     COMPONENT           VERSION
   276     pachctl             {{ config.pach_latest_version }}
   277     pachd               {{ config.pach_latest_version }}
   278     ```
   279  
   280     * If you want to access the Pachyderm UI or use S3 gateway, you need to
   281     forward Pachyderm ports. Open a new terminal window and run the
   282     following command:
   283  
   284       ```shell
   285       pachctl port-forward
   286       ```