github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/deploy-manage/deploy/google_cloud_platform.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/deploy-manage/deploy/google_cloud_platform.md (about)

     1  # Google Cloud Platform
     2  
     3  Google Cloud Platform provides seamless support for Kubernetes.
     4  Therefore, Pachyderm is fully supported on [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/) (GKE).
     5  The following section walks you through deploying a Pachyderm cluster on GKE.
     6  
     7  ## Prerequisites
     8  
     9  - [Google Cloud SDK](https://cloud.google.com/sdk/) >= 124.0.0
    10  - [kubectl](https://kubernetes.io/docs/user-guide/prereqs/)
    11  - [pachctl](#install-pachctl)
    12  
    13  If this is the first time you use the SDK, follow
    14  the [Google SDK QuickStart Guide](https://cloud.google.com/sdk/docs/quickstarts).
    15  
    16  !!! note
    17      When you follow the QuickStart Guide, you might update your `~/.bash_profile`
    18      and point your `$PATH` at the location where you extracted
    19      `google-cloud-sdk`. However, Pachyderm recommends that you extract
    20      the SDK to `~/bin`.
    21  
    22  !!! tip
    23      You can install `kubectl` by using the Google Cloud SDK and
    24      running the following command:
    25  
    26      ```shell
    27      gcloud components install kubectl
    28      ```
    29  
    30  ## Deploy Kubernetes
    31  
    32  To create a new Kubernetes cluster by using GKE, run:
    33  
    34  ```shell
    35  CLUSTER_NAME=<any unique name, e.g. "pach-cluster">
    36  
    37  GCP_ZONE=<a GCP availability zone. e.g. "us-west1-a">
    38  
    39  gcloud config set compute/zone ${GCP_ZONE}
    40  
    41  gcloud config set container/cluster ${CLUSTER_NAME}
    42  
    43  MACHINE_TYPE=<machine type for the k8s nodes, we recommend "n1-standard-4" or larger>
    44  
    45  # By default the following command spins up a 3-node cluster. You can change the default with `--num-nodes VAL`.
    46  gcloud container clusters create ${CLUSTER_NAME} --scopes storage-rw --machine-type ${MACHINE_TYPE}
    47  
    48  # By default, GKE clusters have RBAC enabled. To allow 'pachctl deploy' to give the 'pachyderm' service account
    49  # the requisite privileges via clusterrolebindings, you will need to grant *your user account* the privileges
    50  # needed to create those clusterrolebindings.
    51  #
    52  # Note that this command is simple and concise, but gives your user account more privileges than necessary. See
    53  # https://docs.pachyderm.io/en/latest/deployment/rbac.html for the complete list of privileges that the
    54  # pachyderm serviceaccount needs.
    55  kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value account)
    56  ```
    57  
    58  !!! note "Important"
    59      You must create the Kubernetes cluster by using the `gcloud` command-line
    60      tool rather than the Google Cloud Console, as you can grant the
    61      `storage-rw` scope through the command-line tool only.
    62  
    63  This migth take a few minutes to start up. You can check the status on
    64  the [GCP Console](https://console.cloud.google.com/compute/instances).
    65  A `kubeconfig` entry is automatically generated and set as the current
    66  context. As a sanity check, make sure your cluster is up and running
    67  by running the following `kubectl` command:
    68  
    69  ```shell
    70  # List all pods in the kube-system namespace.
    71  kubectl get pods -n kube-system
    72  ```
    73  
    74  **System Response:**
    75  
    76  ```shell
    77  NAME                                                     READY     STATUS    RESTARTS   AGE
    78  event-exporter-v0.1.7-5c4d9556cf-fd9j2                   2/2       Running   0          1m
    79  fluentd-gcp-v2.0.9-68vhs                                 2/2       Running   0          1m
    80  fluentd-gcp-v2.0.9-fzfpw                                 2/2       Running   0          1m
    81  fluentd-gcp-v2.0.9-qvk8f                                 2/2       Running   0          1m
    82  heapster-v1.4.3-5fbfb6bf55-xgdwx                         3/3       Running   0          55s
    83  kube-dns-778977457c-7hbrv                                3/3       Running   0          1m
    84  kube-dns-778977457c-dpff4                                3/3       Running   0          1m
    85  kube-dns-autoscaler-7db47cb9b7-gp5ns                     1/1       Running   0          1m
    86  kube-proxy-gke-pach-cluster-default-pool-9762dc84-bzcz   1/1       Running   0          1m
    87  kube-proxy-gke-pach-cluster-default-pool-9762dc84-hqkr   1/1       Running   0          1m
    88  kube-proxy-gke-pach-cluster-default-pool-9762dc84-jcbg   1/1       Running   0          1m
    89  kubernetes-dashboard-768854d6dc-t75rp                    1/1       Running   0          1m
    90  l7-default-backend-6497bcdb4d-w72k5                      1/1       Running   0          1m
    91  ```
    92  
    93  If you *don't* see something similar to the above output,
    94  you can point `kubectl` to the new cluster manually by running
    95  the following command:
    96  
    97  ```shell
    98  # Update your kubeconfig to point at your newly created cluster.
    99  gcloud container clusters get-credentials ${CLUSTER_NAME}
   100  ```
   101  
   102  ## Deploy Pachyderm
   103  
   104  To deploy Pachyderm we will need to:
   105  
   106  1. [Create storage resources](#set-up-the-storage-resources), 
   107  2. [Install the Pachyderm CLI tool, `pachctl`](#install-pachctl), and
   108  3. [Deploy Pachyderm on the Kubernetes cluster](#deploy-pachyderm-on-the-kubernetes-cluster)
   109  
   110  ### Set up the Storage Resources
   111  
   112  Pachyderm needs a [GCS bucket](https://cloud.google.com/storage/docs/)
   113  and a [persistent disk](https://cloud.google.com/compute/docs/disks/)
   114  to function correctly. You can specify the size of the persistent
   115  disk, the bucket name, and create the bucket by running the following
   116  commands:
   117  
   118  ```shell
   119  # For the persistent disk, 10GB is a good size to start with.
   120  # This stores PFS metadata. For reference, 1GB
   121  # should work fine for 1000 commits on 1000 files.
   122  STORAGE_SIZE=<the size of the volume that you are going to create, in GBs. e.g. "10">
   123  
   124  # The Pachyderm bucket name needs to be globally unique across the entire GCP region.
   125  BUCKET_NAME=<The name of the GCS bucket where your data will be stored>
   126  
   127  # Create the bucket.
   128  gsutil mb gs://${BUCKET_NAME}
   129  ```
   130  
   131  To check that everything has been set up correctly, run:
   132  
   133  ```shell
   134  gsutil ls
   135  # You should see the bucket you created.
   136  ```
   137  
   138  ### Install `pachctl`
   139  
   140  `pachctl` is a command-line utility for interacting with a Pachyderm cluster. You can install it locally as follows:
   141  
   142  ```shell
   143  # For macOS:
   144  brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.11
   145  
   146  # For Linux (64 bit) or Window 10+ on WSL:
   147  
   148  $ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v{{ config.pach_latest_version }}/pachctl_{{ config.pach_latest_version }}_amd64.deb && sudo dpkg -i /tmp/pachctl.deb
   149  ```
   150  
   151  You can then run `pachctl version --client-only` to check that the installation was successful.
   152  
   153  ```shell
   154  pachctl version --client-only
   155  {{ config.pach_latest_version }}
   156  ```
   157  
   158  ### Deploy Pachyderm on the Kubernetes cluster
   159  
   160  Now you can deploy a Pachyderm cluster by running this command:
   161  
   162  ```shell
   163  pachctl deploy google ${BUCKET_NAME} ${STORAGE_SIZE} --dynamic-etcd-nodes=1
   164  ```
   165  
   166  **System Response:**
   167  
   168  ```shell
   169  serviceaccount "pachyderm" created
   170  storageclass "etcd-storage-class" created
   171  service "etcd-headless" created
   172  statefulset "etcd" created
   173  service "etcd" created
   174  service "pachd" created
   175  deployment "pachd" created
   176  service "dash" created
   177  deployment "dash" created
   178  secret "pachyderm-storage-secret" created
   179  
   180  Pachyderm is launching. Check its status with "kubectl get all"
   181  Once launched, access the dashboard by running "pachctl port-forward"
   182  ```
   183  
   184  !!! note
   185      Pachyderm uses one etcd node to manage Pachyderm metadata.
   186  
   187  !!! note "Important"
   188      If RBAC authorization is a requirement or you run into any RBAC
   189      errors see [Configure RBAC](rbac.md).
   190  
   191  It may take a few minutes for the pachd nodes to be running because Pachyderm
   192  pulls containers from DockerHub. You can see the cluster status with
   193  `kubectl`, which should output the following when Pachyderm is up and running:
   194  
   195  ```shell
   196  kubectl get pods
   197  ```
   198  
   199  **System Response:**
   200  
   201  ```shell
   202  NAME                     READY     STATUS    RESTARTS   AGE
   203  dash-482120938-np8cc     2/2       Running   0          4m
   204  etcd-0                   1/1       Running   0          4m
   205  pachd-3677268306-9sqm0   1/1       Running   0          4m
   206  ```
   207  
   208  If you see a few restarts on the `pachd` pod, you can safely ignore them.
   209  That simply means that Kubernetes tried to bring up those containers
   210  before other components were ready, so it restarted them.
   211  
   212  Finally, assuming your `pachd` is running as shown above, set up
   213  forward a port so that `pachctl` can talk to the cluster.
   214  
   215  ```shell
   216  # Forward the ports. We background this process because it blocks.
   217  pachctl port-forward &
   218  ```
   219  
   220  And you're done! You can test to make sure the cluster is working
   221  by running `pachctl version` or even creating a new repo.
   222  
   223  ```shell
   224  pachctl version
   225  ```
   226  
   227  **System Response:**
   228  
   229  ```shell
   230  COMPONENT           VERSION
   231  pachctl             {{ config.pach_latest_version }}
   232  pachd               {{ config.pach_latest_version }}
   233  ```
   234  
   235  ### Increasing Ingress Throughput
   236  
   237  One way to improve Ingress performance is to restrict Pachd to
   238  a specific, more powerful node in the cluster. This is
   239  accomplished by the use of [node-taints](https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints)
   240  in GKE. By creating a node-taint for `pachd`, you configure the
   241  Kubernetes scheduler to run only the `pachd` pod on that node. After
   242  that’s completed, you can deploy Pachyderm with the `--pachd-cpu-request`
   243  and `--pachd-memory-request` set to match the resources limits of the
   244  machine type. And finally, you need to modify the `pachd` deployment
   245  so that it has an appropriate toleration:
   246  
   247  ```shell
   248  tolerations:
   249  - key: "dedicated"
   250    operator: "Equal"
   251    value: "pachd"
   252    effect: "NoSchedule"
   253  ```
   254  
   255  ### Increasing upload performance
   256  
   257  The most straightfoward approach to increasing upload performance is
   258  to [leverage SSD’s as the boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/custom-boot-disks) in
   259  your cluster because SSDs provide higher throughput and lower latency than
   260  HDD disks. Additionally, you can increase the size of the SSD for
   261  further performance gains because the number of IOPS increases with
   262  disk size.
   263  
   264  ### Increasing merge performance
   265  
   266  Performance tweaks when it comes to merges can be done directly in
   267  the [Pachyderm pipeline spec](../../../reference/pipeline_spec/).
   268  More specifically, you can increase the number of hashtrees (hashtree spec)
   269  in the pipeline spec. This number determines the number of shards for the
   270  filesystem metadata. In general this number should be lower than the number
   271  of workers (parallelism spec) and should not be increased unless merge time
   272  (the time before the job is done and after the number of processed datums +
   273  skipped datums is equal to the total datums) is too slow.