github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/deploy-manage/deploy/google_cloud_platform.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/deploy-manage/deploy/google_cloud_platform.md (about)

     1  # Google Cloud Platform
     2  
     3  Google Cloud Platform provides seamless support for Kubernetes.
     4  Therefore, Pachyderm is fully supported on [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/) (GKE).
     5  The following section walks you through deploying a Pachyderm cluster on GKE.
     6  
     7  ## Prerequisites
     8  
     9  - [Google Cloud SDK](https://cloud.google.com/sdk/) >= 124.0.0
    10  - [kubectl](https://kubernetes.io/docs/user-guide/prereqs/)
    11  - [pachctl](#install-pachctl)
    12  
    13  If this is the first time you use the SDK, follow
    14  the [Google SDK QuickStart Guide](https://cloud.google.com/sdk/docs/quickstarts).
    15  !!! note
    16      When you follow the QuickStart Guide, you might update your `~/.bash_profile`
    17      and point your `$PATH` at the location where you extracted
    18      `google-cloud-sdk`. However, Pachyderm recommends that you extract
    19      the SDK to `~/bin`.
    20  
    21  !!! tip
    22      You can install `kubectl` by using the Google Cloud SDK and
    23      running the following command:
    24  
    25      ```shell
    26      $ gcloud components install kubectl
    27      ```
    28  
    29  ## Deploy Kubernetes
    30  
    31  To create a new Kubernetes cluster by using GKE, run:
    32  
    33  ```bassh
    34  $ CLUSTER_NAME=<any unique name, e.g. "pach-cluster">
    35  
    36  $ GCP_ZONE=<a GCP availability zone. e.g. "us-west1-a">
    37  
    38  $ gcloud config set compute/zone ${GCP_ZONE}
    39  
    40  $ gcloud config set container/cluster ${CLUSTER_NAME}
    41  
    42  $ MACHINE_TYPE=<machine type for the k8s nodes, we recommend "n1-standard-4" or larger>
    43  
    44  # By default the following command spins up a 3-node cluster. You can change the default with `--num-nodes VAL`.
    45  $ gcloud container clusters create ${CLUSTER_NAME} --scopes storage-rw --machine-type ${MACHINE_TYPE}
    46  
    47  # By default, GKE clusters have RBAC enabled. To allow 'pachctl deploy' to give the 'pachyderm' service account
    48  # the requisite privileges via clusterrolebindings, you will need to grant *your user account* the privileges
    49  # needed to create those clusterrolebindings.
    50  #
    51  # Note that this command is simple and concise, but gives your user account more privileges than necessary. See
    52  # https://docs.pachyderm.io/en/latest/deployment/rbac.html for the complete list of privileges that the
    53  # pachyderm serviceaccount needs.
    54  $ kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value account)
    55  ```
    56  
    57  !!! note "Important"
    58      You must create the Kubernetes cluster by using the `gcloud` command-line
    59      tool rather than the Google Cloud Console, as you can grant the
    60      `storage-rw` scope through the command-line tool only.
    61  
    62  This migth take a few minutes to start up. You can check the status on
    63  the [GCP Console](https://console.cloud.google.com/compute/instances).
    64  A `kubeconfig` entry is automatically generated and set as the current
    65  context. As a sanity check, make sure your cluster is up and running
    66  by running the following `kubectl` command:
    67  
    68  ```shell
    69  # List all pods in the kube-system namespace.
    70  $ kubectl get pods -n kube-system
    71  NAME                                                     READY     STATUS    RESTARTS   AGE
    72  event-exporter-v0.1.7-5c4d9556cf-fd9j2                   2/2       Running   0          1m
    73  fluentd-gcp-v2.0.9-68vhs                                 2/2       Running   0          1m
    74  fluentd-gcp-v2.0.9-fzfpw                                 2/2       Running   0          1m
    75  fluentd-gcp-v2.0.9-qvk8f                                 2/2       Running   0          1m
    76  heapster-v1.4.3-5fbfb6bf55-xgdwx                         3/3       Running   0          55s
    77  kube-dns-778977457c-7hbrv                                3/3       Running   0          1m
    78  kube-dns-778977457c-dpff4                                3/3       Running   0          1m
    79  kube-dns-autoscaler-7db47cb9b7-gp5ns                     1/1       Running   0          1m
    80  kube-proxy-gke-pach-cluster-default-pool-9762dc84-bzcz   1/1       Running   0          1m
    81  kube-proxy-gke-pach-cluster-default-pool-9762dc84-hqkr   1/1       Running   0          1m
    82  kube-proxy-gke-pach-cluster-default-pool-9762dc84-jcbg   1/1       Running   0          1m
    83  kubernetes-dashboard-768854d6dc-t75rp                    1/1       Running   0          1m
    84  l7-default-backend-6497bcdb4d-w72k5                      1/1       Running   0          1m
    85  ```
    86  
    87  If you *don't* see something similar to the above output,
    88  you can point `kubectl` to the new cluster manually by running
    89  the following command:
    90  
    91  ```shell
    92  # Update your kubeconfig to point at your newly created cluster.
    93  $ gcloud container clusters get-credentials ${CLUSTER_NAME}
    94  ```
    95  
    96  ## Deploy Pachyderm
    97  
    98  To deploy Pachyderm we will need to:
    99  
   100  1. [Create storage resources](#set-up-the-storage-resources), 
   101  2. [Install the Pachyderm CLI tool, `pachctl`](#install-pachctl), and
   102  3. [Deploy Pachyderm on the Kubernetes cluster](#deploy-pachyderm-on-the-kubernetes-cluster)
   103  
   104  ### Set up the Storage Resources
   105  
   106  Pachyderm needs a [GCS bucket](https://cloud.google.com/storage/docs/)
   107  and a [persistent disk](https://cloud.google.com/compute/docs/disks/)
   108  to function correctly. You can specify the size of the persistent
   109  disk, the bucket name, and create the bucket by running the following
   110  commands:
   111  
   112  ```shell
   113  # For the persistent disk, 10GB is a good size to start with.
   114  # This stores PFS metadata. For reference, 1GB
   115  # should work fine for 1000 commits on 1000 files.
   116  $ STORAGE_SIZE=<the size of the volume that you are going to create, in GBs. e.g. "10">
   117  
   118  # The Pachyderm bucket name needs to be globally unique across the entire GCP region.
   119  $ BUCKET_NAME=<The name of the GCS bucket where your data will be stored>
   120  
   121  # Create the bucket.
   122  $ gsutil mb gs://${BUCKET_NAME}
   123  ```
   124  
   125  To check that everything has been set up correctly, run:
   126  
   127  ```shell
   128  $ gsutil ls
   129  # You should see the bucket you created.
   130  ```
   131  
   132  ### Install `pachctl`
   133  
   134  `pachctl` is a command-line utility for interacting with a Pachyderm cluster. You can install it locally as follows:
   135  
   136  ```shell
   137  # For macOS:
   138  $ brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.10
   139  
   140  # For Linux (64 bit) or Window 10+ on WSL:
   141  $ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.10.0/pachctl_1.10.0_amd64.deb && sudo dpkg -i /tmp/pachctl.deb
   142  ```
   143  
   144  You can then run `pachctl version --client-only` to check that the installation was successful.
   145  
   146  ```shell
   147  $ pachctl version --client-only
   148  1.9.7
   149  ```
   150  
   151  ### Deploy Pachyderm on the Kubernetes cluster
   152  
   153  Now you can deploy a Pachyderm cluster by running this command:
   154  
   155  ```shell
   156  $ pachctl deploy google ${BUCKET_NAME} ${STORAGE_SIZE} --dynamic-etcd-nodes=1
   157  serviceaccount "pachyderm" created
   158  storageclass "etcd-storage-class" created
   159  service "etcd-headless" created
   160  statefulset "etcd" created
   161  service "etcd" created
   162  service "pachd" created
   163  deployment "pachd" created
   164  service "dash" created
   165  deployment "dash" created
   166  secret "pachyderm-storage-secret" created
   167  
   168  Pachyderm is launching. Check its status with "kubectl get all"
   169  Once launched, access the dashboard by running "pachctl port-forward"
   170  ```
   171  
   172  !!! note
   173      Pachyderm uses one etcd node to manage Pachyderm metadata.
   174  
   175  !!! note "Important"
   176      If RBAC authorization is a requirement or you run into any RBAC
   177      errors see [Configure RBAC](rbac.md).
   178  
   179  It may take a few minutes for the pachd nodes to be running because Pachyderm
   180  pulls containers from DockerHub. You can see the cluster status with
   181  `kubectl`, which should output the following when Pachyderm is up and running:
   182  
   183  ```shell
   184  $ kubectl get pods
   185  NAME                     READY     STATUS    RESTARTS   AGE
   186  dash-482120938-np8cc     2/2       Running   0          4m
   187  etcd-0                   1/1       Running   0          4m
   188  pachd-3677268306-9sqm0   1/1       Running   0          4m
   189  ```
   190  
   191  If you see a few restarts on the `pachd` pod, you can safely ignore them.
   192  That simply means that Kubernetes tried to bring up those containers
   193  before other components were ready, so it restarted them.
   194  
   195  Finally, assuming your `pachd` is running as shown above, set up
   196  forward a port so that `pachctl` can talk to the cluster.
   197  
   198  ```shell
   199  # Forward the ports. We background this process because it blocks.
   200  $ pachctl port-forward &
   201  ```
   202  
   203  And you're done! You can test to make sure the cluster is working
   204  by running `pachctl version` or even creating a new repo.
   205  
   206  ```shell
   207  
   208  $ pachctl version
   209  COMPONENT           VERSION
   210  pachctl             1.9.7
   211  pachd               1.9.7
   212  ```
   213  
   214  ### Increasing Ingress Throughput
   215  
   216  One way to improve Ingress performance is to restrict Pachd to
   217  a specific, more powerful node in the cluster. This is
   218  accomplished by the use of [node-taints](https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints)
   219  in GKE. By creating a node-taint for `pachd`, you configure the
   220  Kubernetes scheduler to run only the `pachd` pod on that node. After
   221  that’s completed, you can deploy Pachyderm with the `--pachd-cpu-request`
   222  and `--pachd-memory-request` set to match the resources limits of the
   223  machine type. And finally, you need to modify the `pachd` deployment
   224  so that it has an appropriate toleration:
   225  
   226  ```shell
   227  tolerations:
   228  - key: "dedicated"
   229    operator: "Equal"
   230    value: "pachd"
   231    effect: "NoSchedule"
   232  ```
   233  
   234  ### Increasing upload performance
   235  
   236  The most straightfoward approach to increasing upload performance is
   237  to [leverage SSD’s as the boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/custom-boot-disks) in
   238  your cluster because SSDs provide higher throughput and lower latency than
   239  HDD disks. Additionally, you can increase the size of the SSD for
   240  further performance gains because the number of IOPS increases with
   241  disk size.
   242  
   243  ### Increasing merge performance
   244  
   245  Performance tweaks when it comes to merges can be done directly in
   246  the [Pachyderm pipeline spec](../../../reference/pipeline_spec/).
   247  More specifically, you can increase the number of hashtrees (hashtree spec)
   248  in the pipeline spec. This number determines the number of shards for the
   249  filesystem metadata. In general this number should be lower than the number
   250  of workers (parallelism spec) and should not be increased unless merge time
   251  (the time before the job is done and after the number of processed datums +
   252  skipped datums is equal to the total datums) is too slow.