github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/troubleshooting/deploy_troubleshooting.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/troubleshooting/deploy_troubleshooting.md (about)

     1  # Troubleshooting Deployments
     2  
     3  Here are some common issues by symptom related to certain deploys. 
     4  
     5  - [General Pachyderm cluster deployment](#general-pachyderm-cluster-deployment)
     6  - Environment-specific
     7    - [AWS](#aws-deployment)
     8      - [Can't connect to the Pachyderm cluster after a rolling update](#cant-connect-to-the-pachyderm-cluster-after-a-rolling-update)
     9      - [The one shot deploy script, `aws.sh`, never completes](#one-shot-script-never-completes)
    10      - [VPC limit exceeded](#vpc-limit-exceeded)
    11      - [GPU node never appears](#gpu-node-never-appears)
    12  
    13  <!--  - Google - coming soon...
    14    - Azure - coming soon...-->
    15  
    16  ## General Pachyderm cluster deployment
    17  
    18  - [Pod stuck in `CrashLoopBackoff`](#pod-stuck-in-crashloopbackoff)
    19  - [Pod stuck in `CrashLoopBackoff` - with error attaching volume](#pod-stuck-in-crashloopbackoff-with-error-attaching-volume)
    20  
    21  ### Pod stuck in `CrashLoopBackoff`
    22  
    23  #### Symptoms
    24  
    25  The pachd pod keeps crashing/restarting:
    26  
    27  ```
    28  kubectl get all
    29  NAME                        READY     STATUS             RESTARTS   AGE
    30  po/etcd-281005231-qlkzw     1/1       Running            0          7m
    31  po/pachd-1333950811-0sm1p   0/1       CrashLoopBackOff   6          7m
    32  
    33  NAME             CLUSTER-IP       EXTERNAL-IP   PORT(S)                       AGE
    34  svc/etcd         100.70.40.162    <nodes>       2379:30938/TCP                7m
    35  svc/kubernetes   100.64.0.1       <none>        443/TCP                       9m
    36  svc/pachd        100.70.227.151   <nodes>       650:30650/TCP,651:30651/TCP   7m
    37  
    38  NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
    39  deploy/etcd    1         1         1            1           7m
    40  deploy/pachd   1         1         1            0           7m
    41  
    42  NAME                  DESIRED   CURRENT   READY     AGE
    43  rs/etcd-281005231     1         1         1         7m
    44  rs/pachd-1333950811   1         1         0         7m
    45  ```
    46  
    47  #### Recourse
    48  
    49  First describe the pod:
    50  
    51  ```
    52  kubectl describe po/pachd-1333950811-0sm1p
    53  ```
    54  
    55  If you see an error including `Error attaching EBS volume` or similar, see the recourse for that error here under the corresponding section below. If you don't see that error, but do see something like:
    56  
    57  ```
    58    1m    3s    9    {kubelet ip-172-20-48-123.us-west-2.compute.internal}                Warning    FailedSync    Error syncing pod, skipping: failed to "StartContainer" for "pachd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=pachd pod=pachd-1333950811-0sm1p_default(a92b6665-506a-11e7-8e07-02e3d74c49ac)"
    59  ```
    60  
    61  it means Kubernetes tried running `pachd`, but `pachd` generated an internal error. To see the specifics of this internal error, check the logs for the `pachd` pod:
    62  
    63  ```
    64  kubectl logs po/pachd-1333950811-0sm1p
    65  ```
    66  
    67  !!! note
    68      If you're using a log aggregator service (e.g. the default in GKE), you won't see any logs when using `kubectl logs ...` in this way.  You will need to look at your logs UI (e.g. in GKE's case the stackdriver console).
    69  
    70  These logs will most likely reveal the issue directly, or at the very least, a good indicator as to what's causing the problem. For example, you might see, `BucketRegionError: incorrect region, the bucket is not in 'us-west-2' region`. In that case, your object store bucket in a different region than your pachyderm cluster and the fix would be to recreate the bucket in the same region as your pachydermm cluster.
    71  
    72  If the error / recourse isn't obvious from the error message, post the error as well as the `pachd` logs in our [Slack channel](http://slack.pachyderm.io), or open a [GitHub Issue](https://github.com/pachyderm/pachyderm/issues/new) and provide the necessary details prompted by the issue template. Please do be sure provide these logs either way as it is extremely helpful in resolving the issue.
    73  
    74  ### Pod stuck in `CrashLoopBackoff` - with error attaching volume
    75  
    76  #### Symptoms
    77  
    78  A pod (could be the `pachd` pod or a worker pod) fails to startup, and is stuck in `CrashLoopBackoff`. If you execute `kubectl describe po/pachd-xxxx`, you'll see an error message like the following at the bottom of the output:
    79  
    80  ```
    81    30s        30s        1    {attachdetach }                Warning        FailedMount    Failed to attach volume "etcd-volume" on node "ip-172-20-44-17.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0c1d403ac05096dfe" to instance "i-0a12e00c0f3fb047d": VolumeInUse: vol-0c1d403ac05096dfe is already attached to an instance
    82  ```
    83  
    84  This would indicate that the [peristent volume claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) is failing to get attached to the node in your kubernetes cluster.  
    85  
    86  #### Recourse
    87  
    88  Your best bet is to manually detach the volume and restart the pod.  
    89  
    90  For example, to resolve this issue when Pachyderm is deployed to AWS, pull up your AWS web console and look up the node mentioned in the error message (ip-172-20-44-17.us-west-2.compute.internal in our case). Then on the bottom pane for the attached volume. Follow the link to the attached volume, and detach the volume. You may need to "Force Detach" it.
    91  
    92  Once it's detached (and marked as available). Restart the pod by killing it, e.g:
    93  
    94  ```
    95  kubectl delete po/pachd-xxx
    96  ```
    97  
    98  It will take a moment for a new pod to get scheduled.
    99  
   100  ---
   101  
   102  ## AWS Deployment
   103  
   104  ### Can't connect to the Pachyderm cluster after a rolling update
   105  
   106  #### Symptom
   107  
   108  After running `kops rolling-update`, `kubectl` (and/or `pachctl`) all requests hang and you can't connect to the cluster.
   109  
   110  #### Recourse
   111  
   112  First get your cluster name. You can easily locate that information by running `kops get clusters`. If you used the one shot deployment](http://docs.pachyderm.io/en/latest/deployment/amazon_web_services.html#one-shot-script), you can also get this info in the deploy logs you created by `aws.sh`.
   113  
   114  Then you'll need to grab the new public IP address of your master node. The master node will be named something like `master-us-west-2a.masters.somerandomstring.kubernetes.com`
   115  
   116  Update the etc hosts entry in `/etc/hosts` such that the api endpoint reflects the new IP, e.g:
   117  
   118  ```
   119  54.178.87.68 api.somerandomstring.kubernetes.com
   120  ```
   121  
   122  ### One shot script never completes
   123  
   124  #### Symptom
   125  
   126  The `aws.sh` one shot deploy script hangs on the line:
   127  
   128  ```
   129  Retrieving ec2 instance list to get k8s master domain name (may take a minute)
   130  ```
   131  
   132  If it's been more than 10 minutes, there's likely an error.
   133  
   134  #### Recourse
   135  
   136  Check the AWS web console / autoscale group / activity history. You have probably hit an instance limit. To confirm, open the AWS web console for EC2 and check to see if you have any instances with names like:
   137  
   138  ```
   139  master-us-west-2a.masters.tfgpu.kubernetes.com
   140  nodes.tfgpu.kubernetes.com
   141  ```
   142  
   143  If you don't see instances similar to the ones above the next thing to do is to navigate to "Auto Scaling Groups" in the left hand menu. Then find the ASG with your cluster name:
   144  
   145  ```
   146  master-us-west-2a.masters.tfgpu.kubernetes.com
   147  ```
   148  
   149  Look at the "Activity History" in the lower pane. More than likely, you'll see a "Failed" error message describing why it failed to provision the VM. You're probably run into an instance limit for your account for this region. If you're spinning up a GPU node, make sure that your region supports the instance type you're trying to spin up.
   150  
   151  A successful provisioning message looks like:
   152  
   153  ```
   154  Successful
   155  Launching a new EC2 instance: i-03422f3d32658e90c
   156  2017 June 13 10:19:29 UTC-7
   157  2017 June 13 10:20:33 UTC-7
   158  Description:DescriptionLaunching a new EC2 instance: i-03422f3d32658e90c
   159  Cause:CauseAt 2017-06-13T17:19:15Z a user request created an AutoScalingGroup changing the desired capacity from 0 to 1. At 2017-06-13T17:19:28Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.
   160  ```
   161  
   162  While a failed one looks like:
   163  
   164  ```
   165  Failed
   166  Launching a new EC2 instance
   167  2017 June 12 13:21:49 UTC-7
   168  2017 June 12 13:21:49 UTC-7
   169  Description:DescriptionLaunching a new EC2 instance. Status Reason: You have requested more instances (1) than your current instance limit of 0 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.
   170  Cause:CauseAt 2017-06-12T20:21:47Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.
   171  ```
   172  ### VPC Limit Exceeded
   173  
   174  #### Symptom
   175  
   176  When running `aws.sh` or otherwise deploying with `kops`, you will see:
   177  
   178  ```
   179  W0426 17:28:10.435315   26463 executor.go:109] error running task "VPC/5120cf0c-pachydermcluster.kubernetes.com" (3s remaining to succeed): error creating VPC: VpcLimitExceeded: The  maximum number of VPCs has been reached.
   180  ```
   181  
   182  #### Recourse
   183  
   184  You'll need to increase your VPC limit or delete some existing VPCs that are not in use. On the AWS web console navigate to the VPC service. Make sure you're in the same region where you're attempting to deploy.
   185  
   186  It's not uncommon (depending on how you tear down clusters) for the VPCs not to be deleted. You'll see a list of VPCs here with cluster names, e.g. `aee6b566-pachydermcluster.kubernetes.com`. For clusters that you know are no longer in use, you can delete the VPC here.
   187  
   188  ### GPU Node Never Appears
   189  
   190  #### Symptom
   191  
   192  After running `kops edit ig gpunodes` and `kops update` (as outlined [here](http://docs.pachyderm.io/en/latest/cookbook/gpus.html)) the GPU node never appears, which can be confirmed via the AWS web console.
   193  
   194  #### Recourse
   195  
   196  It's likely you have hit an instance limit for the GPU instance type you're using, or it's possible that AWS doesn't support that instance type in the current region.
   197  
   198  [Follow these instructions to check for and update Instance Limits](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html).  If this region doesn't support your instance type, you'll see an error message like:
   199  
   200  ```
   201  Failed
   202  Launching a new EC2 instance
   203  2017 June 12 13:21:49 UTC-7
   204  2017 June 12 13:21:49 UTC-7
   205  Description:DescriptionLaunching a new EC2 instance. Status Reason: You have requested more instances (1) than your current instance limit of 0 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.
   206  Cause:CauseAt 2017-06-12T20:21:47Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.
   207  ```