github.com/operator-framework/operator-lifecycle-manager@v0.30.0/doc/design/debugging.md

github.com/operator-framework/operator-lifecycle-manager@v0.30.0/doc/design/debugging.md (about)

     1  # Debugging a ClusterServiceVersion
     2  
     3  We have a ClusterServiceVersion that is failing to report as available.
     4  
     5  ```sh
     6  $ kubectl -n ci-olm-pr-188-gc-csvs get clusterserviceversions etcdoperator.v0.8.1 -o yaml
     7  ...
     8    lastTransitionTime: 2018-01-22T15:48:13Z
     9    lastUpdateTime: 2018-01-22T15:51:09Z
    10    message: |
    11      installing: Waiting: waiting for deployment etcd-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
    12    phase: Installing
    13    reason: InstallWaiting
    14  ...
    15  ```
    16  
    17  The message tells us install can't complete because the etcd-operator deployment isn't available yet. Now we check on that deployment:
    18  
    19  ```sh
    20  $ kubectl -n ci-olm-pr-188-gc-csvs get deployments etcd-operator -o yaml
    21  ...
    22  spec:
    23    template:
    24      metadata:
    25        labels:
    26          name: etcd-operator-olm-owned
    27  ...
    28  status:
    29    unavailableReplicas: 1
    30  ...
    31  ```
    32  
    33  We see that 1 of the replicas is unavailable, and the spec tells us the label query to use to find the failing pods:
    34  
    35  ```sh
    36  $ kubectl -n ci-olm-pr-188-gc-csvs get pods -l name=etcd-operator-olm-owned                                                                                         1 ↵
    37  NAME                             READY     STATUS             RESTARTS   AGE
    38  etcd-operator-6c7c8ccb56-9scrz   2/3       CrashLoopBackOff   820        2d
    39  
    40  $ kubectl -n ci-olm-pr-188-gc-csvs get pods etcd-operator-6c7c8ccb56-9scrz -o yaml
    41  ...
    42   containerStatuses:
    43    - containerID: docker://aa7ee0902228247c32b9198be13fc826dfaf4901a70ee84f31582c284721a110
    44      image: quay.io/coreos/etcd-operator@sha256:b85754eaeed0a684642b0886034742234d288132dc6439b8132e9abd7a199de0
    45      imageID: docker-pullable://quay.io/coreos/etcd-operator@sha256:b85754eaeed0a684642b0886034742234d288132dc6439b8132e9abd7a199de0
    46      lastState:
    47        terminated:
    48          containerID: docker://aa7ee0902228247c32b9198be13fc826dfaf4901a70ee84f31582c284721a110
    49          exitCode: 1
    50          finishedAt: 2018-01-22T15:55:16Z
    51          reason: Error
    52          startedAt: 2018-01-22T15:55:16Z
    53      name: etcd-backup-operator
    54      ready: false
    55      restartCount: 820
    56      state:
    57        waiting:
    58          message: Back-off 5m0s restarting failed container=etcd-backup-operator pod=etcd-operator-6c7c8ccb56-9scrz_ci-olm-pr-188-gc-csvs(3084f195-fd38-11e7-b3ea-0aae23d78648)
    59          reason: CrashLoopBackOff
    60  ...
    61  ```
    62  
    63  One of the pods in the deployment, `etcd-backup-operator` is crash looping for some reason. Now we check the logs of that container:
    64  
    65  ```sh
    66  $ kubectl -n ci-olm-pr-188-gc-csvs logs etcd-operator-6c7c8ccb56-9scrz etcd-backup-operator                                                                         1 ↵
    67  time="2018-01-22T15:55:16Z" level=info msg="Go Version: go1.9.2"
    68  time="2018-01-22T15:55:16Z" level=info msg="Go OS/Arch: linux/amd64"
    69  time="2018-01-22T15:55:16Z" level=info msg="etcd-backup-operator Version: 0.8.1"
    70  time="2018-01-22T15:55:16Z" level=info msg="Git SHA: b97d9305"
    71  time="2018-01-22T15:55:16Z" level=info msg="Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"ci-olm-pr-188-gc-csvs", Name:"etcd-backup-operator", UID:"328b063e-fd38-11e7-b021-122952f9fac4", APIVersion:"v1", ResourceVersion:"11570590", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' etcd-operator-6c7c8ccb56-9scrz became leader"
    72  time="2018-01-22T15:55:16Z" level=info msg="starting backup controller" pkg=controller
    73  time="2018-01-22T15:55:16Z" level=fatal msg="unknown StorageType: "
    74  ```
    75  
    76  And we can see the reason for the error and take action to craft a new CSV that doesn't cause this error.
    77  
    78  # Debugging an InstallPlan
    79  
    80  The primary way an InstallPlan can fail is by not resolving the resources needed to install a CSV.
    81  
    82  ```yaml
    83  apiVersion: app.coreos.com/v1alpha1
    84  kind: InstallPlan
    85  metadata:
    86    namespace: ci-olm-pr-188-gc-csvs
    87    name: olm-testing
    88  spec:
    89    clusterServiceVersionNames:
    90    - etcdoperator123
    91    approval: Automatic
    92  ```
    93  
    94  This installplan will fail because `etcdoperator123` is not in the catalog. We can see this in its status:
    95  
    96  ```sh
    97  $ kubectl get -n ci-olm-pr-188-gc-csvs installplans olm-testing -o yaml
    98  apiVersion: app.coreos.com/v1alpha1
    99  kind: InstallPlan
   100  metadata:
   101    ... 
   102  spec:
   103    approval: Automatic
   104    clusterServiceVersionNames:
   105    - etcdoperator123
   106  status:
   107    catalogSources:
   108    - rh-operators
   109    conditions:
   110    - lastTransitionTime: 2018-01-22T16:05:09Z
   111      lastUpdateTime: 2018-01-22T16:06:59Z
   112      message: 'not found: ClusterServiceVersion etcdoperator123'
   113      reason: DependenciesConflict
   114      status: "False"
   115      type: Resolved
   116    phase: Planning
   117  ```
   118  
   119  Error messages like this will displayed for any other inconsistency in the catalog. They can be resolved by either updating the catalog or choosing clusterservices that resolve correctly.
   120  
   121  # Debugging ALM operators
   122  
   123  Both the ALM and Catalog operators have `-debug` flags available that display much more useful information when diagnosing a problem. If necessary, add this flag to their deployments and perform the action that is showing undersired behavior.