github.com/openshift/installer@v1.4.17/docs/user/troubleshooting.md (about)

     1  # Installer Troubleshooting
     2  
     3  Unfortunately, there will always be some cases where OpenShift fails to install properly. In these events, it is helpful to understand the likely failure modes as well as how to troubleshoot the failure.
     4  
     5  If you have a Red Hat subscription for OpenShift, see [here][access-article] for support.
     6  
     7  ## Common Failures
     8  
     9  ### No Worker Nodes Created
    10  
    11  The installer doesn't provision worker nodes directly, like it does with master nodes. Instead, the cluster relies on the Machine API Operator, which is able to scale up and down nodes on supported platforms. If more than fifteen to twenty minutes (depending on the speed of the cluster's Internet connection) have elapsed without any workers, the Machine API Operator needs to be investigated.
    12  
    13  The status of the Machine API Operator can be checked by running the following command from the machine used to install the cluster:
    14  
    15  ```sh
    16  oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig --namespace=openshift-machine-api get deployments
    17  ```
    18  
    19  If the API is unavailable, that will need to be [investigated first](#kubernetes-api-is-unavailable).
    20  
    21  The previous command should yield output similar to the following:
    22  
    23  ```
    24  NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
    25  cluster-autoscaler-operator   1/1     1            1           86m
    26  machine-api-controllers       1/1     1            1           85m
    27  machine-api-operator          1/1     1            1           86m
    28  ```
    29  
    30  Check the machine controller logs with the following command.
    31  
    32  ```sh
    33  oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig --namespace=openshift-machine-api logs deployments/machine-api-controllers --container=machine-controller
    34  ```
    35  
    36  ### Kubernetes API is Unavailable
    37  
    38  When the Kubernetes API is unavailable, the master nodes will need to checked to ensure that they are running the correct components. This requires SSH access so it is necessary to include an administrator's SSH key during the installation.
    39  
    40  If SSH access to the master nodes isn't available, that will need to be [investigated next](#unable-to-ssh-into-master-nodes).
    41  
    42  The first thing to check is to make sure that etcd is running on each of the masters. The etcd logs can be viewed by running the following on each master node:
    43  
    44  ```sh
    45  sudo crictl logs $(sudo crictl ps --pod=$(sudo crictl pods --name=etcd-member --quiet) --quiet)
    46  ```
    47  
    48  If the previous command fails, ensure that the etcd pods have been created by the Kubelet:
    49  
    50  ```sh
    51  sudo crictl pods --name=etcd-member
    52  ```
    53  
    54  If no pods are shown, etcd will need to be [investigated](#etcd-is-not-running).
    55  
    56  ### Unable to SSH into Master Nodes
    57  
    58  For added security, SSH isn't available from the Internet by default. There are several options for enabling this functionality:
    59  
    60  - Create a bastion host that is accessible from the Internet and has access to the cluster. If the bootstrap machine hasn't been automatically destroyed yet, it can double as a temporary bastion host since it is given a public IP address.
    61  - Configure network peering or a VPN to allow remote access to the private network.
    62  
    63  In order to SSH into the master nodes as user `core`, it is necessary to include an administrator's SSH key during the installation.
    64  The selected key, if any, will be added to the `core` user's `~/.ssh/authorized_keys` via [Ignition](https://github.com/coreos/ignition) and is not configured via platform-specific approaches like [AWS key pairs][aws-key-pairs].
    65  See [here][machine-config-daemon-ssh-keys] for information about managing SSH keys via the machine-config daemon.
    66  
    67  If SSH isn't able to connect to the nodes, they may be waiting on the bootstrap node before they can boot. The initial set of master nodes fetch their boot configuration (the Ignition Config) from the bootstrap node and will not complete until they successfully do so. Check the console output of the nodes to determine if they have successfully booted or if they are waiting for Ignition to fetch the remote config.
    68  
    69  Master nodes waiting for Ignition is indicative of problems on the bootstrap node. SSH into the bootstrap node to [investigate further](#troubleshooting-the-bootstrap-node).
    70  
    71  ### Troubleshooting the Bootstrap Node
    72  
    73  If the bootstrap node isn't available, first double check that it hasn't been automatically removed by the installer. If it's not being created in the first place, the installer will need to be [troubleshot](#installer-fails-to-create-resources).
    74  
    75  The most important thing to look at on the bootstrap node is `bootkube.service`. The logs can be viewed in two different ways:
    76  
    77  1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service`
    78  2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'`
    79  
    80  The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshooting bootstrap](./troubleshootingbootstrap.md) document.
    81  
    82  ### etcd Is Not Running
    83  
    84  During the bootstrap process, the Kubelet may emit errors like the following:
    85  
    86  ```
    87  Error signing CSR provided in request from agent: error parsing profile: invalid organization
    88  ```
    89  
    90  This is safe to ignore and merely indicates that the etcd bootstrapping is still in progress. etcd makes use of the CSR APIs provided by Kubernetes to issue and rotate its TLS assets, but these facilities aren't available before etcd has formed quorum. In order to break this dependency loop, a CSR service is run on the bootstrap node which only signs CSRs for etcd. When the Kubelet attempts to go through its TLS bootstrap, it is initially denied because the service it is communicating with only respects CSRs from etcd. After etcd starts and the control plane begins bootstrapping, an approver is scheduled and the Kubelet CSR requests will succeed.
    91  
    92  ### Installer Fails to Create Resources
    93  
    94  The easiest way to get more debugging information from the installer is to check the log file (`.openshift_install.log`) in the install directory. Regardless of the logging level specified, the installer will write its logs in case they need to be inspected retroactively.
    95  
    96  ### Installer Fails to Initialize the Cluster
    97  
    98  The installer uses the [cluster-version-operator] to create all the components of an OpenShift cluster. When the installer fails to initialize the cluster, the most important information can be fetched by looking at the [ClusterVersion][clusterversion] and [ClusterOperator][clusteroperator] objects:
    99  
   100  1. Inspecting the `ClusterVersion` object.
   101  
   102      ```console
   103      $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get clusterversion -oyaml
   104      apiVersion: config.openshift.io/v1
   105      kind: ClusterVersion
   106      metadata:
   107        creationTimestamp: 2019-02-27T22:24:21Z
   108        generation: 1
   109        name: version
   110        resourceVersion: "19927"
   111        selfLink: /apis/config.openshift.io/v1/clusterversions/version
   112        uid: 6e0f4cf8-3ade-11e9-9034-0a923b47ded4
   113      spec:
   114        channel: stable-4.1
   115        clusterID: 5ec312f9-f729-429d-a454-61d4906896ca
   116      status:
   117        availableUpdates: null
   118        conditions:
   119        - lastTransitionTime: 2019-02-27T22:50:30Z
   120          message: Done applying 4.1.1
   121          status: "True"
   122          type: Available
   123        - lastTransitionTime: 2019-02-27T22:50:30Z
   124          status: "False"
   125          type: Failing
   126        - lastTransitionTime: 2019-02-27T22:50:30Z
   127          message: Cluster version is 4.1.1
   128          status: "False"
   129          type: Progressing
   130        - lastTransitionTime: 2019-02-27T22:24:31Z
   131          message: 'Unable to retrieve available updates: unknown version 4.1.1
   132          reason: RemoteFailed
   133          status: "False"
   134          type: RetrievedUpdates
   135        desired:
   136          image: registry.svc.ci.openshift.org/openshift/origin-release@sha256:91e6f754975963e7db1a9958075eb609ad226968623939d262d1cf45e9dbc39a
   137          version: 4.1.1
   138        history:
   139        - completionTime: 2019-02-27T22:50:30Z
   140          image: registry.svc.ci.openshift.org/openshift/origin-release@sha256:91e6f754975963e7db1a9958075eb609ad226968623939d262d1cf45e9dbc39a
   141          startedTime: 2019-02-27T22:24:31Z
   142          state: Completed
   143          version: 4.1.1
   144        observedGeneration: 1
   145        versionHash: Wa7as_ik1qE=
   146      ```
   147  
   148      Some of most important [conditions][cluster-operator-conditions] to take note are `Failing`, `Available` and `Progressing`. You can look at the conditions using:
   149  
   150      ```console
   151      $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get clusterversion version -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'
   152      Available True Done applying 4.1.1
   153      Failing False
   154      Progressing False Cluster version is 4.0.0-0.alpha-2019-02-26-194020
   155      RetrievedUpdates False Unable to retrieve available updates: unknown version 4.1.1
   156      ```
   157  
   158  2. Inspecting the `ClusterOperator` object.
   159  
   160      You can get the status of all the cluster operators:
   161  
   162      ```console
   163      $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get clusteroperator
   164      NAME                                  VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
   165      cluster-autoscaler                              True        False         False     17m
   166      cluster-storage-operator                        True        False         False     10m
   167      console                                         True        False         False     7m21s
   168      dns                                             True        False         False     31m
   169      image-registry                                  True        False         False     9m58s
   170      ingress                                         True        False         False     10m
   171      kube-apiserver                                  True        False         False     28m
   172      kube-controller-manager                         True        False         False     21m
   173      kube-scheduler                                  True        False         False     25m
   174      machine-api                                     True        False         False     17m
   175      machine-config                                  True        False         False     17m
   176      marketplace-operator                            True        False         False     10m
   177      monitoring                                      True        False         False     8m23s
   178      network                                         True        False         False     13m
   179      node-tuning                                     True        False         False     11m
   180      openshift-apiserver                             True        False         False     15m
   181      openshift-authentication                        True        False         False     20m
   182      openshift-cloud-credential-operator             True        False         False     18m
   183      openshift-controller-manager                    True        False         False     10m
   184      openshift-samples                               True        False         False     8m42s
   185      operator-lifecycle-manager                      True        False         False     17m
   186      service-ca                                      True        False         False     30m
   187      ```
   188  
   189      To get detailed information on why an individual cluster operator is `Failing` or not yet `Available`, you can check the status of that individual operator, for example `monitoring`:
   190  
   191      ```console
   192      $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get clusteroperator monitoring -oyaml
   193      apiVersion: config.openshift.io/v1
   194      kind: ClusterOperator
   195      metadata:
   196        creationTimestamp: 2019-02-27T22:47:04Z
   197        generation: 1
   198        name: monitoring
   199        resourceVersion: "24677"
   200        selfLink: /apis/config.openshift.io/v1/clusteroperators/monitoring
   201        uid: 9a6a5ef9-3ae1-11e9-bad4-0a97b6ba9358
   202      spec: {}
   203      status:
   204        conditions:
   205        - lastTransitionTime: 2019-02-27T22:49:10Z
   206          message: Successfully rolled out the stack.
   207          status: "True"
   208          type: Available
   209        - lastTransitionTime: 2019-02-27T22:49:10Z
   210          status: "False"
   211          type: Progressing
   212        - lastTransitionTime: 2019-02-27T22:49:10Z
   213          status: "False"
   214          type: Failing
   215        extension: null
   216        relatedObjects: null
   217        version: ""
   218      ```
   219  
   220      Again, the cluster operators also publish [conditions][cluster-operator-conditions] like `Failing`, `Available` and `Progressing` that can help user provide information on the current state of the operator:
   221  
   222      ```console
   223      $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get clusteroperator monitoring -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'
   224      Available True Successfully rolled out the stack
   225      Progressing False
   226      Failing False
   227      ```
   228  
   229      Each clusteroperator also publishes the list of objects owned by the cluster operator. To get that information:
   230  
   231      ```console
   232      $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get clusteroperator kube-apiserver -o=jsonpath='{.status.relatedObjects}'
   233      [map[resource:kubeapiservers group:operator.openshift.io name:cluster] map[group: name:openshift-config resource:namespaces] map[group: name:openshift-config-managed resource:namespaces] map[group: name:openshift-kube-apiserver-operator resource:namespaces] map[group: name:openshift-kube-apiserver resource:namespaces]]
   234      ```
   235  
   236  **NOTE:** Failing to initialize the cluster is usually not a fatal failure in terms of cluster creation as the user can look at the failures from `ClusterOperator` to debug failures for a cluster operator and take actions which can allow `cluster-version-operator` to make progress.
   237  
   238  ### Installer Fails to Fetch Console URL
   239  
   240  The installer fetches the URL for OpenShift console using the [route][route-object] in `openshift-console` namespace. If the installer fails the fetch the URL for the console:
   241  
   242  1. Check if the console router is `Available` or `Failing`
   243  
   244      ```console
   245      $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get clusteroperator console -oyaml
   246      apiVersion: config.openshift.io/v1
   247      kind: ClusterOperator
   248      metadata:
   249        creationTimestamp: 2019-02-27T22:46:57Z
   250        generation: 1
   251        name: console
   252        resourceVersion: "19682"
   253        selfLink: /apis/config.openshift.io/v1/clusteroperators/console
   254        uid: 960364aa-3ae1-11e9-bad4-0a97b6ba9358
   255      spec: {}
   256      status:
   257        conditions:
   258        - lastTransitionTime: 2019-02-27T22:46:58Z
   259          status: "False"
   260          type: Failing
   261        - lastTransitionTime: 2019-02-27T22:50:12Z
   262          status: "False"
   263          type: Progressing
   264        - lastTransitionTime: 2019-02-27T22:50:12Z
   265          status: "True"
   266          type: Available
   267        - lastTransitionTime: 2019-02-27T22:46:57Z
   268          status: "True"
   269          type: Upgradeable
   270        extension: null
   271        relatedObjects:
   272        - group: operator.openshift.io
   273          name: cluster
   274          resource: consoles
   275        - group: config.openshift.io
   276          name: cluster
   277          resource: consoles
   278        - group: oauth.openshift.io
   279          name: console
   280          resource: oauthclients
   281        - group: ""
   282          name: openshift-console-operator
   283          resource: namespaces
   284        - group: ""
   285          name: openshift-console
   286          resource: namespaces
   287        versions: null
   288      ```
   289  
   290  2. Manually get the URL for `console`
   291  
   292    ```console
   293    $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get route console -n openshift-console -o=jsonpath='{.spec.host}'
   294    console-openshift-console.apps.adahiya-1.devcluster.openshift.com
   295    ```
   296  
   297  ### Installer Fails to Add Default Ingress Certificate to Kubeconfig
   298  
   299  The installer adds the default ingress certificate to the list of trusted client certificate authorities in `${INSTALL_DIR}/auth/kubeconfig`. If the installer fails to add the ingress certificate to `kubeconfig`, you can fetch the certificate from the cluster using the following command:
   300  
   301  ```console
   302  $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get configmaps default-ingress-cert -n openshift-config-managed -o=jsonpath='{.data.ca-bundle\.crt}'
   303  -----BEGIN CERTIFICATE-----
   304  MIIC/TCCAeWgAwIBAgIBATANBgkqhkiG9w0BAQsFADAuMSwwKgYDVQQDDCNjbHVz
   305  dGVyLWluZ3Jlc3Mtb3BlcmF0b3JAMTU1MTMwNzU4OTAeFw0xOTAyMjcyMjQ2Mjha
   306  Fw0yMTAyMjYyMjQ2MjlaMC4xLDAqBgNVBAMMI2NsdXN0ZXItaW5ncmVzcy1vcGVy
   307  YXRvckAxNTUxMzA3NTg5MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
   308  uCA4fQ+2YXoXSUL4h/mcvJfrgpBfKBW5hfB8NcgXeCYiQPnCKblH1sEQnI3VC5Pk
   309  2OfNCF3PUlfm4i8CHC95a7nCkRjmJNg1gVrWCvS/ohLgnO0BvszSiRLxIpuo3C4S
   310  EVqqvxValHcbdAXWgZLQoYZXV7RMz8yZjl5CfhDaaItyBFj3GtIJkXgUwp/5sUfI
   311  LDXW8MM6AXfuG+kweLdLCMm3g8WLLfLBLvVBKB+4IhIH7ll0buOz04RKhnYN+Ebw
   312  tcvFi55vwuUCWMnGhWHGEQ8sWm/wLnNlOwsUz7S1/sW8nj87GFHzgkaVM9EOnoNI
   313  gKhMBK9ItNzjrP6dgiKBCQIDAQABoyYwJDAOBgNVHQ8BAf8EBAMCAqQwEgYDVR0T
   314  AQH/BAgwBgEB/wIBADANBgkqhkiG9w0BAQsFAAOCAQEAq+vi0sFKudaZ9aUQMMha
   315  CeWx9CZvZBblnAWT/61UdpZKpFi4eJ2d33lGcfKwHOi2NP/iSKQBebfG0iNLVVPz
   316  vwLbSG1i9R9GLdAbnHpPT9UG6fLaDIoKpnKiBfGENfxeiq5vTln2bAgivxrVlyiq
   317  +MdDXFAWb6V4u2xh6RChI7akNsS3oU9PZ9YOs5e8vJp2YAEphht05X0swA+X8V8T
   318  C278FFifpo0h3Q0Dbv8Rfn4UpBEtN4KkLeS+JeT+0o2XOsFZp7Uhr9yFIodRsnNo
   319  H/Uwmab28ocNrGNiEVaVH6eTTQeeZuOdoQzUbClElpVmkrNGY0M42K0PvOQ/e7+y
   320  AQ==
   321  -----END CERTIFICATE-----
   322  ```
   323  
   324  You can then **prepend** that certificate to `client-certificate-authority-data` field in your `${INSTALL_DIR}/auth/kubeconfig`.
   325  
   326  ## Generic Troubleshooting
   327  
   328  Here are some ideas if none of the [common failures](#common-failures) match your symptoms.
   329  For other generic troubleshooting, see [the Kubernetes documentation][kubernetes-debug].
   330  
   331  ### Check for Pending or Crashing Pods
   332  
   333  This is the generic version of the [*No Worker Nodes Created*](#no-worker-nodes-created) troubleshooting procedure.
   334  
   335  ```console
   336  $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get pods --all-namespaces
   337  NAMESPACE                              NAME                                                              READY     STATUS              RESTARTS   AGE
   338  kube-system                            etcd-member-wking-master-0                                        1/1       Running             0          46s
   339  openshift-machine-api                  machine-api-operator-586bd5b6b9-bxq9s                             0/1       Pending             0          1m
   340  openshift-cluster-dns-operator         cluster-dns-operator-7f4f6866b9-kzth5                             0/1       Pending             0          2m
   341  ...
   342  ```
   343  
   344  You can investigate any pods listed as `Pending` with:
   345  
   346  ```sh
   347  oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig describe -n openshift-machine-api pod/machine-api-operator-586bd5b6b9-bxq9s
   348  ```
   349  
   350  which may show events with warnings like:
   351  
   352  ```
   353  Warning  FailedScheduling  1m (x10 over 1m)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
   354  ```
   355  
   356  You can get the image used for a crashing pod with:
   357  
   358  ```console
   359  $ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get pod -o "jsonpath={range .status.containerStatuses[*]}{.name}{'\t'}{.state}{'\t'}{.image}{'\n'}{end}" -n openshift-machine-api machine-api-operator-586bd5b6b9-bxq9s
   360  machine-api-operator	map[running:map[startedAt:2018-11-13T19:04:50Z]]	registry.svc.ci.openshift.org/openshift/origin-v4.0-20181113175638@sha256:c97d0b53b98d07053090f3c9563cfd8277587ce94f8c2400b33e246aa08332c7
   361  ```
   362  
   363  And you can see where that image comes from with:
   364  
   365  ```console
   366  $ oc adm release info registry.svc.ci.openshift.org/openshift/origin-release:v4.0-20181113175638 --commits
   367  Name:      v4.0-20181113175638
   368  Digest:    sha256:58196e73cc7bbc16346483d824fb694bf1a73d517fe13f6b5e589a7e0e1ccb5b
   369  Created:   2018-11-13 09:56:46 -0800 PST
   370  OS/Arch:   linux/amd64
   371  Manifests: 121
   372  
   373  Images:
   374    NAME                  REPO                                               COMMIT
   375    ...
   376    machine-api-operator  https://github.com/openshift/machine-api-operator  e681e121e15d2243739ad68978113a07aa35c6ae
   377    ...
   378  ```
   379  
   380  ### One or more nodes are never Ready (Network / CNI issues)
   381  
   382  You might see that one or more nodes are never ready, e.g
   383  
   384  ```console
   385  $ kubectl get nodes
   386  NAME                           STATUS     ROLES     AGE       VERSION
   387  ip-10-0-27-9.ec2.internal      NotReady   master    29m       v1.11.0+d4cacc0
   388  ...
   389  ```
   390  
   391  This usually means that, for whatever reason, networking is not available on the node. You can confirm this by looking at the detailed output of the node:
   392  
   393  ```console
   394  $ kubectl describe node ip-10-0-27-9.ec2.internal
   395   ... (lots of output skipped)
   396  'runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized'
   397  ```
   398  
   399  The first thing to determine is the status of the SDN. The SDN deploys three daemonsets:
   400  - *sdn-controller*, a control-plane component
   401  - *sdn*, the node-level networking daemon
   402  - *ovs*, the OpenVSwitch management daemon
   403  
   404  All 3 must be healthy (though only a single `sdn-controller` needs to be running). `sdn` and `ovs` must be running on every node, and DESIRED should equal AVAILABLE. On a healthy 2-node cluster you would see:
   405  
   406  ```console
   407  $ kubectl -n openshift-sdn get daemonsets
   408  NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
   409  ovs              2         2         2       2            2           beta.kubernetes.io/os=linux       2h
   410  sdn              2         2         2       2            2           beta.kubernetes.io/os=linux       2h
   411  sdn-controller   1         1         1       1            1           node-role.kubernetes.io/master=   2h
   412  ```
   413  
   414  If, instead, you get a different error message:
   415  
   416  ```console
   417  $ kubectl -n openshift-sdn get daemonsets
   418  No resources found.
   419  ```
   420  
   421  This means the network-operator didn't run. Skip ahead [to that section](#debugging-the-cluster-network-operator). Otherwise, let's debug the SDN.
   422  
   423  #### Debugging the openshift-sdn
   424  
   425  On the NotReady node, you need to find out which pods, if any, are in a bad state. Be sure to substitute in the correct `spec.nodeName` (or just remove it).
   426  
   427  ```console
   428  $ kubectl -n openshift-sdn get pod --field-selector "spec.nodeName=ip-10-0-27-9.ec2.internal"
   429  NAME                   READY   STATUS             RESTARTS   AGE
   430  ovs-dk8bh              1/1     Running            1          52m
   431  sdn-8nl47              1/1     CrashLoopBackoff   3          52m
   432  ```
   433  
   434  Then, retrieve the logs for the SDN (and the OVS pod, if that is failed):
   435  
   436  ```sh
   437  kubectl -n openshift-sdn logs sdn-8nl47
   438  ```
   439  
   440  Some common error messages:
   441  - `Cannot fetch default cluster network`:  This means the `sdn-controller` has failed to run to completion. Retrieve its logs with `kubectl -n openshift-sdn logs -l app=sdn-controller`.
   442  - `warning: Another process is currently listening on the CNI socket, waiting 15s`: Something has gone wrong, and multiple SDN processes are running. SSH to the node in question, capture the out of `ps -faux`. If you just need the cluster up, reboot the node.
   443  - Error messages about ovs or OpenVSwitch: Check that the `ovs-*` pod on the same node is healthy. Retrieve its logs with `kubectl -n openshift-sdn logs ovs-<name>`. Rebooting the node should fix it.
   444  - Any indication that the control plane is unavailable: Check to make sure the apiserver is reachable from the node. You may be able to find useful information via `journalctl -f -u kubelet`.
   445  
   446  If you think it's a misconfiguration, file a [network operator](https://github.com/openshift/cluster-network-operator) issue. RH employees can also try #forum-sdn.
   447  
   448  #### Debugging the cluster-network-operator
   449  The cluster network operator is responsible for deploying the networking components. It does this in response to a special object created by the installer.
   450  
   451  From a deployment perspective, the network operator is often the "canary in the coal mine." It runs very early in the installation process, after the master nodes have come up but before the bootstrap control plane has been torn down. It can be indicative of more subtle installer issues, such as long delays in bringing up master nodes or apiserver communication issues. Nevertheless, it can have other bugs.
   452  
   453  First, determine that the network configuration exists:
   454  
   455  ```console
   456  $ kubectl get network.config.openshift.io cluster -oyaml
   457  apiVersion: config.openshift.io/v1
   458  kind: Network
   459  metadata:
   460    name: cluster
   461  spec:
   462    serviceNetwork:
   463    - 172.30.0.0/16
   464    clusterNetwork:
   465    - cidr: 10.128.0.0/14
   466      hostPrefix: 23
   467    networkType: OVNKubernetes
   468  ```
   469  
   470  If it doesn't exist, the installer didn't create it. You'll have to run `openshift-install create manifests` to determine why.
   471  
   472  Next, check that the network-operator is running:
   473  
   474  ```sh
   475  kubectl -n openshift-network-operator get pods
   476  ```
   477  
   478  And retrieve the logs. Note that, on multi-master systems, the operator will perform leader election and all other operators will sleep:
   479  
   480  ```sh
   481  kubectl -n openshift-network-operator logs -l "name=network-operator"
   482  ```
   483  
   484  If appropriate, file a [network operator](https://github.com/openshift/cluster-network-operator) issue. RH employees can also try #forum-sdn.
   485  
   486  [access-article]: https://access.redhat.com/articles/3780981#debugging-an-install-1
   487  [aws-key-pairs]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html
   488  [kubernetes-debug]: https://kubernetes.io/docs/tasks/debug-application-cluster/
   489  [machine-config-daemon-ssh-keys]: https://github.com/openshift/machine-config-operator/blob/master/docs/Update-SSHKeys.md
   490  [cluster-version-operator]: https://github.com/openshift/cluster-version-operator/blob/master/README.md
   491  [clusterversion]: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusterversion.md
   492  [clusteroperator]: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md
   493  [cluster-operator-conditions]: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions