github.com/openshift/installer@v1.4.17/docs/user/troubleshootingbootstrap.md (about)

     1  # Troubleshooting Bootstrap Failures
     2  
     3  Unfortunately, there will always be some cases where OpenShift fails to install properly. In these events, it is helpful to understand the likely failure modes as well as how to troubleshoot the failure.
     4  
     5  ## Gathering bootstrap failure logs
     6  
     7  ### Using the installer provisioned workflow
     8  
     9  When users are using the installer to create the OpenShift cluster, the installer has all the information to automatically capture the logs from bootstrap host in case of failure.
    10  
    11  #### Authenticating to bootstrap host for ipi
    12  
    13  The installer will use the user's environment to discover the credentials to connect to the bootstrap host over SSH. One of the following methods is used by the installer,
    14  
    15  1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.
    16  
    17  2. Use the user's home directory, `~/.ssh` on Linux hosts, to load all the SSH private keys and use those for SSH authentication.
    18      a. The installer also configures the bootstrap host with a *generated* SSH key, and this private key will be used for SSH authentication if none of the user keys are trusted.
    19      The installer only configures the bootstrap host to trust the generated key, and therefore the log bundle will only contain the logs from the bootstrap host and not the control-plane hosts.
    20  
    21  ### Using the user provisioned workflow
    22  
    23  When users are creating the infrastructure for the OpenShift cluster and the cluster fails to bootstrap, the users can use the `gather bootstrap` subcommand to gather the logs from the bootstrap host.
    24  
    25  ```console
    26  $ openshift-install gather bootstrap --help
    27  Gather debugging data for a failing-to-bootstrap control plane
    28  
    29  Usage:
    30    openshift-install gather bootstrap [flags]
    31  
    32  Flags:
    33        --bootstrap string     Hostname or IP of the bootstrap host
    34    -h, --help                 help for bootstrap
    35        --key stringArray      Path to SSH private keys that should be used for authentication. If no key was provided, SSH private keys from user's environment will be used
    36        --master stringArray   Hostnames or IPs of all control plane hosts
    37  ```
    38  
    39  An example of a invocation for a cluster with three control-plane machines would be,
    40  
    41  ```sh
    42  openshift-install gather bootstrap --bootstrap ${BOOTSTRAP_HOST_IP} --master ${CONTROL_PLANE_1_HOST_IP} --master ${CONTROL_PLANE_2_HOST_IP} --master ${CONTROL_PLANE_3_HOST_IP}
    43  ```
    44  
    45  #### Authenticating to bootstrap host for upi
    46  
    47  When explicitly using the `gather bootstrap` subcommand, user can either utilize the installer's discovery mechanism like detailed [above](#authenticating-with bootstrap host-for-ipi) or provide the keys using the `--key` flag.
    48  
    49  An example of a invocation for a cluster with three control-plane machines would be,
    50  
    51  ```sh
    52  openshift-install gather bootstrap --key ${KEY_1} --key ${KEY_2} --bootstrap ${BOOTSTRAP_HOST_IP} --master ${CONTROL_PLANE_1_HOST_IP} --master ${CONTROL_PLANE_2_HOST_IP} --master ${CONTROL_PLANE_3_HOST_IP}
    53  ```
    54  
    55  ## Understanding the bootstrap failure log bundle
    56  
    57  Here's what a log bundle looks like,
    58  
    59  ```console
    60  .
    61  ├── bootstrap
    62  ├── control-plane
    63  ├── failed-units.txt
    64  ├── rendered-assets
    65  ├── resources
    66  └── unit-status
    67  
    68  5 directories, 1 file
    69  ```
    70  
    71  ### file: failed-units.txt
    72  
    73  The failed-units.txt contains a list of all the **failed** systemd units on the bootstrap host.
    74  
    75  ### directory: unit-status
    76  
    77  The unit-status directory contains the details of each failed systemd unit from [failed-units](#file-failed-units-txt),
    78  
    79  ### directory: bootstrap
    80  
    81  The bootstrap directory consists of all the important logs and files from the bootstrap host. There are three subdirectories for the bootstrap host
    82  
    83  ```console
    84  bootstrap
    85  ├── containers
    86  ├── journals
    87  └── pods
    88  
    89  3 directories, 0 files
    90  ```
    91  
    92  #### directory: bootstrap/containers
    93  
    94  The containers directory contains the descriptions and logs from all the containers created by the kubelet using CRI-O for the static pods.
    95  This directory contains all the operators or their operands running on the bootstrap host in special bootstrap modes. For example the machine-config-server container, or the bootstrap-kube-controlplane pods etc.
    96  
    97  For each container the directory has two files,
    98  
    99  * `<human readable id>.log`, which contains the log of the container.
   100  * `<human readable id>.inspect`, which contains the information about the container like the image, volume mounts, arguments etc.
   101  
   102  #### directory: bootstrap/journals
   103  
   104  The journals directory contains the logs for *important* systemd units. These units are,
   105  
   106  * `release-image.log`, the release-image unit is responsible for pulling the Release Image to the bootstrap host.
   107  * `crio-configure.log` and `crio.log`, these units are responsible for configuring the CRI-O on the bootstrap host and CRI-O daemon respectively.
   108  * `kubelet.log`, the kubelet service is responsible for running the kubelet on the bootstrap host. The kubelet on the bootstrap host is responsible for running the static pods for etcd, bootstrap-kube-controlplane and various other operators in bootstrap mode.
   109  * `approve-csr.log`, the approve-csr unit is responsible for allowing control-plane machines to join OpenShift cluster. This unit performs the job of in-cluster approver while the bootstrapping is in progress.
   110  * `bootkube.log`, the bootkube service is the unit that performs the bootstrapping of OpenShift clusters using all the operators. This service is responsible for running all the required steps to bootstrap the API and then wait for success.
   111  
   112  There might also be other services that are important for some platforms like OpenStack, that will have logs in this directory.
   113  
   114  #### directory: bootstrap/pods
   115  
   116  The pods directory contains the information and logs from all the render commands for various operators run by the bootkube unit.
   117  
   118  For each container the directory has two files,
   119  
   120  * `<human readable id>.log`, which contains the log of the container.
   121  * `<human readable id>.inspect`, which contains the information about the container like the image, volume mounts, arguments etc.
   122  
   123  ### directory: resources
   124  
   125  The resources directory contains various Kubernetes objects that are present in the cluster. These resources are pulled using the bootstrap API running on the bootstrap host.
   126  
   127  ### directory: rendered-assets
   128  
   129  The rendered-assets directory contains all the files and directories created by the bootkube unit using various render command for operators. This directory is a snapshot of the `/opt/openshift` directory on the bootstrap-host.
   130  
   131  ### directory: control-plane
   132  
   133  The control-plane directory contains logs for each control-plane host. It contains a sub directory for each control-plane host, usually the IP address of the hosts.
   134  
   135  ```console
   136  control-plane
   137  ├── 10.0.128.114
   138  │   ├── containers
   139  │   ├── failed-units.txt
   140  │   ├── journals
   141  │   └── unit-status
   142  ├── 10.0.142.138
   143  │   ├── containers
   144  │   ├── failed-units.txt
   145  │   ├── journals
   146  │   └── unit-status
   147  └── 10.0.148.48
   148      ├── containers
   149      ├── failed-units.txt
   150      ├── journals
   151      └── unit-status
   152  
   153  12 directories, 3 files
   154  ```
   155  
   156  #### directory: control-plane/name/containers
   157  
   158  The containers directory contains the descriptions and logs from all the containers created by the kubelet using CRI-O on the control-plane host. The files are same as [containers directory](#directory-bootstrap-containers) on bootstrap host.
   159  
   160  #### directory: control-plane/name/journals
   161  
   162  The journals directory contains the logs of **important** units on the control plane hosts. The list of such units is,
   163  
   164  * `crio.log`
   165  * `kubelet.log`
   166  * `machine-config-daemon-host.log` and `pivot.log`, these files have logs for RHCOS pivot related actions on the control plane host.
   167  
   168  ## Common Failures
   169  
   170  Here are some common failures that the users can troubleshoot using the bootstrap failure log bundle.
   171  
   172  ### Unable to pull the bootstrap failure logs
   173  
   174  1. `Attempted to gather debug logs after installation failure: failed to create SSH client: failed to initialize the SSH agent: no keys found for SSH agent`
   175  The installer tried to create a new SSH agent, but there were no keys found in user's home directory, usually `~/.ssh` on Linux. The user can use the `--key` flag to provide the private key for SSH to gather the bootstrap failure logs.
   176  
   177  2. `failed to create SSH client: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain`
   178  The keys provided to the installer from the `SSH_AGENT` or the keys loaded from user's home directory do not have permission to SSH to the bootstrap host. The user can use the `--key` flag to provide the private key for SSH to gather the bootstrap failure logs.
   179  
   180  ### Unable to pull Release Image
   181  
   182  When the pull secret provided to the installer does not have correct permissions to pull the Release Image, the `bootstrap/journals/release-image.log` should contain the debugging logs.
   183  
   184  For example,
   185  
   186  ```txt
   187  -- Logs begin at Fri 2020-04-24 17:08:15 UTC, end at Fri 2020-04-24 17:33:16 UTC. --
   188  Apr 24 17:08:46 ci-op-2cbvx-bootstrap.c.openshift-gce-devel-ci.internal systemd[1]: Starting Download the OpenShift Release Image...
   189  Apr 24 17:08:46 ci-op-2cbvx-bootstrap.c.openshift-gce-devel-ci.internal release-image-download.sh[1688]: Pulling registry.svc.ci.openshift.org/ci-op-8dv01g3m/release@sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc...
   190  Apr 24 17:08:49 ci-op-2cbvx-bootstrap.c.openshift-gce-devel-ci.internal podman[1698]: 2020-04-24 17:08:49.307961668 +0000 UTC m=+1.119158273 system refresh
   191  Apr 24 17:08:49 ci-op-2cbvx-bootstrap.c.openshift-gce-devel-ci.internal release-image-download.sh[1688]: Error: error pulling image "registry.svc.ci.openshift.org/ci-op-8dv01g3m/release@sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc": unable to pull registry.svc.ci.openshift.org/ci-op-8dv01g3m/release@sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ci-op-8dv01g3m/release@sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc: Error reading manifest sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc in registry.svc.ci.openshift.org/ci-op-8dv01g3m/release: unauthorized: authentication required
   192  ```
   193  
   194  ### Bootkube logs are empty
   195  
   196  For cases where the bootkube logs are empty in `bootstrap/journals/bootkube.log` like,
   197  
   198  ```txt
   199  -- Logs begin at Fri 2020-04-24 17:08:15 UTC, end at Fri 2020-04-24 17:33:16 UTC. --
   200  -- No entries --
   201  ```
   202  
   203  There is high likelihood that the Release Image cannot be downloaded and more details can be found using [release-image.log](#unable-to-pull-release-image)
   204  
   205  ## Control-plane logs missing from log bundle
   206  
   207  When the control-plane logs are missing from the log bundle, for example,
   208  
   209  ```console
   210  $ tree control-plane -L 2
   211  control-plane
   212  ├── 10.0.0.4
   213  ├── 10.0.0.5
   214  └── 10.0.0.6
   215  
   216  3 directories, 0 files
   217  ```
   218  
   219  The troubleshooting would require the logs of the installer gathering the log bundle, which are easily available in `.openshift_install.log`.