github.com/misfo/deis@v1.0.1-0.20141111224634-e0eee0392b8a/docs/troubleshooting_deis/index.rst (about)

     1  :title: Troubleshooting Deis
     2  :description: Resolutions for common issues encountered when running Deis.
     3  
     4  .. _troubleshooting_deis:
     5  
     6  Troubleshooting Deis
     7  ====================
     8  
     9  Common issues that users have run into when provisioning Deis are detailed below.
    10  
    11  A deis-store component fails to start
    12  -------------------------------------
    13  
    14  The store component is the most complex component of Deis. As such, there are many ways for it to fail.
    15  Recall that the store components represent Ceph services as follows:
    16  
    17  * ``store-monitor``: http://ceph.com/docs/giant/man/8/ceph-mon/
    18  * ``store-daemon``: http://ceph.com/docs/giant/man/8/ceph-osd/
    19  * ``store-gateway``: http://ceph.com/docs/giant/radosgw/
    20  * ``store-metadata``: http://ceph.com/docs/giant/man/8/ceph-mds/
    21  * ``store-volume``: a system service which mounts a `Ceph FS`_ volume to be used by the controller and logger components
    22  
    23  Log output for store components can be viewed with ``deisctl status store-<component>`` (such as
    24  ``deisctl status store-volume``). Additionally, the Ceph health can be queried by entering
    25  a store container with ``nse deis-store-monitor`` and then issuing a ``ceph -s``. This should output the
    26  health of the cluster like:
    27  
    28  .. code-block:: console
    29  
    30      core@deis-1 ~ $ nse deis-store-monitor
    31      root@deis-1:/# ceph -s
    32          cluster 20038e38-4108-4e79-95d4-291d0eef2949
    33           health HEALTH_OK
    34           monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 16, quorum 0,1,2 deis-1,deis-2,deis-3
    35           mdsmap e10: 1/1/1 up {0=deis-2=up:active}, 2 up:standby
    36           osdmap e36: 3 osds: 3 up, 3 in
    37            pgmap v2096: 1344 pgs, 12 pools, 369 MB data, 448 objects
    38                  24198 MB used, 23659 MB / 49206 MB avail
    39                  1344 active+clean
    40  
    41  If you see ``HEALTH_OK``, this means everything is working as it should.
    42  Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding,
    43  ``mdsmap e10: 1/1/1 up...`` which means all three metadata containers are up and responding,
    44  and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running.
    45  
    46  We can also see from the ``pgmap`` that we have 1344 placement groups, all of which are ``active+clean``.
    47  
    48  For additional information on troubleshooting Ceph, see `troubleshooting`_. Common issues with
    49  specific store components are detailed below.
    50  
    51  store-monitor
    52  ~~~~~~~~~~~~~
    53  
    54  The monitor is the first store component to start, and is required for any of the other store
    55  components to function properly. If a ``deisctl list`` indicates that any of the monitors are failing,
    56  it is likely due to a host issue. Common failure scenarios include not
    57  having adequate free storage on the host node - in that case, monitors will fail with errors similar to:
    58  
    59  .. code-block:: console
    60  
    61    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053693 7fd0586a6700  0 mon.deis-staging-node1@0(leader).data_health(6) update_stats avail 1% total 5960684 used 56655
    62    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053770 7fd0586a6700 -1 mon.deis-staging-node1@0(leader).data_health(6) reached critical levels of available space on
    63    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053772 7fd0586a6700  0 ** Shutdown via Data Health Service **
    64    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053821 7fd056ea3700 -1 mon.deis-staging-node1@0(leader) e3 *** Got Signal Interrupt ***
    65    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053834 7fd056ea3700  1 mon.deis-staging-node1@0(leader) e3 shutdown
    66    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054000 7fd056ea3700  0 quorum service shutdown
    67    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054002 7fd056ea3700  0 mon.deis-staging-node1@0(shutdown).health(6) HealthMonitor::service_shutdown 1 services
    68    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054065 7fd056ea3700  0 quorum service shutdown
    69  
    70  This is typically only an issue when deploying Deis on bare metal, as most cloud providers have adequately
    71  large volumes.
    72  
    73  store-daemon
    74  ~~~~~~~~~~~~
    75  
    76  The daemons are responsible for actually storing the data on the filesystem. The cluster is configured
    77  to allow writes with just one daemon running, but the cluster will be running in a degraded state, so
    78  restoring all daemons to a running state as quickly as possible is paramount.
    79  
    80  Daemons can be safely restarted with ``deisctl restart store-daemon``, but this will restart all daemons,
    81  resulting in downtime of the storage cluster until the daemons recover. Alternatively, issuing a
    82  ``sudo systemctl restart deis-store-daemon`` on the host of the failing daemon will restart just
    83  that daemon.
    84  
    85  store-gateway
    86  ~~~~~~~~~~~~~
    87  
    88  The gateway runs Apache and a FastCGI server to communicate with the cluster. Restarting the gateway
    89  will result in a short downtime for the registry component (and will prevent the database from
    90  backing up), but those components should recover as soon as the gateway comes back up.
    91  
    92  store-metadata
    93  ~~~~~~~~~~~~~~
    94  
    95  The metadata servers are required for the **volume** to function properly. Only one is active at
    96  any one time, and the rest operate as hot standbys. The monitors will promote a standby metadata
    97  server should the active one fail.
    98  
    99  store-volume
   100  ~~~~~~~~~~~~
   101  
   102  Without functioning monitors, daemons, and metadata servers, the volume service will likely hang
   103  indefinitely (or restart constantly). If the controller or logger happen to be running on a host with a
   104  failing store-volume, application logs will be lost until the volume recovers.
   105  
   106  Note that store-volume requires CoreOS >= 471.1.0 for the CephFS kernel module.
   107  
   108  Any component fails to start
   109  ----------------------------
   110  
   111  Use `deisctl status <component>` to view the status of the component.
   112  You can also use `deisctl journal <component>` to tail logs for a component, or `deisctl list`
   113  to list all components.
   114  
   115  Failed initializing SSH client
   116  ------------------------------
   117  
   118  A `deisctl` command fails with: 'Failed initializing SSH client: ssh: handshake failed: ssh: unable to authenticate'.
   119  Did you remember to add your SSH key to the ssh-agent? `ssh-add -L` should list the key you used
   120  to provision the servers. If it's not there, `ssh-add -K /path/to/your/key`.
   121  
   122  All the given peers are not reachable
   123  -------------------------------------
   124  
   125  A `deisctl` command fails with: 'All the given peers are not reachable (Tried to connect to each peer twice and failed)'.
   126  The most common cause of this issue is that a [new discovery URL](https://discovery.etcd.io/new)
   127  wasn't generated and updated in `contrib/coreos/user-data` before the cluster was launched.
   128  Each Deis cluster must have a unique discovery URL, or else `etcd` will try and fail to connect to old hosts.
   129  Try destroying the cluster and relaunching the cluster with a fresh discovery URL.
   130  
   131  You can use ``make discovery-url`` to automatically fetch a new discovery URL.
   132  
   133  Other issues
   134  ------------
   135  
   136  Running into something not detailed here? Please `open an issue`_ or hop into #deis on Freenode IRC and we'll help!
   137  
   138  .. _`Ceph FS`: https://ceph.com/docs/giant/cephfs/
   139  .. _`open an issue`: https://github.com/deis/deis/issues/new
   140  .. _`troubleshooting`: http://docs.ceph.com/docs/giant/rados/troubleshooting/
   141