github.com/amrnt/deis@v1.3.1/docs/troubleshooting_deis/troubleshooting-store.rst

github.com/amrnt/deis@v1.3.1/docs/troubleshooting_deis/troubleshooting-store.rst (about)

     1  :title: Troubleshooting deis-store
     2  :description: Resolutions for common issues with deis-store and Ceph.
     3  
     4  .. _troubleshooting-store:
     5  
     6  Troubleshooting deis-store
     7  ==========================
     8  
     9  The store component is the most complex component of Deis. As such, there are many ways for it to fail.
    10  Recall that the store components represent Ceph services as follows:
    11  
    12  * ``store-monitor``: http://ceph.com/docs/giant/man/8/ceph-mon/
    13  * ``store-daemon``: http://ceph.com/docs/giant/man/8/ceph-osd/
    14  * ``store-gateway``: http://ceph.com/docs/giant/radosgw/
    15  * ``store-metadata``: http://ceph.com/docs/giant/man/8/ceph-mds/
    16  * ``store-volume``: a system service which mounts a `Ceph FS`_ volume to be used by the controller and logger components
    17  
    18  Log output for store components can be viewed with ``deisctl status store-<component>`` (such as
    19  ``deisctl status store-volume``). Additionally, the Ceph health can be queried by using the ``deis-store-admin``
    20  administrative container to access the cluster.
    21  
    22  .. _using-store-admin:
    23  
    24  Using store-admin
    25  -----------------
    26  
    27  ``deis-store-admin`` is an optional component that is helpful when diagnosing problems with ``deis-store``.
    28  It contains the ``ceph`` client and writes the necessary Ceph configuration files so it always has the
    29  most up-to-date configuration for the cluster.
    30  
    31  To use ``deis-store-admin``, install and start it with ``deisctl``:
    32  
    33  .. code-block:: console
    34  
    35      $ deisctl install store-admin
    36      $ deisctl start store-admin
    37  
    38  The container will now be running on all hosts in the cluster. Log into any of the hosts, enter
    39  the container with ``nse deis-store-admin``, and then issue a ``ceph -s`` to query the cluster's health.
    40  
    41  The output should be similar to the following:
    42  
    43  .. code-block:: console
    44  
    45      core@deis-1 ~ $ nse deis-store-admin
    46      root@deis-1:/# ceph -s
    47          cluster 20038e38-4108-4e79-95d4-291d0eef2949
    48           health HEALTH_OK
    49           monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 16, quorum 0,1,2 deis-1,deis-2,deis-3
    50           mdsmap e10: 1/1/1 up {0=deis-2=up:active}, 2 up:standby
    51           osdmap e36: 3 osds: 3 up, 3 in
    52            pgmap v2096: 1344 pgs, 12 pools, 369 MB data, 448 objects
    53                  24198 MB used, 23659 MB / 49206 MB avail
    54                  1344 active+clean
    55  
    56  If you see ``HEALTH_OK``, this means everything is working as it should.
    57  Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding,
    58  ``mdsmap e10: 1/1/1 up...`` which means all three metadata containers are up and responding,
    59  and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running.
    60  
    61  We can also see from the ``pgmap`` that we have 1344 placement groups, all of which are ``active+clean``.
    62  
    63  For additional information on troubleshooting Ceph, see `troubleshooting`_. Common issues with
    64  specific store components are detailed below.
    65  
    66  .. note::
    67  
    68      If all of the ``ceph`` client commands seem to be hanging and the output is solely monitor
    69      faults, the cluster may have lost quorum and manual intervention is necessary to recover.
    70      For more information, see :ref:`recovering-ceph-quorum`.
    71  
    72  store-monitor
    73  -------------
    74  
    75  The monitor is the first store component to start, and is required for any of the other store
    76  components to function properly. If a ``deisctl list`` indicates that any of the monitors are failing,
    77  it is likely due to a host issue. Common failure scenarios include not
    78  having adequate free storage on the host node - in that case, monitors will fail with errors similar to:
    79  
    80  .. code-block:: console
    81  
    82    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053693 7fd0586a6700  0 mon.deis-staging-node1@0(leader).data_health(6) update_stats avail 1% total 5960684 used 56655
    83    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053770 7fd0586a6700 -1 mon.deis-staging-node1@0(leader).data_health(6) reached critical levels of available space on
    84    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053772 7fd0586a6700  0 ** Shutdown via Data Health Service **
    85    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053821 7fd056ea3700 -1 mon.deis-staging-node1@0(leader) e3 *** Got Signal Interrupt ***
    86    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053834 7fd056ea3700  1 mon.deis-staging-node1@0(leader) e3 shutdown
    87    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054000 7fd056ea3700  0 quorum service shutdown
    88    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054002 7fd056ea3700  0 mon.deis-staging-node1@0(shutdown).health(6) HealthMonitor::service_shutdown 1 services
    89    Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054065 7fd056ea3700  0 quorum service shutdown
    90  
    91  This is typically only an issue when deploying Deis on bare metal, as most cloud providers have adequately
    92  large volumes.
    93  
    94  store-daemon
    95  ------------
    96  
    97  The daemons are responsible for actually storing the data on the filesystem. The cluster is configured
    98  to allow writes with just one daemon running, but the cluster will be running in a degraded state, so
    99  restoring all daemons to a running state as quickly as possible is paramount.
   100  
   101  Daemons can be safely restarted with ``deisctl restart store-daemon``, but this will restart all daemons,
   102  resulting in downtime of the storage cluster until the daemons recover. Alternatively, issuing a
   103  ``sudo systemctl restart deis-store-daemon`` on the host of the failing daemon will restart just
   104  that daemon.
   105  
   106  store-gateway
   107  -------------
   108  
   109  The gateway runs Apache and a FastCGI server to communicate with the cluster. Restarting the gateway
   110  will result in a short downtime for the registry component (and will prevent the database from
   111  backing up), but those components should recover as soon as the gateway comes back up.
   112  
   113  store-metadata
   114  --------------
   115  
   116  The metadata servers are required for the **volume** to function properly. Only one is active at
   117  any one time, and the rest operate as hot standbys. The monitors will promote a standby metadata
   118  server should the active one fail.
   119  
   120  store-volume
   121  ------------
   122  
   123  Without functioning monitors, daemons, and metadata servers, the volume service will likely hang
   124  indefinitely (or restart constantly). If the controller or logger happen to be running on a host with a
   125  failing store-volume, application logs will be lost until the volume recovers.
   126  
   127  Note that store-volume requires CoreOS >= 471.1.0 for the CephFS kernel module.
   128  
   129  .. _`Ceph FS`: https://ceph.com/docs/giant/cephfs/
   130  .. _`troubleshooting`: http://docs.ceph.com/docs/giant/rados/troubleshooting/