github.com/1aal/kubeblocks@v0.0.0-20231107070852-e1c03e598921/docs/user_docs/kubeblocks-for-mysql/high-availability/high-availability.md (about)

     1  ---
     2  title: Failure simulation and automatic recovery
     3  description: Automatic recovery of cluster
     4  keywords: [mysql, high availability, failure simulation, automatic recovery]
     5  sidebar_position: 1
     6  ---
     7  
     8  # Failure simulation and automatic recovery
     9  
    10  As an open-source data management platform, KubeBlocks supports two database forms, ReplicationSet and ConsensusSet. ReplicationSet can be used for single source with multiple replicas, and non-automatic switching database management, such as MySQL and Redis. ConsensusSet can be used for database management with multiple replicas and automatic switching capabilities, such as ApeCloud MySQL RaftGroup with multiple replicas, MongoDB, etc. The ConsensusSet database management capability has been released in KubeBlocks v0.3.0, and ReplicationSet is under development.
    11  
    12  This guide takes ApeCloud MySQL as an example to introduce the high availability capability of the database in the form of ConsensusSet. This capability is also applicable to other database engines.
    13  
    14  ## Recovery simulation
    15  
    16  :::note
    17  
    18  The faults here are all simulated by deleting a pod. When there are sufficient resources, the fault can also be simulated by machine downtime or container deletion, and its automatic recovery is the same as described here.
    19  
    20  :::
    21  
    22  ### Before you start
    23  
    24  * Install KubeBlocks: You can install KubeBlocks by [kbcli](./../../installation/install-with-kbcli/install-kubeblocks-with-kbcli.md) or by [Helm](./../../installation/install-with-helm/install-kubeblocks-with-helm.md).
    25  * Create an ApeCloud MySQL RaftGroup, refer to [Create a MySQL cluster](./../cluster-management/create-and-connect-a-mysql-cluster.md).
    26  * Run `kubectl get cd apecloud-mysql -o yaml` to check whether _rolechangedprobe_ is enabled in the ApeCloud MySQL RaftGroup (it is enabled by default). If the following configuration exists, it indicates that it is enabled:
    27  
    28    ```bash
    29    probes:
    30      roleProbe:
    31        failureThreshold: 3
    32        periodSeconds: 2
    33        timeoutSeconds: 1
    34    ```
    35  
    36  ### Leader pod fault
    37  
    38  ***Steps:***
    39  
    40  1. View the ApeCloud MySQL RaftGroup information. View the leader pod name in `Topology`. In this example, the leader pod's name is mysql-cluster-1.
    41  
    42      ```bash
    43      kbcli cluster describe mysql-cluster
    44      ```
    45  
    46      ![describe_cluster](./../../../img/failure_simulation_describe_cluster.png)
    47  2. Delete the leader pod `mysql-cluster-mysql-1` to simulate a pod fault.
    48  
    49      ```bash
    50      kubectl delete pod mysql-cluster-mysql-1
    51      ```
    52  
    53      ![delete_pod](./../../../img/failure_simulation_delete_pod.png)
    54  3. Run `kbcli cluster describe` and `kbcli cluster connect` to check the status of the pods and RaftGroup connection.
    55  
    56      ***Results***
    57  
    58      The following example shows that the roles of pods have changed after the old leader pod was deleted and `mysql-cluster-mysql-2` is elected as the new leader pod.
    59  
    60      ```bash
    61      kbcli cluster describe mysql-cluster
    62      ```
    63  
    64      ![describe_cluster_after](./../../../img/failure_simulation_describe_cluster_after.png)
    65      It shows that this ApeCloud MySQL RaftGroup can be connected within seconds.
    66  
    67      ```bash
    68      kbcli cluster connect mysql-cluster
    69      ```
    70  
    71      ![connect_cluster_after](./../../../img/failure_simulation_connect_cluster_after.png)
    72  
    73     ***How the automatic recovery works***
    74  
    75     After the leader pod is deleted, the ApeCloud MySQL RaftGroup elects a new leader. In this example, `mysql-cluster-mysql-2` is elected as the new leader. KubeBlocks detects that the leader has changed, and sends a notification to update the access link. The original exception node automatically rebuilds and recovers to the normal RaftGroup state. It normally takes 30 seconds from exception to recovery.
    76  
    77  ### Single follower pod exception
    78  
    79  ***Steps:***
    80  
    81  1. View the ApeCloud MySQL RaftGroup information and view the follower pod name in `Topology`. In this example, the follower pods are mysql-cluster-mysql-0 and mysql-cluster-mysql-2.
    82  
    83      ```bash
    84      kbcli cluster describe mysql-cluster
    85      ```
    86  
    87      ![describe_cluster](./../../../img/failure_simulation_describe_cluster.png)
    88  2. Delete the follower pod mysql-cluster-mysql-0.
    89  
    90      ```bash
    91      kubectl delete pod mysql-cluster-mysql-0
    92      ```
    93  
    94      ![delete_follower_pod](./../../../img/failure_simulation_delete_follower_pod.png)
    95  3. View the RaftGroup status and you can find the follower pod is being terminated in `Component.Instance`.
    96  
    97      ```bash
    98      kbcli cluster describe mysql-cluster
    99      ```
   100  
   101      ![describe_cluster_follower](./../../../img/failure_simulation_describe_cluster_follower.png)
   102  4. Connect to the RaftGroup and you can find this single follower exception doesn't affect the R/W of the cluster.
   103  
   104      ```bash
   105      kbcli cluster connect mysql-cluster
   106      ```
   107  
   108      ![connect_cluster_follower](./../../../img/failure_simulation_connect_cluster_follower.png)
   109  
   110     ***How the automatic recovery works***
   111  
   112     One follower exception doesn't trigger re-electing of the leader or access link switch, so the R/W of the cluster is not affected. Follower exception triggers recreation and recovery. The process takes no more than 30 seconds.
   113  
   114  ### Two pods exception
   115  
   116  The availability of the cluster generally requires the majority of pods to be in a normal state. When most pods are exceptional, the original leader will be automatically downgraded to a follower. Therefore, any two exceptional pods result in only one follower pod remaining.
   117  
   118  In this way, whether exceptions occur to one leader and one follower or two followers, failure performance and automatic recovery are the same.
   119  
   120  ***Steps:***
   121  
   122  1. View the ApeCloud MySQL RaftGroup information and view the follower pod name in `Topology`. In this example, the follower pods are mysql-cluster-mysql-1 and mysql-cluster-mysql-0.
   123  
   124      ```bash
   125      kbcli cluster describe mysql-cluster
   126      ```
   127  
   128      ![describe_cluster](./../../../img/failure_simulation_describe_cluster_2.png)
   129  2. Delete these two follower pods.
   130  
   131      ```bash
   132      kubectl delete pod mysql-cluster-mysql-1 mysql-cluster-mysql-0
   133      ```
   134  
   135      ![delete_two_pods](./../../../img/failure_simulation_delete_two_pods.png)
   136  3. View the RaftGroup status and you can find the follower pods are pending and a new leader pod is selected.
   137  
   138      ```bash
   139      kbcli cluster describe mysql-cluster
   140      ```
   141  
   142      ![describe_two_pods](./../../../img/failure_simulation_describe_two_pods.png)
   143  4. Run `kbcli cluster connect mysql-cluster` again after a few seconds and you can find the pods in the RaftGroup work normally again in `Component.Instance`.
   144  
   145      ```bash
   146      kbcli cluster connect mysql-cluster
   147      ```
   148  
   149      ![connect_two_pods](./../../../img/failure_simulation_connect_two_pods.png)
   150  
   151     ***How the automatic recovery works***
   152  
   153     When two pods of the ApeCloud MySQL RaftGroup are exceptional, pods are unavailable and cluster R/W is unavailable. After the recreation of pods, a new leader is elected to recover to R/W status. The process takes less than 30 seconds.
   154  
   155  ### All pods exception
   156  
   157  ***Steps:***
   158  
   159  1. Run the command below to view the ApeCloud MySQL RaftGroup information and view the pods' names in `Topology`.
   160  
   161      ```bash
   162      kbcli cluster describe mysql-cluster
   163      ```
   164  
   165      ![describe_cluster](./../../../img/failure_simulation_describe_cluster.png)
   166  2. Delete all pods.
   167  
   168      ```bash
   169      kubectl delete pod mysql-cluster-mysql-1 mysql-cluster-mysql-0 mysql-cluster-mysql-2
   170      ```
   171  
   172      ![delete_three_pods](./../../../img/failure_simulation_delete_three_pods.png)
   173  3. Run the command below to view the deleting process. You can find the pods are pending.
   174  
   175      ```bash
   176      kbcli cluster describe mysql-cluster
   177      ```
   178  
   179      ![describe_three_clusters](./../../../img/failure_simulation_describe_three_pods.png)
   180  4. Run `kbcli cluster connect mysql-cluster` again after a few seconds and you can find the pods in the RaftGroup work normally again.
   181  
   182      ```bash
   183      kbcli cluster connect mysql-cluster
   184      ```
   185  
   186      ![connect_three_clusters](./../../../img/failure_simulation_connect_three_pods.png)
   187  
   188     ***How the automatic recovery works***
   189  
   190     Every time the pod is deleted, recreation is triggered. And then ApeCloud MySQL automatically completes the cluster recovery and the election of a new leader. After the election of the leader is completed, KubeBlocks detects the new leader and updates the access link. This process takes less than 30 seconds.