github.com/kubernetes-incubator/kube-aws@v0.16.4/docs/advanced-topics/etcd-backup-and-restore.md (about)

     1  # Backup and Restore for etcd
     2  
     3  ## Backup
     4  
     5  ### Manually taking an etcd snapshot
     6  
     7  ssh into one of etcd nodes and run the following command:
     8  
     9  ```bash
    10  set -a; source /var/run/coreos/etcdadm-environment; set +a
    11  /opt/bin/etcdadm save
    12  ```
    13  
    14  The command takes an etcd snapshot by running an appropriate `etcdctl snapshot save` command.
    15  The snapshot is then exported to the S3 URI: `s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db`.
    16  
    17  ### Automatically taking an etcd snapshot
    18  
    19  A feature to periodically take a snapshot of an etcd cluster can be enabled by specifying the following in `cluster.yaml`:
    20  
    21  ```yaml
    22  etcd:
    23    snapshot:
    24      automated: true
    25  ```
    26  
    27  When enabled, the command `etcdadm save` is called periodically (every 1 minute by default) via a systemd timer.
    28  
    29  ## Restore
    30  
    31  Please beware that you must have taken an etcd snapshot beforehand to restore your cluster.
    32  An etcd snapshot can be taken manually or automatically according to the steps described above.
    33  
    34  ### Manually restoring a permanently failed etcd node from etcd snapshot
    35  
    36  It is impossible!
    37  However, you can recover a permanently failed etcd node, without losing data, by "resetting" the node.
    38  More concretely, you can run the following commands to remove the etcd member from the cluster, wipe etcd data, and then re-add the member to the cluster:
    39  
    40  ```bash
    41  sudo systemctl stop etcd-member.service
    42  
    43  set -a; source /var/run/coreos/etcdadm-environment; set +a
    44  /opt/bin/etcdadm replace
    45  
    46  sudo systemctl start etcd-member.service
    47  ```
    48  
    49  The reset member eventually catches up data from the etcd cluster hence the recovery is done without losing data.
    50  
    51  For more details, I'd suggest you to read [the revelant upstream issue](https://github.com/kubernetes/kubernetes/issues/40027#issuecomment-283501556).
    52  
    53  ### Manually restoring a cluster from etcd snapshot
    54  
    55  ssh into every etcd node and stop the etcd3 process:
    56  
    57  ```bash
    58  for h in $hosts; do
    59    ssh -i path/to/your/key core@$h sudo systemctl stop etcd-member.service
    60  done
    61  ```
    62  
    63  and then sart the etcd3 process:
    64  
    65  ```bash
    66  for h in $hosts; do
    67    ssh -i path/to/your/key core@$h sudo systemctl start etcd-member.service
    68  done
    69  ```
    70  
    71  Doing this triggers the automated disaster recovery processes across etcd nodes by running `etcdadm-reconfigure.service`
    72  and your cluster will eventually be restored from the snapshot stored at `s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db`.
    73  
    74  ### Automatic recovery
    75  
    76  A feature to automatically restore a permanently failed etcd member or a cluster can be enabled by specifying:
    77  
    78  ```yaml
    79  etcd:
    80    disasterRecovery:
    81      automated: true
    82  ```
    83  
    84  When enabled,
    85  - The command `etcdadm check` is called periodically by a systemd timer
    86    - The etcd cluster and each etcd node(=member) is checked by running `etcdctl endpoint health` command
    87  - When up to `1/N` etcd nodes failed successive health checks, it will be removed as an etcd member and then added again as a new member
    88     - The new member eventually catches up data from the etcd cluster
    89  - When more than `1/N` etcd nodes failed successive health checks, a disaster recovery process is executed to recover all the etcd nodes from the latest etcd snapshot