github.com/kubernetes-incubator/kube-aws@v0.16.4/docs/advanced-topics/etcd-backup-and-restore.md (about) 1 # Backup and Restore for etcd 2 3 ## Backup 4 5 ### Manually taking an etcd snapshot 6 7 ssh into one of etcd nodes and run the following command: 8 9 ```bash 10 set -a; source /var/run/coreos/etcdadm-environment; set +a 11 /opt/bin/etcdadm save 12 ``` 13 14 The command takes an etcd snapshot by running an appropriate `etcdctl snapshot save` command. 15 The snapshot is then exported to the S3 URI: `s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db`. 16 17 ### Automatically taking an etcd snapshot 18 19 A feature to periodically take a snapshot of an etcd cluster can be enabled by specifying the following in `cluster.yaml`: 20 21 ```yaml 22 etcd: 23 snapshot: 24 automated: true 25 ``` 26 27 When enabled, the command `etcdadm save` is called periodically (every 1 minute by default) via a systemd timer. 28 29 ## Restore 30 31 Please beware that you must have taken an etcd snapshot beforehand to restore your cluster. 32 An etcd snapshot can be taken manually or automatically according to the steps described above. 33 34 ### Manually restoring a permanently failed etcd node from etcd snapshot 35 36 It is impossible! 37 However, you can recover a permanently failed etcd node, without losing data, by "resetting" the node. 38 More concretely, you can run the following commands to remove the etcd member from the cluster, wipe etcd data, and then re-add the member to the cluster: 39 40 ```bash 41 sudo systemctl stop etcd-member.service 42 43 set -a; source /var/run/coreos/etcdadm-environment; set +a 44 /opt/bin/etcdadm replace 45 46 sudo systemctl start etcd-member.service 47 ``` 48 49 The reset member eventually catches up data from the etcd cluster hence the recovery is done without losing data. 50 51 For more details, I'd suggest you to read [the revelant upstream issue](https://github.com/kubernetes/kubernetes/issues/40027#issuecomment-283501556). 52 53 ### Manually restoring a cluster from etcd snapshot 54 55 ssh into every etcd node and stop the etcd3 process: 56 57 ```bash 58 for h in $hosts; do 59 ssh -i path/to/your/key core@$h sudo systemctl stop etcd-member.service 60 done 61 ``` 62 63 and then sart the etcd3 process: 64 65 ```bash 66 for h in $hosts; do 67 ssh -i path/to/your/key core@$h sudo systemctl start etcd-member.service 68 done 69 ``` 70 71 Doing this triggers the automated disaster recovery processes across etcd nodes by running `etcdadm-reconfigure.service` 72 and your cluster will eventually be restored from the snapshot stored at `s3://<your-bucket-name>/.../<your-cluster-name>/exported/etcd-snapshots/snapshot.db`. 73 74 ### Automatic recovery 75 76 A feature to automatically restore a permanently failed etcd member or a cluster can be enabled by specifying: 77 78 ```yaml 79 etcd: 80 disasterRecovery: 81 automated: true 82 ``` 83 84 When enabled, 85 - The command `etcdadm check` is called periodically by a systemd timer 86 - The etcd cluster and each etcd node(=member) is checked by running `etcdctl endpoint health` command 87 - When up to `1/N` etcd nodes failed successive health checks, it will be removed as an etcd member and then added again as a new member 88 - The new member eventually catches up data from the etcd cluster 89 - When more than `1/N` etcd nodes failed successive health checks, a disaster recovery process is executed to recover all the etcd nodes from the latest etcd snapshot