Backup and Restore of ETCD Database

How to Backup and Restore ETCD Database Using Kublr

Backup and restore of a Kubernetes cluster involves the following components: clustes spec, secrets, etcd database content and persistent volumes content.

Kublr implements the full cluster backup procedure described in the Backup section. However, this solution might not fit customer environment.

To aid the customer in implementing the backup procedure that best suits the requirements, we provide low-level tools for application-level etcd snapshot and restore.

Making Backup (Snapshot)

To create a snapshot, run the following command as a root on any of the master nodes:

/opt/kublr/bin/kublr etcd backup --file file.db

This will create an application-level snapshot of etcd database and place it to the file.db file. This command is intended to be run as part of a script that will generate a timestamped name for the file and/or upload it to the final destination.

As with any backup, it is advisable to store it somewhere outside of the node file system.

The file is a standard application-level etcd snapshot as created by etcdctl snapshot create command or equivalent etcd API call.

The snapshot contains consensus data, so which master nodes is used for the snapshot is not important. However, to avoid a single point of failure, you might want to schedule snapshotting on several nodes.

Manual Restoration from Snapshot

The snapshot made in the previous step can be restored manually according to the procedure described in etcd disaster recovery document.

Kublr uses etcdX (where X is the master node ordinal) for etcd instance names and https://etcdX.kublr.local:2380 for peer URLs. You can find etcd data volume location and other aspects of Kublr etcd environment in the file /etc/kubernetes/manifests/etcd.manifest. See section Addendum: locating etcd data volume for more information.

Note that etcdctl requires that peer URLs must be resolvable during the restore. Names etcdX.kublr.local are not part of the Kubernetes DNS (which will not be operational when etcd is down). You must use /etc/hosts or other means to make them resolvable.

Also note that Kublr marks etcd as the critical pod, so you cannot stop etcd instance manually (Kublr will forcefully start it again).

Even if you somehow manage to run the restore without stopping the Kubernetes, (may be during time window of etcd restart), the replacement of etcd database under active Kubernetes API server will render the Kubernetes inoperational, so the cold restart will be needed anyway.

You must stop all Kublr/Kubernetes services by issuing the following commands as root:

service Kublr stop
service Kublr-kubelet stop
service Docker stop
# at this point you can perform the restore
# wait until all other master nodes reach this point
service Kublr start
# no need to manually start Kublr-kubelet and Docker, 
# the Kublr service will start them automatically

After the successful restore, worker nodes also will need to be restarted.

Warning: etcd database restore by itself does not restore the content of persistent volumes. This must be done separately, preferrably before the attempt to start the node.

Aided Restoration of Snapshot

To avoid a tedious tasks of finding node ordinals and constructing the correct etcdctl environment and command arguments for every node we provide a kublr etcd restore subcommand.

To restore the etcd database using this command, issue the following commands as root on every master node:

# distribute the snapshot file to every master node
service Kublr stop
service Kublr-kubelet stop
service Docker stop
/opt/kublr/bin/kublr etcd restore --file file.db
# wait until all other master nodes reach this point
service Kublr start

As with manual restore, all master nodes must be restored from the same snapshot file.

After the successful restore, worker nodes also will need to be restarted.

Warning: etcd database restore by itself does not restore the content of persistent volumes. This must be done separately, preferrably before the attempt to start the node.

The command kublr etcd restore does not perform the actual restore, it just schedules the restore to be performed on etcd pod startup. To find the output of the actual restore operation, check the logs of etcd container using docker logs command. Equivalent kubectl command will be available only if the restore was successful. Etcd container has name starting with k8s_etcd_k8s-etcd-.
The restore operation output most likely will be at the top of the log, before the output of the etcd process.

To abort the scheduled restore, remove the file /mnt/master-pd/etcd/restore.db (See section Addendum: locating etcd data volume to find the etcd data volume location for your instance).

Warning: The etcd restore is actually a destructive operation, so avoid dry-running kublr etcd restore

Restoration of Single ETCD Instance

When only one of etcd instances is failed there is no need to restore entire cluster database from the backup. In etcd 3.2 and higher, the single failed node can be restored by replicating the data from the cluster quorum.

The procedure for restoring the single node is also described in the document etcd disaster recovery. In Kublr environment, this procedure is unconvenient because it requires stopping etcd instance and reproducing etcd environment for etcdctl command.

To aid in the recovery of single etcd instance, Kublr 1.11.2 and higher implements a control mechanism to schedule commands to be performed by running etcd pod.

To schedule the restore of single node by replication from cluster quorum, create a file named command in the root directory of etcd data volume. This file must contain a string reinitialize Example:

echo reinitialize > /mnt/master-pd/etcd/command

See section Addendum: locating etcd data volume to find the etcd data volume location for your instance

The command will be performed by etcd pod several seconds after the file creation, or on next restart if the pod is in crash loop. The command file will be removed after the execution. The results of the execution can be checked in the pod/container logs. Some information will be available in command-result file in the same directory.

This procedure does not involve replacing the content of etcd database, so it does not require restarting Kublr and Kubernetes services.

Warning In the future, additional commands can be added to the pod control mechanism, so avoid creating the command file with other content.

Addendum: Locating ETCD Data Volume

By default, Kublr starts etcd container using host directory /mnt/master-pd/etcd for etcd data volume. However this path can be overriden by custom cluster spec or by platform-specific defaults.

Actual host path to etcd data volume can be found by two metods:

Consulting Kublr configuration. The actual Kublr configuration obtained by merging /etc/kublr/daemon.yaml and Kublr defaults can be read by issuing the command /opt/kublr/bin/kublr validate. Output of this command is Yaml data stream. The host path of etcd data volume is controlled by the parameter etcd_storage.path.
Consulting etcd pod configuration. The configuration can be obtained using kubectl describe pod command or from Yaml file /etc/kubetnetes/manifests/etcd.manifest on the master node.
The host path of etcd data volume is controlled by the parameter hostPath.path of the volume named data.' Volume parameters are located in section spec.volumes` of the manifest.