Database Space Exceeded

Overview

The default etcd database size limit in a Kublr environment is 4Gb. Etcd data volume depends on the cloud provider’s configuration and typically is the smallest available SSD storage option.

It is important to consider that 4Gb is the size of the main database file (by default, /mnt/master-pd/etcd/data/member/snap/db). The data volume also contains WAL and snapshot files, so the full disk space usage typicaly is several times bigger than the database itself.

The size of main database file is guarded by etcd parameter –quota-backend-bytes. It can be controlled by custom spec section etcd_flag. By default it is 4Gb, and you cannot set it to unlimited, you must set a specific value.

There are several processes that contribute to etcd file size in Kubernetes cluster:

  1. Useful cluster data: live deployments, pods, config maps, etc.
  2. Deleted cluster data. Normally, cluster garbage collector takes care about them. For details, view this article .
  3. Old versions of modified cluster data. etcd stores old versions of every modified key. Kubernetes API server runs keyspace compaction (deletes old key values) every 5 minutes.
  4. File space fragmentation

During normal cluster operation, garbage collector and scheduled compaction do a good job of maintaining etcd database. However, we have seen instances of rapidly repeating errors. For example, pods rapidly evicted and rescheduled every second or so. In this case, the database size can hit a limit before scheduled compaction takes place. This fires an etcd disk space alert and stops any write operations on cluster database. You cannot modify the cluster data, including deletion of cluster objects, and scheduled compaction also will not work.

This condition can be detected by log messages in apiserver.log, like:

compact.go:124] etcd: endpoint ([https://172.16.4.149:2379]) compact failed: etcdserver: mvcc: database space exceeded

Fixing this condition requires operator intervention.

Step-by-Step via ETDC Container

  1. Setup kubectl to access your cluster.

  2. Find the name of the etcd pod:

     $ kubectl get pods -n kube-system | grep etcd
     k8s-etcd-09e9eb5f99507d780db842afa1158ef3f77d9586d354031038ae34e63f2025d9-ip-172-16-4-149.ec2.internal  2/2 Running 0 3h41m
    
  3. Launch interactive shell in etcd pod (replace k8s-etcd-09… with actual pod name found on previous step):

     $ kubectl exec -n kube-system k8s-etcd-09e9eb5f99507d780db842afa1158ef3f77d9586d354031038ae34e63f2025d9-ip-172-16-4-149.ec2.internal -c etcd -it /bin/sh / #
    
  4. In etcd pod shell, check the etcd disk space usage (In this example we lowered disk quota, your value in DB SIZE would match your quota-backend-bytes setting):

     / # ETCDCTL_API=3 etcdctl --write-out=table endpoint status
    
    ENDPOINTIDVERSIONDB SIZEIS LEADERRAFT TERMRAFT INDEX
    127.0.0.1:2379aadebc09b69225c23.2.24271MBtrue233778
  5. In etcd pod shell, follow the procedure described here.

    Note In our setup you do not need to specify the —endpoint option.

     # get current revision
     / # rev=$(ETCDCTL_API=3 etcdctl endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
     # compact away all old revisions
     / # ETCDCTL_API=3 etcdctl compact $rev
     compacted revision 1516
     # defragment away excessive space
     / # ETCDCTL_API=3 etcdctl defrag
     Finished defragmenting etcd member[127.0.0.1:2379]
     # disarm alarm
     $ ETCDCTL_API=3 etcdctl alarm disarm
     memberID:13803658152347727308 alarm:NOSPACE
    
  6. Leave etcd pod shell by typing exit or pressing Ctrl-D.

  7. You do not need to restart etcd or other Kubernetes or Kublr pods or services.

  8. Repeat the procedure for all master nodes.

Step-by-Step via Kublr ETDC Subcommand

  1. Obtain a root shell on a master node.

  2. Check etcd disk space usage (output should be similar to previous variant of the procedure):

     #  /opt/kublr/bin/kublr etcd ctl -- --write-out=table endpoint status
    
  3. Compact keyspace and defrag the database. Note that you need to supply double dash argument before first etcdctl argument starting with double dash.

     # get current revision
     / #  rev=$(/opt/kublr/bin/kublr etcd ctl -- --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
     # compact away all old revisions
     / # /opt/kublr/bin/kublr etcd ctl compact $rev
     compacted revision 1516
     # defragment away excessive space
     / # /opt/kublr/bin/kublr etcd ctl defrag
     Finished defragmenting etcd member[127.0.0.1:2379]
     # disarm alarm
     $ /opt/kublr/bin/kublr etcd ctl alarm disarm
     memberID:13803658152347727308 alarm:NOSPACE