Version 3.3.13 home Download and build Libraries and tools Metrics Branch management Demo Discovery service protocol etcd release guide Frequently Asked Questions (FAQ) Logging conventions Production users Reporting bugs Tuning Benchmarks Benchmarking etcd v2.1.0 Benchmarking etcd v2.2.0 Benchmarking etcd v2.2.0-rc Benchmarking etcd v2.2.0-rc-memory Benchmarking etcd v3 Storage Memory Usage Benchmark Watch Memory Usage Benchmark Developer guide etcd API Reference etcd concurrency API Reference Experimental APIs and features gRPC naming and discovery Interacting with etcd Set up a local cluster System limits Why gRPC gateway etcd v3 API Learning etcd client architecture Client feature matrix Data model etcd v3 authentication design etcd versus other key-value stores etcd3 API Glossary KV API guarantees Learner Operations guide Clustering Guide Configuration flags Design of runtime reconfiguration Disaster recovery etcd gateway Failure modes gRPC proxy Hardware recommendations Maintenance Migrate applications from using API v2 to API v3 Monitoring etcd Performance Role-based access control Run etcd clusters inside containers Runtime reconfiguration Supported systems Transport security model Versioning Platforms Amazon Web Services Container Linux with systemd FreeBSD Upgrading Upgrade etcd from 2.3 to 3.0 Upgrade etcd from 3.0 to 3.1 Upgrade etcd from 3.1 to 3.2 Upgrade etcd from 3.2 to 3.3 Upgrade etcd from 3.3 to 3.4 Upgrade etcd from 3.4 to 3.5 Upgrading etcd clusters and applications

Design of runtime reconfiguration

You are viewing documentation for etcd version: v3.3.13

etcd v3.3.13 documentation is no longer actively maintained. The version you are currently viewing is a static snapshot. For up-to-date documentation, see the latest release, v3.4.0, or the current documentation.

Runtime reconfiguration is one of the hardest and most error prone features in a distributed system, especially in a consensus based system like etcd.

Read on to learn about the design of etcd’s runtime reconfiguration commands and how we tackled these problems.

Two phase config changes keep the cluster safe

In etcd, every runtime reconfiguration has to go through two phases for safety reasons. For example, to add a member, first inform the cluster of the new configuration and then start the new member.

Phase 1 - Inform cluster of new configuration

To add a member into an etcd cluster, make an API call to request a new member to be added to the cluster. This is the only way to add a new member into an existing cluster. The API call returns when the cluster agrees on the configuration change.

Phase 2 - Start new member

To join the new etcd member into the existing cluster, specify the correct initial-cluster and set initial-cluster-state to existing. When the member starts, it will contact the existing cluster first and verify the current cluster configuration matches the expected one specified in initial-cluster. When the new member successfully starts, the cluster has reached the expected configuration.

By splitting the process into two discrete phases users are forced to be explicit regarding cluster membership changes. This actually gives users more flexibility and makes things easier to reason about. For example, if there is an attempt to add a new member with the same ID as an existing member in an etcd cluster, the action will fail immediately during phase one without impacting the running cluster. Similar protection is provided to prevent adding new members by mistake. If a new etcd member attempts to join the cluster before the cluster has accepted the configuration change, it will not be accepted by the cluster.

Without the explicit workflow around cluster membership etcd would be vulnerable to unexpected cluster membership changes. For example, if etcd is running under an init system such as systemd, etcd would be restarted after being removed via the membership API, and attempt to rejoin the cluster on startup. This cycle would continue every time a member is removed via the API and systemd is set to restart etcd after failing, which is unexpected.

We expect runtime reconfiguration to be an infrequent operation. We decided to keep it explicit and user-driven to ensure configuration safety and keep the cluster always running smoothly under explicit control.

Permanent loss of quorum requires new cluster

If a cluster permanently loses a majority of its members, a new cluster will need to be started from an old data directory to recover the previous state.

It is entirely possible to force removing the failed members from the existing cluster to recover. However, we decided not to support this method since it bypasses the normal consensus committing phase, which is unsafe. If the member to remove is not actually dead or force removed through different members in the same cluster, etcd will end up with a diverged cluster with same clusterID. This is very dangerous and hard to debug/fix afterwards.

With a correct deployment, the possibility of permanent majority loss is very low. But it is a severe enough problem that is worth special care. We strongly suggest reading the disaster recovery documentation and preparing for permanent majority loss before putting etcd into production.

Do not use public discovery service for runtime reconfiguration

The public discovery service should only be used for bootstrapping a cluster. To join member into an existing cluster, use the runtime reconfiguration API.

The discovery service is designed for bootstrapping an etcd cluster in a cloud environment, when the IP addresses of all the members are not known beforehand. After successfully bootstrapping a cluster, the IP addresses of all the members are known. Technically, the discovery service should no longer be needed.

It seems that using public discovery service is a convenient way to do runtime reconfiguration, after all discovery service already has all the cluster configuration information. However relying on public discovery service brings troubles:

  1. it introduces external dependencies for the entire life-cycle of the cluster, not just bootstrap time. If there is a network issue between the cluster and public discovery service, the cluster will suffer from it.

  2. public discovery service must reflect correct runtime configuration of the cluster during its life-cycle. It has to provide security mechanisms to avoid bad actions, and it is hard.

  3. public discovery service has to keep tens of thousands of cluster configurations. Our public discovery service backend is not ready for that workload.

To have a discovery service that supports runtime reconfiguration, the best choice is to build a private one.

Design of runtime reconfiguration