Version 3.3.13 home Download and build Libraries and tools Metrics Branch management Demo Discovery service protocol etcd release guide Frequently Asked Questions (FAQ) Logging conventions Production users Reporting bugs Tuning Benchmarks Benchmarking etcd v2.1.0 Benchmarking etcd v2.2.0 Benchmarking etcd v2.2.0-rc Benchmarking etcd v2.2.0-rc-memory Benchmarking etcd v3 Storage Memory Usage Benchmark Watch Memory Usage Benchmark Developer guide etcd API Reference etcd concurrency API Reference Experimental APIs and features gRPC naming and discovery Interacting with etcd Set up a local cluster System limits Why gRPC gateway etcd v3 API Learning etcd client architecture Client feature matrix Data model etcd v3 authentication design etcd versus other key-value stores etcd3 API Glossary KV API guarantees Learner Operations guide Clustering Guide Configuration flags Design of runtime reconfiguration Disaster recovery etcd gateway Failure modes gRPC proxy Hardware recommendations Maintenance Migrate applications from using API v2 to API v3 Monitoring etcd Performance Role-based access control Run etcd clusters inside containers Runtime reconfiguration Supported systems Transport security model Versioning Platforms Amazon Web Services Container Linux with systemd FreeBSD Upgrading Upgrade etcd from 2.3 to 3.0 Upgrade etcd from 3.0 to 3.1 Upgrade etcd from 3.1 to 3.2 Upgrade etcd from 3.2 to 3.3 Upgrade etcd from 3.3 to 3.4 Upgrade etcd from 3.4 to 3.5 Upgrading etcd clusters and applications

You are viewing documentation for etcd version: v3.3.13

etcd v3.3.13 documentation is no longer actively maintained. The version you are currently viewing is a static snapshot. For up-to-date documentation, see the latest release, v3.4.0, or the current documentation.

etcd uses Prometheus for metrics reporting. The metrics can be used for real-time monitoring and debugging. etcd does not persist its metrics; if a member restarts, the metrics will be reset.

The simplest way to see the available metrics is to cURL the metrics endpoint /metrics. The format is described here.

Follow the Prometheus getting started doc to spin up a Prometheus server to collect etcd metrics.

The naming of metrics follows the suggested Prometheus best practices. A metric name has an etcd or etcd_debugging prefix as its namespace and a subsystem prefix (for example wal and etcdserver).

etcd namespace metrics

The metrics under the etcd prefix are for monitoring and alerting. They are stable high level metrics. If there is any change of these metrics, it will be included in release notes.

Metrics that are etcd2 related are documented v2 metrics guide.


These metrics describe the status of the etcd server. In order to detect outages or problems for troubleshooting, the server metrics of every production etcd cluster should be closely monitored.

All these metrics are prefixed with etcd_server_

has_leaderWhether or not a leader exists. 1 is existence, 0 is not.Gauge
leader_changes_seen_totalThe number of leader changes seen.Counter
proposals_committed_totalThe total number of consensus proposals committed.Gauge
proposals_applied_totalThe total number of consensus proposals applied.Gauge
proposals_pendingThe current number of pending proposals.Gauge
proposals_failed_totalThe total number of failed proposals seen.Counter

has_leader indicates whether the member has a leader. If a member does not have a leader, it is totally unavailable. If all the members in the cluster do not have any leader, the entire cluster is totally unavailable.

leader_changes_seen_total counts the number of leader changes the member has seen since its start. Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

proposals_committed_total records the total number of consensus proposals committed. This gauge should increase over time if the cluster is healthy. Several healthy members of an etcd cluster may have different total committed proposals at once. This discrepancy may be due to recovering from peers after starting, lagging behind the leader, or being the leader and therefore having the most commits. It is important to monitor this metric across all the members in the cluster; a consistently large lag between a single member and its leader indicates that member is slow or unhealthy.

proposals_applied_total records the total number of consensus proposals applied. The etcd server applies every committed proposal asynchronously. The difference between proposals_committed_total and proposals_applied_total should usually be small (within a few thousands even under high load). If the difference between them continues to rise, it indicates that the etcd server is overloaded. This might happen when applying expensive queries like heavy range queries or large txn operations.

proposals_pending indicates how many proposals are queued to commit. Rising pending proposals suggests there is a high client load or the member cannot commit proposals.

proposals_failed_total are normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.


These metrics describe the status of the disk operations.

All these metrics are prefixed with etcd_disk_.

wal_fsync_duration_secondsThe latency distributions of fsync called by walHistogram
backend_commit_duration_secondsThe latency distributions of commit called by backend.Histogram

A wal_fsync is called when etcd persists its log entries to disk before applying them.

A backend_commit is called when etcd commits an incremental snapshot of its most recent changes to disk.

High disk operation latencies (wal_fsync_duration_seconds or backend_commit_duration_seconds) often indicate disk issues. It may cause high request latency or make the cluster unstable.


These metrics describe the status of the network.

All these metrics are prefixed with etcd_network_

peer_sent_bytes_totalThe total number of bytes sent to the peer with ID To.Counter(To)
peer_received_bytes_totalThe total number of bytes received from the peer with ID From.Counter(From)
peer_sent_failures_totalThe total number of send failures from the peer with ID To.Counter(To)
peer_received_failures_totalThe total number of receive failures from the peer with ID From.Counter(From)
peer_round_trip_time_secondsRound-Trip-Time histogram between peers.Histogram(To)
client_grpc_sent_bytes_totalThe total number of bytes sent to grpc clients.Counter
client_grpc_received_bytes_totalThe total number of bytes received to grpc clients.Counter

peer_sent_bytes_total counts the total number of bytes sent to a specific peer. Usually the leader member sends more data than other members since it is responsible for transmitting replicated data.

peer_received_bytes_total counts the total number of bytes received from a specific peer. Usually follower members receive data only from the leader member.

gRPC requests

These metrics are exposed via go-grpc-prometheus.

etcd_debugging namespace metrics

The metrics under the etcd_debugging prefix are for debugging. They are very implementation dependent and volatile. They might be changed or removed without any warning in new etcd releases. Some of the metrics might be moved to the etcd prefix when they become more stable.


snapshot_save_total_duration_secondsThe total latency distributions of save called by snapshotHistogram

Abnormally high snapshot duration (snapshot_save_total_duration_seconds) indicates disk issues and might cause the cluster to be unstable.

Prometheus supplied metrics

The Prometheus client library provides a number of metrics under the go and process namespaces. There are a few that are particularly interesting.

process_open_fdsNumber of open file descriptors.Gauge
process_max_fdsMaximum number of open file descriptors.Gauge

Heavy file descriptor (process_open_fds) usage (i.e., near the process’s file descriptor limit, process_max_fds) indicates a potential file descriptor exhaustion issue. If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.