«

etcd 故障排查之 `failed to send out heartbeat on time`

查看 etcd 运行日志,如果看到如下日志:

08:52:05.164847 W | etcdserver: failed to send out heartbeat on time  

说明 etcd 发送心跳有问题了,查看官方问答,解释如下:

etcd uses a leader-based consensus protocol for consistent data replication and log execution. Cluster members elect a single leader, all other members become followers. The elected leader must periodically send heartbeats to its followers to maintain its leadership. Followers infer leader failure if no heartbeats are received within an election interval and trigger an election. If a leader doesn’t send its heartbeats in time but is still running, the election is spurious and likely caused by insufficient resources. To catch these soft failures, if the leader skips two heartbeat intervals, etcd will warn it failed to send a heartbeat on time.

Usually this issue is caused by a slow disk. Before the leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor walfsyncduration_seconds (p99 duration should be less than 10ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem.

The second most common cause is CPU starvation. If monitoring of the machine’s CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Moving etcd to dedicated machine, increasing process resource isolation with cgroups, or renicing the etcd server process into a higher priority can usually solve the problem.

A slow network can also cause this issue. If network metrics among the etcd machines shows long latencies or high drop rate, there may not be enough network capacity for etcd. Moving etcd members to a less congested network will typically solve the problem. However, if the etcd cluster is deployed across data centers, long latency between members is expected. For such deployments, tune the heartbeat-interval configuration to roughly match the round trip time between the machines, and the election-timeout configuration to be at least 5 * heartbeat-interval. See tuning documentation for detailed information.

If none of the above suggestions clear the warnings, please open an issue with detailed logging, monitoring, metrics and optionally workload information.

网上有文章大致说明如下:

etcd使用了raft算法,leader会定时地给每个follower发送心跳,如果leader连续两个心跳时间没有给follower发送心跳,etcd会打印这个log以给出告警。通常情况下这个issue是disk运行过慢导致的,leader一般会在心跳包里附带一些metadata,leader需要先把这些数据固化到磁盘上,然后才能发送。写磁盘过程可能要与其他应用竞争,或者因为磁盘是一个虚拟的或者是SATA类型的导致运行过慢,此时只有更好更快磁盘硬件才能解决问题。etcd暴露给Prometheus的metrics指标walfsyncduration_seconds就显示了wal日志的平均花费时间,通常这个指标应低于10ms。

第二种原因就是CPU计算能力不足。如果是通过监控系统发现CPU利用率确实很高,就应该把etcd移到更好的机器上,然后通过cgroups保证etcd进程独享某些核的计算能力,或者提高etcd的priority。

第三种原因就可能是网速过慢。如果Prometheus显示是网络服务质量不行,譬如延迟太高或者丢包率过高,那就把etcd移到网络不拥堵的情况下就能解决问题。但是如果etcd是跨机房部署的,长延迟就不可避免了,那就需要根据机房间的RTT调整heartbeat-interval,而参数election-timeout则至少是heartbeat-interval的5倍。

分享