«

etcd 故障排查之 `etcdserver apply entries took too long`

经常去看看 etcd 运行日志,如果 etcd 负载比较高,或者节点规格较差,往往能看到如下类似日志:

08:52:05.164847 W | etcdserver: apply entries took too long [140.696147ms for 1 entries]  
08:52:05.164886 W | etcdserver: avoid queries with large range/delete range!  

这是一条 warning 日志,有官方解释,如下:

After a majority of etcd members agree to commit a request, each etcd server applies the request to its data store and persists the result to disk. Even with a slow mechanical disk or a virtualized network disk, such as Amazon’s EBS or Google’s PD, applying a request should normally take fewer than 50 milliseconds. If the average apply duration exceeds 100 milliseconds, etcd will warn that entries are taking too long to apply.

Usually this issue is caused by a slow disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor backendcommitduration_seconds (p99 duration should be less than 25ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem.

The second most common cause is CPU starvation. If monitoring of the machine’s CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Moving etcd to dedicated machine, increasing process resource isolation cgroups, or renicing the etcd server process into a higher priority can usually solve the problem.

Expensive user requests which access too many keys (e.g., fetching the entire keyspace) can also cause long apply latencies. Accessing fewer than a several hundred keys per request, however, should always be performant.

If none of the above suggestions clear the warnings, please open an issue with detailed logging, monitoring, metrics and optionally workload information.

参考该篇文章,大致说明如下:

etcd集群接受一个写请求后,每个etcd成员都需要把写请求数据固化到cores/bbolt之中,整个过程不要超过50ms。如果超过100ms,则etcd就会打印此条log进行警告。

通常情况下是因为磁盘慢,比如磁盘竞争或者譬如虚拟块磁盘这种差设备。etcd暴露给Prometheus的metrics指标 backendcommitduration_seconds 就显示了commit的瓶颈时间,这个指标低于25ms即可认为服务正常,如果磁盘本身确实慢则设置一个etcd专用磁盘或者更换成SSD通常就能解决问题。

第二个原因是CPU计算力不足。如果是通过监控系统发现CPU利用率确实很高,就应该把etcd移到更好的机器上,然后通过cgroups保证etcd进程独享某些核的计算能力,或者提高etcd的priority。

或者有别的一些低速请求如有人要获取所有的key也会导致写请求受影响。

分享