Hyperscale数据中心中的NVME和PCIE SSD监视

论文标题

Hyperscale数据中心中的NVME和PCIE SSD监视

NVMe and PCIe SSD Monitoring in Hyperscale Data Centers

论文作者

Khatri, Nikhil, Chakrabarti, Shirshendu

论文摘要

由于潜伏期较低，吞吐量和企业级可靠性，SSD已成为数据中心存储的事实选择。结果，在LinkedIn的所有在线数据存储中使用了SSD。这些应用程序持续存在并提供关键的用户数据并具有毫秒延迟。对于服务这些应用程序的主机，SSD故障是失败的最大原因。频繁的SSD故障导致关键应用程序的大量停机时间。他们还为系统操作团队生成了重要的下游RCA（根本原因分析）负载。缺乏对这些驱动器的运行时特征的洞察力导致有限的能力为此类问题提供准确的RCA，并阻碍了为这些问题提供可靠的长期修复的能力。在本文中，我们描述了在LinkedIn开发的系统，以促进对SSD的实时监控以及我们获得失败特征的见解。我们描述了如何利用这种见解来执行预测性维护，并介绍了减少维护上花费的工时。

With low latency, high throughput and enterprise-grade reliability, SSDs have become the de-facto choice for storage in the data center. As a result, SSDs are used in all online data stores in LinkedIn. These apps persist and serve critical user data and have millisecond latencies. For the hosts serving these applications, SSD faults are the single largest cause of failure. Frequent SSD failures result in significant downtime for critical applications. They also generate a significant downstream RCA (Root Cause Analysis) load for systems operations teams. A lack of insight into the runtime characteristics of these drives results in limited ability to provide accurate RCAs for such issues and hinders the ability to provide credible, long term fixes to such issues. In this paper we describe the system developed at LinkedIn to facilitate the real-time monitoring of SSDs and the insights we gained into failure characteristics. We describe how we used that insight to perform predictive maintenance and present the resulting reduction of man-hours spent on maintenance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题