论文标题
连续器:在边缘失败期间维护分布式DNN服务
CONTINUER: Maintaining Distributed DNN Services During Edge Failures
论文作者
论文摘要
可以使用跨边缘节点的分区和部署深层神经网络(DNN)来满足应用程序的性能目标。但是,单个节点的失败可能会导致级联失败,这将对服务的交付产生不利影响,并将导致未能实现特定目标。这些故障的影响需要在运行时最小化。本文探讨了三种技术,即重新分配,早期和跳过连接。当边缘节点失败时,重新分配技术将重新分配并重新部署DNN,从而避免了失败的节点。早期外观技术可以准备在失败节点之前退出(提早)的请求。跳过连接技术通过跳过失败的节点来动态路由请求。本文将在边缘节点失败时选择给定用户定义的目标(准确性,延迟和停机时间)的最佳技术(准确性,延迟和停机时间)的最佳技术,以精确,端到端的潜伏期和停机时间利用权衡取舍。为此,开发了连续器。该框架的两个关键活动是估计使用分布式DNN的技术并选择最佳技术时的准确性和延迟。在基于实验室的实验测试台上证明了这一点,该测试台使用不超过0.28%和13.06%的平均误差的技术时,可以连续估计准确性和延迟,并选择不超过16.82毫超过16.82毫超过16.82毫超过16.82毫秒的合适技术,并且准确性高达99.86%。
Partitioning and deploying Deep Neural Networks (DNNs) across edge nodes may be used to meet performance objectives of applications. However, the failure of a single node may result in cascading failures that will adversely impact the delivery of the service and will result in failure to meet specific objectives. The impact of these failures needs to be minimised at runtime. Three techniques are explored in this paper, namely repartitioning, early-exit and skip-connection. When an edge node fails, the repartitioning technique will repartition and redeploy the DNN thus avoiding the failed nodes. The early-exit technique makes provision for a request to exit (early) before the failed node. The skip connection technique dynamically routes the request by skipping the failed nodes. This paper will leverage trade-offs in accuracy, end-to-end latency and downtime for selecting the best technique given user-defined objectives (accuracy, latency and downtime thresholds) when an edge node fails. To this end, CONTINUER is developed. Two key activities of the framework are estimating the accuracy and latency when using the techniques for distributed DNNs and selecting the best technique. It is demonstrated on a lab-based experimental testbed that CONTINUER estimates accuracy and latency when using the techniques with no more than an average error of 0.28% and 13.06%, respectively and selects the suitable technique with a low overhead of no more than 16.82 milliseconds and an accuracy of up to 99.86%.