论文标题
部分可观测时空混沌系统的无模型预测
The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis
论文作者
论文摘要
从不断增加的硬件平台中获得最佳性能一直是数据处理系统的反复挑战。近年来,数据科学随着其越来越多的复杂类型的分析的出现,使这一挑战变得更加困难。实际上,系统设计人员的组合数量不知所措,通常仅实施一种分析/平台组合,导致重复实施工作 - 以及用于数据科学家的大量半兼容工具。 在本文中,我们提出了“收集虚拟机”(或CVM) - 可扩展的编译器框架,旨在保持数据分析系统的专业化过程。它可以同时捕获大量低级,特定于硬件的实现技术以及不同类型分析的高级操作的本质。以核心是一种用于定义嵌套的,面向收集的中间表示(IRS)的语言。前端在该语言中定义的IR口味中产生程序,这些程序通过一系列的重写(可能多次更改IR风味更改)进行了优化,直到该程序最终以特定于平台的运营商的IR表示。在减少整体实施工作的同时,这也提高了分析和硬件平台的互操作性。我们已经成功地使用了CVM来为云中的多核CPU,RDMA群集和无服务器计算基础架构等多样化的平台建立专门的后端,并期望在不久的将来为更多的前端和硬件平台提供类似的结果。
Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of analytics has made this challenge even more difficult. In practice, system designers are overwhelmed by the number of combinations and typically implement only one analysis/platform combination, leading to repeated implementation effort -- and a plethora of semi-compatible tools for data scientists. In this paper, we propose the "Collection Virtual Machine" (or CVM) -- an extensible compiler framework designed to keep the specialization process of data analytics systems tractable. It can capture at the same time the essence of a large span of low-level, hardware-specific implementation techniques as well as high-level operations of different types of analyses. At its core lies a language for defining nested, collection-oriented intermediate representations (IRs). Frontends produce programs in their IR flavors defined in that language, which get optimized through a series of rewritings (possibly changing the IR flavor multiple times) until the program is finally expressed in an IR of platform-specific operators. While reducing the overall implementation effort, this also improves the interoperability of both analyses and hardware platforms. We have used CVM successfully to build specialized backends for platforms as diverse as multi-core CPUs, RDMA clusters, and serverless computing infrastructure in the cloud and expect similar results for many more frontends and hardware platforms in the near future.