论文标题

有效的数据科学的专门电子表格解析

Efficient Specialized Spreadsheet Parsing for Data Science

论文作者

Henze, Felix, Gavriilidis, Haralampos, Zacharatou, Eleni Tzirita, Markl, Volker

论文摘要

电子表格广泛用于数据探索。由于电子表格系统的功能有限,因此用户通常需要将电子表格加载到其他数据科学环境中以执行高级分析。但是,当前的电子表格加载方法遭受了高运行时或内存使用情况的影响,这阻碍了对商品系统的数据探索。为了在商品系统上实现Spresheet加载,我们引入了一种新颖的解析器,该解析器通过紧密耦合减压和解析来最大程度地减少记忆使用情况。此外,为了减少运行时,我们引入了优化的电子表格特定解析程序并采用并行性。为了评估我们的方法,我们实施了将Excel电子表格加载到R环境中的原型。我们的评估表明,我们的新颖方法的速度最高3倍,同时比最先进的方法消耗40倍的记忆力。

Spreadsheets are widely used for data exploration. Since spreadsheet systems have limited capabilities, users often need to load spreadsheets to other data science environments to perform advanced analytics. However, current approaches for spreadsheet loading suffer from either high runtime or memory usage, which hinders data exploration on commodity systems. To make spreasheet loading practical on commodity systems, we introduce a novel parser that minimizes memory usage by tightly coupling decompression and parsing. Furthermore, to reduce the runtime, we introduce optimized spreadsheet-specific parsing routines and employ parallelism. To evaluate our approach, we implement a prototype for loading Excel spreadsheets into R environments. Our evaluation shows that our novel approach is up to 3x faster while consuming up to 40x less memory than state-of-the-art approaches.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源