论文标题

语义数据仓库的高级ETL-完整版本

High-Level ETL for Semantic Data Warehouses -- Full Version

论文作者

Nath, Rudra Pratap Deb, Romero, Oscar, Pedersen, Torben Bach, Hose, Katja

论文摘要

语义网(SW)的受欢迎程度鼓励组织使用RDF模型组织和发布语义数据。这种增长对商业智能(BI)技术提出了新的要求,以实现在线分析处理(OLAP)对语义数据的类似分析。传统的提取到转换工具(ETL)工具不支持将语义数据纳入数据仓库(DW),因为它们在集成过程中不考虑语义问题。在本文中,我们提出了一个基于层的集成过程以及一组基于RDF的高级ETL构造,以定义,映射,提取,过程,转换,集成,更新,更新和负载(多维)语义数据。与其他ETL工具不同,我们通过在模式级别创建元数据来自动化ETL数据流。因此,它使ETL开发人员免于在ETL操作级别的手动映射负担。我们基于此处提出的创新ETL构造创建了一个名为Smantic ETL构建体(SetLConstruct)的原型。为了评估setlConstruct,我们通过使用它集成丹麦业务数据集和欧盟补贴数据集来创建一个多维语义DW,并将其与以前的可编程框架Setlprog在生产力,开发时间和性能方面进行比较。评估表明,1)SetLConstruct使用SETLPROG的键入字符数(NOTC)少92%,而SetLauto(用于生成ETL执行流的SetLConstruct的扩展)进一步减少了使用的概念(NOUC)的数量,再减少25%; 2)使用setlConstruct,与setlprog相比,开发时间几乎缩短了一半,并且使用setlauto又削减了27%。 3)SetlConstruct是可扩展的,与SetLProg相比具有相似的性能。

The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源