论文标题
数据新闻的结构化,半结构和非结构化数据的图表集成
Graph integration of structured, semistructured and unstructured data for data journalism
论文作者
论文摘要
数字数据是现代新闻业的金矿。但是,兴趣记者的数据集极为异,范围从高度结构化(关系数据库),半结构化(JSON,XML,HTML),图形(例如RDF)和文本。记者(以及其他缺乏高级IT专业知识的用户类别,例如大多数非政府组织或小型公共管理)也必须能够理解此类异质性语料库,即使他们缺乏定义和部署自定义的提取型转换载荷工作流程的能力,尤其是对于动态造成数据源的动态范围。 我们描述了一种完整的方法,用于沿着上述线路集成异质数据集的动态集:我们面临的挑战使这些图表有用,允许其集成规模,以及我们针对这些问题提出的解决方案。我们的方法是在连接系统系统中实现的;我们通过一组实验对其进行验证。
Digital data is a gold mine for modern journalism. However, datasets which interest journalists are extremely heterogeneous, ranging from highly structured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to define and deploy custom extract-transform-load workflows, especially for dynamically varying sets of data sources. We describe a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.