论文标题
大量数据流的快速极性标记的框架
A Framework for Fast Polarity Labelling of Massive Data Streams
论文作者
论文摘要
许多现有的情感分析技术都是基于监督的学习,他们要求有价值的培训数据集用于培训其模型。当数据集新鲜度至关重要时,高速无标记数据流的注释变得至关重要,但仍然是一个开放的问题。在本文中,我们提出了PLStream,这是一种基于Apache Flink的新型框架,用于大规模数据流的快速极性标记,例如Twitter推文或在线产品评论。我们应对相关的实施挑战,并提出一系列技术列表,包括算法改进和系统优化。对两个现实世界的工作量进行了彻底的经验验证表明,在存在高速连续不标记的数据流(将近16,000个元组/秒)的情况下,PLStream能够产生高质量的标签(近80%的精度),而无需任何手动工作。
Many of the existing sentiment analysis techniques are based on supervised learning, and they demand the availability of valuable training datasets to train their models. When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem. In this paper, we propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews. We address the associated implementation challenges and propose a list of techniques including both algorithmic improvements and system optimizations. A thorough empirical validation with two real-world workloads demonstrates that PLStream is able to generate high quality labels (almost 80% accuracy) in the presence of high-speed continuous unlabelled data streams (almost 16,000 tuples/sec) without any manual efforts.