用于检索流量推文的自动查询优化

论文标题

用于检索流量推文的自动查询优化

Automatic Query Optimization for Retrieving Traffic Tweets

论文作者

Hufbauer, Emory, Khamfroush, Hana

论文摘要

像许多社交媒体和数据经纪公司一样，Twitter通过搜索API（应用程序编程界面）提供数据。除了按日期和位置进行过滤结果外，研究人员还可以使用{\ it and}，{\ it或}以及{\ it not}操作员选择必须或不得不在匹配的推文中选择必须或不得不出现的短语组合的推文，{\ it and}，{\ it或}，{\ it not}操作员选择必须或不得不出现的短语组合。这个布尔文本搜索系统根本不是Twitter所独有的，在许多不同的情况下，包括学术，法律和医疗数据库，但是由于推文的相对量和简洁性，它在Twitter的用例中已限制。此外，在1980年代和1990年代的信息检索主题中对这种系统的半自动化使用进行了很好的研究，但是自那时以来，对此类系统的研究已大大下降。因此，我们提出了更新的方法，以自动选择和完善复杂的布尔搜索查询，以更特异性和完整性隔离相关结果。此外，我们提出了使用优化的查询来收集与流量与交通相关的推文的样本以及手动分类和分析的结果的初步结果。

Twitter, like many social media and data brokering companies, makes their data available through a search API (application programming interface). In addition to filtering results by date and location, researchers can search for tweets with specific content with a boolean text query, using {\it AND}, {\it OR}, and {\it NOT} operators to select the combinations of phrases which must, or must not, appear in matching tweets. This boolean text search system is not at all unique to Twitter and is found in many different contexts, including academic, legal, and medical databases, however it is stretched to its limits in Twitter's use case because of the relative volume and brevity of tweets. In addition, the semi-automated use of such systems was well studied under the topic of Information Retrieval during the 1980s and 1990s, however the study of such systems has greatly declined since that time. As such, we propose updated methods for automatically selecting and refining complex boolean search queries that can isolate relevant results with greater specificity and completeness. Furthermore, we present preliminary results of using an optimized query to collect a sample of traffic-incident-related tweets, along with the results of manually classifying and analyzing them.

下载PDF全文

下载文献需遵守相关版权规定

论文标题