论文标题
大规模提取文化常识知识
Extracting Cultural Commonsense Knowledge at Scale
论文作者
论文摘要
结构化知识对于许多AI应用很重要。常识知识对较健壮的以人为本的AI至关重要,它被少数结构化知识项目所涵盖。但是,他们缺乏对人类特征和行为的了解,以社会文化的背景为条件,这对于情境AI至关重要。本文介绍了蜡烛,这是一种用于大规模提取高质量文化常识知识(CCSK)的端到端方法。蜡烛从一个巨大的网络语料库中提取CCSK主张,并将其组织成连贯的群集,以实现3个主题(地理,宗教,职业)和几个文化方面(食物,饮料,服装,传统,仪式,仪式,行为)的三个领域。蜡烛包括明智的技术,用于基于分类的过滤和趣味性评分。实验评估表明,CCSK CCSK收集的优越性优于先前的作品,外部用例表明了CCSK对GPT-3语言模型的好处。可以在https://candle.mpi-inf.mpg.de/上访问代码和数据。
Structured knowledge is important for many AI applications. Commonsense knowledge, which is crucial for robust human-centric AI, is covered by a small number of structured knowledge projects. However, they lack knowledge about human traits and behaviors conditioned on socio-cultural contexts, which is crucial for situative AI. This paper presents CANDLE, an end-to-end methodology for extracting high-quality cultural commonsense knowledge (CCSK) at scale. CANDLE extracts CCSK assertions from a huge web corpus and organizes them into coherent clusters, for 3 domains of subjects (geography, religion, occupation) and several cultural facets (food, drinks, clothing, traditions, rituals, behaviors). CANDLE includes judicious techniques for classification-based filtering and scoring of interestingness. Experimental evaluations show the superiority of the CANDLE CCSK collection over prior works, and an extrinsic use case demonstrates the benefits of CCSK for the GPT-3 language model. Code and data can be accessed at https://candle.mpi-inf.mpg.de/.