论文标题

TYDIP:用九种类型多样性语言的礼貌分类的数据集

TyDiP: A Dataset for Politeness Classification in Nine Typologically Diverse Languages

论文作者

Srinivasan, Anirudh, Choi, Eunsol

论文摘要

我们以九种类型上多样化的语言研究礼貌现象。礼貌是交流的重要方面,有时被认为是特定于文化的,但现有的计算语言研究仅限于英语。我们创建了Tydip,这是一个数据集,其中包含每种语言中500个示例的三向礼貌注释,总计4.5k示例。我们评估了多语言模型能够识别礼貌水平的程度 - 它们表现出相当强大的零拍传递能力,但却没有明显的人类准确性。我们进一步研究将英国礼貌策略词典绘制为九种语言,通过自动翻译和词典归纳,分析每种策略的影响是否保持在语言之间。最后,我们通过转移实验从经验研究形式与礼貌之间的复杂关系。我们希望我们的数据集将支持各种研究问题和应用程序,从评估多语言模型到构建礼貌的多语言代理。

We study politeness phenomena in nine typologically diverse languages. Politeness is an important facet of communication and is sometimes argued to be cultural-specific, yet existing computational linguistic study is limited to English. We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples. We evaluate how well multilingual models can identify politeness levels -- they show a fairly robust zero-shot transfer ability, yet fall short of estimated human accuracy significantly. We further study mapping the English politeness strategy lexicon into nine languages via automatic translation and lexicon induction, analyzing whether each strategy's impact stays consistent across languages. Lastly, we empirically study the complicated relationship between formality and politeness through transfer experiments. We hope our dataset will support various research questions and applications, from evaluating multilingual models to constructing polite multilingual agents.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源