使用基于语音和字符串相似性的语言建模对文本进行标准化

论文标题

使用基于语音和字符串相似性的语言建模对文本进行标准化

Normalizing Text using Language Modelling based on Phonetics and String Similarity

论文作者

Doshi, Fenil, Gandhi, Jimit, Gosalia, Deep, Bagul, Sudhir

论文摘要

社交媒体网络和聊天平台通常使用非正式版本的自然文本。对抗性拼写攻击也倾向于通过修改文本中的字符来改变输入文本。对这些文本进行标准化是各种应用程序的必要步骤，例如语言翻译和文本到语音综合，其中模型是通过干净的常规英语培训的。我们提出了一个新的强大模型来执行文本归一化。我们的系统使用BERT语言模型来预测与不均衡单词相对应的掩盖单词。我们提出了两种独特的掩蔽策略，这些策略试图使用基于语音和字符串相似性指标的唯一分数来代替文本中的非均衡单词。我们使用以人为中心的评估，要求志愿者对归一化文本进行排名。我们的策略的准确性为86.7％和83.2％，这表明我们系统在处理文本标准化方面的有效性。

Social media networks and chatting platforms often use an informal version of natural text. Adversarial spelling attacks also tend to alter the input text by modifying the characters in the text. Normalizing these texts is an essential step for various applications like language translation and text to speech synthesis where the models are trained over clean regular English language. We propose a new robust model to perform text normalization. Our system uses the BERT language model to predict the masked words that correspond to the unnormalized words. We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form using a unique score based on phonetic and string similarity metrics.We use human-centric evaluations where volunteers were asked to rank the normalized text. Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题