论文标题
大规模计算比较语言学的有效自动数据分析方法
An efficient automated data analytics approach to large scale computational comparative linguistics
论文作者
论文摘要
该研究项目旨在克服分析人类语言关系的挑战,通过开发自动比较技术来促进语言的分组以及它们之间的家谱关系的形成。技术基于某些关键词和概念的语音表示。示例单词集包括数字1-10(策划),数字1-10的大数据库和计数数字1-10(其他来源),颜色(策划),基本单词(策划)。 为了在集合中进行比较,根据Levenshtein距离度量计算编辑距离的度量。两个字符串之间的指标是单字符编辑的最小数量,包括:插入,删除或替换。要探索哪些单词或多或少的变化,哪些单词更加保存,并检查如何根据集合中的语言距离进行分组,涉及几种数据分析技术。其中包括密度评估,分层聚类,轮廓,平均值,标准偏差和Bhattacharya系数计算。这些技术导致了工作流程的开发,后来通过将Unix Shell脚本(开发的R软件包和SWI Prologs)组合来实现。事实证明,这在计算上是有效的,并允许对大语言集的快速探索及其分析。
This research project aimed to overcome the challenge of analysing human language relationships, facilitate the grouping of languages and formation of genealogical relationship between them by developing automated comparison techniques. Techniques were based on the phonetic representation of certain key words and concept. Example word sets included numbers 1-10 (curated), large database of numbers 1-10 and sheep counting numbers 1-10 (other sources), colours (curated), basic words (curated). To enable comparison within the sets the measure of Edit distance was calculated based on Levenshtein distance metric. This metric between two strings is the minimum number of single-character edits, operations including: insertions, deletions or substitutions. To explore which words exhibit more or less variation, which words are more preserved and examine how languages could be grouped based on linguistic distances within sets, several data analytics techniques were involved. Those included density evaluation, hierarchical clustering, silhouette, mean, standard deviation and Bhattacharya coefficient calculations. These techniques lead to the development of a workflow which was later implemented by combining Unix shell scripts, a developed R package and SWI Prolog. This proved to be computationally efficient and permitted the fast exploration of large language sets and their analysis.