自动识别历史手稿中的变更类型

论文标题

自动识别历史手稿中的变更类型

Automatic Identification of Types of Alterations in Historical Manuscripts

论文作者

Lassner, David, Baillot, Anne, Dogadov, Sergej, Müller, Klaus-Robert, Nakajima, Shinichi

论文摘要

诸如字母之类的历史手稿的改变代表了一个有希望的研究领域。一方面，他们有助于了解文本的构建。另一方面，在考虑更改时，在手稿时被认为是敏感的主题，尤其是在删除情况下。但是，对手稿的变更的分析是一部传统上非常乏味的作品。在本文中，我们提出了一种基于机器学习的方法，以帮助分类文档中的更改。特别是，我们提出了一个新的概率模型（更改潜在的dirichlet分配，以下内容中的alterlda）对与内容相关的更改进行了分类。此处提出的方法是根据对数字学术版柏林知识分子进行的实验开发的，为此，AlterLda在识别标记数据的变化时实现了高性能。在未标记的数据上，应用AlterLDA会导致对作者，编辑和其他手稿贡献者的改变行为的有趣新见解，以及在1800年左右柏林知识分子对应中对敏感主题的见解。除了基于数字学者智力的数字知识分子的发现外，我们还可以对文本进行分析的一般框架，以进行其他数字框架。为此，我们详细介绍了为了实现此类结果而要遵循的方法论步骤，从而为机器学习应用程序提供了数字人文科学的典型例子。

Alterations in historical manuscripts such as letters represent a promising field of research. On the one hand, they help understand the construction of text. On the other hand, topics that are being considered sensitive at the time of the manuscript gain coherence and contextuality when taking alterations into account, especially in the case of deletions. The analysis of alterations in manuscripts, though, is a traditionally very tedious work. In this paper, we present a machine learning-based approach to help categorize alterations in documents. In particular, we present a new probabilistic model (Alteration Latent Dirichlet Allocation, alterLDA in the following) that categorizes content-related alterations. The method proposed here is developed based on experiments carried out on the digital scholarly edition Berlin Intellectuals, for which alterLDA achieves high performance in the recognition of alterations on labelled data. On unlabelled data, applying alterLDA leads to interesting new insights into the alteration behavior of authors, editors and other manuscript contributors, as well as insights into sensitive topics in the correspondence of Berlin intellectuals around 1800. In addition to the findings based on the digital scholarly edition Berlin Intellectuals, we present a general framework for the analysis of text genesis that can be used in the context of other digital resources representing document variants. To that end, we present in detail the methodological steps that are to be followed in order to achieve such results, giving thereby a prime example of an Machine Learning application the Digital Humanities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题