论文标题

使用QUORA问题对数据集进行释义识别的实验

Experiments on Paraphrase Identification Using Quora Question Pairs Dataset

论文作者

Chandra, Andreas, Stefanus, Ruben

论文摘要

我们对Quora问题对数据集进行了建模,以确定类似的问题。我们使用的数据集由Quora提供。该任务是二进制分类。我们尝试了几种方法和算法以及与以前的工作不同的方法。为了提取特征,我们使用了一袋词,包括count vectorizer,以及带有XGBoost和catboost的Unigram的术语频率段文档频率。此外,我们还尝试了Wordpiece令牌,从而显着提高了模型性能。我们达到了高达97%的准确性。代码和数据集。

We modeled the Quora question pairs dataset to identify a similar question. The dataset that we use is provided by Quora. The task is a binary classification. We tried several methods and algorithms and different approach from previous works. For feature extraction, we used Bag of Words including Count Vectorizer, and Term Frequency-Inverse Document Frequency with unigram for XGBoost and CatBoost. Furthermore, we also experimented with WordPiece tokenizer which improves the model performance significantly. We achieved up to 97 percent accuracy. Code and Dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源