网络绑带面部图像数据集的策划方法

论文标题

网络绑带面部图像数据集的策划方法

A Method for Curation of Web-Scraped Face Image Datasets

论文作者

Zhang, Kai, Albiero, Vítor, Bowyer, Kevin W.

论文摘要

网络结束的野外数据集已成为面部识别研究的规范。网络绑带数据集中获取的主题和图像的数量通常非常大，图像数量为数百万。当收集数据集内野外时，会出现各种问题，包括具有错误的身份标签，重复图像，重复主题和质量变化的图像。由于数百万的图像数量，手动清洁程序是不可行的。但是用于日期的完全自动化方法导致清洁数据集的水平较低。我们提出了一种半自动化方法，该方法的目标是拥有一个干净的数据集来测试面部识别方法，在男性和女人之间具有相似的质量，以支持对性别的准确性的比较。我们的方法消除了近乎简化的图像，合并重复的主题，纠正标签错误的图像，并删除在定义的姿势和质量范围之外的图像。我们在亚洲面部数据集（AFD）和VGGFACE2测试数据集上进行策划。该实验表明，在策划后，最新的方法在数据集上实现了更高的精度。最后，我们将两个数据集的清洁版本发布给研究界。

Web-scraped, in-the-wild datasets have become the norm in face recognition research. The numbers of subjects and images acquired in web-scraped datasets are usually very large, with number of images on the millions scale. A variety of issues occur when collecting a dataset in-the-wild, including images with the wrong identity label, duplicate images, duplicate subjects and variation in quality. With the number of images being in the millions, a manual cleaning procedure is not feasible. But fully automated methods used to date result in a less-than-ideal level of clean dataset. We propose a semi-automated method, where the goal is to have a clean dataset for testing face recognition methods, with similar quality across men and women, to support comparison of accuracy across gender. Our approach removes near-duplicate images, merges duplicate subjects, corrects mislabeled images, and removes images outside a defined range of pose and quality. We conduct the curation on the Asian Face Dataset (AFD) and VGGFace2 test dataset. The experiments show that a state-of-the-art method achieves a much higher accuracy on the datasets after they are curated. Finally, we release our cleaned versions of both datasets to the research community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题