论文标题

网络绑带面部图像数据集的策划方法

A Method for Curation of Web-Scraped Face Image Datasets

论文作者

Zhang, Kai, Albiero, Vítor, Bowyer, Kevin W.

论文摘要

网络结束的野外数据集已成为面部识别研究的规范。网络绑带数据集中获取的主题和图像的数量通常非常大,图像数量为数百万。当收集数据集内野外时,会出现各种问题,包括具有错误的身份标签,重复图像,重复主题和质量变化的图像。由于数百万的图像数量,手动清洁程序是不可行的。但是用于日期的完全自动化方法导致清洁数据集的水平较低。我们提出了一种半自动化方法,该方法的目标是拥有一个干净的数据集来测试面部识别方法,在男性和女人之间具有相似的质量,以支持对性别的准确性的比较。我们的方法消除了近乎简化的图像,合并重复的主题,纠正标签错误的图像,并删除在定义的姿势和质量范围之外的图像。我们在亚洲面部数据集(AFD)和VGGFACE2测试数据集上进行策划。该实验表明,在策划后,最新的方法在数据集上实现了更高的精度。最后,我们将两个数据集的清洁版本发布给研究界。

Web-scraped, in-the-wild datasets have become the norm in face recognition research. The numbers of subjects and images acquired in web-scraped datasets are usually very large, with number of images on the millions scale. A variety of issues occur when collecting a dataset in-the-wild, including images with the wrong identity label, duplicate images, duplicate subjects and variation in quality. With the number of images being in the millions, a manual cleaning procedure is not feasible. But fully automated methods used to date result in a less-than-ideal level of clean dataset. We propose a semi-automated method, where the goal is to have a clean dataset for testing face recognition methods, with similar quality across men and women, to support comparison of accuracy across gender. Our approach removes near-duplicate images, merges duplicate subjects, corrects mislabeled images, and removes images outside a defined range of pose and quality. We conduct the curation on the Asian Face Dataset (AFD) and VGGFace2 test dataset. The experiments show that a state-of-the-art method achieves a much higher accuracy on the datasets after they are curated. Finally, we release our cleaned versions of both datasets to the research community.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源