使用视觉变压器与传输学习和数据扩展的掩盖使用识别

论文标题

使用视觉变压器与传输学习和数据扩展的掩盖使用识别

Mask Usage Recognition using Vision Transformer with Transfer Learning and Data Augmentation

论文作者

Jahja, Hensel Donato, Yudistira, Novanto, Sutrisno

论文摘要

共同19-19大流行破坏了社会的各个层次。通过识别使用口罩的人的图像来防止COVID-19的扩散至关重要。尽管只有23.1％的人正确使用口罩，但人工神经网络（ANN）可以帮助对使用良好的口罩进行分类，以减缓Covid-19-19病毒的传播。但是，它需要一个大数据集来训练可以正确使用掩码的ANN。 Maskedface-Net是一个合适的数据集，该数据集由137016个具有4个类标签的数字图像组成，即面具，面膜下巴，面具嘴巴下巴和口罩鼻子嘴。掩模分类训练利用视觉变压器（VIT）结构，采用传输学习方法，使用Imagenet-21K上的预训练权重，并随机增强。 In addition, the hyper-parameters of training of 20 epochs, an Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.03, a batch size of 64, a Gaussian Cumulative Distribution (GeLU) activation function, and a Cross-Entropy loss function are used to be applied on the training of three architectures of ViT, namely Base-16, Large-16, and Huge-14.此外，还进行了有或没有增强和转移学习的比较。这项研究发现，最好的分类是使用VIT巨大14的转移学习和增强。使用此方法在Maskedface-Net数据集上，该研究在培训数据上达到0.9601的精度，验证数据的0.9412和测试数据的0.9534。这项研究表明，通过数据增强和转移学习培训VIT模型可以改善掩盖使用情况的分类，甚至比基于卷积的残留网络（RESNET）更好。

The COVID-19 pandemic has disrupted various levels of society. The use of masks is essential in preventing the spread of COVID-19 by identifying an image of a person using a mask. Although only 23.1% of people use masks correctly, Artificial Neural Networks (ANN) can help classify the use of good masks to help slow the spread of the Covid-19 virus. However, it requires a large dataset to train an ANN that can classify the use of masks correctly. MaskedFace-Net is a suitable dataset consisting of 137016 digital images with 4 class labels, namely Mask, Mask Chin, Mask Mouth Chin, and Mask Nose Mouth. Mask classification training utilizes Vision Transformers (ViT) architecture with transfer learning method using pre-trained weights on ImageNet-21k, with random augmentation. In addition, the hyper-parameters of training of 20 epochs, an Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.03, a batch size of 64, a Gaussian Cumulative Distribution (GeLU) activation function, and a Cross-Entropy loss function are used to be applied on the training of three architectures of ViT, namely Base-16, Large-16, and Huge-14. Furthermore, comparisons of with and without augmentation and transfer learning are conducted. This study found that the best classification is transfer learning and augmentation using ViT Huge-14. Using this method on MaskedFace-Net dataset, the research reaches an accuracy of 0.9601 on training data, 0.9412 on validation data, and 0.9534 on test data. This research shows that training the ViT model with data augmentation and transfer learning improves classification of the mask usage, even better than convolutional-based Residual Network (ResNet).

下载PDF全文

下载文献需遵守相关版权规定

论文标题