论文标题
时尚IQ 2020挑战第二名的解决方案
Fashion-IQ 2020 Challenge 2nd Place Team's Solution
论文作者
论文摘要
本文专门针对VAA团队在CVPR 2020年的时尚IQ挑战中提交的方法。鉴于图像和文本,我们提出了一种新颖的多模式构图方法RTIC,可以有效地将文本和图像模式结合到语义空间中。我们分别提取由CNN和顺序模型(例如LSTM或GRU)编码的图像和文本特征。为了强调目标和候选者之间特征残差的含义,RTIC由带有通道注意模块的N块组成。然后,我们将编码的残差添加到候选图像的特征中,以获得合成的特征。我们还探索了一种模型变体的合奏策略,并与最佳单个模型相比,性能取得了重大提升。最后,我们的方法在时尚IQ 2020挑战赛中获得了第二名,在排行榜上的测试成绩为48.02。
This paper is dedicated to team VAA's approach submitted to the Fashion-IQ challenge in CVPR 2020. Given a pair of the image and the text, we present a novel multimodal composition method, RTIC, that can effectively combine the text and the image modalities into a semantic space. We extract the image and the text features that are encoded by the CNNs and the sequential models (e.g., LSTM or GRU), respectively. To emphasize the meaning of the residual of the feature between the target and candidate, the RTIC is composed of N-blocks with channel-wise attention modules. Then, we add the encoded residual to the feature of the candidate image to obtain a synthesized feature. We also explored an ensemble strategy with variants of models and achieved a significant boost in performance comparing to the best single model. Finally, our approach achieved 2nd place in the Fashion-IQ 2020 Challenge with a test score of 48.02 on the leaderboard.