论文标题
一种模型,多种方式:用于文本,声音,图像,视频和代码的稀疏激活方法
One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code
论文作者
论文摘要
人们以多种感官感知世界(例如,通过听到声音,阅读单词和看到对象)。但是,大多数现有的AI系统仅处理单个模式。本文提出了一种擅长使用单个模型处理多种信息方式的方法。在我们的“ {SkillNet}”模型中,参数的不同部分专门用于处理不同的方式。与始终激活所有模型参数的传统密集模型不同,我们的模型稀疏地激活了与任务相关的参数的一部分。这种模型设计使SkillNet能够以更容易解释的方式学习技能。我们为五种模式开发模型,包括文本,图像,声音,视频和代码。结果表明,SkillNet与五个特定于模式的微型模型相当。此外,我们的模型以相同的激活方式支持自我监督的预处理,从而为不同方式提供了更好的初始化参数。我们发现,预处理可显着提高SkillNet在五种模式上的性能,与具有特定于模态的预训练的基线相当甚至更好。在中文文本到图像检索的任务上,我们的最终系统的准确性比包括Wukong {Vit-B}和Wenlan 2.0在内的现有领先系统更高,同时使用较少数量的激活参数。
People perceive the world with multiple senses (e.g., through hearing sounds, reading words and seeing objects). However, most existing AI systems only process an individual modality. This paper presents an approach that excels at handling multiple modalities of information with a single model. In our "{SkillNet}" model, different parts of the parameters are specialized for processing different modalities. Unlike traditional dense models that always activate all the model parameters, our model sparsely activates parts of the parameters whose skills are relevant to the task. Such model design enables SkillNet to learn skills in a more interpretable way. We develop our model for five modalities including text, image, sound, video and code. Results show that, SkillNet performs comparably to five modality-specific fine-tuned models. Moreover, our model supports self-supervised pretraining with the same sparsely activated way, resulting in better initialized parameters for different modalities. We find that pretraining significantly improves the performance of SkillNet on five modalities, on par with or even better than baselines with modality-specific pretraining. On the task of Chinese text-to-image retrieval, our final system achieves higher accuracy than existing leading systems including Wukong{ViT-B} and Wenlan 2.0 while using less number of activated parameters.