论文标题
部分可观测时空混沌系统的无模型预测
The xmuspeech system for multi-channel multi-party meeting transcription challenge
论文作者
论文摘要
本文介绍了Xmuspeech团队为多渠道多方会议转录挑战(M2MET)开发的系统。对于说话者诊断任务,我们提出了一个多通道扬声器诊断系统,该系统通过到达差(DOA)技术获得说话者的空间信息。扬声器空间嵌入是由X-Vector生成的,S-vector从过滤器和符合光束形成(FSB)中得出的S-矢量,这使得嵌入更强大。具体而言,我们提出了一种新型的多通道序列到序列神经网络结构,称为歧视性多流神经网络(DMSNET),该构建由注意力滤波器和-SUM块(AFSB)和构象异构体编码器组成。我们探索DMSNET,以解决多频道音频上的重叠语音问题。与基于LSTM的OSD模块相比,我们的检测错误率(DITER)降低了10.1%。通过执行基于DMSNET的OSD模块,基于聚类的诊断系统的DER可显着降低13.44%至7.63%。我们的最佳融合系统在评估集和测试集上实现了诊断错误率(DER)的7.09%和9.80%。
This paper describes the system developed by the XMUSPEECH team for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT). For the speaker diarization task, we propose a multi-channel speaker diarization system that obtains spatial information of speaker by Difference of Arrival (DOA) technology. Speaker-spatial embedding is generated by x-vector and s-vector derived from Filter-and-Sum Beamforming (FSB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named Discriminative Multi-stream Neural Network (DMSNet) which consists of Attention Filter-and-Sum block (AFSB) and Conformer encoder. We explore DMSNet to address overlapped speech problem on multi-channel audio. Compared with LSTM based OSD module, we achieve a decreases of 10.1% in Detection Error Rate(DetER). By performing DMSNet based OSD module, the DER of cluster-based diarization system decrease significantly form 13.44% to 7.63%. Our best fusion system achieves 7.09% and 9.80% of the diarization error rate (DER) on evaluation set and test set.