天天看點

語音頂會Interspeech 論文解讀|Multi-Task Multi-Network Joint-Learning of Deep Residual Networks and Cycle-Consistency Generative Adversarial Networks for Robus

2019年,國際語音交流協會INTERSPEECH第20屆年會将于9月15日至19日在奧地利格拉茨舉行。Interspeech是世界上規模最大,最全面的頂級語音領域會議,近2000名一線業界和學界人士将會參與包括主題演講,Tutorial,論文講解和主會展覽等活動,本次阿裡論文有8篇入選,本文為Shengkui Zhao, Chongjia Ni, Rong Tong, Bin Ma的論文《Multi-Task Multi-Network Joint-Learning of Deep Residual Networks and Cycle-Consistency Generative Adversarial Networks for Robust Speech Recognition》

點選下載下傳論文

文章解讀

自動語音識别系統(ASR)在實際生活中有着廣泛的應用場景,不過通常由于周遭環境的噪聲和混響的影響,自動語音識别的結果出現錯誤和不穩定的情況。提高自動語音識别系統的魯棒性是推廣其應用的一個關鍵問題。為了解決這個問題,增加語音增強子產品和模型适應訓練已經研究了很長時間。最近,在統一模組化架構中利用同時訓練降噪和語音識别的多任務聯合學習方案顯示出令人鼓舞的進展,不過目前模型訓練仍高度依賴于成對的幹淨和噪聲資料。為了克服這一限制,研究者開始引進對抗性生成網絡(GAN)和對抗性訓練方法到聲學模型的訓練中,由于無需複雜的前端設計和配對訓練資料,大大簡化了模型訓練過程和要求。盡管對抗性生成網絡在計算機視覺領域發展迅速,但目前隻引進了正常對抗性生成網絡和進行了有限的模型訓練實驗,而且正常對抗性生成網絡存在模式崩潰缺陷常常導緻訓練失敗問題。

在這項工作中,我們采用更先進的循環一緻性對抗性生成網絡(CycleGAN)來解決由于正常對抗性生成網絡模式崩潰導緻的訓練失敗問題,另外,結合最近流行的深度殘差網絡(ResNets),我們進一步将多任務學習方案擴充為多任務多網絡聯合學習方案,以實作更強大的降噪功能和模型自适應訓練功能。

語音頂會Interspeech 論文解讀|Multi-Task Multi-Network Joint-Learning of Deep Residual Networks and Cycle-Consistency Generative Adversarial Networks for Robus

基于CHiME-4的單聲道自動語音識别的實驗結果表明,與最先進的聯合學習方法相比(B),我們提出的方法通過實作更低的字錯誤率(WER)顯着提高了自動語音識别系統的噪聲魯棒性。

語音頂會Interspeech 論文解讀|Multi-Task Multi-Network Joint-Learning of Deep Residual Networks and Cycle-Consistency Generative Adversarial Networks for Robus

基于循環一緻性對抗性生成網絡,我們提出的多任務多網絡聯合學習方案較好的解決了模式崩潰問題。

語音頂會Interspeech 論文解讀|Multi-Task Multi-Network Joint-Learning of Deep Residual Networks and Cycle-Consistency Generative Adversarial Networks for Robus

文章摘要

**Robustness of automatic speech recognition (ASR) systems is a critical issue due to noise and reverberations. Speech enhancement and model adaptation have been studied for long time to address this issue. Recently, the developments of multitask joint-learning scheme that addresses noise reduction and ASR criteria in a unified modeling framework show promising improvements, but the model training highly relies on paired clean-noisy data. To overcome this limit, the generative adversarial networks (GANs) and the adversarial training method are deployed, which have greatly simplified the model training process without the requirements of complex front-end design and paired training data. Despite the fast developments of GANs for computer visions, only regular GANs have been adopted for robust ASR. In this work, we adopt a more advanced cycleconsistency GAN (CycleGAN) to address the training failure problem due to mode collapse of regular GANs. Using deep residual networks (ResNets), we further expand the multi-task scheme to a multi-task multi-network joint-learning scheme for more robust noise reduction and model adaptation. Experiment results on CHiME-4 show that our proposed approach significantly improves the noise robustness of the ASR system by achieving much lower word error rates (WERs) than the stateof-the-art joint-learning approaches.

Index Terms: Robust speech recognition, convolutional neural

networks, acoustic model, generative adversarial networks

阿裡雲開發者社群整理

繼續閱讀