現在位置首頁 > 博碩士論文 > 詳目
論文中文名稱:基於語者分群之廣播節目語音逐字稿自動轉寫系統效能改善 [以論文名稱查詢館藏系統]
論文英文名稱:A Preliminary Study on Speaker Diarization for Automatic Transcription of Broadcast Radio Speech [以論文名稱查詢館藏系統]
院校名稱:臺北科技大學
學院名稱:電資學院
系所名稱:電子工程系
畢業學年度:106
畢業學期:第二學期
出版年度:107
中文姓名:許吳華
英文姓名:Wu-Hua Hsu
研究生學號:105368118
學位類別:碩士
語文別:中文
口試日期:2018/07/25
論文頁數:74
指導教授中文名:廖元甫
口試委員中文名:廖元甫;王逸如;蔡偉和
中文關鍵詞:語者分群語音辨識系統信心度評估
英文關鍵詞:Speaker DiarizationAutomatic Speech RecognitionConfidence Measure
論文中文摘要:本論文的目的是將廣播節目中的語者自動分類,並且自動轉寫廣播節目中每一位語者所講述的內容,以便日後能自動產生語音逐字稿,將廣播節目組織成有聲電子書,方便讓聽眾能夠容易地以文字的方式檢索,去找到最關鍵講述內容的部分。
我們使用時間延遲神經網路(Time-delay Neural Network, TDNN)來做自動語者分群標記(Speaker Diarization),在分類的平均DER(diarization error rate)為27.74%,效果好於GMM的31.08%。我們利用訓練好的自動語者分群標記系統,對未標示語者資訊的NER-210語料進行語者分類,並且把產出的語者資訊時間軸標示出來,重新訓練語音辨認器,實驗結果經過Speaker Diarization系統將語者資訊分類出來的語音辨識系統可以將原本平均CER(character error rate)從20.01%降至19.13%,相對改善率有4.4%。此外在語音辨識系統上使用LSTM模型當baseline的平均CER為17.2%,使用多層串連的神經網路CNN-TDNN-LSTM模型可以將平均CER降為13.12%,對基本模型之相對改善率為23.7%,接著再使用Confidence Measure Data Selection與針對語言模型加入更多字詞序列來增加辨認率,其平均CER可以降為9.2%,對基本模型之相對改善率來到46.5%。
論文英文摘要:We use Time-delay Neural Network for Speaker Diarization. The average DER is 27.74%, which is better than 31.08% of GMM. We use trained automatic speaker diarization system to classify information of unmarked speakers in the NER-210 corpus, retrain the ASR by marking the output of the speaker information timeline. The experimental results show, through the speaker diarization system, the ASR system that classifies the speaker information can reduce the original CER from 20.01% to 19.13%. In addition, the average CER of the basic LSTM model on the automatic speech recognition system is 17.2%. The average CER can be reduced to 13.12% using the multi-layer serial neural network CNN-TDNN-LSTM model. Then, we using Confidence Measure data selection and adding more word sequences in the language model to increase the recognition rate, the average CER can be reduced to 9.2%.
論文目次:摘要 i
ABSTRACT ii
誌謝 iii
目錄 iv
表目錄 vii
圖目錄 viii
第一章 緒論 1
1.1 研究動機與目的 1
1.2 相關研究 2
1.3 研究方法 3
1.4 章節概要 6
第二章 自動語者分群標記系統 7
2.1 語者切割 7
2.1.1 語者轉換點偵測 7
2.2 聚合式階層分群法 9
2.3 語者分群相關技術 11
2.3.1 動態時間扭曲 11
2.3.2 通用背景模型 12
2.3.3 聯合因素分析 12
2.3.4 全變異空間 13
2.4 Kaldi GMM系統 13
2.5 i-Vector模型 16
2.5.1 i-Vector訓練方法 17
2.6 機率線性鑑別分析 19
第三章 類神經網路模型 21
3.1 類神經網路的運作 21
3.2 神經網路訓練方法 23
3.2.1 Gradient Descent 23
3.2.2 Backpropagation 24
3.3 類神經網路最佳化 26
3.3.1 Activation Function 26
3.3.2 Dropout 29
3.3.3 AdaGrad 31
第四章 廣播節目逐字稿自動轉寫系統 32
4.1 語音訊號特徵參數 32
4.1.1 語音訊號前處理與特徵參數擷取 33
4.1.2 倒頻譜平均值與變異數正規化法 34
4.1.3 線性鑑別分析 35
4.1.4 最大相似線性轉換 36
4.2 大詞彙語音辨認器 37
4.2.1 基本語音識別 37
4.2.2 音素共享 38
4.3 聲學模型 38
4.3.1 CNN 38
4.3.2 TDNN 39
4.3.3 LSTM 40
4.4 語言模型 41
4.4.1 N-gram語言模型 41
4.4.2 RNNLM 43
4.5 加權有限狀態機 45
4.5.1 加權有限狀態機用於語音辨識 45
第五章 實驗設定與結果 47
5.1 實驗一:自動語者分群標記系統 47
5.1.1 語者切換人工標記規範 47
5.1.2 訓練語料及測試語料資料準備 48
5.1.3 自動語者分群標記系統實驗設定 50
5.1.4 自動語者分群標記系統實驗結果 53
5.2 實驗二:多語言基本語音辨認系統 54
5.2.1 多語言辨認器訓練語料及測試語料 54
5.2.2 語言模型訓練語料 57
5.2.3 中英混合字典 58
5.2.4 特徵參數與神經網路配置 58
5.2.5 多語言基本語音辨認系統結果 59
5.3 實驗三:多層神經網路模型實驗結果 60
5.4 實驗四:Confidence Measure Data Selection 63
5.5 實驗五:語言模型改善 65
5.6 實驗六:大量大陸語料加入 67
5.7 實驗七:語者分群資訊加入語音辨認器 68
第六章 結論與未來展望 70
參考文獻 71
論文參考文獻:[1] Najim Dehak, Patrick Kenny, R´eda Dehak, Pierre Dumouchel,and Pierre Ouellet,“Front-End Factor Analysisfor Speaker Verification,” IEEE Transactions on Audio,Speech, and Language Processing, 2010.
[2] H.-S. Lee et al. , "Clustering-based i-vector formulation for speaker recognition," in Proc. Interspeech, 2014.
[3] S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. ICCV’07, Rio de Janeiro, Brazil, Oct. 2007, pp. 1-8.
[4] Kenny, Patrick, et al. "Joint factor analysis versus eigenchannels in speaker recognition." IEEE Transactions on Audio, Speech, and Language Processing 15.4 (2007): 1435-1447.
[5] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted gaussian mixture models,” Dig. Sig. Proc., 2000.
[6] Tranter, Sue E., and Douglas A. Reynolds. "An overview of automatic speaker diarization systems." IEEE Transactions on audio, speech, and language processing 14.5 (2006): 1557-1565.
[7] Anguera, Xavier, et al. "Speaker diarization: A review of recent research." IEEE Transactions on Audio, Speech, and Language Processing 20.2 (2012): 356-370.
[8] D. Povey, A. Ghosal, G. Boulianne, L. Burgat, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, YM Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, “The Kaldi Speech Recognition Toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, Hawaii, 2011.
[9] Reynolds, Douglas A., Thomas F. Quatieri, and Robert B. Dunn. "Speaker verification using adapted Gaussian mixture models." Digital signal processing 10.1-3 (2000): 19-41.
[10] Prazak, Jan, and Jan Silovsky. "Speaker diarization using PLDA-based speaker clustering." Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on. Vol. 1. IEEE, 2011.
[11] Blut, Ahmet Emin, et al. "PLDA-based diarization of telephone conversations." Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
[12] Waibel, Alexander, et al. "Phoneme recognition using time-delay neural networks." Readings in speech recognition. 1990. 393-404.
[13] Wang, Dong, et al. "Deep speaker verification: Do we need end to end?." Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017. IEEE, 2017.
[14] Graves, Alex, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech recognition with deep bidirectional LSTM." Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013.
[15] LeCun, Yann, and Yoshua Bengio. "Convolutional networks for images, speech, and time series." The handbook of brain theory and neural networks 3361.10 (1995): 1995.
[16] Jiang, Hui. "Confidence measures for speech recognition: A survey." Speech communication 45.4 (2005): 455-470.
[17] Chen, Scott, and Ponani Gopalakrishnan. "Speaker, environment and channel change detection and clustering via the bayesian information criterion." Proc. darpa broadcast news transcription and understanding workshop. Vol. 8. 1998.
[18] Dubes, Richard, and Anil K. Jain. "Validity studies in clustering methodologies." Pattern recognition 11.4 (1979): 235-254.
[19] Reynolds, Douglas A. "An overview of automatic speaker recognition technology." ICASSP. Vol. 4. No. 4. 2002.
[20] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the royal statistical society. Series B (methodological) (1977): 1-38.
[21] Kenny, Patrick. "Joint factor analysis of speaker and session variability: Theory and algorithms." CRIM, Montreal,(Report) CRIM-06/08-13 14 (2005): 28-29.
[22] Sahraeian, Reza, Dirk Van Compernolle, and Febe De Wet. "Using generalized maxout networks and phoneme mapping for low resource ASR-a case study on Flemish-Afrikaans." Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), 2015. IEEE, 2015.
[23] Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP- 35(3):400–401, March 1987.
[24] Understanding LSTM Networks http://colah.github.io/posts/2015-08-Understanding-LSTMs/ , 2016, July.
[25] Mikolov Tomáš, Kombrink Stefan, Deoras Anoop, Burget Lukáš, Černocký Jan: RNNLM - Recurrent Neural Network Language Modeling Toolkit, In: ASRU 2011 Demo Session.
[26] D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal, M. Janda, M. Karafiat, S. Kombrink, ´ P. Motl´ıcek, Y. Qian, N. T. Vu, K. Riedhammer, and ˇ K. Vesely, “Generating exact lattices in the WFST ´ framework,” 2011, submitted to ICASSP 2012.
[27] M. Mohri, F. Pereira and M. Riley “Weighted finite state transducers in speech recognition”, Proc. Automatic Speech Recognition Workshop, pp.97-106 2000.
[28] M. Mohri, F. Pereira and M. Riley “Speech Recognition with weighted finite -state transducers”, in Handbook of Speech Processing, J. Benesty, M. Sondhi, and Y. Huang, Eds. Springer, ch. 28,pp. 559-582,2008.
[29] NIST 2004 Speaker Recognition Evaluation plan, https://www.nist.gov/sites/default/files/documents/2017/09/26/sre-04_evalplan-v1a.pdf.
[30] NIST 2006 Speaker Recognition Evaluation plan, https://www.nist.gov/sites/default/files/documents/2017/09/26/sre-06_evalplan-v9.pdf.
[31] NIST 2008 Speaker Recognition Evaluation plan, https://www.nist.gov/sites/default/files/documents/2017/09/26/sre08_evalplan_release4.pdf.
[32] Van Leeuwen, David A., and Marijn Huijbregts. "The AMI speaker diarization system for NIST RT06s meeting data." International Workshop on Machine Learning for Multimodal Interaction. Springer, Berlin, Heidelberg, 2006.
[33] NIST, “2000 Speaker Recognition Evaluation- Evaluation Plan,” 2000. http://www.itl.nist.gov/iad/mig/tests/sre/2000/spk-2000-plan-v1.0.html.
[34] NIST: The NIST Rich Transcription 2009 (RT’09) evaluation. http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-planv2.pdf (2009).
[35] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition.,” in INTERSPEECH, 2015, pp. 3586–3589.
論文全文使用權限:不同意授權