© The Institution of Engineering and Technology
The task of single-channel speech enhancement is to restore clean speech from noisy speech. Recently, speech enhancement has been greatly improved with the introduction of deep learning. Previous work proved that using ideal ratio mask or phase-sensitive mask as intermediation to recover clean speech can yield better performance. In this case, the mean square error is usually selected as the loss function. However, after conducting experiments, the authors find that the mean square error has a problem. It considers absolute error values, meaning that the gradients of the network depend on absolute differences between estimated values and true values, so the points in magnitude spectra with smaller values contribute little to the gradients. To solve this problem, they propose relative loss, which pays more attention to relative differences between magnitude spectra, rather than the absolute differences, and is more in accordance with human sensory characteristics. The perceptual evaluation of speech quality, the short-time objective intelligibility, the signal-to-distortion ratio, and the segmental signal-to-noise ratio are used to evaluate the performance of the relative loss. Experimental results show that it can greatly improve speech enhancement by focusing on smaller values.
References
-
-
1)
-
48. Hansen, J.H.L., Pellom, B.L.: ‘An effective quality evaluation protocol for speech enhancement algorithms’. Proc. Int. Conf. on Speech and Language Processing, Sydney, Australia, 1998, pp. 2819–2822.
-
2)
-
40. Xu, Y., Du, J., Dai, L., et al: ‘A regression approach to speech enhancement based on deep neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (1), pp. 7–19.
-
3)
-
26. Wang, Y.X., Narayanan, A., Wang, D.L.: ‘On training targets for supervised speech separation’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 1849–1858. .
-
4)
-
45. Rix, A.W., Beerends, J.G., Hollier, M.P., et al: ‘Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs’. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2001, Salt Palace Convention Center, Salt Lake City, Utah, USA, 7–11 May 2001, pp. 749–752, .
-
5)
-
16. Longueira, F., Keene, S.: ‘A fully convolutional neural network approach to end-to-end speech enhancement’. CoRR, 2018, .
-
6)
-
20. Wang, D.L., Chen, J.T.: ‘Supervised speech separation based on deep learning: an overview’. CoRR, 2017, .
-
7)
-
38. Roman, N., Liang Wang, D., Brown, G.J.: ‘Speech segregation based on sound localization’. Proc. Int. Joint Conf. on Neural Networks, IJCNN'01 (Cat. No. 01CH37222), Washington, DC, USA., 2001, , pp. 2861–2866.
-
8)
-
25. Choi, H.S., Kim, J.H., Huh, J., et al: ‘Phase-aware speech enhancement with deep complex U-Net’. CoRR, 2019, .
-
9)
-
34. Kim, J., El-Khamy, M., Lee, J.: ‘End-to-end multi-task denoising for joint SDR and PESQ optimization’. CoRR, 2019, .
-
10)
-
21. Wang, Y.X., Wang, D.L.: ‘Towards scaling up classification-based speech separation’, IEEE Trans. Audio Speech Lang. Process., 2013, 21, (7), pp. 1381–1390.
-
11)
-
23. Krawczyk, M., Gerkmann, T.: ‘Stft phase reconstruction in voiced speech for an improved single-channel speech enhancement’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 1931–1940.
-
12)
-
2. Cohen, I., Berdugo, B.: ‘Noise estimation by minima controlled recursive averaging for robust speech enhancement’, IEEE Signal Process. Lett., 2002, 9, (1), pp. 12–15.
-
13)
-
39. Subramanian, A.S., Chen, S.J., Watanabe, S.: ‘Student–teacher learning for BLSTM mask-based speech enhancement’. Proc. Interspeech 2018, Hyderabad, India, 2018, pp. 3249–3253. .
-
14)
-
12. Weninger, F., Hershey, J.R., Roux, J.L., et al: ‘Discriminatively trained recurrent neural networks for single-channel speech separation’. 2014 IEEE Global Conf. on Signal and Information Processing (GlobalSIP), Atlanta, GA, USA., 2014, pp. 577–581.
-
15)
-
46. Taal, C.H., Hendriks, R.C., Heusdens, R., et al: ‘An algorithm for intelligibility prediction of time–frequency weighted noisy speech’, IEEE Trans. Audio Speech Lang. Process., 2011, 19, (7), pp. 2125–2136.
-
16)
-
29. Han, W., Zhang, X., Min, G., et al: ‘Perceptual weighting deep neural networks for single-channel speech enhancement’. 2016 12th World Congress on Intelligent Control and Automation (WCICA), Guilin, China, 2016, pp. 446–450.
-
17)
-
10. Saleem, N., Khattak, M.I.: ‘Deep neural networks for speech enhancement in complex-noisy environments’, Int. J. Inter. Multimed. Artificial Intell., 2020, 6, pp. 1–7. .
-
18)
-
37. Hu, G.N., Wang, D.L.: ‘Monaural speech segregation based on pitch tracking and amplitude modulation’. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2002, Orlando, Florida, USA, 13–17 May 2002, pp. 553–556. .
-
19)
-
30. Zhao, Z., Elshamy, S., Fingscheidt, T.: ‘A perceptual weighting filter loss for DNN training in speech enhancement’. CoRR, 2019, .
-
20)
-
51. Nagrani, A., Chung, J.S., Zisserman, A.: ‘Voxceleb: a large-scale speaker identification dataset’. CoRR, 2017, .
-
21)
-
47. Vincent, E., Gribonval, R., Fevotte, C.: ‘Performance measurement in blind audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2006, 14, (4), pp. 1462–1469.
-
22)
-
41. Kingma, D.P., Ba, J.: ‘Adam: A method for stochastic optimization’. 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. .
-
23)
-
32. Kolbaek, M., Tan, Z., Jensen, J.: ‘Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure’. CoRR, 2018, .
-
24)
-
31. Venkataramani, S., Smaragdis, P.: ‘End-to-end source separation with adaptive front-ends’. CoRR, 2017, .
-
25)
-
43. Veaux, C., Yamagishi, J., MacDonald, K.: ‘CSTR VCTK corpus: English multispeaker corpus for CSTR voice cloning toolkit’, 2016. .
-
26)
-
33. Naithani, G., Nikunen, J., Bramsløw, L., et al: ‘Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications’. CoRR, 2018, .
-
27)
-
36. Kolbaek, M., Yu, D., Tan, Z., et al: ‘Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2017, 25, (10), pp. 1901–1913, .
-
28)
-
8. Saleem, N., Khattak, M., Qazi, A.: ‘Supervised speech enhancement based on deep neural network’, J. Intell. Fuzzy Syst., 2019, 37, pp. 1–15.
-
29)
-
9. Saleem, N., Irfan.Khattak, M., Ali, M.Y., et al: ‘Deep neural network for supervised single-channel speech enhancement’, Arch. Acoust., 2019, 44, (1), pp. 3–12.
-
30)
-
3. Cohen, I.: ‘Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging’, IEEE Trans. Speech Audio Process., 2003, 11, (5), pp. 466–475.
-
31)
-
27. Shivakumar, P., Georgiou, P.: ‘Perception optimized deep denoising autoencoders for speech enhancement’, INTERSPEECH 2016, the 17th Annual Conf. of the Int. Speech Communication Association, San Francisco, CA, USA., 2016, pp. 3743–3747.
-
32)
-
17. Michelsanti, D., Tan, Z.H.: ‘Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification’, , 2017.
-
33)
-
4. Mowlaee, P., Saeidi, R.: ‘Iterative closed-loop phase-aware single-channel speech enhancement’, IEEE Signal Process. Lett., 2013, 20, (12), pp. 1235–1239.
-
34)
-
14. Weninger, F., Erdogan, H., Watanabe, S., et al: ‘Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR’. Latent Variable Analysis and Signal Separation – 12th Int. Conf., LVA/ICA 2015, Liberec, Czech Republic, 25–28 August 2015, , pp. 91–99. .
-
35)
-
24. Wakabayashi, Y., Fukumori, T., Nakayama, M., et al: ‘Single-channel speech enhancement with phase reconstruction based on phase distortion averaging’, IEEE/ACM Trans. Audio Speech Lang. Process., 2018, 26, (9), pp. 1559–1569.
-
36)
-
19. Soni, M.H., Shah, N., Patil, H.A.: ‘Time-frequency masking-based speech enhancement using generative adversarial network’. 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, 15–20 April 2018, pp. 5039–5043. .
-
37)
-
6. Rivet, B., Wang, W., Naqvi, S.M., et al: ‘Audiovisual speech source separation: an overview of key methodologies’, IEEE Signal Process. Mag., 2014, 31, (3), pp. 125–134.
-
38)
-
35. Kolbaek, M., Yu, D., Tan, Z., et al: ‘Multi-talker speech separation and tracing with permutation invariant training of deep recurrent neural networks’. CoRR, 2017, .
-
39)
-
11. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, pp. 1735–1780.
-
40)
-
15. Park, S.R., Lee, J.: ‘A fully convolutional neural network for speech enhancement’. Interspeech 2017, 18th Annual Conf. of the Int. Speech Communication Association (ISCA), 2017, Stockholm, Sweden, 20–24 August 2017, pp. 1993–1997, .
-
41)
-
7. Narayanan, A., Wang, D.L.: ‘Ideal ratio mask estimation using deep neural networks for robust speech recognition’. 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 7092–7096.
-
42)
-
22. Erdogan, H., Hershey, J.R., Watanabe, S., et al: ‘Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio’. In: Watanabe, S., Delcroix, M., Metze, F. (Eds): ‘New era for robust speech recognition, exploiting deep learning’ (Springer, Cham, Switzerland, 2017), pp. 165–186. .
-
43)
-
44. Guzewich, P., Zahorian, S., Chen, X., et al: ‘Cross-corpora convolutional deep neural network dereverberation preprocessing for speaker verification and speech enhancement’. Proc. Interspeech 2018, Hyderabad, India, 2018, pp. 1329–1333. .
-
44)
-
50. Abadi, M., Barham, P., Chen, J., et al: ‘Tensorflow: a system for large-scale machine learning’. CoRR, 2016, .
-
45)
-
1. Paliwal, K.K., Schwerin, B., Wójcicki, K.K.: ‘Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator’, Speech Commun., 2012, 54, (2), pp. 282–305, .
-
46)
-
18. Pascual, S., Bonafonte, A., Serrà, J.: ‘SEGAN: speech enhancement generative adversarial network’. CoRR, 2017, .
-
47)
-
13. Erdogan, H., Hershey, J.R., Watanabe, S., et al: ‘Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks’. 2015 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, 19–24 April 2015, pp. 708–712, .
-
48)
-
28. Kumar, A., Florêncio, D.: ‘Speech enhancement in multiple-noise conditions using deep neural networks’. CoRR, 2016, .
-
49)
-
51. Nagrani, A., Chung, J.S., Zisserman, A.: ‘Voxceleb: a large-scale speaker identification dataset’. CoRR, 2017, .
-
50)
-
21. Wang, Y.X., Wang, D.L.: ‘Towards scaling up classification-based speech separation’, IEEE Trans. Audio Speech Lang. Process., 2013, 21, (7), pp. 1381–1390.
-
51)
-
19. Soni, M.H., Shah, N., Patil, H.A.: ‘Time-frequency masking-based speech enhancement using generative adversarial network’. 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, 15–20 April 2018, pp. 5039–5043. .
-
52)
-
17. Michelsanti, D., Tan, Z.H.: ‘Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification’, , 2017.
-
53)
-
27. Shivakumar, P., Georgiou, P.: ‘Perception optimized deep denoising autoencoders for speech enhancement’, INTERSPEECH 2016, the 17th Annual Conf. of the Int. Speech Communication Association, San Francisco, CA, USA., 2016, pp. 3743–3747.
-
54)
-
14. Weninger, F., Erdogan, H., Watanabe, S., et al: ‘Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR’. Latent Variable Analysis and Signal Separation – 12th Int. Conf., LVA/ICA 2015, Liberec, Czech Republic, 25–28 August 2015, , pp. 91–99. .
-
55)
-
45. Rix, A.W., Beerends, J.G., Hollier, M.P., et al: ‘Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs’. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2001, Salt Palace Convention Center, Salt Lake City, Utah, USA, 7–11 May 2001, pp. 749–752, .
-
56)
-
32. Kolbaek, M., Tan, Z., Jensen, J.: ‘Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure’. CoRR, 2018, .
-
57)
-
16. Longueira, F., Keene, S.: ‘A fully convolutional neural network approach to end-to-end speech enhancement’. CoRR, 2018, .
-
58)
-
5. Martin, R.: ‘Statistical methods for the enhancement of noisy speech’, Speech Enhancement (Springer, Berlin, Heidelberg, 2005), pp. 43–65. .
-
59)
-
40. Xu, Y., Du, J., Dai, L., et al: ‘A regression approach to speech enhancement based on deep neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (1), pp. 7–19.
-
60)
-
20. Wang, D.L., Chen, J.T.: ‘Supervised speech separation based on deep learning: an overview’. CoRR, 2017, .
-
61)
-
15. Park, S.R., Lee, J.: ‘A fully convolutional neural network for speech enhancement’. Interspeech 2017, 18th Annual Conf. of the Int. Speech Communication Association (ISCA), 2017, Stockholm, Sweden, 20–24 August 2017, pp. 1993–1997, .
-
62)
-
39. Subramanian, A.S., Chen, S.J., Watanabe, S.: ‘Student–teacher learning for BLSTM mask-based speech enhancement’. Proc. Interspeech 2018, Hyderabad, India, 2018, pp. 3249–3253. .
-
63)
-
50. Abadi, M., Barham, P., Chen, J., et al: ‘Tensorflow: a system for large-scale machine learning’. CoRR, 2016, .
-
64)
-
48. Hansen, J.H.L., Pellom, B.L.: ‘An effective quality evaluation protocol for speech enhancement algorithms’. Proc. Int. Conf. on Speech and Language Processing, Sydney, Australia, 1998, pp. 2819–2822.
-
65)
-
36. Kolbaek, M., Yu, D., Tan, Z., et al: ‘Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2017, 25, (10), pp. 1901–1913, .
-
66)
-
33. Naithani, G., Nikunen, J., Bramsløw, L., et al: ‘Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications’. CoRR, 2018, .
-
67)
-
24. Wakabayashi, Y., Fukumori, T., Nakayama, M., et al: ‘Single-channel speech enhancement with phase reconstruction based on phase distortion averaging’, IEEE/ACM Trans. Audio Speech Lang. Process., 2018, 26, (9), pp. 1559–1569.
-
68)
-
7. Narayanan, A., Wang, D.L.: ‘Ideal ratio mask estimation using deep neural networks for robust speech recognition’. 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 7092–7096.
-
69)
-
34. Kim, J., El-Khamy, M., Lee, J.: ‘End-to-end multi-task denoising for joint SDR and PESQ optimization’. CoRR, 2019, .
-
70)
-
8. Saleem, N., Khattak, M., Qazi, A.: ‘Supervised speech enhancement based on deep neural network’, J. Intell. Fuzzy Syst., 2019, 37, pp. 1–15.
-
71)
-
26. Wang, Y.X., Narayanan, A., Wang, D.L.: ‘On training targets for supervised speech separation’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 1849–1858. .
-
72)
-
9. Saleem, N., Irfan.Khattak, M., Ali, M.Y., et al: ‘Deep neural network for supervised single-channel speech enhancement’, Arch. Acoust., 2019, 44, (1), pp. 3–12.
-
73)
-
49. Févotte, C., Gribonval, R., Vincent, E.: .
-
74)
-
31. Venkataramani, S., Smaragdis, P.: ‘End-to-end source separation with adaptive front-ends’. CoRR, 2017, .
-
75)
-
11. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, pp. 1735–1780.
-
76)
-
18. Pascual, S., Bonafonte, A., Serrà, J.: ‘SEGAN: speech enhancement generative adversarial network’. CoRR, 2017, .
-
77)
-
25. Choi, H.S., Kim, J.H., Huh, J., et al: ‘Phase-aware speech enhancement with deep complex U-Net’. CoRR, 2019, .
-
78)
-
30. Zhao, Z., Elshamy, S., Fingscheidt, T.: ‘A perceptual weighting filter loss for DNN training in speech enhancement’. CoRR, 2019, .
-
79)
-
2. Cohen, I., Berdugo, B.: ‘Noise estimation by minima controlled recursive averaging for robust speech enhancement’, IEEE Signal Process. Lett., 2002, 9, (1), pp. 12–15.
-
80)
-
43. Veaux, C., Yamagishi, J., MacDonald, K.: ‘CSTR VCTK corpus: English multispeaker corpus for CSTR voice cloning toolkit’, 2016. .
-
81)
-
13. Erdogan, H., Hershey, J.R., Watanabe, S., et al: ‘Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks’. 2015 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, 19–24 April 2015, pp. 708–712, .
-
82)
-
46. Taal, C.H., Hendriks, R.C., Heusdens, R., et al: ‘An algorithm for intelligibility prediction of time–frequency weighted noisy speech’, IEEE Trans. Audio Speech Lang. Process., 2011, 19, (7), pp. 2125–2136.
-
83)
-
28. Kumar, A., Florêncio, D.: ‘Speech enhancement in multiple-noise conditions using deep neural networks’. CoRR, 2016, .
-
84)
-
44. Guzewich, P., Zahorian, S., Chen, X., et al: ‘Cross-corpora convolutional deep neural network dereverberation preprocessing for speaker verification and speech enhancement’. Proc. Interspeech 2018, Hyderabad, India, 2018, pp. 1329–1333. .
-
85)
-
37. Hu, G.N., Wang, D.L.: ‘Monaural speech segregation based on pitch tracking and amplitude modulation’. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2002, Orlando, Florida, USA, 13–17 May 2002, pp. 553–556. .
-
86)
-
23. Krawczyk, M., Gerkmann, T.: ‘Stft phase reconstruction in voiced speech for an improved single-channel speech enhancement’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 1931–1940.
-
87)
-
4. Mowlaee, P., Saeidi, R.: ‘Iterative closed-loop phase-aware single-channel speech enhancement’, IEEE Signal Process. Lett., 2013, 20, (12), pp. 1235–1239.
-
88)
-
29. Han, W., Zhang, X., Min, G., et al: ‘Perceptual weighting deep neural networks for single-channel speech enhancement’. 2016 12th World Congress on Intelligent Control and Automation (WCICA), Guilin, China, 2016, pp. 446–450.
-
89)
-
10. Saleem, N., Khattak, M.I.: ‘Deep neural networks for speech enhancement in complex-noisy environments’, Int. J. Inter. Multimed. Artificial Intell., 2020, 6, pp. 1–7. .
-
90)
-
47. Vincent, E., Gribonval, R., Fevotte, C.: ‘Performance measurement in blind audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2006, 14, (4), pp. 1462–1469.
-
91)
-
41. Kingma, D.P., Ba, J.: ‘Adam: A method for stochastic optimization’. 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. .
-
92)
-
1. Paliwal, K.K., Schwerin, B., Wójcicki, K.K.: ‘Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator’, Speech Commun., 2012, 54, (2), pp. 282–305, .
-
93)
-
3. Cohen, I.: ‘Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging’, IEEE Trans. Speech Audio Process., 2003, 11, (5), pp. 466–475.
-
94)
-
35. Kolbaek, M., Yu, D., Tan, Z., et al: ‘Multi-talker speech separation and tracing with permutation invariant training of deep recurrent neural networks’. CoRR, 2017, .
-
95)
-
22. Erdogan, H., Hershey, J.R., Watanabe, S., et al: ‘Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio’. In: Watanabe, S., Delcroix, M., Metze, F. (Eds): ‘New era for robust speech recognition, exploiting deep learning’ (Springer, Cham, Switzerland, 2017), pp. 165–186. .
-
96)
-
6. Rivet, B., Wang, W., Naqvi, S.M., et al: ‘Audiovisual speech source separation: an overview of key methodologies’, IEEE Signal Process. Mag., 2014, 31, (3), pp. 125–134.
-
97)
-
38. Roman, N., Liang Wang, D., Brown, G.J.: ‘Speech segregation based on sound localization’. Proc. Int. Joint Conf. on Neural Networks, IJCNN'01 (Cat. No. 01CH37222), Washington, DC, USA., 2001, , pp. 2861–2866.
-
98)
-
42. Snyder, D., Chen, G., Povey, D.: ‘MUSAN: a music, speech, and noise corpus’, , 2015.
-
99)
-
12. Weninger, F., Hershey, J.R., Roux, J.L., et al: ‘Discriminatively trained recurrent neural networks for single-channel speech separation’. 2014 IEEE Global Conf. on Signal and Information Processing (GlobalSIP), Atlanta, GA, USA., 2014, pp. 577–581.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-spr.2019.0290
Related content
content/journals/10.1049/iet-spr.2019.0290
pub_keyword,iet_inspecKeyword,pub_concept
6
6