access icon free Improving speech enhancement by focusing on smaller values using relative loss

The task of single-channel speech enhancement is to restore clean speech from noisy speech. Recently, speech enhancement has been greatly improved with the introduction of deep learning. Previous work proved that using ideal ratio mask or phase-sensitive mask as intermediation to recover clean speech can yield better performance. In this case, the mean square error is usually selected as the loss function. However, after conducting experiments, the authors find that the mean square error has a problem. It considers absolute error values, meaning that the gradients of the network depend on absolute differences between estimated values and true values, so the points in magnitude spectra with smaller values contribute little to the gradients. To solve this problem, they propose relative loss, which pays more attention to relative differences between magnitude spectra, rather than the absolute differences, and is more in accordance with human sensory characteristics. The perceptual evaluation of speech quality, the short-time objective intelligibility, the signal-to-distortion ratio, and the segmental signal-to-noise ratio are used to evaluate the performance of the relative loss. Experimental results show that it can greatly improve speech enhancement by focusing on smaller values.

Inspec keywords: learning (artificial intelligence); speech intelligibility; performance evaluation; speech enhancement; neural nets

Other keywords: segmental signal-to-noise ratio; loss function; performance evaluation; mean square error; noisy speech; clean speech recovery; deep learning; single-channel speech enhancement; phase-sensitive mask; speech quality; short-time objective intelligibility; signal-to-distortion ratio; absolute differences; ideal ratio mask; magnitude spectra; relative loss; absolute error values

Subjects: Neural computing techniques; Speech and audio signal processing; Speech processing techniques; Performance evaluation and testing

References

    1. 1)
      • 48. Hansen, J.H.L., Pellom, B.L.: ‘An effective quality evaluation protocol for speech enhancement algorithms’. Proc. Int. Conf. on Speech and Language Processing, Sydney, Australia, 1998, pp. 28192822.
    2. 2)
      • 40. Xu, Y., Du, J., Dai, L., et al: ‘A regression approach to speech enhancement based on deep neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (1), pp. 719.
    3. 3)
      • 26. Wang, Y.X., Narayanan, A., Wang, D.L.: ‘On training targets for supervised speech separation’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 18491858. Available at https://doi.org/10.1109/TASLP.2014.2352935.
    4. 4)
      • 45. Rix, A.W., Beerends, J.G., Hollier, M.P., et al: ‘Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs’. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2001, Salt Palace Convention Center, Salt Lake City, Utah, USA, 7–11 May 2001, pp. 749752, Available at https://doi.org/10.1109/ICASSP.2001.941023.
    5. 5)
      • 16. Longueira, F., Keene, S.: ‘A fully convolutional neural network approach to end-to-end speech enhancement’. CoRR, 2018, abs/1807.07959. Available at http://arxiv.org/abs/1807.07959.
    6. 6)
      • 20. Wang, D.L., Chen, J.T.: ‘Supervised speech separation based on deep learning: an overview’. CoRR, 2017, abs/1708.07524. Available at http://arxiv.org/abs/1708.07524.
    7. 7)
      • 38. Roman, N., Liang Wang, D., Brown, G.J.: ‘Speech segregation based on sound localization’.  Proc. Int. Joint Conf. on Neural Networks, IJCNN'01 (Cat. No. 01CH37222), Washington, DC, USA., 2001, vol. 4, pp. 28612866.
    8. 8)
      • 25. Choi, H.S., Kim, J.H., Huh, J., et al: ‘Phase-aware speech enhancement with deep complex U-Net’. CoRR, 2019, abs/1903.03107. Withdrawn. Available at http://arxiv.org/abs/1903.03107.
    9. 9)
      • 34. Kim, J., El-Khamy, M., Lee, J.: ‘End-to-end multi-task denoising for joint SDR and PESQ optimization’. CoRR, 2019, abs/1901.09146. Available at: http://arxiv.org/abs/1901.09146.
    10. 10)
      • 21. Wang, Y.X., Wang, D.L.: ‘Towards scaling up classification-based speech separation’, IEEE Trans. Audio Speech Lang. Process., 2013, 21, (7), pp. 13811390.
    11. 11)
      • 23. Krawczyk, M., Gerkmann, T.: ‘Stft phase reconstruction in voiced speech for an improved single-channel speech enhancement’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 19311940.
    12. 12)
      • 2. Cohen, I., Berdugo, B.: ‘Noise estimation by minima controlled recursive averaging for robust speech enhancement’, IEEE Signal Process. Lett., 2002, 9, (1), pp. 1215.
    13. 13)
      • 39. Subramanian, A.S., Chen, S.J., Watanabe, S.: ‘Student–teacher learning for BLSTM mask-based speech enhancement’. Proc. Interspeech 2018, Hyderabad, India, 2018, pp. 32493253. Available at http://dx.doi.org/10.21437/Interspeech.2018-2440.
    14. 14)
      • 12. Weninger, F., Hershey, J.R., Roux, J.L., et al: ‘Discriminatively trained recurrent neural networks for single-channel speech separation’. 2014 IEEE Global Conf. on Signal and Information Processing (GlobalSIP), Atlanta, GA, USA., 2014, pp. 577581.
    15. 15)
      • 46. Taal, C.H., Hendriks, R.C., Heusdens, R., et al: ‘An algorithm for intelligibility prediction of time–frequency weighted noisy speech’, IEEE Trans. Audio Speech Lang. Process., 2011, 19, (7), pp. 21252136.
    16. 16)
      • 29. Han, W., Zhang, X., Min, G., et al: ‘Perceptual weighting deep neural networks for single-channel speech enhancement’. 2016 12th World Congress on Intelligent Control and Automation (WCICA), Guilin, China, 2016, pp. 446450.
    17. 17)
      • 10. Saleem, N., Khattak, M.I.: ‘Deep neural networks for speech enhancement in complex-noisy environments’, Int. J. Inter. Multimed. Artificial Intell., 2020, 6, pp. 17. Available at https://www.ijimai.org/journal/sites/default/files/files/2019/06/ip2019_06_01_pdf_20779.pdf.
    18. 18)
      • 37. Hu, G.N., Wang, D.L.: ‘Monaural speech segregation based on pitch tracking and amplitude modulation’. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2002, Orlando, Florida, USA, 13–17 May 2002, pp. 553556. Available at https://doi.org/10.1109/ICASSP.2002.5743777.
    19. 19)
      • 30. Zhao, Z., Elshamy, S., Fingscheidt, T.: ‘A perceptual weighting filter loss for DNN training in speech enhancement’. CoRR, 2019, abs/1905.09754. Available at http://arxiv.org/abs/1905.09754.
    20. 20)
      • 51. Nagrani, A., Chung, J.S., Zisserman, A.: ‘Voxceleb: a large-scale speaker identification dataset’. CoRR, 2017, abs/1706.08612. Available at http://arxiv.org/abs/1706.08612.
    21. 21)
      • 47. Vincent, E., Gribonval, R., Fevotte, C.: ‘Performance measurement in blind audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2006, 14, (4), pp. 14621469.
    22. 22)
      • 41. Kingma, D.P., Ba, J.: ‘Adam: A method for stochastic optimization’. 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. Conference Track Proceedings, 2015. Available at http://arxiv.org/abs/1412.6980.
    23. 23)
      • 32. Kolbaek, M., Tan, Z., Jensen, J.: ‘Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure’. CoRR, 2018, abs/1802.00604. Available at http://arxiv.org/abs/1802.00604.
    24. 24)
      • 31. Venkataramani, S., Smaragdis, P.: ‘End-to-end source separation with adaptive front-ends’. CoRR, 2017, abs/1705.02514. Available at http://arxiv.org/abs/1705.02514.
    25. 25)
      • 43. Veaux, C., Yamagishi, J., MacDonald, K.: ‘CSTR VCTK corpus: English multispeaker corpus for CSTR voice cloning toolkit’, 2016. Available at https://doi.org/10.7488/ds/1994.
    26. 26)
      • 33. Naithani, G., Nikunen, J., Bramsløw, L., et al: ‘Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications’. CoRR, 2018, abs/1807.06899. Available at http://arxiv.org/abs/1807.06899.
    27. 27)
      • 36. Kolbaek, M., Yu, D., Tan, Z., et al: ‘Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2017, 25, (10), pp. 19011913, Available at https://doi.org/10.1109/TASLP.2017.2726762.
    28. 28)
      • 8. Saleem, N., Khattak, M., Qazi, A.: ‘Supervised speech enhancement based on deep neural network’, J. Intell. Fuzzy Syst., 2019, 37, pp. 115.
    29. 29)
      • 9. Saleem, N., Irfan.Khattak, M., Ali, M.Y., et al: ‘Deep neural network for supervised single-channel speech enhancement’, Arch. Acoust., 2019, 44, (1), pp. 312.
    30. 30)
      • 3. Cohen, I.: ‘Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging’, IEEE Trans. Speech Audio Process., 2003, 11, (5), pp. 466475.
    31. 31)
      • 27. Shivakumar, P., Georgiou, P.: ‘Perception optimized deep denoising autoencoders for speech enhancement’, INTERSPEECH 2016, the 17th Annual Conf. of the Int. Speech Communication Association, San Francisco, CA, USA., 2016, pp. 37433747.
    32. 32)
      • 17. Michelsanti, D., Tan, Z.H.: ‘Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification’, arXiv e-prints, 2017.
    33. 33)
      • 4. Mowlaee, P., Saeidi, R.: ‘Iterative closed-loop phase-aware single-channel speech enhancement’, IEEE Signal Process. Lett., 2013, 20, (12), pp. 12351239.
    34. 34)
      • 14. Weninger, F., Erdogan, H., Watanabe, S., et al: ‘Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR’. Latent Variable Analysis and Signal Separation – 12th Int. Conf., LVA/ICA 2015, Liberec, Czech Republic, 25–28 August 2015, vol. 9237, pp. 9199. Available at https://doi.org/10.1007/978-3-319-22482-4_11.
    35. 35)
      • 24. Wakabayashi, Y., Fukumori, T., Nakayama, M., et al: ‘Single-channel speech enhancement with phase reconstruction based on phase distortion averaging’, IEEE/ACM Trans. Audio Speech Lang. Process., 2018, 26, (9), pp. 15591569.
    36. 36)
      • 19. Soni, M.H., Shah, N., Patil, H.A.: ‘Time-frequency masking-based speech enhancement using generative adversarial network’. 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, 15–20 April 2018, pp. 50395043. Available at: https://doi.org/10.1109/ICASSP.2018.8462068.
    37. 37)
      • 6. Rivet, B., Wang, W., Naqvi, S.M., et al: ‘Audiovisual speech source separation: an overview of key methodologies’, IEEE Signal Process. Mag., 2014, 31, (3), pp. 125134.
    38. 38)
      • 35. Kolbaek, M., Yu, D., Tan, Z., et al: ‘Multi-talker speech separation and tracing with permutation invariant training of deep recurrent neural networks’. CoRR, 2017, abs/1703.06284. Available at http://arxiv.org/abs/1703.06284.
    39. 39)
      • 11. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, pp. 17351780.
    40. 40)
      • 15. Park, S.R., Lee, J.: ‘A fully convolutional neural network for speech enhancement’. Interspeech 2017, 18th Annual Conf. of the Int. Speech Communication Association (ISCA), 2017, Stockholm, Sweden, 20–24 August 2017, pp. 19931997, Available at http://www.isca-speech.org/archive/Interspeech_2017/abstracts/1465.html.
    41. 41)
      • 7. Narayanan, A., Wang, D.L.: ‘Ideal ratio mask estimation using deep neural networks for robust speech recognition’. 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 70927096.
    42. 42)
      • 22. Erdogan, H., Hershey, J.R., Watanabe, S., et al: ‘Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio’. In: Watanabe, S., Delcroix, M., Metze, F. (Eds): ‘New era for robust speech recognition, exploiting deep learning’ (Springer, Cham, Switzerland, 2017), pp. 165186. Available at https://doi.org/10.1007/978-3-319-64680-0_7.
    43. 43)
      • 44. Guzewich, P., Zahorian, S., Chen, X., et al: ‘Cross-corpora convolutional deep neural network dereverberation preprocessing for speaker verification and speech enhancement’. Proc. Interspeech 2018, Hyderabad, India, 2018, pp. 13291333. Available at http://dx.doi.org/10.21437/Interspeech.2018-2238.
    44. 44)
      • 50. Abadi, M., Barham, P., Chen, J., et al: ‘Tensorflow: a system for large-scale machine learning’. CoRR, 2016, abs/1605.08695. Available at http://arxiv.org/abs/1605.08695.
    45. 45)
      • 1. Paliwal, K.K., Schwerin, B., Wójcicki, K.K.: ‘Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator’, Speech Commun., 2012, 54, (2), pp. 282305, Available at https://doi.org/10.1016/j.specom.2011.09.003.
    46. 46)
      • 18. Pascual, S., Bonafonte, A., Serrà, J.: ‘SEGAN: speech enhancement generative adversarial network’. CoRR, 2017, abs/1703.09452. Available at http://arxiv.org/abs/1703.09452.
    47. 47)
      • 13. Erdogan, H., Hershey, J.R., Watanabe, S., et al: ‘Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks’. 2015 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, 19–24 April 2015, pp. 708712, Available at https://doi.org/10.1109/ICASSP.2015.7178061.
    48. 48)
      • 28. Kumar, A., Florêncio, D.: ‘Speech enhancement in multiple-noise conditions using deep neural networks’. CoRR, 2016, abs/1605.02427. Available at http://arxiv.org/abs/1605.02427.
    49. 49)
      • 51. Nagrani, A., Chung, J.S., Zisserman, A.: ‘Voxceleb: a large-scale speaker identification dataset’. CoRR, 2017, abs/1706.08612. Available at http://arxiv.org/abs/1706.08612.
    50. 50)
      • 21. Wang, Y.X., Wang, D.L.: ‘Towards scaling up classification-based speech separation’, IEEE Trans. Audio Speech Lang. Process., 2013, 21, (7), pp. 13811390.
    51. 51)
      • 19. Soni, M.H., Shah, N., Patil, H.A.: ‘Time-frequency masking-based speech enhancement using generative adversarial network’. 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, 15–20 April 2018, pp. 50395043. Available at: https://doi.org/10.1109/ICASSP.2018.8462068.
    52. 52)
      • 17. Michelsanti, D., Tan, Z.H.: ‘Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification’, arXiv e-prints, 2017.
    53. 53)
      • 27. Shivakumar, P., Georgiou, P.: ‘Perception optimized deep denoising autoencoders for speech enhancement’, INTERSPEECH 2016, the 17th Annual Conf. of the Int. Speech Communication Association, San Francisco, CA, USA., 2016, pp. 37433747.
    54. 54)
      • 14. Weninger, F., Erdogan, H., Watanabe, S., et al: ‘Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR’. Latent Variable Analysis and Signal Separation – 12th Int. Conf., LVA/ICA 2015, Liberec, Czech Republic, 25–28 August 2015, vol. 9237, pp. 9199. Available at https://doi.org/10.1007/978-3-319-22482-4_11.
    55. 55)
      • 45. Rix, A.W., Beerends, J.G., Hollier, M.P., et al: ‘Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs’. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2001, Salt Palace Convention Center, Salt Lake City, Utah, USA, 7–11 May 2001, pp. 749752, Available at https://doi.org/10.1109/ICASSP.2001.941023.
    56. 56)
      • 32. Kolbaek, M., Tan, Z., Jensen, J.: ‘Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure’. CoRR, 2018, abs/1802.00604. Available at http://arxiv.org/abs/1802.00604.
    57. 57)
      • 16. Longueira, F., Keene, S.: ‘A fully convolutional neural network approach to end-to-end speech enhancement’. CoRR, 2018, abs/1807.07959. Available at http://arxiv.org/abs/1807.07959.
    58. 58)
      • 5. Martin, R.: ‘Statistical methods for the enhancement of noisy speech’, Speech Enhancement (Springer, Berlin, Heidelberg, 2005), pp. 4365. Available at https://doi.org/10.1007/3-540-27489-8_3.
    59. 59)
      • 40. Xu, Y., Du, J., Dai, L., et al: ‘A regression approach to speech enhancement based on deep neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (1), pp. 719.
    60. 60)
      • 20. Wang, D.L., Chen, J.T.: ‘Supervised speech separation based on deep learning: an overview’. CoRR, 2017, abs/1708.07524. Available at http://arxiv.org/abs/1708.07524.
    61. 61)
      • 15. Park, S.R., Lee, J.: ‘A fully convolutional neural network for speech enhancement’. Interspeech 2017, 18th Annual Conf. of the Int. Speech Communication Association (ISCA), 2017, Stockholm, Sweden, 20–24 August 2017, pp. 19931997, Available at http://www.isca-speech.org/archive/Interspeech_2017/abstracts/1465.html.
    62. 62)
      • 39. Subramanian, A.S., Chen, S.J., Watanabe, S.: ‘Student–teacher learning for BLSTM mask-based speech enhancement’. Proc. Interspeech 2018, Hyderabad, India, 2018, pp. 32493253. Available at http://dx.doi.org/10.21437/Interspeech.2018-2440.
    63. 63)
      • 50. Abadi, M., Barham, P., Chen, J., et al: ‘Tensorflow: a system for large-scale machine learning’. CoRR, 2016, abs/1605.08695. Available at http://arxiv.org/abs/1605.08695.
    64. 64)
      • 48. Hansen, J.H.L., Pellom, B.L.: ‘An effective quality evaluation protocol for speech enhancement algorithms’. Proc. Int. Conf. on Speech and Language Processing, Sydney, Australia, 1998, pp. 28192822.
    65. 65)
      • 36. Kolbaek, M., Yu, D., Tan, Z., et al: ‘Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2017, 25, (10), pp. 19011913, Available at https://doi.org/10.1109/TASLP.2017.2726762.
    66. 66)
      • 33. Naithani, G., Nikunen, J., Bramsløw, L., et al: ‘Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications’. CoRR, 2018, abs/1807.06899. Available at http://arxiv.org/abs/1807.06899.
    67. 67)
      • 24. Wakabayashi, Y., Fukumori, T., Nakayama, M., et al: ‘Single-channel speech enhancement with phase reconstruction based on phase distortion averaging’, IEEE/ACM Trans. Audio Speech Lang. Process., 2018, 26, (9), pp. 15591569.
    68. 68)
      • 7. Narayanan, A., Wang, D.L.: ‘Ideal ratio mask estimation using deep neural networks for robust speech recognition’. 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 70927096.
    69. 69)
      • 34. Kim, J., El-Khamy, M., Lee, J.: ‘End-to-end multi-task denoising for joint SDR and PESQ optimization’. CoRR, 2019, abs/1901.09146. Available at: http://arxiv.org/abs/1901.09146.
    70. 70)
      • 8. Saleem, N., Khattak, M., Qazi, A.: ‘Supervised speech enhancement based on deep neural network’, J. Intell. Fuzzy Syst., 2019, 37, pp. 115.
    71. 71)
      • 26. Wang, Y.X., Narayanan, A., Wang, D.L.: ‘On training targets for supervised speech separation’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 18491858. Available at https://doi.org/10.1109/TASLP.2014.2352935.
    72. 72)
      • 9. Saleem, N., Irfan.Khattak, M., Ali, M.Y., et al: ‘Deep neural network for supervised single-channel speech enhancement’, Arch. Acoust., 2019, 44, (1), pp. 312.
    73. 73)
      • 49. Févotte, C., Gribonval, R., Vincent, E.: ‘BSS_EVAL Toolbox User Guide – Revision 2.0’, 2005. Developed with the support of the French GdR-ISIS/CNRS Workgroup ‘Resources for Audio Source Separation’. Available at https://hal.inria.fr/inria-00564760.
    74. 74)
      • 31. Venkataramani, S., Smaragdis, P.: ‘End-to-end source separation with adaptive front-ends’. CoRR, 2017, abs/1705.02514. Available at http://arxiv.org/abs/1705.02514.
    75. 75)
      • 11. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, pp. 17351780.
    76. 76)
      • 18. Pascual, S., Bonafonte, A., Serrà, J.: ‘SEGAN: speech enhancement generative adversarial network’. CoRR, 2017, abs/1703.09452. Available at http://arxiv.org/abs/1703.09452.
    77. 77)
      • 25. Choi, H.S., Kim, J.H., Huh, J., et al: ‘Phase-aware speech enhancement with deep complex U-Net’. CoRR, 2019, abs/1903.03107. Withdrawn. Available at http://arxiv.org/abs/1903.03107.
    78. 78)
      • 30. Zhao, Z., Elshamy, S., Fingscheidt, T.: ‘A perceptual weighting filter loss for DNN training in speech enhancement’. CoRR, 2019, abs/1905.09754. Available at http://arxiv.org/abs/1905.09754.
    79. 79)
      • 2. Cohen, I., Berdugo, B.: ‘Noise estimation by minima controlled recursive averaging for robust speech enhancement’, IEEE Signal Process. Lett., 2002, 9, (1), pp. 1215.
    80. 80)
      • 43. Veaux, C., Yamagishi, J., MacDonald, K.: ‘CSTR VCTK corpus: English multispeaker corpus for CSTR voice cloning toolkit’, 2016. Available at https://doi.org/10.7488/ds/1994.
    81. 81)
      • 13. Erdogan, H., Hershey, J.R., Watanabe, S., et al: ‘Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks’. 2015 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, 19–24 April 2015, pp. 708712, Available at https://doi.org/10.1109/ICASSP.2015.7178061.
    82. 82)
      • 46. Taal, C.H., Hendriks, R.C., Heusdens, R., et al: ‘An algorithm for intelligibility prediction of time–frequency weighted noisy speech’, IEEE Trans. Audio Speech Lang. Process., 2011, 19, (7), pp. 21252136.
    83. 83)
      • 28. Kumar, A., Florêncio, D.: ‘Speech enhancement in multiple-noise conditions using deep neural networks’. CoRR, 2016, abs/1605.02427. Available at http://arxiv.org/abs/1605.02427.
    84. 84)
      • 44. Guzewich, P., Zahorian, S., Chen, X., et al: ‘Cross-corpora convolutional deep neural network dereverberation preprocessing for speaker verification and speech enhancement’. Proc. Interspeech 2018, Hyderabad, India, 2018, pp. 13291333. Available at http://dx.doi.org/10.21437/Interspeech.2018-2238.
    85. 85)
      • 37. Hu, G.N., Wang, D.L.: ‘Monaural speech segregation based on pitch tracking and amplitude modulation’. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2002, Orlando, Florida, USA, 13–17 May 2002, pp. 553556. Available at https://doi.org/10.1109/ICASSP.2002.5743777.
    86. 86)
      • 23. Krawczyk, M., Gerkmann, T.: ‘Stft phase reconstruction in voiced speech for an improved single-channel speech enhancement’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 19311940.
    87. 87)
      • 4. Mowlaee, P., Saeidi, R.: ‘Iterative closed-loop phase-aware single-channel speech enhancement’, IEEE Signal Process. Lett., 2013, 20, (12), pp. 12351239.
    88. 88)
      • 29. Han, W., Zhang, X., Min, G., et al: ‘Perceptual weighting deep neural networks for single-channel speech enhancement’. 2016 12th World Congress on Intelligent Control and Automation (WCICA), Guilin, China, 2016, pp. 446450.
    89. 89)
      • 10. Saleem, N., Khattak, M.I.: ‘Deep neural networks for speech enhancement in complex-noisy environments’, Int. J. Inter. Multimed. Artificial Intell., 2020, 6, pp. 17. Available at https://www.ijimai.org/journal/sites/default/files/files/2019/06/ip2019_06_01_pdf_20779.pdf.
    90. 90)
      • 47. Vincent, E., Gribonval, R., Fevotte, C.: ‘Performance measurement in blind audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2006, 14, (4), pp. 14621469.
    91. 91)
      • 41. Kingma, D.P., Ba, J.: ‘Adam: A method for stochastic optimization’. 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. Conference Track Proceedings, 2015. Available at http://arxiv.org/abs/1412.6980.
    92. 92)
      • 1. Paliwal, K.K., Schwerin, B., Wójcicki, K.K.: ‘Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator’, Speech Commun., 2012, 54, (2), pp. 282305, Available at https://doi.org/10.1016/j.specom.2011.09.003.
    93. 93)
      • 3. Cohen, I.: ‘Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging’, IEEE Trans. Speech Audio Process., 2003, 11, (5), pp. 466475.
    94. 94)
      • 35. Kolbaek, M., Yu, D., Tan, Z., et al: ‘Multi-talker speech separation and tracing with permutation invariant training of deep recurrent neural networks’. CoRR, 2017, abs/1703.06284. Available at http://arxiv.org/abs/1703.06284.
    95. 95)
      • 22. Erdogan, H., Hershey, J.R., Watanabe, S., et al: ‘Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio’. In: Watanabe, S., Delcroix, M., Metze, F. (Eds): ‘New era for robust speech recognition, exploiting deep learning’ (Springer, Cham, Switzerland, 2017), pp. 165186. Available at https://doi.org/10.1007/978-3-319-64680-0_7.
    96. 96)
      • 6. Rivet, B., Wang, W., Naqvi, S.M., et al: ‘Audiovisual speech source separation: an overview of key methodologies’, IEEE Signal Process. Mag., 2014, 31, (3), pp. 125134.
    97. 97)
      • 38. Roman, N., Liang Wang, D., Brown, G.J.: ‘Speech segregation based on sound localization’.  Proc. Int. Joint Conf. on Neural Networks, IJCNN'01 (Cat. No. 01CH37222), Washington, DC, USA., 2001, vol. 4, pp. 28612866.
    98. 98)
      • 42. Snyder, D., Chen, G., Povey, D.: ‘MUSAN: a music, speech, and noise corpus’, arXiv:1510.08484v1, 2015.
    99. 99)
      • 12. Weninger, F., Hershey, J.R., Roux, J.L., et al: ‘Discriminatively trained recurrent neural networks for single-channel speech separation’. 2014 IEEE Global Conf. on Signal and Information Processing (GlobalSIP), Atlanta, GA, USA., 2014, pp. 577581.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-spr.2019.0290
Loading

Related content

content/journals/10.1049/iet-spr.2019.0290
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading