access icon free Single-channel dereverberation and denoising based on lower band trained SA-LSTMs

The supervised single-channel speech enhancement presents one mixture recording at the input of the neural network and updates network parameters in order to generate an output as the reconstructed speech signal. However, current neural networks-based single-channel speech enhancement methods are not able to fully utilise pertinence with the specific frequency range of speech signals with limited computational complexity. In this study, the authors studied the power spectral density of mixtures with human speech and noise interferences. Based on the theory that the speech signal distributes at the lower band, they proposed a method to train signal approximation (SA) based neural networks with the lower frequency band of the speech mixture to improve the performance. To realise the lower band approach for single-channel speech enhancement, the method uses a long short-term memory (LSTM) block to exploit short-time Fourier transform of the desired frequency range. Furthermore, in order to improve the speech enhancement performance within reverberant room environments, the dereverberation mask and the enhanced ratio mask are exploited as the training targets of two LSTM blocks, respectively. The detailed evaluations confirm that the proposed method outperforms the state-of-the-art methods.

Inspec keywords: speech enhancement; reverberation; computational complexity; supervised learning; approximation theory; recurrent neural nets; Fourier transforms

Other keywords: long short-term memory; dereverberation mask; computational complexity; signal approximation based neural networks; network parameters; reconstructed speech signal; lower band approach; short-time Fourier transform; noise interferences; human speech; enhanced ratio mask; single-channel dereverberation; lower band trained SA-LSTMs; reverberant room environments; speech mixture; mixture recording; power spectral density; speech enhancement performance; current neural network-based single-channel speech methods; supervised single-channel speech enhancement

Subjects: Integral transforms in numerical analysis; Speech and audio signal processing; Interpolation and function approximation (numerical analysis); Integral transforms in numerical analysis; Speech processing techniques; Interpolation and function approximation (numerical analysis); Computational complexity; Neural nets; Supervised learning

References

    1. 1)
      • 11. Soni, M.H., Shah, N., Patil, H.A.: ‘Time-frequency masking-based speech enhancement using generative adversarial network’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018.
    2. 2)
      • 29. Garofolo, J.S., Lamel, L.F., Fisher, W.M., et al: ‘TIMIT acoustic phonetic continuous speech corpus [CD-ROM]’, (U.S. Department of Commerce, USA, 1993).
    3. 3)
      • 6. Sun, Y., Xian, Y., Wang, W.W., et al: ‘Monaural source separation in complex domain with long short-term memory neural network’, IEEE. J. Sel. Top. Signal. Process., 2019, 13, (2), pp. 359369.
    4. 4)
      • 31. Williamson, D.S., Wang, D.L.: ‘Time-frequency masking in the complex domain for speech dereverberation and denoising’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2017, 25, (7), pp. 14921501.
    5. 5)
      • 20. Soni, M.H., Patil, H.A.: ‘Novel subband autoencoder features for non-intrusive quality assessment of noise suppressed speech’. Interspeech, San Francisco, USA, 2016.
    6. 6)
      • 25. Varga, A., Steeneken, H.J.M.: ‘Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems’, IEEE Trans. Audio, Speech, Lang. Process., 1993, 12, (3), pp. 247251.
    7. 7)
      • 21. Ergen, T., Kozat, S.S.: ‘Online training of LSTM networks in distributed systems for variable length data sequences’, IEEE Trans. Neural Netw. Learn. Syst., 2018, 29, (10), pp. 51595165.
    8. 8)
      • 5. Pascual, S., Bonafonte, A., Serrà, J.: ‘SEGAN: speech enhancement generative adversarial network’. Interspeech, Stockholm, Sweden, 2017.
    9. 9)
      • 13. Delfarah, M., Wang, D.L.: ‘Deep learning for talker-dependent reverberant speaker separation an empirical study’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2019, 27, (11), pp. 18391848.
    10. 10)
      • 40. Han, K., Wang, Y., Wang, D.L., et al: ‘Learning spectral mapping for speech dereverberation and denoising’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2015, 23, (6), pp. 982992.
    11. 11)
      • 35. Habets, E.A.P.: ‘Single- and multi-microphone speech dereverberation using spectral enhancement’, Thesis, 2007.
    12. 12)
      • 2. Macartney, C., Weyde, T.: ‘Improved speech enhancement with the wave-u-net’, arXiv preprint arXiv: 1811.11307, 2018.
    13. 13)
      • 9. Takeuchi, D., Yatabe, K., Koizumi, Y., et al: ‘Real-time speech enhancement using equilibriated RNN’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020.
    14. 14)
      • 22. Tan, K., Wang, D.L.: ‘Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019.
    15. 15)
      • 18. Weninger, F., Hershey, J.R., Le, J., et al: ‘Discriminatively trained recurrent neural networks for single-channel speech separation’. Proc. of IEEE Global Conf. on Signal and Information Processing (GlobalSIP), Atlanta, Georgia, USA, 2014.
    16. 16)
      • 16. Wang, Y., Narayanan, A., Wang, D.L.: ‘On training targets for supervised speech separation’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2014, 22, (12), pp. 18491858.
    17. 17)
      • 27. Hermansky, H., Morgan, N.: ‘RASTA processing of speech: an overview’, IEEE Trans. Audio, Speech, Lang. Process., 1994, 2, (4), pp. 578589.
    18. 18)
      • 19. Kyei, A., Sim, D.H., Jung, Y.B.: ‘Compact log-periodic dipole array antenna with bandwidth-enhancement techniques for the low frequency band’, IET Microw. Antennas Propag., 2017, 11, (5), pp. 711717.
    19. 19)
      • 17. Pandey, A., Wang, D.L.: ‘TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019.
    20. 20)
      • 38. Eaton, J., Moore, A.H., Naylor, P.A., et al: ‘Direct-to-reverberant ratio estimation using a null-steered beamformer’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Queensland, Australia, 2015.
    21. 21)
      • 7. Chen, J.T., Wang, D.L.: ‘Long short-term memory for speaker generalization in supervised speech separation’, J. Acoust. Soc. Am., 2017, 141, pp. 47054714.
    22. 22)
      • 26. Williamson, D.S., Wang, Y., Wang, D.L.: ‘Complex ratio masking for monaural speech separation’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2016, 24, (3), pp. 483492.
    23. 23)
      • 15. Yu, M., Yu, Y., Rhuma, A., et al: ‘An online one class support vector machine-based person-specific fall detection system for monitoring an elderly individual in a room environment’, IEEE. J. Biomed. Health. Inform., 2013, 17, (6), pp. 10021014.
    24. 24)
      • 34. Albu, F., Dumitriu, N., Stanciu, L.D.: ‘Speech enhancement by spectral subtraction’. Proc. of Int. Symp. on Electronics and Telecommunications, Timisoara, Romania, 1996, pp. 7883.
    25. 25)
      • 36. Kim, J., El-Kharmy, M., Lee, J.: ‘End-to-end multi-task denoising for joint SDR and PESQ optimization’, arXiv preprint arXiv: 1901.09146, 2019.
    26. 26)
      • 4. Valentini-Botinhao, C., Wang, X., Takaki, S., et al: ‘Investigating RNN-based speech enhancement methods for noise robust text-to-speech’. 9th ISCA Speech Synthesis Workshop, Sunnyvale, USA, 2016, pp. 159165.
    27. 27)
      • 32. Braithwaite, D.T., Kleijn, W.B.: ‘Speech enhancement with variance constrained autoencoders’. Interspeech, Graz, Australia, 2019.
    28. 28)
      • 23. Li, Y., Sun, Y., Naqvi, S.M.: ‘Monaural source separation based on sequentially trained LSTMs in real room environments’. 27th European Signal Processing Conf. (EUSIPCO), A Coruña, Spain, 2019.
    29. 29)
      • 30. IEEE Audio and Electroacoustics Group: ‘IEEE recommended practice for speech quality measurements’, IEEE Trans. Audio Electroacoust., 1969, 17, (3), pp. 225246.
    30. 30)
      • 12. Sun, Y., Wang, W., Chambers, J.A., et al: ‘Two-stage monaural source separation in reverberant room environments using deep neural networks’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2019, 27, (1), pp. 125138.
    31. 31)
      • 14. Zhao, Y., Wang, Z.Q., Wang, D.L.: ‘Two-stage deep learning for noisy-reverberant speech enhancement’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2018, 27, (1), pp. 5362.
    32. 32)
      • 10. Strake, M., Defraene, B., Fluyt, K., et al: ‘Fully convolutional recurrent networks for speech enhancement’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020.
    33. 33)
      • 1. Wang, D.L., Chen, J.T.: ‘Supervised speech separation based on deep learning: an overview’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2018, 26, (10), pp. 17021726.
    34. 34)
      • 24. Titze, I.R., Martin, D.W.: ‘Principles of voice production’ (Prentice Hall, USA, 1994).
    35. 35)
      • 3. Li, Y., Sun, Y., Naqvi, S.M.: ‘PSD and signal approximation-LSTM based speech enhancement’. The 13th Int. Conf. on Signal Processing and Communication Systems (ICSPCS), Surfers Paradise, Golden Coast, Australia, 2019.
    36. 36)
      • 37. Pandey, A., Wang, D.L.: ‘Performance measurement in blind audio source separation’, IEEE Trans. Audio, Speech, Lang. Process., 2006, 14, (4), pp. 14621469.
    37. 37)
      • 28. Delfarah, M., Wang, D.L.: ‘Features for masking-based monaural speech separation in reverberant conditions’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2017, 25, (15), pp. 10851094.
    38. 38)
      • 33. Choi, H.S., Kim, J.H., Huh, J., et al: ‘Phase-aware speech enhancement with deep complex u-net’. Seventh Int. Conf. on Learning Representations, New Orleans, USA, 2019.
    39. 39)
      • 39. Kong, Q.Q., Xu, Y., Sobieraj, I., et al: ‘Sound event detection and time–frequency segmentation from weakly labelled data’, IEEE/ACM Trans. Audio, Speech, Lang. Process., 2015, 27, (4), pp. 777787.
    40. 40)
      • 8. Kolbaek, M., Tan, Z.H., Jensen, S.H., et al: ‘On loss functions for supervised monaural time-domain speech enhancement’, arXiv preprint arXiv: 1909.01019, 2019.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-spr.2020.0134
Loading

Related content

content/journals/10.1049/iet-spr.2020.0134
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading