Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

Block-online multi-channel speech enhancement using deep neural network-supported relative transfer function estimates

Block-online multi-channel speech enhancement using deep neural network-supported relative transfer function estimates

For access to this article, please select a purchase option:

Buy article PDF
$19.95
(plus tax if applicable)
Buy Knowledge Pack
10 articles for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IET Signal Processing — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

This work addresses the problem of block-online processing for multi-channel speech enhancement. Such processing is vital in scenarios with moving speakers and/or when short utterances are processed, e.g. in voice assistant applications. We consider several variants of a system that performs beamforming supported by deep neural network-based voice activity detection followed by post-filtering. The speaker is targeted through estimating relative transfer functions between microphones. Each block of the input signals is processed independently to make the method applicable in highly dynamic environments. Due to short processed blocks, the statistics required by the beamformer are estimated less precisely. The influence of this inaccuracy is studied and compared to batch processing regime, when recordings are treated as one block. The experimental evaluation is performed on large datasets of CHiME-4 and another dataset featuring moving target speaker. The experiments are evaluated in terms of objective and perceptual criteria. Moreover, word error rate (WER) of a speech recognition system is evaluated, for which the method serves as a front-end. The results indicate that the proposed method is robust for short length of the processed block. Significant improvements in terms of the criteria and WER are observed even for the block length of 250 ms.

References

    1. 1)
      • 49. Higuchi, T., Ito, N., Yoshioka, T., et al: ‘Robust MVDR beamforming using time–frequency masks for online/offline ASR in noise’. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) IEEE, Shanghai, People's Republic of China, 2016, pp. 52105214.
    2. 2)
      • 16. Warsitz, E., Haeb Umbach, R.: ‘Blind acoustic beamforming based on generalized eigenvalue decomposition’, IEEE Trans. Audio Speech Lang. Process., 2007, 15, (5), pp. 15291539.
    3. 3)
      • 57. Hu, Y., Loizou, P.C.: ‘Evaluation of objective quality measures for speech enhancement’, IEEE Trans. Audio Speech Lang. Process., 2007, 16, (1), pp. 229238.
    4. 4)
      • 41. Araki, S., Okada, M., Higuchi, T., et al: ‘Spatial correlation model-based observation vector clustering and MVDR beamforming for meeting recognition’. 2016 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), China, 2016, pp. 385389.
    5. 5)
      • 46. Wang, Z.Q., Wang, D.: ‘Mask weighted STFT ratios for relative transfer function estimation and its application to robust ASR’. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) IEEE, Calgary, Canada, 2018, pp. 15.
    6. 6)
      • 53. Malek, J., Koldovsky, Z.: ‘Multichannel-enhancement’, 2019. Available at https://asap.ite.tul.cz/downloads/enhancer/, last submission on 27 November 2019.
    7. 7)
      • 37. Xiang, H., Wang, B., Ou, Z.: ‘The USTC-iFlytek system for CHiME-4 challenge’. Proc. Fourth Int. Workshop on Speech Processing in Everyday Environments CHiME-4, USA, 2016.
    8. 8)
      • 26. Meier, S., Kellermann, W.: ‘Analysis of the performance and limitations of ICA-based relative impulse response identification’. 2015 23rd European Signal Processing Conf. (EUSIPCO), France, 2015, pp. 414418.
    9. 9)
      • 34. Hongyu, W.B., Ou, Z.: ‘The THU-SPMI CHiME-4 system: lightweight design with advanced multi-channel processing, feature enhancement, and language modeling’. Proc. the Fourth Int. Workshop on Speech Processing in Everyday Environments CHiME-4, USA, 2016.
    10. 10)
      • 32. Zhang, Z., Geiger, J., Pohjalainen, J., et al: ‘Deep learning for environmentally robust speech recognition: an overview of recent developments’, ACM Trans. Intell. Syst. Technol. (TIST), 2018, 9, (5), p. 49.
    11. 11)
      • 9. Gannot, S., Vincent, E., Markovich Golan, S., et al: ‘A consolidated perspective on multimicrophone speech enhancement and source separation’, IEEE/ACM Trans. Audio Speech Lang. Process., 2017, 25, (4), pp. 692730.
    12. 12)
      • 18. Chang, J.H., Kim, N.S., Mitra, S.K.: ‘Voice activity detection based on multiple statistical models’, IEEE Trans. Signal Process., 2006, 54, (6), pp. 19651976.
    13. 13)
      • 24. Markovich, S., Gannot, S., Cohen, I.: ‘Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals’, IEEE Trans. Audio Speech Lang. Process., 2009, 17, (6), pp. 10711086.
    14. 14)
      • 12. Doclo, S., Spriet, A., Wouters, J., et al: ‘Speech distortion weighted multichannel Wiener filtering techniques for noise reduction’, in Benesty, J., Makino, S., Chen, J. (Eds.): ‘Speech enhancement. Signals and communication technology’ (Springer, Berlin, Heidelberg, 2005), pp. 199228.
    15. 15)
      • 20. Kim, J.T., Jung, S.H., Cho, K.H.: ‘Efficient harmonic peak detection of vowel sounds for enhanced voice activity detection’, IET Signal Process., 2018, 12, (8), pp. 975982.
    16. 16)
      • 4. Ephraim, Y., Malah, D.: ‘Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator’, IEEE Trans. Acoust. Speech Signal Process., 1984, 32, (6), pp. 11091121.
    17. 17)
      • 11. Van Veen, B.D., Buckley, K.M.: ‘Beamforming: a versatile approach to spatial filtering’, IEEE ASSP Mag., 1988, 5, (2), pp. 424.
    18. 18)
      • 39. Barker, J., Watanabe, S., Vincent, E.: ‘The 5th CHiME speech separation and recognition challenge’, 2019. Available at http://spandh.dcs.shef.ac.uk/chime_challenge/index.html, last submission on 27 November 2019.
    19. 19)
      • 52. Collobert, R., Farabet, C., Kavukcuoglu, K., et al: ‘Torch – a scientific computing framework for LuaJIT’, 2019. Available at http://torch.ch/, last submission on 27 November 2019.
    20. 20)
      • 48. Higuchi, T., Kinoshita, K., Ito, N., et al: ‘Frame-by-frame closed-form update for mask-based adaptive MVDR beamforming’. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) IEEE, Calgary, Canada, 2018, pp. 15.
    21. 21)
      • 55. Rix, A.W., Beerends, J.G., Hollier, M.P., et al: ‘Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs’. 2001 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2001 Proc. (ICASSP'01) IEEE, Salt Lake City, UT, USA, 2001, vol. 2, pp. 749752.
    22. 22)
      • 35. Heymann, J., Drude, L., Haeb Umbach, R.: ‘Neural network-based spectral mask estimation for acoustic beamforming’. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), China, 2016, pp. 196200.
    23. 23)
      • 30. Katzberg, F., Mazur, R., Maass, M., et al: ‘Spatial interpolation of room impulse responses using compressed sensing’. 2018 16th Int. Workshop on Acoustic Signal Enhancement (IWAENC), Japan, 2018, pp. 426430.
    24. 24)
      • 25. Shalvi, O., Weinstein, E.: ‘System identification using non-stationary signals’, IEEE Trans. Signal Process., 1996, 44, (8), pp. 20552063.
    25. 25)
      • 31. Cutajar, M., Gatt, E., Grech, I., et al: ‘Comparative study of automatic speech recognition techniques’, IET Signal Process., 2013, 7, (1), pp. 2546.
    26. 26)
      • 17. Gannot, S., Burshtein, D., Weinstein, E.: ‘Signal enhancement using beamforming and non-stationarity with applications to speech’, IEEE Trans. Signal Process., 2001, 49, (8), pp. 16141626.
    27. 27)
      • 54. Vincent, E., Gribonval, R., Fevotte, C.: ‘Performance measurement in blind audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2006, 14, (4), pp. 14621469.
    28. 28)
      • 33. Vincent, E., Watanabe, S., Nugraha, A.A., et al: ‘The 4th CHiME speech separation and recognition challenge’, 2019. Available at http://spandh.dcs.shef.ac.uk/chime_challenge/chime2016/, last submission on 27 November 2019.
    29. 29)
      • 7. Li, Y., Kang, S.: ‘Deep neural network-based linear predictive parameter estimations for speech enhancement’, IET Signal Process., 2016, 11, (4), pp. 469476.
    30. 30)
      • 38. Anguera, X., Wooters, C., Hernando, J.: ‘Acoustic beamforming for speaker diarization of meetings’, IEEE Trans. Audio Speech Lang. Process., 2007, 15, (7), pp. 20112022.
    31. 31)
      • 29. Giri, R., Rao, B.D., Mustiere, F., et al: ‘Dynamic relative impulse response estimation using structured sparse Bayesian learning’. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), China, 2016, pp. 514518.
    32. 32)
      • 43. Pfeifenberger, L., Zohrer, M., Pernkopf, F.: ‘DNN-based speech mask estimation for eigenvector beamforming’. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) IEEE, New Orleans, LA, USA, 2017, pp. 6670.
    33. 33)
      • 42. Nakatani, T., Ito, N., Higuchi, T., et al: ‘Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming’. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) IEEE, New Orleans, LA, USA, 2017, pp. 286290.
    34. 34)
      • 22. Goodfellow, I., Bengio, Y., Courville, A.: ‘Deep learning’ (MIT Press, USA, 2016). Available at http://www.deeplearningbook.org.
    35. 35)
      • 45. Erdogan, H., Hershey, J.R., Watanabe, S., et al: ‘Improved MVDR beamforming using single-channel mask prediction networks’. INTERSPEECH, USA, 2016, pp. 19811985.
    36. 36)
      • 3. Samui, S., Chakrabarti, I., Ghosh, S.K.: ‘Improved single-channel phase-aware speech enhancement technique for low signal-to-noise ratio signal’, IET Signal Process., 2016, 10, (6), pp. 641650.
    37. 37)
      • 36. Heymann, J., Drude, L., Haeb Umbach, R.: ‘Wide residual BLSTM network with discriminative speaker adaptation for robust speech recognition’. Proc. Fourth Intl. Workshop on Speech Processing in Everyday Environments CHiME-4, USA, 2016.
    38. 38)
      • 5. Mahmmod, B.M., Ramli, A.R., Abdulhussian, S.H., et al: ‘Low-distortion MMSE speech enhancement estimator based on Laplacian prior’, IEEE Access, 2017, 5, pp. 98669881.
    39. 39)
      • 40. Koldovsky, Z., Nesta, F.: ‘Approximate MVDR and MMSE beamformers exploiting scale-invariant reconstruction of signals on microphones’. 2016 15th Int. Workshop on Acoustic Signal Enhancement (IWAENC), China, 2016.
    40. 40)
      • 21. Zhang, X.L., Wu, J.: ‘Deep belief networks based voice activity detection’, IEEE Trans. Audio Speech Lang. Process., 2013, 21, (4), pp. 697710.
    41. 41)
      • 14. Doclo, S., Gannot, S., Moonen, M., et al: ‘Acoustic beamforming for hearing aid applications’, in Haykin, S., Ray Liu, K.J. (Eds.): Handbook on array processing and sensor networks, (John Wiley & Sons, USA, 2010), pp. 269302.
    42. 42)
      • 23. Cohen, I.: ‘Relative transfer function identification using speech signals’, IEEE Trans. Speech Audio Process., 2004, 12, (5), pp. 451459.
    43. 43)
      • 27. Khan, A.H., Taseska, M., Habets, E.A.P.: ‘A geometrically constrained independent vector analysis algorithm for online source extraction’. 12th Int. Conf. Latent Variable Analysis and Signal Separation: LVA/ICA 2015, Springer, 2015, pp. 396403.
    44. 44)
      • 50. Togami, M.: ‘Simultaneous optimization of forgetting factor and time-frequency mask for block-online multi-channel speech enhancement’. ICASSP 2019–2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) IEEE, Brighton, UK, 2019, pp. 27022706.
    45. 45)
      • 59. Vincent, E., Watanabe, S., Nugraha, A.A., et al: ‘An analysis of environment, microphone and data simulation mismatches in robust speech recognition’, Comput. Speech Lang., 2016, 46, pp. 535557.
    46. 46)
      • 28. Koldovsky, Z., Malek, J., Gannot, S.: ‘Spatial source subtraction based on incomplete measurements of relative transfer function’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (8), pp. 13351347.
    47. 47)
      • 58. Garofolo, J., Graff, D., Paul, D., et al: ‘CSR-I (WSJ0) complete LDC93S6A’, Philadelphia: Linguistic Data Consortium, 1993. Available at http://spandh.dcs.shef.ac.uk/chime_challenge/index.html, last submission on 27 November 2019.
    48. 48)
      • 6. Mahmmod, B.M., Ramli, A.R., Baker, T., et al: ‘Speech enhancement algorithm based on super-Gaussian modeling and orthogonal polynomials’, IEEE Access, 2019, 7, pp. 103485103504.
    49. 49)
      • 47. Boeddeker, C., Erdogan, H., Yoshioka, T., et al: ‘Exploring practical aspects of neural mask-based beamforming for far-field speech recognition’. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) IEEE, Calgary, Canada, 2018, pp. 15.
    50. 50)
      • 10. Makino, S., Lee, T.W., Sawada, H.: ‘Blind speech separation’, vol. 615 (Springer, Switzerland, 2007).
    51. 51)
      • 15. Cox, H., Zeskind, R., Owen, M.: ‘Robust adaptive beamforming’, IEEE Trans. Acoust. Speech Signal Process., 1987, 35, (10), pp. 13651376.
    52. 52)
      • 56. Loizou, P.C.: ‘Speech enhancement, theory and practice’ (CRC Press, USA, 2013, 2nd edn.).
    53. 53)
      • 13. Van Trees, H.L.: ‘Optimum array processing: part IV of detection, estimation, and modulation theory’ (John Wiley & Sons, USA, 2004).
    54. 54)
      • 19. Dov, D., Talmon, R., Cohen, I.: ‘Kernel method for voice activity detection in the presence of transients’, IEEE/ACM Trans. Audio Speech Lang. Process., 2016, 24, (12), pp. 23132326.
    55. 55)
      • 44. Ochiai, T., Watanabe, S., Hori, T., et al: ‘Unified architecture for multichannel end-to-end speech recognition with neural beamforming’, IEEE. J. Sel. Top. Signal Process., 2017, 11, (8), pp. 12741288.
    56. 56)
      • 2. Boll, S.: ‘Suppression of acoustic noise in speech using spectral subtraction’, IEEE Trans. Acoust. Speech Signal Process., 1979, 27, (2), pp. 113120.
    57. 57)
      • 51. Schwartz, O., Gannot, S., Habets, E.A.P.: ‘Multi-microphone speech dereverberation and noise reduction using relative early transfer functions’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (2), pp. 240251.
    58. 58)
      • 1. Cohen, I., Benesty, J., Gannot, S.: ‘Speech processing in modern communication: challenges and perspectives’, vol. 3 (Springer Science & Business Media, Switzerland, 2009).
    59. 59)
      • 8. Wang, D., Chen, J.: ‘Supervised speech separation based on deep learning: an overview’, IEEE/ACM Trans. Audio Speech Lang. Process., 2018, 26, (10), pp. 17021726.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-spr.2019.0304
Loading

Related content

content/journals/10.1049/iet-spr.2019.0304
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address