http://iet.metastore.ingenta.com
1887

Dual-channel VTS feature compensation for noise-robust speech recognition on mobile devices

Dual-channel VTS feature compensation for noise-robust speech recognition on mobile devices

For access to this article, please select a purchase option:

Buy article PDF
$19.95
(plus tax if applicable)
Buy Knowledge Pack
10 articles for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IET Signal Processing — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

One way to improve automatic speech recognition (ASR) performance on the latest mobile devices, which can be employed on a variety of noisy environments, consists of taking advantage of the small microphone arrays embedded in them. Since the performance of the classic beamforming techniques with small microphone arrays is rather limited, specific techniques are being developed to efficiently exploit this novel feature for noise-robust ASR purposes. In this study, a novel dual-channel minimum mean square error-based feature compensation method relying on a vector Taylor series (VTS) expansion of a dual-channel speech distortion model is proposed. In contrast to the single-channel VTS approach (which can be considered as the state-of-the-art for feature compensation), the authors’ technique particularly benefits from the spatial properties of speech and noise. Their proposal is assessed on a dual-microphone smartphone (a particular case of interest) by means of the AURORA2-2C synthetic corpus. Word recognition results, also validated with real noisy speech data, demonstrate the higher accuracy of their method by clearly outperforming minimum variance distortionless response beamforming and a single-channel VTS feature compensation approach, especially at low signal-to-noise ratios.

References

    1. 1)
      • 1. Barker, J., Marxer, R., Vincent, E., et al: ‘The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines’. ASRU 2015 – IEEE Automatic Speech Recognition and Understanding, Scottsdale, USA, 13–17 December 2015.
    2. 2)
      • 2. Baker, J. M., Deng, L., Khudanpur, S., et al: ‘Updated MINDS report on speech recognition and understanding, part 2’, IEEE Signal Process. Mag., 2009, 26, pp. 7885.
    3. 3)
      • 3. Jeub, M., Herglotz, C., Nelke, C.M., et al: ‘Noise reduction for dualmicrophone mobile phones exploiting power level differences’. ICASSP 2012 – 37th Int. Conf. on Acoustics, Speech, and Signal Processing, Kyoto, Japan, 25–30 March 2012, pp. 16931696.
    4. 4)
      • 4. Zhang, J., Xia, R., Fu, Z., et al: ‘A fast two-microphone noise reduction algorithm based on power level ratio for mobile phone’. ISCSLP 2012 – 8th Int. Symp. on Chinese Spoken Language Processing, Hong Kong, 5–8 December 2012, pp. 206209.
    5. 5)
      • 5. Fu, Z., Fan, F., Huang, J.: ‘Dual-microphone noise reduction for mobile phone application’. ICASSP 2013 – 38th Int. Conf. on Acoustics, Speech, and Signal Processing, Vancouver, Canada, 26–31 May 2013, pp. 72397243.
    6. 6)
      • 6. Koldovsky, Z., Tichavsky, P., Botka, D.: ‘Noise reduction in dual-microphone mobile phones using a bank of pre-measured target-cancellation filters’. ICASSP 2013 – 38th Int. Conf. on Acoustics, Speech, and Signal Processing, Vancouver, Canada, 26–31 May 2013, pp. 679683.
    7. 7)
      • 7. Sugiyama, A., Miyahara, R.: ‘A new generalized sidelobe canceller with a compact array of microphones suitable for mobile terminals’. ICASSP 2014 – 39th Int. Conf. on Acoustics, Speech, and Signal Processing, Florence, Italy, 4–9 May 2014, pp. 820824.
    8. 8)
      • 8. Yousefian, N., Akbaria, A., Rahmani, M.: ‘Using power level difference for near field dualmicrophone speech enhancement’, Appl. Acoust., 2009, 70, pp. 14121421.
    9. 9)
      • 9. Mestre, X., Lagunas, M.Á.: ‘On diagonal loading for minimum variance beamformers’. ISSPIT 2003 – 3th Int. Symp. on Signal Processing and Information Technology, Darmstadt, Germany, 2003, pp. 459462.
    10. 10)
      • 10. López-Espejo, I., González, J.A., Gomez, A.M., et al: ‘A deep neural network approach for missing-data mask estimation on dual-microphone smartphones: application to noise-robust speech recognition’, Lect. Notes Comput. Sci., 2014, 8854, pp. 119128.
    11. 11)
      • 11. López-Espejo, I., Gomez, A.M., González, J.A., et al: ‘Feature enhancement for robust speech recognition on smartphones with dual-microphone’. EUSIPCO 2014 – 22nd European Signal Processing Conf., Lisbon, Portugal, 1–5 September 2014, pp. 2125.
    12. 12)
      • 12. Tashev, I., Mihov, S., Gleghorn, T., et al: ‘Sound capture system and spatial filter for small devices’. EUROSPEECH 2008 – 9th Annual Conf. of the Int. Speech Communication Association, Brisbane, Australia, 22–26 September 2008, pp. 435438.
    13. 13)
      • 13. Tashev, I., Seltzer, M., Acero, A.: ‘Microphone array for headset with spatial noise suppressor’. IWAENC 2005 – 9th Int. Workshop on Acoustic, Echo and Noise Control, 2005.
    14. 14)
      • 14. Moreno, P.J., Raj, B., Stern, R.M.: ‘A vector Taylor series approach for environment independent speech recognition’. ICASSP 1996 – 21st Int. Conf. on Acoustics, Speech, and Signal Processing, Atlanta, GA, 7–10 May 1996, pp. 733736.
    15. 15)
      • 15. Moreno, P.: ‘Speech recognition in noisy environments’. PhD Thesis, Carnegie Mellon University, 1996.
    16. 16)
      • 16. Segura, J.C., Torre, A., Benitez, M.C., et al: ‘Model-based compensation of the additive noise for continuous speech recognition. Experiments using the AURORA II database and tasks’. EUROSPEECH 2001 – 7th European Conf. on Speech Communication and Technology, Aalborg, Denmark, 3–7 September 2001.
    17. 17)
      • 17. Pearce, D., Hirsch, H.G.: ‘The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions’. ICSLP 2000 – 6th Int. Conf. of Spoken Language Processing, Beijing, China, 16–20 October 2000, pp. 2932.
    18. 18)
      • 18. ETSI ES 202 050 - Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms.
    19. 19)
      • 19. Peinado, A.M., Segura, J.C.: ‘Speech recognition over digital channels’ (Wiley, 2006).
    20. 20)
      • 20. Acero, A., Deng, L., Kristjansson, T., et al: ‘HMM adaptation using vector Taylor series for noisy speech recognition’. ICSLP 2000 – 6th Int. Conf. of Spoken Language Processing, Beijing, China, 16–20 October 2000, pp. 229232.
    21. 21)
      • 21. Atal, B.S.: ‘Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification’, J. Acoust. Soc. Am., 1974, 55, pp. 13041312.
    22. 22)
      • 22. González, J.A., Peinado, A.M., Gomez, A.M., et al: ‘Efficient MMSE estimation and uncertainty processing for multienvironment robust speech recognition’, IEEE Trans. Audio, Speech, Lang. Process., 2011, 19, 12061220.
    23. 23)
      • 23. Faubel, F., McDonough, J., Klakow, D.: ‘On expectation maximization based channel and noise estimation beyond the vector Taylor series expansion’. ICASSP 2010 – 35th Int. Conf. on Acoustics, Speech, and Signal Processing, Dallas, USA, 14–19 March 2010.
    24. 24)
      • 24. Petersen, K.B., Pedersen, M.S.: ‘The matrix cookbook’ (Technical University of Denmark, 2008).
    25. 25)
      • 25. Stouten, V., Van Hamme, H., Wambacq, P.: ‘Model-based feature enhancement with uncertainty decoding for noise robust ASR’, Speech Commun., 2006, 48, pp. 15021514.
    26. 26)
      • 26. ETSI ES 201 108 - Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms.
    27. 27)
      • 27. González, J.A., Peinado, A.M., Ma, N., et al: ‘MMSE-based missing feature reconstruction with temporal modeling for robust speech recognition’, IEEE Trans. Audio, Speech, Lang. Process., 2013, 21, pp. 624635.
    28. 28)
      • 28. Chang, S.Y., Wegmann, S.: ‘On the importance of modeling and robustness for deep neural network feature’. ICASSP 2015 – 40th Int. Conf. on Acoustics, Speech, and Signal Processing, Brisbane, Australia, 19–24 April 2015.
    29. 29)
      • 29. Baby, D., Gemmeke, J.F., Virtanen, T., et al: ‘Exemplar-based speech enhancement for deep neural network based automatic speech recognition’. ICASSP 2015 – 40th Int. Conf. on Acoustics, Speech, and Signal Processing, Brisbane, Australia, 19–24 April 2015.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-spr.2016.0182
Loading

Related content

content/journals/10.1049/iet-spr.2016.0182
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address