It is well known that a keyword spotting (KWS) system provides significantly reduced performance in mismatched training and test conditions. In this work, an approach is proposed for reducing the mismatches between the training and test speech due to speaker-related variabilities and environmental noises. In the proposed approach, the variational-mode decomposition is first performed on the short-term magnitude spectra to decompose it into a number of variational mode functions (VMFs) in an adaptive manner. Then, a sufficiently smoothed spectra are reconstructed by selecting only two lower frequency VMFs. When the KWS system is developed by using Mel frequency cepstral coefficients (MFCCs) extracted from the smoothed spectra, a significantly improved performance is observed for pitch and noise mismatched test conditions. To further suppress the mismatches due to the pitch and speaking rate of the speakers, data-augmented training based on explicit prosody modification is performed. The experimental results presented in this study show that data-augmented training further enhances the performance of the developed KWS.

References

1. 1)
  - 36. Varga, A., Steeneken, H.J.M.: ‘Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems’, Speech Commun., 1993, 12, (3), pp. 247–251.
2. 2)
  - 37. Rath, S.P., Povey, D., Veselý, K., et al: ‘Improved feature processing for deep neural networks’. Proc. Interspeech, Lyon, 2013, pp. 109–113.
3. 3)
  - 34. Robinson, T., Fransen, J., Pye, D., et al: ‘WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition’. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Detroit, MI, 1995, pp. 81–84.
4. 4)
  - 8. Motlicek, P., Valente, F., Szoke, I.: ‘Improving acoustic based keyword spotting using LVCSR lattic’. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Kyoto, 2012, pp. 4413–4416.
5. 5)
  - 35. Batliner, A., Blomberg, M., D'Arcy, S., et al: ‘The PF_STAR children's speech corpus’. Proc. Ninth European Conf. on Speech Communication and Technology, Lisbon, 2005.
6. 6)
  - 29. Srinivas, N., Pradhan, G., Shahnawazuddin, S.: ‘Enhancement of noisy speech signal by non-local means estimation of variational mode functions’, Proc. Interspeech, 2018, 2018, pp. 326–338.
7. 7)
  - 4. Warren, R.L.: ‘Broadcast speech recognition system for keyword monitoring’, US Patent 6,332,120, 2001.
8. 8)
  - 28. Gerosa, M., Giuliani, D., Narayanan, S., et al: ‘A review of ASR technologies for children's speech’. Proc. of the 2nd Workshop on Child, Computer and Interaction, Cambridge, MA, 2009.
9. 9)
  - 41. Upadhyay, A., Pachori, R.B.: ‘Speech enhancement based on mEMD-VMD method’, Electron. Lett., 2017, 53, (7), pp. 502–504.
10. 10)
  - 44. Lee, L., Rose, R.: ‘A frequency warping approach to speaker normalization’, Trans. Speech Audio Process, 1998, 6, (1), pp. 49–60.
11. 11)
  - 20. Wu, P., Liu, H., Li, X., et al: ‘A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion’, Trans. Multimed., 2016, 18, (3), pp. 326–338.
12. 12)
  - 12. Panchapagesan, S., Sun, M., Khare, A., et al: ‘Multi-task learning and weighted cross-entropy for DNN-based keyword spotting’. Proc. INTERSPEECH, San Francisco, 2016, pp. 760–764.
13. 13)
  - 26. Yadav, I.C., Shahnawazuddin, S., Pradhan, G.: ‘Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing’, Digit. Signal Process., 2019, 86, pp. 55–64.
14. 14)
  - 7. Rose, R.C., Paul, D. B.: ‘A hidden markov model based keyword recognition system’. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Albuquerque, NM, 1990, pp. 129–132.
15. 15)
  - 17. Chen, G., Parada, C., Heigold, G.: ‘Small-footprint keyword spotting using deep neural networks’. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Florence, 2014, pp. 4087–4091.
16. 16)
  - 42. Sinha, R., Shahnawazuddin, S.: ‘Assessment of pitch-adaptive front-end signal processing for children's speech recognition’, Comput. Speech. Lang., 2018, 48, pp. 103–121.
17. 17)
  - 10. Can, D., Saraclar, M.: ‘Lattice indexing for spoken term detection’, Trans. Audio, Speech, Lang. Process., 2011, 19, (8), pp. 2338–2347.
18. 18)
  - 13. Hwang, K., Lee, M., Sung, W.: ‘Online keyword spotting with a character-level recurrent neural network’, arXiv preprint arXiv:1512.08903, 2015.
19. 19)
  - 40. Wegmann, S., Faria, A., Janin, A., et al: ‘The tao of ATWV: probing the mysteries of keyword search performance’. Proc. Workshop on Automatic Speech Recognition and Understanding (ASRU), Olomouc, 2013, pp. 192–197.
20. 20)
  - 38. Povey, D., Ghoshal, A., Boulianne, G., et al: ‘The kaldi speech recognition toolkit’. Proc. Automatic speech recognition and understanding, 2011.
21. 21)
  - 22. Sak, H., de Chaumont Quitry, F., Sainath, T., et al: ‘Acoustic modelling with cd-ctc-smbr lstm rnns’. Proc. Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, 2015, pp. 604–609.
22. 22)
  - 19. Sangeetha, J., Jothilakshmi, S.: ‘A novel spoken keyword spotting system using support vector machine’, Eng. Appl. Artif. Intell., 2014, 36, pp. 287–293.
23. 23)
  - 21. Fernández, S., Graves, A., Schmidhuber, J.: ‘Phoneme recognition in TIMIT with BLSTM-CTC’, arXiv preprint arXiv:0804.3269, 2008.
24. 24)
  - 18. Sainath, T.N., Parada, C.: ‘Convolutional neural networks for small-footprint keyword spotting’. Proc. Sixteenth Annual Conf. of the Int. Speech Communication Association, Google, Inc., New York, NY, 2015.
25. 25)
  - 25. Shahnawazuddin, S., Maity, K., Pradhan, G.: ‘Improving the performance of keyword spotting system for children's speech through prosody modification’, Digit. Signal Process., 2018, 86, pp. 11–18.
26. 26)
  - 24. Sak, H., Senior, A., Rao, K., et al: ‘Fast and accurate recurrent neural network acoustic models for speech recognition’, arXiv preprint arXiv:1507.06947, 2015.
27. 27)
  - 2. Vergyri, D., Shafran, I., Stolcke, A., et al: ‘The SRI/OGI 2006 spoken term detection system’. Proc. Eighth Annual Conf. of the Int. Speech Communication Association, Antwerp, 2007.
28. 28)
  - 14. Parveen, S., Green, P.: ‘Speech enhancement with missing data techniques using recurrent neural networks’. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Montreal, 2004, pp. 733–736.
29. 29)
  - 6. Szoke, I., Schwarz, P., Matejka, P., et al: ‘Comparison of keyword spotting approaches for informal continuous speech’. Proc. Ninth European Conf. on Speech Communication and Technology, Lisbon, 2005.
30. 30)
  - 32. Prasanna, S.R.M., Govind, D., Rao, K.S., et al: ‘Fast prosody modification using instants of significant excitation’. Proc. Speech Prosody, Chicago, IL, 2010.
31. 31)
  - 15. Wöllmer, M., Eyben, F., Graves, A., et al: ‘Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework’, Cogn. Comput., 2010, 2, (3), pp. 180–190.
32. 32)
  - 27. Russell, M., D'Arcy, S.: ‘Challenges for computer recognition of children's speech’. Proc. Workshop on Speech and Language Technology in Education, Farmington, PA, 2007.
33. 33)
  - 39. Open keyword search 2013 evaluation (openkws13) plan, national institute of standards and technology (nist).: ‘https://www.nist.gov/sites/default/files/documents/itl/iad/mig/OpenKWS13-EvalPlan.pdf.
34. 34)
  - 16. Wollmer, M., Blaschke, C., Schindl, T., et al: ‘Online driver distraction detection using long short-term memory’, Trans. Intell. Transp. Syst., 2011, 12, (2), pp. 574–582.
35. 35)
  - 23. Amodei, D., Ananthanarayanan, S., Anubhai, R., et al: ‘Deep speech 2: End-to-end speech recognition in English and mandarin’. Proc. Int. Conf. on Machine Learning, New York, NY, 2016, pp. 173–182.
36. 36)
  - 33. Sri Rama Murthy, K., Yegnanarayana, B.: ‘Epoch extraction from speech signals’, Trans. Audio Speech Lang. Process., 2008, 16, pp. 1602–1613.
37. 37)
  - 5. Weintraub, M.: ‘LVCSR log-likelihood ratio scoring for keyword spotting’. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Detroit, MI, 1995, pp. 297–300.
38. 38)
  - 11. Karakos, D., Schwartz, R., Tsakalidis, S., et al: ‘Score normalization and system combination for improved keyword spotting’. Proc. Workshop on Automatic Speech Recognition and Understanding, Olomouc, 2013, pp. 210–215.
39. 39)
  - 31. Singh, P., Pradhan, G.: ‘Exploring the non-local similarity present in variational mode functions for effective ECG denoising’. Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 861–865.
40. 40)
  - 3. Makhoul, J., Kubala, F., Leek, T., et al: ‘Speech and language technologies for audio indexing and retrieval’, Proc. IEEE, 2000, 88, (8), pp. 1338–1353.
41. 41)
  - 45. Serizel, R., Giuliani, D.: ‘Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children’, Nat. Lang. Eng., 2017, 23, (3), pp. 325–350.
42. 42)
  - 30. Dragomiretskiy, K., Zosso, D.: ‘Variational mode decomposition’, Trans. signal process., 2014, 62, (3), pp. 531–544.
43. 43)
  - 1. Zhao, N., Yang, H.: ‘Realizing speech to gesture conversion by keyword spotting’. Proc. Chinese Spoken Language Processing (ISCSLP), Tianjin, 2016, pp. 1–5.
44. 44)
  - 9. Chen, G., Khudanpur, S., Povey, D., et al: ‘Quantifying the value of pronunciation lexicons for keyword search in lowresource languages’. Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, BC, 2013, pp. 8560–8564.
45. 45)
  - 43. Shahnawazuddin, S., Adiga, N., Kathania, H.K.: ‘Effect of prosody modification on children's ASR’, Signal Process. Lett., 2017, 24, (11), pp. 1749–1753.

Adaptive spectral smoothening for development of robust keyword spotting system

References

Related content