Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

access icon free Efficient harmonic peak detection of vowel sounds for enhanced voice activity detection

Voice activity detection (VAD) involves discriminating speech segments from background noise and is a critical step in numerous speech-related applications. However, distinguishing speech from noise based on the properties of noise is fallible, because it is difficult to predict and characterise the noise occurring in real life. In this study, the authors instead focus on the intrinsic characteristics of speech. The harmonic peaks of vowel sounds have higher energies than the other spectral components of speech and are the speech features most likely to survive in most cases of severe noise. Therefore, the energy differences between harmonic peaks and other spectral features show promise for enabling robust VAD. To exploit this feature, the harmonic peaks must be accurately located. For this purpose, this study proposes an efficient harmonic peak location detection (HPD) method. Based on extensive experiments conducted in the presence of various noise types and signal-to-noise ratios, we found that VAD with the proposed HPD approach outperforms existing VAD methods and does so with reasonable computational cost and higher robustness.

References

    1. 1)
      • 15. Sadjadi, S.O., Hansen, J.H.: ‘Unsupervised speech activity detection using voicing measures and perceptual spectral flux’, IEEE Signal Process. Lett., 2013, 20, (3), pp. 197200.
    2. 2)
      • 23. Ishizuka, K., Nakatani, T., Fujimoto, M., et al: ‘Noise robust voice activity detection based on periodic to aperiodic component ratio’, Speech Commun., 2010, 52, (1), pp. 4160.
    3. 3)
      • 3. Sohn, J., Kim, N.S., Sung, W.: ‘A statistical model-based voice activity detection’, IEEE Signal Process. Lett., 1999, 6, (1), pp. 13.
    4. 4)
      • 4. Ramírez, J., Segura, J.C., Benítez, C., et al: ‘Statistical voice activity detection using a multiple observation likelihood ratio test’, IEEE Signal Process. Lett., 2005, 12, (10), pp. 689692.
    5. 5)
      • 6. Suh, Y., Kim, H.: ‘Multiple acoustic model-based discriminative likelihood ratio weighting for voice activity detection’, IEEE Signal Process. Lett., 2012, 19, (8), pp. 507510.
    6. 6)
      • 38. Kim, J.: ‘VAD-Toolkit’, GitHub Repository, 2017. Available at https://github.com/jtkim-kaist/VAD.
    7. 7)
      • 5. Davis, A., Nordholm, S., Togneri, R.: ‘Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold’, IEEE Trans. Audio Speech Lang. Process., 2006, 14, (2), pp. 412424.
    8. 8)
      • 36. Goldberg, D.E., Holland, J.H.: ‘Genetic algorithms and machine learning’, Mach. Learn., 1988, 3, (2), pp. 9599.
    9. 9)
      • 12. Hwang, I., Chang, J.H.: ‘Voice activity detection based on statistical model employing deep neural network’. Tenth Int. Conf. Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), Kitakyushu, Japan2014, pp. 582585.
    10. 10)
      • 10. Zhang, X.L., Wu, J.: ‘Deep belief networks based voice activity detection’, IEEE Trans. Audio Speech Lang. Process., 2013, 21, (4), pp. 697710.
    11. 11)
      • 29. ‘Voxforge database’. Available at http://voxforge.org, accessed 1 November 2017.
    12. 12)
      • 24. Khoa, P.C.: ‘Noise robust voice activity detection’. M. Eng. thesis, Nanyang Technological University, 2012.
    13. 13)
      • 11. Zhang, X.L., Wu, J.: ‘Denoising deep neural networks based voice activity detection’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Kitakyushu, Japan, 2013, pp. 853857.
    14. 14)
      • 32. Brookes, M.: ‘Voicebox’. Available at http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html, accessed 1 November 2017.
    15. 15)
      • 33. Tan, Z.H., Lindberg, B.: ‘Low-complexity variable frame rate analysis for speech recognition and voice activity detection’, IEEE J. Sel. Top. Signal Process., 2010, 4, (5), pp. 798807.
    16. 16)
      • 7. Tan, L.N., Borgstrom, B.J., Alwan, A.: ‘Voice activity detection using harmonic frequency components in likelihood ratio test’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 2010, pp. 44664469.
    17. 17)
      • 2. Benyassine, A., Shlomot, E., Su, H.Y., et al: ‘ITU-T recommendation G.729 annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications’, IEEE Commun. Mag., 1997, 35, (9), pp. 6473.
    18. 18)
      • 26. Loizou, P.C.: ‘Speech enhancement: theory and practice’ (CRC Press, USA, 2013), pp. 8384.
    19. 19)
      • 8. Ghosh, P.K., Tsiartas, A., Narayanan, S.: ‘Robust voice activity detection using long-term signal variability’, IEEE Trans. Audio Speech Lang. Process., 2011, 19, (3), pp. 600613.
    20. 20)
      • 37. Freeman, D.K., Cosier, G., Southcott, C.B., et al: ‘The voice activity detector for the Pan-European digital cellular mobile telephone service’. IEEE Int. Conf. on Acoustics Speech and Signal Processing, 1989, pp. 369–372.
    21. 21)
      • 30. Aneeja, G., Yegnanarayana, B.: ‘Single frequency filtering approach for discriminating speech and nonspeech’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (4), pp. 705717.
    22. 22)
      • 9. Rosen, O., Mousazadeh, S., Cohen, I.: ‘Voice activity detection in presence of transient noise using spectral clustering and diffusion kernels’. IEEE 28th Convention of Electrical & Electronics Engineers in Israel (IEEEI), Eilat, Israel, 2014, pp. 15.
    23. 23)
      • 16. Yoo, I.C., Lim, H., Yook, D.: ‘Formant-based robust voice activity detection’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (12), pp. 22382245.
    24. 24)
      • 28. Arzeno, N.M., Deng, Z.D., Poon, C.S.: ‘Analysis of first-derivative based QRS detection algorithms’, IEEE Trans. Biomed. Eng., 2008, 55, (2), pp. 478484.
    25. 25)
      • 18. Moattar, M.H., Homayounpour, M.M., Kalantari, N.K.: ‘A new approach for robust realtime voice activity detection using spectral pattern’. IEEE Int. Conf. on Acoustics Speech and Signal Processing, Dallas, TX, USA, 2010, pp. 44784481.
    26. 26)
      • 27. Di Benedetto, M.G.: ‘Vowel representation: some observations on temporal and spectral properties of the first formant frequency’, J. Acoust. Soc. Am., 1989, 86, (1), pp. 5566.
    27. 27)
      • 21. Dhananjaya, N., Yegnanarayana, B.: ‘Voiced/nonvoiced detection based on robustness of voiced epochs’, IEEE Signal Process. Lett., 2010, 17, (3), pp. 273276.
    28. 28)
      • 35. Garofolo, J.S., Lamel, L.F., Fisher, W.M., et al: ‘TIMIT Acoustic-Phonetic continuous speech corpus LDC93S1’(Linguistic Data Consortium, Philadelphia, 1993).
    29. 29)
      • 25. Rabiner, L.R., Gold, B.: ‘Theory and application of digital signal processing’, vol. 1 (Prentice-Hall Inc., Englewood Cliffs, NJ, 1975), p. 777.
    30. 30)
      • 1. Vlaj, D., Kotnik, B., Horvat, B., et al: ‘A computationally efficient mel-filter bank VAD algorithm for distributed speech recognition systems’, EURASIP J. Adv. Signal Process., 2005, 4, pp. 487497.
    31. 31)
      • 14. Zhang, X.L., Wang, D.: ‘Boosting contextual information for deep neural network based voice activity detection’, IEEE/ACM Trans. Audio Speech Lang. Process., 2016, 24, (2), pp. 252264.
    32. 32)
      • 19. Yoo, I.C., Yook, D.: ‘Robust voice activity detection using the spectral peaks of vowel sounds’, ETRI J., 2009, 31, (4), pp. 451453.
    33. 33)
      • 31. Varga, A., Steeneken, H.J.: ‘Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems’, Speech Commun., 1993, 12, (3), pp. 247251.
    34. 34)
      • 34. Waller, W.N.: ‘A voice profile of the adolescent speaker and singer’. Master thesis, University of North Carolina at Greensboro, 2007.
    35. 35)
      • 20. Kristjansson, T., Deligne, S., Olsen, P.: ‘Voicing features for robust speech detection’, Proceedings of Interspeech, Lisbon, Portugal, 2005, pp. 369372.
    36. 36)
      • 13. Drugman, T., Stylianou, Y., Kida, Y., et al: ‘Voice activity detection: merging source and filter-based information’, IEEE Signal Process. Lett., 2016, 23, (2), pp. 252256.
    37. 37)
      • 17. Moattar, M.H., Homayounpour, M.M.: ‘A weighted feature voting approach for robust and real-time voice activity detection’, ETRI J., 2011, 33, (1), pp. 99109.
    38. 38)
      • 22. Ghaemmaghami, H., Baker, B.J., Vogt, R.J., et al: ‘Noise robust voice activity detection using features extracted from the time-domain autocorrelation function’. Proc. Interspeech, Makuhari, Chiba, Japan, 2010, pp. 31183121.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-spr.2017.0553
Loading

Related content

content/journals/10.1049/iet-spr.2017.0553
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address