access icon free Robust speech recognition using harmonic features

In this study, the authors propose a speech recognition system using harmonic structure related information to detect harmonic features in noisy environment. The proposed algorithm first extracts the harmonic components contained inside the speech signals using sine function convolution. By setting the frequency of the sine function as equal to the fundamental frequency of speech signals, harmonic components can be extracted out. The reconstructed signal obtained by summing up the extracted harmonic components is found to have a high degree of correlation with the original signal. The extracted frame energy measure of the harmonic components has been further processed to become dynamic harmonic features and then used together with the European Telecommunications Standards Institute (ETSI) front-end processed mel-frequency cepstral coefficients (MFCC) feature or the perceptual linear prediction (PLP) feature in the speech recognition system. The proposed enhanced speech recognition system shows a better recognition rate over the ETSI front-end processed MFCC (or PLP)-based speech recognition system.

Inspec keywords: speech recognition; signal reconstruction

Other keywords: noisy environment; ETSI front-end processed MFCC; extracted frame energy; front-end processed mel-frequency cepstral coefflcients; speech recognition system; extracted harmonic components; European Telecommunications Standards Institute; harmonic features; perceptual linear prediction; sine function convolution; speech signals; robust speech recognition; high degree of correlation; speech signal frequency; reconstructed signal; harmonic components; PLP-based speech recognition system; harmonic structure related information; dynamic harmonic features

Subjects: Speech processing techniques; Speech recognition and synthesis

References

    1. 1)
    2. 2)
      • 32. Bartle, R., Sherbert, D.: ‘Introduction to real analysis’ (Wiley New York, 1982), vol. 2.
    3. 3)
    4. 4)
    5. 5)
    6. 6)
    7. 7)
    8. 8)
    9. 9)
      • 29. Traunmuller, H., Eriksson, A.: ‘The frequency range of the voice fundamental in the speech of male and female adults’, PhD thesis, Manuscript, Department of Linguistics, University of Stockholm, (accessed May 8 2004) http://www.ling.su.se/staff/hartmut/aktupub.htm, 1994.
    10. 10)
    11. 11)
    12. 12)
    13. 13)
    14. 14)
    15. 15)
    16. 16)
    17. 17)
    18. 18)
    19. 19)
      • 31. Baken, R., Orlikoff, R.: ‘Clinical measurement of speech and voice’ (Singular Pub Group, 2000).
    20. 20)
      • 25. King, B., Atlas, L.: ‘Coherent modulation comb filtering for enhancing speech in wind noise’. Proc. Int. Workshop on Acoustics Echo and Noise Control, 2008.
    21. 21)
      • 4. Zavarehei, E., Vaseghi, S.: ‘Interpolation of lost speech segments using lp-hnm model with codebook post-processing’, IEEE Trans. Multimedia, 2008, 10, pp. 493502 (doi: 10.1109/TMM.2008.917345).
    22. 22)
      • 18. Lippmann, R.: ‘Speech recognition by machines and humans’, Speech Commun., 1997, 22, pp. 115 (doi: 10.1016/S0167-6393(97)00021-6).
    23. 23)
      • 10. Yu, A., Wang, H.: ‘New speech harmonic structure measure and it application to post speech enhancement’, IEEE Int. Conf. Acoust., Speech, Signal Process., 2004, 1, pp. I 729732.
    24. 24)
      • 23. Nehorai, A., Porat, B.: ‘Adaptive comb filtering for harmonic signal enhancement’, IEEE Trans. Acoust., Speech Signal Process., 2003, 34, pp. 11241138 (doi: 10.1109/TASSP.1986.1164952).
    25. 25)
      • 27. Speech processing, transmission and quality aspects (stq); distributed speech recognition; extended front-end feature extraction algorithm; compression algorithms; back-end speech reconstruction algorithm, ETSI ES 202 211 v1.1.1, 2003.
    26. 26)
      • 28. Tabrikian, J., Dubnov, S., Dickalov, Y.: ‘Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model’, IEEE Trans. Speech Audio Process., 2004, 12, pp. 7687 (doi: 10.1109/TSA.2003.819950).
    27. 27)
      • 5. Griffin, D., Lim, J.: ‘Multiband excitation vocoder’, IEEE Trans. Acoust., Speech Signal Process., 2002, 36, pp. 12231235 (doi: 10.1109/29.1651).
    28. 28)
      • 13. Raza, D., Chan, C.: ‘Enhancing quality of celp coded speech via wideband extension by using voicing GMM interpolation and HNM re-synthesis’. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, IEEE, 2005, vol. 1.
    29. 29)
      • 32. Bartle, R., Sherbert, D.: ‘Introduction to real analysis’ (Wiley New York, 1982), vol. 2.
    30. 30)
      • 14. Fukuda, T., Ichikawa, O., Nishimura, M.: ‘Long-term spectro-temporal and static harmonic features for voice activity detection’, IEEE J. Sel. Top. Signal Process., 2010, 4, pp. 834844 (doi: 10.1109/JSTSP.2010.2069750).
    31. 31)
      • 17. Huang, Q., Wang, D.: ‘Single-channel speech separation based on long-short frame associated harmonic model’, Digit. Signal Process., 2011, 21, pp. 497507 (doi: 10.1016/j.dsp.2011.02.003).
    32. 32)
      • 2. Park, S., Kwon, W., Kwon, O., Kim, M.: ‘Short-time Fourier analysis via optimal harmonic FIR filters’, IEEE Trans. Signal Process., 2002, 45, pp. 15351542 (doi: 10.1109/78.599995).
    33. 33)
      • 8. Gong, Y., Haton, J.: ‘Time domain harmonic matching pitch estimation using time dependent speech modeling’, IEEE Trans. Acoust., Speech Signal Process., 2003, 35, pp. 13861400 (doi: 10.1109/TASSP.1987.1165056).
    34. 34)
      • 9. Plapous, C., Marro, C., Scalart, P.: ‘Speech enhancement using harmonic regeneration’. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2005, vol. 1, pp. 157160.
    35. 35)
      • 7. Codec, D.: Version 2, Inmarsat-M Specification, Inmarsat, 1991.
    36. 36)
      • 20. De Cheveigne, A.: ‘Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancellation model of auditory processing’, J. Acoust. Soc. Am., 1993, 93, pp. 32713290 (doi: 10.1121/1.405712).
    37. 37)
      • 19. Gu, L., Rose, K.: ‘Perceptual harmonic cepstral coefficients for speech recognition in noisy environment’. icassp, IEEE, 2001, pp. 125128.
    38. 38)
      • 15. Xiao, X., Nickel, R.: ‘Speech enhancement with inventory style speech resynthesis’, IEEE Trans. Audio, Speech, Lang. Process., 2010, 18, pp. 12431257 (doi: 10.1109/TASL.2009.2031793).
    39. 39)
      • 24. Jang, Y., Chicharo, J.: ‘Adaptive IIR comb filter for harmonic signal cancellation’, Int. J. Electron., 1993, 75, pp. 241250 (doi: 10.1080/00207219308907103).
    40. 40)
      • 3. Tabrikian, J., Dubnov, S., Dickalov, Y.: ‘Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model’, IEEE Trans. Speech Audio Process., 2004, 12, pp. 7687 (doi: 10.1109/TSA.2003.819950).
    41. 41)
      • 26. Schwartz, D., Howe, C., Purves, D.: ‘The statistical structure of human speech sounds predicts musical universals’, J. Neurosci., 2003, 23, pp. 7160.
    42. 42)
      • 30. Titze, I., Martin, D.: ‘Principles of voice production’, Acoust. Soc. Am. J., 1998, 104, pp. 1148 (doi: 10.1121/1.424266).
    43. 43)
      • 22. Lim, J., Oppenheim, A., Braida, L.: ‘Evaluation of an adaptive comb filtering method for enhancing speech degraded by white noise addition’, IEEE Trans. Acoust., Speech Signal Process., 2003, 26, pp. 354358 (doi: 10.1109/TASSP.1978.1163117).
    44. 44)
      • 1. Irizarry, R.: ‘The additive sinusoidal plus residual model: A statistical analysis’. Proc. CNMAT, 1999.
    45. 45)
      • 21. Nishi, K., Ando, S.: ‘An optimal comb filter for time-varying harmonics extraction’, IEICE transactions on Fundamentals of Electronics, Commun. Comput. Sci., 1998, 81, pp. 16221627.
    46. 46)
      • 16. Vera-Candeas, P., Ruiz-Reyes, N., López-Ferreras, F.: ‘Bark scale-based perceptual matching pursuit for improving sinusoidal audio modeling’, Digit. Signal Process., 2009, 19, pp. 229240 (doi: 10.1016/j.dsp.2008.10.001).
    47. 47)
      • 11. Zavarehei, E., Vaseghi, S., Yan, Q.: ‘Noisy speech enhancement using harmonic noise model and codebook-based post-processing’, IEEE Trans. Audio, Speech, Lang. Process., 2007, 15, pp. 11941203 (doi: 10.1109/TASL.2007.894516).
    48. 48)
      • 29. Traunmuller, H., Eriksson, A.: ‘The frequency range of the voice fundamental in the speech of male and female adults’, PhD thesis, Manuscript, Department of Linguistics, University of Stockholm, (accessed May 8 2004) http://www.ling.su.se/staff/hartmut/aktupub.htm, 1994.
    49. 49)
      • 12. Vaseghi, S., Zavarehei, E., Yan, Q.: ‘Speech bandwidth extension: extrapolations of spectral envelop and harmonicity quality of excitation’, IEEE Int. Conf. Acoust., Speech Signal Process., 2006, 3, pp. III 14.
    50. 50)
      • 6. Kondoz, A.: ‘Digital speech: coding for low bit rate communication systems’ (John Wiley & Sons Inc, 2004).
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-spr.2013.0094
Loading

Related content

content/journals/10.1049/iet-spr.2013.0094
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading