In this study, the authors present a novel voicing detection algorithm which employs the well-known aperiodicity measure to detect voiced speech in signals contaminated with non-stationary noise. The method computes a signal-adaptive decision threshold which takes into account the current noise level, enabling voicing detection by direct comparison with the extracted aperiodicity. This adaptive threshold is updated at each frame by making a simple estimate of the current noise power, and thus is adapted to fluctuating noise conditions. Once the aperiodicity is computed, the method only requires a small number of operations, and enables its implementation in challenging devices (such as hearing aids) if an efficient approximation of the difference function is employed to extract the aperiodicity. Evaluation over a database of speech sentences degraded by several types of noise reveals that the proposed voicing classifier is robust against different noises and signal-to-noise ratios. In addition, to evaluate the applicability of the method for speech enhancement, a simple F ₀-based speech enhancement algorithm integrating the proposed classifier is implemented. The system is shown to achieve competitive results, in terms of objective measures, when compared with other well-known speech enhancement approaches.

References

1. 1)
  - 13. Nakatani, T., Amano, S., Irino, T., Ishizuka, K., Kondo, T.: ‘A method for fundamental frequency estimation and voicing decision: application to infant utterances recorded in real acoustical environments’, Speech Commun., 2008, 50, (3), pp. 203–214 (doi: 10.1016/j.specom.2007.09.003).
2. 2)
  - 24. ITU-T Recommendation P.862: ‘Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs’, 2000.
3. 3)
  - L.R. Rabiner . A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE , 2 , 257 - 286
4. 4)
  - 23. Ruiz-Reyes, N., Vera-Candeas, P., Muñoz, J., García-Galán, S., Cañadas, F.: ‘New speech/music discrimination approach based on fundamental frequency estimation’, Multimedia Tools Appl., 2009, 41, (2), pp. 253–286 (doi: 10.1007/s11042-008-0228-x).
5. 5)
  - A. de Cheveigne , H. Kawahara . YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. , 4 , 1917 - 30
6. 6)
  - 14. Beritelli, F., Casale, S., Russo, S., Serrano, S.: ‘Adaptive V/UV speech detection based on characterization of background noise’, EURASIP J. Audio, Speech, Music Process., 2009, Article ID 965436 (doi: 10.1155/2009/965436).
7. 7)
  - 2. Nehorai, A., Porat, B.: ‘Adaptive comb filtering for harmonic signal enhancement’, IEEE Trans. Acoust. Speech Signal Process., 1986, 34, (5), pp. 1124–1138 (doi: 10.1109/TASSP.1986.1164952).
8. 8)
  - 28. ETSI ES 202 050 V1.1.5: ‘Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms’, 2007.
9. 9)
  - Y. Hu , P.C. Loizou . Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. , 1 , 229 - 238
10. 10)
  - 7. Qi, Y., Hunt, B.: ‘Voiced-unvoiced-silence classifications of speech using hybrid features and a network classifier’, IEEE Trans. Speech Audio Process., 1993, 1, (2), pp. 250–255 (doi: 10.1109/89.222883).
11. 11)
  - Y. Ephraim , H.L. Van Trees . A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. , 251 - 256
12. 12)
  - R. Martin . Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process , 5 , 504 - 512
13. 13)
  - 3. George, E., Smith, M.: ‘Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model’, IEEE Trans. Speech Audio Process., 1997, 5, (5), pp. 389–406 (doi: 10.1109/89.622558).
14. 14)
  - 15. Le Roux, J., Kameoka, H., Ono, N., de Cheveigné, A., Sagayama, S.: ‘Single and multiple F0 contour estimation through parametric spectrogram modeling of speech in noisy environments’, IEEE Trans. Audio, Speech, Lang. Process., 2007, 15, (4), pp. 1135–1145 (doi: 10.1109/TASL.2007.894510).
15. 15)
  - 26. Loizou, P.: ‘Speech quality assessment’, in Weisi, L., et al (Eds.): ‘Multimedia analysis, processing and communications’ (Springer Verlag, 2011), pp. 623–654.
16. 16)
  - Y. Ephraim , D. Malah . Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. , 6 , 1109 - 1121
17. 17)
  - 27. Lu, Y., Loizou, P.: ‘A geometric approach to spectral subtraction’, Speech Commun., 2008, 50, (6), pp. 453–466 (doi: 10.1016/j.specom.2008.01.003).
18. 18)
  - S. Ahmadi , A.S. Spanias . Cepstrum-based pitch detection using a new statistical V/UV classification algorithm. IEEE Trans. Speech Audio Process. , 3 , 333 - 338
19. 19)
  - S.F. Boll . Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust., Speech, Signal Process. , 113 - 120
20. 20)
  - 16. Cabañas-Molero, P., Ruiz-Reyes, N., Vera-Candeas, P., Maldonado-Bascon, S.: ‘Low-complexity F0-based speech/nonspeech discrimination approach for digital hearing aids’, Multimedia Tools Appl., 2011, 54, (2), pp. 291–319 (doi: 10.1007/s11042-010-0523-1).
21. 21)
  - H. Kawahara , I. Masuda-Kasuse , A. de Cheveigne . Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction. Speech Commun. , 187 - 207
22. 22)
  - 12. ETSI ES 202 211 V1.1.1: ‘Speech processing, transmission and quality aspects (STQ); distributed speech recognition; extended front-end feature extraction algorithm; compression algorithms; back-end speech reconstruction algorithm’, 2003.
23. 23)
  - 11. Kobatake, H.: ‘Optimization of voiced/unvoiced decisions in nonstationary noise environments’, IEEE Trans. Acoust. Speech Signal Process., 1987, 35, (1), pp. 9–18 (doi: 10.1109/TASSP.1987.1165034).
24. 24)
  - 12. ETSI ES 202 211 V1.1.1: ‘Speech processing, transmission and quality aspects (STQ); distributed speech recognition; extended front-end feature extraction algorithm; compression algorithms; back-end speech reconstruction algorithm’, 2003.
25. 25)
  - 4. Boll, S.: ‘Suppression of acoustic noise in speech using spectral subtraction’, IEEE Trans. Acoust. Speech Signal Process., 1979, 27, (2), pp. 113–120 (doi: 10.1109/TASSP.1979.1163209).
26. 26)
  - 22. Ealey, D., Kelleher, H., Pearce, D.: ‘Harmonic tunneling: tracking non-stationary noises during speech’. Proc. Seventh European Conf. Speech Communication and Technology, Aalborg, Denmark, September 2001, pp. 437–440.
27. 27)
  - 9. Boersma, P.: ‘Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound’. Proc. Institute of Phonetic Sciences, Amsterdam, The Netherlands, 1993, vol. 17, pp. 97–110.
28. 28)
  - 10. de Cheveigné, A., Kawahara, H.: ‘Yin, a fundamental frequency estimator for speech and music’, J. Acoust. Soc. Am., 2002, 111, (4), pp. 1917–1930 (doi: 10.1121/1.1458024).
29. 29)
  - 16. Cabañas-Molero, P., Ruiz-Reyes, N., Vera-Candeas, P., Maldonado-Bascon, S.: ‘Low-complexity F0-based speech/nonspeech discrimination approach for digital hearing aids’, Multimedia Tools Appl., 2011, 54, (2), pp. 291–319 (doi: 10.1007/s11042-010-0523-1).
30. 30)
  - 8. Kawahara, H., Masuda-Katsuse, I., de Cheveigné, A.: ‘Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds’, Speech Commun., 1999, 27, (3–4), pp. 187–207 (doi: 10.1016/S0167-6393(98)00085-5).
31. 31)
  - 27. Lu, Y., Loizou, P.: ‘A geometric approach to spectral subtraction’, Speech Commun., 2008, 50, (6), pp. 453–466 (doi: 10.1016/j.specom.2008.01.003).
32. 32)
  - 7. Qi, Y., Hunt, B.: ‘Voiced-unvoiced-silence classifications of speech using hybrid features and a network classifier’, IEEE Trans. Speech Audio Process., 1993, 1, (2), pp. 250–255 (doi: 10.1109/89.222883).
33. 33)
  - 18. Rabiner, L.: ‘A tutorial on hidden Markov models and selected applications in speech recognition’, Proc. IEEE, 1989, 77, (2), pp. 257–286 (doi: 10.1109/5.18626).
34. 34)
  - 15. Le Roux, J., Kameoka, H., Ono, N., de Cheveigné, A., Sagayama, S.: ‘Single and multiple F0 contour estimation through parametric spectrogram modeling of speech in noisy environments’, IEEE Trans. Audio, Speech, Lang. Process., 2007, 15, (4), pp. 1135–1145 (doi: 10.1109/TASL.2007.894510).
35. 35)
  - 24. ITU-T Recommendation P.862: ‘Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs’, 2000.
36. 36)
  - 25. Hu, Y., Loizou, P.: ‘Evaluation of objective quality measures for speech enhancement’, IEEE Trans. Audio, Speech, Lang. Process., 2008, 16, (1), pp. 229–238 (doi: 10.1109/TASL.2007.911054).
37. 37)
  - 1. Ahmadi, S., Spanias, A.: ‘Cepstrum-based pitch detection using a new statistical V/UV classification algorithm’, IEEE Trans. Speech Audio Process., 1999, 7, (3), pp. 333–338 (doi: 10.1109/89.759042).
38. 38)
  - 6. Ephraim, Y., Van Trees, H.: ‘A signal subspace approach for speech enhancement’, IEEE Trans. Speech Audio Process., 1995, 3, (4), pp. 251–266 (doi: 10.1109/89.397090).
39. 39)
  - 26. Loizou, P.: ‘Speech quality assessment’, in Weisi, L., et al (Eds.): ‘Multimedia analysis, processing and communications’ (Springer Verlag, 2011), pp. 623–654.
40. 40)
  - 19. Hu, Y., Loizou, P.: ‘Subjective comparison of speech enhancement algorithms’. Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Toulouse, France, May 2006, pp. 153–156.
41. 41)
  - 5. Ephraim, Y., Malah, D.: ‘Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator’, IEEE Trans. Acoust. Speech Signal Process., 1984, 32, (6), pp. 1109–1121 (doi: 10.1109/TASSP.1984.1164453).
42. 42)
  - 13. Nakatani, T., Amano, S., Irino, T., Ishizuka, K., Kondo, T.: ‘A method for fundamental frequency estimation and voicing decision: application to infant utterances recorded in real acoustical environments’, Speech Commun., 2008, 50, (3), pp. 203–214 (doi: 10.1016/j.specom.2007.09.003).
43. 43)
  - 23. Ruiz-Reyes, N., Vera-Candeas, P., Muñoz, J., García-Galán, S., Cañadas, F.: ‘New speech/music discrimination approach based on fundamental frequency estimation’, Multimedia Tools Appl., 2009, 41, (2), pp. 253–286 (doi: 10.1007/s11042-008-0228-x).
44. 44)
  - 28. ETSI ES 202 050 V1.1.5: ‘Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms’, 2007.
45. 45)
  - 21. Kamath, S., Loizou, P.: ‘A multi-band spectral subtraction method for enhancing speech corrupted by colored noise’. Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Orlando, FL, USA, May 2002, pp. 4160–4164.
46. 46)
  - 20. Cho, E., Smith, J.O., Widrow, B.: ‘Exploiting the harmonic structure for speech enhancement’. Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Kyoto, Japan, March 2012, pp. 4569–4572.
47. 47)
  - 14. Beritelli, F., Casale, S., Russo, S., Serrano, S.: ‘Adaptive V/UV speech detection based on characterization of background noise’, EURASIP J. Audio, Speech, Music Process., 2009, Article ID 965436 (doi: 10.1155/2009/965436).
48. 48)
  - 2. Nehorai, A., Porat, B.: ‘Adaptive comb filtering for harmonic signal enhancement’, IEEE Trans. Acoust. Speech Signal Process., 1986, 34, (5), pp. 1124–1138 (doi: 10.1109/TASSP.1986.1164952).
49. 49)
  - 3. George, E., Smith, M.: ‘Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model’, IEEE Trans. Speech Audio Process., 1997, 5, (5), pp. 389–406 (doi: 10.1109/89.622558).
50. 50)
  - 17. Martin, R.: ‘Noise power spectral density estimation based on optimal smoothing and minimum statistics’, IEEE Trans. Speech Audio Process., 2001, 9, (5), pp. 504–512 (doi: 10.1109/89.928915).
51. 51)
  - 11. Kobatake, H.: ‘Optimization of voiced/unvoiced decisions in nonstationary noise environments’, IEEE Trans. Acoust. Speech Signal Process., 1987, 35, (1), pp. 9–18 (doi: 10.1109/TASSP.1987.1165034).

Voicing detection based on adaptive aperiodicity thresholding for speech enhancement in non-stationary noise

References

Related content