Speaker identification using multimodal neural networks and wavelet analysis
- Author(s): Noor Almaadeed 1, 2 ; Amar Aggoun 3 ; Abbes Amira 2, 4
-
-
View affiliations
-
Affiliations:
1:
Department of Computer Engineering, Brunel University, Kingston Lane, Uxbridge, Middlesex UB8 3PH, UK;
2: Department of Computer Science and Engineering, College of Engineering, Qatar University, Doha, Qatar;
3: Department of Computer Science and Technology, University of Bedfordshire, University Square, Luton, LU1, 3JU, UK;
4: Department of Engineering and Computer Science, University of the West of Scotland, Paisley, UK
-
Affiliations:
1:
Department of Computer Engineering, Brunel University, Kingston Lane, Uxbridge, Middlesex UB8 3PH, UK;
- Source:
Volume 4, Issue 1,
March 2015,
p.
18 – 28
DOI: 10.1049/iet-bmt.2014.0011 , Print ISSN 2047-4938, Online ISSN 2047-4946
(http://creativecommons.org/licenses/by/3.0/)
The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice regardless of the content. In this study, the authors designed and implemented a novel text-independent multimodal speaker identification system based on wavelet analysis and neural networks. Wavelet analysis comprises discrete wavelet transform, wavelet packet transform, wavelet sub-band coding and Mel-frequency cepstral coefficients (MFCCs). The learning module comprises general regressive, probabilistic and radial basis function neural networks, forming decisions through a majority voting scheme. The system was found to be competitive and it improved the identification rate by 15% as compared with the classical MFCC. In addition, it reduced the identification time by 40% as compared with the back-propagation neural network, Gaussian mixture model and principal component analysis. Performance tests conducted using the GRID database corpora have shown that this approach has faster identification time and greater accuracy compared with traditional approaches, and it is applicable to real-time, text-independent speaker identification systems.
Inspec keywords: principal component analysis; discrete wavelet transforms; text analysis; cepstral analysis; biometrics (access control); backpropagation; Gaussian processes; radial basis function networks; audio databases; mixture models; speaker recognition
Other keywords: Gaussian mixture model; principal component analysis; wavelet packet transform; back-propagation neural network; multimodal neural networks; radial basis function neural networks; biometric authentication systems; probabilistic neural networks; Mel-frequency cepstral coefficients; learning module; wavelet subband coding; general regressive neural networks; text-independent multimodal speaker identification system; MFCC; GRID database corpora; discrete wavelet transform; majority voting scheme; wavelet analysis
Subjects: Integral transforms; Speech recognition and synthesis; Other topics in statistics; Other topics in statistics; Speech processing techniques; Natural language interfaces; Integral transforms; Neural computing techniques
References
-
-
1)
-
32. Wilpon, J.G., Lee, C.H., Rabiner, L.R.: ‘Improvements in connected digit recognition using higher order spectral and energy features’. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Toronto, Canada, 1991.
-
-
2)
-
41. Li, D., Sethi, I., Dimitrova, N., McGee, T.: ‘Classification of general audio data for content-based retrieval’, Pattern Recognit. Lett., 2001, 22, (5), pp. 533–544 (doi: 10.1016/S0167-8655(00)00119-7).
-
-
3)
-
20. Mallat, S.: ‘A wavelet tour of signal processing’ (Elsevier, UK, 1999).
-
-
4)
-
3. Kinsner, W., Peters, D.: ‘A speech recognition system using linear predictive coding and dynamic time warping’. Proc. Annual Int. Conf. IEE, Engineering in Medicine & Biology Society, New Orleans, LA, 4–7 November 2006, no. 3, pp. 1070–1071.
-
-
5)
-
26. Lung, Y.: ‘Improved wavelet feature extraction using kernel analysis for text-independent speaker recognition’, Digit. Signal Process., 2010, 20, (5), pp. 1400–1407 (doi: 10.1016/j.dsp.2009.12.004).
-
-
6)
-
24. Deshpande, M.S., Holambe, R.S.: ‘Speaker identification using admissible wavelet packet based decomposition’, Int. J. Inf. Commun. Eng., 2011, 6, (1), pp. 20–23.
-
-
7)
-
30. Ye, J.: ‘Speech recognition using time domain features from phase space reconstructions’. PhD thesis.Marquette University Milwaukee, Wisconsin, 2004.
-
-
8)
-
40. Morris, A., Bloothooft, G., Barry, W., Andreeva, B., Koreman, J.C.: ‘Human and machine identification of consonantal place of articulation from vocalic transition segments’. EUROSPEECH, 1997.
-
-
9)
-
46. Saeidi, R., Mowlaee, P., Kinnunen, T., Tan, Z., Christensen, M., Jensen, H., Franti, P.: ‘Signal-to-signal ratio independent speaker identification for co-channel speech signals’. Proc. IEEE Int. Conf. Pattern Recognition, 2010, pp. 4545–4548.
-
-
10)
-
4. Benesty, J., Sondhi, M., Huang, Y.: ‘Springer handbook of speech processing’ (Springer, 2007).
-
-
11)
- D.A. Reynolds , T.F. Quartieri , R.B. Dunn . Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. , 19 - 41
-
12)
-
7. Gemmeke, J.F., Virtanen, T., Hurmalainen, A.: ‘Exemplar-based sparse representations for noise robust automatic speech recognition’, IEEE Trans. Audio, Speech, Lang. Process., 2011, 19, (7), pp. 2067–2080 (doi: 10.1109/TASL.2011.2112350).
-
-
13)
-
28. Amrouche, A., Rouvaen, J.: ‘Efficient system for speech recognition using general regression neural network’, Int. J. Intell. Technol., 2006, 1, (2), pp. 183–189.
-
-
14)
-
22. Vetterli, M., Kovacevic, J.: ‘Wavelets and subband coding’ (Prentice-Hall, New Jersey, 1995).
-
-
15)
-
37. Holmes, W., Speech synthesis and recognition, (CRC Press, UK, 2001).
-
-
16)
- D.F. Specht . A general regression neural network. IEEE Trans. Neural Netw. , 568 - 576
-
17)
-
11. Revada, L.K.V., Rambatla, V.K., Ande, K.V.N.: ‘A novel approach to speech recognition by using generalised regression neural networks’, IJCSI Int. J. Comput. Sci. Issues, 2011, 1, pp. 483–489.
-
-
18)
-
39. Mirhassani, S.M., Ting, H.N.: ‘Fuzzy-based discriminative feature representation for children's speech recognition’, Dig. Signal Process., 2014, 31, pp. 102–114 (doi: 10.1016/j.dsp.2014.05.004).
-
-
19)
-
6. Suvarna Kumar, G., Prasad Raju, K.A., Rao, M., et al: ‘Speaker recognition using GMM’, Int. J. Eng. Sci. Technol., 2010, 2, (6), pp. 2428–2436.
-
-
20)
-
1. Pawar, R.V., Kajave, P.P., Mali, S.N.: ‘Speaker identification using neural networks’. Proc. World Academy of Science, Engineering and Technology, 2005, no. 7, pp. 429–433.
-
-
21)
- J.-D. Wu , B.-F. Lin . Speaker identification using discrete wavelet packet transform technique with irregular decomposition. Expert Syst. Appl. , 3136 - 3143
-
22)
-
49. Revathi, A., Ganapathy, R., Venkataramani, Y.: ‘Text independent speaker recognition and speaker independent speech recognition using iterative clustering approach’, Int. J. Comput. Sci. Inf. Technol., 2009, 1, (2), pp. 30–42.
-
-
23)
-
44. Chi, T.S., Lin, T.H., Hsu, C.C.: ‘Spectro-temporal modulation energy based mask for robust speaker identification’, J. Acoust. Soc. Am., 2012, 131, (5), pp. 368–374.
-
-
24)
-
50. Gomez, P.: ‘A text independent speaker recognition system using a novel parametric neural network’, Int. J. Signal Process., Image Process. Pattern Recognit., 2011, 1, pp. 1–16.
-
-
25)
-
16. Chetty, G., Wagner, M.: ‘Audio visual speaker verification based on hybrid fusion of cross modal features’, in Pattern Recognition and Machine Intelligence, (Springer, Berlin, 2007).
-
-
26)
-
29. Lu, W., Sun, W., Lu, H.: ‘Robust watermarking based on DWT and non-negative matrix factorization’, Comput. Electr. Eng., 2009, 35, (1), pp. 183–188 (doi: 10.1016/j.compeleceng.2008.09.004).
-
-
27)
-
35. ‘The GRID audio corpus for speech recognition’. Available at http://www.dcs.shef.ac.uk/spandh/gridcorpus.
-
-
28)
-
14. Ross, A., Jain, A.: ‘Information fusion in biometrics’, Pattern Recognit. Lett., 2003, 24, (3), pp. 2115–2125.
-
-
29)
-
5. Abdalla, M.I., Ali, H.S.: ‘Wavelet-based Mel-frequency cepstral coefficients for speaker identification using hidden Markov models’, J. Telecommun., 2010, 1, (2), pp. 16–21.
-
-
30)
-
17. Chetty, G., Wagner, M.: ‘Investigating feature-level fusion for checking liveness in face-voice authentication’. Int. Symp. on Signal Processing and its Applications, 2005, vol. 1.
-
-
31)
- D. Reynolds , R. Rose . Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. , 1 , 72 - 83
-
32)
- J. Moody , C.J. Darken . Fast learning in network of locally-tuned processing units. Neural Comput. , 281 - 294
-
33)
-
21. Lung, S., Chen, C.: ‘Further reduced form of Karhunen–Loeve transform for text independent speaker recognition’, Electron. Lett., 1998, 34, (14), pp. 1380–1382 (doi: 10.1049/el:19980914).
-
-
34)
-
47. Barker, J., Ma, N., Coy, A., Cooke, M.: ‘Speech fragment decoding techniques for simultaneous speaker identification and speech recognition’, Comput. Speech Lang., 2010, 24, (1), pp. 94–111 (doi: 10.1016/j.csl.2008.05.003).
-
-
35)
-
18. Arora, S., Bhattacharjee, D., Nasipuri, M., Malik, L., Kundu, M., Basu, D.K.: ‘Performance comparison of SVM and ANN for handwritten Devnagari character recognition’, IJCSI Int. J. Comput. Sci., 2010, 7, (3), pp. 1–10.
-
-
36)
-
13. Hall, D.L., Llinas, J.: ‘Handbook of multi-sensor data fusion’ (CRC Press, UK,2011).
-
-
37)
-
25. Lung, Y.: ‘Feature extracted from wavelet eigenfunction estimation for text-independent speaker recognition’, Pattern Recognit., 2004, 37, pp. 1543–1544 (doi: 10.1016/j.patcog.2003.01.003).
-
-
38)
-
10. Shukla, A., Tiwari, R., Hemant Kumar, M., Kala, R.: ‘Speaker identification using wavelet analysis and modular neural networks’, J. Acoust. Soc. India (JASI), 2009, 36, (1), pp. 14–19.
-
-
39)
-
9. Wang, J.C., Yang, C.H., Wang, J.F., Lee, H.P.: ‘Robust speaker identification and verification’, Taiwan IEEE Computational Intelligence Magazine, 2007, 2, (2), pp. 52–59 (doi: 10.1109/MCI.2007.353420).
-
-
40)
-
2. Rabiner, L., Juang, B.H.: ‘Fundamentals of speech recognition’ (Prentice-Hall, 1993).
-
-
41)
- M. Cooke , J. Barker , S. Cunningham , X. Shao . An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. , 2421 - 2424
-
42)
-
33. Rottland, J., Neukirchen, C., Willett, D., Rigoll, G.: ‘Large vocabulary speech recognition with context dependent MMI-connectionist/HMM systems using the WSJ database’. EUROSPEECH, 1997.
-
-
43)
-
42. Morris, A., Wu, D., Koreman, J.: ‘GMM based clustering and speaker separability in the TIMIT speech database’, IEICE Trans. Fundam. Syst., 2005, 85, pp. 1–8.
-
-
44)
-
8. Campbell, W.M., Assaleh, K.T., Broun, C.C.: ‘Speaker recognition with polynomial classifiers’, IEEE Trans. Speech and Audio Processing, 2002, 10, (4), pp. 205–212 (doi: 10.1109/TSA.2002.1011533).
-
-
45)
-
34. Hamzah, R., Jamil, N., Seman, N.: ‘Filled pause classification using energy-boosted Mel-frequency cepstrum coefficients’. Proc. Int. Conf. on Robotic, Vision, Signal Processing & Power Applications, 2014, pp. 311–319.
-
-
46)
-
19. Xiang, B., Berger, T.: ‘Efficient text-independent speaker verification with structural Gaussian mixture models and neural network’, IEEE Trans. Speech Audio Process., 2003, 11, (5), pp. 447–456 (doi: 10.1109/TSA.2003.815822).
-
-
47)
-
38. Gelbart, D.: ‘Ensemble feature selection for multi-stream automatic speech recognition’. Technical Report No. UCB/EECS-2008-160, University of California at Berkeley, December2008.
-
-
48)
-
15. Nefian, A., Liang, L., Pi, X., Liu, X., Murphy, K.: ‘Dynamic Bayesian networks for audio-visual speech recognition’, EURASIP J. Adv. Signal Process., 2002, 11, pp. 1274–1288 (doi: 10.1155/S1110865702206083).
-
-
49)
-
31. Furui, S.: ‘Speaker-independent isolated word recognition using dynamic features of speech spectrum’, IEEE Trans. ASSP, 1986, 34, (1), pp. 52–59 (doi: 10.1109/TASSP.1986.1164788).
-
-
50)
-
7. Kekre1, H.B., Kulkarni, V.: ‘Speaker identification by using vector quantization’, Int. J. Eng. Sci. Technol., 2010, 2, (5), pp. 1325–1331.
-
-
1)