Sentence-HMM state-based i-vector/PLDA modelling for improved performance in text dependent single utterance speaker verification

Osman Büyük

Sentence-HMM state-based i-vector/PLDA modelling for improved performance in text dependent single utterance speaker verification

View Fulltext

Author(s): Osman Büyük ¹
- Affiliations: 1: Electronics and Telecommunications Engineering Department, Kocaeli University, Kocaeli, 41380, Turkey
Source: Volume 10, Issue 8, October 2016, p. 918 – 923
DOI: 10.1049/iet-spr.2015.0288 , Print ISSN 1751-9675, Online ISSN 1751-9683

Received 08/07/2015, Accepted 13/05/2016, Revised 11/05/2016, Published 31/05/2016

In this paper, we make use of hidden Markov model (HMM) state alignment information in i-vector/probabilistic linear discriminant analysis (PLDA) framework to improve the verification performance in a text-dependent single utterance (TDSU) task. In the TDSU task, speakers repeat a fixed utterance in both enrollment and authentication sessions. Despite Gaussian mixture models (GMMs) have been the dominant modeling technique for text-independent applications, an HMM based method might be better suited for the TDSU task since it captures the co-articulation information better. Recently, powerful channel compensation techniques such as joint factor analysis (JFA), i-vectors and PLDA have been proposed for GMM based text-independent speaker verification. In this study, we train a separate i-vector/PLDA model for each sentence HMM state in order to utilize the alignment information of the HMM states in a TDSU task. The proposed method is tested using a multi-channel speaker verification database. In the experiments, it is observed that HMM state based i-vector/PLDA (i-vector/PLDA-HMM) provides approximately 67% relative reduction in equal error rate (EER) when compared to the i-vector/PLDA. The proposed method also outperforms the baseline GMM and sentence HMM methods. It yields approximately 51% relative reduction in EER over the best performing sentence HMM method.

References

1. 1)
  - 32. Young, S., Evermann, G., Gales, M., et al: ‘The HTK Book (for HTK Version 3.4)’ (Cambridge University Engineering Department, Cambridge, UK, 2006).
2. 2)
  - 20. Larcher, A., Lee, K.A., Ma, B., et al: ‘Phonetically constrained PLDA modeling for text-dependent speaker verification with multiple short utterances’. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2013), Vancouver, Canada, 2013, pp. 7673–7677.
3. 3)
  - 6. Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: ‘Score normalization for text-independent speaker verification systems’, Digit. Signal Process., 2000, 10, (1–3), pp. 42–54.
4. 4)
  - 17. Kenny, P., Stafylakis, T., Alam, J., et al: ‘Joint factor analysis for text-dependent speaker verification’. Proc. Speaker and Language Recognition Workshop (ODYSSEY 2014), Joensuu, Finland, 2014, pp. 200–207.
5. 5)
  - 24. Buyuk, O., Arslan, L.M.: ‘Model selection and score normalization for text-dependent single utterance speaker verification’, Turk. J. Electr. Eng. Comput. Sci., 2012, 20, (sup.2), pp. 1277–1295.
6. 6)
  - 18. Kenny, P., Stafylakis, T., Alam, J., et al: ‘In-Domain versus out-of-domain training for text-dependent JFA’. Proc. European Conf. on Speech Communication and Technology (INTERSPEECH 2014), Singapore, 2014, pp. 1332–1336.
7. 7)
  - 9. Dehak, N., Kenny, P., Dehak, R., et al: ‘Front-end factor analysis for speaker verification’, IEEE Trans. Audio, Speech Lang. Process., 2011, 19, (4), pp. 788–798.
8. 8)
  - 13. Sturim, D., Campbell, W., Dehak, N., et al: ‘The MIT LL 2010 speaker recognition evaluation system: scalable language-independent speaker recognition’. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2011), Prague, Czech Republic, 2011, pp. 5272–5275.
9. 9)
  - 16. Aronowitz, H.: ‘Voice biometrics for user authentication’. Afeka-AVIOS Speech Processing Conf., Tel-Aviv, Israel, 2012, pp. 1–4.
10. 10)
  - 8. Kenny, P., Ouellet, P., Dehak, N., et al: ‘A study of inter-speaker variability in speaker verification’, IEEE Trans. Audio, Speech Lang. Process., 2008, 16, (5), pp. 980–988.
11. 11)
  - 31. Sadjadi, S.O., Slaney, M., Heck, L.P.: ‘MSR identity toolbox: a MATLAB toolbox for speaker recognition research, version 1.0’. Technical Report, Microsoft Research, Conversational Systems Research Center (CSRC), November 2013.
12. 12)
  - 22. Stafylakis, T., Kenny, P., Ouellet, P., et al: ‘I-Vector/PLDA variants for text-dependent speaker recognition’. Technical Report, CRIM, Montreal, June 2013.
13. 13)
  - 5. Teunen, R., Shahshahani, B., Heck, L.P.: ‘A model-based transformational approach to robust speaker recognition’. Proc. Int. Conf. on Spoken Language Processing (ICSLP 2000), Beijing, China, 2000, vol. 2, pp. 495–498.
14. 14)
  - 29. Duda, R.O., Hart, P.E., Stork, D.G.: ‘Pattern classification’ (John Wiley & Sons Inc., New York, USA, 2001, 2nd edn.).
15. 15)
  - 2. Nealand, J.H., Pelecanos, J.W., Zilca, R.D., et al: ‘Study of the relative importance of temporal characteristics in text-dependent and text-constrained speaker verification’. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing 2005 (ICASSP 2005), Philadelphia, USA, vol. 1, pp. 653–656.
16. 16)
  - 1. National Institute of Standards and Technology: ‘Speaker recognition evaluation’. Available at http://www.nist.gov/speech/tests/spk accessed September 2015.
17. 17)
  - 11. Kenny, P.: ‘Bayesian speaker verification with heavy-tailed priors’. Proc. Speaker and Language Recognition Workshop (ODYSSEY 2010), Brno, Czech Republic, 2010, p. 014.
18. 18)
  - 14. Hasan, T., Sadjadi, S.O., Liu, G., et al: ‘CRSS systems for 2012 NIST speaker recognition evaluation’. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2013), Vancouver, Canada, 2013, pp. 6783–6787.
19. 19)
  - 4. Zhu, D., Ma, B., Li, H., et al: ‘A generalized feature transformation approach for channel robust speaker verification’. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, Hawaii, USA, 2007, vol. 4, pp. 61–64.
20. 20)
  - 28. Matrouf, D., Scheffer, N., Fauve, B., et al: ‘A straightforward and efficient implementation of the factor analysis model for speaker verification’. Proc. European Conf. on Speech Communication and Technology (INTERSPEECH 2007), Antwerp, Belgium, 2007, pp. 1242–1245.
21. 21)
  - 23. Novoselov, S., Pekhovsky, T., Shulipa, A., et al: ‘Text-dependent GMM-JFA system for password based speaker verification’. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2014), Florence, Italy, 2014, pp. 729–737.
22. 22)
  - 25. Buyuk, O.: ‘Telephone-based text-dependent speaker verification’. PhD thesis, Bogazici University, 2011.
23. 23)
  - 30. Dehak, N., Karam, Z.N., Reynolds, D.A., et al: ‘A channel blind system for speaker verification’. Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2011), Prague, Czech Republic, 2011, pp. 4536–4539.
24. 24)
  - 19. Soldi, G., Bozonnet, S., Alegre, F., et al: ‘Short duration speaker modeling with phone adaptive training’. Proc. Speaker and Language Recognition Workshop, (ODYSSEY 2014), Joensuu, Finland, 2014, pp. 208–215.
25. 25)
  - 10. Kenny, P.: ‘A small footprint i-vector extractor’. Proc. Speaker and Language Recognition Workshop (ODYSSEY 2012), Singapore, 2012, pp. 1–6.
26. 26)
  - 7. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: ‘Speaker verification using adapted Gaussian mixture models’, Digit. Signal Process., 2000, 10, (1–3), pp. 19–41.
27. 27)
  - 33. Buyuk, O., Arslan, L.M.: ‘Combining log-spectral mean subtraction at different frequency resolutions for handset-channel compensation in single utterance speaker verification’, IET Signal Process., 2012, 6, (9), pp. 824–828.
28. 28)
  - 3. Reynolds, D.A.: ‘Channel robust speaker verification via feature mapping’. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing 2003 (ICASSP 2003), Hong Kong, vol. 2, p. II-53-6.
29. 29)
  - 27. Garcia-Romero, D., Espy-Wilson, C.Y.: ‘Analysis of i-vector length normalization in speaker recognition systems’. Proc. European Conf. on Speech Communication and Technology (INTERSPEECH 2011), Florence, Italy, 2011, pp. 249–252.
30. 30)
  - 15. Ferrer, L., McLaren, M., Scheffer, N., et al: ‘A noise-robust system for NIST 2012 speaker recognition evaluation’. Proc. European Conf. on Speech Communication and Technology (INTERSPEECH 2013), Lyon, France, 2013, pp. 1981–1985.
31. 31)
  - 12. Prince, S.J.D., Elder, J.H.: ‘Probabilistic linear discriminant analysis for inferences about identity’. Proc. IEEE Int. Conf. on Computer Vision (ICCV 2007), Rio de Janeiro, Brazil, 2007, pp. 1–8.
32. 32)
  - 21. Stafylakis, T., Kenny, P., Ouellet, P., et al: ‘Text-dependent speaker recogntion using PLDA with uncertainty propagation’. Proc. Annual Conf. of the Int. Speech Communication Association (INTERSPEECH 2013), Lyon, France, 2013, pp. 3684–3688.
33. 33)
  - 26. Blouet, R., Mokbel, C., Mokbel, H., et al: ‘Becars: a free software for speaker verification’. Proc. Speaker and Language Recognition Workshop (ODYSSEY 2004), Toledo, Spain, 2004, pp. 145–148.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Sentence-HMM state-based i-vector/PLDA modelling for improved performance in text dependent single utterance speaker verification

References

Related content