Automatic phoneme segmentation of a speech sequence is a basic problem in speech engineering. This study investigates unsupervised phoneme segmentation without using prior information on linguistic contents and acoustic models of an input sequence. The authors formulate the unsupervised segmentation as an optimal problem by means of maximum likelihood, and show that the optimal segmentation corresponds to minimising the coding length of the input sequence. Under different assumptions, five different objective functions are developed, namely log determinant, rate distortion (RD), Bayesian log determinant, Mahalanobis distance and Euclidean distance objectives. The authors prove that the optimal segmentations have the transformation-invariant properties, introduce a time-constrained agglomerative clustering algorithm to find the optimal segmentations, and propose an efficient implementation of the algorithm by using integration functions. The experiments are carried out on the TIMIT database to compare the above five objective functions. The results show that RD achieves the best performance, and the proposed method outperforms the previous unsupervised segmentation methods.

References

1. 1)
  - 1. Jo, Q., Chang, J., Shin, J., Kim, N.: ‘Statistical model-based voice activity detection using support vector machine’, IET Signal Process., 2009, 3, (3), pp. 205–210 (doi: 10.1049/iet-spr.2008.0128).
2. 2)
  - 10. Scharenborg, O., Wan, V., Ernestus, M.: ‘Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries’, J. Acoust. Soc. Am., 2010, 127, pp. 1084–1095 (doi: 10.1121/1.3277194).
3. 3)
  - 26. Jain, A.: ‘Data clustering: 50 years beyond k-means’, Patt. Recogn. Lett., 2010, 31, (8), pp. 651–666 (doi: 10.1016/j.patrec.2009.09.011).
4. 4)
  - 15. Pitz, M., Ney, H.: ‘Vocal tract normalization equals linear transformation in cepstral space’, IEEE Trans. Speech Audio Process., 2005, 13, (5), pp. 930–944 (doi: 10.1109/TSA.2005.848881).
5. 5)
  - 4. Tiomkin, S., Malah, D., Shechtman, S.: ‘Statistical text-to-speech synthesis based on segment-wise representation with a norm constraint’, IEEE Trans. ALSP, 2010, 18, (5), pp. 1077–1082.
6. 6)
  - 7. Mporas, I., Ganchev, T., Fakotakis, N.: ‘Speech segmentation using regression fusion of boundary predictions’, Comput. Speech Language, 2010, 24, (2), pp. 273–288 (doi: 10.1016/j.csl.2009.04.004).
7. 7)
  - 9. Scharenborg, O., Ernestus, M., Wan, V.: ‘Segmentation of speech: Child's play?’ INTERSPEECH, 2007, pp. 1953–1957.
8. 8)
  - 11. Aversano, G., Esposito, A., Esposito, A., Marinaro, M.: ‘A new text-independent method for phoneme segmentation’. IEEE Midwest Symp. Circuits and Systems, 2001, pp. 516–519.
9. 9)
  - 28. Luo, D., Minematsu, N., Yamauchi, Y., Hirose, K.: ‘Automatic assessment of language proficiency through shadowing’. ISCSLP, 2008, pp. 41–44.
10. 10)
  - 23. Ma, Y., Derksen, H., Hong, W., Wright, J.: ‘Segmentation of multivariate mixed data via lossy coding and compression’, IEEE Trans. PAMI, 2007, 29, (9), pp. 1546–1562 (doi: 10.1109/TPAMI.2007.1085).
11. 11)
  - 16. Qiao, Y., Minematsu, N.: ‘A study on invariance of f-divergence and its application to speech recognition’, IEEE Trans. Signal Process., 2010, 58, (7), pp. 3884–3890 (doi: 10.1109/TSP.2010.2047340).
12. 12)
  - 5. Toledano, D., Gomez, L., Grande, L.: ‘Automatic phonetic segmentation’, IEEE Trans. SAP, 2003, 11, (6), pp. 617–625.
13. 13)
  - 19. Rissanen, J.: ‘A universal prior for integers and estimation by minimum description length’, Ann. Stat., 1983, 11, (2), pp. 416–431 (doi: 10.1214/aos/1176346150).
14. 14)
  - 27. Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N.: ‘Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database’ (National Institute of Standards and Technology (NIST), Gaithersburgh, MD, 1988).
15. 15)
  - 18. Qiao, Y., Minematsu, N.: ‘Metric learning for unsupervised phoneme segmentation’. INTERSPEECH, 2008, pp. 1060–1063.
16. 16)
  - 14. Armstrong, T., Antetomaso, S.: ‘Unsupervised discovery of phoneme boundaries in multi-speaker continuous speech’. IEEE Int. Conf. on Development and Learning (ICDL), 2011, vol. 2, pp. 1–5.
17. 17)
  - 24. Gelman, A., Stern, H., Rubin, D.: ‘Bayesian data analysis’, Texts Stat. Sci., 2008(Chapman & Hall/CRC Texts in Statistical Science).
18. 18)
  - 21. Tobias, J.: ‘Foundations of modern auditory theory’ (Academic Press, 1970).
19. 19)
  - 17. Qiao, Y., Shimomura, N., Minematsu, N.: ‘Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons’. Proc. ICASSP, 2008, pp. 3989–3992.
20. 20)
  - 2. Grasic, M., Kos, M., Kacic, Z.: ‘Online speaker segmentation and clustering using cross-likelihood ratio calculation with reference criterion selection’, IET Signal Process., 2010, 4, (6), pp. 673–685 (doi: 10.1049/iet-spr.2009.0235).
21. 21)
  - 8. Duran, D., Schütze, H., Möbius, B., Walsh, M.: ‘A computational model of unsupervised speech segmentation for correspondence learning’, Res. Language Comput., 2010, 8, (2–3), pp. 133–168 (doi: 10.1007/s11168-011-9075-4).
22. 22)
  - 13. Estevan, Y.P., Wan, V., Scharenborg, O.: ‘Finding maximum margin segments in speech’. ICASSP, 2007, pp. 937–940.
23. 23)
  - 6. Akdemir, E., Ciloglu, T.: ‘HMM topology for boundary refinement in automatic speech segmentation’, IET Electron. Lett., 2010, 46, (15), pp. 1086–1087 (doi: 10.1049/el.2010.1390).
24. 24)
  - 25. Qiao, Y., Luo, D., Minematsu, N.: ‘A study on unsupervised phoneme segmentation and its application to automatic evaluation of shadowed utterances’. Technical Report, Shenzhen Institutes of Advance Technology, Chinese Academy of Sciences, 2012. Available at: http://mmlab.siat.ac.cn/personal/qiao/SegmentationReport12.pdf.
25. 25)
  - 3. Anguera Miro, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: ‘Speaker diarization: a review of recent research’, IEEE Trans. ASLP, 2012, 20, (2), pp. 356–370.
26. 26)
  - 20. Cover, T., Thomas, J.: ‘Elements of information theory’ (Wiley-Interscience, New York, 2006).
27. 27)
  - 29. Hori, T.: ‘Exploring shadowing as a method of English pronunciation training’. Doctoral Thesis, Kwansei Gakuin University, Japan, 2008.
28. 28)
  - 22. Ortego, A., Ramchandran, K.: ‘Rate-distortion methods for image and video compression’, IEEE Signal Process. Mag., 1998, 15, (6), pp. 23–50 (doi: 10.1109/79.733495).
29. 29)
  - 12. Dusan, S., Rabiner, L.: ‘On the relation between maximum spectral transition positions and phone boundaries’. INTERSPEECH, 2006, pp. 645–648.

Unsupervised optimal phoneme segmentation: theory and experimental evaluation

References

Related content