© The Institution of Engineering and Technology
The authors propose a method of fast speaker clustering in which a distance (distance of feature matrix mean, DFMM) is first defined for characterising the similarities between any two clusters, and then an adaptive convergence threshold is introduced for terminating the procedure of speaker clustering. If the minimum of the DFMMs between any two clusters is smaller than the threshold, then they are merged. The above mergence of clusters is repeated until the minimum of the DFMMs between any two clusters is larger than the threshold. They conduct experiments on both shorter voice segments (≤ 3 s) and longer voice segments (> 3 s) to compare their method with state-of-the-art methods, agglomerative hierarchical clustering with Bayesian information criterion (AHC + BIC) and vector quantisation with spectral clustering. Experiments show that their method achieves the best results for clustering shorter voice segments, and also obtains satisfactory results for clustering longer voice segments in comparison with other two methods. What is more, their method is faster than other methods in all experimental cases. The initial results show that the hybrid methods by combining their method with the AHC + BIC obtain further improvement in terms of the F score.
References
-
-
1)
-
2. Li, Y., He, Q., Kwong, S., et al: ‘Characteristics-based effective applause detection for meeting speech’, Signal Process., 2009, 89, (8), pp. 1625–1633 (doi: 10.1016/j.sigpro.2009.03.001).
-
2)
-
4. Moattar, M.H., Homayounpour, M.M.: ‘A review on speaker diarization systems and approaches’, Speech Commun., 2012, 54, (10), pp. 1065–1103 (doi: 10.1016/j.specom.2012.05.002).
-
3)
-
5. Solomonoff, A., Mielke, A., Schmidt, M., Gish, H.: ‘Clustering speakers by their voices’. Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1998, vol. 2, pp. 757–760.
-
4)
-
5)
-
6. Ajmera, J., Bourlard, H., Lapidot, I., McCowan, I.: ‘Unknown-multiple speaker clustering using HMM’. Proc. Int. Conf. Spoken Language Processing, 2002, pp. 573–576.
-
6)
-
14. Li, Y.X., He, Q.H.: ‘Detecting laughter in spontaneous speech by constructing laughter bouts’, Int. J. Speech Technol., 2011, 14, (3), pp. 211–225 (doi: 10.1007/s10772-011-9097-1).
-
7)
-
8)
-
3. Miro, X.A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: ‘Speaker diarization: a review of recent research’, IEEE Trans. Audio Speech Lang. Process., 2012, 20, (2), pp. 356–370 (doi: 10.1109/TASL.2011.2125954).
-
9)
-
8. Iso, K.-I.: ‘Speaker clustering using vector quantization and spectral clustering’. Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2010, pp. 4986–4989.
-
10)
-
1. Ostendorf, M., Favre, B., Grishman, R., et al: ‘Speech segmentation and spoken document processing’, IEEE Signal Process. Mag., 2008, 25, (3), pp. 59–69 (doi: 10.1109/MSP.2008.918023).
-
11)
-
12. Sun, X.J.: ‘Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio’. Proc. IEEE Int. Conf. Acoustic, Speech and Signal Processing, 2002, pp. 333–336.
-
12)
-
9. Valente, F., Motlicek, P., Vijayasenan, D.: ‘Variational Bayesian speaker diarization of meeting recordings’. Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2010, pp. 4954–4957.
-
13)
-
7. Kotti, M., Moschou, V., Kotropoulos, C.: ‘Speaker segmentation and clustering’, Signal Process., 2008, 88, (5), pp. 1091–1124 (doi: 10.1016/j.sigpro.2007.11.017).
-
14)
-
10. Han, K.J., Kim, S., Narayanan, S.S.: ‘Robust speaker clustering strategies to data source variation for improved speaker diarization’. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 2007, pp. 262–267.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-spr.2013.0340
Related content
content/journals/10.1049/iet-spr.2013.0340
pub_keyword,iet_inspecKeyword,pub_concept
6
6