access icon free Fast speaker clustering using distance of feature matrix mean and adaptive convergence threshold

The authors propose a method of fast speaker clustering in which a distance (distance of feature matrix mean, DFMM) is first defined for characterising the similarities between any two clusters, and then an adaptive convergence threshold is introduced for terminating the procedure of speaker clustering. If the minimum of the DFMMs between any two clusters is smaller than the threshold, then they are merged. The above mergence of clusters is repeated until the minimum of the DFMMs between any two clusters is larger than the threshold. They conduct experiments on both shorter voice segments (≤ 3 s) and longer voice segments (> 3 s) to compare their method with state-of-the-art methods, agglomerative hierarchical clustering with Bayesian information criterion (AHC + BIC) and vector quantisation with spectral clustering. Experiments show that their method achieves the best results for clustering shorter voice segments, and also obtains satisfactory results for clustering longer voice segments in comparison with other two methods. What is more, their method is faster than other methods in all experimental cases. The initial results show that the hybrid methods by combining their method with the AHC + BIC obtain further improvement in terms of the F score.

Inspec keywords: vector quantisation; Bayes methods; convergence; matrix algebra; speaker recognition; pattern clustering

Other keywords: Fast Speaker Clustering; voice segments; AHC + BIC; F score; vector quantisation; speech corpora; agglomerative hierarchical clustering; Bayesian information criterion; distance of feature matrix mean; DFMM; adaptive convergence threshold

Subjects: Algebra; Speech recognition and synthesis; Other topics in statistics; Algebra; Speech processing techniques; Other topics in statistics

References

    1. 1)
    2. 2)
    3. 3)
      • 5. Solomonoff, A., Mielke, A., Schmidt, M., Gish, H.: ‘Clustering speakers by their voices’. Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1998, vol. 2, pp. 757760.
    4. 4)
      • 11. ‘Chinese Linguistic Data Consortium, http://www.chineseldc.org/en/index.htm.
    5. 5)
      • 6. Ajmera, J., Bourlard, H., Lapidot, I., McCowan, I.: ‘Unknown-multiple speaker clustering using HMM’. Proc. Int. Conf. Spoken Language Processing, 2002, pp. 573576.
    6. 6)
    7. 7)
      • 13. Brookes, M.: Voicebox 1.15, Department of Electrical & Electronic Engineering, Imperial College, 2007.
    8. 8)
    9. 9)
      • 8. Iso, K.-I.: ‘Speaker clustering using vector quantization and spectral clustering’. Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2010, pp. 49864989.
    10. 10)
    11. 11)
      • 12. Sun, X.J.: ‘Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio’. Proc. IEEE Int. Conf. Acoustic, Speech and Signal Processing, 2002, pp. 333336.
    12. 12)
      • 9. Valente, F., Motlicek, P., Vijayasenan, D.: ‘Variational Bayesian speaker diarization of meeting recordings’. Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2010, pp. 49544957.
    13. 13)
    14. 14)
      • 10. Han, K.J., Kim, S., Narayanan, S.S.: ‘Robust speaker clustering strategies to data source variation for improved speaker diarization’. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 2007, pp. 262267.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-spr.2013.0340
Loading

Related content

content/journals/10.1049/iet-spr.2013.0340
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading