Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

access icon openaccess Text clustering algorithm based on deep representation learning

Text clustering is an important method for effectively organising, summarising, and navigating text information. However, in the absence of labels, the text data to be clustered cannot be used to train the text representation model based on deep learning. To address the problem, an algorithm of text clustering based on deep representation learning is proposed using the transfer learning domain adaptation and the parameters update during cluster iteration. First, source domain data is used to perform the pre-training of the deep learning classification model. This procedure acts as an initialisation of the model parameters. Then, the domain discriminator is added to the model, to domain-divide the input sample. If the discriminator cannot distinguish which domain the data belongs to, the common feature space of two domains is obtained, so the domain adaptation problem is solved. Finally, the text feature vectors obtained by the model are clustered with MCSKM++ algorithm. The algorithm not only resolves the model pre-training problem in unsupervised clustering, but also has a good clustering effect on the transfer problem caused by different numbers of domain labels. Experiments suggest that the clustering accuracy of the algorithm is superior to other similar algorithms.

References

    1. 1)
      • 7. Legrand, J., Collobert, R.: ‘Joint RNN-based greedy parsing and word composition’, Comput. Sci., 2015, pp. 5161.
    2. 2)
      • 18. Raskutti, G., Wainwright, M.J., Yu, B.: ‘Early stopping and non-parametric regression: an optimal data-dependent stopping rule’, J. Mach. Learn. Res., 2014, 15, (1), pp. 335366.
    3. 3)
      • 4. Xu, J., Wang, P., Tian, G., et al: ‘Short text clustering via convolutional neural networks’. The Workshop on Vector Space Modeling for Natural Language Processing, Denver, USA, June 2015, pp. 6269.
    4. 4)
      • 13. Binyu, W., Wenfen, L., Xuexian, H., et al: ‘Research on text clustering for selecting initial cluster center based on cosine distance’, Comp. Eng. Appl., 2018, 54, (10), pp. 1118.
    5. 5)
      • 8. Kim, Y.: ‘Convolutional neural networks for sentence classification’. Eprint Arxiv, 2014.
    6. 6)
      • 11. Ganin, Y., Ustinova, E., Ajakan, H., et al: ‘Domain-adversarial training of neural networks’, J. Mach. Learn. Res., 2016, 17, (1), pp. 20012035.
    7. 7)
      • 2. Yin, J., Wang, J.: ‘A text clustering algorithm using an online clustering scheme for initialization’. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, USA, August 2016, pp. 19952004.
    8. 8)
      • 17. Srivastava, N., Hinton, G., Krizhevsky, A., et al: ‘Dropout: a simple way to prevent neural networks from overfitting’, J. Mach. Learn. Res., 2014, 15, (1), pp. 19291958.
    9. 9)
      • 1. Aggarwal, C.C., Zhai, C.X.: ‘A survey of text clustering algorithms’, in Aggarwal, C.C., Zhai, C. (Eds.): ‘Mining Text Data’ (Springer, US, 2012), pp. 77128.
    10. 10)
      • 10. Mikolov, T., Chen, K., Corrado, G., et al: ‘Efficient estimation of word representations in vector space’, Comput. Sci., 2013.
    11. 11)
      • 12. Bishop, C.M.: ‘Pattern recognition and machine learning’, IEEE Trans. Inf. Theory, 2012, 9, (4), pp. 257261.
    12. 12)
      • 9. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, (8), pp. 17351780.
    13. 13)
      • 5. Xu, J., Xu, B., Wang, P., et al: ‘Self-taught convolutional neural networks for short text clustering’, Neural Netw., 2017, 88, pp. 2231.
    14. 14)
      • 3. Dhillon, I.S., Modha, D.S.: ‘Concept decompositions for large sparse text data using clustering’, Mach. Learn., 2001, 42, (1), pp. 143175.
    15. 15)
      • 6. Wang, Z., Mi, H., Ittycheriah, A.: ‘Semi-supervised clustering for short text via deep representation learning’. Signll Conf. on Computational Natural Language Learning, Berlin, Germany, August 2016, pp. 3139.
    16. 16)
      • 15. Xue, G.R., Dai, W., Yang, Q., et al: ‘Topic-bridged PLSA for cross-domain text classification’. Int. ACM SIGIR Conf. on Research & Development in Information Retrieval (DBLP), Singapore, July 2008, pp. 627634.
    17. 17)
      • 16. Greff, K., Srivastava, R.K., Koutník, J., et al: ‘LSTM: a search space odyssey’, IEEE Trans. Neural Netw. Learn. Syst., 2017, 28, (10), pp. 22222232.
    18. 18)
      • 14. Kingma, D., Adam, B.J.: ‘A method for stochastic optimization’, Comput. Sci., 2014.
http://iet.metastore.ingenta.com/content/journals/10.1049/joe.2018.8282
Loading

Related content

content/journals/10.1049/joe.2018.8282
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address