Text clustering algorithm based on deep representation learning

Binyu Wang; Wenfen Liu; Zijie Lin; Xuexian Hu; Jianghong Wei; Chun Liu

Text clustering algorithm based on deep representation learning

View Fulltext

Author(s): Binyu Wang¹ ; Wenfen Liu² ; Zijie Lin¹ ; Xuexian Hu¹ ; Jianghong Wei¹ ; Chun Liu¹
- Affiliations: 1: State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou , People's Republic of China ;
  2: Guangxi Key Laboratory of Cryptography and Information Security , Guilin University of Electronic Technology , 541000 Guilin , People's Republic of China
Source: Volume 2018, Issue 16, November 2018, p. 1407 – 1414
DOI: 10.1049/joe.2018.8282 , Online ISSN 2051-3305

This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/)

Received 18/07/2018, Accepted 26/07/2018, Published 16/08/2018

Text clustering is an important method for effectively organising, summarising, and navigating text information. However, in the absence of labels, the text data to be clustered cannot be used to train the text representation model based on deep learning. To address the problem, an algorithm of text clustering based on deep representation learning is proposed using the transfer learning domain adaptation and the parameters update during cluster iteration. First, source domain data is used to perform the pre-training of the deep learning classification model. This procedure acts as an initialisation of the model parameters. Then, the domain discriminator is added to the model, to domain-divide the input sample. If the discriminator cannot distinguish which domain the data belongs to, the common feature space of two domains is obtained, so the domain adaptation problem is solved. Finally, the text feature vectors obtained by the model are clustered with MCSKM++ algorithm. The algorithm not only resolves the model pre-training problem in unsupervised clustering, but also has a good clustering effect on the transfer problem caused by different numbers of domain labels. Experiments suggest that the clustering accuracy of the algorithm is superior to other similar algorithms.

References

1. 1)
  - 7. Legrand, J., Collobert, R.: ‘Joint RNN-based greedy parsing and word composition’, Comput. Sci., 2015, pp. 51–61.
2. 2)
  - 18. Raskutti, G., Wainwright, M.J., Yu, B.: ‘Early stopping and non-parametric regression: an optimal data-dependent stopping rule’, J. Mach. Learn. Res., 2014, 15, (1), pp. 335–366.
3. 3)
  - 4. Xu, J., Wang, P., Tian, G., et al: ‘Short text clustering via convolutional neural networks’. The Workshop on Vector Space Modeling for Natural Language Processing, Denver, USA, June 2015, pp. 62–69.
4. 4)
  - 13. Binyu, W., Wenfen, L., Xuexian, H., et al: ‘Research on text clustering for selecting initial cluster center based on cosine distance’, Comp. Eng. Appl., 2018, 54, (10), pp. 11–18.
5. 5)
  - 8. Kim, Y.: ‘Convolutional neural networks for sentence classification’. Eprint Arxiv, 2014.
6. 6)
  - 11. Ganin, Y., Ustinova, E., Ajakan, H., et al: ‘Domain-adversarial training of neural networks’, J. Mach. Learn. Res., 2016, 17, (1), pp. 2001–2035.
7. 7)
  - 2. Yin, J., Wang, J.: ‘A text clustering algorithm using an online clustering scheme for initialization’. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, USA, August 2016, pp. 1995–2004.
8. 8)
  - 17. Srivastava, N., Hinton, G., Krizhevsky, A., et al: ‘Dropout: a simple way to prevent neural networks from overfitting’, J. Mach. Learn. Res., 2014, 15, (1), pp. 1929–1958.
9. 9)
  - 1. Aggarwal, C.C., Zhai, C.X.: ‘A survey of text clustering algorithms’, in Aggarwal, C.C., Zhai, C. (Eds.): ‘Mining Text Data’ (Springer, US, 2012), pp. 77–128.
10. 10)
  - 10. Mikolov, T., Chen, K., Corrado, G., et al: ‘Efficient estimation of word representations in vector space’, Comput. Sci., 2013.
11. 11)
  - 12. Bishop, C.M.: ‘Pattern recognition and machine learning’, IEEE Trans. Inf. Theory, 2012, 9, (4), pp. 257–261.
12. 12)
  - 9. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, (8), pp. 1735–1780.
13. 13)
  - 5. Xu, J., Xu, B., Wang, P., et al: ‘Self-taught convolutional neural networks for short text clustering’, Neural Netw., 2017, 88, pp. 22–31.
14. 14)
  - 3. Dhillon, I.S., Modha, D.S.: ‘Concept decompositions for large sparse text data using clustering’, Mach. Learn., 2001, 42, (1), pp. 143–175.
15. 15)
  - 6. Wang, Z., Mi, H., Ittycheriah, A.: ‘Semi-supervised clustering for short text via deep representation learning’. Signll Conf. on Computational Natural Language Learning, Berlin, Germany, August 2016, pp. 31–39.
16. 16)
  - 15. Xue, G.R., Dai, W., Yang, Q., et al: ‘Topic-bridged PLSA for cross-domain text classification’. Int. ACM SIGIR Conf. on Research & Development in Information Retrieval (DBLP), Singapore, July 2008, pp. 627–634.
17. 17)
  - 16. Greff, K., Srivastava, R.K., Koutník, J., et al: ‘LSTM: a search space odyssey’, IEEE Trans. Neural Netw. Learn. Syst., 2017, 28, (10), pp. 2222–2232.
18. 18)
  - 14. Kingma, D., Adam, B.J.: ‘A method for stochastic optimization’, Comput. Sci., 2014.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Text clustering algorithm based on deep representation learning

References

Related content