access icon openaccess Uniform attribute-content model

There have been growing needs for text processing, such as classifying, retrieving and clustering. The foundation of such a process is to extract features, which can best describe the text. Great progress has been made in text modelling. However, most of the text modelling methods are based only on the content, nor only on the attributes. Although there have been some combined models proposed in recent years, the lack of universality limits such models. In this study, the authors propose a uniform attribute-content model, which uses the attributes to influence the content feature extraction process. They design the attributes as a special filter to each feature extracted from the content. Thus the mixed features contain both content information and attribute information, which can describe the text more precise. They also propose a Monte Carlo method to solve this model. Experimental results on the Enron email dataset demonstrate the effectiveness of the authors’ proposed models.

Inspec keywords: Monte Carlo methods; feature extraction; text analysis; information retrieval

Other keywords: content information; attribute information; content feature extraction process; uniform attribute-content model; text processing; text modelling methods; Monte Carlo method

Subjects: Information retrieval techniques; Document processing and analysis techniques; Monte Carlo methods; Knowledge engineering techniques

References

    1. 1)
      • 10. Golub, G.H., Reinsch, C.: ‘Singular value decomposition and least squares solutions’, Numer. Math., 1970, 14, (5), pp. 403420.
    2. 2)
      • 24. Suykens Johan, A.K., Vandewalle, J.: ‘Least squares support vector machine classifiers’, Neural Process. Lett., 1999, 9, (3), pp. 293300.
    3. 3)
      • 26. Bailey, T.L., Elkan, C.: ‘Fitting a mixture model by expectation maximization to discover motifs in bipolymers’, (University of California at San Diego, Technical Report, 1994), pp. 2836.
    4. 4)
      • 3. Hofmann, T.: ‘Probabilistic latent semantic indexing’. Proc. of the 22nd Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval (ACM), Berkeley, USA, 1999, pp. 5057.
    5. 5)
      • 9. Martineau, J., Finin, T.: ‘Delta TFIDF: an improved feature space for sentiment analysis’. 3rd Int. AAAI Conf. on Web and Social Media (ICWSM), San Jose, USA, 2009, p. 106.
    6. 6)
      • 13. Griffiths, T.L., Steyvers, M.: ‘Finding scientific topics’, Proc. Natl. Acad. Sci., 2004, 101, (Suppl. 1), pp. 52285235.
    7. 7)
      • 1. Salton, A., Wong, G., Yang, C.S.: ‘A vector space model for automatic indexing’, Commun. ACM, 1975, 18, pp. 613620.
    8. 8)
      • 16. Xia, X., Lo, D., Ding, Y., et al: ‘Improving automated bug triaging with specialized topic model’, IEEE Trans. Softw. Eng., 2017, 43, (3), pp. 272297.
    9. 9)
      • 18. Lim, K.W., Chen, C., Buntine, W.: ‘Twitter-network topic model: a full Bayesian treatment for social network and text modeling’, arXiv preprint arXiv:1609.06791, 2016.
    10. 10)
      • 25. Tong, S., Koller, D.: ‘Support vector machine active learning with applications to text classification’, J. Mach. Learn. Res., 2001, 2, pp. 4566.
    11. 11)
      • 14. Kumar, A.: ‘A spectral algorithm for latent Dirichlet allocation’, Adv. Neural Inf. Process. Syst., 2012, pp. 917925.
    12. 12)
      • 11. Pon, R.K., Cárdenas, A.F., Buttler, D.J., et al: ‘Iscore: measuring the interestingness of articles in a limited user environment’. IEEE Symp. on Computational Intelligence and Data Mining (CIDM), 2007, Paris, France, 2007, pp. 354361.
    13. 13)
      • 12. Blei, D.M.: ‘Probabilistic topic models’, Commun. ACM, 2012, 55, (4), pp. 7784.
    14. 14)
      • 5. Dumais, S.T.: ‘Latent semantic analysis’, Annu. Rev. Inf. Sci. Technol., 2004, 38, (1), pp. 188230.
    15. 15)
      • 15. Roberts, M.E., Stewart, B.M., Airoldi, E.M.: ‘A model of text for experimentation in the social sciences’, J. Am. Stat. Assoc., 2016, 111, (515), pp. 9881003.
    16. 16)
      • 19. Azzopardi, J, Ivanovic, D., Kapitsaki, G.: ‘Comparison of collaborative and content-based automatic recommendation approaches in a digital library of serbian’, PhD Dissertations, Semanitic Keyword-based Search on Structured Data Sources, Springer, Cham, 2016.
    17. 17)
      • 6. Blei, D.M., Ng, A.Y., Jordan, M.I.: ‘Latent Dirichlet allocation’, J. Mach. Learn. Res., 2003, 3, pp. 9931022.
    18. 18)
      • 4. Landauer, T.K.: ‘Latent semantic analysis’ (John Wiley & Sons, Ltd, 2006).
    19. 19)
      • 27. Moon, T.K.: ‘The expectation-maximization algorithm’, IEEE Signal Process. Mag., 1996, 13, (6), pp. 4760.
    20. 20)
      • 21. Blei, D. M.: ‘Improving and evaluating topic models and other models of text comment’, 2016, pp. 14081410.
    21. 21)
      • 20. Nair, B.: ‘A classifier to predict document novelty using association rule mining’.
    22. 22)
      • 23. Cortes, C., Vapnik, V.: ‘Support vector machine’, Mach. Learn., 1995, 20, (3), pp. 273297.
    23. 23)
      • 2. Salton, G.: ‘The SMART information retrieval system’ (Prentice-Hall, Englewood Cliffs, NJ, 1971).
    24. 24)
      • 7. Yiming, Y.: ‘An evaluation of statistical approach to text categorization’, Technical Report CMU-CS-97–127, Computer Science Department, Carnegie Mellon Univ., 1997.
    25. 25)
      • 8. Aizawa, A.: ‘An information-theoretic perspective of tf–idf measures’, Inf. Process. Manage., 2003, 39, (1), pp. 4565.
    26. 26)
      • 22. Liu, J., Chang, W.C., Wu, Y., et al: ‘Deep learning for extreme multi-label text classification’. Proc. 40th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Tokyo, Japan2017.
    27. 27)
      • 17. He, J., Hu, Z., Berg-Kirkpatrick, T., et al: ‘Efficient correlated topic modeling with topic embedding’, Proc. 23rd ACM SIGKDD Int. Conf. on Knowl. Discov. Data Min., Halifax, Canada2017, pp. 225233.
http://iet.metastore.ingenta.com/content/journals/10.1049/joe.2018.5135
Loading

Related content

content/journals/10.1049/joe.2018.5135
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading