Data-driven image captioning via salient region discovery
- Author(s): Mert Kilickaya 1 ; Burak Kerim Akkus 2 ; Ruket Cakici 2 ; Aykut Erdem 1 ; Erkut Erdem 1 ; Nazli Ikizler-Cinbis 1
-
-
View affiliations
-
Affiliations:
1:
Department of Computer Engineering , Hacettepe University , Ankara , Turkey ;
2: Department of Computer Engineering , Middle East Technical University , Ankara , Turkey
-
Affiliations:
1:
Department of Computer Engineering , Hacettepe University , Ankara , Turkey ;
- Source:
Volume 11, Issue 6,
September
2017,
p.
398 – 406
DOI: 10.1049/iet-cvi.2016.0286 , Print ISSN 1751-9632, Online ISSN 1751-9640
In the past few years, automatically generating descriptions for images has attracted a lot of attention in computer vision and natural language processing research. Among the existing approaches, data-driven methods have been proven to be highly effective. These methods compare the given image against a large set of training images to determine a set of relevant images, then generate a description using the associated captions. In this study, the authors propose to integrate an object-based semantic image representation into a deep features-based retrieval framework to select the relevant images. Moreover, they present a novel phrase selection paradigm and a sentence generation model which depends on a joint analysis of salient regions in the input and retrieved images within a clustering framework. The authors demonstrate the effectiveness of their proposed approach on Flickr8K and Flickr30K benchmark datasets and show that their model gives highly competitive results compared with the state-of-the-art models.
Inspec keywords: image representation; pattern clustering; text analysis; visual databases; image retrieval; feature extraction
Other keywords: clustering framework; input images; salient region discovery; sentence generation model; training images; phrase selection paradigm; data-driven image captioning; object-based semantic image representation; Flickr30K benchmark dataset; retrieved images; Flickr8K benchmark dataset; deep feature-based retrieval framework
Subjects: Computer vision and image processing techniques; Information retrieval techniques; Spatial and pictorial databases; Document processing and analysis techniques; Image recognition; Natural language interfaces
References
-
-
1)
-
6. Yagcioglu, S., Erdem, E., Erdem, A., et al: ‘A distributed representation based query expansion approach for image captioning’. Annual Meeting of the Association for Computational Linguistics, 2015.
-
-
2)
-
31. Xiaozhi, C., Ma, H., Wang, X., et al: ‘Improving object proposals with multi-thresholding straddling expansion’. IEEE Conf. on Computer Vision and Pattern Recognition, 2015.
-
-
3)
-
14. Pennington, J., Socher, R., Manning, C.D.: ‘GloVe: global vectors for word representation’. Conf. on Empirical Methods on Natural Language Processing, 2014.
-
-
4)
-
25. Specia, L., Frank, S., Simaan, K., et al: ‘A shared task on multimodal machine translation and crosslingual image description’. Proc. of the First Conf. on Machine Translation, Berlin, Germany, 2016.
-
-
5)
-
49. Chen, X., Fang, H., Lin, T.-Y., et al: ‘Microsoft COCO captions: data collection and evaluation server’, 2015, arXiv preprint arXiv:1504.00325.
-
-
6)
-
23. Chen, X., Lawrence Zitnick, C.: ‘Mind's eye: a recurrent visual representation for image caption generation’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2015, pp. 2422–2431.
-
-
7)
-
24. Heuer, H., Monz, C., Smeulders, A.W.M.: ‘Generating captions without looking beyond objects’, 2016, arXiv preprint arXiv:1610.03708.
-
-
8)
-
40. Socher, R., Bauer, J., Manning, C.D., et al: ‘Parsing with compositional vector grammars’. Annual Meeting of the Association for Computational Linguistics, 2013.
-
-
9)
-
32. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for largescale image recognition’. Int. Conf. on Learning Representations, 2015.
-
-
10)
-
38. Borji, A., Cheng, M.-M., Jiang, H., et al: ‘Salient Object Detection: A Benchmark’, IEEE Trans. Image Process., 2015, 24, (12), pp. 5706–5722.
-
-
11)
-
18. Xu, K., Ba, J., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’. Int. Conf. on Machine Learning, 2015, vol. 14, pp. 77–81.
-
-
12)
-
47. Banerjee, S., Lavie, A.: ‘METEOR: an automatic metric for MT evaluation with improved correlation with human judgments’. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, vol. 29, pp. 65–72.
-
-
13)
-
39. Pavan, M., Pelillo, M.: ‘Dominant sets and pairwise clustering’, IEEE Trans. Pattern Anal. Mach. Intell., 2007, 29, (1), pp. 167–172.
-
-
14)
-
5. Patterson, G., Xu, C., Su, H., et al: ‘The SUN attribute database: beyond categories for deeper scene understanding’, Int. J. Comput. Vis., 2014, 108, (1-2), pp. 59–81.
-
-
15)
-
35. Hayder, Z., Salzmann, M., He, X.: ‘Object co-detection via efficient inference in a fully-connected crf’. European Conf. on Computer Vision, 2014, pp. 330–345.
-
-
16)
-
21. Vinyals, O., Toshev, A., Bengio, S., et al: ‘Show and tell: a neural image caption generator’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2015, (20), pp. 3156–3164.
-
-
17)
-
9. Young, P., Lai, A., Hodosh, M., et al: ‘From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions’, Trans. Assoc. Comput. Linguist., 2014, 2, pp. 67–78.
-
-
18)
-
15. Li, S., Kulkarni, G., Berg, T.L., et al: ‘Composing simple image descriptions using web-scale n-grams’. Conf. on Computational Natural Language Learning, 2011, pp. 220–228.
-
-
19)
-
29. Deng, J., Dong, W., Socher, R., et al: ‘Imagenet: a large-scale hierarchical image database’. IEEE Conf. on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
-
-
20)
-
1. Bernardi, R., Cakici, R., Elliott, D., et al: ‘Automatic description generation from images: a survey of models, datasets, and evaluation measures’. J. Artif. Intell. Res., 2016, 55, pp. 409–442.
-
-
21)
-
2. Ordonez, V., Han, X., Kuznetsova, P., et al: ‘Large scale retrieval and generation of image descriptions’, Int. J. Comput. Vis., 2013, 119, (1), pp. 46–59.
-
-
22)
-
48. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: ‘Cider: consensus-based image description evaluation’, 2014, arXiv preprint arXiv:1411.5726.
-
-
23)
-
36. Tang, K., Joulin, A., Li, L.-J., et al: ‘Co-localization in real-world images’. IEEE Conf. on Computer Vision and Pattern Recognition, 2014, pp. 1464–1471.
-
-
24)
-
19. Margolin, R., Tal, A., Zelnik-Manor, L.: ‘What makes a patch distinct?’. IEEE Conf. on Computer Vision and Pattern Recognition, 2013, pp. 1139–1146.
-
-
25)
-
44. Toutanova, K., Klein, D., Manning, C.D., et al: ‘Feature-rich part-of-speech tagging with a cyclic dependency network’. Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics), 2003, pp. 173–180.
-
-
26)
-
20. Itti, L., Koch, C., Niebur, E.: ‘A model of saliency-based visual attention for rapid scene analysis’, IEEE Trans. Pattern Anal. Mach. Intell., 1998, (11), pp. 1254–1259.
-
-
27)
-
28. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘Imagenet classification with deep convolutional neural networks’. Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
-
-
28)
-
41. Lafferty, J., McCallum, A., Pereira, F.C.N.: ‘Conditional random fields: probabilistic models for segmenting and labeling sequence data’, 2001.
-
-
29)
-
26. Koehn, P., Hoang, H., Birch, A., et al: ‘Moses: open source toolkit for statistical machine translation’. Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007, pp. 177–180.
-
-
30)
-
45. Papineni, K., Roukos, S., Ward, T., et al: ‘Bleu: a method for automatic evaluation of machine translation’. Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
-
-
31)
-
12. Gupta, A., Verma, Y., Jawahar, C.V.: ‘Choosing linguistics over vision to describe images’. AAAI Conf. on Artificial Intelligence, 2012.
-
-
32)
-
16. Gatt, A., Reiter, E.: ‘Simplenlg: a realisation engine for practical applications’. Proc. of ENLG, 2009, pp. 90–93.
-
-
33)
-
13. Chatfield, K., Simonyan, K., Vedaldi, A., et al: ‘Return of the devil in the details: delving deep into convolutional nets’. British Machine Vision Conf., 2014.
-
-
34)
-
4. Berg, A.C., Berg, T.L., Daume, H.III, et al: ‘Understanding and predicting importance in images’. IEEE Conf. on Computer Vision and Pattern Recognition, 2012, pp. 3562–3569.
-
-
35)
-
43. Tjong Kim Sang, E.F., Buchholz, S.: ‘Introduction to the conll-2000 shared task: chunking’. Proc. of LLL and CoNLL, 2000, pp. 127–132.
-
-
36)
-
42. Sha, F., Pereira, F.: ‘Shallow parsing with conditional random fields’. Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics), 2003, pp. 134–141.
-
-
37)
-
8. Hodosh, M., Young, P., Hockenmaier, J.: ‘Framing image description as a ranking task: data, models and evaluation metrics’. J. Artif. Intell. Res., 2013, pp. 853–899.
-
-
38)
-
30. Ordonez, V., Kulkarni, G., Berg, T.L.: ‘Im2text: describing images using 1 million captioned photographs’. Advances in Neural Information Processing Systems2011, pp. 1143–1151.
-
-
39)
-
11. Kuznetsova, P., Ordonez, V., Berg, A.C., et al: ‘Collective generation of natural image descriptions’. Annual Meeting of the Association for Computational Linguistics, 2012, pp. 359–368.
-
-
40)
-
10. Mason, R., Charniak, E.: ‘Nonparametric method for data-driven image captioning’. Annual Meeting of the Association for Computational Linguistics, 2014.
-
-
41)
-
37. Alexe, B., Deselaers, T., Ferrari, V.: ‘Measuring the objectness of image windows’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (11), pp. 2189–2202.
-
-
42)
-
34. Jing, Y., Baluja, S.: ‘Visualrank: applying pagerank to large-scale image search’, IEEE Trans. Pattern Anal. Mach. Intell., 2008, 30, (11), pp. 1877–1890.
-
-
43)
-
3. Girshick, R., Donahue, J., Darrell, T., et al: ‘Rich feature hierarchies for accurate object detection and semantic segmentation’. IEEE Conf. on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
-
-
44)
-
17. Yun, K., Peng, Y., Samaras, D., et al: ‘Studying relationships between human gaze, description, and computer vision’. IEEE Conf. on Computer Vision and Pattern Recognition, 2013, pp. 739–746.
-
-
45)
-
22. Mao, J., Xu, W., Yang, Y., et al: ‘Deep captioning with multimodal recurrent neural networks (m-rnn)’, 2014, arXiv preprint arXiv:1412.6632.
-
-
46)
-
33. Wang, J., Yang, J., Yu, K., et al: ‘Locality-constrained linear coding for image classification’. IEEE Conf. on Computer Vision and Pattern Recognition, 2010, pp. 3360–3367.
-
-
47)
-
27. Zhou, B., Lapedriza, A., Xiao, J., et al: ‘Learning deep features for scene recognition using places database’. Advances in Neural Information Processing Systems, 2014, pp. 487–495.
-
-
48)
-
7. Devlin, J., Gupta, S., Girshick, R., et al: ‘Exploring nearest neighbor approaches for image captioning’, 2015, arXiv preprint arXiv:1505.04467.
-
-
49)
-
46. Lin, C.-Y.: ‘Rouge: a package for automatic evaluation of summaries’. Text Summarization Branches out: ACL Workshop, vol. 8, 2004.
-
-
1)