Data-driven image captioning via salient region discovery

Mert Kilickaya; Burak Kerim Akkus; Ruket Cakici; Aykut Erdem; Erkut Erdem; Nazli Ikizler-Cinbis

Data-driven image captioning via salient region discovery

View Fulltext

Author(s): Mert Kilickaya¹ ; Burak Kerim Akkus² ; Ruket Cakici² ; Aykut Erdem¹ ; Erkut Erdem¹ ; Nazli Ikizler-Cinbis¹
- Affiliations: 1: Department of Computer Engineering , Hacettepe University , Ankara , Turkey ;
  2: Department of Computer Engineering , Middle East Technical University , Ankara , Turkey
Source: Volume 11, Issue 6, September 2017, p. 398 – 406
DOI: 10.1049/iet-cvi.2016.0286 , Print ISSN 1751-9632, Online ISSN 1751-9640

Received 01/09/2016, Accepted 27/02/2017, Revised 20/02/2017, Published 03/03/2017

In the past few years, automatically generating descriptions for images has attracted a lot of attention in computer vision and natural language processing research. Among the existing approaches, data-driven methods have been proven to be highly effective. These methods compare the given image against a large set of training images to determine a set of relevant images, then generate a description using the associated captions. In this study, the authors propose to integrate an object-based semantic image representation into a deep features-based retrieval framework to select the relevant images. Moreover, they present a novel phrase selection paradigm and a sentence generation model which depends on a joint analysis of salient regions in the input and retrieved images within a clustering framework. The authors demonstrate the effectiveness of their proposed approach on Flickr8K and Flickr30K benchmark datasets and show that their model gives highly competitive results compared with the state-of-the-art models.

References

1. 1)
  - 6. Yagcioglu, S., Erdem, E., Erdem, A., et al: ‘A distributed representation based query expansion approach for image captioning’. Annual Meeting of the Association for Computational Linguistics, 2015.
2. 2)
  - 31. Xiaozhi, C., Ma, H., Wang, X., et al: ‘Improving object proposals with multi-thresholding straddling expansion’. IEEE Conf. on Computer Vision and Pattern Recognition, 2015.
3. 3)
  - 14. Pennington, J., Socher, R., Manning, C.D.: ‘GloVe: global vectors for word representation’. Conf. on Empirical Methods on Natural Language Processing, 2014.
4. 4)
  - 25. Specia, L., Frank, S., Simaan, K., et al: ‘A shared task on multimodal machine translation and crosslingual image description’. Proc. of the First Conf. on Machine Translation, Berlin, Germany, 2016.
5. 5)
  - 49. Chen, X., Fang, H., Lin, T.-Y., et al: ‘Microsoft COCO captions: data collection and evaluation server’, 2015, arXiv preprint arXiv:1504.00325.
6. 6)
  - 23. Chen, X., Lawrence Zitnick, C.: ‘Mind's eye: a recurrent visual representation for image caption generation’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2015, pp. 2422–2431.
7. 7)
  - 24. Heuer, H., Monz, C., Smeulders, A.W.M.: ‘Generating captions without looking beyond objects’, 2016, arXiv preprint arXiv:1610.03708.
8. 8)
  - 40. Socher, R., Bauer, J., Manning, C.D., et al: ‘Parsing with compositional vector grammars’. Annual Meeting of the Association for Computational Linguistics, 2013.
9. 9)
  - 32. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for largescale image recognition’. Int. Conf. on Learning Representations, 2015.
10. 10)
  - 38. Borji, A., Cheng, M.-M., Jiang, H., et al: ‘Salient Object Detection: A Benchmark’, IEEE Trans. Image Process., 2015, 24, (12), pp. 5706–5722.
11. 11)
  - 18. Xu, K., Ba, J., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’. Int. Conf. on Machine Learning, 2015, vol. 14, pp. 77–81.
12. 12)
  - 47. Banerjee, S., Lavie, A.: ‘METEOR: an automatic metric for MT evaluation with improved correlation with human judgments’. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, vol. 29, pp. 65–72.
13. 13)
  - 39. Pavan, M., Pelillo, M.: ‘Dominant sets and pairwise clustering’, IEEE Trans. Pattern Anal. Mach. Intell., 2007, 29, (1), pp. 167–172.
14. 14)
  - 5. Patterson, G., Xu, C., Su, H., et al: ‘The SUN attribute database: beyond categories for deeper scene understanding’, Int. J. Comput. Vis., 2014, 108, (1-2), pp. 59–81.
15. 15)
  - 35. Hayder, Z., Salzmann, M., He, X.: ‘Object co-detection via efficient inference in a fully-connected crf’. European Conf. on Computer Vision, 2014, pp. 330–345.
16. 16)
  - 21. Vinyals, O., Toshev, A., Bengio, S., et al: ‘Show and tell: a neural image caption generator’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2015, (20), pp. 3156–3164.
17. 17)
  - 9. Young, P., Lai, A., Hodosh, M., et al: ‘From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions’, Trans. Assoc. Comput. Linguist., 2014, 2, pp. 67–78.
18. 18)
  - 15. Li, S., Kulkarni, G., Berg, T.L., et al: ‘Composing simple image descriptions using web-scale n-grams’. Conf. on Computational Natural Language Learning, 2011, pp. 220–228.
19. 19)
  - 29. Deng, J., Dong, W., Socher, R., et al: ‘Imagenet: a large-scale hierarchical image database’. IEEE Conf. on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
20. 20)
  - 1. Bernardi, R., Cakici, R., Elliott, D., et al: ‘Automatic description generation from images: a survey of models, datasets, and evaluation measures’. J. Artif. Intell. Res., 2016, 55, pp. 409–442.
21. 21)
  - 2. Ordonez, V., Han, X., Kuznetsova, P., et al: ‘Large scale retrieval and generation of image descriptions’, Int. J. Comput. Vis., 2013, 119, (1), pp. 46–59.
22. 22)
  - 48. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: ‘Cider: consensus-based image description evaluation’, 2014, arXiv preprint arXiv:1411.5726.
23. 23)
  - 36. Tang, K., Joulin, A., Li, L.-J., et al: ‘Co-localization in real-world images’. IEEE Conf. on Computer Vision and Pattern Recognition, 2014, pp. 1464–1471.
24. 24)
  - 19. Margolin, R., Tal, A., Zelnik-Manor, L.: ‘What makes a patch distinct?’. IEEE Conf. on Computer Vision and Pattern Recognition, 2013, pp. 1139–1146.
25. 25)
  - 44. Toutanova, K., Klein, D., Manning, C.D., et al: ‘Feature-rich part-of-speech tagging with a cyclic dependency network’. Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics), 2003, pp. 173–180.
26. 26)
  - 20. Itti, L., Koch, C., Niebur, E.: ‘A model of saliency-based visual attention for rapid scene analysis’, IEEE Trans. Pattern Anal. Mach. Intell., 1998, (11), pp. 1254–1259.
27. 27)
  - 28. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘Imagenet classification with deep convolutional neural networks’. Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
28. 28)
  - 41. Lafferty, J., McCallum, A., Pereira, F.C.N.: ‘Conditional random fields: probabilistic models for segmenting and labeling sequence data’, 2001.
29. 29)
  - 26. Koehn, P., Hoang, H., Birch, A., et al: ‘Moses: open source toolkit for statistical machine translation’. Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007, pp. 177–180.
30. 30)
  - 45. Papineni, K., Roukos, S., Ward, T., et al: ‘Bleu: a method for automatic evaluation of machine translation’. Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
31. 31)
  - 12. Gupta, A., Verma, Y., Jawahar, C.V.: ‘Choosing linguistics over vision to describe images’. AAAI Conf. on Artificial Intelligence, 2012.
32. 32)
  - 16. Gatt, A., Reiter, E.: ‘Simplenlg: a realisation engine for practical applications’. Proc. of ENLG, 2009, pp. 90–93.
33. 33)
  - 13. Chatfield, K., Simonyan, K., Vedaldi, A., et al: ‘Return of the devil in the details: delving deep into convolutional nets’. British Machine Vision Conf., 2014.
34. 34)
  - 4. Berg, A.C., Berg, T.L., Daume, H.III, et al: ‘Understanding and predicting importance in images’. IEEE Conf. on Computer Vision and Pattern Recognition, 2012, pp. 3562–3569.
35. 35)
  - 43. Tjong Kim Sang, E.F., Buchholz, S.: ‘Introduction to the conll-2000 shared task: chunking’. Proc. of LLL and CoNLL, 2000, pp. 127–132.
36. 36)
  - 42. Sha, F., Pereira, F.: ‘Shallow parsing with conditional random fields’. Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics), 2003, pp. 134–141.
37. 37)
  - 8. Hodosh, M., Young, P., Hockenmaier, J.: ‘Framing image description as a ranking task: data, models and evaluation metrics’. J. Artif. Intell. Res., 2013, pp. 853–899.
38. 38)
  - 30. Ordonez, V., Kulkarni, G., Berg, T.L.: ‘Im2text: describing images using 1 million captioned photographs’. Advances in Neural Information Processing Systems2011, pp. 1143–1151.
39. 39)
  - 11. Kuznetsova, P., Ordonez, V., Berg, A.C., et al: ‘Collective generation of natural image descriptions’. Annual Meeting of the Association for Computational Linguistics, 2012, pp. 359–368.
40. 40)
  - 10. Mason, R., Charniak, E.: ‘Nonparametric method for data-driven image captioning’. Annual Meeting of the Association for Computational Linguistics, 2014.
41. 41)
  - 37. Alexe, B., Deselaers, T., Ferrari, V.: ‘Measuring the objectness of image windows’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (11), pp. 2189–2202.
42. 42)
  - 34. Jing, Y., Baluja, S.: ‘Visualrank: applying pagerank to large-scale image search’, IEEE Trans. Pattern Anal. Mach. Intell., 2008, 30, (11), pp. 1877–1890.
43. 43)
  - 3. Girshick, R., Donahue, J., Darrell, T., et al: ‘Rich feature hierarchies for accurate object detection and semantic segmentation’. IEEE Conf. on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
44. 44)
  - 17. Yun, K., Peng, Y., Samaras, D., et al: ‘Studying relationships between human gaze, description, and computer vision’. IEEE Conf. on Computer Vision and Pattern Recognition, 2013, pp. 739–746.
45. 45)
  - 22. Mao, J., Xu, W., Yang, Y., et al: ‘Deep captioning with multimodal recurrent neural networks (m-rnn)’, 2014, arXiv preprint arXiv:1412.6632.
46. 46)
  - 33. Wang, J., Yang, J., Yu, K., et al: ‘Locality-constrained linear coding for image classification’. IEEE Conf. on Computer Vision and Pattern Recognition, 2010, pp. 3360–3367.
47. 47)
  - 27. Zhou, B., Lapedriza, A., Xiao, J., et al: ‘Learning deep features for scene recognition using places database’. Advances in Neural Information Processing Systems, 2014, pp. 487–495.
48. 48)
  - 7. Devlin, J., Gupta, S., Girshick, R., et al: ‘Exploring nearest neighbor approaches for image captioning’, 2015, arXiv preprint arXiv:1505.04467.
49. 49)
  - 46. Lin, C.-Y.: ‘Rouge: a package for automatic evaluation of summaries’. Text Summarization Branches out: ACL Workshop, vol. 8, 2004.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Data-driven image captioning via salient region discovery

References

Related content