Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

access icon free Data-driven image captioning via salient region discovery

In the past few years, automatically generating descriptions for images has attracted a lot of attention in computer vision and natural language processing research. Among the existing approaches, data-driven methods have been proven to be highly effective. These methods compare the given image against a large set of training images to determine a set of relevant images, then generate a description using the associated captions. In this study, the authors propose to integrate an object-based semantic image representation into a deep features-based retrieval framework to select the relevant images. Moreover, they present a novel phrase selection paradigm and a sentence generation model which depends on a joint analysis of salient regions in the input and retrieved images within a clustering framework. The authors demonstrate the effectiveness of their proposed approach on Flickr8K and Flickr30K benchmark datasets and show that their model gives highly competitive results compared with the state-of-the-art models.

References

    1. 1)
      • 6. Yagcioglu, S., Erdem, E., Erdem, A., et al: ‘A distributed representation based query expansion approach for image captioning’. Annual Meeting of the Association for Computational Linguistics, 2015.
    2. 2)
      • 31. Xiaozhi, C., Ma, H., Wang, X., et al: ‘Improving object proposals with multi-thresholding straddling expansion’. IEEE Conf. on Computer Vision and Pattern Recognition, 2015.
    3. 3)
      • 14. Pennington, J., Socher, R., Manning, C.D.: ‘GloVe: global vectors for word representation’. Conf. on Empirical Methods on Natural Language Processing, 2014.
    4. 4)
      • 25. Specia, L., Frank, S., Simaan, K., et al: ‘A shared task on multimodal machine translation and crosslingual image description’. Proc. of the First Conf. on Machine Translation, Berlin, Germany, 2016.
    5. 5)
      • 49. Chen, X., Fang, H., Lin, T.-Y., et al: ‘Microsoft COCO captions: data collection and evaluation server’, 2015, arXiv preprint arXiv:1504.00325.
    6. 6)
      • 23. Chen, X., Lawrence Zitnick, C.: ‘Mind's eye: a recurrent visual representation for image caption generation’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2015, pp. 24222431.
    7. 7)
      • 24. Heuer, H., Monz, C., Smeulders, A.W.M.: ‘Generating captions without looking beyond objects’, 2016, arXiv preprint arXiv:1610.03708.
    8. 8)
      • 40. Socher, R., Bauer, J., Manning, C.D., et al: ‘Parsing with compositional vector grammars’. Annual Meeting of the Association for Computational Linguistics, 2013.
    9. 9)
      • 32. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for largescale image recognition’. Int. Conf. on Learning Representations, 2015.
    10. 10)
      • 38. Borji, A., Cheng, M.-M., Jiang, H., et al: ‘Salient Object Detection: A Benchmark’, IEEE Trans. Image Process., 2015, 24, (12), pp. 57065722.
    11. 11)
      • 18. Xu, K., Ba, J., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’. Int. Conf. on Machine Learning, 2015, vol. 14, pp. 7781.
    12. 12)
      • 47. Banerjee, S., Lavie, A.: ‘METEOR: an automatic metric for MT evaluation with improved correlation with human judgments’. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, vol. 29, pp. 6572.
    13. 13)
      • 39. Pavan, M., Pelillo, M.: ‘Dominant sets and pairwise clustering’, IEEE Trans. Pattern Anal. Mach. Intell., 2007, 29, (1), pp. 167172.
    14. 14)
      • 5. Patterson, G., Xu, C., Su, H., et al: ‘The SUN attribute database: beyond categories for deeper scene understanding’, Int. J. Comput. Vis., 2014, 108, (1-2), pp. 5981.
    15. 15)
      • 35. Hayder, Z., Salzmann, M., He, X.: ‘Object co-detection via efficient inference in a fully-connected crf’. European Conf. on Computer Vision, 2014, pp. 330345.
    16. 16)
      • 21. Vinyals, O., Toshev, A., Bengio, S., et al: ‘Show and tell: a neural image caption generator’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2015, (20), pp. 31563164.
    17. 17)
      • 9. Young, P., Lai, A., Hodosh, M., et al: ‘From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions’, Trans. Assoc. Comput. Linguist., 2014, 2, pp. 6778.
    18. 18)
      • 15. Li, S., Kulkarni, G., Berg, T.L., et al: ‘Composing simple image descriptions using web-scale n-grams’. Conf. on Computational Natural Language Learning, 2011, pp. 220228.
    19. 19)
      • 29. Deng, J., Dong, W., Socher, R., et al: ‘Imagenet: a large-scale hierarchical image database’. IEEE Conf. on Computer Vision and Pattern Recognition, 2009, pp. 248255.
    20. 20)
      • 1. Bernardi, R., Cakici, R., Elliott, D., et al: ‘Automatic description generation from images: a survey of models, datasets, and evaluation measures’. J. Artif. Intell. Res., 2016, 55, pp. 409442.
    21. 21)
      • 2. Ordonez, V., Han, X., Kuznetsova, P., et al: ‘Large scale retrieval and generation of image descriptions’, Int. J. Comput. Vis., 2013, 119, (1), pp. 4659.
    22. 22)
      • 48. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: ‘Cider: consensus-based image description evaluation’, 2014, arXiv preprint arXiv:1411.5726.
    23. 23)
      • 36. Tang, K., Joulin, A., Li, L.-J., et al: ‘Co-localization in real-world images’. IEEE Conf. on Computer Vision and Pattern Recognition, 2014, pp. 14641471.
    24. 24)
      • 19. Margolin, R., Tal, A., Zelnik-Manor, L.: ‘What makes a patch distinct?’. IEEE Conf. on Computer Vision and Pattern Recognition, 2013, pp. 11391146.
    25. 25)
      • 44. Toutanova, K., Klein, D., Manning, C.D., et al: ‘Feature-rich part-of-speech tagging with a cyclic dependency network’. Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics), 2003, pp. 173180.
    26. 26)
      • 20. Itti, L., Koch, C., Niebur, E.: ‘A model of saliency-based visual attention for rapid scene analysis’, IEEE Trans. Pattern Anal. Mach. Intell., 1998, (11), pp. 12541259.
    27. 27)
      • 28. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘Imagenet classification with deep convolutional neural networks’. Advances in Neural Information Processing Systems, 2012, pp. 10971105.
    28. 28)
      • 41. Lafferty, J., McCallum, A., Pereira, F.C.N.: ‘Conditional random fields: probabilistic models for segmenting and labeling sequence data’, 2001.
    29. 29)
      • 26. Koehn, P., Hoang, H., Birch, A., et al: ‘Moses: open source toolkit for statistical machine translation’. Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007, pp. 177180.
    30. 30)
      • 45. Papineni, K., Roukos, S., Ward, T., et al: ‘Bleu: a method for automatic evaluation of machine translation’. Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311318.
    31. 31)
      • 12. Gupta, A., Verma, Y., Jawahar, C.V.: ‘Choosing linguistics over vision to describe images’. AAAI Conf. on Artificial Intelligence, 2012.
    32. 32)
      • 16. Gatt, A., Reiter, E.: ‘Simplenlg: a realisation engine for practical applications’. Proc. of ENLG, 2009, pp. 9093.
    33. 33)
      • 13. Chatfield, K., Simonyan, K., Vedaldi, A., et al: ‘Return of the devil in the details: delving deep into convolutional nets’. British Machine Vision Conf., 2014.
    34. 34)
      • 4. Berg, A.C., Berg, T.L., Daume, H.III, et al: ‘Understanding and predicting importance in images’. IEEE Conf. on Computer Vision and Pattern Recognition, 2012, pp. 35623569.
    35. 35)
      • 43. Tjong Kim Sang, E.F., Buchholz, S.: ‘Introduction to the conll-2000 shared task: chunking’. Proc. of LLL and CoNLL, 2000, pp. 127132.
    36. 36)
      • 42. Sha, F., Pereira, F.: ‘Shallow parsing with conditional random fields’. Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics), 2003, pp. 134141.
    37. 37)
      • 8. Hodosh, M., Young, P., Hockenmaier, J.: ‘Framing image description as a ranking task: data, models and evaluation metrics’. J. Artif. Intell. Res., 2013, pp. 853899.
    38. 38)
      • 30. Ordonez, V., Kulkarni, G., Berg, T.L.: ‘Im2text: describing images using 1 million captioned photographs’. Advances in Neural Information Processing Systems2011, pp. 11431151.
    39. 39)
      • 11. Kuznetsova, P., Ordonez, V., Berg, A.C., et al: ‘Collective generation of natural image descriptions’. Annual Meeting of the Association for Computational Linguistics, 2012, pp. 359368.
    40. 40)
      • 10. Mason, R., Charniak, E.: ‘Nonparametric method for data-driven image captioning’. Annual Meeting of the Association for Computational Linguistics, 2014.
    41. 41)
      • 37. Alexe, B., Deselaers, T., Ferrari, V.: ‘Measuring the objectness of image windows’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (11), pp. 21892202.
    42. 42)
      • 34. Jing, Y., Baluja, S.: ‘Visualrank: applying pagerank to large-scale image search’, IEEE Trans. Pattern Anal. Mach. Intell., 2008, 30, (11), pp. 18771890.
    43. 43)
      • 3. Girshick, R., Donahue, J., Darrell, T., et al: ‘Rich feature hierarchies for accurate object detection and semantic segmentation’. IEEE Conf. on Computer Vision and Pattern Recognition, 2014, pp. 580587.
    44. 44)
      • 17. Yun, K., Peng, Y., Samaras, D., et al: ‘Studying relationships between human gaze, description, and computer vision’. IEEE Conf. on Computer Vision and Pattern Recognition, 2013, pp. 739746.
    45. 45)
      • 22. Mao, J., Xu, W., Yang, Y., et al: ‘Deep captioning with multimodal recurrent neural networks (m-rnn)’, 2014, arXiv preprint arXiv:1412.6632.
    46. 46)
      • 33. Wang, J., Yang, J., Yu, K., et al: ‘Locality-constrained linear coding for image classification’. IEEE Conf. on Computer Vision and Pattern Recognition, 2010, pp. 33603367.
    47. 47)
      • 27. Zhou, B., Lapedriza, A., Xiao, J., et al: ‘Learning deep features for scene recognition using places database’. Advances in Neural Information Processing Systems, 2014, pp. 487495.
    48. 48)
      • 7. Devlin, J., Gupta, S., Girshick, R., et al: ‘Exploring nearest neighbor approaches for image captioning’, 2015, arXiv preprint arXiv:1505.04467.
    49. 49)
      • 46. Lin, C.-Y.: ‘Rouge: a package for automatic evaluation of summaries’. Text Summarization Branches out: ACL Workshop, vol. 8, 2004.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2016.0286
Loading

Related content

content/journals/10.1049/iet-cvi.2016.0286
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address