http://iet.metastore.ingenta.com
1887

Data-driven image captioning via salient region discovery

Data-driven image captioning via salient region discovery

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IET Computer Vision — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

In the past few years, automatically generating descriptions for images has attracted a lot of attention in computer vision and natural language processing research. Among the existing approaches, data-driven methods have been proven to be highly effective. These methods compare the given image against a large set of training images to determine a set of relevant images, then generate a description using the associated captions. In this study, the authors propose to integrate an object-based semantic image representation into a deep features-based retrieval framework to select the relevant images. Moreover, they present a novel phrase selection paradigm and a sentence generation model which depends on a joint analysis of salient regions in the input and retrieved images within a clustering framework. The authors demonstrate the effectiveness of their proposed approach on Flickr8K and Flickr30K benchmark datasets and show that their model gives highly competitive results compared with the state-of-the-art models.

References

    1. 1)
      • R. Bernardi , R. Cakici , D. Elliott .
        1. Bernardi, R., Cakici, R., Elliott, D., et al: ‘Automatic description generation from images: a survey of models, datasets, and evaluation measures’. J. Artif. Intell. Res., 2016, 55, pp. 409442.
        . J. Artif. Intell. Res. , 409 - 442
    2. 2)
      • V. Ordonez , X. Han , P. Kuznetsova .
        2. Ordonez, V., Han, X., Kuznetsova, P., et al: ‘Large scale retrieval and generation of image descriptions’, Int. J. Comput. Vis., 2013, 119, (1), pp. 4659.
        . Int. J. Comput. Vis. , 1 , 46 - 59
    3. 3)
      • R. Girshick , J. Donahue , T. Darrell .
        3. Girshick, R., Donahue, J., Darrell, T., et al: ‘Rich feature hierarchies for accurate object detection and semantic segmentation’. IEEE Conf. on Computer Vision and Pattern Recognition, 2014, pp. 580587.
        . IEEE Conf. on Computer Vision and Pattern Recognition , 580 - 587
    4. 4)
      • A.C. Berg , T.L. Berg , H. Daume .
        4. Berg, A.C., Berg, T.L., Daume, H.III, et al: ‘Understanding and predicting importance in images’. IEEE Conf. on Computer Vision and Pattern Recognition, 2012, pp. 35623569.
        . IEEE Conf. on Computer Vision and Pattern Recognition , 3562 - 3569
    5. 5)
      • G. Patterson , C. Xu , H. Su .
        5. Patterson, G., Xu, C., Su, H., et al: ‘The SUN attribute database: beyond categories for deeper scene understanding’, Int. J. Comput. Vis., 2014, 108, (1-2), pp. 5981.
        . Int. J. Comput. Vis. , 59 - 81
    6. 6)
      • S. Yagcioglu , E. Erdem , A. Erdem .
        6. Yagcioglu, S., Erdem, E., Erdem, A., et al: ‘A distributed representation based query expansion approach for image captioning’. Annual Meeting of the Association for Computational Linguistics, 2015.
        . Annual Meeting of the Association for Computational Linguistics
    7. 7)
      • J. Devlin , S. Gupta , R. Girshick .
        7. Devlin, J., Gupta, S., Girshick, R., et al: ‘Exploring nearest neighbor approaches for image captioning’, 2015, arXiv preprint arXiv:1505.04467.
        .
    8. 8)
      • M. Hodosh , P. Young , J. Hockenmaier .
        8. Hodosh, M., Young, P., Hockenmaier, J.: ‘Framing image description as a ranking task: data, models and evaluation metrics’. J. Artif. Intell. Res., 2013, pp. 853899.
        . J. Artif. Intell. Res. , 853 - 899
    9. 9)
      • P. Young , A. Lai , M. Hodosh .
        9. Young, P., Lai, A., Hodosh, M., et al: ‘From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions’, Trans. Assoc. Comput. Linguist., 2014, 2, pp. 6778.
        . Trans. Assoc. Comput. Linguist. , 67 - 78
    10. 10)
      • R. Mason , E. Charniak .
        10. Mason, R., Charniak, E.: ‘Nonparametric method for data-driven image captioning’. Annual Meeting of the Association for Computational Linguistics, 2014.
        . Annual Meeting of the Association for Computational Linguistics
    11. 11)
      • P. Kuznetsova , V. Ordonez , A.C. Berg .
        11. Kuznetsova, P., Ordonez, V., Berg, A.C., et al: ‘Collective generation of natural image descriptions’. Annual Meeting of the Association for Computational Linguistics, 2012, pp. 359368.
        . Annual Meeting of the Association for Computational Linguistics , 359 - 368
    12. 12)
      • A. Gupta , Y. Verma , C.V. Jawahar .
        12. Gupta, A., Verma, Y., Jawahar, C.V.: ‘Choosing linguistics over vision to describe images’. AAAI Conf. on Artificial Intelligence, 2012.
        . AAAI Conf. on Artificial Intelligence
    13. 13)
      • K. Chatfield , K. Simonyan , A. Vedaldi .
        13. Chatfield, K., Simonyan, K., Vedaldi, A., et al: ‘Return of the devil in the details: delving deep into convolutional nets’. British Machine Vision Conf., 2014.
        . British Machine Vision Conf.
    14. 14)
      • J. Pennington , R. Socher , C.D. Manning .
        14. Pennington, J., Socher, R., Manning, C.D.: ‘GloVe: global vectors for word representation’. Conf. on Empirical Methods on Natural Language Processing, 2014.
        . Conf. on Empirical Methods on Natural Language Processing
    15. 15)
      • S. Li , G. Kulkarni , T.L. Berg .
        15. Li, S., Kulkarni, G., Berg, T.L., et al: ‘Composing simple image descriptions using web-scale n-grams’. Conf. on Computational Natural Language Learning, 2011, pp. 220228.
        . Conf. on Computational Natural Language Learning , 220 - 228
    16. 16)
      • A. Gatt , E. Reiter .
        16. Gatt, A., Reiter, E.: ‘Simplenlg: a realisation engine for practical applications’. Proc. of ENLG, 2009, pp. 9093.
        . Proc. of ENLG , 90 - 93
    17. 17)
      • K. Yun , Y. Peng , D. Samaras .
        17. Yun, K., Peng, Y., Samaras, D., et al: ‘Studying relationships between human gaze, description, and computer vision’. IEEE Conf. on Computer Vision and Pattern Recognition, 2013, pp. 739746.
        . IEEE Conf. on Computer Vision and Pattern Recognition , 739 - 746
    18. 18)
      • K. Xu , J. Ba , R. Kiros .
        18. Xu, K., Ba, J., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’. Int. Conf. on Machine Learning, 2015, vol. 14, pp. 7781.
        . Int. Conf. on Machine Learning , 77 - 81
    19. 19)
      • R. Margolin , A. Tal , L. Zelnik-Manor .
        19. Margolin, R., Tal, A., Zelnik-Manor, L.: ‘What makes a patch distinct?’. IEEE Conf. on Computer Vision and Pattern Recognition, 2013, pp. 11391146.
        . IEEE Conf. on Computer Vision and Pattern Recognition , 1139 - 1146
    20. 20)
      • L. Itti , C. Koch , E. Niebur .
        20. Itti, L., Koch, C., Niebur, E.: ‘A model of saliency-based visual attention for rapid scene analysis’, IEEE Trans. Pattern Anal. Mach. Intell., 1998, (11), pp. 12541259.
        . IEEE Trans. Pattern Anal. Mach. Intell. , 11 , 1254 - 1259
    21. 21)
      • O. Vinyals , A. Toshev , S. Bengio .
        21. Vinyals, O., Toshev, A., Bengio, S., et al: ‘Show and tell: a neural image caption generator’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2015, (20), pp. 31563164.
        . Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition , 20 , 3156 - 3164
    22. 22)
      • J. Mao , W. Xu , Y. Yang .
        22. Mao, J., Xu, W., Yang, Y., et al: ‘Deep captioning with multimodal recurrent neural networks (m-rnn)’, 2014, arXiv preprint arXiv:1412.6632.
        .
    23. 23)
      • X. Chen , C. Lawrence Zitnick .
        23. Chen, X., Lawrence Zitnick, C.: ‘Mind's eye: a recurrent visual representation for image caption generation’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2015, pp. 24222431.
        . Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition , 2422 - 2431
    24. 24)
      • H. Heuer , C. Monz , A.W.M. Smeulders .
        24. Heuer, H., Monz, C., Smeulders, A.W.M.: ‘Generating captions without looking beyond objects’, 2016, arXiv preprint arXiv:1610.03708.
        .
    25. 25)
      • L. Specia , S. Frank , K. Simaan .
        25. Specia, L., Frank, S., Simaan, K., et al: ‘A shared task on multimodal machine translation and crosslingual image description’. Proc. of the First Conf. on Machine Translation, Berlin, Germany, 2016.
        . Proc. of the First Conf. on Machine Translation
    26. 26)
      • P. Koehn , H. Hoang , A. Birch .
        26. Koehn, P., Hoang, H., Birch, A., et al: ‘Moses: open source toolkit for statistical machine translation’. Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007, pp. 177180.
        . Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions , 177 - 180
    27. 27)
      • B. Zhou , A. Lapedriza , J. Xiao .
        27. Zhou, B., Lapedriza, A., Xiao, J., et al: ‘Learning deep features for scene recognition using places database’. Advances in Neural Information Processing Systems, 2014, pp. 487495.
        . Advances in Neural Information Processing Systems , 487 - 495
    28. 28)
      • A. Krizhevsky , I. Sutskever , G.E. Hinton .
        28. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘Imagenet classification with deep convolutional neural networks’. Advances in Neural Information Processing Systems, 2012, pp. 10971105.
        . Advances in Neural Information Processing Systems , 1097 - 1105
    29. 29)
      • J. Deng , W. Dong , R. Socher .
        29. Deng, J., Dong, W., Socher, R., et al: ‘Imagenet: a large-scale hierarchical image database’. IEEE Conf. on Computer Vision and Pattern Recognition, 2009, pp. 248255.
        . IEEE Conf. on Computer Vision and Pattern Recognition , 248 - 255
    30. 30)
      • V. Ordonez , G. Kulkarni , T.L. Berg .
        30. Ordonez, V., Kulkarni, G., Berg, T.L.: ‘Im2text: describing images using 1 million captioned photographs’. Advances in Neural Information Processing Systems2011, pp. 11431151.
        . Advances in Neural Information Processing Systems , 1143 - 1151
    31. 31)
      • C. Xiaozhi , H. Ma , X. Wang .
        31. Xiaozhi, C., Ma, H., Wang, X., et al: ‘Improving object proposals with multi-thresholding straddling expansion’. IEEE Conf. on Computer Vision and Pattern Recognition, 2015.
        . IEEE Conf. on Computer Vision and Pattern Recognition
    32. 32)
      • K. Simonyan , A. Zisserman .
        32. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for largescale image recognition’. Int. Conf. on Learning Representations, 2015.
        . Int. Conf. on Learning Representations
    33. 33)
      • J. Wang , J. Yang , K. Yu .
        33. Wang, J., Yang, J., Yu, K., et al: ‘Locality-constrained linear coding for image classification’. IEEE Conf. on Computer Vision and Pattern Recognition, 2010, pp. 33603367.
        . IEEE Conf. on Computer Vision and Pattern Recognition , 3360 - 3367
    34. 34)
      • Y. Jing , S. Baluja .
        34. Jing, Y., Baluja, S.: ‘Visualrank: applying pagerank to large-scale image search’, IEEE Trans. Pattern Anal. Mach. Intell., 2008, 30, (11), pp. 18771890.
        . IEEE Trans. Pattern Anal. Mach. Intell. , 11 , 1877 - 1890
    35. 35)
      • Z. Hayder , M. Salzmann , X. He .
        35. Hayder, Z., Salzmann, M., He, X.: ‘Object co-detection via efficient inference in a fully-connected crf’. European Conf. on Computer Vision, 2014, pp. 330345.
        . European Conf. on Computer Vision , 330 - 345
    36. 36)
      • K. Tang , A. Joulin , L.-J. Li .
        36. Tang, K., Joulin, A., Li, L.-J., et al: ‘Co-localization in real-world images’. IEEE Conf. on Computer Vision and Pattern Recognition, 2014, pp. 14641471.
        . IEEE Conf. on Computer Vision and Pattern Recognition , 1464 - 1471
    37. 37)
      • B. Alexe , T. Deselaers , V. Ferrari .
        37. Alexe, B., Deselaers, T., Ferrari, V.: ‘Measuring the objectness of image windows’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (11), pp. 21892202.
        . IEEE Trans. Pattern Anal. Mach. Intell. , 11 , 2189 - 2202
    38. 38)
      • A. Borji , M.-M. Cheng , H. Jiang .
        38. Borji, A., Cheng, M.-M., Jiang, H., et al: ‘Salient Object Detection: A Benchmark’, IEEE Trans. Image Process., 2015, 24, (12), pp. 57065722.
        . IEEE Trans. Image Process. , 12 , 5706 - 5722
    39. 39)
      • M. Pavan , M. Pelillo .
        39. Pavan, M., Pelillo, M.: ‘Dominant sets and pairwise clustering’, IEEE Trans. Pattern Anal. Mach. Intell., 2007, 29, (1), pp. 167172.
        . IEEE Trans. Pattern Anal. Mach. Intell. , 1 , 167 - 172
    40. 40)
      • R. Socher , J. Bauer , C.D. Manning .
        40. Socher, R., Bauer, J., Manning, C.D., et al: ‘Parsing with compositional vector grammars’. Annual Meeting of the Association for Computational Linguistics, 2013.
        . Annual Meeting of the Association for Computational Linguistics
    41. 41)
      • J. Lafferty , A. McCallum , F.C.N. Pereira .
        41. Lafferty, J., McCallum, A., Pereira, F.C.N.: ‘Conditional random fields: probabilistic models for segmenting and labeling sequence data’, 2001.
        .
    42. 42)
      • F. Sha , F. Pereira .
        42. Sha, F., Pereira, F.: ‘Shallow parsing with conditional random fields’. Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics), 2003, pp. 134141.
        . Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics) , 134 - 141
    43. 43)
      • E.F. Tjong Kim Sang , S. Buchholz .
        43. Tjong Kim Sang, E.F., Buchholz, S.: ‘Introduction to the conll-2000 shared task: chunking’. Proc. of LLL and CoNLL, 2000, pp. 127132.
        . Proc. of LLL and CoNLL , 127 - 132
    44. 44)
      • K. Toutanova , D. Klein , C.D. Manning .
        44. Toutanova, K., Klein, D., Manning, C.D., et al: ‘Feature-rich part-of-speech tagging with a cyclic dependency network’. Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics), 2003, pp. 173180.
        . Proc. of HLT and NAACL (North American Chapter of the Association for Computational Linguistics) , 173 - 180
    45. 45)
      • K. Papineni , S. Roukos , T. Ward .
        45. Papineni, K., Roukos, S., Ward, T., et al: ‘Bleu: a method for automatic evaluation of machine translation’. Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311318.
        . Annual Meeting of the Association for Computational Linguistics , 311 - 318
    46. 46)
      • C.-Y. Lin .
        46. Lin, C.-Y.: ‘Rouge: a package for automatic evaluation of summaries’. Text Summarization Branches out: ACL Workshop, vol. 8, 2004.
        . Text Summarization Branches out: ACL Workshop
    47. 47)
      • S. Banerjee , A. Lavie .
        47. Banerjee, S., Lavie, A.: ‘METEOR: an automatic metric for MT evaluation with improved correlation with human judgments’. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, vol. 29, pp. 6572.
        . ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 65 - 72
    48. 48)
      • R. Vedantam , C. Lawrence Zitnick , D. Parikh .
        48. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: ‘Cider: consensus-based image description evaluation’, 2014, arXiv preprint arXiv:1411.5726.
        .
    49. 49)
      • X. Chen , H. Fang , T.-Y. Lin .
        49. Chen, X., Fang, H., Lin, T.-Y., et al: ‘Microsoft COCO captions: data collection and evaluation server’, 2015, arXiv preprint arXiv:1504.00325.
        .
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2016.0286
Loading

Related content

content/journals/10.1049/iet-cvi.2016.0286
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address