Recent captioning models are limited in their ability to describe concepts unseen in paired image–sentence pairs. This study presents a framework of multi-task learning for describing novel words not present in existing image-captioning datasets. The authors’ framework takes advantage of external sources-labelled images from image classification datasets, and semantic knowledge extracted from the annotated text. They propose minimising a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings. When in the inference step they change the BeamSearch step by considering both the caption model and language model enabling the model to generalise novel words outside of image-captioning datasets. They demonstrate that in the framework by adding an annotated text data which can help the image captioning model to describe images with the right corresponding novel words. Extensive experiments are conducted on both AI Challenger and Microsoft coco (MSCOCO) image captioning datasets of two different languages, demonstrating the ability of their framework to describe novel words such as scenes and objects.

References

1. 1)
  - 1. Fang, H., Gupta, S., Iandola, F., et al: ‘From captions to visual concepts and back’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 2015, pp. 1473–1482.
2. 2)
  - 6. Ren, Z., Wang, X., Zhang, N., et al: ‘Deep reinforcement learning-based image captioning with embedding reward’, 2017, arXiv preprint arXiv:1704.03899.
3. 3)
  - 5. Rennie, S. J., Marcheret, E., Mroueh, Y., et al: ‘Self-critical sequence training for image captioning’, 2016, arXiv preprint arXiv:1612.00563.
4. 4)
  - 13. Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., et al: ‘Generating natural-language video descriptions using text-mined knowledge’. AAAI Conf. on Artificial Intelligence (AAAI), Bellevue, Washington, USA, 2013.
5. 5)
  - 15. Hendricks, L.A., Venugopalan, S., Rohrbach, M., et al: ‘Deep compositional captioning: describing novel object categories without paired training data’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, 2016.
6. 6)
  - 16. Venugopalan, S., Hendricks, L. A., Rohrbach, M., et al: ‘Captioning images with diverse objects’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, USA, 2017.
7. 7)
  - 22. Donahue, J., Anne Hendricks, L., Guadarrama, S., et al: ‘Long-term recurrent convolutional networks for visual recognition and description’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 2015.
8. 8)
  - 9. Deng, J., Berg, A., Li, K., et al: ‘What does classifying more than 10,000 image categories tell us?’. European Conf. on Computer Vision (ECCV), Crete, Greece, 2010.
9. 9)
  - 18. Papineni, K., Roukos, S., Ward, T., et al: ‘Bleu: a method for automatic evaluation of machine translation’. Proc. 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311–318.
10. 10)
  - 10. Zhao, B., Wu, B., Wu, T., et al: ‘Zero-shot learning posed as a missing data problem’. Proc. IEEE Int. Conf. on Computer Vision (ICCV) Workshop, Venice, Italy, 2017, pp. 2616–2622.
11. 11)
  - 2. Vinyals, O., Toshev, A., Bengio, S., et al: ‘Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39, (4), pp. 652–663.
12. 12)
  - 19. Denkowski, M., Lavie, A.K.: ‘Meteor universal: language specific translation evaluation for any target language’. Proc. Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, 2014, pp. 376–380.
13. 13)
  - 21. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: ‘CIDEr: consensus-based image description evaluation’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 2015, pp. 4566–4575.
14. 14)
  - 11. Wang, D., Li, Y., Lin, Y., et al: ‘Relational knowledge transfer for zero-shot learning’. Thirtieth AAAI Conf. on Artificial Intelligence, Phoenix, Arizona, USA, 2016.
15. 15)
  - 12. Thomason, J., Venugopalan, S., Guadarrama, S., et al: ‘Integrating language and vision to generate natural language descriptions of videos in the wild’. Computational Linguistics (COLING), Dublin, Ireland, 2014.
16. 16)
  - 7. Jiahong, W., He, Z., Bo, Z., et al: ‘AI challenger: a large-scale dataset for going deeper in image understanding’, 2017, arXiv preprint arXiv:1711.06475.
17. 17)
  - 4. Karpathy, A., Fei-Fei, L.: ‘Deep visual-semantic alignments for generating image descriptions’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 2015, pp. 3128–3137.
18. 18)
  - 17. Yao, T., Pan, Y., Li, Y., et al: ‘Incorporating copying mechanism in image captioning for learning novel objects’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, USA, 2017, pp. 5263–5271.
19. 19)
  - 20. Lin, C.-Y.: ‘Rouge: A package for automatic evaluation of summaries’. Text Summarization Branches Out: Proc. ACL-04 Workshop, Barcelona, Spain, 2004, vol. 8.
20. 20)
  - 14. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., et al: ‘Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition’. Int. Conf. on Computer Vision (ICCV), Darling Harbour, Sydney, 2013.
21. 21)
  - 8. Lin, T.-Y., Maire, M., Belongie, S., et al: ‘Microsoft coco: common objects in context’. European Conf. on Computer Vision (ECCV), Cornell University, 2014.
22. 22)
  - 3. Xu, K., Ba, J., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’. Int. Conf. on Machine Learning, Lille, France, 2015, pp. 2048–2057.

Multi-task learning for captioning images with novel words

References

Related content