Multi-task learning for captioning images with novel words

Multi-task learning for captioning images with novel words

For access to this article, please select a purchase option:

Buy article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
IET Computer Vision — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Recent captioning models are limited in their ability to describe concepts unseen in paired image–sentence pairs. This study presents a framework of multi-task learning for describing novel words not present in existing image-captioning datasets. The authors’ framework takes advantage of external sources-labelled images from image classification datasets, and semantic knowledge extracted from the annotated text. They propose minimising a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings. When in the inference step they change the BeamSearch step by considering both the caption model and language model enabling the model to generalise novel words outside of image-captioning datasets. They demonstrate that in the framework by adding an annotated text data which can help the image captioning model to describe images with the right corresponding novel words. Extensive experiments are conducted on both AI Challenger and Microsoft coco (MSCOCO) image captioning datasets of two different languages, demonstrating the ability of their framework to describe novel words such as scenes and objects.


    1. 1)
      • 1. Fang, H., Gupta, S., Iandola, F., et al: ‘From captions to visual concepts and back’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 2015, pp. 14731482.
    2. 2)
      • 2. Vinyals, O., Toshev, A., Bengio, S., et al: ‘Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39, (4), pp. 652663.
    3. 3)
      • 3. Xu, K., Ba, J., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’. Int. Conf. on Machine Learning, Lille, France, 2015, pp. 20482057.
    4. 4)
      • 4. Karpathy, A., Fei-Fei, L.: ‘Deep visual-semantic alignments for generating image descriptions’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 2015, pp. 31283137.
    5. 5)
      • 5. Rennie, S. J., Marcheret, E., Mroueh, Y., et al: ‘Self-critical sequence training for image captioning’, 2016, arXiv preprint arXiv:1612.00563.
    6. 6)
      • 6. Ren, Z., Wang, X., Zhang, N., et al: ‘Deep reinforcement learning-based image captioning with embedding reward’, 2017, arXiv preprint arXiv:1704.03899.
    7. 7)
      • 7. Jiahong, W., He, Z., Bo, Z., et al: ‘AI challenger: a large-scale dataset for going deeper in image understanding’, 2017, arXiv preprint arXiv:1711.06475.
    8. 8)
      • 8. Lin, T.-Y., Maire, M., Belongie, S., et al: ‘Microsoft coco: common objects in context’. European Conf. on Computer Vision (ECCV), Cornell University, 2014.
    9. 9)
      • 9. Deng, J., Berg, A., Li, K., et al: ‘What does classifying more than 10,000 image categories tell us?’. European Conf. on Computer Vision (ECCV), Crete, Greece, 2010.
    10. 10)
      • 10. Zhao, B., Wu, B., Wu, T., et al: ‘Zero-shot learning posed as a missing data problem’. Proc. IEEE Int. Conf. on Computer Vision (ICCV) Workshop, Venice, Italy, 2017, pp. 26162622.
    11. 11)
      • 11. Wang, D., Li, Y., Lin, Y., et al: ‘Relational knowledge transfer for zero-shot learning’. Thirtieth AAAI Conf. on Artificial Intelligence, Phoenix, Arizona, USA, 2016.
    12. 12)
      • 12. Thomason, J., Venugopalan, S., Guadarrama, S., et al: ‘Integrating language and vision to generate natural language descriptions of videos in the wild’. Computational Linguistics (COLING), Dublin, Ireland, 2014.
    13. 13)
      • 13. Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., et al: ‘Generating natural-language video descriptions using text-mined knowledge’. AAAI Conf. on Artificial Intelligence (AAAI), Bellevue, Washington, USA, 2013.
    14. 14)
      • 14. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., et al: ‘Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition’. Int. Conf. on Computer Vision (ICCV), Darling Harbour, Sydney, 2013.
    15. 15)
      • 15. Hendricks, L.A., Venugopalan, S., Rohrbach, M., et al: ‘Deep compositional captioning: describing novel object categories without paired training data’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, 2016.
    16. 16)
      • 16. Venugopalan, S., Hendricks, L. A., Rohrbach, M., et al: ‘Captioning images with diverse objects’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, USA, 2017.
    17. 17)
      • 17. Yao, T., Pan, Y., Li, Y., et al: ‘Incorporating copying mechanism in image captioning for learning novel objects’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, USA, 2017, pp. 52635271.
    18. 18)
      • 18. Papineni, K., Roukos, S., Ward, T., et al: ‘Bleu: a method for automatic evaluation of machine translation’. Proc. 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311318.
    19. 19)
      • 19. Denkowski, M., Lavie, A.K.: ‘Meteor universal: language specific translation evaluation for any target language’. Proc. Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, 2014, pp. 376380.
    20. 20)
      • 20. Lin, C.-Y.: ‘Rouge: A package for automatic evaluation of summaries’. Text Summarization Branches Out: Proc. ACL-04 Workshop, Barcelona, Spain, 2004, vol. 8.
    21. 21)
      • 21. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: ‘CIDEr: consensus-based image description evaluation’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 2015, pp. 45664575.
    22. 22)
      • 22. Donahue, J., Anne Hendricks, L., Guadarrama, S., et al: ‘Long-term recurrent convolutional networks for visual recognition and description’.  IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 2015.

Related content

This is a required field
Please enter a valid email address