Cascade recurrent neural network for image caption generation

Cascade recurrent neural network for image caption generation

For access to this article, please select a purchase option:

Buy article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
Electronics Letters — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

A new cascade recurrent neural network (CRNN) for image caption generation is proposed. Different from the classical multimodal recurrent neural network, which only uses a single network for extracting unidirectional syntactic features, CRNN adopts a cascade network for learning visual-language interactions from forward and backward directions, which can exploit the deep semantic contexts contained in the image. In the proposed framework, two embedding layers for dense word expression are constructed. A new stacked Gated Recurrent Unit is designed for learning image-word mappings. The effectiveness of the CRNN model is verified with adopting the commonly used MSCOCO datasets, where the results indicate CRNN can achieve better performance compared with the state-of-the-art image captioning methods such as Google NIC, multimodal recurrent neural network and so on.


    1. 1)
      • 1. Donahue, J., Hendricks, L.A., Guadarrama, S., et al: ‘Long-term recurrent convolutional networks for visual recognition and description’. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015.
    2. 2)
      • 2. Mao, J., Xu, W., Yang, Y., et al: ‘Deep caption with multimodal recurrent neural networks (M-RNN)’. Int. Conf. on Learning Representations (ICLR), 2015.
    3. 3)
      • 3. Vinyals, O., Toshev, A., Bengio, S., et al: ‘Show and tell: a neural image caption generator’, IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 31563164.
    4. 4)
      • 4. Karpathy, A., Fei-Fei, L.: ‘Deep visual-semantic alignments for generating image descriptions’. The IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015.
    5. 5)
      • 5. Tsung-Yi, L., Michael, M., Serge, B., et al: ‘Microhard coco: common objects in context’, 2014, arXiv preprint arXiv:1405.0312.
    6. 6)
      • 6. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’, 2015. arXiv:1409.1556.
    7. 7)
      • 7. Szegedy, C., Vanhoucke, V., Ioffe, S., et al: ‘Rethinking the inception architecture for computer vision’, 2015. arXiv:1512.00567.
    8. 8)
      • 8. Chung, J., Gulcehre, C., Cho, K.H., et al: ‘Empirical evaluation of gated recurrent neural networks on sequence modeling’, 2014. Eprint Arxiv.
    9. 9)
      • 9. Kishore, P., Salim, R., Todd, W., et al: ‘A method for automatic evaluation of machine translation’. Proc. of the 40th Annual Meeting on Association for Computational Linguistics (ACL) ACL '02, 2002, pp. 311318.
    10. 10)
      • 10. Chen, X., Fang, H., Lin, T.Y., et al: ‘Microsoft coco captions: data collection and evaluation server’, 2015, arXiv:1504.00325.
    11. 11)
      • 11. Jia, X., Gavves, E., Fernando, B., et al: ‘Guiding the long-short term memory model for image caption generation’. The IEEE Int. Conf. Computer Vision (ICCV), 2015.
    12. 12)
      • 12. Xu, K., Ba, J.L., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’, 2016, arXiv: 1502. 03044v3 [cs.LG].

Related content

This article has following corresponding article(s):
in brief
This is a required field
Please enter a valid email address