A new cascade recurrent neural network (CRNN) for image caption generation is proposed. Different from the classical multimodal recurrent neural network, which only uses a single network for extracting unidirectional syntactic features, CRNN adopts a cascade network for learning visual-language interactions from forward and backward directions, which can exploit the deep semantic contexts contained in the image. In the proposed framework, two embedding layers for dense word expression are constructed. A new stacked Gated Recurrent Unit is designed for learning image-word mappings. The effectiveness of the CRNN model is verified with adopting the commonly used MSCOCO datasets, where the results indicate CRNN can achieve better performance compared with the state-of-the-art image captioning methods such as Google NIC, multimodal recurrent neural network and so on.

References

1. 1)
  - 4. Karpathy, A., Fei-Fei, L.: ‘Deep visual-semantic alignments for generating image descriptions’. The IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015.
2. 2)
  - 1. Donahue, J., Hendricks, L.A., Guadarrama, S., et al: ‘Long-term recurrent convolutional networks for visual recognition and description’. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015.
3. 3)
  - 3. Vinyals, O., Toshev, A., Bengio, S., et al: ‘Show and tell: a neural image caption generator’, IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156–3164.
4. 4)
  - 2. Mao, J., Xu, W., Yang, Y., et al: ‘Deep caption with multimodal recurrent neural networks (M-RNN)’. Int. Conf. on Learning Representations (ICLR), 2015.
5. 5)
  - 9. Kishore, P., Salim, R., Todd, W., et al: ‘A method for automatic evaluation of machine translation’. Proc. of the 40th Annual Meeting on Association for Computational Linguistics (ACL) ACL '02, 2002, pp. 311–318.
6. 6)
  - 12. Xu, K., Ba, J.L., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’, 2016, arXiv: 1502. 03044v3 [cs.LG].
7. 7)
  - 11. Jia, X., Gavves, E., Fernando, B., et al: ‘Guiding the long-short term memory model for image caption generation’. The IEEE Int. Conf. Computer Vision (ICCV), 2015.
8. 8)
  - 8. Chung, J., Gulcehre, C., Cho, K.H., et al: ‘Empirical evaluation of gated recurrent neural networks on sequence modeling’, 2014. Eprint Arxiv.
9. 9)
  - 6. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’, 2015. arXiv:1409.1556.
10. 10)
  - 5. Tsung-Yi, L., Michael, M., Serge, B., et al: ‘Microhard coco: common objects in context’, 2014, arXiv preprint arXiv:1405.0312.
11. 11)
  - 7. Szegedy, C., Vanhoucke, V., Ioffe, S., et al: ‘Rethinking the inception architecture for computer vision’, 2015. arXiv:1512.00567.
12. 12)
  - 10. Chen, X., Fang, H., Lin, T.Y., et al: ‘Microsoft coco captions: data collection and evaluation server’, 2015, arXiv:1504.00325.

Correspondence

This article has following corresponding article(s):

in brief

Cascade recurrent neural network for image caption generation

References

Related content