© The Institution of Engineering and Technology
A new cascade recurrent neural network (CRNN) for image caption generation is proposed. Different from the classical multimodal recurrent neural network, which only uses a single network for extracting unidirectional syntactic features, CRNN adopts a cascade network for learning visual-language interactions from forward and backward directions, which can exploit the deep semantic contexts contained in the image. In the proposed framework, two embedding layers for dense word expression are constructed. A new stacked Gated Recurrent Unit is designed for learning image-word mappings. The effectiveness of the CRNN model is verified with adopting the commonly used MSCOCO datasets, where the results indicate CRNN can achieve better performance compared with the state-of-the-art image captioning methods such as Google NIC, multimodal recurrent neural network and so on.
References
-
-
1)
-
4. Karpathy, A., Fei-Fei, L.: ‘Deep visual-semantic alignments for generating image descriptions’. The IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015.
-
2)
-
1. Donahue, J., Hendricks, L.A., Guadarrama, S., et al: ‘Long-term recurrent convolutional networks for visual recognition and description’. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015.
-
3)
-
3. Vinyals, O., Toshev, A., Bengio, S., et al: ‘Show and tell: a neural image caption generator’, IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156–3164.
-
4)
-
2. Mao, J., Xu, W., Yang, Y., et al: ‘Deep caption with multimodal recurrent neural networks (M-RNN)’. Int. Conf. on Learning Representations (ICLR), 2015.
-
5)
-
9. Kishore, P., Salim, R., Todd, W., et al: ‘A method for automatic evaluation of machine translation’. Proc. of the 40th Annual Meeting on Association for Computational Linguistics (ACL) ACL '02, 2002, pp. 311–318.
-
6)
-
12. Xu, K., Ba, J.L., Kiros, R., et al: , 2016, .
-
7)
-
11. Jia, X., Gavves, E., Fernando, B., et al: ‘Guiding the long-short term memory model for image caption generation’. The IEEE Int. Conf. Computer Vision (ICCV), 2015.
-
8)
-
8. Chung, J., Gulcehre, C., Cho, K.H., et al: , 2014. .
-
9)
-
6. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’, 2015. .
-
10)
-
5. Tsung-Yi, L., Michael, M., Serge, B., et al: , 2014, .
-
11)
-
7. Szegedy, C., Vanhoucke, V., Ioffe, S., et al: , 2015. .
-
12)
-
10. Chen, X., Fang, H., Lin, T.Y., et al: , 2015, .
http://iet.metastore.ingenta.com/content/journals/10.1049/el.2017.3159
Related content
content/journals/10.1049/el.2017.3159
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Correspondence
This article has following corresponding article(s):
in brief