© The Institution of Engineering and Technology
A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions-convolution neural network (R-CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.
References
-
-
1)
-
9. LeCun, Y.A., Bottou, L., Orr, G.B., et al: ‘Efficient backprop’, Neural Netw., 2012, 7700, (1), pp. 9–48.
-
2)
-
8. Girshick, R.: ‘Fast R-CNN’. ICCV, Santiago, Chile, December 2015, pp. 1440–1448.
-
3)
-
1. Karpathy, A., Fei-Fei, L.: ‘Deep visual-semantic alignments for generating image descriptions’. CVPR, Boston, MA, USA, June 2015, pp. 3128–3137.
-
4)
-
2. Johnson, J., Karpathy, A., Fei-Fei, L.: ‘Densecap: fully convolutional localization networks for dense captioning’. CVPR, Las Vegas, NV, USA, June 2016, pp. 4565–4574.
-
5)
-
3. Ren, S., He, K., Girshick, R., et al: ‘Faster R-CNN: towards real-time object detection with region proposal networks’. NIPS, Montreal, Canada, December 2015, pp. 91–99.
-
6)
-
6. Krishna, R., Zhu, Y., Groth, O., et al: ‘Visual genome: connecting language and vision using crowd sourced dense image annotations’, , 2016.
-
7)
-
11. Girshick, R., Donahue, J., Darrell, T., et al: ‘Rich feature hierarchies for accurate object detection and semantic segmentation’. CVPR, Columbus, OH, USA, June 2014, pp. 580–587.
-
8)
-
7. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’, , 2014.
-
9)
-
10. He, K., Zhang, X., Ren, S., et al: ‘Spatial pyramid pooling in deep convolutional networks for visual recognition’. ECCV, Zurich, Switzerland, September 2014, pp. 346–361.
-
10)
-
4. Mao, J., Xu, W., Yang, Y., et al: ‘Deep captioning with multimodal recurrent neural networks (m-rnn)’. , 2014.
-
11)
-
29. Everingham, M., Gool, L.V., Williams, C.K., et al: ‘The pascal visual object classes (voc) challenge’, Int. J. Comput. Vis., 2010, 88, (2), pp. 303–338 (doi: 10.1007/s11263-009-0275-4).
http://iet.metastore.ingenta.com/content/journals/10.1049/el.2017.0326
Related content
content/journals/10.1049/el.2017.0326
pub_keyword,iet_inspecKeyword,pub_concept
6
6