access icon free Multimodal object description network for dense captioning

A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions-convolution neural network (R-CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.

Inspec keywords: visual databases; computer vision; feature extraction; object detection; statistical distributions

Other keywords: R-CNN; feature extraction; vision module; probability distribution; object features; VOC2007 dataset; Visual Genome dataset; language module; discriminant information extraction; salient object detection; MODN model; multimodal layer; semantics features; multimodal object description network; dense captioning

Subjects: Other topics in statistics; Image recognition; Spatial and pictorial databases; Computer vision and image processing techniques; Other topics in statistics

References

    1. 1)
      • 9. LeCun, Y.A., Bottou, L., Orr, G.B., et al: ‘Efficient backprop’, Neural Netw., 2012, 7700, (1), pp. 948.
    2. 2)
      • 8. Girshick, R.: ‘Fast R-CNN’. ICCV, Santiago, Chile, December 2015, pp. 14401448.
    3. 3)
      • 1. Karpathy, A., Fei-Fei, L.: ‘Deep visual-semantic alignments for generating image descriptions’. CVPR, Boston, MA, USA, June 2015, pp. 31283137.
    4. 4)
      • 2. Johnson, J., Karpathy, A., Fei-Fei, L.: ‘Densecap: fully convolutional localization networks for dense captioning’. CVPR, Las Vegas, NV, USA, June 2016, pp. 45654574.
    5. 5)
      • 3. Ren, S., He, K., Girshick, R., et al: ‘Faster R-CNN: towards real-time object detection with region proposal networks’. NIPS, Montreal, Canada, December 2015, pp. 9199.
    6. 6)
      • 6. Krishna, R., Zhu, Y., Groth, O., et al: ‘Visual genome: connecting language and vision using crowd sourced dense image annotations’, arXiv preprint arXiv:1602.07332, 2016.
    7. 7)
      • 11. Girshick, R., Donahue, J., Darrell, T., et al: ‘Rich feature hierarchies for accurate object detection and semantic segmentation’. CVPR, Columbus, OH, USA, June 2014, pp. 580587.
    8. 8)
      • 7. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’, arXiv preprint arXiv:1409.1556, 2014.
    9. 9)
      • 10. He, K., Zhang, X., Ren, S., et al: ‘Spatial pyramid pooling in deep convolutional networks for visual recognition’. ECCV, Zurich, Switzerland, September 2014, pp. 346361.
    10. 10)
      • 4. Mao, J., Xu, W., Yang, Y., et al: ‘Deep captioning with multimodal recurrent neural networks (m-rnn)’. arXiv preprint arXiv:1412.6632, 2014.
    11. 11)
http://iet.metastore.ingenta.com/content/journals/10.1049/el.2017.0326
Loading

Related content

content/journals/10.1049/el.2017.0326
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading