A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions-convolution neural network (R-CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.

Multimodal object description network for dense captioning

References

Related content