Scene graph generation is to recognise objects and their semantic relationships in an image and can help computers understand visual scene. To improve relationship prediction, geometry information is essential and usually incorporated into relationship features. Existing methods use coordinates of objects to encode their spatial layout. However, in this way, they neglect the context of objects. In this study, to take full use of spatial knowledge efficiently, the authors propose a novel subgraph and object context-masked network (SOCNet) consisting of spatial mask relation inference (SMRI) and hierarchical message passing (HMP) modules to address the scene graph generation task. In particular, to take advantage of spatial knowledge, SMRI masks partial context of object features depending on their spatial layout of objects and corresponding subgraph to facilitate their relationship recognition. To refine the features of objects and subgraphs, they also propose HMP that passes highly correlated messages from both microcosmic and macroscopic aspects through a triple-path structure including subgraph–subgraph, object–object, and subgraph–object paths. Finally, statistical co-occurrence probability is used to regularise relationship prediction. SOCNet integrates HMP and SMRI into a unified network, and comprehensive experiments on visual relationship detection and visual genome datasets indicate that SOCNet outperforms several state-of-the-art methods on two common tasks.

References

1. 1)
  - 13. Xiaofeng, R., Deva, R.: ‘Histograms of sparse codes for object detection’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Portland, USA, 2013, pp. 3246–3253.
2. 2)
  - 35. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’. Proc. of the Int. Conf. on Learning Representations, San Diego, USA, 2015.
3. 3)
  - 30. Kim, D., Yoo, Y., Kim, J.-S., et al: ‘Dynamic graph generation network: generating relational knowledge from diagrams’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 4167–4175.
4. 4)
  - 32. Rohrbach, A., Rohrbach, M., Hu, R., et al: ‘Grounding of textual phrases in images by reconstruction’. Proc. of the European Conf. on Computer Vision, Amsterdam, The Netherlands, 2016, pp. 817–834.
5. 5)
  - 36. He, K., Gkioxari, G., Dollar, P., et al: ‘Mask r-CNN’. Proc. of the IEEE Int. Conf. on Computer Vision, Venice, Italy, 2017, pp. 2961–2969.
6. 6)
  - 31. Vicol, P., Tapaswi, M., Castrejon, L., et al: ‘Moviegraphs: towards understanding human-centric situations from videos’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 8581–8590.
7. 7)
  - 7. Chen, X., Gupta, A.: ‘Spatial memory for context reasoning in object detection’. Proc. of the IEEE Int. Conf. on Computer Vision, Venice, Italy, 2017, pp. 4086–4096.
8. 8)
  - 24. Jia, Z., Gallagher, A., Saxena, A., et al: ‘3d-based reasoning with blocks, support, and stability’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Portland, USA, 2013, pp. 1–8.
9. 9)
  - 4. Lu, J., Yang, J., Batra, D., et al: ‘Neural baby talk’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 7219–7228.
10. 10)
  - 28. Li, Y., Ouyang, W., Zhou, B., et al: ‘Scene graph generation from objects, phrases and region captions’. Proc. of the IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2017, pp. 1261–1270.
11. 11)
  - 8. Dai, B., Zhang, Y., Lin, D.: ‘Detecting visual relationships with deep relational networks’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu Hawaii, USA, 2017, pp. 3076–3086.
12. 12)
  - 9. Zhang, H., Kyaw, Z., Chang, S.-F., et al: ‘Visual translation embedding network for visual relation detection’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu Hawaii, USA, 2017, pp. 5532–5540.
13. 13)
  - 40. Gu, J., Zhao, H., Lin, Z., et al: ‘Scene graph generation with external knowledge and image reconstruction’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 1969–1978.
14. 14)
  - 25. Zhang, J., Elhoseiny, M., Cohen, S., et al: ‘Relationship proposal networks’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu Hawaii, USA, 2017, pp. 5226–5234.
15. 15)
  - 10. Li, Y., Ouyang, W., Zhou, B., et al: ‘Factorizable net: an efficient subgraph-based framework for scene graph generation’. Proc. of the European Conf. on Computer Vision, Munich, Germany, 2018, pp. 346–363.
16. 16)
  - 29. Yang, J., Lu, J., Lee, S., et al: ‘Graph r-cnn for scene graph generation’. Proc. of the European Conf. on Computer Vision, Munich, Germany, 2018, pp. 690–706.
17. 17)
  - 11. Lu, C., Krishna, R., Bernstein, M., et al: ‘Visual relationship detection with language priors’. Proc. of the European Conf. on Computer Vision, Amsterdam, The Netherlands, 2016, pp. 852–869.
18. 18)
  - 16. Girshick, R., Donahue, J., Darrell, T., et al: ‘Rich feature hierarchies for accurate object detection and semantic segmentation’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 580–587.
19. 19)
  - 34. Zellers, R., Yatskar, M., Thomson, S., et al: ‘Neural motifs: scene graph parsing with global context’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 5831–5840.
20. 20)
  - 19. Redmon, J., Divvala, S., Girshick, R., et al: ‘You only look once: unified, real-time object detection’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 779–788.
21. 21)
  - 20. Liu, W., Anguelov, D., Erhan, D., et al: ‘Ssd: single shot multibox detector’. Proc. of the European Conf. on Computer Vision, Amsterdam, The Netherlands, 2016, pp. 21–37.
22. 22)
  - 15. Xinghua, L., Huanfeng, S., Liangpei, Z., et al: ‘Sparse-based reconstruction of missing information in remote sensing images from spectral/temporal complementary information’, ISPRS J. Photogramm. Remote Sens., 2015, 106, pp. 1–15.
23. 23)
  - 14. Agarwal, S., Roth, D.: ‘Learning a sparse representation for object detection’. Proc. of the European Conf. on Computer Vision, Copenhagen, Denmark, 2002, pp. 113–127.
24. 24)
  - 23. Ramanathan, V., Li, C., Deng, J., et al: ‘Learning semantic relationships for better action retrieval in images’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, 2015, pp. 1100–1109.
25. 25)
  - 26. Li, Y., Ouyang, W., Wang, X., et al: ‘Vip-CNN: visual phrase guided convolutional neural network’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu Hawaii, USA, 2017, pp. 1347–1356.
26. 26)
  - 12. Krishna, R., Zhu, Y., Groth, O., et al: ‘Visual genome: connecting language and vision using crowdsourced dense image annotations’, Int. J. Comput. Vis., 2017, 123, (1), pp. 32–73.
27. 27)
  - 37. Krishna, R., Chami, I., Bernstein, M., et al: ‘Referring relationships’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 6867–6876.
28. 28)
  - 21. Chen, X., Li, L.-J., Fei-Fei, L., et al: ‘Iterative visual reasoning beyond convolutions’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 7239–7248.
29. 29)
  - 22. Liu, Y., Wang, R., Shan, S., et al: ‘Structure inference net: object detection using scene-level context and instance-level relationships’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 6985–6994.
30. 30)
  - 33. Vaswani, A., Shazeer, N., Parmar, N., et al: ‘Attention is all you need’. Proc. of the Int. Conf. on Neural Information Processing System, Long Beach, USA, 2017, pp. 5998–6008.
31. 31)
  - 38. Johnson, J., Gupta, A., Fei-Fei, L.: ‘Image generation from scene graphs’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 1219–1228.
32. 32)
  - 39. Plummer, B.A., Mallya, A., Cervantes, C.M., et al: ‘Phrase localization and visual relationship detection with comprehensive image-language cues’. Proc. of the IEEE Int. Conf. on Computer Vision, Venice, Italy, 2017, pp. 1928–1937.
33. 33)
  - 18. Ren, S., He, K., Girshick, R., et al: ‘Faster r-CNN: towards real-time object detection with region proposal networks’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39, (6), pp. 1137–1149.
34. 34)
  - 6. Sadeghi, M.A., Farhadi, A.: ‘Recognition using visual phrases’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Colorado Springs, USA, 2011, pp. 1745–1752.
35. 35)
  - 27. Xu, D., Zhu, Y., Choy, C.B., et al: ‘Scene graph generation by iterative message passing’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu Hawaii, USA, 2017, pp. 5410–5419.
36. 36)
  - 17. Girshick, R.: ‘Fast r-CNN’. Proc. of the IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 1440–1448.
37. 37)
  - 1. Johnson, J., Hariharan, B., van der Maaten, L., et al: ‘CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu Hawaii, USA, 2017, pp. 2901–2910.
38. 38)
  - 3. Johnson, J., Krishna, R., Stark, M., et al: ‘Image retrieval using scene graphs’. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, 2015, pp. 3668–3678.
39. 39)
  - 2. Teney, D., Liu, L., van den Hengel, A.: ‘Graph-structured representations for visual question answering’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu Hawaii, USA, 2017, pp. 1–9.
40. 40)
  - 5. Anderson, P., Fernando, B., Johnson, M., et al ‘SPICE: semantic propositional image caption evaluation’. Proc. of the European Conf. on Computer Vision, Amsterdam, The Netherlands, 2016, pp. 382–398.

Subgraph and object context-masked network for scene graph generation

References

Related content