ResFusion: deeply fused scene parsing network for RGB-D images

Juting Dai; Xinyi Tang

ResFusion: deeply fused scene parsing network for RGB-D images

View Fulltext

Author(s): Juting Dai^{1, 2, 3} and Xinyi Tang^{1, 3}
- Affiliations: 1: Shanghai Institute of Technical Physics of the Chinese Academy of Sciences , Shanghai , People's Republic of China ;
  2: University of Chinese Academy of Sciences , Beijing , People's Republic of China ;
  3: Key Laboratory of Infrared System Detection and Imaging Technology , CAS , Shanghai , People's Republic of China
Source: Volume 12, Issue 8, December 2018, p. 1171 – 1178
DOI: 10.1049/iet-cvi.2018.5218 , Print ISSN 1751-9632, Online ISSN 1751-9640

Received 18/01/2018, Accepted 20/06/2018, Revised 16/04/2018, Published 25/07/2018

Scene parsing is a very challenging work for complex and diverse scenes. In this study, the authors address the problem of semantic segmentation of indoor scenes for red, green, blue-depth (RGB-D) images. Most existing works use only the colour or photometric information for this problem. Here, they present an approach to fusing feature maps between colour network branch and depth network branch to integrate the photometric information and geometric information, which improves the semantic segmentation performance. They propose a novel convolutional neural network that uses ResNet as a baseline network. Their proposed network adopts a spatial pyramid pooling module to make full use of different sub-region representations. Their proposed network also adopts multiple feature maps fusion modules to integrate texture and structure information between the colour branch and depth branch. Moreover, their proposed network has multiple auxiliary loss branches together with the main loss function to prevent the gradient of frontal layers disappear and accelerate the training phase of the fusion part. Comprehensive experimental evaluations show that their proposed network ‘ResFusion’ improves the performance greatly over the baseline network and has achieved competitive performance compared with other state-of-the-art methods on the challenging SUN RGB-D benchmark.

References

1. 1)
  - 10. Mousavian, A., Anguelov, D., Flynn, J., et al: ‘3D bounding box estimation using deep learning and geometry’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5632–5640.
2. 2)
  - 19. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘ImageNet classification with deep convolutional neural networks’. Proc. Int. Conf. Neural Information Processing Systems, 2012, pp. 1097–1105.
3. 3)
  - 16. Couprie, C., Farabet, C., Najman, L., et al: ‘Indoor semantic segmentation using depth information’, arXiv preprint arXiv:1301.3572, 2013.
4. 4)
  - 24. Chen, L.C., Papandreou, G., Kokkinos, I., et al: ‘Semantic image segmentation with deep convolutional nets and fully connected CRFs’, IEEE Trans Pattern Anal Mach Intell., 2018, 40, (4), pp. 834–848.
5. 5)
  - 3. Badrinarayanan, V., Kendall, A., Cipolla, R.: ‘SegNet: a deep convolutional encoder–decoder architecture for image segmentation’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39, (12), pp. 2481–2495.
6. 6)
  - 30. Li, Z., Gan, Y., Liang, X., et al: ‘RGB-D scene labeling with long short-term memorized fusion model’, arXiv preprint arXiv:1604.05000, 2016.
7. 7)
  - 21. Bishop, C.M.: ‘Pattern recognition and machine learning’ (Springer, New York, USA, 2006).
8. 8)
  - 15. He, K., Zhang, X., Ren, S., et al: ‘Deep residual learning for image recognition’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
9. 9)
  - 12. Everingham, M., Gool, L.V., Williams, C.K.I., et al: ‘The PASCAL visual object classes (VOC) challenge’, Int. J. Comput. Vis., 2010, 88, (2), pp. 303–338.
10. 10)
  - 14. Silberman, N., Hoiem, D., Kohli, P., et al: ‘Indoor segmentation and support inference from RGBD images’. Proc. European Conf. Computer Vision (ECCV), 2012, pp. 746–760.
11. 11)
  - 18. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’, arXiv preprint arXiv:1409.1556, 2014.
12. 12)
  - 4. Chen, L.C., Papandreou, G., Schroff, F., et al: ‘Rethinking atrous convolution for semantic image segmentation’, arXiv preprint arXiv:1706.05587, 2017.
13. 13)
  - 35. Jia, Y., Shelhamer, E., Donahue, J., et al: ‘Caffe: convolutional architecture for fast feature embedding’. Proc. Int. conf. Multimedia, 2014, pp. 675–678.
14. 14)
  - 26. Lazebnik, S., Schmid, C., Ponce, J.: ‘Beyond bags of features: spatial pyramid matching for recognizing natural scene categories’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2006, vol. 2, pp. 2169–2178.
15. 15)
  - 34. Eigen, D., Fergus, R.: ‘Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture’. Proc. Int. Conf. Computer Vision (ICCV), 2015, pp. 2650–2658.
16. 16)
  - 6. Zheng, S., Jayasumana, S., Romera-Paredes, B., et al: ‘Conditional random fields as recurrent neural networks’. Proc. Int. Conf. Computer Vision (ICCV), 2015, pp. 1529–1537.
17. 17)
  - 28. McCormac, J., Handa, A., Davison, A., et al: ‘Semantic fusion: dense 3D semantic mapping with convolutional neural networks’. Proc. IEEE Conf. Robotics and Automation (ICRA), 2017, pp. 4628–4635.
18. 18)
  - 20. Szegedy, C., Liu, W., Jia, Y., et al: ‘Going deeper with convolutions’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
19. 19)
  - 1. Long, J., Shelhamer, E., Darrell, T.: ‘Fully convolutional networks for semantic segmentation’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.
20. 20)
  - 9. Zou, C., Colburn, A., Shan, Q., et al: ‘LayoutNet: reconstructing the 3D room layout from a single RGB image’, arXiv preprint arXiv:1803.08999, 2018.
21. 21)
  - 11. Izadinia, H., Shan, Q., Seitz, S.M.: ‘Im2cad’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2422–2431.
22. 22)
  - 22. Noh, H., Hong, S., Han, B.: ‘Learning deconvolution network for semantic segmentation’. Proc. Int. Conf. Computer Vision (ICCV), 2015, pp. 1520–1528.
23. 23)
  - 37. Kendall, A., Badrinarayanan, V., Cipolla, R.: ‘Bayesian SegNet: model uncertainty in deep convolutional encoder–decoder architectures for scene understanding’, arXiv preprint arXiv:1511.02680, 2015.
24. 24)
  - 2. Zhao, H., Shi, J., Qi, X., et al: ‘Pyramid scene parsing network’, arXiv preprint arXiv:1612.01105, 2016.
25. 25)
  - 27. Arnab, A., Jayasumana, S., Zheng, S., et al: ‘Higher order conditional random fields in deep neural networks’. Proc. European Conf. Computer Vision (ECCV), 2016, pp. 524–540.
26. 26)
  - 7. Byeon, W., Breuel, T.M., Raue, F., et al: ‘Scene labeling with LSTM recurrent neural networks’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3547–3555.
27. 27)
  - 29. Gupta, S., Girshick, R., Arbeláez, P., et al: ‘Learning rich features from RGB-D images for object detection and segmentation’. Proc. European Conf. Computer Vision (ECCV), 2014, pp. 345–360.
28. 28)
  - 23. Zeiler, M.D., Taylor, G.W., Fergus, R.: ‘Adaptive deconvolutional networks for mid and high level feature learning’. Proc. Int. Conf. Computer Vision (ICCV), 2011, pp. 2018–2025.
29. 29)
  - 13. Song, S., Lichtenberg, S.P., Xiao, J.: ‘Sun RGB-D: A RGB-D scene understanding benchmark suite’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 567–576.
30. 30)
  - 36. Bottou, L.: ‘Stochastic gradient descent tricks’ (Springer, Berlin, Heidelberg, 2012), pp. 421–436.
31. 31)
  - 33. Zhou, B., Khosla, A., Lapedriza, A., et al: ‘Object detectors emerge in deep scene CNNS’, arXiv preprint arXiv:1412.6856, 2014.
32. 32)
  - 31. He, K., Sun, J..: ‘Convolutional neural networks at constrained time cost’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5353–5360.
33. 33)
  - 17. Hazirbas, C., Ma, L., Domokos, C., et al: ‘FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture’. Proc. Asian Conf. Computer Vision (ACCV), 2016, pp. 213–228.
34. 34)
  - 32. Srivastava, R.K., Greff, K., Schmidhuber, J.: ‘Highway networks’, arXiv preprint arXiv:1505.00387, 2015.
35. 35)
  - 5. Chen, L.C., Papandreou, G., Kokkinos, I., et al: ‘Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS’, arXiv preprint arXiv:1606.00915, 2016.
36. 36)
  - 25. Chen, L.C., Yang, Y., Wang, J., et al: ‘Attention to scale: scale-aware semantic image segmentation’. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3640–3649.
37. 37)
  - 8. Szegedy, C., Ioffe, S., Vanhoucke, V., et al: ‘Inception-v4, inception-ResNet and the impact of residual connections on learning’. Proc. Conf. Artificial Intelligence (AAAI), 2017, vol. 4, pp. 12.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

ResFusion: deeply fused scene parsing network for RGB-D images

References

Related content