The authors present a novel approach to improve three-dimensional (3D) structure estimation from an image stream in urban scenes. The authors consider a particular setup, where the camera is installed on a moving vehicle. Applying traditional structure from motion (SfM) technique in this case generates poor estimation of the 3D structure because of several reasons such as texture-less images, small baseline variations and dominant forward camera motion. The authors idea is to introduce the monocular depth cues that exist in a single image, and add time constraints on the estimated 3D structure. The scene is modelled as a set of small planar patches obtained using over-segmentation, and the goal is to estimate the 3D positioning of these planes. The authors propose a fusion scheme that employs Markov random field model to integrate spatial and temporal depth features. Spatial depth is obtained by learning a set of global and local image features. Temporal depth is obtained via sparse optical flow based SfM approach. That allows decreasing the estimation ambiguity by forcing some constraints on camera motion. Finally, the authors apply a fusion scheme to create unique 3D structure estimation.

References

1. 1)
  - 17. Alvarez, J., Gevers, T., LeCun, Y., Lopez, A.: ‘Road scene segmentation from a single image ECCV 2012, Part VII’, 2012 (LNCS, 7578), pp. 376–389.
2. 2)
  - 15. Hoiem, D., Efros, A., Hebert, M.: ‘Recovering surface layout from an image’, Int. J. Comput. Vis., 2007, 75, (1), pp. 151–172 (doi: 10.1007/s11263-006-0031-y).
3. 3)
  - 6. Felzenszwalb, P., Huttenlocher, D.: ‘Efficient graph-based image segmentation’, Int. J. 444 Comput. Vis., 2004, 59, (2), pp. 167–181 (doi: 10.1023/B:VISI.0000022288.19776.77).
4. 4)
  - 25. Cigla, C., Alatan, A.: ‘An improved stereo matching algorithm with ground plane and temporal smoothness constraints’. ECCV 2012 Ws/Demos, Part II, 2012 (LNCS 7584), pp. 134–147.
5. 5)
  - 4. Saxena, A., Sun, M., Ng, A.: ‘Make3d: Learning 3d scene structure from a single still 439 image’, IEEE Trans. Pattern 440 Anal. Mach. Intell., 2009, 31, (5), pp. 824–840 (doi: 10.1109/TPAMI.2008.132).
6. 6)
  - 7. Humayun, A., Mac Aodha, O., Brostow, G.: ‘Learning to find occlusion regions’. IEEE Conf. CVPR'11, 2011, pp. 2161–2168.
7. 7)
  - 21. Esteban, I., Dorst, L., Dijk, J.: ‘Closed form solution for the scale ambiguity problem in monocular visual odometry’. Proc. Third Int. Conf. on Intelligent robotics and applications – Volume Part I. (ICIRA'10), Berlin, Heidelberg, 2010, pp. 665–679, ISBN 3-642-16583-4, 479 978-3-642-16583-2.
8. 8)
  - 10. Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbon, A.: ‘Bundle adjustment: a modern synthesis’, Vis. Algorithms: Theory Pract., 2000, 1883, pp. 153–177.
9. 9)
  - 22. Saxena, A.: ‘Monocular depth perception and robotic grasping of novel objects’. PhD thesis, Stanford University, 2009.
10. 10)
  - 5. Liu, B., Gould, S., Koller, D.: ‘Single image depth estimation from predicted semantic labels’. CVPR, 442 IEEE Conf., 2010, pp. 1253–1260.
11. 11)
  - 29. Geiger, A., Lenz, P., Urtasun, R.: ‘Are we ready for autonomous driving? the kitti vision benchmark suite’. Computer Vision and Pattern Recognition (CVPR), Providence, USA, 2012.
12. 12)
  - 19. Martinez-Carranza, J., Calway, A.: ‘Efficient visual odometry using a structure-driven temporal map’. Proc. 2012 IEEE Int. Conf. on Robotics and Automation (ICRA), 2012, pp. 5210–5215.
13. 13)
  - 8. Lowe, D.: ‘Distinctive image features from scale-invariant keypoints’, Int. J. Comput. Vis., 2004, 60, (2), pp. 91–110 (doi: 10.1023/B:VISI.0000029664.99615.94).
14. 14)
  - 27. Meister, S., Jähne, B., Kondermann, D.: ‘Outdoor stereo camera system for the generation of real-world benchmark data sets’, Opt. Eng., 2012, 51, (02), pp. 021107 (doi: 10.1117/1.OE.51.2.021107).
15. 15)
  - 24. Bao, S., Savarese, S.: ‘Semantic structure from motion’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 2025–2032.
16. 16)
  - 2. Aanæs, H.: ‘Methods for structure from motion’. PhD thesis, Danmarks Tekniske Universitet, 2003.
17. 17)
  - 26. Hirschmuller, H., Scharstein, D.: ‘Evaluation of cost functions for stereo matching’. IEEE Conf. on Computer Vision and Pattern Recognition, 2007 (CVPR'07), 2007, pp. 1–8.
18. 18)
  - 3. Vedaldi, A., Guidi, G., Soatto, S.: ‘Moving forward in structure from motion’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR'07), 2007, pp. 1–7.
19. 19)
  - 16. Pfeiffer, D., Erbs, F., Franke, U.: ‘Pixels, stixels, and objects’. Workshops and Demonstrations Computer Vision, (ECCV 2012), 2012, pp. 1–10.
20. 20)
  - 11. Raguram, R., Frahm, J., Pollefeys, M.: ‘A comparative analysis of ransac techniques leading to adaptive real-time random sample consensus’, Comput. Vis. – ECCV, 2008, pp. 500–513.
21. 21)
  - 23. Li, P., Gunnewiek, R.K.: ‘Scene reconstruction using MRF optimization with image content adaptive energy functions’. Proc. 10th Int. Conf. on Advanced Concepts for Intelligent Vision Systems (ACIVS'08), 2008, pp. 872–882.
22. 22)
  - 12. Lindeberg, T., Garding, J.: ‘Shape from texture from a multi-scale perspective’. Proc. Fourth Int. Conf. on Computer Vision, 1993, 5303, pp. 683–691.
23. 23)
  - 14. Hoiem, D., Efros, A., Hebert, M.: ‘Automatic photo pop-up’, ACM Trans. Graph., 2005, 24, (3), pp. 577–584 (doi: 10.1145/1073204.1073232).
24. 24)
  - 1. Nawaf, M.M., Trémeau, A.: ‘Joint spatio-temporal depth features fusion framework for 3d structure estimation in urban environment’. Workshops and Demonstrations Computer Vision, (ECCV 2012), 2012, pp. 526–535.
25. 25)
  - 20. Nistér, D., Naroditsky, O., Bergen, J.: ‘Visual odometry’. Proc. 2004 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, 2004 (CVPR 2004), 2004, vol. 1, pp. I–652.
26. 26)
  - 13. Torralba, A., Oliva, A.: ‘Depth estimation from image structure’, IEEE Trans. Pattern Anal. Mach. Intell., 2002, 24, (9), pp. 1226–1238 (doi: 10.1109/TPAMI.2002.1033214).
27. 27)
  - 28. Pandey, G., McBride, J.R., Eustice, R.M.: ‘Ford campus vision and lidar data set’, Int. J. Robot. Res., 2011, 30, (13), pp. 1543–1552 (doi: 10.1177/0278364911400640).
28. 28)
  - 18. Sturgess, P., Alahari, K., Ladick, L., Torr, P.: ‘Combining appearance and structure from motion features for road scene understanding’. Proc. British Machine Vision Conf. (BMVC'09), 2009, pp. 1226–1238.
29. 29)
  - 9. Hartley, R.I., Zisserman, A.: ‘Multiple view geometry in computer vision’ (Cambridge University Press, 2004, 2nd edn.).

Fusion of dense spatial features and sparse temporal features for three-dimensional structure estimation in urban scenes

References

Related content