access icon free Action recognition by discriminative EdgeBoxes

Due to the huge number of online videos uploaded and viewed every day, there is an emerging need nowadays for the action recognition techniques. Applying these techniques in uncontrolled and realistic videos is still a challenging task, considering the large variations in camera motion, viewpoint, cluttered background etc. Moreover, they need to be automated to be able to handle such an amount of different actions. The goal of this study is to introduce a new technique for mining mid-level discriminative patches from videos. These patches are the most representative parts that can describe an action. To achieve this goal, the authors generalise a technique borrowed from 2D images to generate bounding boxes with a high motion and appearance saliencies. Then, a clustering-classification iterative procedure is applied on the generated boxes. Then, they calculate a discriminative score for each box. Finally, they select top ranked boxes to train exemplar-SVM on low-level features which are extracted from the selected boxes. The proposed approach has been evaluated using two challenging datasets YouTube and JHMDB. The experimental results demonstrated the effectiveness of their approach to achieve a better average recognition accuracy than the state-of-the-art techniques.

Inspec keywords: image classification; image recognition; video signal processing; image motion analysis; support vector machines

Other keywords: online videos; JHMDB; cluttered background; action recognition techniques; discriminative EdgeBoxes; exemplar-SVM; YouTube; 2D images; midlevel discriminative patches; camera motion; clustering-classification iterative procedure

Subjects: Knowledge engineering techniques; Image recognition; Computer vision and image processing techniques; Video signal processing

References

    1. 1)
      • 16. Lucas, B.D., Kanade, T.: ‘An iterative image registration technique with an application to stereo vision’. Proc. 7th Int. Joint Conf. Artificial Intelligence, 1981, vol. 81, pp. 674679.
    2. 2)
      • 38. Malisiewicz, T., Gupta, A., Efros, A.A.: ‘Ensemble of exemplar-SVMS for object detection and beyond’. 2011 IEEE Int. Conf. Computer Vision (ICCV), 2011, pp. 8996.
    3. 3)
      • 19. Wang, L., Qiao, Y., Tang, X.: ‘Motionlets: mid-level 3d parts for human motion recognition’. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 26742681.
    4. 4)
      • 9. Dalal, N., Triggs, B., Schmid, C.: ‘Human detection using oriented histograms of flow and appearance’. European Conf. Computer Vision, 2006, pp. 428441.
    5. 5)
      • 40. Kuehne, H., Jhuang, H., Garrote, E., et al: ‘HMDB: a large video database for human motion recognition’. 2011 IEEE Int. Conf. Computer Vision (ICCV), 2011, pp. 25562563.
    6. 6)
      • 42. Weinzaepfel, P., Revaud, J., Harchaoui, Z., et al: ‘Learning to detect motion boundaries’. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2015, pp. 25782586.
    7. 7)
      • 43. Ikizler.Cinbis, N., Sclaroff, S.: ‘Object, scene and actions: combining multiple features for human action recognition’. European Conf. Computer Vision, 2010, pp. 494507.
    8. 8)
      • 48. Cherian, A., Fernando, B., Harandi, M., et al: ‘Generalized rank pooling for activity recognition’, arXiv preprint arXiv:170402112, 2017.
    9. 9)
      • 11. Scovanner, P., Ali, S., Shah, M.: ‘A 3-dimensional sift descriptor and its application to action recognition’. Proc. 15th ACM Int. Conf. Multimedia, 2007, pp. 357360.
    10. 10)
      • 10. Klaser, A., Marszałek, M., Schmid, C.: ‘A spatio-temporal descriptor based on 3D-gradients’. BMVC 2008-19th British Machine Vision Conf., 2008, pp. 275281.
    11. 11)
      • 32. Xiao, F., Jae.Lee, Y.: ‘Track and segment: an iterative unsupervised approach for video object proposals’. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 933942.
    12. 12)
      • 45. Peng, X., Zou, C., Qiao, Y., et al: ‘Action recognition with stacked fisher vectors’. European Conf. Computer Vision, 2014, pp. 581595.
    13. 13)
      • 22. Manen, S., Guillaumin, M., Van.Gool, L.: ‘Prime object proposals with randomized prim's algorithm’. Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 25362543.
    14. 14)
      • 39. Ke, Y., Sukthankar, R., Hebert, M.: ‘Volumetric features for video event detection’, Int. J. Comput. Vis., 2010, 88, (3), pp. 339362.
    15. 15)
      • 20. Jhuang, H., Gall, J., Zuffi, S., et al: ‘Towards understanding action recognition’. Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 31923199.
    16. 16)
      • 5. Polana, R., Nelson, R.: ‘Low level recognition of human motion (or how to get your man without finding his body parts)’. 1994 Proc. 1994 IEEE Workshop on Motion of Non-Rigid and Articulated Objects, 1994, pp. 7782.
    17. 17)
      • 15. Zhu, J., Wang, B., Yang, X., et al: ‘Action recognition with actons’. Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 35593566.
    18. 18)
      • 4. Bobick, A.F., Davis, J.W.: ‘The recognition of human movement using temporal templates’, IEEE Trans. Pattern Anal. Mach. Intell., 2001, 23, (3), pp. 257267.
    19. 19)
      • 46. Tu, Z., Cao, J., Li, Y., et al: ‘MSR-CNN: applying motion salient region based descriptors for action recognition’. 2016 23rd Int. Conf. Pattern Recognition (ICPR), 2016, pp. 35243529.
    20. 20)
      • 8. Dollár, P., Rabaud, V., Cottrell, G., et al: ‘Behavior recognition via sparse spatio-temporal features’. 2nd Joint IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, 2005, pp. 6572.
    21. 21)
      • 23. Cheng, M.M., Zhang, Z., Lin, W.Y., et al: ‘Bing: binarized normed gradients for objectness estimation at 300 fps’. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, pp. 32863293.
    22. 22)
      • 12. Willems, G., Tuytelaars, T., Van Gool, L.: ‘An efficient dense and scale-invariant spatio-temporal interest point detector’. European Conf. Computer Vision, 2008, pp. 650663.
    23. 23)
      • 14. Wang, H., Kläser, A., Schmid, C., et al: ‘Action recognition by dense trajectories’. 2011 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2011, pp. 31693176.
    24. 24)
      • 6. Efros, A.A., Berg, A.C., Mori, G., et al: ‘Recognizing action at a distance’. Proc. 9th IEEE Int. Conf. Computer Vision, 2003, 2003, vol. 3, pp. 726733.
    25. 25)
      • 33. Zhou, Y., Yu, H., Wang, S.: ‘Feature sampling strategies for action recognition’, arXiv preprint arXiv:150106993, 2015.
    26. 26)
      • 24. Jain, M., van Gemert, J., Jégou, H., et al: ‘Tubelets: unsupervised action proposals from spatiotemporal super-voxels’, arXiv preprint arXiv:160702003, 2016.
    27. 27)
      • 26. Van den Bergh, M., Roig, G., Boix, X., et al: ‘Online video seeds for temporal window objectness’. Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 377384.
    28. 28)
      • 13. Wang, H., Schmid, C.: ‘Action recognition with improved trajectories’. Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 35513558.
    29. 29)
      • 47. Cherian, A., Koniusz, P., Gould, S.: ‘Higher-order pooling of CNN features via kernel linearization for action recognition’. 2017 IEEE Winter Conf. Applications of Computer Vision (WACV), 2017, pp. 130138.
    30. 30)
      • 31. Sultani, W., Zhang, D., Shah, M.: ‘Unsupervised action proposal ranking through proposal recombination’, arXiv preprint arXiv:170400758, 2017.
    31. 31)
      • 3. Zitnick, C.L., Dollár, P.: ‘EdgeBoxes: locating object proposals from edges’. European Conf. Computer Vision, 2014, pp. 391405.
    32. 32)
      • 49. Hou, R., Chen, C., Shah, M.: ‘Tube convolutional neural network (T-CNN) for action detection in videos’, arXiv preprint arXiv:170310664, 2017.
    33. 33)
      • 36. Vedaldi, A., Fulkerson, B.: ‘VLFeat: an open and portable library of computer vision algorithms’. Proc. 18th ACM Int. Conf. Multimedia, 2010, pp. 14691472.
    34. 34)
      • 25. Oneata, D., Revaud, J., Verbeek, J., et al: ‘Spatio-temporal object detection proposals’. European Conf. Computer Vision, 2014, pp. 737752.
    35. 35)
      • 1. Peterson, C., Hou, R.: ‘Exemplar-SVMs for action recognition’. Technical Report, University of Central Florida, 2013.
    36. 36)
      • 28. Gemert, J., Jain, M., Gati, E., et al: ‘APT: action localization proposals from dense trajectories’, ‘Proceedings of the British Machine Vision Conference (BMVC)’, (BMVA Press, Swansea, UK, 2015).
    37. 37)
      • 50. Kroeger, T., Timofte, R., Dai, D., et al: ‘Fast optical flow using dense inverse search’. European Conf. Computer Vision, 2016, pp. 471488.
    38. 38)
      • 29. Sibson, R.: ‘SLINK: an optimally efficient algorithm for the single-link cluster method’, Comput. J., 1973, 16, (1), pp. 3034.
    39. 39)
      • 35. Singh, S., Gupta, A., Efros, A.A.: ‘Unsupervised discovery of mid-level discriminative patches’. Computer Vision–ECCV 2012, 2012, pp. 7386.
    40. 40)
      • 17. Yao, B., Fei.Fei, L.: ‘Grouplet: a structured image representation for recognizing human and object interactions’. 2010 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2010, pp. 916.
    41. 41)
      • 2. Tran, D., Torresani, L.: ‘Exmoves: classifier-based features for scalable action recognition’, arXiv preprint arXiv:13125785, 2013.
    42. 42)
      • 34. Liu, J., Luo, J., Shah, M.: ‘Recognizing realistic actions from videos ‘in the wild’’. IEEE Conf. Computer Vision and Pattern Recognition, 2009. CVPR 2009, 2009, pp. 19962003.
    43. 43)
      • 21. Sapienza, M., Cuzzolin, F., Torr, P.H.: ‘Learning discriminative space–time action parts from weakly labelled videos’, Int. J. Comput. Vis., 2014, 110, (1), pp. 3047.
    44. 44)
      • 41. Chéron, G., Laptev, I., Schmid, C.: ‘P-CNN: pose-based CNN features for action recognition’. Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 32183226.
    45. 45)
      • 37. Chang, C.C., Lin, C.J.: ‘LIBSVM: a library for support vector machines’, ACM Trans. Intell. Syst. Technol. (TIST), 2011, 2, (3), p. 27.
    46. 46)
      • 30. Zhang, D., Javed, O., Shah, M.: ‘Video object segmentation through spatially accurate and temporally dense extraction of primary object regions’. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 628635.
    47. 47)
      • 18. Tian, Y., Sukthankar, R., Shah, M.: ‘Spatiotemporal deformable part models for action detection’. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 26422649.
    48. 48)
      • 7. Laptev, I.: ‘On space-time interest points’, Int. J. Comput. Vis., 2005, 64, (2–3), pp. 107123.
    49. 49)
      • 27. Papazoglou, A., Ferrari, V.: ‘Fast object segmentation in unconstrained video’. Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 17771784.
    50. 50)
      • 44. Wang, H., Kläser, A., Schmid, C., et al: ‘Dense trajectories and motion boundary descriptors for action recognition’, Int. J. Comput. Vis., 2013, 103, (1), pp. 6079.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2017.0335
Loading

Related content

content/journals/10.1049/iet-cvi.2017.0335
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading