Spatio-temporal multi-scale motion descriptor from a spatially-constrained decomposition for online action recognition

Spatio-temporal multi-scale motion descriptor from a spatially-constrained decomposition for online action recognition

For access to this article, please select a purchase option:

Buy article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
IET Computer Vision — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

This study presents a spatio-temporal motion descriptor that is computed from a spatially-constrained decomposition and applied to online classification and recognition of human activities. The method starts by computing a dense optical flow without explicit spatial regularisation. Potential human actions are detected at each frame as spatially consistent moving regions of interest (RoIs). Each of these RoIs is then sequentially partitioned to obtain a spatial representation of small overlapped subregions with different sizes. Each of these region parts is characterised by a set of flow orientation histograms. A particular RoI is then described along the time by a set of recursively calculated statistics that collect information from the temporal history of orientation histograms, to form the action descriptor. At any time, the whole descriptor can be extracted and labelled by a previously trained support vector machine. The method was evaluated using three different public datasets: (i) the ViSOR dataset was used for global classification obtaining an average accuracy of 95% and for recognition in long sequences, achieving an average per-frame accuracy of 92.3%. (ii) The KTH dataset was used for global classification and (iii) the UT-datasets were used for recognition task, obtaining an average accuracy of 80% (frame rate).


    1. 1)
      • 1. Aggarwal, J.K., Ryoo, M.S.: ‘Human activity analysis: a review’, ACM Comput. Surv. (CSUR), 2011, 43, (3), p. 16.
    2. 2)
      • 2. Vishwakarma, S., Agrawal, A.: ‘A survey on activity recognition and behavior understanding in video surveillance’, Vis. Comput., 2013, 29, (10), pp. 9831009.
    3. 3)
      • 3. Borges, P.V.K., Conci, N., Cavallaro, A.: ‘Video-based human behavior understanding: a survey’, IEEE Trans. Circuits Syst. Video Technol., 2013, 23, (11), pp. 19932008.
    4. 4)
      • 4. Vrigkas, M., Nikou, C., Kakadiaris, I.A.: ‘A review of human activity recognition methods’, Front. Robot. AI, 2015, 2, p. 28.
    5. 5)
      • 5. Liu, A.A., Xu, N., Su, Y.T., et al: ‘Single/multi-view human action recognition via regularized multi-task learning’, Neurocomputing, 2015, 151, pp. 544553.
    6. 6)
      • 6. Cao, X., Zhang, H., Deng, C., et al: ‘Action recognition using 3D DAISY descriptor’, Mach. Vis. Appl., 2014, 25, (1), pp. 159171.
    7. 7)
      • 7. Samanta, S., Chanda, B.: ‘Space–time facet model for human activity classification’, IEEE Trans. Multimed., 2014, 16, (6), pp. 15251535.
    8. 8)
      • 8. Chen, C.Y., Grauman, K.: ‘Efficient activity detection with max-subgraph search’. Computer Vision and Pattern Recognition (CVPR), 2012, pp. 12741281.
    9. 9)
      • 9. Wang, H., Oneata, D., Verbeek, J., et al: ‘A robust and efficient video representation for action recognition’, Int. J. Comput. Vis., 2016, 119, (3), pp. 219238.
    10. 10)
      • 10. Chaudhry, R., Ravichandran, A., Hager, G., et al: ‘Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions’. Computer Vision and Pattern Recognition (CVPR), 2009, pp. 19321939.
    11. 11)
      • 11. Riemenschneider, H., Donoser, M., Bischof, H.: ‘Bag of optical flow volumes for image sequence recognition’. The British Machine Vision Conf. (BMVC), 2009, pp. 111.
    12. 12)
      • 12. Ikizler, N., Cinbis, R.G., Duygulu, P.: ‘Human action recognition with line and flow histograms’. 19th Int. Conf. on Pattern Recognition (ICPR), 2008, pp. 14.
    13. 13)
      • 13. Tabia, H., Gouiffes, M., Lacassagne, L.: ‘Motion histogram quantification for human action recognition’. 21st Int. Conf. Pattern Recognition (ICPR), 2012, pp. 24042407.
    14. 14)
      • 14. Zhang, Z., Hu, Y., Chan, S., et al: ‘Motion context: a new representation for human action recognition’. European Conf. on Computer Vision, 2008, pp. 817829.
    15. 15)
      • 15. Vrigkas, M., Karavasilis, V., Nikou, C., et al: ‘Matching mixtures of curves for human action recognition’, Comput. Vis. Image Underst., 2014, 119, pp. 2740.
    16. 16)
      • 16. Ji, S., Xu, W., Yang, M., et al: ‘3D convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, (1), pp. 221231..
    17. 17)
      • 17. Taylor, G.W., Fergus, R., LeCun, Y., et al: ‘Convolutional learning of spatio-temporal features’. European Conf. on Computer Vision, 2010, pp. 140153.
    18. 18)
      • 18. Baccouche, M., Mamalet, F., Wolf, C., et al: ‘Sequential deep learning for human action recognition’. Int. Workshop on Human Behavior Understanding, 2011, pp. 2939.
    19. 19)
      • 19. Ryoo, M.S., Rothrock, B., Matthies, L.: ‘Pooled motion features for first-person videos’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2015, pp. 896904.
    20. 20)
      • 20. Ostrovsky, Y., Meyers, E., Ganesh, S., et al: ‘Visual parsing after recovery from blindness’, Psychol. Sci., 2009, 20, (12), pp. 14841491.
    21. 21)
      • 21. Manzanera, A.: ‘Local jet feature space framework for image processing and representation’. Signal-Image Technology and Internet-Based Systems (SITIS), 2011, pp. 261268.
    22. 22)
      • 22. van Hateren, J.H., Ruderman, D.L.: ‘Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex’, Proc. R. Soc. Lond. B, Biol. Sci., 1998, 265, (1412), pp. 23152320.
    23. 23)
      • 23. Dalal, N., Triggs, B.: ‘Histograms of oriented gradients for human detection’. Computer Vision and Pattern Recognition, CVPR, 2005, vol. 1, pp. 886893.
    24. 24)
      • 24. Richefeu, J., Manzanera, A.: ‘A new hybrid differential filter for motion detection’, Comput. Vis. Graph., 2006, 32, pp. 727732.
    25. 25)
      • 25. Schuldt, C., Laptev, I., Caputo, B.: ‘Recognizing human actions: a local SVM approach’. Pattern Recognition, ICPR, 2004, vol. 3, pp. 3236.
    26. 26)
      • 26. Dollár, P., Rabaud, V., Cottrell, G., et al: ‘Behavior recognition via sparse spatio-temporal features’. Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 6572.
    27. 27)
      • 27. Chang, C.C., Lin, C.J.: ‘LIBSVM: a library for support vector machines’, ACM Trans. Intell. Syst. Technol. (TIST), 2011, 2, (3), p. 27.
    28. 28)
      • 28. Vezzani, R., Cucchiara, R.: ‘Video surveillance online repository (visor): an integrated framework’, Multimedia Tools Appl., 2010, 50, (2), pp. 359380.
    29. 29)
      • 29. Ballan, L., Bertini, M., Del Bimbo, A., et al: ‘Effective codebooks for human action categorization’. Computer Vision Workshops (ICCV Workshops), 2009, pp. 506513.
    30. 30)
      • 30. Ryoo, M., Aggarwal, J.: ‘UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA)’, 2010.
    31. 31)
      • 31. Yu, G., Yuan, J., Liu, Z.: ‘Propagative hough voting for human activity recognition’. European Conf. on Computer Vision, 2012, pp. 693706.
    32. 32)
      • 32. Scovanner, P., Ali, S., Shah, M.: ‘A 3-dimensional SIFT descriptor and its application to action recognition’. Proc. of the 15th ACM Int. Conf. on Multimedia, 2007, pp. 357360.
    33. 33)
      • 33. Nour el houda Slimani, K., Benezeth, Y., Souami, F.: ‘Human interaction recognition based on the co-occurence of visual words’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition Workshops, 2014, pp. 455460.
    34. 34)
      • 34. Ryoo, M.S.: ‘Human activity prediction: early recognition of ongoing activities from streaming videos’. The Int. Conf. on Computer Vision (ICCV), 2011, pp. 10361043.
    35. 35)
      • 35. Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: ‘Recognizing interaction between human performers using “key pose doublet”’. Proc. of the 19th ACM Int. Conf. on Multimedia, 2011, pp. 13291332.
    36. 36)
      • 36. Ji, X., Wang, C., Zuo, X., et al: ‘Multiple feature voting based human interaction recognition’, Int. J. Signal Process. Image Process. Pattern Recognit., 2016, 9, (1), pp. 323334.
    37. 37)
      • 37. Ji, X., Wang, C., Li, Y.: ‘A view-invariant action recognition based on multi-view space hidden Markov models’, Int. J. Humanoid Robot., 2014, 11, (01), p. 1450011.

Related content

This is a required field
Please enter a valid email address