Converting video classification problem to image classification with global descriptors and pre-trained network

Saeedeh Zebhi; SMT Al-Modarresi; Vahid Abootalebi

Converting video classification problem to image classification with global descriptors and pre-trained network

View Fulltext

Author(s): Saeedeh Zebhi¹ ; SMT Al-Modarresi¹ ; Vahid Abootalebi¹
- Affiliations: 1: Electrical Engineering Department , Yazd University , Yazd , Iran
Source: Volume 14, Issue 8, December 2020, p. 614 – 624
DOI: 10.1049/iet-cvi.2019.0625 , Print ISSN 1751-9632, Online ISSN 1751-9640

Received 05/08/2019, Accepted 23/07/2020, Revised 09/02/2020, Published 07/10/2020

Motion history image (MHI) is a spatio-temporal template that temporal motion information is collapsed into a single image where intensity is a function of recency of motion. Also, it consists of spatial information. Energy image (EI) based on the magnitude of optical flow is a temporal template that shows only temporal information of motion. Each video can be described in these templates. So, four new methods are introduced in this study. The first three methods are called basic methods. In method 1, each video splits into N groups of consecutive frames and MHI is calculated for each group. Transfer learning with fine-tuning technique has been used for classifying these templates. EIs are used for classifying in method 2 similar to method 1. Fusing two streams of these templates is introduced as method 3. Finally, spatial information is added in method 4. Among these methods, method 4 outperforms others and it is called the proposed method. It achieves the recognition accuracy of 92.30 and 94.50% for UCF Sport and UCF-11 action data sets, respectively. Also, the proposed method is compared with the state-of-the-art approaches and the results show that it has the best performance.

References

1. 1)
  - 43. Hasan, M., Roy-Chowdhury, A.K.: ‘Incremental activity modeling and recognition in streaming videos’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014.
2. 2)
  - 34. Chollet, F. ‘Keras’, 2015. Available at https://github.com/fchollet/keras.
3. 3)
  - 18. Simonyan, K., Zisserman, A.: ‘Two-stream convolutional networks for action recognition in videos’. Advances in Neural Information Processing Systems, Montreal, Quebec, Canada, 2014.
4. 4)
  - 36. Weinzaepfel, P., Harchaoui, Z., Schmid, C.: ‘Learning to track for spatio-temporal action localization’. Proc. IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015.
5. 5)
  - 5. Gorelick, L., Blank, M., Shechtman, E., et al: ‘Actions as space–time shapes’, IEEE Trans. Pattern Anal. Mach. Intell., 2007, 29, (12), pp. 2247–2253.
6. 6)
  - 38. Abdulmunem, A., Lai, Y.-K., Sun, X.: ‘Saliency guided local and global descriptors for effective action recognition’, Comput. Visual Media, 2016, 2, (1), pp. 97–106.
7. 7)
  - 22. Xie, S., Sun, C., Huang, J., et al: ‘Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification’. Proc. European Conf. on Computer Vision (ECCV), Munich, Germany, 2018.
8. 8)
  - 24. Monteiro, J., Granada, R., Barros, R.C.: ‘Evaluating the feasibility of deep learning for action recognition in small datasets’. 2018 Int. Joint Conf. on Neural Networks (IJCNN), Rio, Brazil, 2018.
9. 9)
  - 32. Soomro, K., Zamir, A.R.: ‘Action recognition in realistic sports videos’. Computer Vision in Sports, Cham, Heidelberg, New York, 2014.
10. 10)
  - 46. Gammulle, H., Denman, S., Sridharan, S., et al: ‘Two stream LSTM: a deep fusion framework for human action recognition’. 2017 IEEE Winter Conf. on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 2017.
11. 11)
  - 4. Bobick, A.F., Davis, J.W.: ‘The recognition of human movement using temporal templates’, IEEE Trans. Pattern Anal. Mach. Intell., 2001, 23, (3), pp. 257–267.
12. 12)
  - 21. Wang, L., Xiong, Y., Wang, Z., et al: ‘Towards good practices for very deep two-stream convnets’, arXiv preprint arXiv:1507.02159, 2015.
13. 13)
  - 10. Klaser, A., Marszałek, M., Schmid, C.: ‘A spatio-temporal descriptor based on 3D-gradients’. BMVC 2008-19th British Machine Vision Conf., British Machine Vision Association, Leeds, UK, 2008.
14. 14)
  - 30. Lam, T.H., Cheung, K.H., Liu, J.N.: ‘Gait flow image: a silhouette-based gait representation for human identification’, Pattern Recognit., 2011, 44, (4), pp. 973–987.
15. 15)
  - 9. Dalal, N., Triggs, B.: ‘Histograms of oriented gradients for human detection’. Int. Conf. on Computer Vision & Pattern Recognition (CVPR'05), San Diego, CA, USA, 2005.
16. 16)
  - 8. Scovanner, P., Ali, S., Shah, M.: ‘A 3-dimensional SIFT descriptor and its application to action recognition’. Proc. 15th ACM Int. Conf. on Multimedia, Augsburg, Germany, 2007.
17. 17)
  - 40. Zhou, Y., Pu, N., Qian, L., et al: ‘Human action recognition in videos of realistic scenes based on multi-scale CNN feature’. Pacific Rim Conf. on Multimedia, Harbin, China, 2017.
18. 18)
  - 42. Phan, H.-H., Vu, N.-S., Nguyen, V.-L., et al: ‘Action recognition based on motion of oriented magnitude patterns and feature selection’, IET Comput. Vis., 2018, 12, (5), pp. 735–743.
19. 19)
  - 7. Lowe, D.G.: ‘Object recognition from local scale-invariant features’. Proc. Seventh IEEE Int. Conf. on Computer Vision, Kerkyra, Greece, 1999.
20. 20)
  - 35. Wang, H., Kläser, A., Schmid, C., et al: ‘Action recognition by dense trajectories’. CVPR 2011-IEEE Conf. on Computer Vision & Pattern Recognition, CO, USA, 2011.
21. 21)
  - 19. Bilen, H., Fernando, B., Gavves, E., et al: ‘Action recognition with dynamic image networks’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 40, (12), pp. 2799–2813.
22. 22)
  - 48. Lu, N., Wu, Y., Feng, L., et al: ‘Deep learning for fall detection: three-dimensional CNN combined with LSTM on video kinematic data’, IEEE. J. Biomed. Health Inf., 2018, 23, (1), pp. 314–323.
23. 23)
  - 17. Tran, D., Bourdev, L., Fergus, R., et al: ‘Learning spatiotemporal features with 3D convolutional networks’. Proc. IEEE Int. Conf. on Computer Vision, Boston, MA, USA, 2015.
24. 24)
  - 13. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., et al: ‘Beyond short snippets: deep networks for video classification’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015.
25. 25)
  - 31. ImageNet. Available at http://www.image-net.org.
26. 26)
  - 39. Souly, N., Shah, M.: ‘Visual saliency detection using group lasso regularization in videos of natural scenes’, Int. J. Comput. Vis., 2016, 117, (1), pp. 93–110.
27. 27)
  - 26. Wei, J., Wang, H., Yi, Y., et al: ‘P3D-CTN: pseudo-3D convolutional tube network for spatio-temporal action detection in videos’. 2019 IEEE Int. Conf. on Image Processing (ICIP), Taipei, Taiwan, 2019.
28. 28)
  - 15. Ji, S., Xu, W., Yang, M., et al: ‘3D convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 35, (1), pp. 221–231.
29. 29)
  - 41. Wang, L., Xu, Y., Cheng, J., et al: ‘Human action recognition by learning spatio-temporal features with deep neural networks’, IEEE Access, 2018, 6, pp. 17913–17922.
30. 30)
  - 3. Bobick, A., Davis, J.: ‘An appearance-based representation of action’. Proc. 13th Int. Conf. on Pattern Recognition, Vienna, Austria, 1996.
31. 31)
  - 23. Safaei, M., Balouchian, P., Foroosh, H.: ‘TICNN: a hierarchical deep learning framework for still image action recognition using temporal image prediction’. 2018 25th IEEE Int. Conf. on Image Processing (ICIP), Athens, Greece, 2018.
32. 32)
  - 45. Sharma, S., Kiros, R., Salakhutdinov, R.: ‘Action recognition using visual attention’, arXiv preprint arXiv:1511.04119, 2015.
33. 33)
  - 33. Liu, J., Luo, J., Shah, M.: ‘Recognizing realistic actions from videos in the wild’. 2009 IEEE Conf. on Computer Vision and Pattern Recognition, Miami, FL, 2009.
34. 34)
  - 25. Zare, A., Moghaddam, H.A., Sharifi, A.: ‘Video spatiotemporal mapping for human action recognition by convolutional neural network’, Pattern Anal. Appl., 2020, 23, pp. 265–279.
35. 35)
  - 1. Lucas, B.D., Kanade, T.: ‘An iterative image registration technique with an application to stereo vision’, 1981.
36. 36)
  - 20. Feichtenhofer, C., Pinz, A., Wildes, R.P.: ‘Temporal residual networks for dynamic scene recognition’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA, 2017.
37. 37)
  - 11. Karpathy, A., Toderici, G., Shetty, S., et al: ‘Large-scale video classification with convolutional neural networks’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014.
38. 38)
  - 14. Carreira, J., Zisserman, A.: ‘Quo vadis, action recognition? A new model and the kinetics dataset’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA, 2017.
39. 39)
  - 37. Ravanbakhsh, M., Mousavi, H., Rastegari, M., et al: ‘Action recognition with image based CNN features’, arXiv preprint arXiv:1512.03980, 2015.
40. 40)
  - 16. Taylor, G.W., Fergus, R., LeCun, Y., et al: ‘Convolutional learning of spatio-temporal features’. European Conf. on Computer Vision, Heraklion, Crete, Greece, 2010.
41. 41)
  - 44. Cho, J., Lee, M., Chang, H.J., et al: ‘Robust action recognition using local motion and group sparsity’, Pattern Recognit., 2014, 47, (5), pp. 1813–1825.
42. 42)
  - 2. Shi, J., Tomasi, C.: ‘Good features to track’ (Cornell University, Ithaca, NY, USA, 1993). Technical Report.
43. 43)
  - 6. Bay, H., Ess, A., Tuytelaars, T., et al: ‘Speeded-up robust features (SURF)’, Comput. Vis. Image Underst., 2008, 110, (3), pp. 346–359.
44. 44)
  - 27. Ge, H., Yan, Z., Yu, W., et al: ‘An attention mechanism based convolutional LSTM network for video action recognition’, Multimedia Tools Appl., 2019, 78, (14), pp. 20533–20556.
45. 45)
  - 49. Yan, Y., Chen, M., Sadiq, S., et al: ‘Efficient Imbalanced Multimedia Concept Retrieval by Deep Learning on Spark Clusters’, Int. J. Multimedia Data Eng. Manag. (IJMDEM), 2017, 8, (1), pp. 1–20.
46. 46)
  - 29. Keskin, C., Erkan, A., Akarun, L.: ‘Real time hand tracking and 3D gesture recognition for interactive interfaces using hmm’, ICANN/ICONIPP, 2003, 2003, pp. 26–29.
47. 47)
  - 28. Davis, J.W.: ‘Hierarchical motion history images for recognizing human motion’. Proc. IEEE Workshop on Detection and Recognition of Events in Video, Vancouver, BC, Canada, 2001.
48. 48)
  - 47. Gilbert, A., Bowden, R.: ‘Image and video mining through online learning’, Comput. Vis. Image Underst., 2017, 158, pp. 72–84.
49. 49)
  - 12. Donahue, J., Anne Hendricks, L., Guadarrama, S., et al: ‘Long-term recurrent convolutional networks for visual recognition and description’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Converting video classification problem to image classification with global descriptors and pre-trained network

References

Related content