Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

access icon free Multi-mode neural network for human action recognition

Video data are of two different intrinsic modes, in-frame and temporal. It is beneficial to incorporate static in-frame features to acquire dynamic features for video applications. However, some existing methods such as recurrent neural networks do not have a good performance, and some other such as 3D convolutional neural networks (CNNs) are both memory consuming and time consuming. This study proposes an effective framework that takes the advantage of deep learning on the static image feature extraction to tackle the video data. After extracting in-frame feature vectors using a pretrained deep network, the authors integrate them and form a multi-mode feature matrix, which preserves the multi-mode structure and high-level representation. They propose two models for follow-up classification. The authors first introduce a temporal CNN, which directly feeds the multi-mode feature matrix into a CNN. However, they show that characteristics of the multi-mode features differ significantly in distinct modes. The authors therefore further propose the multi-mode neural network (MMNN), in which different modes deploy different types of layers. They evaluate their algorithm with the task of human action recognition. The experimental results show that the MMNN achieves a much better performance than the existing long short-term memory-based methods and consumes far fewer resources than the existing 3D end-to-end models.

References

    1. 1)
      • 36. Cortes, C., Vapnik, V.: ‘Support vector machine’, Mach. Learn., 1995, 20, (3), pp. 273297.
    2. 2)
      • 37. Olshausen, B.A., Field, D.J.: ‘Sparse coding with an overcomplete basis set: a strategy employed by v1?’, Vis. Res., 1997, 37, (23), pp. 33113325.
    3. 3)
      • 30. Feichtenhofer, C., Fan, H., Malik, J., et al: ‘Slowfast networks for video recognition’. Proc. of the IEEE Int. Conf. on Computer Vision, Seoul, Korea (South), 2019, pp. 62026211.
    4. 4)
      • 39. Selvaraju, R.R., Cogswell, M., Das, A., et al: ‘Grad-cam: visual explanations from deep networks via gradient-based localization’. Proc. of the IEEE Int. Conf. on Computer Vision, Venice, Italy, 2017, pp. 618626.
    5. 5)
      • 29. Crasto, N., Weinzaepfel, P., Alahari, K., et al: ‘Mars: motion-augmented rgb stream for action recognition’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 78827891.
    6. 6)
      • 25. Laptev, I.: ‘On space-time interest points’, Int. J. Comput. Vis., 2005, 64, (2–3), pp. 107123.
    7. 7)
      • 24. Sadanand, S., Corso, J.J.: ‘Action bank: a high-level representation of activity in video’. 2012 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 2012, pp. 12341241.
    8. 8)
      • 20. Gers, F.A., Schmidhuber, J., Cummins, F.: ‘Learning to forget: continual prediction with lstm’, Neural Comput., 2000, 12, (10), p. 2451.
    9. 9)
      • 8. Wang, H., Kläser, A., Schmid, C., et al: ‘Dense trajectories and motion boundary descriptors for action recognition’, Int. J. Comput. Vis., 2013, 103, (1), pp. 6079.
    10. 10)
      • 19. Zha, S., Luisier, F., Andrews, W., et al: ‘Exploiting image-trained cnn architectures for unconstrained video classification’. arXiv preprint, arXiv:1503.04144, 2015.
    11. 11)
      • 43. Gray, R.M.: ‘Toeplitz and circulant matrices: a review’, Foundations Trends® Commun. Inf. Theory, 2006, 2, (3), pp. 155239.
    12. 12)
      • 41. Ioffe, S., Szegedy, C.: ‘Batch normalization: accelerating deep network training by reducing internal covariate shift’. Int. Conf. on Machine Learning, Lille, France, 2015, pp. 448456.
    13. 13)
      • 3. Chaudhry, R., Ravichandran, A., Hager, G., et al: ‘Histograms of oriented optical flow and binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions’. IEEE Conf. on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 19321939.
    14. 14)
      • 18. Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., et al: ‘Beyond short snippets: deep networks for video classification’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 46944702.
    15. 15)
      • 40. LeCun, Y., Bengio, Y.: ‘Convolutional networks for images, speech, and time series’, in Arbib, M.A. (Ed.): ‘The handbook of brain theory and neural networks’, (MIT Press, 55 Hayward St., Cambridge, MA, USA, 1995, vol. 3361, (10), p. 1995).
    16. 16)
      • 46. Chen, T., Li, M., Li, Y., et al: ‘Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems’. arXiv preprint, arXiv:1512.01274, 2015.
    17. 17)
      • 27. Baccouche, M., Mamalet, F., Wolf, C., et al: ‘Sequential deep learning for human action recognition’. Int. Workshop on Human Behavior Understanding, Amsterdam, The Netherlands, 2011, pp. 2939.
    18. 18)
      • 47. Wang, L., Xiong, Y., Wang, Z., et al: ‘Towards good practices for very deep two-stream convnets’. arXiv preprint, arXiv:1507.02159, 2015.
    19. 19)
      • 4. Dalal, N., Triggs, B., Schmid, C.: ‘Human detection using oriented histograms of flow and appearance’. European Conf. on Computer Vision, Graz, Austria, 2006, pp. 428441.
    20. 20)
      • 49. Shou, Z., Lin, X., Kalantidis, Y., et al: ‘Dmc-net: generating discriminative motion cues for fast compressed video action recognition’. The IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, June 2019.
    21. 21)
      • 11. He, K., Zhang, X., Ren, S., et al: ‘Deep residual learning for image recognition’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770778.
    22. 22)
      • 15. Tran, D., Wang, H., Torresani, L., et al: ‘A closer look at spatiotemporal convolutions for action recognition’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 64506459.
    23. 23)
      • 21. Donahue, J., Hendricks, L.A., Guadarrama, S., et al: ‘Long-term recurrent convolutional networks for visual recognition and description’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 26252634.
    24. 24)
      • 12. Szegedy, C., Zaremba, W., Sutskever, I., et al: ‘Intriguing properties of neural networks’. arXiv preprint, arXiv:1312.6199, 2013.
    25. 25)
      • 32. Li, Z., Gavrilyuk, K., Gavves, E., et al: ‘Videolstm convolves, attends and flows for action recognition’, Comput. Vis. Image Underst., 2018, 166, pp. 4150.
    26. 26)
      • 33. Varol, G., Laptev, I., Schmid, C.: ‘Long-term temporal convolutions for action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 40, (6), pp. 15101517.
    27. 27)
      • 5. Arandjelovic, R., Zisserman, A.: ‘All about vlad’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 15781585.
    28. 28)
      • 44. Yoo, H.-J.: ‘Deep convolution neural networks in computer vision’, IEIE Trans. Smart Process. Comput., 2015, 4, (1), pp. 3543.
    29. 29)
      • 2. Dalal, N., Triggs, B.: ‘Histograms of oriented gradients for human detection’. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, San Diego, CA, USA, 2005, vol. 1, pp. 886893.
    30. 30)
      • 1. Krüger, B., Tautges, J., Müller, M., et al: ‘Multimode tensor representation of motion data’, J. Virtual Reality Broadcasting, 2008, 5, (5), pp. 113.
    31. 31)
      • 13. Tran, D., Bourdev, L., Fergus, R., et al: ‘Learning spatiotemporal features with 3d convolutional networks’. Proc. of the IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 44894497.
    32. 32)
      • 17. Xie, S., Sun, C., Huang, J., et al: ‘Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification’. Proc. of the European Conf. on Computer Vision (ECCV), Munich, Germany, 2018, pp. 305321.
    33. 33)
      • 51. Molchanov, P., Tyree, S., Karras, T., et al: ‘Pruning convolutional neural networks for resource efficient inference’. Int. Conf. on Learning Representations (ICLR), Toulon, France, 2017.
    34. 34)
      • 7. Wang, J., Yang, J., Yu, K., et al: ‘Locality-constrained linear coding for image classification’. 2010 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 2010, pp. 33603367.
    35. 35)
      • 35. Soomro, K., Zamir, A.R., Shah, M.: ‘Ucf101: a dataset of 101 human actions classes from videos in the wild’. arXiv preprint, arXiv:1212.0402, 2012.
    36. 36)
      • 45. Heilbron, F.C., Escorcia, V., Ghanem, B., et al: ‘Activitynet: a large-scale video benchmark for human activity understanding’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 961970.
    37. 37)
      • 26. Wang, H., Schmid, C.: ‘Action recognition with improved trajectories’. Proc. of the IEEE Int. Conf. on Computer Vision, Sydney, NSW, Australia, 2013, pp. 35513558.
    38. 38)
      • 52. Abu-El-Haija, S., Kothari, N., Lee, J., et al: ‘Youtube-8m: a large-scale video classification benchmark’. arXiv preprint, arXiv:1609.08675, 2016.
    39. 39)
      • 42. Xu, B., Wang, N., Chen, T., et al: ‘Empirical evaluation of rectified activations in convolutional network’. arXiv preprint, arXiv:1505.00853, 2015.
    40. 40)
      • 50. Simonyan, K., Zisserman, A.: ‘Two-stream convolutional networks for action recognition in videos’. Advances in Neural Information Processing Systems, Montréal, Canada, 2014, pp. 568576.
    41. 41)
      • 14. Carreira, J., Zisserman, A.: ‘Quo vadis, action recognition? a new model and the kinetics dataset’. arXiv preprint, arXiv:1705.07750, 2017.
    42. 42)
      • 6. Sánchez, J., Perronnin, F., Mensink, T., et al: ‘Image classification with the fisher vector: theory and practice’, Int. J. Comput. Vis., 2013, 105, (3), pp. 222245.
    43. 43)
      • 16. Qiu, Z., Yao, T., Mei, T.: ‘Learning spatio-temporal representation with pseudo-3d residual networks’. 2017 IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy, 2017, pp. 55345542.
    44. 44)
      • 48. van der Maaten, L., Hinton, G.: ‘Visualizing data using t-sne’, J. Mach. Learn. Res., 2008, 9, pp. 25792605.
    45. 45)
      • 10. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’. arXiv preprint, arXiv:1409.1556, 2014.
    46. 46)
      • 38. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: ‘Face recognition by independent component analysis’, IEEE Trans. Neural Netw., 2002, 13, (6), pp. 14501464.
    47. 47)
      • 34. Zeiler, M.D., Fergus, R.: ‘Visualizing and understanding convolutional networks’. European Conf. on Computer Vision, Zurich, Switzerland, 2014, pp. 818833.
    48. 48)
      • 28. Ji, S., Xu, W., Yang, M., et al: ‘3d convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, (1), pp. 221231.
    49. 49)
      • 23. Scovanner, P., Ali, S., Shah, M.: ‘A 3-dimensional sift descriptor and its application to action recognition’. Proc. of the 15th ACM Int. Conf. on Multimedia, Augsburg, Germany, 2007, pp. 357360.
    50. 50)
      • 31. Srivastava, N., Mansimov, E., Salakhudinov, R.: ‘Unsupervised learning of video representations using lstms’. Int. Conf. on Machine Learning, Lille, France, 2015, pp. 843852.
    51. 51)
      • 9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘Imagenet classification with deep convolutional neural networks’. Advances in Neural Information Processing Systems, Lake Tahoe, Nevada, USA., 2012, pp. 10971105.
    52. 52)
      • 53. Montes, A., Salvador, A., Pascual, S., et al: ‘Temporal activity detection in untrimmed videos with recurrent neural networks’. 1st NIPS Workshop on Large Scale Computer Vision Systems, Barcelona, Spain, December 2016.
    53. 53)
      • 22. Lowe, D.G.: ‘Object recognition from local scale-invariant features’. The Proc. of the Seventh IEEE Int. Conf. on Computer Vision, Kerkyra, Corfu, Greece, 1999, vol. 2, pp. 11501157.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2019.0761
Loading

Related content

content/journals/10.1049/iet-cvi.2019.0761
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address