Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

access icon free Fully convolutional networks for action recognition

Human action recognition is an important and challenging topic in computer vision. Recently, convolutional neural networks (CNNs) have established impressive results for many image recognition tasks. The CNNs usually contain million parameters which prone to overfit when training on small datasets. Therefore, the CNNs do not produce superior performance over traditional methods for action recognition. In this study, the authors design a novel two-stream fully convolutional networks architecture for action recognition which can significantly reduce parameters while keeping performance. To utilise the advantage of spatial-temporal features, a linear weighted fusion method is used to fuse two-stream networks’ feature maps and a video pooling method is adopted to construct the video-level features. At the meantime, the authors also demonstrate that the improved dense trajectories has significant impact for action recognition. The authors’ method can achieve the state-of-the-art performance on two challenging datasets UCF101 (93.0%) and HMDB51 (70.2%).

References

    1. 1)
      • 21. Lowe, D.G.: ‘Distinctive image features from scale-invariant keypoints’, Int. J. Comput. Vis., 2004, 60, (2), pp. 91110.
    2. 2)
      • 46. Feichtenhofer, C., Pinz, A., Zisserman, A.: ‘Convolutional two-stream network fusion for video action recognition’. 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 19331941.
    3. 3)
      • 4. Wang, H., Schmid, C.: ‘Action recognition with improved trajectories’. , 2013 IEEE Int. Conf. on Computer Vision (ICCV), 2013, pp. 35513558.
    4. 4)
      • 27. Jain, A., Zamir, A.R., Savarese, S., et al: ‘Structural-rnn: deep learning on spatio-temporal graphs’, arXiv preprint arXiv:1511.05298, 2015.
    5. 5)
      • 54. Lan, Z., Lin, M., Li, X., et al: ‘Beyond Gaussian pyramid: multi-skip feature stacking for action recognition’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 204212.
    6. 6)
      • 43. Wang, J., Wang, W., Wang, R., et al: ‘Deep alternative neural network: exploring contexts as early as possible for action recognition’. Advances in Neural Information Processing Systems (NIPS 2016), 2016, pp. 811819.
    7. 7)
      • 36. Zhou, B., Lapedriza, A., Xiao, J., et al: ‘Learning deep features for scene recognition using places database’. Advances in neural information processing systems (NIPS2014), 2014, pp. 487495.
    8. 8)
      • 14. Park, E., Han, X., Berg, T.L., et al: ‘Combining multiple sources of knowledge in deep cnns for action recognition’. 2016 IEEE Winter Conf. on Applications of Computer Vision (WACV), 2016, pp. 18.
    9. 9)
      • 3. Laptev, I.: ‘On space-time interest points’, Int. J. Comput. Vis., 2005, 64, (2–3), pp. 107123.
    10. 10)
      • 19. Simonyan, K., Zisserman, A.: ‘Two-stream convolutional networks for action recognition in videos’. Advances in Neural Information Processing Systems (NIPS 2014), 2014, pp. 568576.
    11. 11)
      • 51. Zach, C., Pock, T., Bischof, H.: ‘A duality based approach for realtime tv-l 1 optical flow’. Joint Pattern Recognition Symp., 2007, pp. 214223.
    12. 12)
      • 47. Wang L., Qiao, Yu, T.X., Gool, L.V.: ‘Actionness estimation using hybrid fully convolutional networks’. 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 27082717.
    13. 13)
      • 38. Sun, Y., Chen, Y., Wang, X., et al: ‘Deep learning face representation by joint identification-verification’. Advances in Neural Information Processing Systems (NIPS 2014), 2014, pp. 19881996.
    14. 14)
      • 26. Gehring, J., Miao, Y., Metze, F., et al: ‘Extracting deep bottleneck features using stacked auto-encoders’. 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, IEEE, 2013, pp. 33773381.
    15. 15)
      • 16. Wang, L., Qiao, Y., Tang, X.: ‘Action recognition with trajectory-pooled deep-convolutional descriptors’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 43054314.
    16. 16)
      • 52. Wang, P., Cao, Y., Shen, C., et al: ‘Temporal pyramid pooling based convolutional neural networks for action recognition’, arXiv preprint arXiv:1503.01224, 2015.
    17. 17)
      • 48. Fan, R.E., Chang, K.W., Hsieh, C.J., et al: ‘Liblinear: a library for large linear classification’, J. Mach. Learn. Res., 2008, 9, pp. 18711874.
    18. 18)
      • 23. Dollár, P., Rabaud, V., Cottrell, G., et al: ‘Behavior recognition via sparse spatio-temporal features’. 2005 IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, IEEE, 2005, pp. 6572.
    19. 19)
      • 8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘Imagenet classification with deep convolutional neural networks’. Advances in neural information processing systems (NIPS 2012), 2012, pp. 10971105.
    20. 20)
      • 11. Ren, S., He, K., Girshick, R., et al: ‘Faster r-cnn: towards real-time object detection with region proposal networks’. Advances in neural information processing systems (NIPS 2015), 2015, pp. 9199.
    21. 21)
      • 32. Lan, Z., Yu, S.I., Lin, M., et al: ‘Handcrafted local features are convolutional neural networks’, arXiv preprint arXiv:1511.05045, 2015.
    22. 22)
      • 40. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., et al: ‘Beyond short snippets: deep networks for video classification’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 46944702.
    23. 23)
      • 30. Veeriah, V., Zhuang, N., Qi, G.J.: ‘Differential recurrent neural networks for action recognition’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 40414049.
    24. 24)
      • 49. Szegedy, C., Liu, W., Jia, Y., et al: ‘Going deeper with convolutions’. , 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 19.
    25. 25)
      • 50. Wang, L., Xiong, Y., Wang, Z., et al: ‘Towards good practices for very deep two-stream convnets’, arXiv preprint arXiv:1507.02159, 2015.
    26. 26)
      • 13. Tran, D., Bourdev, L., Fergus, R., et al: ‘Learning spatiotemporal features with 3d convolutional networks’. 2015 IEEE Conf. on Computer Vision (ICCV), 2015, pp. 44894497.
    27. 27)
      • 53. Xu, Z., Yang, Y., Hauptmann, A.G.: ‘A discriminative cnn video representation for event detection’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 17981807.
    28. 28)
      • 37. Zhong, Z., Lei, M., Cao, D., et al: ‘Class-specific object proposals re-ranking for object detection in automatic driving’, Neurocomputing, 2017, 242, pp. 187194.
    29. 29)
      • 2. Peng, X., Wang, L., Wang, X., et al: ‘Bag of visual words and fusion methods for action recognition: comprehensive study and good practice’, Comput. Vis. Image Underst., 2016, 150, pp. 109125.
    30. 30)
      • 29. Donahue, J., Anne Hendricks, L., Guadarrama, S., et al: ‘Long-term recurrent convolutional networks for visual recognition and description’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 26252634.
    31. 31)
      • 15. Varol, G., Laptev, I., Schmid, C.: ‘Long-term temporal convolutions for action recognition’, arXiv preprint arXiv:1604.04494, 2016.
    32. 32)
      • 25. Lee, H., Grosse, R., Ranganath, R., et al: ‘Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations’. Proc. of the 26th Annual Int. Conf. on Machine Learning, ACM, 2009, pp. 609616.
    33. 33)
      • 28. Zaremba, W., Sutskever, I., Vinyals, O.: ‘Recurrent neural network regularization’, arXiv preprint arXiv:1409.2329, 2014.
    34. 34)
      • 7. Jegou, H., Perronnin, F., Douze, M., et al: ‘Aggregating local image descriptors into compact codes’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (9), pp. 17041716.
    35. 35)
      • 18. Kuehne, H., Jhuang, H., Stiefelhagen, R., et al: ‘Hmdb51: A large video database for human motion recognition’ (High Performance Computing in Science and Engineering 12, Springer, 2013), pp. 571582.
    36. 36)
      • 5. Csurka, G., Dance, C., Fan, L., et al: ‘Visual categorization with bags of keypoints’. Workshop on statistical learning in computer vision, ECCV, vol. 1, 2004, pp. 12.
    37. 37)
      • 44. Sun, L., Jia, K., Yeung, D.Y., et al: ‘Human action recognition using factorized spatiotemporal convolutional networks’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 45974605.
    38. 38)
      • 22. Scovanner, P., Ali, S., Shah, M.: ‘A 3-dimensional sift descriptor and its application to action recognition’. Proc. of the 15th ACM international conference on Multimedia, ACM, 2007, pp. 357360.
    39. 39)
      • 12. Tompson, J., Goroshin, R., Jain, A., et al: ‘Efficient object localization using convolutional networks’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 648656.
    40. 40)
      • 1. Shao, L., Liu, L., Yu, M.: ‘Kernelized multiview projection for robust action recognition’, Int. J. Comput. Vis., 2016, 118, pp. 115129.
    41. 41)
      • 24. Wang, L., Qiao, Y., Tang, X.: ‘Mofap: a multi-level representation for action recognition’, Int. J. Comput. Vis., 2016, 119, (3), pp. 119254.
    42. 42)
      • 9. Bilen, H., Fernando, B., Gavves, E., et al: ‘Dynamic image networks for action recognition’. 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 30343042.
    43. 43)
      • 41. Ji, S., Xu, W., Yang, M., et al: ‘3d convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, (1), pp. 221231.
    44. 44)
      • 35. Liu, L., Shen, C., van den Hengel, A.: ‘The treasure beneath convolutional layers: cross-convolutional-layer pooling for image classification’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 47494757.
    45. 45)
      • 17. Soomro, K., Zamir, A.R., Shah, M.: ‘Ucf101: A dataset of 101 human actions classes from videos in the wild’, CRCV-TR-12, 2012.
    46. 46)
      • 20. Klaser, A., Marszałek, M., Schmid, C.: ‘A spatio-temporal descriptor based on 3d-gradients’. 2008–19th British Machine Vision Conf. (BMVC), British Machine Vision Association, 2008, pp. 110.
    47. 47)
      • 45. Wang, L., Xiong, Y., Wang, Z., et al: ‘Temporal segment networks: towards good practices for deep action recognition’. European Conf. on Computer Vision, 2016, pp. 2036.
    48. 48)
      • 42. Karpathy, A., Toderici, G., Shetty, S., et al: ‘Large-scale video classification with convolutional neural networks’. 2014 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 17251732.
    49. 49)
      • 31. Le, Q.V., Zou, W.Y., Yeung, S.Y., et al: ‘Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis’. 2011 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 33613368.
    50. 50)
      • 33. Long, J., Shelhamer, E., Darrell, T.: ‘Fully convolutional networks for semantic segmentation’. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 34313440.
    51. 51)
      • 34. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’, arXiv preprint arXiv:1409.1556, 2014.
    52. 52)
      • 6. Perronnin, F., Dance, C.: ‘Fisher kernels on visual vocabularies for image categorization’. 2007 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 18.
    53. 53)
      • 39. Sun, Y., Liang, D., Wang, X., et al: ‘Deepid3: face recognition with very deep neural networks’, arXiv preprint arXiv:1502.00873, 2015.
    54. 54)
      • 10. Girshick, R.: ‘Fast r-cnn’. 2015 IEEE Conf. on Computer Vision (ICCV), 2015, pp. 14401448.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2017.0005
Loading

Related content

content/journals/10.1049/iet-cvi.2017.0005
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address