access icon free Combination of temporal-channels correlation information and bilinear feature for action recognition

In this study, the authors focus on improving the spatio–temporal representation ability of three-dimensional (3D) convolutional neural networks (CNNs) in the video domain. They observe two unfavourable issues: (i) the convolutional filters only dedicate to learning local representation along input channels. Also they treat channel-wise features equally, without emphasising the important features; (ii) traditional global average pooling layer only captures first-order statistics, ignoring finer detail features useful for classification. To mitigate these problems, they proposed two modules to boost 3D CNNs’ performance, which are temporal-channel correlation (TCC) and bilinear pooling module. The TCC module can capture the information of inter-channel correlations over the temporal domain. Moreover, the TCC module generates channel-wise dependencies, which can adaptively re-weight the channel-wise features. Therefore, the network can focus on learning important features. With regards to the bilinear pooling module, it can capture more complex second-order statistics in deep features and generate a second-order classification vector. We can get more accurate classification results by combining the first-order and second-order classification vector. Extensive experiments show that adding our proposed modules to I3D network could consistently improve the performance and outperform the state-of-the-art methods. The code and models are available at https://github.com/caijh33/I3D_TCC_Bilinear.

Inspec keywords: correlation methods; image representation; vectors; convolutional neural nets; object recognition; learning (artificial intelligence); video signal processing; image classification; stereo image processing; higher order statistics

Other keywords: deep features; convolutional filters; channel-wise dependencies; temporal-channels correlation information; channel-wise features; first-order statistics; spatio–temporal representation; second-order classification vector; bilinear feature; bilinear pooling module; I3D network; video domain; temporal-channel correlation; three-dimensional convolutional neural networks; first-order classification vector; 3D CNN; TCC module; second-order statistics

Subjects: Neural computing techniques; Other topics in statistics; Image recognition; Algebra; Other topics in statistics; Computer vision and image processing techniques; Algebra; Video signal processing

References

    1. 1)
      • 14. Tran, D., Wang, H., Torresani, L., et al: ‘A closer look at spatiotemporal convolutions for action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 64506459.
    2. 2)
      • 23. Lin, T.Y., RoyChowdhury, A., Maji, S.: ‘Bilinear CNN models for fine-grained visual recognition’. IEEE Int. Conf. on Computer Vision (ICCV), Santiago, Chile, December 2015, pp. 14491457.
    3. 3)
      • 21. Hu, J., Shen, L., Sun, G.: ‘Squeeze-and-excitation networks’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 71327141.
    4. 4)
      • 6. Carreira, J., Zisserman, A.: ‘Quo vadis, action recognition? a new model and the kinetics dataset’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 62996308.
    5. 5)
      • 38. Selvaraju, R.R., Cogswell, M., Das, A., et al: ‘Grad-cam: visual explanations from deep networks via gradient-based localization’. IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy, October 2017, pp. 618626.
    6. 6)
      • 10. Lei, J., Jia, Y., Peng, B., et al: ‘Channel-wise temporal attention network for video action recognition’. IEEE Int. Conf. on Multimedia and Expo (ICME), Shanghai, China, July 2019, pp. 562567.
    7. 7)
      • 2. Yan, C., Coenen, F., Zhang, B.: ‘Driving posture recognition by convolutional neural networks’, IET Comput. Vis., 2016, 10, (2), pp. 103114.
    8. 8)
      • 36. Wang, X., Gupta, A.: ‘Videos as space-time region graphs’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 399417.
    9. 9)
      • 16. Zhou, Y., Sun, X., Zha, Z.J., et al: ‘MICT: mixed 3D/2D convolutional tube for human action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 449458.
    10. 10)
      • 7. Hara, K., Kataoka, H., Satoh, Y.: ‘Can spatio–temporal 3D CNNs retrace the history of 2D CNNs and imagenet?’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 65466555.
    11. 11)
      • 12. Ji, S., Xu, W., Yang, M., et al: ‘3D convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 35, (1), pp. 221231.
    12. 12)
      • 22. Woo, S., Park, J., Lee, J.Y.: ‘CBAM: convolutional block attention module’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 319.
    13. 13)
      • 18. Pramono, R.R.A., Chen, Y.T., Fang, W.H.: ‘Hierarchical self-attention network for action localization in videos’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), CA, USA, June 2019, pp. 6170.
    14. 14)
      • 13. Tran, D., Bourdev, L., Fergus, R., et al: ‘Learning spatiotemporal features with 3D convolutional networks’. IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, December 2015, pp. 44894497.
    15. 15)
      • 32. Qiu, Z., Yao, T., Ngo, C. W., et al: ‘Learning spatio-temporal representation with local and global diffusion’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), CA, USA, June 2019, pp. 1205612065.
    16. 16)
      • 3. Simonyan, K., Zisserman, A.: ‘Two-stream convolutional networks for action recognition in videos’. Advances in Neural Information Processing Systems (NIPS), Montreal, Canada, December 2014, pp. 568576.
    17. 17)
      • 35. Diba, A., Fayyaz, M., Sharma, V., et al: ‘Spatio-temporal channel correlation networks for action classification’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 284299.
    18. 18)
      • 1. Huang, G., Liu, Z., Van Der Maaten, L., et al: ‘Densely connected convolutional networks’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 47004708.
    19. 19)
      • 30. He, D., Zhou, Z., Gan, C., et al: ‘STNET: local and global spatial-temporal modeling for action recognition’. AAAI Conf. on Artificial Intelligence, Hawaii, USA, January 2019, pp. 84018408.
    20. 20)
      • 19. Purwanto, D., Pramono, R.R.A., Chen, Y.T., et al: ‘Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos’, IEEE Signal Process. Lett.2019, 26, pp. 11871191.
    21. 21)
      • 27. Ioffe, S., Szegedy, C., et al: ‘Batch normalization: accelerating deep network training by reducing internal covariate shift’, arXiv preprint arXiv:1502.03167, 2015.
    22. 22)
      • 20. Wang, F., Jiang, M., Qian, C., et al: ‘Residual attention network for image classification’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 31563164.
    23. 23)
      • 9. Martinez, B., Modolo, D., Xiong, Y., et al: ‘Action recognition with spatial-temporal discriminative filter banks’. IEEE Int. Conf. on Computer Vision (ICCV), Seoul, Korea, October 2019, pp. 54825491.
    24. 24)
      • 25. Yu, C., Zhao, X., Zheng, Q.: ‘Hierarchical bilinear pooling for fine-grained visual recognition’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 574589.
    25. 25)
      • 26. Chen, Y., Kalantidis, Y., Li, J., et al: ‘A2-nets: double attention networks’. Advances in Neural Information Processing Systems (NIPS), Montreal, Canada, December 2018, pp. 352361.
    26. 26)
      • 24. Gao, Y., Beijbom, O., Zhang, N., et al: ‘Compact bilinear pooling’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016, pp. 317326.
    27. 27)
      • 28. Wang, X., Girshick, R., Gupta, A., et al: ‘Non-local neural networks’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 77947803.
    28. 28)
      • 5. Zhang, B., Wang, L., Wang, Z., et al: ‘Real-time action recognition with enhanced motion vector CNNs’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016, pp. 27182726.
    29. 29)
      • 34. Choutas, V., Weinzaepfel, P., Revaud, J., et al: ‘Potion: pose motion representation for action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 70247033.
    30. 30)
      • 33. Crasto, N., Weinzaepfel, P., Alahari, K., et al: ‘MARS: motion-augmented RGB stream for action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), CA, USA, June 2019, pp. 78827891.
    31. 31)
      • 11. Hu, J.F., Zheng, W.S., Pan, J., et al: ‘Deep bilinear learning for RGB-D action recognition’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 335351.
    32. 32)
      • 29. Song, X., Lan, C., Zeng, W., et al: ‘Temporal-spatial mapping for action recognition’, IEEE Trans. Circuits Syst. Video Technol., 2019.
    33. 33)
      • 4. Wang, L., Xiong, Y., Wang, Z., et al: ‘Temporal segment networks: towards good practices for deep action recognition’. European Conf. on Computer Vision (ECCV), Amsterdam, The Netherlands, September 2016, pp. 2036.
    34. 34)
      • 15. Xie, S., Sun, C., Huang, J., et al: ‘Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 305321.
    35. 35)
      • 31. Stroud, J.C., Ross, D.A., Sun, C., et al: ‘D3D: distilled 3D networks for video action recognition’, arXiv preprint arXiv:1812.08249, 2018.
    36. 36)
      • 17. Liu, J., Wang, G., Hu, P., et al: ‘Global context-aware attention LSTM networks for 3D action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 16471656.
    37. 37)
      • 8. Girdhar, R., Ramanan, D.: ‘Attentional pooling for action recognition’. Advances in Neural Information Processing Systems (NIPS), California, USA, December 2017, pp. 3445.
    38. 38)
      • 37. Lin, J., Gan, C., Han, S.: ‘TSM: temporal shift module for efficient video understanding’. IEEE Int. Conf. on Computer Vision (ICCV), Seoul, Korea, October 2019, pp. 70837093.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2020.0023
Loading

Related content

content/journals/10.1049/iet-cvi.2020.0023
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading