© The Institution of Engineering and Technology
In this study, the authors focus on improving the spatio–temporal representation ability of three-dimensional (3D) convolutional neural networks (CNNs) in the video domain. They observe two unfavourable issues: (i) the convolutional filters only dedicate to learning local representation along input channels. Also they treat channel-wise features equally, without emphasising the important features; (ii) traditional global average pooling layer only captures first-order statistics, ignoring finer detail features useful for classification. To mitigate these problems, they proposed two modules to boost 3D CNNs’ performance, which are temporal-channel correlation (TCC) and bilinear pooling module. The TCC module can capture the information of inter-channel correlations over the temporal domain. Moreover, the TCC module generates channel-wise dependencies, which can adaptively re-weight the channel-wise features. Therefore, the network can focus on learning important features. With regards to the bilinear pooling module, it can capture more complex second-order statistics in deep features and generate a second-order classification vector. We can get more accurate classification results by combining the first-order and second-order classification vector. Extensive experiments show that adding our proposed modules to I3D network could consistently improve the performance and outperform the state-of-the-art methods. The code and models are available at https://github.com/caijh33/I3D_TCC_Bilinear.
References
-
-
1)
-
14. Tran, D., Wang, H., Torresani, L., et al: ‘A closer look at spatiotemporal convolutions for action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 6450–6459.
-
2)
-
23. Lin, T.Y., RoyChowdhury, A., Maji, S.: ‘Bilinear CNN models for fine-grained visual recognition’. IEEE Int. Conf. on Computer Vision (ICCV), Santiago, Chile, December 2015, pp. 1449–1457.
-
3)
-
21. Hu, J., Shen, L., Sun, G.: ‘Squeeze-and-excitation networks’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 7132–7141.
-
4)
-
6. Carreira, J., Zisserman, A.: ‘Quo vadis, action recognition? a new model and the kinetics dataset’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 6299–6308.
-
5)
-
38. Selvaraju, R.R., Cogswell, M., Das, A., et al: ‘Grad-cam: visual explanations from deep networks via gradient-based localization’. IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy, October 2017, pp. 618–626.
-
6)
-
10. Lei, J., Jia, Y., Peng, B., et al: ‘Channel-wise temporal attention network for video action recognition’. IEEE Int. Conf. on Multimedia and Expo (ICME), Shanghai, China, July 2019, pp. 562–567.
-
7)
-
2. Yan, C., Coenen, F., Zhang, B.: ‘Driving posture recognition by convolutional neural networks’, IET Comput. Vis., 2016, 10, (2), pp. 103–114.
-
8)
-
36. Wang, X., Gupta, A.: ‘Videos as space-time region graphs’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 399–417.
-
9)
-
16. Zhou, Y., Sun, X., Zha, Z.J., et al: ‘MICT: mixed 3D/2D convolutional tube for human action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 449–458.
-
10)
-
7. Hara, K., Kataoka, H., Satoh, Y.: ‘Can spatio–temporal 3D CNNs retrace the history of 2D CNNs and imagenet?’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 6546–6555.
-
11)
-
12. Ji, S., Xu, W., Yang, M., et al: ‘3D convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 35, (1), pp. 221–231.
-
12)
-
22. Woo, S., Park, J., Lee, J.Y.: ‘CBAM: convolutional block attention module’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 3–19.
-
13)
-
18. Pramono, R.R.A., Chen, Y.T., Fang, W.H.: ‘Hierarchical self-attention network for action localization in videos’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), CA, USA, June 2019, pp. 61–70.
-
14)
-
13. Tran, D., Bourdev, L., Fergus, R., et al: ‘Learning spatiotemporal features with 3D convolutional networks’. IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, December 2015, pp. 4489–4497.
-
15)
-
32. Qiu, Z., Yao, T., Ngo, C. W., et al: ‘Learning spatio-temporal representation with local and global diffusion’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), CA, USA, June 2019, pp. 12056–12065.
-
16)
-
3. Simonyan, K., Zisserman, A.: ‘Two-stream convolutional networks for action recognition in videos’. Advances in Neural Information Processing Systems (NIPS), Montreal, Canada, December 2014, pp. 568–576.
-
17)
-
35. Diba, A., Fayyaz, M., Sharma, V., et al: ‘Spatio-temporal channel correlation networks for action classification’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 284–299.
-
18)
-
1. Huang, G., Liu, Z., Van Der Maaten, L., et al: ‘Densely connected convolutional networks’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 4700–4708.
-
19)
-
30. He, D., Zhou, Z., Gan, C., et al: ‘STNET: local and global spatial-temporal modeling for action recognition’. AAAI Conf. on Artificial Intelligence, Hawaii, USA, January 2019, pp. 8401–8408.
-
20)
-
19. Purwanto, D., Pramono, R.R.A., Chen, Y.T., et al: ‘Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos’, IEEE Signal Process. Lett.2019, 26, pp. 1187–1191.
-
21)
-
27. Ioffe, S., Szegedy, C., et al: ‘Batch normalization: accelerating deep network training by reducing internal covariate shift’, , 2015.
-
22)
-
20. Wang, F., Jiang, M., Qian, C., et al: ‘Residual attention network for image classification’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 3156–3164.
-
23)
-
9. Martinez, B., Modolo, D., Xiong, Y., et al: ‘Action recognition with spatial-temporal discriminative filter banks’. IEEE Int. Conf. on Computer Vision (ICCV), Seoul, Korea, October 2019, pp. 5482–5491.
-
24)
-
25. Yu, C., Zhao, X., Zheng, Q.: ‘Hierarchical bilinear pooling for fine-grained visual recognition’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 574–589.
-
25)
-
26. Chen, Y., Kalantidis, Y., Li, J., et al: ‘A2-nets: double attention networks’. Advances in Neural Information Processing Systems (NIPS), Montreal, Canada, December 2018, pp. 352–361.
-
26)
-
24. Gao, Y., Beijbom, O., Zhang, N., et al: ‘Compact bilinear pooling’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016, pp. 317–326.
-
27)
-
28. Wang, X., Girshick, R., Gupta, A., et al: ‘Non-local neural networks’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 7794–7803.
-
28)
-
5. Zhang, B., Wang, L., Wang, Z., et al: ‘Real-time action recognition with enhanced motion vector CNNs’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016, pp. 2718–2726.
-
29)
-
34. Choutas, V., Weinzaepfel, P., Revaud, J., et al: ‘Potion: pose motion representation for action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 7024–7033.
-
30)
-
33. Crasto, N., Weinzaepfel, P., Alahari, K., et al: ‘MARS: motion-augmented RGB stream for action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), CA, USA, June 2019, pp. 7882–7891.
-
31)
-
11. Hu, J.F., Zheng, W.S., Pan, J., et al: ‘Deep bilinear learning for RGB-D action recognition’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 335–351.
-
32)
-
29. Song, X., Lan, C., Zeng, W., et al: ‘Temporal-spatial mapping for action recognition’, IEEE Trans. Circuits Syst. Video Technol., 2019.
-
33)
-
4. Wang, L., Xiong, Y., Wang, Z., et al: ‘Temporal segment networks: towards good practices for deep action recognition’. European Conf. on Computer Vision (ECCV), Amsterdam, The Netherlands, September 2016, pp. 20–36.
-
34)
-
15. Xie, S., Sun, C., Huang, J., et al: ‘Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 305–321.
-
35)
-
31. Stroud, J.C., Ross, D.A., Sun, C., et al: ‘D3D: distilled 3D networks for video action recognition’, , 2018.
-
36)
-
17. Liu, J., Wang, G., Hu, P., et al: ‘Global context-aware attention LSTM networks for 3D action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 1647–1656.
-
37)
-
8. Girdhar, R., Ramanan, D.: ‘Attentional pooling for action recognition’. Advances in Neural Information Processing Systems (NIPS), California, USA, December 2017, pp. 34–45.
-
38)
-
37. Lin, J., Gan, C., Han, S.: ‘TSM: temporal shift module for efficient video understanding’. IEEE Int. Conf. on Computer Vision (ICCV), Seoul, Korea, October 2019, pp. 7083–7093.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2020.0023
Related content
content/journals/10.1049/iet-cvi.2020.0023
pub_keyword,iet_inspecKeyword,pub_concept
6
6