In this study, the authors focus on improving the spatio–temporal representation ability of three-dimensional (3D) convolutional neural networks (CNNs) in the video domain. They observe two unfavourable issues: (i) the convolutional filters only dedicate to learning local representation along input channels. Also they treat channel-wise features equally, without emphasising the important features; (ii) traditional global average pooling layer only captures first-order statistics, ignoring finer detail features useful for classification. To mitigate these problems, they proposed two modules to boost 3D CNNs’ performance, which are temporal-channel correlation (TCC) and bilinear pooling module. The TCC module can capture the information of inter-channel correlations over the temporal domain. Moreover, the TCC module generates channel-wise dependencies, which can adaptively re-weight the channel-wise features. Therefore, the network can focus on learning important features. With regards to the bilinear pooling module, it can capture more complex second-order statistics in deep features and generate a second-order classification vector. We can get more accurate classification results by combining the first-order and second-order classification vector. Extensive experiments show that adding our proposed modules to I3D network could consistently improve the performance and outperform the state-of-the-art methods. The code and models are available at https://github.com/caijh33/I3D_TCC_Bilinear.

References

1. 1)
  - 14. Tran, D., Wang, H., Torresani, L., et al: ‘A closer look at spatiotemporal convolutions for action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 6450–6459.
2. 2)
  - 23. Lin, T.Y., RoyChowdhury, A., Maji, S.: ‘Bilinear CNN models for fine-grained visual recognition’. IEEE Int. Conf. on Computer Vision (ICCV), Santiago, Chile, December 2015, pp. 1449–1457.
3. 3)
  - 21. Hu, J., Shen, L., Sun, G.: ‘Squeeze-and-excitation networks’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 7132–7141.
4. 4)
  - 6. Carreira, J., Zisserman, A.: ‘Quo vadis, action recognition? a new model and the kinetics dataset’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 6299–6308.
5. 5)
  - 38. Selvaraju, R.R., Cogswell, M., Das, A., et al: ‘Grad-cam: visual explanations from deep networks via gradient-based localization’. IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy, October 2017, pp. 618–626.
6. 6)
  - 10. Lei, J., Jia, Y., Peng, B., et al: ‘Channel-wise temporal attention network for video action recognition’. IEEE Int. Conf. on Multimedia and Expo (ICME), Shanghai, China, July 2019, pp. 562–567.
7. 7)
  - 2. Yan, C., Coenen, F., Zhang, B.: ‘Driving posture recognition by convolutional neural networks’, IET Comput. Vis., 2016, 10, (2), pp. 103–114.
8. 8)
  - 36. Wang, X., Gupta, A.: ‘Videos as space-time region graphs’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 399–417.
9. 9)
  - 16. Zhou, Y., Sun, X., Zha, Z.J., et al: ‘MICT: mixed 3D/2D convolutional tube for human action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 449–458.
10. 10)
  - 7. Hara, K., Kataoka, H., Satoh, Y.: ‘Can spatio–temporal 3D CNNs retrace the history of 2D CNNs and imagenet?’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 6546–6555.
11. 11)
  - 12. Ji, S., Xu, W., Yang, M., et al: ‘3D convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 35, (1), pp. 221–231.
12. 12)
  - 22. Woo, S., Park, J., Lee, J.Y.: ‘CBAM: convolutional block attention module’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 3–19.
13. 13)
  - 18. Pramono, R.R.A., Chen, Y.T., Fang, W.H.: ‘Hierarchical self-attention network for action localization in videos’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), CA, USA, June 2019, pp. 61–70.
14. 14)
  - 13. Tran, D., Bourdev, L., Fergus, R., et al: ‘Learning spatiotemporal features with 3D convolutional networks’. IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, December 2015, pp. 4489–4497.
15. 15)
  - 32. Qiu, Z., Yao, T., Ngo, C. W., et al: ‘Learning spatio-temporal representation with local and global diffusion’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), CA, USA, June 2019, pp. 12056–12065.
16. 16)
  - 3. Simonyan, K., Zisserman, A.: ‘Two-stream convolutional networks for action recognition in videos’. Advances in Neural Information Processing Systems (NIPS), Montreal, Canada, December 2014, pp. 568–576.
17. 17)
  - 35. Diba, A., Fayyaz, M., Sharma, V., et al: ‘Spatio-temporal channel correlation networks for action classification’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 284–299.
18. 18)
  - 1. Huang, G., Liu, Z., Van Der Maaten, L., et al: ‘Densely connected convolutional networks’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 4700–4708.
19. 19)
  - 30. He, D., Zhou, Z., Gan, C., et al: ‘STNET: local and global spatial-temporal modeling for action recognition’. AAAI Conf. on Artificial Intelligence, Hawaii, USA, January 2019, pp. 8401–8408.
20. 20)
  - 19. Purwanto, D., Pramono, R.R.A., Chen, Y.T., et al: ‘Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos’, IEEE Signal Process. Lett.2019, 26, pp. 1187–1191.
21. 21)
  - 27. Ioffe, S., Szegedy, C., et al: ‘Batch normalization: accelerating deep network training by reducing internal covariate shift’, arXiv preprint arXiv:1502.03167, 2015.
22. 22)
  - 20. Wang, F., Jiang, M., Qian, C., et al: ‘Residual attention network for image classification’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 3156–3164.
23. 23)
  - 9. Martinez, B., Modolo, D., Xiong, Y., et al: ‘Action recognition with spatial-temporal discriminative filter banks’. IEEE Int. Conf. on Computer Vision (ICCV), Seoul, Korea, October 2019, pp. 5482–5491.
24. 24)
  - 25. Yu, C., Zhao, X., Zheng, Q.: ‘Hierarchical bilinear pooling for fine-grained visual recognition’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 574–589.
25. 25)
  - 26. Chen, Y., Kalantidis, Y., Li, J., et al: ‘A2-nets: double attention networks’. Advances in Neural Information Processing Systems (NIPS), Montreal, Canada, December 2018, pp. 352–361.
26. 26)
  - 24. Gao, Y., Beijbom, O., Zhang, N., et al: ‘Compact bilinear pooling’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016, pp. 317–326.
27. 27)
  - 28. Wang, X., Girshick, R., Gupta, A., et al: ‘Non-local neural networks’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 7794–7803.
28. 28)
  - 5. Zhang, B., Wang, L., Wang, Z., et al: ‘Real-time action recognition with enhanced motion vector CNNs’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016, pp. 2718–2726.
29. 29)
  - 34. Choutas, V., Weinzaepfel, P., Revaud, J., et al: ‘Potion: pose motion representation for action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Utah, USA, June 2018, pp. 7024–7033.
30. 30)
  - 33. Crasto, N., Weinzaepfel, P., Alahari, K., et al: ‘MARS: motion-augmented RGB stream for action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), CA, USA, June 2019, pp. 7882–7891.
31. 31)
  - 11. Hu, J.F., Zheng, W.S., Pan, J., et al: ‘Deep bilinear learning for RGB-D action recognition’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 335–351.
32. 32)
  - 29. Song, X., Lan, C., Zeng, W., et al: ‘Temporal-spatial mapping for action recognition’, IEEE Trans. Circuits Syst. Video Technol., 2019.
33. 33)
  - 4. Wang, L., Xiong, Y., Wang, Z., et al: ‘Temporal segment networks: towards good practices for deep action recognition’. European Conf. on Computer Vision (ECCV), Amsterdam, The Netherlands, September 2016, pp. 20–36.
34. 34)
  - 15. Xie, S., Sun, C., Huang, J., et al: ‘Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification’. European Conf. on Computer Vision (ECCV), Munich, Germany, September 2018, pp. 305–321.
35. 35)
  - 31. Stroud, J.C., Ross, D.A., Sun, C., et al: ‘D3D: distilled 3D networks for video action recognition’, arXiv preprint arXiv:1812.08249, 2018.
36. 36)
  - 17. Liu, J., Wang, G., Hu, P., et al: ‘Global context-aware attention LSTM networks for 3D action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), HI, USA, June 2017, pp. 1647–1656.
37. 37)
  - 8. Girdhar, R., Ramanan, D.: ‘Attentional pooling for action recognition’. Advances in Neural Information Processing Systems (NIPS), California, USA, December 2017, pp. 34–45.
38. 38)
  - 37. Lin, J., Gan, C., Han, S.: ‘TSM: temporal shift module for efficient video understanding’. IEEE Int. Conf. on Computer Vision (ICCV), Seoul, Korea, October 2019, pp. 7083–7093.

Combination of temporal-channels correlation information and bilinear feature for action recognition

References

Related content