Structured RNN for human interaction

Anh Minh Truong; Atsuo Yoshitaka

Structured RNN for human interaction

View Fulltext

Author(s): Anh Minh Truong¹ and Atsuo Yoshitaka¹
- Affiliations: 1: School of Information Science, Japan Advanced Institute of Science and Technology , 1-1 Asahidai, Nomi, Ishikawa , Japan
Source: Volume 12, Issue 6, September 2018, p. 817 – 825
DOI: 10.1049/iet-cvi.2017.0487 , Print ISSN 1751-9632, Online ISSN 1751-9640

Received 15/10/2017, Accepted 04/04/2018, Revised 01/04/2018, Published 05/04/2018

Understanding human activities has been an important research area in computer vision. Generally, the authors can model the human interactions as a temporal sequence with the transition in relationships of humans and objects. Besides, many studies have proved the effectiveness of long short-term memory (LSTM) on long-term temporal dependency problems. Here, the authors proposed a novel structured recurrent neural network (S-RNN) to model spatio-temporal relationships between human subjects and objects in daily human interactions. The authors represent the evolution of different components and the relationships between them over time by several subnets. Then, the hidden representations of those relations are fused and fed into the later layers to obtain the final hidden representation. The final prediction is carried out by the single-layer perceptron. The experimental results of different tasks on the CAD-120, SBU-Kinect-Interaction, multi-modal and multi-view and interactive, and NTU RGB+D data sets showed advantages of the proposed method compared with the state-of-art methods.

References

1. 1)
  - 17. Liu, J., Shahroudy, A., Xu, D., et al: ‘Spatio-temporal lstm with trust gates for 3d human action recognition’. European Conf. on Computer Vision, 2016, pp. 816–833.
2. 2)
  - 23. Prest, A., Schmid, C., Ferrari, V.: ‘Weakly supervised learning of interactions between humans and objects’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (3), pp. 601–614.
3. 3)
  - 35. Vemulapalli, R., Arrate, F., Chellappa, R.: ‘Human action recognition by representing 3d skeletons as points in a lie group’. IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 588–595.
4. 4)
  - 33. Hu, N., Englebienne, G., Lou, Z., et al: ‘Learning latent structure for activity recognition’. IEEE Int. Conf. on Robotics and Automation, Hong Kong, China, 2014, pp. 1048–1053.
5. 5)
  - 30. Duchi, J., Hazan, E., Singer, Y.: ‘Adaptive subgradient methods for online learning and stochastic optimization’, J. Mach. Learn. Res., 2011, 12, pp. 2121–2159.
6. 6)
  - 11. Fragkiadaki, K., Levine, S., Felsen, P., et al: ‘Recurrent network models for human dynamics’. Proc. of the 2015 IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 4346–4354.
7. 7)
  - 21. Akdemir, U., Turaga, P., Chellappa, R.: ‘An ontology based approach for activity recognition from video’. Proc. of the 16th ACM Int. Conf. on Multimedia, Vancouver, Canada, 2008, pp. 709–712.
8. 8)
  - 15. Zhu, W., Lan, C., Xing, J., et al: ‘Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks’. Proc. of the Thirtieth AAAI Conf. on Artificial Intelligence, Phoenix, AZ, USA, 2016, pp. 3697–3703.
9. 9)
  - 13. Taha, A., Zayed, H.H., Khalifa, M.E., et al: ‘Skeleton-based human activity recognition for video surveillance’, Int. J. Sci. Eng. Res., 2015, 6, (1), pp. 993–1004.
10. 10)
  - 29. Koppula, H.S., Saxena, A.: ‘Anticipating human activities using object affordances for reactive robotic response’, IEEE Trans. Pattern Anal. Mach. Intell., 2016, 38, (1), pp. 14–29.
11. 11)
  - 31. Gibson, J.J.: ‘The ecological approach to visual perception’ (Houghton Mifflin, Boston, 1979).
12. 12)
  - 20. Ryoo, M.S., Aggarwal, J.K.: ‘Hierarchical recognition of human activities interacting with objects’. IEEE Conf. on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 2007, pp. 1–8.
13. 13)
  - 38. Liu, J., Wang, G., Hu, P., et al: ‘Global context-aware attention LSTM networks for 3D action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3671–3680.
14. 14)
  - 3. Yun, K., Honorio, J., Chattopadhyay, D., et al: ‘Two-person interaction detection using body-pose features, multiple instance learning’. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 2012, pp. 28–35.
15. 15)
  - 7. Koppula, H.S., Gupta, R., Saxena, A.: ‘Learning human activities and object affordances from RGB-D videos’, Int. J. Rob. Res., 2013, 32, (8), pp. 951–970.
16. 16)
  - 25. Lan, T., Wang, Y., Yang, W., et al: ‘Discriminative latent models for recognizing contextual group activities’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (8), pp. 1549–1562.
17. 17)
  - 24. Kong, Y., Jia, Y., Fu, Y.: ‘Interactive phrases: semantic descriptions for human interaction recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2014, 36, (9), pp. 1775–1788.
18. 18)
  - 27. Pascanu, R., Mikolov, T., Bengio, Y.: ‘On the difficulty of training recurrent neural networks’. Proc. of the 30th Int. Conf. on Machine Learning, Atlanta, GA, USA, 2013, pp. 1310–1318.
19. 19)
  - 34. Ji, Y., Ye, G., Cheng, H.: ‘Interactive body part contrast mining for human interaction recognition’. 2014 IEEE Int. Conf. on Multimedia and Expo Workshops, Chendgu, China, 2014, pp. 1–6.
20. 20)
  - 26. Jain, A., Zamir, A.R., Savarese, S., et al: ‘Structural-RNN: deep learning on spatio-temporal graphs’. IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 5308–5317.
21. 21)
  - 16. Shahroudy, A., Liu, J., Ng, T.T., et al: ‘Ntu RGB+D: a large scale dataset for 3d human activity analysis’. The IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016.
22. 22)
  - 32. Xu, N., Liu, A., Nie, W., et al: ‘Multi-modal & multi-view & interactive benchmark dataset for human action recognition’. Proc. of the 23th Int. Conf. on Multimedia, Brisbane, Australia, 2015.
23. 23)
  - 36. Hu, J.F., Zheng, W.S., Lai, J., et al: ‘Jointly learning heterogeneous features for RGB-D activity recognition’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 5344–5352.
24. 24)
  - 37. Ke, Q., An, S., Bennamoun, M., et al: ‘Skeletonnet: mining deep part features for 3-d action recognition’, IEEE Signal Process. Lett., 2017, 24, (6), pp. 731–735.
25. 25)
  - 10. Graves, A., Jaitly, N.: ‘Towards end-to-end speech recognition with recurrent neural networks’. Proc. of the 31st Int. Conf. on Machine Learning, 2014, pp. 1764–1772.
26. 26)
  - 4. Sung, J., Ponce, C., Selman, B., et al: ‘Unstructured human activity detection from RGBD images’. IEEE Int. Conf. on Robotics and Automation, 2012, pp. 842–849.
27. 27)
  - 5. Ni, B., Pei, Y., Moulin, P., et al: ‘Multilevel depth and image fusion for human activity detection’, IEEE Trans. Cybern., 2013, 43, (5), pp. 1383–1394.
28. 28)
  - 8. Koppula, H.S., Saxena, A.: ‘Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation’. Proc. of the 30th Int. Conf. on Int. Conf. on Machine Learning, 2013, pp. 792–800.
29. 29)
  - 14. Du, Y., Wang, W., Wang, L.: ‘Hierarchical recurrent neural network for skeleton based action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 1110–1118.
30. 30)
  - 2. Ye, M.: In Grzegorzek, M., Theobalt, C., Koch, R., et al (Eds.): ‘A survey on human motion analysis from depth data’ (Springer, Berlin, Heidelberg, 2013), pp. 149–187.
31. 31)
  - 28. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, (8), pp. 1735–1780.
32. 32)
  - 6. Gupta, R., Chia, A.Y.S., Rajan, D.: ‘Human activities recognition using depth images’. Proc. of the 21st ACM Int. Conf. on Multimedia. MM ‘13, Barcelona, Spain, 2013, pp. 283–292.
33. 33)
  - 22. Kong, Y., Jia, Y., Fu, Y.: ‘Learning human interaction by interactive phrases’. Computer Vision – ECCV 2012, Firenze, Italy, 2012, pp. 300–313.
34. 34)
  - 19. Srivastava, N., Hinton, G., Krizhevsky, A., et al: ‘Dropout: a simple way to prevent neural networks from overfitting’, J. Mach. Learn. Res., 2014, 15, (1), pp. 1929–1958.
35. 35)
  - 12. Sung, J., Ponce, C., Selman, B., et al: ‘Human activity detection from RGBD images’. Proc. of the 16th AAAI Conf. on Plan, Activity, and Intent Recognition, San Francisco, CA, USA, 2011, pp. 47–55.
36. 36)
  - 18. Wang, J., Liu, Z., Wu, Y., et al: ‘Mining actionlet ensemble for action recognition with depth cameras’. IEEE Conf. on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 1290–1297.
37. 37)
  - 1. Aggarwal, J.K., Ryoo, M.S.: ‘Human activity analysis: a review’, ACM Comput. Surv., 2011, 43, (3), pp. 1–43.
38. 38)
  - 9. Li, M., Leung, H.: ‘Multiview skeletal interaction recognition using active joint interaction graph’, IEEE Trans. Multimed., 2016, 18, (11), pp. 2293–2302.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Structured RNN for human interaction

References

Related content