http://iet.metastore.ingenta.com
1887

Structured RNN for human interaction

Structured RNN for human interaction

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IET Computer Vision — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Understanding human activities has been an important research area in computer vision. Generally, the authors can model the human interactions as a temporal sequence with the transition in relationships of humans and objects. Besides, many studies have proved the effectiveness of long short-term memory (LSTM) on long-term temporal dependency problems. Here, the authors proposed a novel structured recurrent neural network (S-RNN) to model spatio-temporal relationships between human subjects and objects in daily human interactions. The authors represent the evolution of different components and the relationships between them over time by several subnets. Then, the hidden representations of those relations are fused and fed into the later layers to obtain the final hidden representation. The final prediction is carried out by the single-layer perceptron. The experimental results of different tasks on the CAD-120, SBU-Kinect-Interaction, multi-modal and multi-view and interactive, and NTU RGB+D data sets showed advantages of the proposed method compared with the state-of-art methods.

References

    1. 1)
      • 1. Aggarwal, J.K., Ryoo, M.S.: ‘Human activity analysis: a review’, ACM Comput. Surv., 2011, 43, (3), pp. 143.
    2. 2)
      • 2. Ye, M.: In Grzegorzek, M., Theobalt, C., Koch, R., et al (Eds.): ‘A survey on human motion analysis from depth data’ (Springer, Berlin, Heidelberg, 2013), pp. 149187.
    3. 3)
      • 3. Yun, K., Honorio, J., Chattopadhyay, D., et al: ‘Two-person interaction detection using body-pose features, multiple instance learning’. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 2012, pp. 2835.
    4. 4)
      • 4. Sung, J., Ponce, C., Selman, B., et al: ‘Unstructured human activity detection from RGBD images’. IEEE Int. Conf. on Robotics and Automation, 2012, pp. 842849.
    5. 5)
      • 5. Ni, B., Pei, Y., Moulin, P., et al: ‘Multilevel depth and image fusion for human activity detection’, IEEE Trans. Cybern., 2013, 43, (5), pp. 13831394.
    6. 6)
      • 6. Gupta, R., Chia, A.Y.S., Rajan, D.: ‘Human activities recognition using depth images’. Proc. of the 21st ACM Int. Conf. on Multimedia. MM ‘13, Barcelona, Spain, 2013, pp. 283292.
    7. 7)
      • 7. Koppula, H.S., Gupta, R., Saxena, A.: ‘Learning human activities and object affordances from RGB-D videos’, Int. J. Rob. Res., 2013, 32, (8), pp. 951970.
    8. 8)
      • 8. Koppula, H.S., Saxena, A.: ‘Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation’. Proc. of the 30th Int. Conf. on Int. Conf. on Machine Learning, 2013, pp. 792800.
    9. 9)
      • 9. Li, M., Leung, H.: ‘Multiview skeletal interaction recognition using active joint interaction graph’, IEEE Trans. Multimed., 2016, 18, (11), pp. 22932302.
    10. 10)
      • 10. Graves, A., Jaitly, N.: ‘Towards end-to-end speech recognition with recurrent neural networks’. Proc. of the 31st Int. Conf. on Machine Learning, 2014, pp. 17641772.
    11. 11)
      • 11. Fragkiadaki, K., Levine, S., Felsen, P., et al: ‘Recurrent network models for human dynamics’. Proc. of the 2015 IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 43464354.
    12. 12)
      • 12. Sung, J., Ponce, C., Selman, B., et al: ‘Human activity detection from RGBD images’. Proc. of the 16th AAAI Conf. on Plan, Activity, and Intent Recognition, San Francisco, CA, USA, 2011, pp. 4755.
    13. 13)
      • 13. Taha, A., Zayed, H.H., Khalifa, M.E., et al: ‘Skeleton-based human activity recognition for video surveillance’, Int. J. Sci. Eng. Res., 2015, 6, (1), pp. 9931004.
    14. 14)
      • 14. Du, Y., Wang, W., Wang, L.: ‘Hierarchical recurrent neural network for skeleton based action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 11101118.
    15. 15)
      • 15. Zhu, W., Lan, C., Xing, J., et al: ‘Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks’. Proc. of the Thirtieth AAAI Conf. on Artificial Intelligence, Phoenix, AZ, USA, 2016, pp. 36973703.
    16. 16)
      • 16. Shahroudy, A., Liu, J., Ng, T.T., et al: ‘Ntu RGB+D: a large scale dataset for 3d human activity analysis’. The IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016.
    17. 17)
      • 17. Liu, J., Shahroudy, A., Xu, D., et al: ‘Spatio-temporal lstm with trust gates for 3d human action recognition’. European Conf. on Computer Vision, 2016, pp. 816833.
    18. 18)
      • 18. Wang, J., Liu, Z., Wu, Y., et al: ‘Mining actionlet ensemble for action recognition with depth cameras’. IEEE Conf. on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 12901297.
    19. 19)
      • 19. Srivastava, N., Hinton, G., Krizhevsky, A., et al: ‘Dropout: a simple way to prevent neural networks from overfitting’, J. Mach. Learn. Res., 2014, 15, (1), pp. 19291958.
    20. 20)
      • 20. Ryoo, M.S., Aggarwal, J.K.: ‘Hierarchical recognition of human activities interacting with objects’. IEEE Conf. on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 2007, pp. 18.
    21. 21)
      • 21. Akdemir, U., Turaga, P., Chellappa, R.: ‘An ontology based approach for activity recognition from video’. Proc. of the 16th ACM Int. Conf. on Multimedia, Vancouver, Canada, 2008, pp. 709712.
    22. 22)
      • 22. Kong, Y., Jia, Y., Fu, Y.: ‘Learning human interaction by interactive phrases’. Computer Vision – ECCV 2012, Firenze, Italy, 2012, pp. 300313.
    23. 23)
      • 23. Prest, A., Schmid, C., Ferrari, V.: ‘Weakly supervised learning of interactions between humans and objects’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (3), pp. 601614.
    24. 24)
      • 24. Kong, Y., Jia, Y., Fu, Y.: ‘Interactive phrases: semantic descriptions for human interaction recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2014, 36, (9), pp. 17751788.
    25. 25)
      • 25. Lan, T., Wang, Y., Yang, W., et al: ‘Discriminative latent models for recognizing contextual group activities’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (8), pp. 15491562.
    26. 26)
      • 26. Jain, A., Zamir, A.R., Savarese, S., et al: ‘Structural-RNN: deep learning on spatio-temporal graphs’. IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 53085317.
    27. 27)
      • 27. Pascanu, R., Mikolov, T., Bengio, Y.: ‘On the difficulty of training recurrent neural networks’. Proc. of the 30th Int. Conf. on Machine Learning, Atlanta, GA, USA, 2013, pp. 13101318.
    28. 28)
      • 28. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, (8), pp. 17351780.
    29. 29)
      • 29. Koppula, H.S., Saxena, A.: ‘Anticipating human activities using object affordances for reactive robotic response’, IEEE Trans. Pattern Anal. Mach. Intell., 2016, 38, (1), pp. 1429.
    30. 30)
      • 30. Duchi, J., Hazan, E., Singer, Y.: ‘Adaptive subgradient methods for online learning and stochastic optimization’, J. Mach. Learn. Res., 2011, 12, pp. 21212159.
    31. 31)
      • 31. Gibson, J.J.: ‘The ecological approach to visual perception’ (Houghton Mifflin, Boston, 1979).
    32. 32)
      • 32. Xu, N., Liu, A., Nie, W., et al: ‘Multi-modal & multi-view & interactive benchmark dataset for human action recognition’. Proc. of the 23th Int. Conf. on Multimedia, Brisbane, Australia, 2015.
    33. 33)
      • 33. Hu, N., Englebienne, G., Lou, Z., et al: ‘Learning latent structure for activity recognition’. IEEE Int. Conf. on Robotics and Automation, Hong Kong, China, 2014, pp. 10481053.
    34. 34)
      • 34. Ji, Y., Ye, G., Cheng, H.: ‘Interactive body part contrast mining for human interaction recognition’. 2014 IEEE Int. Conf. on Multimedia and Expo Workshops, Chendgu, China, 2014, pp. 16.
    35. 35)
      • 35. Vemulapalli, R., Arrate, F., Chellappa, R.: ‘Human action recognition by representing 3d skeletons as points in a lie group’. IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 588595.
    36. 36)
      • 36. Hu, J.F., Zheng, W.S., Lai, J., et al: ‘Jointly learning heterogeneous features for RGB-D activity recognition’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 53445352.
    37. 37)
      • 37. Ke, Q., An, S., Bennamoun, M., et al: ‘Skeletonnet: mining deep part features for 3-d action recognition’, IEEE Signal Process. Lett., 2017, 24, (6), pp. 731735.
    38. 38)
      • 38. Liu, J., Wang, G., Hu, P., et al: ‘Global context-aware attention LSTM networks for 3D action recognition’. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 36713680.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2017.0487
Loading

Related content

content/journals/10.1049/iet-cvi.2017.0487
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address