Human-action recognition using a multi-layered fusion scheme of Kinect modalities

Bassem Seddik; Sami Gazzah; Najoua Essoukri Ben Amara

Human-action recognition using a multi-layered fusion scheme of Kinect modalities

View Fulltext

Author(s): Bassem Seddik^{1, 2} ; Sami Gazzah¹ ; Najoua Essoukri Ben Amara¹
- Affiliations: 1: LATIS Laboratory, National Engineering School of Sousse , University of Sousse , Sousse , Tunisia ;
  2: National Engineering School of Sfax , University of Sfax , Sfax , Tunisia
Source: Volume 11, Issue 7, October 2017, p. 530 – 540
DOI: 10.1049/iet-cvi.2016.0326 , Print ISSN 1751-9632, Online ISSN 1751-9640

Received 11/09/2016, Accepted 11/03/2017, Revised 23/01/2017, Published 27/04/2017

This study addresses the problem of efficiently combining the joint, RGB and depth modalities of the Kinect sensor in order to recognise human actions. For this purpose, a multi-layered fusion scheme concatenates different specific features, builds specialised local and global SVM models and then iteratively fuses their different scores. The authors essentially contribute in two levels: (i) they combine the performance of local descriptors with the strength of global bags-of-visual-words representations. They are able then to generate improved local decisions that allow noisy frames handling. (ii) They also study the performance of multiple fusion schemes guided by different features concatenations, Fisher vectors representations concatenation and later iterative scores fusion. To prove the efficiency of their approach, they have evaluated their experiments on two challenging public datasets: CAD-60 and CGC-2014. Competitive results are obtained for both benchmarks.

References

1. 1)
  - 5. Vrigkas, M., Nikou, C., Kakadiaris, I.A.: ‘A review of human activity recognition methods’, Front. Robot. AI, 2015, 2, p. 28.
2. 2)
  - 18. Neverova, N., Wolf, C., Taylor, G., et al: ‘Moddrop: adaptive multi-modal gesture recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2016, 38, (8), pp. 1692–1706.
3. 3)
  - 19. Wang, L., Qiao, Y., Tang, X.: ‘Action recognition with trajectory-pooled deep-convolutional descriptors’. Proc. CVPR, 2015, pp. 4305–4314.
4. 4)
  - 6. Haque, A., Peng, B., Luo, Z., et al: ‘Towards viewpoint invariant 3d human pose estimation’. Proc. ECCV, 2016, pp. 160–177.
5. 5)
  - 24. Camgöz, N.C., Kindiroglu, A.A., Akarun, L.: ‘Gesture recognition using template based random forest classifiers’. Proc. ECCV Workshops, 2014, pp. 579–594.
6. 6)
  - 39. Liang, B., Zheng, L.: ‘Multi-modal gesture recognition using skeletal joints and motion trail model’. Proc. ECCV Workshops, 2014, pp. 623–638.
7. 7)
  - 31. Faria, D.R., Premebida, C., Nunes, U.: ‘A probabilistic approach for human everyday activities recognition using body motion from rgb-d images’. Proc. RO-MAN, 2014, pp. 732–737.
8. 8)
  - 32. Cippitelli, E., Gasparrini, S., Gambi, E., et al: ‘A human activity recognition system using skeleton data from rgbd sensors’, Comput. Intell. Neurosci., 2016, 2016, pp. 1–14.
9. 9)
  - 40. Oreifej, O., Liu, Z.: ‘Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences’. Proc. CVPR, 2013, pp. 716–723.
10. 10)
  - 35. Hernández-Vela, A., Bautista, M.Á., Perez-Sala, X., et al: ‘Probability-based dynamic time warping and bag-of-visual-and-depth-words for human gesture recognition in RGB-D’, Pattern Recognit. Lett., 2013, 50, pp. 112–121.
11. 11)
  - 25. Vemulapalli, R., Arrate, F., Chellappa, R.: ‘R3DG features: relative 3d geometry-based skeletal representations for human action recognition’, Comput. Vis. Image Underst., 2016, 152, pp. 155–166.
12. 12)
  - 51. Selmi, M., El-Yacoubi, M.A., Dorizzi, B.: ‘Two-layer discriminative model for human activity recognition’, IET Comput. Vis., 2016, 10, (4), pp. 273–278.
13. 13)
  - 54. Molchanov, P., Yang, X., Gupta, S., et al: ‘Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks’. Proc. CVPR, 2016, pp. 4207–4215.
14. 14)
  - 1. Choudhary, A., Chaudhury, S.: ‘Video analytics revisited’, IET Comput. Vis., 2016, 10, (4), pp. 237–247.
15. 15)
  - 36. Wang, H., Kläser, A., Schmid, C., et al: ‘Action recognition by dense trajectories’. Proc. CVPR, 2011, pp. 3169–3176.
16. 16)
  - 34. Chun, S., Lee, C.: ‘Human action recognition using histogram of motion intensity and direction from multiple views’, IET Comput. Vis., 2016, 10, (4), pp. 250–256.
17. 17)
  - 7. Wang, L., Qiao, Y., Tang, X.: ‘Video action detection with relational dynamic-poselets’. Proc. ECCV, 2014, pp. 565–580.
18. 18)
  - 12. Guyon, I., Athitsos, V., Jangyodsuk, P., et al: ‘The chalearn gesture dataset (cgd 2011)’, Mach. Vis. Appl., 2014, 25, (8), pp. 1929–1951.
19. 19)
  - 49. Pigou, L., Van Den Oord, A., Dieleman, S., et al: ‘Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video’, Int. J. Comput. Vis., 2016, 124, pp. 1–10.
20. 20)
  - 52. Ni, B., Moulin, P., Yan, S.: ‘Order-Preserving sparse coding for sequence classification’. Proc. ECCV, 2012, pp. 173–187.
21. 21)
  - 21. Seddik, B., Gazzah, S., Essoukri Ben Amara, N.: ‘Modalities combination for Italian sign language extraction and recognition’. Proc. ICIAP, 2015, pp. 710–721.
22. 22)
  - 2. Aggarwal, J.K., Xia, L.: ‘Human activity recognition from 3d data: a review’, Pattern Recognit. Lett., 2014, 48, pp. 70–80.
23. 23)
  - 33. Ben Amor, B., Su, J., Srivastava, A.: ‘Action recognition using rate-invariant analysis of skeletal shape trajectories’, IEEE Trans. Pattern Anal. Mach. Intell., 2016, 38, (1), pp. 1–13.
24. 24)
  - 9. Laptev, I., Marszalek, M., Schmid, C., et al: ‘Learning realistic human actions from movies’. Proc. CVPR, 2008, pp. 1–8.
25. 25)
  - 15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘ImageNet classification with deep convolutional neural networks’. Proc. NIPS, 2012, pp. 1097–1105.
26. 26)
  - 44. Zhou, W., Wang, C., Xiao, B., et al: ‘Human action recognition using weighted pooling’, IET Comput. Vis., 2014, 8, (6), pp. 579–587.
27. 27)
  - 46. Peng, X., Wang, L., Cai, Z., et al: ‘Action and gesture temporal spotting with super vector representation’. Proc. ECCV Workshops, 2014, pp. 518–527.
28. 28)
  - 53. Yang, X., Tian, Y.: ‘Effective 3d action recognition using eigenjoints’, J. Vis. Commun. Image Represent., 2014, 25, (1), pp. 2–11.
29. 29)
  - 45. Peng, X., Wang, L., Wang, X., et al: ‘Bag of visual words and fusion methods for action recognition: comprehensive study and good practice’, Comput. Vis. Image Underst., 2016, 150, pp. 109–125.
30. 30)
  - 27. Monnier, C., German, S., Ost, A.: ‘A multi-scale boosted detector for efficient and robust gesture recognition’. Proc. ECCV Workshops, 2014, pp. 491–502.
31. 31)
  - 41. Zhang, C., Tian, Y.: ‘Histogram of 3d facets: a depth descriptor for human action and hand gesture recognition’, Comput. Vis. Image Underst., 2015, 139, pp. 29–39.
32. 32)
  - 56. Seddik, B., Maâmatou, H., Gazzah, S., et al: ‘Unsupervised facial expressions recognition and avatar reconstruction from kinect’. Proc. SSD, 2013, pp. 1–6.
33. 33)
  - 14. Guo, Y., Liu, Y., Oerlemans, A., et al: ‘Deep learning for visual understanding: a review’, Neurocomputing, 2016, 187, pp. 27–48.
34. 34)
  - 8. Hadfield, S., Lebeda, K., Bowden, R.: ‘Hollywood 3d: what are the best 3d features for action recognition?’, Int. J. Comput. Vis., 2017, 121, pp. 95–110.
35. 35)
  - 4. Guo, H., Wang, J., Lu, H.: ‘Multiple deep features learning for object retrieval in surveillance videos’, IET Comput. Vis., 2016, 10, (4), pp. 268–271.
36. 36)
  - 20. Seddik, B., Gazzah, S., Essoukri Ben Amara, N.: ‘Hands, face and joints for multi-modal human-action temporal segmentation and recognition’. Proc. EUSIPCO, 2015, pp. 1143–1147.
37. 37)
  - 13. Escalera, S., Baró, X., Gonzàlez, J., et al: ‘Chalearn looking at people challenge 2014: dataset and results’. Proc. ECCV Workshops, 2014, pp. 459–473.
38. 38)
  - 42. Zhu, Y., Chen, W., Guo, G.: ‘Evaluating spatiotemporal interest point features for depth based action recognition’, Image Vis. Comput., 2014, 32, (8), pp. 453–464.
39. 39)
  - 3. Dondi, P., Lombardi, L., Porta, M.: ‘Development of gesture-based human–computer interaction applications by fusion of depth and colour video streams’, IET Comput. Vis., 2014, 8, (6), pp. 568–578.
40. 40)
  - 17. Pfister, T., Charles, J., Zisserman, A.: ‘Flowing convNets for human pose estimation in videos’. Proc. ICCV, 2015, pp. 1913–1921.
41. 41)
  - 30. Chang, J.Y.: ‘Nonparametric gesture labeling from multi-modal data’. Proc. ECCV Workshops, 2014, pp. 503–517.
42. 42)
  - 26. Gaglio, S., Re, G.L., Morana, M.: ‘Human activity recognition process using 3-d posture data’, IEEE Trans. Hum.-Mach. Syst., 2015, 45, (5), pp. 586–597.
43. 43)
  - 10. Jhuang, H., Gall, J., Zuffi, S., et al: ‘Towards understanding action recognition’. Proc. ICCV, 2013, pp. 3192–3199.
44. 44)
  - 38. Dominio, F., Donadeo, M., Zanuttigh, P.: ‘Combining multiple depth-based descriptors for hand gesture recognition’, Pattern Recognit. Lett., 2014, 50, pp. 101–111.
45. 45)
  - 22. Wan, J., Ruan, Q., Li, W., et al: ‘One-shot learning gesture recognition from rgb-d data using bag of features’, J. Mach. Learn. Res., 2013, 14, pp. 2549–2582.
46. 46)
  - 43. Parisi, G.I., Weber, C., Wermter, S.: ‘Self-organizing neural integration of pose-motion features for human action recognition’, Front. Neurorobot., 2015, 9, p. 3.
47. 47)
  - 47. Onofri, L., Soda, P., Iannello, G.: ‘Multiple subsequence combination in human action recognition’, IET Comput. Vis., 2014, 8, (1), pp. 26–34.
48. 48)
  - 37. Wang, H., Schmid, C.: ‘Action recognition with improved trajectories’. Proc. ICCV, 2013, pp. 3551–3558.
49. 49)
  - 29. Zanfir, M., Leordeanu, M., Sminchisescu, C.: ‘The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection’. Proc. ICCV, 2013, pp. 2752–2759.
50. 50)
  - 23. Koppula, H.S., Gupta, R., Saxena, A.: ‘Learning human activities and object affordances from RGB-D videos’, Int. J. Robot. Res., 2013, 32, (8), pp. 951–970.
51. 51)
  - 11. Sung, J., Ponce, C., Selman, B., et al: ‘Unstructured human activity detection from rgbd images’. Proc. ICRA, 2012, pp. 842–849.
52. 52)
  - 50. Wu, D., Pigou, L., Kindermanz, P.J., et al: ‘Deep dynamic neural networks for multimodal gesture segmentation and recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2016, 38, (8), pp. 1583–1597.
53. 53)
  - 48. Iosidis, A., Tefas, A., Pitas, I.: ‘Discriminant bag of words based representation for human action recognition’, Pattern Recognit. Lett., 2014, 49, pp. 185–192.
54. 54)
  - 55. Evangelidis, G.D., Singh, G., Horaud, R.: ‘Continuous gesture recognition from articulated poses’. Proc. ECCV Workshops, 2014, pp. 595–607.
55. 55)
  - 28. Shan, J., Akella, S.: ‘3d human action segmentation and recognition using pose kinetic energy’. Proc. ARSO, 2014, pp. 69–75.
56. 56)
  - 16. Perronnin, F., Sánchez, J., Mensink, T.: ‘Improving the Fisher kernel for large-scale image classification’. Proc. ECCV, 2010, pp. 143–156.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Human-action recognition using a multi-layered fusion scheme of Kinect modalities

References

Related content