Continuous action segmentation and recognition using hybrid convolutional neural network-hidden Markov model model

Jun Lei; Guohui Li; Jun Zhang; Qiang Guo; Dan Tu

Continuous action segmentation and recognition using hybrid convolutional neural network-hidden Markov model model

View Fulltext

Author(s): Jun Lei ¹ ; Guohui Li ¹ ; Jun Zhang ¹ ; Qiang Guo ¹ ; Dan Tu ¹
- Affiliations: 1: College of Information Systems and Management, National University of Defense Technology, Changsha, People's Republic of China
Source: Volume 10, Issue 6, September 2016, p. 537 – 544
DOI: 10.1049/iet-cvi.2015.0408 , Print ISSN 1751-9632, Online ISSN 1751-9640

Received 21/10/2015, Accepted 13/01/2016, Revised 21/12/2015, Published 19/02/2016

Continuous action recognition in video is more complicated compared with traditional isolated action recognition. Besides the high variability of postures and appearances of each action, the complex temporal dynamics of continuous action makes this problem challenging. In this study, the authors propose a hierarchical framework combining convolutional neural network (CNN) and hidden Markov model (HMM), which recognises and segments continuous actions simultaneously. The authors utilise the CNN's powerful capacity of learning high level features directly from raw data, and use it to extract effective and robust action features. The HMM is used to model the statistical dependences over adjacent sub-actions and infer the action sequences. In order to combine the advantages of these two models, the hybrid architecture of CNN-HMM is built. The Gaussian mixture model is replaced by CNN to model the emission distribution of HMM. The CNN-HMM model is trained using embedded Viterbi algorithm, and the data used to train CNN are labelled by forced alignment. The authors test their method on two public action dataset Weizmann and KTH. Experimental results show that the authors’ method achieves improved recognition and segmentation accuracy compared with several other methods. The superior property of features learnt by CNN is also illustrated.

References

1. 1)
  - 21. Wang, Z., Wang, J., Xiao, J., et al: ‘Substructure and boundary modeling for continuous action recognition’. CVPR, Providence, RI, June 2012, pp. 1330–1337.
2. 2)
  - 24. Guo, Q., Tu, D., Lei, J., et al: ‘Hybrid CNN-HMM model for street view house number recognition’. ACCV Workshops, Singapore, Singapore, November 2014, pp. 303–315.
3. 3)
  - 2. Thurau, C., Hlavac, V.: ‘Pose primitive based human action recognition in videos or still images’. CVPR, Anchorage, AK, June 2008, pp. 1–6.
4. 4)
  - 25. Wu, D., Shao, L.: ‘Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition’. CVPR, Columbus, OH, June 2014, pp. 724–731.
5. 5)
  - 14. Hoai, M., Lan, Z., De la Torre, F.: ‘Joint segmentation and classification of human actions in video’. CVPR, Providence, RI, June 2011, pp. 3265–3272.
6. 6)
  - 15. Karpathy, A., Toderici, G., Shetty, S., et al: ‘Large-scale video classification with convolutional neural networks’. CVPR, Columbus, OH, June 2014, pp. 1725–1732.
7. 7)
  - 9. Duchenne, O., Laptev, I., Sivic, J., et al: ‘Automatic annotation of human actions in video’. ICCV, Kyoto, Japan, September 2009, pp. 1491–1498.
8. 8)
  - 12. Ning, H., Xu, W., Gong, Y., et al: ‘Latent pose estimator for continuous action recognition’. ECCV, Marseille, France, October 2008, pp. 419–433.
9. 9)
  - 22. Lv, F., Nevatia, R.: ‘Single view human action recognition using key pose matching and Viterbi path searching’. CVPR, Minneapolis, MN, June 2007, pp. 1–8.
10. 10)
  - 16. Baccouche, M., Mamalet, F., Wolf, C., et al: ‘Sequential deep learning for human action recognition’. Human Behaviour Understanding, Springer Berlin Heidelberg, 2011, pp. 29–39.
11. 11)
  - 1. Bobick, A.F., Davis, J.W.: ‘The recognition of human movement using temporal templates’, IEEE Trans. Pattern Anal. Mach. Intell., 2001, 23, (3), pp. 257–267.
12. 12)
  - 28. Bourlard, H.A., Morgan, N.: ‘Connectionist speech recognition: a hybrid approach’ (Springer Science & Business Media, 2012).
13. 13)
  - 30. Schüldt, C., Laptev, I., Caputo, B.: ‘Recognizing human actions: a local SVM approach’ (CVPR, Washington, DC, USA, 2004), pp. 32–36.
14. 14)
  - 19. Hasan, M., Roy-Chowdhury, A.K.: ‘Continuous learning of human activity models using deep nets’. ECCV, Zurich, Switzerland, September 2014, pp. 705–720.
15. 15)
  - 7. LeCun, Y., Kavukcuoglu, K., Farabet, C.: ‘Convolutional networks and applications in vision’. Proc. 2010 IEEE Int. Symp. on Circuits and Systems, Paris, France, May 2010, pp. 253–256.
16. 16)
  - 20. Shi, Q., Cheng, L., Wang, L., et al: ‘Discriminative human action segmentation and recognition using semi-Markov model’, Int. J. Comput. Vis., 2011, 93, (1), pp. 22–32.
17. 17)
  - 5. Hinton, G.E., Osindero, S., Teh, Y.: ‘A fast learning algorithm for deep belief nets’, Neural Comput., 2006, 18, (7), pp. 1527–1554.
18. 18)
  - 32. Lei, J., Li, G., Tu, D., et al: ‘Convolutional restricted Boltzmann machines learning for robust visual tracking’, Neural Comput. Appl., 2014, 25, (6), pp. 1383–1391.
19. 19)
  - 3. Blank, M., Gorelick, L., Shechtman, E., et al: ‘Actions as space-time shapes’, IEEE Trans. Pattern Anal. Mach. Intell., 2007, 29, (12), pp. 1395–1402.
20. 20)
  - 29. Morgan, N., Bourlard, H.: ‘Continuous speech recognition’, IEEE Signal Process. Mag., 1995, 12, (3), pp. 24–42.
21. 21)
  - 10. Yuan, J., Liu, Z., Wu, Y.: ‘Discriminative subvolume search for efficient action detection’. CVPR, Miami, FL, June 2009, pp. 2442–2449.
22. 22)
  - 23. Guenterberg, E., Ghasemzadeh, H., Loseu, V., et al: ‘Distributed continuous action recognition using a hidden Markov model in body sensor networks’. Distributed Computing in Sensor Systems, Springer Berlin Heidelberg, 2009, pp. 145–158.
23. 23)
  - 17. Ng, J.Y., Hausknecht, M., Vijayanarasimhan, S.: ‘Beyond short snippets: deep networks for video classification’. CVPR, Moston, MA, June 2015, pp. 4694–4702.
24. 24)
  - 6. LeCun, Y., Bottou, L., Bengio, Y., et al: ‘Gradient based learning applied to document recognition’, Proceedings of the IEEE, 1998, 86, (11), pp. 2278–2324.
25. 25)
  - 11. Lv, F., Nevatia, R.: ‘Recognition and segmentation of 3-d human action using HMM and multi-class AdaBoost’. ECCV, Graz, Austria, May 2006, pp. 359–372.
26. 26)
  - 31. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., et al: ‘Object detection with discriminatively trained part-based models’, IEEE Trans. Pattern Anal. Mach. Intell., 2010, 32, (9), pp. 1627–1645.
27. 27)
  - 8. Ji, S., Yang, M., Yu, K.: ‘3D convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, (1), pp. 221–231.
28. 28)
  - 18. Le, Q.V., Zou, W.Y., Yeung, S.Y., et al: ‘Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis’. CVPR, Providence, RI, June 2011, pp. 3361–3368.
29. 29)
  - 4. Laptev, I., Marszalek, M., Schmid, C., et al: ‘Learning realistic human actions from movies’. CVPR, Anchorage, AK, June 2008, pp. 1–8.
30. 30)
  - 26. Young, S.J., Russell, N.H., Thornton, J.H.S.: ‘Token passing: a conceptual model for connected speech recognition systems’ (Cambridge University Engineering Department, Cambridge, UK, 1989).
31. 31)
  - 13. Kulkarni, K., Evangelidis, G., Cech, J., et al: ‘Continuous action recognition based on sequence alignment’, Int. J. Comput. Vis., 2014, 112, (1), pp. 90–114.
32. 32)
  - 27. Baum, L.E., Petrie, T.: ‘Statistical inference for probabilistic functions of finite state Markov chains’, Ann. Math. Stat., 1967, 37, (6), pp. 1554–1563.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Continuous action segmentation and recognition using hybrid convolutional neural network-hidden Markov model model

References

Related content