Continuous action segmentation and recognition using hybrid convolutional neural network-hidden Markov model model

Continuous action segmentation and recognition using hybrid convolutional neural network-hidden Markov model model

For access to this article, please select a purchase option:

Buy article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
IET Computer Vision — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Continuous action recognition in video is more complicated compared with traditional isolated action recognition. Besides the high variability of postures and appearances of each action, the complex temporal dynamics of continuous action makes this problem challenging. In this study, the authors propose a hierarchical framework combining convolutional neural network (CNN) and hidden Markov model (HMM), which recognises and segments continuous actions simultaneously. The authors utilise the CNN's powerful capacity of learning high level features directly from raw data, and use it to extract effective and robust action features. The HMM is used to model the statistical dependences over adjacent sub-actions and infer the action sequences. In order to combine the advantages of these two models, the hybrid architecture of CNN-HMM is built. The Gaussian mixture model is replaced by CNN to model the emission distribution of HMM. The CNN-HMM model is trained using embedded Viterbi algorithm, and the data used to train CNN are labelled by forced alignment. The authors test their method on two public action dataset Weizmann and KTH. Experimental results show that the authors’ method achieves improved recognition and segmentation accuracy compared with several other methods. The superior property of features learnt by CNN is also illustrated.


    1. 1)
      • 1. Bobick, A.F., Davis, J.W.: ‘The recognition of human movement using temporal templates’, IEEE Trans. Pattern Anal. Mach. Intell., 2001, 23, (3), pp. 257267.
    2. 2)
      • 2. Thurau, C., Hlavac, V.: ‘Pose primitive based human action recognition in videos or still images’. CVPR, Anchorage, AK, June 2008, pp. 16.
    3. 3)
      • 3. Blank, M., Gorelick, L., Shechtman, E., et al: ‘Actions as space-time shapes’, IEEE Trans. Pattern Anal. Mach. Intell., 2007, 29, (12), pp. 13951402.
    4. 4)
      • 4. Laptev, I., Marszalek, M., Schmid, C., et al: ‘Learning realistic human actions from movies’. CVPR, Anchorage, AK, June 2008, pp. 18.
    5. 5)
      • 5. Hinton, G.E., Osindero, S., Teh, Y.: ‘A fast learning algorithm for deep belief nets’, Neural Comput., 2006, 18, (7), pp. 15271554.
    6. 6)
      • 6. LeCun, Y., Bottou, L., Bengio, Y., et al: ‘Gradient based learning applied to document recognition’, Proceedings of the IEEE, 1998, 86, (11), pp. 22782324.
    7. 7)
      • 7. LeCun, Y., Kavukcuoglu, K., Farabet, C.: ‘Convolutional networks and applications in vision’. Proc. 2010 IEEE Int. Symp. on Circuits and Systems, Paris, France, May 2010, pp. 253256.
    8. 8)
      • 8. Ji, S., Yang, M., Yu, K.: ‘3D convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, (1), pp. 221231.
    9. 9)
      • 9. Duchenne, O., Laptev, I., Sivic, J., et al: ‘Automatic annotation of human actions in video’. ICCV, Kyoto, Japan, September 2009, pp. 14911498.
    10. 10)
      • 10. Yuan, J., Liu, Z., Wu, Y.: ‘Discriminative subvolume search for efficient action detection’. CVPR, Miami, FL, June 2009, pp. 24422449.
    11. 11)
      • 11. Lv, F., Nevatia, R.: ‘Recognition and segmentation of 3-d human action using HMM and multi-class AdaBoost’. ECCV, Graz, Austria, May 2006, pp. 359372.
    12. 12)
      • 12. Ning, H., Xu, W., Gong, Y., et al: ‘Latent pose estimator for continuous action recognition’. ECCV, Marseille, France, October 2008, pp. 419433.
    13. 13)
      • 13. Kulkarni, K., Evangelidis, G., Cech, J., et al: ‘Continuous action recognition based on sequence alignment’, Int. J. Comput. Vis., 2014, 112, (1), pp. 90114.
    14. 14)
      • 14. Hoai, M., Lan, Z., De la Torre, F.: ‘Joint segmentation and classification of human actions in video’. CVPR, Providence, RI, June 2011, pp. 32653272.
    15. 15)
      • 15. Karpathy, A., Toderici, G., Shetty, S., et al: ‘Large-scale video classification with convolutional neural networks’. CVPR, Columbus, OH, June 2014, pp. 17251732.
    16. 16)
      • 16. Baccouche, M., Mamalet, F., Wolf, C., et al: ‘Sequential deep learning for human action recognition’. Human Behaviour Understanding, Springer Berlin Heidelberg, 2011, pp. 2939.
    17. 17)
      • 17. Ng, J.Y., Hausknecht, M., Vijayanarasimhan, S.: ‘Beyond short snippets: deep networks for video classification’. CVPR, Moston, MA, June 2015, pp. 46944702.
    18. 18)
      • 18. Le, Q.V., Zou, W.Y., Yeung, S.Y., et al: ‘Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis’. CVPR, Providence, RI, June 2011, pp. 33613368.
    19. 19)
      • 19. Hasan, M., Roy-Chowdhury, A.K.: ‘Continuous learning of human activity models using deep nets’. ECCV, Zurich, Switzerland, September 2014, pp. 705720.
    20. 20)
      • 20. Shi, Q., Cheng, L., Wang, L., et al: ‘Discriminative human action segmentation and recognition using semi-Markov model’, Int. J. Comput. Vis., 2011, 93, (1), pp. 2232.
    21. 21)
      • 21. Wang, Z., Wang, J., Xiao, J., et al: ‘Substructure and boundary modeling for continuous action recognition’. CVPR, Providence, RI, June 2012, pp. 13301337.
    22. 22)
      • 22. Lv, F., Nevatia, R.: ‘Single view human action recognition using key pose matching and Viterbi path searching’. CVPR, Minneapolis, MN, June 2007, pp. 18.
    23. 23)
      • 23. Guenterberg, E., Ghasemzadeh, H., Loseu, V., et al: ‘Distributed continuous action recognition using a hidden Markov model in body sensor networks’. Distributed Computing in Sensor Systems, Springer Berlin Heidelberg, 2009, pp. 145158.
    24. 24)
      • 24. Guo, Q., Tu, D., Lei, J., et al: ‘Hybrid CNN-HMM model for street view house number recognition’. ACCV Workshops, Singapore, Singapore, November 2014, pp. 303315.
    25. 25)
      • 25. Wu, D., Shao, L.: ‘Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition’. CVPR, Columbus, OH, June 2014, pp. 724731.
    26. 26)
      • 26. Young, S.J., Russell, N.H., Thornton, J.H.S.: ‘Token passing: a conceptual model for connected speech recognition systems’ (Cambridge University Engineering Department, Cambridge, UK, 1989).
    27. 27)
      • 27. Baum, L.E., Petrie, T.: ‘Statistical inference for probabilistic functions of finite state Markov chains’, Ann. Math. Stat., 1967, 37, (6), pp. 15541563.
    28. 28)
      • 28. Bourlard, H.A., Morgan, N.: ‘Connectionist speech recognition: a hybrid approach’ (Springer Science & Business Media, 2012).
    29. 29)
      • 29. Morgan, N., Bourlard, H.: ‘Continuous speech recognition’, IEEE Signal Process. Mag., 1995, 12, (3), pp. 2442.
    30. 30)
      • 30. Schüldt, C., Laptev, I., Caputo, B.: ‘Recognizing human actions: a local SVM approach’ (CVPR, Washington, DC, USA, 2004), pp. 3236.
    31. 31)
      • 31. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., et al: ‘Object detection with discriminatively trained part-based models’, IEEE Trans. Pattern Anal. Mach. Intell., 2010, 32, (9), pp. 16271645.
    32. 32)
      • 32. Lei, J., Li, G., Tu, D., et al: ‘Convolutional restricted Boltzmann machines learning for robust visual tracking’, Neural Comput. Appl., 2014, 25, (6), pp. 13831391.

Related content

This is a required field
Please enter a valid email address