Fusing HOG and convolutional neural network spatial–temporal features for video-based facial expression recognition

Xianzhang Pan

Fusing HOG and convolutional neural network spatial–temporal features for video-based facial expression recognition

View Fulltext

Author(s): Xianzhang Pan¹
- Affiliations: 1: Institute of Intelligent Information Processing, Taizhou University , Taizhou 318000 , People's Republic of China
Source: Volume 14, Issue 1, 10 January 2020, p. 176 – 182
DOI: 10.1049/iet-ipr.2019.0293 , Print ISSN 1751-9659, Online ISSN 1751-9667

Received 15/03/2019, Accepted 29/07/2019, Revised 03/06/2019, Published 21/10/2019

Video-based facial expression recognition (VFER) is the fundamental feature of various computer vision applications. Visual features are the key factors for facial expression recognition. However, the gap between the visual features and the emotions is large. In order to bridge the gap, the proposed method utilises convolutional neural networks (CNNs) and histogram of oriented gradient (HOG) to obtain the more comprehensive feature for VFER. Firstly, it extracts shallow features from the video frame through a number of convolutional kernels in CNNs, which has the characteristics of displacement, scale and deformation invariance. Then, the HOG is employed to extract HOG features from CNN's shallow features, which are strongly correlated with facial expressions. Finally, the support vector machine (SVM) is employed to conduct the task of facial expression recognition. The extensive experiments on RML, CK+ and AFEW5.0 database show that this framework takes on the promising performance and outperforming the state of the arts.

References

1. 1)
  - 23. Liu, M., Shan, S., Wang, R., et al: ‘Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition’. IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, Ohio, 2014, pp. 1749–1756.
2. 2)
  - 3. Klaser, A.M., Schmid, M.: ‘A spatio-temporal descriptor based on 3D gradients (HOG3D)’. BMVC 2008 – 19th British Machine Vision Conf., British Machine Vision Association, London, UK, 2008, pp. 1–10.
3. 3)
  - 11. Jain, S., Hu, C., Aggarwal, J.K.: ‘Facial expression recognition with temporal modeling of shapes’. IEEE Int. Conf. on Computer Vision Workshops, Barcelona, Spain, 2011, pp. 1642–1649.
4. 4)
  - 24. Russakovsky, O., Deng, J., Su, H., et al: ‘Imagenet large scale visual recognition challenge’, Int. J. Comput. Vis., 2015, 115, (3), pp. 211–252.
5. 5)
  - 28. Bruhn, A., Weickert, J., Schnörr, C.: ‘Lucas/Kanade meets horn/schunck: combining local and global optic flow methods’, Int. J. Comput. Vis., 2005, 61, (3), pp. 211–231.
6. 6)
  - 6. Wang, Y., Guan, L., Venetsanopoulos, A.N.: ‘Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition’, IEEE Trans. Multimed., 2012, 14, (3), pp. 597–607.
7. 7)
  - 1. Ekman, P., Friesen, W.V.: ‘A new pan-cultural facial expression of emotion’, Motiv. Emot., 1986, 10, (2), pp. 159–168.
8. 8)
  - 33. Elmadany, N.E.D., He, Y., Guan, L.: ‘Multiview emotion recognition via multi-set locality preserving canonical correlation analysis’. Int. Symp. on Circuits and Systems, Montreal, Canada, 2016, pp. 590–593.
9. 9)
  - 35. Hassan, H., Suandi, S.A.: ‘Analysis of local binary pattern for facial expression recognition using patch local binary pattern on extended Cohn Kanade database’. 10th Int. Conf. on Robotics, Vision, Signal Processing and Power Applications, Wembley St-Giles Hotel in Georgetown, Penang, Malaysia, 2019, pp. 633–639.
10. 10)
  - 9. Fan, X., Tjahjadi, T.: ‘A spatial–temporal framework based on histogram of gradients and optical flow for facial expression recognition in video sequences’, Pattern Recognit., 2015, 48, (11), pp. 3407–3416.
11. 11)
  - 18. Rodriguez, P., Cucurull, G., Gonzàlez, J., et al: ‘Deep pain: exploiting long short-term memory networks for facial expression classification’, IEEE Trans. Cybern., 2017, pp. 1–11, doi:10.1109/TCYB.2017.2662199.
12. 12)
  - 17. Kahou, S.E., Michalski, V., Konda, K., et al: ‘Recurrent neural networks for emotion recognition in video’. ACM Int. Conf. on Multimodal Interaction, Seattle, Washington, USA, 2015, pp. 467–474.
13. 13)
  - 4. Lucey, P., Cohn, J.F., Kanade, T., et al: ‘The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression’. Computer Vision and Pattern Recognition Workshops, San Francisco, CA, USA, 2010, pp. 94–101.
14. 14)
  - 21. Ahamed, H., Alam, I., Islam, M.M.: ‘HOG-CNN based real time face recognition’. 2018 Int. Conf. on Advancement in Electrical and Electronic Engineering (ICAEEE), Gazipur, Bangladesh, 2018, pp. 1–4.
15. 15)
  - 19. Li, J., Lam, E.Y.: ‘Facial expression recognition using deep neural networks’. IEEE Int. Conf. on Imaging Systems and Techniques (IST), Macau, China, 2015, pp. 1–6.
16. 16)
  - 29. Caruana, R., Niculescu-Mizil, A.: ‘An empirical comparison of supervised learning algorithms’. Int. Conf. on Machine Learning, Pennsylvania, USA, 2006, pp. 161–168.
17. 17)
  - 32. Ji, S., Xu, W., Yang, M., et al: ‘3D convolutional neural networks for human action recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, (1), pp. 221–231.
18. 18)
  - 34. Avots, E., Sapiński, T., Bachmann, M., et al: ‘Audiovisual emotion recognition in wild’, Mach. Vis. Appl., 2018, 30, (5), pp. 1–11, doi:10.1007/s00138-018-0960-9.
19. 19)
  - 7. Orrite, C., Gañán, A., Rogez, G.: ‘HOG-based decision tree for facial expression classification’. Proc. of the 4th Iberian Conf. on Pattern Recognition and Image Analysis, Póvoa de Varzim, Portugal, 2009, pp. 176–183.
20. 20)
  - 31. Klaser, A., Marszałek, M., Schmid, C.: ‘A spatio-temporal descriptor based on 3D gradients (HOG3D)’. BMVC 2008 – 19th British Machine Vision Conf., British Machine Vision Association, London, UK, 2008, pp. 275:271–210.
21. 21)
  - 25. Suykens, J.A.K., Vandewalle, J.: ‘Least squares support vector machine classifiers’, Neural Process. Lett., 1999, 9, (3), pp. 293–300.
22. 22)
  - 5. Kossaifi, J., Tzimiropoulos, G., Todorovic, S., et al: ‘AFEW-VA database for valence and arousal estimation in-the-wild’, Image Vis. Comput., 2017, 65, pp. 23–36.
23. 23)
  - 14. Lopes, A.T., Aguiar, E.D., Souza, A.F.D., et al: ‘Facial expression recognition with convolutional neural networks: coping with few data and the training sample order’, Pattern Recognit., 2016, 61, pp. 610–628.
24. 24)
  - 15. Mollahosseini, A., Chan, D., Mahoor, M.H.: ‘Going deeper in facial expression recognition using deep neural networks’. IEEE Winter Conf. on Applications of Computer Vision, New York, USA, 2016, pp. 1–10.
25. 25)
  - 12. Elmadany, N.E.D., He, Y., Guan, L.: ‘Multiview emotion recognition via multi-set locality preserving canonical correlation analysis’. IEEE Int. Symp. on Circuits and Systems, Montreal, Canada, 2016, pp. 590–593.
26. 26)
  - 20. Lu, T., Wang, D., Zhang, Y.: ‘Fast object detection algorithm based on HOG and CNN’. Ninth Int. Conf. on Graphic and Image Processing, Qingdao, China, 2018, vol. 10615.
27. 27)
  - 36. Yan, J., Yan, B., Lu, G., et al: ‘Convolutional neural networks and feature fusion for bimodal emotion recognition on the emotiw 2016 challenge’. 2017 10th Int. Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Shanghai, China, 2017, pp. 1–5.
28. 28)
  - 13. Kayaoglu, M., Erdem, C.E.: ‘Affect recognition using key frame selection based on minimum sparse reconstruction’. Proc. of the 2015 ACM on Int. Conf. on Multimodal Interaction, Seattle, Washington, USA, 2015, pp. 519–524.
29. 29)
  - 16. Liu, M., Li, S., Shan, S., et al: ‘Deeply learning deformable facial action parts model for dynamic expression analysis’. Asian Conf. on Computer Vision, Singapore, 2014, pp. 143–157.
30. 30)
  - 26. Caruana, R., Niculescu-Mizil, A.: ‘An empirical comparison of supervised learning algorithms’. Int. Conf. on Machine Learning, Pennsylvania, USA, 2006, pp. 161–168.
31. 31)
  - 8. Shan, C., Gong, S., Mcowan, P.W.: ‘Robust facial expression recognition using local binary patterns’. IEEE Int. Conf. on Image Processing, Genova, Italy, 2005, pp. 370–373.
32. 32)
  - 30. Wang, S., Chen, S., Ji, Q.: ‘Content-based video emotion tagging augmented by users’ multiple physiological responses’, IEEE Trans. Affective Comput., 2017, 10, (2), pp. 155–156, doi:10.1109/TAFFC.2017.2702749.
33. 33)
  - 2. Farooq, F., Ahmed, J., Zheng, L.: ‘Facial expression recognition using hybrid features and self-organizing maps’. IEEE Int. Conf. on Multimedia and Expo (ICME), Hong Kong, China, 2017, pp. 409–414.
34. 34)
  - 27. Chen, J., Chen, Z., Chi, Z., et al: ‘Facial expression recognition in video with multiple feature fusion’, IEEE Trans. Affective Comput., 2018, 9, (1), pp. 38–50.
35. 35)
  - 22. Arefin, M.R., Makhmudkhujaev, F., Chae, O., et al: ‘Aggregating CNN and HOG features for real-time distracted driver detection’. 2019 IEEE Int. Conf. on Consumer Electronics (ICCE), Las Vegas, NV, USA, 2019, pp. 1–3.
36. 36)
  - 10. Wang, Y., See, J., Phan, R.C.W., et al: ‘LBP with six intersection points: reducing redundant information in LBP-TOP for micro-expression recognition’. ACCV, Singapore, 2014, pp. 21–23.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Fusing HOG and convolutional neural network spatial–temporal features for video-based facial expression recognition

References

Related content