HGR-Net: a fusion network for hand gesture segmentation and recognition

HGR-Net: a fusion network for hand gesture segmentation and recognition

For access to this article, please select a purchase option:

Buy article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
IET Computer Vision — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

We propose a two-stage convolutional neural network (CNN) architecture for robust recognition of hand gestures, called HGR-Net, where the first stage performs accurate semantic segmentation to determine hand regions, and the second stage identifies the gesture. The segmentation stage architecture is based on the combination of fully convolutional residual network and atrous spatial pyramid pooling. Although the segmentation sub-network is trained without depth information, it is particularly robust against challenges such as illumination variations and complex backgrounds. The recognition stage deploys a two-stream CNN, which fuses the information from the red–green–blue and segmented images by combining their deep representations in a fully connected layer before classification. Extensive experiments on public datasets show that our architecture achieves almost as good as state-of-the-art performance in segmentation and recognition of static hand gestures, at a fraction of training time, run time, and model size. Our method can operate at an average of 23 ms per frame.


    1. 1)
      • 1. Rautaray, S.S., Agrawal, A.: ‘Vision based hand gesture recognition for human computer interaction: a survey’, Artif. Intell. Rev., 2015, 43, (1), pp. 154.
    2. 2)
      • 2. Starner, T., Weaver, J., Pentland, A.: ‘Real-time American sign language recognition using desk and wearable computer based video’, IEEE Trans. Pattern Anal. Mach. Intell., 1998, 20, (12), pp. 13711375.
    3. 3)
      • 3. Cooper, H., Holt, B., Bowden, R.: ‘Sign language recognition’, in Moeslund, T.B. (Ed.): ‘Visual analysis of humans’ (Springer, London, 2011), pp. 539562.
    4. 4)
      • 4. Xu, D.: ‘A neural network approach for hand gesture recognition in virtual reality driving training system of SPG’. 18th Int. Conf. on Pattern Recognition, Hong Kong, China, 2006, vol. 3, pp. 519522..
    5. 5)
      • 5. Dipietro, L., Sabatini, A.M., Dario, P., et al: ‘A survey of glove-based systems and their applications’, IEEE Trans. Syst., Man, Cybern. C, 2008, 38, (4), pp. 461482.
    6. 6)
      • 6. Pisharady, P.K., Saerbeck, M.: ‘Recent methods and databases in vision-based hand gesture recognition: a review’, Comput. Vis. Image Underst., 2015, 141, pp. 152165.
    7. 7)
      • 7. Plouffe, G., Cretu, A.-M.: ‘Static and dynamic hand gesture recognition in depth data using dynamic time warping’, IEEE Trans. Instrum. Meas., 2016, 65, (2), pp. 305316.
    8. 8)
      • 8. De Smedt, Q., Wannous, H., Vandeborre, J.-P.: ‘Skeleton-based dynamic hand gesture recognition’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 2016, pp. 19.
    9. 9)
      • 9. Lu, W., Tong, Z., Chu, J.: ‘Dynamic hand gesture recognition with leap motion controller’, Signal Process. Lett., 2016, 23, (9), pp. 11881192.
    10. 10)
      • 10. Singha, J., Roy, A., Laskar, R.H.: ‘Dynamic hand gesture recognition using vision-based approach for human–computer interaction’, Neural Comput. Appl., 2018, 29, (4), pp. 11291141.
    11. 11)
      • 11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘Imagenet classification with deep convolutional neural networks’, Adv. Neural Inf. Proc. Syst., 2012, pp. 10971105, Massachusetts Institute of Technology Press.
    12. 12)
      • 12. Ren, S., He, K., Girshick, R., et al: ‘Faster R-CNN: towards real-time object detection with region proposal networks’, Adv. Neural Inf. Proc. Syst., 2015, pp. 9199, Massachusetts Institute of Technology Press.
    13. 13)
      • 13. Long, J., Shelhamer, E., Darrell, T.: ‘Fully convolutional networks for semantic segmentation’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, 2015, pp. 34313440.
    14. 14)
      • 14. Guo, Y., Liu, Y., Georgiou, T., et al: ‘A review of semantic segmentation using deep neural networks’, Int. J. Multimed. Inf. Retr., 2018, 7, (2), pp. 8793.
    15. 15)
      • 15. Chevtchenko, S.F., Vale, R.F., Macario, V., et al: ‘A convolutional neural network with feature fusion for real-time hand posture recognition’, Appl. Soft Comput., 2018, 73, pp. 748766.
    16. 16)
      • 16. Molchanov, P., Gupta, S., Kim, K., et al: ‘Hand gesture recognition with 3d convolutional neural networks’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition Workshops, Boston, Massachusetts, USA, 2015, pp. 17.
    17. 17)
      • 17. Oyedotun, O.K., Khashman, A.: ‘Deep learning in vision-based static hand gesture recognition’, Neural Comput. Appl., 2017, 28, (12), pp. 39413951.
    18. 18)
      • 18. Mutto, C.D., Zanuttigh, P., Cortelazzo, G.M.: ‘Time-of-flight cameras and microsoft kinect, Springer briefs in electrical and computer engineering’ (Springer, New York, 2012).
    19. 19)
      • 19. Chen, L.-C., Papandreou, G., Schroff, F., et al: ‘Rethinking atrous convolution for semantic image segmentation’, arXiv preprint arXiv:1706.05587, 2017.
    20. 20)
      • 20. Eitel, A., Springenberg, J.T., Spinello, L., et al: ‘Multimodal deep learning for robust RGB-d object recognition’. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, Hamburg, Germany, 2015, pp. 681687.
    21. 21)
      • 21. Feichtenhofer, C., Pinz, A., Zisserman, A.: ‘Convolutional two-stream network fusion for video action recognition’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 19331941.
    22. 22)
      • 22. Zhu, H., Weibel, J.-B., Lu, S.: ‘Discriminative multi-modal feature fusion for RGBD indoor scene recognition’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 29692976.
    23. 23)
      • 23. Matilainen, M., Sangi, P., Holappa, J., et al: ‘Ouhands database for hand detection and pose recognition’. 6th Int. Conf. on Image Processing Theory Tools and Applications, Oulu, Finland, 2016, pp. 15..
    24. 24)
      • 24. HGR1.
    25. 25)
      • 25. Qian, C., Sun, X., Wei, Y., et al: ‘Realtime and robust hand tracking from depth’. IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014, pp. 11061113.
    26. 26)
      • 26. Tompson, J., Stein, M., Lecun, Y., et al: ‘Real-time continuous pose recovery of human hands using convolutional networks’, ACM Trans. Graph., 2014, 33, (5), p. 169.
    27. 27)
      • 27. Han, J., Sutherland, A., Wu, H., et al: ‘Automatic skin segmentation for gesture recognition combining region and support vector machine active learning’. Proc. 7th Int. Conf. on Automatic Face and Gesture Recognition, IEEE Computer Society, Southampton, UK, 2006, pp. 237242..
    28. 28)
      • 28. Nalepa, J., Kawulok, M.: ‘Fast and accurate hand shape classification’. Int. Conf.: Beyond Databases, Architectures and Structures, Ustron, Poland, 2014, pp. 364373.
    29. 29)
      • 29. Joshi, A., Monnier, C., Betke, M., et al: ‘A random forest approach to segmenting and classifying gestures’. 11th IEEE Int. Conf. and Workshops on Automatic Face and Gesture Recognition, Ljubljana, Slovenia, 2015, volume 1, pp. 17.
    30. 30)
      • 30. Argyros, A.A., Lourakis, M.I.: ‘Real-time tracking of multiple skin-colored objects with a possibly moving camera’. European Conf. on Computer Vision, Prague, Czech Republic, 2004, pp. 368379.
    31. 31)
      • 31. Kawulok, M., Nalepa, J., Kawulok, J.: ‘Skin detection and segmentation in color images’. Adv. Low-Level Color Image Proc., 2014, pp. 329366.
    32. 32)
      • 32. Sawicki, D.J., Miziolek, W.: ‘Human colour skin detection in cmyk colour space’, IET Image Process., 2015, 9, (9), pp. 751757.
    33. 33)
      • 33. Bhoyar, K.K., Kakde, O.G.: ‘Skin color detection model using neural networks and its performance evaluation’, J. Comput. Sci.Citeseer, 2010, 6, pp. 963968.
    34. 34)
      • 34. Khan, R., Hanbury, A., Stöttinger, J., et al: ‘Color based skin classification’, Pattern Recognit. Lett., 2012, 33, (2), pp. 157163.
    35. 35)
      • 35. Kawulok, M.: ‘Fast propagation-based skin regions segmentation in color images’. 10th IEEE Int. Conf. and Workshops on Automatic Face and Gesture Recognition, Shanghai, China, 2013, pp. 17.
    36. 36)
      • 36. Kawulok, M., Kawulok, J., Nalepa, J.: ‘Spatial-based skin detection using discriminative skin-presence features’, Pattern Recognit. Lett., 2014, 41, pp. 313.
    37. 37)
      • 37. Hettiarachchi, R., Peters, J.F.: ‘Multi-manifold-based skin classifier on feature space vorono regions for skin segmentation’, J. Vis. Commun. Image Represent., 2016, 41, pp. 123139.
    38. 38)
      • 38. Shotton, J., Fitzgibbon, A., Cook, M., et al: ‘Real-time human pose recognition in parts from single depth images’. Computer vision and pattern recognition, IEEE, 2011, pp. 12971304.
    39. 39)
      • 39. Palacios, J.M., Sagüés, C., Montijano, E., et al: ‘Human–computer interaction based on hand gestures using RGB-D sensors’, Sensors, 2013, 13, (9), pp. 1184211860.
    40. 40)
      • 40. Kang, B., Tan, K.-H., Jiang, N., et al: ‘Hand segmentation for hand-object interaction from depth map’. IEEE Global Conf. on Signal and Information Processing, Montreal, Quebec, Canada, 2017, pp. 259263.
    41. 41)
      • 41. Pisharady, P.K., Vadakkepat, P., Loh, A.P.: ‘Attention based detection and recognition of hand postures against complex backgrounds’, Int. J. Comput. Vis., 2013, 101, (3), pp. 403419.
    42. 42)
      • 42. Priyal, S.P., Bora, P.K.: ‘A robust static hand gesture recognition system using geometry based normalizations and Krawtchouk moments’, Pattern Recognit., 2013, 46, (8), pp. 22022219.
    43. 43)
      • 43. Avraam, M.: ‘Static gesture recognition combining graph and appearance features’, Int. J. Adv. Res. Artif. Intell., 2014, 3, (2), pp. 14.
    44. 44)
      • 44. Marin, G., Dominio, F., Zanuttigh, P.: ‘Hand gesture recognition with leap motion and kinect devices’. IEEE Int. Conf. on Image Processing (ICIP), Paris, France, 2014, pp. 15651569.
    45. 45)
      • 45. Memo, A., Zanuttigh, P.: ‘Head-mounted gesture controlled interface for human-computer interaction. Multimedia Tools Appl., 2018, 77, (1), pp. 2753.
    46. 46)
      • 46. Barros, P., Magg, S., Weber, C., et al: ‘A multichannel convolutional neural network for hand posture recognition’. Int. Conf. on Artificial Neural Networks, Hamburg, Germany, 2014, pp. 403410..
    47. 47)
      • 47. Liang, C., Song, Y., Zhang, Y.: ‘Hand gesture recognition using view projection from point cloud’. IEEE Int. Conf. on Image Processing (ICIP), Phoenix, Arizona, USA, 2016, pp. 44134417.
    48. 48)
      • 48. Goodfellow, I., Bengio, Y., Courville, A., et al: ‘Deep learning’, vol. 1, (MIT Press, Cambridge, 2016).
    49. 49)
      • 49. He, K., Zhang, X., Ren, S., et al: ‘Deep residual learning for image recognition’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770778.
    50. 50)
      • 50. Galleguillos, C., Belongie, S.: ‘Context based object categorization: a critical survey’, Comput. Vis. Image Underst., 2010, 114, (6), pp. 712722.
    51. 51)
      • 51. Liu, Z., Li, X., Luo, P., et al: ‘Semantic image segmentation via deep parsing network’. Proc. IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 13771385.
    52. 52)
      • 52. Zhou, B., Khosla, A., Lapedriza, A., et al: ‘Object detectors emerge in deep scene CNNs’. arXiv preprint arXiv:1412.6856, 2014.
    53. 53)
      • 53. Yu, F., Koltun, V.: ‘Multi-scale context aggregation by dilated convolutions’, arXiv preprint arXiv:1511.07122, 2015.
    54. 54)
      • 54. Keras: Deep learning library for theano and tensorflow. Available at
    55. 55)
      • 55. TensorFlow. Available at
    56. 56)
      • 56. Kingma, D.P., Ba, J.: ‘Adam: A method for stochastic optimization’, arXiv preprint arXiv:1412.6980, 2014.
    57. 57)
      • 57. Zhao, H., Shi, J., Qi, X., et al: ‘Pyramid scene parsing network’. IEEE Conf. on Computer Vision and Pattern Recognition, Hawaii Convention Center, Honolulu, Hawaii, USA, 2017, pp. 62306239.
    58. 58)
      • 58. Huang, G., Liu, Z., van der Maaten, L., et al: ‘Densely connected convolutional networks’. IEEE Conf. on Computer Vision and Pattern Recognition, Hawaii Convention Center, Honolulu, Hawaii, USA, 2017, pp. 22612269.
    59. 59)
      • 59. Howard, A.G., Zhu, M., Chen, B., et al: ‘Mobilenets: efficient convolutional neural networks for mobile vision applications’, arXiv preprint arXiv:1704.04861, 2017.
    60. 60)
      • 60. Deng, J., Dong, W., Socher, R., et al: ‘Imagenet: a large-scale hierarchical image database’. In CVPR, 2009.

Related content

This is a required field
Please enter a valid email address