Supervised framework for automatic recognition and retrieval of interaction: a framework for classification and retrieving videos with similar human interactions

Supervised framework for automatic recognition and retrieval of interaction: a framework for classification and retrieving videos with similar human interactions

For access to this article, please select a purchase option:

Buy article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
IET Computer Vision — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

This study presents supervised framework for automatic recognition and retrieval of interactions (SAFARRIs), a supervised learning framework to recognise interactions such as pushing, punching, and hugging, between a pair of human performers in a video shot. The primary contribution of the study is to extend the vectors of locally aggregated descriptors (VLADs) as a compact and discriminative video encoding representation, to solve the complex class partitioning problem of recognising human interaction. An initial codebook is generated from the training set of video shots, by extracting feature descriptors around the spatiotemporal interest points computed across frames. A bag of action words is generated by encoding the first-order statistics of the visual words using VLAD. Support vector machine classifiers (1 against all) are trained using these codebooks. The authors have verified SAFARRI's accuracy for classification and retrieval (query by example). SAFARRI is free from tracking or recognition of body parts and capable of identifying the region of interaction in video shots. It gives superior retrieval and classification performances over recently proposed methods, on two publicly available human interaction datasets.


    1. 1)
    2. 2)
      • 2. El Houda Slimani, K.N., Benezeth, Y., Souami, F.: ‘Human interaction recognition based on the co-occurrence of visual words’. Computer Vision and Pattern Recognition Workshops, 2014, pp. 461466.
    3. 3)
      • 3. Kong, Y., Jia, Y., Fu, Y.: ‘Learning human interaction by interactive phrases’. European Conf. on Computer Vision, 2012, pp. 300313.
    4. 4)
    5. 5)
      • 5. Park, S., Aggarwal, J.K.: ‘Recognition of human interaction using multiple features in gray scale images’. Int. Conf. on Pattern Recognition, 2000, pp. 5154.
    6. 6)
      • 6. Xing, D., Wang, X., Lu, H.: ‘Action recognition using hybrid feature descriptor and VLAD video encoding’. Asian Conf. of Computer Vision Workshop, 2015.
    7. 7)
    8. 8)
      • 8. Ryoo, M.S., Joung, J., Choi, S., et al: ‘Incremental learning of novel activity categories from videos’. Int. Conf. on Virtual Systems and Multimedia, 2010, pp. 2126.
    9. 9)
      • 9. Yu, T.H., Kim, T.K., Cipolla, R.: ‘Real-time action recognition by spatiotemporal semantic and structural forests’. British Machine Vision Conf., 2010, pp. 52.152.12.
    10. 10)
      • 10. Ryoo, M.S.: ‘Human activity prediction: early recognition of ongoing activities from streaming videos’. Int. Conf. on Computer Vision, 2011, pp. 10361043.
    11. 11)
      • 11. Brendel, W., Todorovic, S.: ‘Learning spatiotemporal graphs of human activities’. Int. Conf. on Computer Vision, 2011, pp. 778785.
    12. 12)
      • 12. Vahdat, A., Ranjbar, M., Mori, G.: ‘A discriminative key pose sequence model for recognizing human interactions’. Int. Conf. on Computer Vision Workshops, 2011, pp. 17291736.
    13. 13)
      • 13. Yuan, F., Xia, G., Sahbi, H., et al: ‘Spatio-temporal interest points chain (STIPC) for activity recognition’. Asian Conf. on Pattern Recognition, 2011, pp. 2226.
    14. 14)
      • 14. Meng, L., Qing, L., Yang, P., et al: ‘Activity recognition based on semantic spatial relation’. Int. Conf. on Pattern Recognition, 2012, pp. 609612.
    15. 15)
    16. 16)
      • 16. Poiesi, F., Cavallaro, A.: ‘Predicting and recognizing human interactions in public spaces’, J. Real-Time Image Process., DOI: 10.1007/s11554-014-0428-8, 2014, pp. 119.
    17. 17)
    18. 18)
      • 18. Li, R., Porfilio, P., Zickler, T.: ‘Finding group interactions in social clutter’. Computer Vision and Pattern Recognition, 2013, pp. 27222729.
    19. 19)
    20. 20)
    21. 21)
    22. 22)
    23. 23)
    24. 24)
    25. 25)
      • 25. Yun, K., Honorio, J., Chattopadhyay, D., et al: ‘Two-person interaction detection using body-pose features and multiple instance learning’. Computer Vision and Pattern Recognition Workshops, 2012, pp. 2835.
    26. 26)
    27. 27)
      • 27. Ballas, N., Delezoide, B., Prêeteux, F.: ‘Trajectory signature for action recognition in video’. Int. Conf. on Multimedia, 2012, pp. 14291432.
    28. 28)
      • 28. Ryoo, M., Yu, W.: ‘One video is sufficient? Human activity recognition using active video composition’. Workshop on Applications of Computer Vision, 2011, pp. 634641.
    29. 29)
      • 29. Amer, M.R., Todorovic, S.: ‘A chains model for localizing participants of group activities in videos’. Int. Conf. on Computer Vision, 2011, pp. 786793.
    30. 30)
      • 30. Zhang, X., Cui, J., Tian, L., et al: ‘Local spatio-temporal feature based voting framework for complex human activity detection and localization’. Asian Conf. on Pattern Recognition, 2011, pp. 1216.
    31. 31)
      • 31. Jain, M., Jegou, H., Bouthemy, P.: ‘Better exploiting motion for better action recognition’. Computer Vision and Pattern Recognition, 2013, pp. 25552562.
    32. 32)
    33. 33)
    34. 34)
      • 34. Gao, H.P., Yang, Z.Q.: ‘Content based video retrieval using spatiotemporal salient objects’. Int. Symp. on Intelligence Information Processing and Trusted Computing, 2010, pp. 689692.
    35. 35)
    36. 36)
      • 36. Salarifard, R., Hosseini, M.A., Karimian, M., et al: ‘A robust SIFT-based descriptor for video classification’. Int. Conf. on Machine Vision, 2015, pp. 94451E94451E-5.
    37. 37)
      • 37. Liu, W., Jia, K., Wang, Z., et al: ‘Video retrieval algorithm based on video fingerprints and spatiotemporal information’. Int. Conf. on Signal Processing, 2014, pp. 13211325.
    38. 38)
      • 38. Jégou, H., Douze, M., Schmid, C.: ‘Packing bag-of-features’. Int. Conf. on Computer Vision, 2009, pp. 23572364.
    39. 39)
    40. 40)
      • 40. Spyromitros-Xioufis, E., Papadopoulos, S., Kompatsiaris, I., et al: ‘An empirical study on the combination of SURF features with VLAD vectors for image search’. Int. Workshop on Image Analysis for Multimedia Interactive Services, 2012, pp. 14.
    41. 41)
      • 41. Revaud, J., Douze, M., Schmid, C., et al: ‘Event retrieval in large video collections with circulant temporal encoding’. Computer Vision and Pattern Recognition, 2013, pp. 24592466.
    42. 42)
      • 42. Arandjelovic, R., Zisserman, A.: ‘All about VLAD’. Computer Vision and Pattern Recognition, 2013, pp. 15781585.
    43. 43)
      • 43. Dalal, N., Triggs, B.: ‘Histograms of oriented gradients for human detection’. Computer Vision and Pattern Recognition, 2005, pp. 886893.
    44. 44)
      • 44. Chaudhry, R., Ravichandran, A., Hager, G., et al: ‘Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions’. Computer Vision and Pattern Recognition, 2009, pp. 19321939.
    45. 45)
      • 45. Guo, Y., Zhao, G., Chen, J., et al: ‘Dynamic texture synthesis using a spatial temporal descriptor’. Int. Conf. on Image Processing, 2009, pp. 22772280.
    46. 46)
    47. 47)
    48. 48)
    49. 49)
      • 49. Ryoo, M.S., Aggarwal, J.K.: ‘UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA)’, 2010. Available at

Related content

This is a required field
Please enter a valid email address