This study presents supervised framework for automatic recognition and retrieval of interactions (SAFARRIs), a supervised learning framework to recognise interactions such as pushing, punching, and hugging, between a pair of human performers in a video shot. The primary contribution of the study is to extend the vectors of locally aggregated descriptors (VLADs) as a compact and discriminative video encoding representation, to solve the complex class partitioning problem of recognising human interaction. An initial codebook is generated from the training set of video shots, by extracting feature descriptors around the spatiotemporal interest points computed across frames. A bag of action words is generated by encoding the first-order statistics of the visual words using VLAD. Support vector machine classifiers (1 against all) are trained using these codebooks. The authors have verified SAFARRI's accuracy for classification and retrieval (query by example). SAFARRI is free from tracking or recognition of body parts and capable of identifying the region of interaction in video shots. It gives superior retrieval and classification performances over recently proposed methods, on two publicly available human interaction datasets.

References

1. 1)
  - 12. Vahdat, A., Ranjbar, M., Mori, G.: ‘A discriminative key pose sequence model for recognizing human interactions’. Int. Conf. on Computer Vision Workshops, 2011, pp. 1729–1736.
2. 2)
  - 42. Arandjelovic, R., Zisserman, A.: ‘All about VLAD’. Computer Vision and Pattern Recognition, 2013, pp. 1578–1585.
3. 3)
  - 19. Marín-Jiménez, M.J., Yeguas, E., Pérez de la Blanca, N.: ‘Exploring STIP-based models for recognizing human interactions in TV videos’, Pattern Recognit. Lett., 2013, 34, (15), pp. 1819–1828 (doi: 10.1016/j.patrec.2012.10.018).
4. 4)
  - 27. Ballas, N., Delezoide, B., Prêeteux, F.: ‘Trajectory signature for action recognition in video’. Int. Conf. on Multimedia, 2012, pp. 1429–1432.
5. 5)
  - 45. Guo, Y., Zhao, G., Chen, J., et al: ‘Dynamic texture synthesis using a spatial temporal descriptor’. Int. Conf. on Image Processing, 2009, pp. 2277–2280.
6. 6)
  - 16. Poiesi, F., Cavallaro, A.: ‘Predicting and recognizing human interactions in public spaces’, J. Real-Time Image Process., DOI: 10.1007/s11554-014-0428-8, 2014, pp. 1–19.
7. 7)
  - 49. Ryoo, M.S., Aggarwal, J.K.: ‘UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA)’, 2010. Available at http://www.cvrc.ece.utexas.edu/SDHA2010/HumanInteraction.html.
8. 8)
  - 43. Dalal, N., Triggs, B.: ‘Histograms of oriented gradients for human detection’. Computer Vision and Pattern Recognition, 2005, pp. 886–893.
9. 9)
  - 15. Yuan, F., Xia, G.S., Sahbi, H., et al: ‘Mid-level features and spatio-temporal context for activity recognition’, Pattern Recognit., 2012, 45, (12), pp. 4182–4191 (doi: 10.1016/j.patcog.2012.05.001).
10. 10)
  - 24. Patron-Perez, A., Marszalek, M., Reid, I., et al: ‘Structured learning of human interactions in TV shows’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (12), pp. 2441–2453 (doi: 10.1109/TPAMI.2012.24).
11. 11)
  - 18. Li, R., Porfilio, P., Zickler, T.: ‘Finding group interactions in social clutter’. Computer Vision and Pattern Recognition, 2013, pp. 2722–2729.
12. 12)
  - 47. Lee, S., Yoo, C.D.: ‘Robust video fingerprinting for content-based video identification’, IEEE Trans. Circuits Syst. Video Technol., 2008, 18, (7), pp. 983–988 (doi: 10.1109/TCSVT.2008.920739).
13. 13)
  - 38. Jégou, H., Douze, M., Schmid, C.: ‘Packing bag-of-features’. Int. Conf. on Computer Vision, 2009, pp. 2357–2364.
14. 14)
  - 36. Salarifard, R., Hosseini, M.A., Karimian, M., et al: ‘A robust SIFT-based descriptor for video classification’. Int. Conf. on Machine Vision, 2015, pp. 94451E–94451E-5.
15. 15)
  - 13. Yuan, F., Xia, G., Sahbi, H., et al: ‘Spatio-temporal interest points chain (STIPC) for activity recognition’. Asian Conf. on Pattern Recognition, 2011, pp. 22–26.
16. 16)
  - 5. Park, S., Aggarwal, J.K.: ‘Recognition of human interaction using multiple features in gray scale images’. Int. Conf. on Pattern Recognition, 2000, pp. 51–54.
17. 17)
  - 22. Burghouts, G.J., Schutte, K., Bouma, H., et al: ‘Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos’, Mach. Vis. Appl., 2013, 25, (1), pp. 85–98 (doi: 10.1007/s00138-013-0514-0).
18. 18)
  - 10. Jegou, H., Douze, M., Schmid, C.: ‘Product quantization for nearest neighbor search’, IEEE Trans. Pattern Anal. Mach. Intell., 2010, 33, (1), pp. 117–128 (doi: 10.1109/TPAMI.2010.57).
19. 19)
  - 30. Zhang, X., Cui, J., Tian, L., et al: ‘Local spatio-temporal feature based voting framework for complex human activity detection and localization’. Asian Conf. on Pattern Recognition, 2011, pp. 12–16.
20. 20)
  - 37. Liu, W., Jia, K., Wang, Z., et al: ‘Video retrieval algorithm based on video fingerprints and spatiotemporal information’. Int. Conf. on Signal Processing, 2014, pp. 1321–1325.
21. 21)
  - 21. Zhu, Y., Nayak, N., Gaur, U., et al: ‘Modeling multi-object interactions using string of feature graphs’, Comput. Vis. Image Underst., 2013, 117, (10), pp. 1313–1328 (doi: 10.1016/j.cviu.2012.08.009).
22. 22)
  - 34. Gao, H.P., Yang, Z.Q.: ‘Content based video retrieval using spatiotemporal salient objects’. Int. Symp. on Intelligence Information Processing and Trusted Computing, 2010, pp. 689–692.
23. 23)
  - 39. Jégou, H., Perronnin, F., Douze, M., et al: ‘Aggregating local image descriptors into compact codes’, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34, (9), pp. 1704–1716 (doi: 10.1109/TPAMI.2011.235).
24. 24)
  - 4. Kong, Y., Liang, W., Dong, Z., et al: ‘Recognising human interaction from videos by a discriminative model’, IET Comput. Vis., 2014, 8, (4), pp. 277–286 (doi: 10.1049/iet-cvi.2013.0042).
25. 25)
  - 35. Taşdemir, K., Çetin, A.E.: ‘Content-based video copy detection based on motion vectors estimated using a lower frame rate’, Signal Image Video Process., 2014, 8, (6), pp. 1049–1057 (doi: 10.1007/s11760-014-0627-6).
26. 26)
  - 33. Liang, B., Xiao, W., Liu, X.: ‘Design of video retrieval system using MPEG-7 descriptors’, Procedia Eng., 2012, 29, pp. 2578–2582 (doi: 10.1016/j.proeng.2012.01.354).
27. 27)
  - 41. Revaud, J., Douze, M., Schmid, C., et al: ‘Event retrieval in large video collections with circulant temporal encoding’. Computer Vision and Pattern Recognition, 2013, pp. 2459–2466.
28. 28)
  - 29. Amer, M.R., Todorovic, S.: ‘A chains model for localizing participants of group activities in videos’. Int. Conf. on Computer Vision, 2011, pp. 786–793.
29. 29)
  - 10. Ryoo, M.S.: ‘Human activity prediction: early recognition of ongoing activities from streaming videos’. Int. Conf. on Computer Vision, 2011, pp. 1036–1043.
30. 30)
  - 17. Jones, S., Shao, L.: ‘Content-based retrieval of human actions from realistic video databases’, Inf. Sci., 2013, 236, pp. 56–65 (doi: 10.1016/j.ins.2013.02.018).
31. 31)
  - 9. Yu, T.H., Kim, T.K., Cipolla, R.: ‘Real-time action recognition by spatiotemporal semantic and structural forests’. British Machine Vision Conf., 2010, pp. 52.1–52.12.
32. 32)
  - 46. Laptev, I.: ‘On space-time interest points’, Int. J. Comput. Vis., 2005, 64, (2–3), pp. 107–123 (doi: 10.1007/s11263-005-1838-7).
33. 33)
  - 25. Yun, K., Honorio, J., Chattopadhyay, D., et al: ‘Two-person interaction detection using body-pose features and multiple instance learning’. Computer Vision and Pattern Recognition Workshops, 2012, pp. 28–35.
34. 34)
  - 32. Basharat, A., Zhai, Y., Shah, M.: ‘Content based video matching using spatiotemporal volumes’, Comput. Vis. Image Underst., 2008, 110, (3), pp. 360–377 (doi: 10.1016/j.cviu.2007.09.016).
35. 35)
  - 11. Brendel, W., Todorovic, S.: ‘Learning spatiotemporal graphs of human activities’. Int. Conf. on Computer Vision, 2011, pp. 778–785.
36. 36)
  - 31. Jain, M., Jegou, H., Bouthemy, P.: ‘Better exploiting motion for better action recognition’. Computer Vision and Pattern Recognition, 2013, pp. 2555–2562.
37. 37)
  - 6. Xing, D., Wang, X., Lu, H.: ‘Action recognition using hybrid feature descriptor and VLAD video encoding’. Asian Conf. of Computer Vision Workshop, 2015.
38. 38)
  - 44. Chaudhry, R., Ravichandran, A., Hager, G., et al: ‘Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions’. Computer Vision and Pattern Recognition, 2009, pp. 1932–1939.
39. 39)
  - 28. Ryoo, M., Yu, W.: ‘One video is sufficient? Human activity recognition using active video composition’. Workshop on Applications of Computer Vision, 2011, pp. 634–641.
40. 40)
  - 8. Ryoo, M.S., Joung, J., Choi, S., et al: ‘Incremental learning of novel activity categories from videos’. Int. Conf. on Virtual Systems and Multimedia, 2010, pp. 21–26.
41. 41)
  - 20. Martínez del Rincón, J., Santofimia, M.J., Nebel, J.C.: ‘Common-sense reasoning for human action recognition’, Pattern Recognit. Lett., 2013, 34, (15), pp. 1849–1860 (doi: 10.1016/j.patrec.2012.10.020).
42. 42)
  - 3. Kong, Y., Jia, Y., Fu, Y.: ‘Learning human interaction by interactive phrases’. European Conf. on Computer Vision, 2012, pp. 300–313.
43. 43)
  - 2. El Houda Slimani, K.N., Benezeth, Y., Souami, F.: ‘Human interaction recognition based on the co-occurrence of visual words’. Computer Vision and Pattern Recognition Workshops, 2014, pp. 461–466.
44. 44)
  - 14. Meng, L., Qing, L., Yang, P., et al: ‘Activity recognition based on semantic spatial relation’. Int. Conf. on Pattern Recognition, 2012, pp. 609–612.
45. 45)
  - 7. Du, Y., Chen, F., Xu, W.: ‘Human interaction representation and recognition through motion decomposition’, IEEE Signal Process. Lett., 2007, 14, (12), pp. 952–955 (doi: 10.1109/LSP.2007.908035).
46. 46)
  - 26. Li, R., Chellappa, R., Zhou, S.K.: ‘Recognizing interactive group activities using temporal interaction matrices and their Riemannian statistics’, Int. J. Comput. Vis., 2012, 101, (2), pp. 305–328 (doi: 10.1007/s11263-012-0573-0).
47. 47)
  - 40. Spyromitros-Xioufis, E., Papadopoulos, S., Kompatsiaris, I., et al: ‘An empirical study on the combination of SURF features with VLAD vectors for image search’. Int. Workshop on Image Analysis for Multimedia Interactive Services, 2012, pp. 1–4.
48. 48)
  - 23. Burghouts, G.J., Schutte, K.: ‘Spatio-temporal layout of human actions for improved bag-of-words action detection’, Pattern Recognit. Lett., 2013, 34, (15), pp. 1861–1869 (doi: 10.1016/j.patrec.2013.01.024).
49. 49)
  - 1. Chen, F., Sang, N., Gao, C.: ‘A hierarchical feature graph matching method for recognition of complex human activities’, Optik-Int. J. Light Electron Opt., 2014, 125, (16), pp. 4347–4351 (doi: 10.1016/j.ijleo.2014.03.026).

Supervised framework for automatic recognition and retrieval of interaction: a framework for classification and retrieving videos with similar human interactions

References

Related content