Object sequences: encoding categorical and spatial information for a yes/no visual question answering task

Shivam Garg; Rajeev Srivastava

Object sequences: encoding categorical and spatial information for a yes/no visual question answering task

View Fulltext

Author(s): Shivam Garg¹ and Rajeev Srivastava¹
- Affiliations: 1: Department of Computer Science and Engineering, Indian Institute of Technology (BHU) , Varanasi 221005 (UP) , India
Source: Volume 12, Issue 8, December 2018, p. 1141 – 1150
DOI: 10.1049/iet-cvi.2018.5226 , Print ISSN 1751-9632, Online ISSN 1751-9640

Received 23/12/2017, Accepted 19/06/2018, Revised 23/04/2018, Published 25/07/2018

The task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the text-based question. In this study, the authors propose a novel method of encoding the visual information (categorical and spatial object information) of all the objects present in the image into a sequential format, which is called an object sequence. These object sequences can then be suitably processed by a neural network. They experiment with multiple techniques for obtaining a joint embedding from the visual features (in the form of object sequences) and language-based features obtained from the question. They also provide a detailed analysis on the performance of a neural network architecture using object sequences, on the Oracle task of GuessWhat dataset (a Yes/No VQA task) and benchmark it against the baseline.

References

1. 1)
  - 7. Malinowski, M., Fritz, M.: ‘A multi-world approach to question answering about real-world scenes based on uncertain input’. Advances in Neural Information Processing Systems 27: Annual Conf. on Neural Information Processing Systems 2014, Montreal, Quebec, Canada, 8–13 December 2014, pp. 1682–1690.
2. 2)
  - 10. Silberman, N., Hoiem, D., Kohli, P., et al: ‘Indoor segmentation and support inference from RGBD images’. Proc. 12th European Conf. on Computer Vision Computer Vision (ECCV 2012), Part V, Florence, Italy, 7–13 October 2012, pp. 746–760.
3. 3)
  - 21. de Vries, H., Strub, F., Mary, J., et al: ‘Modulating early visual processing by language’. Advances in Neural Information Processing Systems 30: Annual Conf. on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 6597–6607.
4. 4)
  - 3. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., et al: ‘A review on deep learning techniques applied to semantic segmentation’. CoRR, 2017. abs/1704.06857.
5. 5)
  - 25. Schuster, M., Paliwal, K.K.: ‘Bidirectional recurrent neural networks’, IEEE Trans. Signal Process., 1997, 45, (11), pp. 2673–2681.
6. 6)
  - 12. Hochreiter, S., Schmidhuber, J.: ‘LSTM can solve hard long time lag problems’. Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, 2–5 December 1996, pp. 473–479.
7. 7)
  - 9. Malinowski, M., Fritz, M.: ‘Towards a visual turing challenge’. CoRR, 2014. abs/1410.8027.
8. 8)
  - 14. Goyal, Y., Khot, T., Summers-Stay, D., et al: ‘Making the V in VQA matter: elevating the role of image understanding in visual question answering’. 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017, pp. 6325–6334.
9. 9)
  - 20. Krishna, R., Zhu, Y., Groth, O., et al: ‘Visual genome: connecting language and vision using crowdsourced dense image annotations’, Int. J. Comput. Vis., 2017, 123, (1), pp. 32–73.
10. 10)
  - 8. de Vries, H., Strub, F., Chandar, S., et al: ‘Guesswhat?! Visual object discovery through multi-modal dialogue’. 2017 IEEE Conf. on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 4466–4475.
11. 11)
  - 16. Anderson, P., He, X., Buehler, C., et al: ‘Bottom-up and top-down attention for image captioning and VQA’. CoRR, 2017. abs/1707.07998.
12. 12)
  - 15. Teney, D., Anderson, P., He, X., et al: ‘Tips and tricks for visual question answering: learnings from the 2017 challenge’. CoRR, 2017. abs/1708.02711.
13. 13)
  - 4. Plummer, B.A., Wang, L., Cervantes, C.M., et al: ‘Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models’, Int. J. Comput. Vis., 2017, 123, (1), pp. 74–93.
14. 14)
  - 26. Kingma, D.P., Ba, J.: ‘Adam: A method for stochastic optimization’. CoRR, 2014. abs/1412.6980.
15. 15)
  - 6. Karpathy, A., Fei-Fei, L.: ‘Deep visual-semantic alignments for generating image descriptions’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39, (4), pp. 664–676.
16. 16)
  - 13. Agrawal, A., Lu, J., Antol, S., et al: ‘VQA: visual question answering’, Int. J. Comput. Vis., 2017, 123, (1), pp. 4–31.
17. 17)
  - 18. Bahdanau, D., Cho, K., Bengio, Y.: ‘Neural machine translation by jointly learning to align and translate’. CoRR, 2014. abs/1409.0473.
18. 18)
  - 2. Ren, S., He, K., Girshick, R.B., et al: ‘Faster R-CNN: towards real-time object detection with region proposal networks’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39, (6), pp. 1137–1149.
19. 19)
  - 17. Xu, K., Ba, J., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’. Proc. 32nd Int. Conf. on Machine Learning (ICML 2015), Lille, France, 6–11 July 2015, pp. 2048–2057.
20. 20)
  - 29. Xiong, C., Merity, S., Socher, R.: ‘Dynamic memory networks for visual and textual question answering’. Proc. 33rd Int. Conf. on Machine Learning (ICML 2016), New York City, NY, USA, 19–24 June 2016, pp. 2397–2406.
21. 21)
  - 11. Malinowski, M., Rohrbach, M., Fritz, M.: ‘Ask your neurons: A deep learning approach to visual question answering’, Int. J. Comput. Vis., 2017, 125, (1–3), pp. 110–135.
22. 22)
  - 19. Agrawal, A., Batra, D., Parikh, D., et al: ‘Don't just assume; look and answer: overcoming priors for visual question answering’. SUNw: Scene Understanding Workshop, CVPR – 2017, Honolulu, HI, USA, 2017.
23. 23)
  - 1. He, K., Zhang, X., Ren, S., et al: ‘Deep residual learning for image recognition’. 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778.
24. 24)
  - 22. Strub, F., de Vries, H., Mary, J., et al: ‘End-to-end optimization of goal-driven and visually grounded dialogue systems’. Proc. 26th Int. Joint Conf. on Artificial Intelligence (IJCAI 2017), Melbourne, Australia, 19–25 August 2017, pp. 2765–2771.
25. 25)
  - 24. Pennington, J., Socher, R., Manning, C.D.: ‘Glove: global vectors for word representation’. Proc. 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP 2014), A Meeting of SIGDAT, a Special Interest Group of the ACL, Doha, Qatar, 25–29 October 2014, pp. 1532–1543.
26. 26)
  - 28. Russakovsky, O., Deng, J., Su, H., et al: ‘ImageNet large scale visual recognition challenge’, Int. J. Comput. Vis., 2015, 115, (3), pp. 211–252.
27. 27)
  - 23. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’. CoRR, 2014. abs/1409.1556.
28. 28)
  - 27. Paszke, A., Gross, S., Chintala, S., et al: ‘Automatic differentiation in pytorch’. Autodiff Workshop, NIPS, Long Beach, CA, USA, 2017.
29. 29)
  - 5. Lin, T., Maire, M., Belongie, S.J., et al: ‘Microsoft COCO: common objects in context’. Proc. 13th European Conf. on Computer Vision (ECCV 2014), Part V, Zurich, Switzerland, 6–12 September 2014, pp. 740–755.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Object sequences: encoding categorical and spatial information for a yes/no visual question answering task

References

Related content