http://iet.metastore.ingenta.com
1887

Object sequences: encoding categorical and spatial information for a yes/no visual question answering task

Object sequences: encoding categorical and spatial information for a yes/no visual question answering task

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IET Computer Vision — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

The task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the text-based question. In this study, the authors propose a novel method of encoding the visual information (categorical and spatial object information) of all the objects present in the image into a sequential format, which is called an object sequence. These object sequences can then be suitably processed by a neural network. They experiment with multiple techniques for obtaining a joint embedding from the visual features (in the form of object sequences) and language-based features obtained from the question. They also provide a detailed analysis on the performance of a neural network architecture using object sequences, on the Oracle task of GuessWhat dataset (a Yes/No VQA task) and benchmark it against the baseline.

References

    1. 1)
      • 1. He, K., Zhang, X., Ren, S., et al: ‘Deep residual learning for image recognition’. 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016, pp. 770778.
    2. 2)
      • 2. Ren, S., He, K., Girshick, R.B., et al: ‘Faster R-CNN: towards real-time object detection with region proposal networks’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39, (6), pp. 11371149.
    3. 3)
      • 3. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., et al: ‘A review on deep learning techniques applied to semantic segmentation’. CoRR, 2017. abs/1704.06857.
    4. 4)
      • 4. Plummer, B.A., Wang, L., Cervantes, C.M., et al: ‘Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models’, Int. J. Comput. Vis., 2017, 123, (1), pp. 7493.
    5. 5)
      • 5. Lin, T., Maire, M., Belongie, S.J., et al: ‘Microsoft COCO: common objects in context’. Proc. 13th European Conf. on Computer Vision (ECCV 2014), Part V, Zurich, Switzerland, 6–12 September 2014, pp. 740755.
    6. 6)
      • 6. Karpathy, A., Fei-Fei, L.: ‘Deep visual-semantic alignments for generating image descriptions’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39, (4), pp. 664676.
    7. 7)
      • 7. Malinowski, M., Fritz, M.: ‘A multi-world approach to question answering about real-world scenes based on uncertain input’. Advances in Neural Information Processing Systems 27: Annual Conf. on Neural Information Processing Systems 2014, Montreal, Quebec, Canada, 8–13 December 2014, pp. 16821690.
    8. 8)
      • 8. de Vries, H., Strub, F., Chandar, S., et al: ‘Guesswhat?! Visual object discovery through multi-modal dialogue’. 2017 IEEE Conf. on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 44664475.
    9. 9)
      • 9. Malinowski, M., Fritz, M.: ‘Towards a visual turing challenge’. CoRR, 2014. abs/1410.8027.
    10. 10)
      • 10. Silberman, N., Hoiem, D., Kohli, P., et al: ‘Indoor segmentation and support inference from RGBD images’. Proc. 12th European Conf. on Computer Vision Computer Vision (ECCV 2012), Part V, Florence, Italy, 7–13 October 2012, pp. 746760.
    11. 11)
      • 11. Malinowski, M., Rohrbach, M., Fritz, M.: ‘Ask your neurons: A deep learning approach to visual question answering’, Int. J. Comput. Vis., 2017, 125, (1–3), pp. 110135.
    12. 12)
      • 12. Hochreiter, S., Schmidhuber, J.: ‘LSTM can solve hard long time lag problems’. Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, 2–5 December 1996, pp. 473479.
    13. 13)
      • 13. Agrawal, A., Lu, J., Antol, S., et al: ‘VQA: visual question answering’, Int. J. Comput. Vis., 2017, 123, (1), pp. 431.
    14. 14)
      • 14. Goyal, Y., Khot, T., Summers-Stay, D., et al: ‘Making the V in VQA matter: elevating the role of image understanding in visual question answering’. 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017, pp. 63256334.
    15. 15)
      • 15. Teney, D., Anderson, P., He, X., et al: ‘Tips and tricks for visual question answering: learnings from the 2017 challenge’. CoRR, 2017. abs/1708.02711.
    16. 16)
      • 16. Anderson, P., He, X., Buehler, C., et al: ‘Bottom-up and top-down attention for image captioning and VQA’. CoRR, 2017. abs/1707.07998.
    17. 17)
      • 17. Xu, K., Ba, J., Kiros, R., et al: ‘Show, attend and tell: neural image caption generation with visual attention’. Proc. 32nd Int. Conf. on Machine Learning (ICML 2015), Lille, France, 6–11 July 2015, pp. 20482057.
    18. 18)
      • 18. Bahdanau, D., Cho, K., Bengio, Y.: ‘Neural machine translation by jointly learning to align and translate’. CoRR, 2014. abs/1409.0473.
    19. 19)
      • 19. Agrawal, A., Batra, D., Parikh, D., et al: ‘Don't just assume; look and answer: overcoming priors for visual question answering’. SUNw: Scene Understanding Workshop, CVPR – 2017, Honolulu, HI, USA, 2017.
    20. 20)
      • 20. Krishna, R., Zhu, Y., Groth, O., et al: ‘Visual genome: connecting language and vision using crowdsourced dense image annotations’, Int. J. Comput. Vis., 2017, 123, (1), pp. 3273.
    21. 21)
      • 21. de Vries, H., Strub, F., Mary, J., et al: ‘Modulating early visual processing by language’. Advances in Neural Information Processing Systems 30: Annual Conf. on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 65976607.
    22. 22)
      • 22. Strub, F., de Vries, H., Mary, J., et al: ‘End-to-end optimization of goal-driven and visually grounded dialogue systems’. Proc. 26th Int. Joint Conf. on Artificial Intelligence (IJCAI 2017), Melbourne, Australia, 19–25 August 2017, pp. 27652771.
    23. 23)
      • 23. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’. CoRR, 2014. abs/1409.1556.
    24. 24)
      • 24. Pennington, J., Socher, R., Manning, C.D.: ‘Glove: global vectors for word representation’. Proc. 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP 2014), A Meeting of SIGDAT, a Special Interest Group of the ACL, Doha, Qatar, 25–29 October 2014, pp. 15321543.
    25. 25)
      • 25. Schuster, M., Paliwal, K.K.: ‘Bidirectional recurrent neural networks’, IEEE Trans. Signal Process., 1997, 45, (11), pp. 26732681.
    26. 26)
      • 26. Kingma, D.P., Ba, J.: ‘Adam: A method for stochastic optimization’. CoRR, 2014. abs/1412.6980.
    27. 27)
      • 27. Paszke, A., Gross, S., Chintala, S., et al: ‘Automatic differentiation in pytorch’. Autodiff Workshop, NIPS, Long Beach, CA, USA, 2017.
    28. 28)
      • 28. Russakovsky, O., Deng, J., Su, H., et al: ‘ImageNet large scale visual recognition challenge’, Int. J. Comput. Vis., 2015, 115, (3), pp. 211252.
    29. 29)
      • 29. Xiong, C., Merity, S., Socher, R.: ‘Dynamic memory networks for visual and textual question answering’. Proc. 33rd Int. Conf. on Machine Learning (ICML 2016), New York City, NY, USA, 19–24 June 2016, pp. 23972406.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2018.5226
Loading

Related content

content/journals/10.1049/iet-cvi.2018.5226
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address