http://iet.metastore.ingenta.com
1887

Deep neural network with attention model for scene text recognition

Deep neural network with attention model for scene text recognition

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IET Computer Vision — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

The authors present a deep neural network (DNN) with attention model for scene text recognition. The proposed model does not require any segmentation of the input text image. The framework is inspired by the attention model presented recently for speech recognition and image captioning. In the proposed framework, feature extraction, feature attention and sequence recognition are integrated in a jointly trainable network. Compared with previous approaches, the following contributions are mainly made. (i) The attention model is applied into DNN to recognise scene text, and it can effectively solve the sequence recognition problem caused by variable length labels. (ii) Rigorous experiments are performed across a number of challenging benchmarks, including IIIT5K, SVT, ICDAR2003 and ICDAR2013 datasets. Results in experiments show that the proposed model is comparable or better than the state-of-the-art methods. (iii) This model only contains 6.5 million parameters. Compared with other DNN models for scene text recognition, this model has the least number of parameters so far.

References

    1. 1)
      • 1. Ohya, J., Shio, A., Akamatsu, S.: ‘Recognizing characters in scene images’, IEEE Trans. Pattern Anal. Mach. Intell., 1994, 16, (2), pp. 214220.
    2. 2)
      • 2. Roy, P.P., Pal, U., Llados, J., et al: ‘Multi-oriented and multi-sized touching character segmentation using dynamic programming’. IEEE Int. Conf. on Document Analysis and Recognition, Catalonia, Spain, July 2009, pp. 1115.
    3. 3)
      • 3. Oliveira, L.S., Sabourin, R., Bortolozzi, F., et al: ‘Automatic recognition of handwritten numerical strings: A recognition and verification strategy’, IEEE Trans. Pattern Anal. Mach. Intell., 2002, 24, (11), pp. 14381454.
    4. 4)
      • 4. Wang, K., Babenko, B., Belongie, S.: ‘End-to-end scene text recognition’. IEEE Int. Conf. on Computer Vision, Barcelona, Spain, November 2011, pp. 14571464.
    5. 5)
      • 5. Szegedy, C., Liu, W., Jia, Y., et al: ‘Going deeper with convolutions’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, June 2015, pp. 19.
    6. 6)
      • 6. Wang, T., Wu, D.J., Coates, A., et al: ‘End-to-end text recognition with convolutional neural networks’. IEEE Int. Conf. on Pattern Recognition, Tsukuba Science City, Japan, November 2012, pp. 33043308.
    7. 7)
      • 7. Bissacco, A., Cummins, M., Netzer, Y., et al: ‘Photoocr: reading text in uncontrolled conditions’. IEEE Int. Conf. on Computer Vision, Sydney, Australia, December 2013, pp. 785792.
    8. 8)
      • 8. Graves, A., Mohamed, A., Hinton, G.: ‘Speech recognition with deep recurrent neural networks’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, British Columbia, Canada, May 2013, pp. 66456649.
    9. 9)
      • 9. Byeon, W., Breuel, T.M., Raue, F., et al: ‘Scene labeling with lstm recurrent neural networks’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, June 2015, pp. 35473555.
    10. 10)
      • 10. Shi, B., Bai, X., Yao, C.: ‘An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition’, arXiv preprint arXiv:1507.05717, 2015.
    11. 11)
      • 11. Graves, A., Fernández, S., Gomez, F., et al: ‘Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks’. ACM 23rd Int. Conf. on Machine Learning, Pittsburgh, Pennsylvania, USA, June 2006, pp. 369376.
    12. 12)
      • 12. Cho, K., Courville, A., Bengio, Y.: ‘Describing multimedia content using attention-based encoder-decoder networks’, IEEE Trans. Multimed., 2015, 17, (11), pp. 18751886.
    13. 13)
      • 13. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, (8), pp. 17351780.
    14. 14)
      • 14. Mishra, A., Alahari, K., Jawahar, C.V.: ‘Scene text recognition using higher order language priors’. The 23rd British Machine Vision Conf., Guildford, British, September 2012.
    15. 15)
      • 15. Lucas, S.M., Panaretos, A., Sosa, L., et al: ‘ICDAR 2003 robust reading competitions: entries, results, and future directions’, Int. J. Doc. Anal. Recognit. (IJDAR), 2005, 7, (2-3), pp. 105122.
    16. 16)
      • 16. Karatzas, D., Shafait, F., Uchida, S., et al: ‘ICDAR 2013 robust reading competition’. IEEE Int. Conf. on Document Analysis and Recognition, Washington, DC, USA, August 2013, pp. 14841493.
    17. 17)
      • 17. Neumann, L., Matas, J.: ‘Real-time lexicon-free scene text localization and recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2015, 38, (9), pp. 18721885.
    18. 18)
      • 18. Su, B., Lu, S.: ‘Accurate scene text recognition based on recurrent neural network’. Singapore, Asian Conf. on Computer Vision, Singapore, November 2014, pp. 3548.
    19. 19)
      • 19. Rong, X., Yi, C., Yang, X., et al: ‘Scene text recognition in multiple frames based on text tracking’. IEEE Int. Conf. on Multimedia and Expo, Chengdu, China, July 2014, pp. 16.
    20. 20)
      • 20. Gordo, A.: ‘Supervised mid-level features for word image representation’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, June 2015, pp. 29562964.
    21. 21)
      • 21. Yao, C., Bai, X., Shi, B., et al: ‘Strokelets: A learned multi-scale representation for scene text recognition’. IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, Ohio, June 2014, pp. 40424049.
    22. 22)
      • 22. Jaderberg, M., Vedaldi, A., Zisserman, A.: ‘Deep features for text spotting’. European Conf. on Computer Vision, Zurich, Switzerland, September 2014, pp. 512528.
    23. 23)
      • 23. Alsharif, O., Pineau, J.: ‘End-to-end text recognition with hybrid HMM maxout models’. Int. Conf. on Learning Representations, Banff, Canada, April 2014.
    24. 24)
      • 24. Jaderberg, M., Simonyan, K., Vedaldi, A., et al: ‘Reading text in the wild with convolutional neural networks’, Int. J. Comput. Vis., 2016, 116, (1), pp. 120.
    25. 25)
      • 25. Wang, F., Tax, D.M.J.: ‘Survey on the attention based RNN model and its applications in computer vision’, arXiv preprint arXiv:1601.06823, 2016.
    26. 26)
      • 26. Bluche, T., Louradour, J., Messina, R.: ‘Scan, attend and read: end-to-end handwritten paragraph recognition with mdlstm attention’, arXiv preprint arXiv:1604.03286, 2016.
    27. 27)
      • 27. Lee, C.Y., Osindero, S.: ‘Recursive recurrent nets with attention modeling for OCR in the wild’. IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, USA, June 2016, pp. 22312239.
    28. 28)
      • 28. Zeiler, M.D.: ‘ADADELTA: an adaptive learning rate method’, arXiv preprint arXiv:1212.5701, 2012.
    29. 29)
      • 29. Collobert, R., Kavukcuoglu, K., Farabet, C.: ‘Torch7: A matlab-like environment for machine learning’. Advances in Neural Information Processing System Workshop, Granada, Spain, December 2011.
    30. 30)
      • 30. Almazán, J., Gordo, A., Fornés, A., et al: ‘Word spotting and recognition with embedded attributes’, IEEE Trans. Pattern Anal. Mach. Intell., 2014, 36, (12), pp. 25522566.
    31. 31)
      • 31. Goel, V., Mishra, A., Alahari, K., et al: ‘Whole is greater than sum of parts: recognizing scene text words’. IEEE Int. Conf. on Document Analysis and Recognition, Washington, DC, USA, August 2013, pp. 398402.
    32. 32)
      • 32. Rodriguez-Serrano, J.A., Gordo, A., Perronnin, F.: ‘Label embedding: A frugal baseline for text recognition’, Int. J. Comput. Vis., 2015, 113, (3), pp. 193207.
    33. 33)
      • 33. Jaderberg, M., Simonyan, K., Vedaldi, A., et al: ‘Deep structured output learning for unconstrained text recognition’. Int. Conf. on Learning Representations, Banff, Canada, April 2014.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2016.0404
Loading

Related content

content/journals/10.1049/iet-cvi.2016.0404
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address