access icon free Deep neural network with attention model for scene text recognition

The authors present a deep neural network (DNN) with attention model for scene text recognition. The proposed model does not require any segmentation of the input text image. The framework is inspired by the attention model presented recently for speech recognition and image captioning. In the proposed framework, feature extraction, feature attention and sequence recognition are integrated in a jointly trainable network. Compared with previous approaches, the following contributions are mainly made. (i) The attention model is applied into DNN to recognise scene text, and it can effectively solve the sequence recognition problem caused by variable length labels. (ii) Rigorous experiments are performed across a number of challenging benchmarks, including IIIT5K, SVT, ICDAR2003 and ICDAR2013 datasets. Results in experiments show that the proposed model is comparable or better than the state-of-the-art methods. (iii) This model only contains 6.5 million parameters. Compared with other DNN models for scene text recognition, this model has the least number of parameters so far.

Inspec keywords: neural nets; feature extraction; text detection; image sequences

Other keywords: deep neural network; SVT dataset; scene text recognition; feature extraction; sequence recognition; DNN; ICDAR2003 dataset; attention model; feature attention; speech recognition; ICDAR2013 datasets; image captioning; variable length labels; IIIT5K dataset

Subjects: Computer vision and image processing techniques; Neural computing techniques; Image recognition

References

    1. 1)
      • 10. Shi, B., Bai, X., Yao, C.: ‘An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition’, arXiv preprint arXiv:1507.05717, 2015.
    2. 2)
      • 9. Byeon, W., Breuel, T.M., Raue, F., et al: ‘Scene labeling with lstm recurrent neural networks’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, June 2015, pp. 35473555.
    3. 3)
      • 14. Mishra, A., Alahari, K., Jawahar, C.V.: ‘Scene text recognition using higher order language priors’. The 23rd British Machine Vision Conf., Guildford, British, September 2012.
    4. 4)
      • 7. Bissacco, A., Cummins, M., Netzer, Y., et al: ‘Photoocr: reading text in uncontrolled conditions’. IEEE Int. Conf. on Computer Vision, Sydney, Australia, December 2013, pp. 785792.
    5. 5)
      • 11. Graves, A., Fernández, S., Gomez, F., et al: ‘Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks’. ACM 23rd Int. Conf. on Machine Learning, Pittsburgh, Pennsylvania, USA, June 2006, pp. 369376.
    6. 6)
      • 26. Bluche, T., Louradour, J., Messina, R.: ‘Scan, attend and read: end-to-end handwritten paragraph recognition with mdlstm attention’, arXiv preprint arXiv:1604.03286, 2016.
    7. 7)
      • 16. Karatzas, D., Shafait, F., Uchida, S., et al: ‘ICDAR 2013 robust reading competition’. IEEE Int. Conf. on Document Analysis and Recognition, Washington, DC, USA, August 2013, pp. 14841493.
    8. 8)
      • 2. Roy, P.P., Pal, U., Llados, J., et al: ‘Multi-oriented and multi-sized touching character segmentation using dynamic programming’. IEEE Int. Conf. on Document Analysis and Recognition, Catalonia, Spain, July 2009, pp. 1115.
    9. 9)
      • 1. Ohya, J., Shio, A., Akamatsu, S.: ‘Recognizing characters in scene images’, IEEE Trans. Pattern Anal. Mach. Intell., 1994, 16, (2), pp. 214220.
    10. 10)
      • 32. Rodriguez-Serrano, J.A., Gordo, A., Perronnin, F.: ‘Label embedding: A frugal baseline for text recognition’, Int. J. Comput. Vis., 2015, 113, (3), pp. 193207.
    11. 11)
      • 5. Szegedy, C., Liu, W., Jia, Y., et al: ‘Going deeper with convolutions’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, June 2015, pp. 19.
    12. 12)
      • 17. Neumann, L., Matas, J.: ‘Real-time lexicon-free scene text localization and recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2015, 38, (9), pp. 18721885.
    13. 13)
      • 25. Wang, F., Tax, D.M.J.: ‘Survey on the attention based RNN model and its applications in computer vision’, arXiv preprint arXiv:1601.06823, 2016.
    14. 14)
      • 29. Collobert, R., Kavukcuoglu, K., Farabet, C.: ‘Torch7: A matlab-like environment for machine learning’. Advances in Neural Information Processing System Workshop, Granada, Spain, December 2011.
    15. 15)
      • 8. Graves, A., Mohamed, A., Hinton, G.: ‘Speech recognition with deep recurrent neural networks’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, British Columbia, Canada, May 2013, pp. 66456649.
    16. 16)
      • 30. Almazán, J., Gordo, A., Fornés, A., et al: ‘Word spotting and recognition with embedded attributes’, IEEE Trans. Pattern Anal. Mach. Intell., 2014, 36, (12), pp. 25522566.
    17. 17)
      • 21. Yao, C., Bai, X., Shi, B., et al: ‘Strokelets: A learned multi-scale representation for scene text recognition’. IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, Ohio, June 2014, pp. 40424049.
    18. 18)
      • 3. Oliveira, L.S., Sabourin, R., Bortolozzi, F., et al: ‘Automatic recognition of handwritten numerical strings: A recognition and verification strategy’, IEEE Trans. Pattern Anal. Mach. Intell., 2002, 24, (11), pp. 14381454.
    19. 19)
      • 23. Alsharif, O., Pineau, J.: ‘End-to-end text recognition with hybrid HMM maxout models’. Int. Conf. on Learning Representations, Banff, Canada, April 2014.
    20. 20)
      • 33. Jaderberg, M., Simonyan, K., Vedaldi, A., et al: ‘Deep structured output learning for unconstrained text recognition’. Int. Conf. on Learning Representations, Banff, Canada, April 2014.
    21. 21)
      • 19. Rong, X., Yi, C., Yang, X., et al: ‘Scene text recognition in multiple frames based on text tracking’. IEEE Int. Conf. on Multimedia and Expo, Chengdu, China, July 2014, pp. 16.
    22. 22)
      • 31. Goel, V., Mishra, A., Alahari, K., et al: ‘Whole is greater than sum of parts: recognizing scene text words’. IEEE Int. Conf. on Document Analysis and Recognition, Washington, DC, USA, August 2013, pp. 398402.
    23. 23)
      • 28. Zeiler, M.D.: ‘ADADELTA: an adaptive learning rate method’, arXiv preprint arXiv:1212.5701, 2012.
    24. 24)
      • 18. Su, B., Lu, S.: ‘Accurate scene text recognition based on recurrent neural network’. Singapore, Asian Conf. on Computer Vision, Singapore, November 2014, pp. 3548.
    25. 25)
      • 20. Gordo, A.: ‘Supervised mid-level features for word image representation’. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, June 2015, pp. 29562964.
    26. 26)
      • 4. Wang, K., Babenko, B., Belongie, S.: ‘End-to-end scene text recognition’. IEEE Int. Conf. on Computer Vision, Barcelona, Spain, November 2011, pp. 14571464.
    27. 27)
      • 6. Wang, T., Wu, D.J., Coates, A., et al: ‘End-to-end text recognition with convolutional neural networks’. IEEE Int. Conf. on Pattern Recognition, Tsukuba Science City, Japan, November 2012, pp. 33043308.
    28. 28)
      • 12. Cho, K., Courville, A., Bengio, Y.: ‘Describing multimedia content using attention-based encoder-decoder networks’, IEEE Trans. Multimed., 2015, 17, (11), pp. 18751886.
    29. 29)
      • 24. Jaderberg, M., Simonyan, K., Vedaldi, A., et al: ‘Reading text in the wild with convolutional neural networks’, Int. J. Comput. Vis., 2016, 116, (1), pp. 120.
    30. 30)
      • 22. Jaderberg, M., Vedaldi, A., Zisserman, A.: ‘Deep features for text spotting’. European Conf. on Computer Vision, Zurich, Switzerland, September 2014, pp. 512528.
    31. 31)
      • 27. Lee, C.Y., Osindero, S.: ‘Recursive recurrent nets with attention modeling for OCR in the wild’. IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, USA, June 2016, pp. 22312239.
    32. 32)
      • 15. Lucas, S.M., Panaretos, A., Sosa, L., et al: ‘ICDAR 2003 robust reading competitions: entries, results, and future directions’, Int. J. Doc. Anal. Recognit. (IJDAR), 2005, 7, (2-3), pp. 105122.
    33. 33)
      • 13. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, Neural Comput., 1997, 9, (8), pp. 17351780.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cvi.2016.0404
Loading

Related content

content/journals/10.1049/iet-cvi.2016.0404
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading