Convolutional recurrent neural networks with hidden Markov model bootstrap for scene text recognition

Fenglei Wang; Qiang Guo; Jun Lei; Jun Zhang

Convolutional recurrent neural networks with hidden Markov model bootstrap for scene text recognition

View Fulltext

Author(s): Fenglei Wang¹ ; Qiang Guo¹ ; Jun Lei¹ ; Jun Zhang¹
- Affiliations: 1: Department of Information System and Management , National University of Defense Technology , Changsha, Hunan Province , People's Republic of China
Source: Volume 11, Issue 6, September 2017, p. 497 – 504
DOI: 10.1049/iet-cvi.2016.0417 , Print ISSN 1751-9632, Online ISSN 1751-9640

Received 21/12/2016, Accepted 07/06/2017, Revised 23/04/2017, Published 22/06/2017

Text recognition in natural scene remains a challenging problem due to the highly variable appearance in unconstrained condition. The authors develop a system that directly transcribes scene text images to text without character segmentation. They formulate the problem as sequence labelling. They build a convolutional recurrent neural network (RNN) by using deep convolutional neural networks (CNN) for modelling text appearance and RNNs for sequence dynamics. The two models are complementary in modelling capabilities and so integrated together to form the segmentation free system. They train a Gaussian mixture model–hidden Markov model to supervise the training of the CNN model. The system is data driven and needs no hand labelled training data. Their method has several appealing properties: (i) It can recognise arbitrary length text images. (ii) The recognition process does not involve sophisticated character segmentation. (iii) It is trained on scene text images with only word-level transcriptions. (iv) It can recognise both the lexicon-based or lexicon-free text. The proposed system achieves competitive performance comparison with the state of the art on several public scene text datasets, including both lexicon-based and non-lexicon ones.

References

1. 1)
  - 14. Bourlard, H.A., Morgan, N.: ‘Connectionist speech recognition: a hybrid approach’ (Springer Science & Business Media, 2012), vol. 247.
2. 2)
  - 25. Donahue, J., Hendricks, L.A., Guadarrama, S., et al: ‘Long-term recurrent convolutional networks for visual recognition and description’, 2014.
3. 3)
  - 19. Szegedy, C., Liu, W., Jia, Y., et al: ‘Going deeper with convolutions’, IEEE Conf. on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
4. 4)
  - 37. Jaderberg, M., Simonyan, K., Vedaldi, A., et al: ‘Synthetic data and artificial neural networks for natural scene text recognition’, Clin. Orthop. Relat. Res., arXiv preprint arXiv:1406.2227, 2014.
5. 5)
  - 43. Allahverdyan, A., Galstyan, A.: ‘Comparative analysis of Viterbi training and maximum likelihood estimation for HMMS’, Adv. Neural Inform. Process. Syst., 2011, pp. 1674–1682.
6. 6)
  - 23. Graves, A., Schmidhuber, J.: ‘Offline handwriting recognition with multidimensional recurrent neural networks’. NIPS, 2008, pp. 545–552.
7. 7)
  - 48. Bissacco, A., Cummins, M., Netzer, Y., et al: ‘PhotoOCR: reading text in uncontrolled conditions’. 2013 IEEE Int. Conf. Computer Vision (ICCV), 2013, pp. 785–792.
8. 8)
  - 45. Hochreiter, S., Schmidhuber, J.: ‘Long short-term memory’, 1997, pp. 1735–1780.
9. 9)
  - 28. Rabiner, L.R.: ‘A tutorial on hidden Markov models and selected applications in speech recognition’, Proc. IEEE, 1989, 77, (2), pp. 257–286.
10. 10)
  - 3. Yin, X.C., Yin, X., Huang, K., et al: ‘Robust text detection in natural scene images’, IEEE Trans. Pattern Anal. Mach. Intell., 2014, 36, (5), pp. 970–983.
11. 11)
  - 4. Zhang, Z., Shen, W., Yao, C., et al: ‘Symmetry-based text line detection in natural scenes’. IEEE Conf. Computer Vision and Pattern Recognition, 2015, pp. 2558–2567.
12. 12)
  - 5. He, T., Huang, W., Qiao, Y., et al: ‘Text-attentional convolutional neural network for scene text detection’, IEEE Trans. Image Process., 2016, 25, (6), pp. 2529–2541.
13. 13)
  - 17. Espana-Boquera, S., Castro-Bleda, M.J., Gorbe-Moya, J., et al: ‘Improving offline handwritten text recognition with hybrid HMM/ANN models’, IEEE Trans. Pattern Anal. Mach. Intell., 2011, 33, (4), pp. 767–779.
14. 14)
  - 21. Ren, S., He, K., Girshick, R., et al: ‘Faster R-CNN: towards real-time object detection with region proposal networks’, IEEE Trans. Pattern Anal. Mach. Intell., 2015, 39, pp. 1137–1149.
15. 15)
  - 12. Bissacco, A., Cummins, M., Netzer, Y., et al: ‘Photoocr: reading text in uncontrolled conditions’. ICCV, 2013, pp. 785–792.
16. 16)
  - 49. Goodfellow, I.J., Bulatov, Y., Ibarz, J., et al: ‘Multi-digit number recognition from street view imagery using deep convolutional neural networks’, 2013.
17. 17)
  - 34. Kozielski, M., Doetsch, P., Ney, H.: ‘Improvements in RWTH'S system for off-line handwriting recognition’. ICDAR, 2013, pp. 935–939.
18. 18)
  - 18. Guo, Q., Tu, D., Lei, J., et al: ‘Hybrid CNN-HMM model for street view house number recognition’. ACCV 2014 Workshops, 2015 (LNCS, 9008), pp. 303–315.
19. 19)
  - 51. Yao, C., Bai, X., Shi, B., et al: ‘Strokelets: A learned multi-scale representation for scene text recognition’. 2014 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014, pp. 4042–4049.
20. 20)
  - 54. Jaderberg, M., Simonyan, K., Vedaldi, A., et al: ‘Deep structured output learning for unconstrained text recognition’, Clin. Orthop. Relat. Res., arXiv preprint arXiv:1412.5903, 2014.
21. 21)
  - 52. Lee, C.-Y., Bhardwaj, A., Di, W., et al: ‘Region-based discriminative feature pooling for scene text recognition’. 2014 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014, pp. 4050–4057.
22. 22)
  - 47. Lucas, S.M., Panaretos, A., Sosa, L., et al: ‘ICDAR 2003 robust reading competitions’. Null. IEEE, 2003, p. 682.
23. 23)
  - 22. He, K., Zhang, X., Ren, S., et al: ‘Deep residual learning for image recognition’, IEEE conf. on Computer vision and pattern recognition, arXiv preprint arXiv:1512.033852016, pp. 770–778.
24. 24)
  - 44. Pascanu, R., Mikolov, T., Bengio, Y.: ‘On the difficulty of training recurrent neural networks’. ICML (3), 2013, pp. 1310–1318.
25. 25)
  - 11. Jaderberg, M., Simonyan, K., Vedaldi, A., et al: ‘Reading text in the wild with convolutional neural networks’, Int. J. Comput. Vis., 2014, 116, (1), pp. 1–20.
26. 26)
  - 32. Almazán, J., Gordo, A., Fornés, A., et al: ‘Word spotting and recognition with embedded attributes’, IEEE Trans. Pattern Anal. Mach. Intell., 2014, 36, (12), pp. 2552–2566.
27. 27)
  - 16. Marti, U.-V., Bunke, H.: ‘Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system’, Int. J. Pattern Recogn. Artif. Intell., 2001, 15, (01), pp. 65–90.
28. 28)
  - 53. Alsharif, O., Pineau, J.: ‘End-to-end text recognition with hybrid HMM maxout models’, Clin. Orthop. Relat. Res., arXiv preprint arXiv:1310.1811, 2013.
29. 29)
  - 1. Zhang, Z., Zhang, C., Shen, W., et al: ‘Multi-oriented text detection with fully convolutional networks’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2016, pp. 4159–4167.
30. 30)
  - 2. Zhu, Y., Yao, C., Bai, X.: ‘Scene text detection and recognition: recent advances and future trends’, Front. Comput. Sci., 2016, 10, (1), pp. 19–36.
31. 31)
  - 42. Povey, D., Ghoshal, A., Boulianne, G., , et al: ‘The kaldi speech recognition toolkit’, 2011.
32. 32)
  - 9. Shi, B., Bai, X., Yao, C.: ‘An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2015, (99), pp. 1–1.
33. 33)
  - 20. Zheng, S., Jayasumana, S., Romera-Paredes, B., et al: ‘Conditional random fields as recurrent neural networks’, IEEE International Conference on Comput. Vis., 2015, pp. 1529–1537.
34. 34)
  - 10. Wang, T., Wu, D.J., Coates, A., et al: ‘End-to-end text recognition with convolutional neural networks’. 2012 21st Int. Conf. Pattern Recognition (ICPR), 2012, pp. 3304–3308.
35. 35)
  - 39. Lin, M., Chen, Q., Yan, S.: ‘Network in network’. CoRR, 2013, vol. abs/1312.4400. Available at http://arxiv.org/abs/1312.4400.
36. 36)
  - 24. Graves, A., Jaitly, N.: ‘Towards end-to-end speech recognition with recurrent neural networks’. ICML, 2014, pp. 1764–1772.
37. 37)
  - 50. Shi, C., Wang, C., Xiao, B., et al: ‘Scene text recognition using part-based tree-structured character detection’. 2013 IEEE Conf. IEEE Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2961–2968.
38. 38)
  - 33. Vinciarelli, A., Bengio, S., Bunke, H.: ‘Offline recognition of unconstrained handwritten texts using HMMS and statistical language models’, J. IEEE Trans. Pattern Anal. Mach. Intell. 2004, pp. 709–720.
39. 39)
  - 26. Karpathy, A., Li, F.-F.: ‘Deep visual-semantic alignments for generating image descriptions’, 2014.
40. 40)
  - 35. Bluche, T., Ney, H., Kermorvant, C.: ‘Feature extraction with convolutional neural networks for handwritten word recognition’. ICDAR, 2013, pp. 285–289.
41. 41)
  - 13. Netzer, Y., Wang, T., Coates, A., et al: ‘Reading digits in natural images with unsupervised feature learning’, NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011, 2011, (2), p. 5.
42. 42)
  - 36. Su, B., Lu, S.: ‘Accurate scene text recognition based on recurrent neural network’. Computer Vision-ACCV 2014, 2015, pp. 35–48.
43. 43)
  - 46. Mishra, A., Alahari, K., Jawahar, C.: ‘Scene text recognition using higher order language priors’. BMVC 2012-23rd British Machine Vision Conf. BMVA, 2012.
44. 44)
  - 38. Graves, A.: ‘Supervised sequence labelling with recurrent neural networks’ (Springer, 2012), vol. 385.
45. 45)
  - 30. Guo, Q., Wang, F., Lei, J., et al: ‘Convolutional feature learning and hybrid CNN-HMM for scene number recognition’, Neurocomputing, 2015.
46. 46)
  - 15. Dahl, G.E., Yu, D., Deng, L., et al: ‘Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition’, IEEE Trans. Audio Speech Lang. Process., 2012, 20, (1), pp. 30–42.
47. 47)
  - 27. Schuster, M., Paliwal, K.K.: ‘Bidirectional recurrent neural networks’, IEEE Trans. Signal Process., 1997, 45, (11), pp. 2673–2681.
48. 48)
  - 7. He, P., Huang, W., Qiao, Y.Q., et al: ‘Reading scene text in deep convolutional sequences’, AAAI'16 Proc. Thirtieth AAAI Conf. on Artificial Intelligence2015, pp. 3501–3508.
49. 49)
  - 41. Plamondon, R., Srihari, S.N.: ‘Online and off-line handwriting recognition: a comprehensive survey’, IEEE Trans. Pattern Anal. Mach. Intell., 2000, 22, (1), pp. 63–84.
50. 50)
  - 8. Wang, K., Babenko, B., Belongie, S.: ‘End-to-end scene text recognition’. 2011 IEEE Int. Conf. Computer Vision (ICCV), 2011, pp. 1457–1464.
51. 51)
  - 40. Casey, R.G., Lecolinet, E.: ‘A survey of methods and strategies in character segmentation’, IEEE Trans. Pattern Anal. Mach. Intell., 1996, 18, (7), pp. 690–706.
52. 52)
  - 29. Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., et al: ‘Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition’. 2012 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4277–4280.
53. 53)
  - 31. Graves, A., Fernndez, S., Gomez, F.J., et al: ‘Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks’. ICML, 2006, pp. 369–376.
54. 54)
  - 6. Huang, W., Qiao, Y., Tang, X.: ‘Robust scene text detection with convolution neural network induced MSER trees’ (Springer International Publishing, 2014).

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Convolutional recurrent neural networks with hidden Markov model bootstrap for scene text recognition

References

Related content