Printed Persian OCR system using deep learning
- Author(s): Marziye Rahmati 1 ; Mansoor Fateh 1 ; Mohsen Rezvani 1 ; Alireza Tajary 1 ; Vahid Abolghasemi 1, 2
-
-
View affiliations
-
Affiliations:
1:
Faculty Computer Engineering , Shahrood University of Technologhy , Shahrood , Iran ;
2: School of Computer Science and Electronic Engineering , University of Essex , UK
-
Affiliations:
1:
Faculty Computer Engineering , Shahrood University of Technologhy , Shahrood , Iran ;
- Source:
Volume 14, Issue 15,
15
December
2020,
p.
3920 – 3931
DOI: 10.1049/iet-ipr.2019.0728 , Print ISSN 1751-9659, Online ISSN 1751-9667
Optical character recognition, known as OCR, has been widely used due to high demand of different technologies. Currently, most existing OCR systems have been focused on Latin languages. In recent studies, OCR systems for non-Latin texts involving cursive style have also been introduced despite posing some challenges. In this study, the authors propose an OCR system based on long short-term memory neural networks for the Persian language. The authors also investigate the effects of variations of parameters, involved in this approach. The proposed OCR system solves false recognition of sub-word ‘LA’ and ‘LA’. Moreover, the authors present a preprocessing algorithm to remove ‘justification’ using image processing. A new comprehensive collated data set is introduced, comprising five million images with eight popular Persian fonts and in ten various font sizes. The proposed evaluations show that the accuracy of the proposed OCR is increased by 2%, compared to the existing Persian OCR system. The experimental results indicated that the proposed system has average accuracy of 99.69% at the letter level. The proposed system has an accuracy of 98.1% for ‘zero-width non-breaking space’ and 98.64% for ‘LA’ at the word level.
Inspec keywords: feature extraction; text analysis; neural nets; optical character recognition; document image processing; natural languages; natural language processing; learning (artificial intelligence)
Other keywords: Latin languages; existing Persian OCR system; Persian language; nonLatin texts; popular Persian fonts; short-term memory neural networks; optical character recognition; printed Persian OCR system; existing OCR systems
Subjects: Neural computing techniques; Computer vision and image processing techniques; Document processing and analysis techniques; Knowledge engineering techniques; Natural language interfaces; Image recognition
References
-
-
1)
-
6. Lorigo, L.M.: ‘Offline arabic handwriting recognition: a survey’, IEEE Trans. Pattern Anal. Mach. Intell., 2006, 28, (5), pp. 712–724.
-
-
2)
-
26. Bahi, H.E., Zatni, A.: ‘Segmentation and recognition of text images acquired by a mobile phones’, Int. J. Tomography Simul., 2017, 30, (4), pp. 95–107.
-
-
3)
-
25. Mousavi, S.M.H., Lyashenko, V.: ‘Extracting old persian cuneiform font out of noisy images (handwritten or inscription)’. 2017 10th Iranian Conf. on Machine Vision and Image Processing (MVIP), Isfahan, Iran, 2017, pp. 241–246.
-
-
4)
-
5. Chaudhuri, A., Mandaviya, K., Badelia, P., et al: ‘Optical character recognition systems’, in Chaudhuri, A., Mandaviya, K., Badelia, P., et al (Eds.:) ‘Optical character recognition systems for different languages with soft computing’ (Springer, Switzerland, 2017), pp. 9–41.
-
-
5)
-
3. Ashiquzzaman, A., Tushar, A.K., Rahman, A., et al: ‘An efficient recognition method for handwritten arabic numerals using CNN with data augmentation and dropout’. Data Management, Analytics and Innovation, 2019, pp. 299–309.
-
-
6)
-
43. ‘Iranocr’, https://www.iranocr.ir/, accessed 11 April 2020.
-
-
7)
-
10. Roy, A., Ghoshal, D.P.: ‘Number plate recognition for use in different countries using an improved segmentation’. 2011 2nd National Conf. on Emerging Trends and Applications in Computer Science, Shillong, India, 2011, pp. 1–5.
-
-
8)
-
33. Zahour, A., Likforman-Sulem, L., Boussellaa, W., et al: ‘Text line segmentation of historical arabic documents’. Ninth Int. Conf. on Document Analysis and Recognition (ICDAR), Parana, Brazil, 2007, vol. 1, pp. 138–142.
-
-
9)
-
14. Doush, I.A., Khateeb, F.A.I., Gharibeh, A.H.: ‘Yarmouk arabic OCR dataset’. 2018 8th Int. Conf. on Computer Science and Information Technology (CSIT), Amman, Jordan, 2018, pp. 150–154.
-
-
10)
-
37. Jain, M., Mathew, M., Jawahar, C.V.: ‘Unconstrained OCR for urdu using deep CNN-RNN hybrid networks’. 2017 4th IAPR Asian Conf. on Pattern Recognition (ACPR), Nanjing, People's Republic of China, 2017, pp. 747–752.
-
-
11)
-
7. Zeki, A.M.: ‘The segmentation problem in arabic character recognition the state of the art’. 2005 Int. Conf. on Information and Communication Technologies, Karachi, Pakistan, 2005, pp. 11–26.
-
-
12)
-
11. Zhao, X., Wu, K., Feng, J., et al: ‘Noise elimination algorithms for terrestrial 3D laser scanning data based on LS fitting’. 2011 Int. Conf. on Multimedia Technology, Hangzhou, People's Republic of China, 2011, pp. 581–583.
-
-
13)
-
35. Soheili, M.R., Yousefi, M.R., Kabir, E., et al: ‘Merging clustering and classification results for whole book recognition’. 2017 10th Iranian Conf. on Machine Vision and Image Processing (MVIP), Isfahan, Iran, 2017, pp. 134–138.
-
-
14)
-
13. Srivastava, S., Priyadarshini, J., Gopal, S., et al: ‘Optical character recognition on bank cheques using 2D convolution neural network’, in Malik, H., Srivastava, S., Sood, Y.R., et al (Eds.:) ‘Applications of artificial intelligence techniques in engineering’ (Springer, Singapore, 2019), pp. 589–596.
-
-
15)
-
38. Akram, Q.U.A., Hussain, S.: ‘Ligature-based font size independent OCR for Noori Nastalique writing style’. 2017 1st Int. Workshop on Arabic Script Analysis and Recognition (ASAR), Nancy, France, 2017, pp. 129–133.
-
-
16)
-
41. ‘VGSLSpecs’, https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs, accessed 27 March 2019.
-
-
17)
-
24. Eghbali, K., Veisi, H., Mirzaie, M., et al: ‘Font recognition for persian optical character recognition system’. 2017 10th Iranian Conf. on Machine Vision and Image Processing (MVIP), Isfahan, Iran, 2017, pp. 252–257.
-
-
18)
-
27. Patel, C., Patel, A., Patel, D.: ‘Optical character recognition by open source OCR tool tesseract: a case study’, Int. J. Comput. Appl., 2012, 55, (10), pp. 50–56.
-
-
19)
-
1. Bhatia, E.N.: ‘Optical character recognition techniques: a review’, Int. J. Adv. Res. Comput. Sci. Softw. Eng., 2014, 4, (2), pp. 1219–1223.
-
-
20)
-
31. Ploetz, T., Fink, G.A.: ‘Markov models for offline handwriting recognition: a survey’, Int. J. Document Anal. Recognition (IJDAR), 2009, 12, (4), p. 269.
-
-
21)
-
39. Akram, Q.U.A., Hussain, S.: ‘Improving urdu recognition using character-based artistic features of Nastalique Calligraphy’, IEEE Access, 2019, 7, pp. 8495–8507.
-
-
22)
-
15. Sabbour, N., Shafait, F.: ‘A segmentation-free approach to Arabic and Urdu OCR’. Document Recognition and Retrieval, 2013, 8658, p. 86580N.
-
-
23)
-
40. Abolghasemi, V., Chen, M., Alameer, A., et al: ‘Incoherent dictionary pair learning: application to a novel open-source database of Chinese numbers’, IEEE Signal Process. Lett., 2018, 25, (4), pp. 472–476.
-
-
24)
-
8. Khosravi, H., Kabir, E.: ‘A blackboard approach towards integrated Farsi OCR system’, Int. J. Document Anal. Recognit. (IJDAR), 2009, 12, (1), pp. 21–32.
-
-
25)
-
30. ‘TrainingTesseract-4.00’, https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html, accessed 09April2020.
-
-
26)
-
32. Gupta, M.R., Jacobson, N.P., Garcia, E.K.: ‘OCR binarization and image pre-processing for searching historical documents’, Pattern Recognit., 2007, 40, (2), pp. 389–397.
-
-
27)
-
36. Naz, S., Umar, A.I., Ahmad, R., et al: ‘Urdu Nastaliq recognition using convolutional–recursive deep learning’, Neurocomputing, 2017, 243, pp. 80–87.
-
-
28)
-
23. Naseer, A., Zafar, K.: ‘Meta features-based scale invariant OCR decision making using LSTM-RNN’, Comput. Math. Organ. Theory, 2019, 25, pp. 165–183.
-
-
29)
-
21. Aranian, M.J., Sarvaghad-Moghaddam, M., Houshmand, M.: ‘Feature dimensionality reduction for recognition of Persian handwritten letters using a combination of quantum genetic algorithm and neural network’, Majlesi J. Electr. Eng., 2017, 11, (2), pp. 19–25.
-
-
30)
-
16. Zoizou, A., Zarghili, A., Chaker, I.: ‘A new hybrid method for arabic multi-font text segmentation, and a reference corpus construction’, J. King Saud Univ.-Comput. Inf. Sci., 2020, 32, pp. 576–582.
-
-
31)
-
34. Senior, A.W., Robinson, A.J.: ‘Forward-backward retraining of recurrent neural network’. Advances in Neural Information Processing Systems, Denver, CO, USA, 1996, pp. 743–749.
-
-
32)
-
4. Alkhateeb, F., Doush, I.A.: ‘Arabic optical character recognition software: a review’, Pattern Recognit. Image Anal., 2017, 27, (4), pp. 763–776.
-
-
33)
-
20. Mirza, N.M.: ‘Printed arabic characters recognition based on minimum distance classifier technique’, Iraqi J. Sci., 2018, 59, (2A), pp. 762–770.
-
-
34)
-
19. Hassanpour, H., Samadiani, N., Akbarzadeh, F.: ‘A modfied self-organizing map neural network to recognize multi-font printed persian numerals’, Int. J. Eng., 2017, 30, (11), pp. 1700–1706.
-
-
35)
-
12. Wen, Y., Lu, Y., Yan, J., et al: ‘An algorithm for license plate recognition applied to intelligent transportation system’, IEEE Trans. Intell. Transp. Syst., 2011, 12, (3), pp. 830–845.
-
-
36)
-
18. Bhunia, A.K., Konwer, A., Bhunia, A.K., et al: ‘Script identification in natural scene image and video frames using an attention based convolutional-LSTM network’, Pattern Recognit., 2019, 85, pp. 172–184.
-
-
37)
-
17. Yaseen, R., Hassani, H.: ‘Kurdish optical character recognition’, UKH J. Sci. Eng., 2018, 2, (1), pp. 18–27.
-
-
38)
-
28. Mathew, M., Singh, A.K., Jawahar, C.V.: ‘Multilingual OCR for indic scripts’. 2016 12th IAPR Workshop on Document Analysis Systems (DAS), Santorini, Greece, 2016, pp. 186–191.
-
-
39)
-
29. Tafti, A.P., Baghaie, A., Assefi, M., et al: ‘OCR as a service: an experimental evaluation of Google docs OCR, tesseract, ABBYY FineReader, and Transym’. Int. Symp. on Visual Computing, Las Vegas, NV, USA, 2016, pp. 735–746.
-
-
40)
-
9. El-Sheikh, T.S., Guindi, R.M.: ‘Computer recognition of arabic cursive scripts’, Pattern Recognit., 1988, 21, (4), pp. 293–302.
-
-
41)
-
2. Soomro, W.J., Ismaili, I.A., Shoro, G.M.: ‘Optical character recognition system for sindhi text: a survey’, Univ. Sindh J. Inf. Commun. Technol., 2018, 2, (2), pp. 81–87.
-
-
42)
-
22. Smith, R.: ‘An overview of the Tesseract OCR engine’. Ninth Int. Conf. on Document Analysis and Recognition (ICDAR), Parana, Brazil, 2007, vol. 2, pp. 629–633.
-
-
43)
-
42. AleAhmad, A., Amiri, H., Darrudi, E., et al: ‘Hamshahri: a standard Persian text collection’, Knowl.-Based Syst., 2009, 22, (5), pp. 382–387.
-
-
1)