A novel Arabic OCR post-processing using rule-based and word context techniques

Iyad Abu Doush, Faisal Alkhateeb, Anwaar Hamdi Gharaibeh

Research output: Contribution to journalArticlepeer-review

20 Scopus citations

Abstract

Optical character recognition (OCR) is the process of recognizing characters automatically from scanned documents for editing, indexing, searching, and reducing the storage space. The resulted text from the OCR usually does not match the text in the original document. In order to minimize the number of incorrect words in the obtained text, OCR post-processing approaches can be used. Correcting OCR errors is more complicated when we are dealing with the Arabic language because of its complexity such as connected letters, different letters may have the same shape, and the same letter may have different forms. This paper provides a statistical Arabic language model and post-processing techniques based on hybridizing the error model approach with the context approach. The proposed model is language independent and non-constrained with the string length. To the best of our knowledge, this is the first end-to-end OCR post-processing model that is applied to the Arabic language. In order to train the proposed model, we build Arabic OCR context database which contains 9000 images of Arabic text. Also, the evaluation of the OCR post-processing system results is automated using our novel alignment technique which is called fast automatic hashing text alignment. Our experimental results show that the rule-based system improves the word error rate from 24.02% to become 20.26% by using a training data set of 1000 images. On the other hand, after this training, we apply the rule-based system on 500 images as a testing dataset and the word error rate is improved from 14.95% to become 14.53%. The proposed hybrid OCR post-processing system improves the results based on using 1000 training images from a word error rate of 24.02% to become 18.96%. After training the hybrid system, we used 500 images for testing and the results show that the word error rate enhanced from 14.95 to become 14.42. The obtained results show that the proposed hybrid system outperforms the rule-based system.

Original languageEnglish
Pages (from-to)77-89
Number of pages13
JournalInternational Journal on Document Analysis and Recognition
Volume21
Issue number1-2
DOIs
StatePublished - 1 Jun 2018

Keywords

  • Alignment technique
  • Arabic OCR post-processing
  • Automatic post-processing
  • Error model
  • Language model

Fingerprint

Dive into the research topics of 'A novel Arabic OCR post-processing using rule-based and word context techniques'. Together they form a unique fingerprint.

Cite this