TY - GEN
T1 - Yarmouk Arabic OCR Dataset
AU - Doush, Iyad Abu
AU - Aikhateeb, Faisal
AU - Gharibeh, Anwaar Hamdi
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/8
Y1 - 2018/10/8
N2 - Optical Character Recognition (OCR) is the process of recognizing characters automatically from scanned or image documents. OCR software uses machine learning to recognize characters in the document. Such software needs to pass a training phase to learn how to recognize the letters in the text. In order to implement the training phase the OCR needs to use a standard dataset. The dataset can be used to evaluate the obtained results. In this research, we propose an Arabic printed OCR dataset. To the best of our knowledge, there is no Arabic OCR dataset that is available to be used by the research community with its ground truth with a size that is suitable to build a robust Arabic OCR. The proposed dataset is extracted randomly from Wikipedia to have different topics. It consists of 4,587 Arabic articles with a total of 8,994 images.
AB - Optical Character Recognition (OCR) is the process of recognizing characters automatically from scanned or image documents. OCR software uses machine learning to recognize characters in the document. Such software needs to pass a training phase to learn how to recognize the letters in the text. In order to implement the training phase the OCR needs to use a standard dataset. The dataset can be used to evaluate the obtained results. In this research, we propose an Arabic printed OCR dataset. To the best of our knowledge, there is no Arabic OCR dataset that is available to be used by the research community with its ground truth with a size that is suitable to build a robust Arabic OCR. The proposed dataset is extracted randomly from Wikipedia to have different topics. It consists of 4,587 Arabic articles with a total of 8,994 images.
KW - Arabic OCR Dataset
KW - Image Dataset
KW - Optical Character Recognition
UR - http://www.scopus.com/inward/record.url?scp=85056727569&partnerID=8YFLogxK
U2 - 10.1109/CSIT.2018.8486162
DO - 10.1109/CSIT.2018.8486162
M3 - Conference contribution
AN - SCOPUS:85056727569
T3 - 2018 8th International Conference on Computer Science and Information Technology, CSIT 2018
SP - 150
EP - 154
BT - 2018 8th International Conference on Computer Science and Information Technology, CSIT 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 8th International Conference on Computer Science and Information Technology, CSIT 2018
Y2 - 11 July 2018 through 12 July 2018
ER -