Yarmouk Arabic OCR Dataset

Iyad Abu Doush, Faisal Aikhateeb, Anwaar Hamdi Gharibeh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Scopus citations

Abstract

Optical Character Recognition (OCR) is the process of recognizing characters automatically from scanned or image documents. OCR software uses machine learning to recognize characters in the document. Such software needs to pass a training phase to learn how to recognize the letters in the text. In order to implement the training phase the OCR needs to use a standard dataset. The dataset can be used to evaluate the obtained results. In this research, we propose an Arabic printed OCR dataset. To the best of our knowledge, there is no Arabic OCR dataset that is available to be used by the research community with its ground truth with a size that is suitable to build a robust Arabic OCR. The proposed dataset is extracted randomly from Wikipedia to have different topics. It consists of 4,587 Arabic articles with a total of 8,994 images.

Original languageEnglish
Title of host publication2018 8th International Conference on Computer Science and Information Technology, CSIT 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages150-154
Number of pages5
ISBN (Electronic)9781538641521
DOIs
StatePublished - 8 Oct 2018
Event8th International Conference on Computer Science and Information Technology, CSIT 2018 - Amman, Jordan
Duration: 11 Jul 201812 Jul 2018

Publication series

Name2018 8th International Conference on Computer Science and Information Technology, CSIT 2018

Conference

Conference8th International Conference on Computer Science and Information Technology, CSIT 2018
Country/TerritoryJordan
CityAmman
Period11/07/1812/07/18

Keywords

  • Arabic OCR Dataset
  • Image Dataset
  • Optical Character Recognition

Fingerprint

Dive into the research topics of 'Yarmouk Arabic OCR Dataset'. Together they form a unique fingerprint.

Cite this