Published January 1, 2018 | Version v1
Publication Open

Upcycle Your

  • 1. Indian Institute of Technology Kharagpur
  • 2. University of California, San Diego

Description

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit.Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman.Currently, there exists no dataset available for Romanised Sanskrit OCR.So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth.For training, we synthetically generate training images for both the settings.We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-tosequence tasks (Schnober et al., 2016).We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01%from an OCR output with CRR of 35.76% for one of the dataset settings.A human judgement survey performed on the models shows that our proposed model results in predictions which are faster to comprehend and faster to improve for a human than the other systems 1 .

⚠️ This is an automatic machine translation with an accuracy of 90-95%

Translated Description (Arabic)

نقترح نهجًا لتصحيح النص بعد التعرف الضوئي على الحروف لرقمنة النصوص باللغة السنسكريتية الرومانية .بالنظر إلى نقص الموارد، يستخدم نهجنا نماذج التعرف الضوئي على الحروف المدربة على اللغات الأخرى المكتوبة باللغة الرومانية .حاليًا، لا توجد مجموعة بيانات متاحة للنموذج السنسكريتية الروماني للتعرف الضوئي على الحروف .لذلك، نقوم بتمهيد مجموعة بيانات من 430 صورة، تم مسحها ضوئيًا في إعدادين مختلفين والحقيقة الأرضية المقابلة لهما .للتدريب، نقوم بتوليد صور تدريبية بشكل صناعي لكل من الإعدادين .نجد أن استخدام آلية النسخ (Gu et al.، 2016) ينتج عنه زيادة بنسبة 7.69 ٪ في معدل التعرف على الحروف (CRR) مقارنة بالحالة الحالية للنموذج الفني في حل مهام تسلسل وتسل الحروف الرتيبة (Schnober et al.، 2016) .نجد أن نظامنا قوي في مكافحة أخطاء OCR المعرضة للخطر، حيث يحصل على CRR بنسبة 87.01 ٪ من ناتج OCR مع CRR بنسبة 35.76 ٪ لأحد مجموعة البيانات .تظهر إعدادات الحكم البشري التي أجريت على نماذجنا أن النتائج المقترحة أسرع في فهم وتحسين الأنظمة البشرية الأخرى.

Translated Description (English)

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit.Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman.Currently, there exists no dataset available for Romanised Sanskrit OCR.So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth.For training, we synthetically generate training images for both the settings.We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-tosequence tasks (Schnober et al., 2016).We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01% from an OCR output with CRR of 35.76% for one of the dataset settings.A human judgement survey performed on the models shows that our proposed model results in predictions which are faster to understandend and faster to improve for a human than the other systems 1 .

Translated Description (Spanish)

We propos a post-OCR text correction approach for digitising texts in Romanised Sanskrit.Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman.Currently, there exists no dataset available for Romanised Sanskrit OCR.So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth.For training, we synthetically generate training images for both the settings.We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the model in solving monotone sequose-tosequence tasks (Schnober et al., 2016) .

Files

K18-1034.pdf.pdf

Files (226 Bytes)

⚠️ Please wait a few minutes before your translated files are ready ⚠️ Note: Some files might be protected thus translations might not work.
Name Size Download all
md5:5360980bad11bf9723da89687501effc
226 Bytes
Preview Download

Additional details

Additional titles

Translated title (Arabic)
Upcycle Your
Translated title (English)
Upcycle Your
Translated title (Spanish)
Upcycle Your

Identifiers

Other
https://openalex.org/W2891526782
DOI
10.18653/v1/k18-1034

GreSIS Basics Section

Is Global South Knowledge
Yes
Country
India

References

  • https://openalex.org/W1538460311
  • https://openalex.org/W1652129060
  • https://openalex.org/W1775264392
  • https://openalex.org/W1902237438
  • https://openalex.org/W2001642682
  • https://openalex.org/W2015896468
  • https://openalex.org/W2035487336
  • https://openalex.org/W2055555960
  • https://openalex.org/W2079479194
  • https://openalex.org/W2108610317
  • https://openalex.org/W2118947254
  • https://openalex.org/W2121879602
  • https://openalex.org/W2124558257
  • https://openalex.org/W2128458196
  • https://openalex.org/W2130942839
  • https://openalex.org/W2133464112
  • https://openalex.org/W2147880316
  • https://openalex.org/W2155450843
  • https://openalex.org/W2164863177
  • https://openalex.org/W2251529809
  • https://openalex.org/W2296283641
  • https://openalex.org/W2427181987
  • https://openalex.org/W2468727120
  • https://openalex.org/W2564757993
  • https://openalex.org/W2574685885
  • https://openalex.org/W2789474247
  • https://openalex.org/W2962784628
  • https://openalex.org/W2962910830
  • https://openalex.org/W2963661177
  • https://openalex.org/W2964165364
  • https://openalex.org/W2964308564