Upcycle Your
- 1. Indian Institute of Technology Kharagpur
- 2. University of California, San Diego
Description
We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit.Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman.Currently, there exists no dataset available for Romanised Sanskrit OCR.So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth.For training, we synthetically generate training images for both the settings.We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-tosequence tasks (Schnober et al., 2016).We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01%from an OCR output with CRR of 35.76% for one of the dataset settings.A human judgement survey performed on the models shows that our proposed model results in predictions which are faster to comprehend and faster to improve for a human than the other systems 1 .
Translated Descriptions
Translated Description (Arabic)
نقترح نهجًا لتصحيح النص بعد التعرف الضوئي على الحروف لرقمنة النصوص باللغة السنسكريتية الرومانية .بالنظر إلى نقص الموارد، يستخدم نهجنا نماذج التعرف الضوئي على الحروف المدربة على اللغات الأخرى المكتوبة باللغة الرومانية .حاليًا، لا توجد مجموعة بيانات متاحة للنموذج السنسكريتية الروماني للتعرف الضوئي على الحروف .لذلك، نقوم بتمهيد مجموعة بيانات من 430 صورة، تم مسحها ضوئيًا في إعدادين مختلفين والحقيقة الأرضية المقابلة لهما .للتدريب، نقوم بتوليد صور تدريبية بشكل صناعي لكل من الإعدادين .نجد أن استخدام آلية النسخ (Gu et al.، 2016) ينتج عنه زيادة بنسبة 7.69 ٪ في معدل التعرف على الحروف (CRR) مقارنة بالحالة الحالية للنموذج الفني في حل مهام تسلسل وتسل الحروف الرتيبة (Schnober et al.، 2016) .نجد أن نظامنا قوي في مكافحة أخطاء OCR المعرضة للخطر، حيث يحصل على CRR بنسبة 87.01 ٪ من ناتج OCR مع CRR بنسبة 35.76 ٪ لأحد مجموعة البيانات .تظهر إعدادات الحكم البشري التي أجريت على نماذجنا أن النتائج المقترحة أسرع في فهم وتحسين الأنظمة البشرية الأخرى.Translated Description (English)
We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit.Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman.Currently, there exists no dataset available for Romanised Sanskrit OCR.So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth.For training, we synthetically generate training images for both the settings.We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-tosequence tasks (Schnober et al., 2016).We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01% from an OCR output with CRR of 35.76% for one of the dataset settings.A human judgement survey performed on the models shows that our proposed model results in predictions which are faster to understandend and faster to improve for a human than the other systems 1 .Translated Description (Spanish)
We propos a post-OCR text correction approach for digitising texts in Romanised Sanskrit.Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman.Currently, there exists no dataset available for Romanised Sanskrit OCR.So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth.For training, we synthetically generate training images for both the settings.We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the model in solving monotone sequose-tosequence tasks (Schnober et al., 2016) .Files
K18-1034.pdf.pdf
Files
(226 Bytes)
Name | Size | Download all |
---|---|---|
md5:5360980bad11bf9723da89687501effc
|
226 Bytes | Preview Download |
Additional details
Additional titles
- Translated title (Arabic)
- Upcycle Your
- Translated title (English)
- Upcycle Your
- Translated title (Spanish)
- Upcycle Your
Identifiers
- Other
- https://openalex.org/W2891526782
- DOI
- 10.18653/v1/k18-1034
References
- https://openalex.org/W1538460311
- https://openalex.org/W1652129060
- https://openalex.org/W1775264392
- https://openalex.org/W1902237438
- https://openalex.org/W2001642682
- https://openalex.org/W2015896468
- https://openalex.org/W2035487336
- https://openalex.org/W2055555960
- https://openalex.org/W2079479194
- https://openalex.org/W2108610317
- https://openalex.org/W2118947254
- https://openalex.org/W2121879602
- https://openalex.org/W2124558257
- https://openalex.org/W2128458196
- https://openalex.org/W2130942839
- https://openalex.org/W2133464112
- https://openalex.org/W2147880316
- https://openalex.org/W2155450843
- https://openalex.org/W2164863177
- https://openalex.org/W2251529809
- https://openalex.org/W2296283641
- https://openalex.org/W2427181987
- https://openalex.org/W2468727120
- https://openalex.org/W2564757993
- https://openalex.org/W2574685885
- https://openalex.org/W2789474247
- https://openalex.org/W2962784628
- https://openalex.org/W2962910830
- https://openalex.org/W2963661177
- https://openalex.org/W2964165364
- https://openalex.org/W2964308564