Upcycle Your

doi:10.60692/rknqh-1b264

Published January 1, 2018 | Version v1

Publication Open

Upcycle Your

1. Indian Institute of Technology Kharagpur
2. University of California, San Diego

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit.Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman.Currently, there exists no dataset available for Romanised Sanskrit OCR.So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth.For training, we synthetically generate training images for both the settings.We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-tosequence tasks (Schnober et al., 2016).We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01%from an OCR output with CRR of 35.76% for one of the dataset settings.A human judgement survey performed on the models shows that our proposed model results in predictions which are faster to comprehend and faster to improve for a human than the other systems 1 .

Translated Descriptions

This is an automatic machine translation with an accuracy of 90-95%

Translated Description (Arabic)

نقترح نهجًا لتصحيح النص بعد التعرف الضوئي على الحروف لرقمنة النصوص باللغة السنسكريتية الرومانية .بالنظر إلى نقص الموارد، يستخدم نهجنا نماذج التعرف الضوئي على الحروف المدربة على اللغات الأخرى المكتوبة باللغة الرومانية .حاليًا، لا توجد مجموعة بيانات متاحة للنموذج السنسكريتية الروماني للتعرف الضوئي على الحروف .لذلك، نقوم بتمهيد مجموعة بيانات من 430 صورة، تم مسحها ضوئيًا في إعدادين مختلفين والحقيقة الأرضية المقابلة لهما .للتدريب، نقوم بتوليد صور تدريبية بشكل صناعي لكل من الإعدادين .نجد أن استخدام آلية النسخ (Gu et al.، 2016) ينتج عنه زيادة بنسبة 7.69 ٪ في معدل التعرف على الحروف (CRR) مقارنة بالحالة الحالية للنموذج الفني في حل مهام تسلسل وتسل الحروف الرتيبة (Schnober et al.، 2016) .نجد أن نظامنا قوي في مكافحة أخطاء OCR المعرضة للخطر، حيث يحصل على CRR بنسبة 87.01 ٪ من ناتج OCR مع CRR بنسبة 35.76 ٪ لأحد مجموعة البيانات .تظهر إعدادات الحكم البشري التي أجريت على نماذجنا أن النتائج المقترحة أسرع في فهم وتحسين الأنظمة البشرية الأخرى.

Translated Description (English)

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit.Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman.Currently, there exists no dataset available for Romanised Sanskrit OCR.So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth.For training, we synthetically generate training images for both the settings.We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-tosequence tasks (Schnober et al., 2016).We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01% from an OCR output with CRR of 35.76% for one of the dataset settings.A human judgement survey performed on the models shows that our proposed model results in predictions which are faster to understandend and faster to improve for a human than the other systems 1 .

Translated Description (Spanish)

We propos a post-OCR text correction approach for digitising texts in Romanised Sanskrit.Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman.Currently, there exists no dataset available for Romanised Sanskrit OCR.So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth.For training, we synthetically generate training images for both the settings.We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the model in solving monotone sequose-tosequence tasks (Schnober et al., 2016) .

Files

K18-1034.pdf.pdf

Files (226 Bytes)

Please wait a few minutes before your translated files are ready Note: Some files might be protected thus translations might not work.

Name	Size	Download all
K18-1034.pdf.pdf md5:5360980bad11bf9723da89687501effc	226 Bytes	Preview Download

Additional details

Translated title (Arabic): Upcycle Your
Translated title (English): Upcycle Your
Translated title (Spanish): Upcycle Your

Other: https://openalex.org/W2891526782
DOI: 10.18653/v1/k18-1034

Is Global South Knowledge: Yes
Country: India

https://openalex.org/W1538460311
https://openalex.org/W1652129060
https://openalex.org/W1775264392
https://openalex.org/W1902237438
https://openalex.org/W2001642682
https://openalex.org/W2015896468
https://openalex.org/W2035487336
https://openalex.org/W2055555960
https://openalex.org/W2079479194
https://openalex.org/W2108610317
https://openalex.org/W2118947254
https://openalex.org/W2121879602
https://openalex.org/W2124558257
https://openalex.org/W2128458196
https://openalex.org/W2130942839
https://openalex.org/W2133464112
https://openalex.org/W2147880316
https://openalex.org/W2155450843
https://openalex.org/W2164863177
https://openalex.org/W2251529809
https://openalex.org/W2296283641
https://openalex.org/W2427181987
https://openalex.org/W2468727120
https://openalex.org/W2564757993
https://openalex.org/W2574685885
https://openalex.org/W2789474247
https://openalex.org/W2962784628
https://openalex.org/W2962910830
https://openalex.org/W2963661177
https://openalex.org/W2964165364
https://openalex.org/W2964308564

	All versions	This version
Views	1	1
Downloads	1	1
Data volume	226 Bytes	226 Bytes

Upcycle Your

Translated Descriptions

Translated Description (Arabic)

Translated Description (English)

Translated Description (Spanish)

Files

K18-1034.pdf.pdf

Files (226 Bytes)

Additional details

Additional titles

Identifiers

Related works

GreSIS Basics Section

References

Upcycle Your

Creators

Description

Translated Descriptions

Translated Description (Arabic)

Translated Description (English)

Translated Description (Spanish)

Files

K18-1034.pdf.pdf

Files (226 Bytes)

Additional details

Additional titles

Identifiers

Related works

GreSIS Basics Section

References