Optical Character Recognition and Translation from Sinhala to Tamil for Printed Documents
Table of Contents
Introduction
The problem domain involves a language barrier gap between Sinhala and Tamil, with limited resources available for translating printed documents from Sinhala to Tamil. This presents challenges in accurately translating content due to significant linguistic differences. The scarcity of tools and trained translators in this language pair exacerbates the issue.
Moreover, the translation process is costly and time-consuming, requiring manual translation and attention to detail. Addressing this problem necessitates bridging the language gap, increasing translation resources, and exploring more efficient workflows.
The solution can be achieved by several steps
- Preprocess the printed document for OCR by ensuring good image quality.
- Enhance the image to improve its quality.
- Extract the text using OCR algorithms.
- Translate the extracted Sinhala text to Tamil using a machine translation system.
- Perform post-processing to clean up the translated text.
- Generate the translated text in the desired output format.
Our approach
What have we used
- Tesserect OCR to convert printed document into written text
- Google Translation API to convert Sinhala to Tamil
- ReactJS for frontend development
- NodeJS for backend development
- Visual Studio Code as code editor
- Postman for testing
Issues we faced and how we overcame
- Faced difficulty in integrating components together - Followed modular design pattern
- Tesseract OCR Engine doesn’t support documents; only images - Convert documents into images and pass into our system
- Errors in OCR extraction due to logos, images, unrecognized characters - Enabled all 3 official languages to be recognized by OCR engine. Need to implement Text identification & Localization module
- OCR extraction takes little amount of time - Improve image preprocessing & Introduce multithreading
- Ensuring the acuracy of OCR extraction and translation modules - Manual checking. Need to come up with an automated reliable checking system