Optical character recognition (OCR) is the process of converting the document images into an editable electronic format. This has many advantages like data compression, enabling search or edit options in the images/text, and creating the database for other applications like machine translation, speech recognition, and enhancing dictionaries and language models. OCR in Indian Languages is quite challenging due to richness in inflections.
Using open source and commercial OCR systems, we have observed the word error rates (WER) of around 20-50% on typewriter printed documents according to our experiments. Also, developing a highly accurate OCR system with an accuracy as high as 90% is not useful unless aided by the mechanism to identify errors. So, we started with the problem of developing an end-to-end framework for error detection and corrections in Indic-OCR. We have outperformed state-of-the-art in ‘error detection in Indic OCR’ for languages with varied inflections and have solved the out of vocabulary problem for ‘error correction in Indic-OCR’ in our ICDAR-2017 conference paper.
Prof. Ganesh Ramakrishnan