Article | Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden | Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing
Göm menyn

Title:
Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing
Author:
Mika Koistinen: National Library of Finland, The Centre for Preservation and Digitisation, Finland Kimmo Kettunen: National Library of Finland, The Centre for Preservation and Digitisation, Finland Tuula Pääkkönen: National Library of Finland, The Centre for Preservation and Digitisation, Finland
Download:
Full text (pdf)
Year:
2017
Conference:
Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
Issue:
131
Article no.:
038
Pages:
277-283
No. of pages:
7
Publication type:
Abstract and Fulltext
Published:
2017-05-08
ISBN:
978-91-7685-601-7
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents

Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Author:
Mika Koistinen, Kimmo Kettunen, Tuula Pääkkönen
Title:
Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing
References:

R. C. Carrasco. 2014. An open-source OCR evaluation tool. In DATeCH ’14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pages 179–184.


M. Droettboom. 2003. Correcting broken characters in the recognition of historical documents. In JCDL 03 Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, pages 364–366.


A. El Harraj and N. Raissouni. 2015. Ocr accuracy improvement on document images through a novel preprocessing approach. In Signal & Image Processing : An International Journal (SIPIJ), volume 6, pages114–133.


J. Evershed and K. Fitch. 2014. Correcting Noisy OCR: Context beats Confusion (2014). In DATeCH ’14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pages 45–51.


G. Ganchimeg. 2015. History document image background noise and removal methods. In International Journal of Knowledge Content Development & Technology, volume 5, pages 11–24.


R. C. Gonzales and R. E. Woods. 2002. Digital Image Processing. Prentice-Hall.


M. Helinski, M. Kmieciak, and T. Parkola. 2012. Report on the comparison of Tesseract and ABBYY FineReader OCR engines. Technical report, Poznan Supercomputing and networking center, Poland.


N. Howe. 2013. Document Binarization with Automatic Parameter Tuning. In Journal International
Journal on Document Analysis and Recognition, volume 16, pages 247–258.


A. Järvelin, H. Keskustalo, E. Sormunen, M. Saastamoinen, and K. Kettunen. 2015. Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. In Journal of the Association for Information Science and Technology, volume 67.


K. Kettunen, T. P¨a¨akk¨onen, and M. Koistinen. 2016. Between diachrony and synchrony: evaluation of lexical quality of a digitized historical Finnish newspaper collection with morphological analyzers. In Baltic HLT 2016, volume 289, pages 122–129.


R. Krutsch and D. Tenorio. 2011. Histogram Equalization, Application Note. Technical report.


D. Lopresti. 2009. Optical character recognition errors and their effects on natural language processing. In International Journal on Document Analysis and Recognition, volume 12, pages 141–151.


N. Makkar and S Singh. 2012. A Brief tour to various Skew Detection and Correction Techniques. In International Journal for Science and Emerging Technologies with Latest Trend, volume 4, pages 54–58.


W. Niblack. 1986. An Introduction to Image Processing, volume SMC-9. Prentice-Hall, Eaglewood
Cliffs, NJ.


K. Ntirogiannis, B. Gatos, and I. Pratikakis. 2014. ICFHR2014 Competition on Handwritten Document Image Binarization (H-DIBCO 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition, pages 809–813.


N. Otsu. 1979. A Threshold Selection Method from Gray-Level Histograms. In IEEE Transactions on Systems, Man and Cybernetics, volume SMC-9, pages 62–66.


S. Parashar and S. Sogi. 2012. Finding skewness and deskewing scanned document. 3(4):1619–1924.


S. M. Pizer, R. E. Johnston, J. P. Ericksen, B. C. Yankaskas, and K. E. Muller. 1990. Contrast Limited Histogram Equalization Speed and Effectiveness.


I. Pratikakis, B. Gatos, and K. Ntirogiannis. 2013. ICDAR 2013 Document Image Binarization Contest (DIBCO 2013). In 2013 12th International Conference on Document Analysis and Recognition, pages 1471–1476.


S. V. Rice and T. A. Nartker. 1996. The ISRI Analytic Tools for OCR Evaluation Version 5.1. Technical report, Information Science Research Institute (ISRI).


J. Sauvola and M. Pietik¨ainen. 1999. Adaptive Document Image Binarization. In The Journal of the Pattern recognition society, volume 33, pages 225–236.


M. Segzin and B. Sankur. 2004. Survey over image thresholding techniques and quantitative performance evaluation.


R. Smith. 1995. A Simple and Efficient Skew Detection Algorithm via Text Row Algorithm. In Proceedings 3rd ICDAR’95, IEEE (1995), pages 1145–1148.


R. Smith. 2007. An Overview of the Tesseract OCR Engine. In Proc. Ninth Int. Conference on Document Analysis and Recognition (ICDAR), IEEE (1995), pages 629–633.


M. L. Smitha, P. J. Antony, and D. N. Sachin. 2016. ocument Image Analysis Using Imagemagick and Tesseract-ocr. In International Advanced Research Journal in Science, Engineering and Technology (IARJSET), volume 3, pages 108–112.


T. Stanhope. 2016. Applications of Low-Cost Computer Vision for Agricultural Implement Feedback and Control.


O. Tange. 2011. GNU Parallel - The Command-Line Power Tool. In The USENIX Magazine, pages 42–47.


S. Tanner, T. Muñoz, and P. Hemy Ros. 2009. Measuring Mass Text Digitization Quality and Usefulness. Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive. 15(7/8).


C. Wolf, J. Jolion, and F. Chassaing. 2002. Text Localization, Enhancement and Binarization in Multimedia Documents. In Proceedings of the International Conference on Pattern Recognition (ICPR), volume 4, pages 1037–1040. Quebec City, Canada.

Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Author:
Mika Koistinen, Kimmo Kettunen, Tuula Pääkkönen
Title:
Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21