Article | Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden | A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
Göm menyn

Title:
A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
Author:
Aleksi Vesanto: Turku NLP Group, Department of FT Asko Nivala: Cultural History, Finland / Turku Institute for Advanced Studies, University of Turku, Finland Tapio Salakoski: Turku NLP Group, Department of FT Hannu Salmi: Cultural History, Finland Filip Ginter: Turku NLP Group, Department of FT
Download:
Full text (pdf)
Year:
2017
Conference:
Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
Issue:
131
Article no.:
049
Pages:
330-333
No. of pages:
4
Publication type:
Abstract and Fulltext
Published:
2017-05-08
ISBN:
978-91-7685-601-7
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

We present a software for retrieving and exploring duplicated text passages in low quality OCR historical text corpora. The system combines NCBI BLAST, a software created for comparing and aligning biological sequences, with the Solr search and indexing engine, providing a web interface to easily query and browse the clusters of duplicated texts. We demonstrate the system on a corpus of scanned and OCR-recognized Finnish newspapers and journals from years 1771 to 1910.

Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Author:
Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi, Filip Ginter
Title:
A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
References:
Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403‚Äď410, Oct. Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008. Kimmo Kettunen, Tuula P√§√§kk√∂nen, and Mika Koistinen. 2016. Between diachrony and synchrony: Evaluation of lexical quality of a digitized historical finnish newspaper and journal collection with morphological analyzers. In Baltic HLT. David A. Smith, Ryan Cordell, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson. 2014. Detecting and modeling local text reuse. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ‚Äô14, pages 183‚Äď192, Piscataway, NJ, USA. IEEE Press.

Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Author:
Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi, Filip Ginter
Title:
A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21