Article | Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16 | Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
Göm menyn

Title:
Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
Author:
Eva Pettersson: Department of Linguistics and Philology, Uppsala University, Sweden and Swedish National Graduate School of Language Technology Beàta Megyesi: Department of Linguistics and Philology, Uppsala University, Sweden Joakim Nivre: Department of Linguistics and Philology, Uppsala University, Sweden
Download:
Full text (pdf)
Year:
2013
Conference:
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16
Issue:
085
Article no.:
017
Pages:
163-179
No. of pages:
17
Publication type:
Abstract and Fulltext
Published:
2013-05-17
ISBN:
978-91-7519-589-6
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press; Linköpings universitet


Export in BibTex, RIS or text

Natural language processing for historical text imposes a variety of challenges; such as to deal with a high degree of spelling variation. Furthermore; there is often not enough linguistically annotated data available for training part-of-speech taggers and other tools aimed at handling this specific kind of text. In this paper we present a Levenshtein-based approach to normalisation of historical text to a modern spelling. This enables us to apply standard NLP tools trained on contemporary corpora on the normalised version of the historical input text. In its basic version; no annotated historical data is needed; since the only data used for the Levenshtein comparisons are a contemporary dictionary or corpus. In addition; a (small) corpus of manually normalised historical text can optionally be included to learn normalisation for frequent words and weights for edit operations in a supervised fashion; which improves precision. We show that this method is successful both in terms of normalisation accuracy; and by the performance of a standard modern tagger applied to the historical text. We also compare our method to a previously implemented approach using a set of hand-written normalisation rules; and we see that the Levenshtein-based approach clearly outperforms the hand-crafted rules. Furthermore; the experiments were carried out on Swedish data with promising results and we believe that our method could be successfully applicable to analyse historical text for other languages; including those with less resources.

Keywords: Digital Humanities; Natural Language Processing; Historical Text; Normalisation; Levenshtein Edit Distance; Compound Splitting; Part-of-Speech Tagging; Underresourced Languages; Less-Resource Languages

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Author:
Eva Pettersson, Beàta Megyesi, Joakim Nivre
Title:
Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
References:

√Ögren; M.; Fiebranz; R.; Lindberg; E.; and Lindstr√∂m; J. (2011). Making verbs count. The research project ‚ÄôGender and Work‚Äô and its methodology. Scandinavian Economic History Review; 59(3):271‚Äď291. Forthcoming.

Baron; A. and Rayson; P. (2008). Vard2: A tool for dealing with spelling variation in historical corpora. In Postgraduate Conference in Corpus Linguistics; Aston University; Birmingham.

Black; A. W. and Taylor; P. (1997). Festival speech synthesis system: system documentation. Technical report; University of Edinburgh; Centre for Speech Technology Research.

Bollmann; M. (2012). (semi-)Automatic Normalization of Historical Texts using Distance Measures and the norma tool. In Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanitites (ACRH-2).

Bollmann; M.; Petran; F.; and Dipper; S. (2011). Rule-based normalization of historical texts. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage; pages 34‚Äď42; Hissar; Bulgaria.

Borin; L.; Forsberg; M.; and Lönngren; L. (2008). Saldo 1.0 (svenskt associationslexikon version 2). Språkbanken; University of Gothenburg.

Brants; T. (2000). TnT - a statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP); Seattle; Washington; USA.

Ejerhed; E. and Källgren; G. (1997). Stockholm Umeå Corpus. Version 1.0. Produced by Department of Linguistics; Umeå University and Department of Linguistics; Stockholm University. ISBN 91-7191-348-3.

Hal√°csy; P.; Kornai; A.; and Oravecz; C. (2007). HunPos - an open source trigram tagger. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; pages 209‚Äď212; Prague; Czech Republic.

Jurish; B. (2008). Finding canonical forms for historical German text. In Storrer; A.; Geyken; A.; Siebert; A.; and W√ľrzner; K.-M.; editors; Text Resources and Lexical Knowledge: Selected Papers from the 9th Conference on Natural Language Processing (KONVENS 2008); pages 27‚Äď37. Mouton de Gruyter; Berlin.

Jurish; B. (2010). More Than Words: Using Token Context to Improve Canonicalization of Historical German. Journal for Language Technology and Computational Linguistics; 25(1):23‚Äď 39.

Kukich; K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR); 24(4):377‚Äď439.

Levenshtein; V. (1966). Binary Codes Capable of Correcting Deletions; Insertions and Reversals. Soviet Physics Doklady; 10(8):707‚Äď710.

Pettersson; E.; Megyesi; B.; and Nivre; J. (2012). Rule-based normalisation of historical text - a diachronic study. In Proceedings of the First International Workshop on Language Technology for Historical Text(s); Vienna; Austria.

Rayson; P.; Archer; D.; and Nicholas; S. (2005). VARD versus Word ‚Äď A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In Proceedings from the Corpus Linguistics Conference Series on-line e-journal; volume 1; Birmingham; UK.

Stymne; S. (2008). German compounds in factored statistical machine translation. InRanta; A. and Nordstr√∂m; B.; editors; Proceedings of GoTAL; 6th International Conference on Natural Language Processing; volume 5221; pages 464‚Äď475; Gothenburg; Sweden. Springer LNCS/LNAI.

Stymne; S. and Holmqvist; M. (2008). Processing of Swedish Compounds for Phrase-Based Statistical Machine Translation. In Proceedings of the 12th EAMT conference; Hamburg; Germany.

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Author:
Eva Pettersson, Beàta Megyesi, Joakim Nivre
Title:
Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21