Article | Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18 | An SMT approach to automatic annotation of historical text
Göm menyn

Title:
An SMT approach to automatic annotation of historical text
Author:
Eva Pettersson: Department of Linguistics and Philology, Uppsala University, Sweden and Swedish National Graduate School of Language Technology Beáta Megyesi: Department of Linguistics and Philology, Uppsala University, Sweden Jörg Tiedemann: Department of Linguistics and Philology, Uppsala University, Sweden
Download:
Full text (pdf)
Year:
2013
Conference:
Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18
Issue:
087
Article no.:
005
Pages:
54-69
No. of pages:
16
Publication type:
Abstract and Fulltext
Published:
2013-05-17
ISBN:
978-91-7519-587-2
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press; Linköpings universitet


Export in BibTex, RIS or text

In this paper we propose an approach to tagging and parsing of historical text; using characterbased SMT methods for translating the historical spelling to a modern spelling before applying the NLP tools. This way; existing modern taggers and parsers may be used to analyse historical text instead of training new tools specialised in historical language; which might be hard considering the lack of linguistically annotated historical corpora. We show that our approach to spelling normalisation is successful even with small amounts of training data; and that it is generalisable to several languages. For the two languages presented in this paper; the proportion of tokens with a spelling identical to the modern gold standard spelling increases from 64.8% to 83.9%; and from 64.6% to 92.3% respectively; which has a positive impact on subsequent tagging and parsing using modern tools.

Keywords: Digital Humanities; Natural Language Processing; Historical Text; Normalisation; Underresourced Languages; Less-Resource Languages; SMT

Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18

Author:
Eva Pettersson, Beáta Megyesi, Jörg Tiedemann
Title:
An SMT approach to automatic annotation of historical text
References:

Bollmann; M.; Petran; F.; and Dipper; S. (2011). Rule-based normalization of historical texts. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage; pages 34–42; Hissar; Bulgaria.


Brants; T. (2000). TnT - a statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP); Seattle; Washington; USA.


Ejerhed; E. and Källgren; G. (1997). Stockholm Umeå Corpus. Version 1.0. Produced by Department of Linguistics; Umeå University and Department of Linguistics; Stockholm University. ISBN 91-7191-348-3.


Halácsy; P.; Kornai; A.; and Oravecz; C. (2007). HunPos - an open source trigram tagger. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; pages 209–212; Prague; Czech Republic.


Helgadóttir; S.; Svavarsdóttir; A.; Rögnvaldsson; E.; Bjarnadóttir; K.; and Loftsson; H. (2012). The tagged icelandic corpus (mím). In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages; pages 67–72.


Jiampojamarn; S.; Kondrak; G.; and Sherif; T. (2007). Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007); pages 372–379; Rochester; NY.


Krauwer; S.; Maegaard; B.; Khalid; C.; and Damsgaard Jørgensen; L. (2004). Report on Basic Language Resource Kit (BLARK) for Arabic.


Loftsson; H. and Rögnvaldsson; E. (2007). IceNLP: A natural language processing toolkit for Icelandic. In Proceedings of InterSpeech; Special session: Speech and language technology for less-resourced languages; Antwerp; Belgium.


Loth; A.; editor (1962). Late Medieval Icelandic Romances I. Kaupmannahöfn; Copenhagen.


Matthews; D. (2007). Machine transliteration of proper names. Master’s thesis; School of Informatics.


Nakov; P. and Tiedemann; J. (2012). Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); pages 301–305; Jeju Island; Korea. Association for Computational Linguistics.


Nivre; J.; Hall; J.; and Nilsson; J. (2006a). MaltParser: A data-driven parser-generator for dependency parsing. In Proceedings of the 5th international conference on Language Resources and Evaluation (LREC); pages 2216–2219; Genoa; Italy.


Nivre; J.; Nilsson; J.; and Hall; J. (2006b). Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the 5th international conference on Language Resources and Evaluation (LREC); pages 24–26; Genoa; Italy.


Och; F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of ACL’03; pages 160–167; Sapporo; Japan.


Palsson; H.; editor (2012). The Uppsala Edda. Viking Society for Northern Research.


Pettersson; E.; Megyesi; B.; and Nivre; J. (2012). Rule-based normalisation of historical text - a diachronic study. In Proceedings of the First International Workshop on Language Technology for Historical Text(s); Vienna; Austria.


Pind; J.; editor (1991). Icelandic Frequency Dictionary. Institute of Lexicography; Reykjavik; Iceland.


Rayson; P.; Archer; D.; and Nicholas; S. (2005). VARD versus Word – A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In Proceedings from the Corpus Linguistics Conference Series on-line e-journal; volume 1; Birmingham; UK.


Rögnvaldsson; E.; Ingason; A. K.; sson; E. F. S.; and Wallenberg; J. (2012). The icelandic parsed historical corpus (icepahc). In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); Istanbul; Turkey. European Language Resources Association (ELRA).


Varga; D.; Németh; L.; Halácsy; P.; Kornai; A.; Trón; V.; and Nagy; V. (2005). Parallel corpora for medium density languages. In Proceedings of the RANLP; pages 590–596.


Vilar; D.; Peter; J.-T.; and Hermann; N. (2007). Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation; pages 33–39; Prague; Czech Republic. Association for Computational Linguistics.

Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18

Author:
Eva Pettersson, Beáta Megyesi, Jörg Tiedemann
Title:
An SMT approach to automatic annotation of historical text
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21