Article | Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18 | An SMT approach to automatic annotation of historical text
Göm menyn

Title:
An SMT approach to automatic annotation of historical text
Author:
Eva Pettersson: Department of Linguistics and Philology, Uppsala University, Sweden and Swedish National Graduate School of Language Technology Beáta Megyesi: Department of Linguistics and Philology, Uppsala University, Sweden Jörg Tiedemann: Department of Linguistics and Philology, Uppsala University, Sweden
Download:
Full text (pdf)
Year:
2013
Conference:
Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18
Issue:
087
Article no.:
005
Pages:
54-69
No. of pages:
16
Publication type:
Abstract and Fulltext
Published:
2013-05-17
ISBN:
978-91-7519-587-2
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press; Linköpings universitet


Export in BibTex, RIS or text

In this paper we propose an approach to tagging and parsing of historical text; using characterbased SMT methods for translating the historical spelling to a modern spelling before applying the NLP tools. This way; existing modern taggers and parsers may be used to analyse historical text instead of training new tools specialised in historical language; which might be hard considering the lack of linguistically annotated historical corpora. We show that our approach to spelling normalisation is successful even with small amounts of training data; and that it is generalisable to several languages. For the two languages presented in this paper; the proportion of tokens with a spelling identical to the modern gold standard spelling increases from 64.8% to 83.9%; and from 64.6% to 92.3% respectively; which has a positive impact on subsequent tagging and parsing using modern tools.

Keywords: Digital Humanities; Natural Language Processing; Historical Text; Normalisation; Underresourced Languages; Less-Resource Languages; SMT

Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18

Author:
Eva Pettersson, Beáta Megyesi, Jörg Tiedemann
Title:
An SMT approach to automatic annotation of historical text
References:

Bollmann; M.; Petran; F.; and Dipper; S. (2011). Rule-based normalization of historical texts. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage; pages 34–42; Hissar; Bulgaria.

Brants; T. (2000). TnT - a statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP); Seattle; Washington; USA.

Ejerhed; E. and Källgren; G. (1997). Stockholm Umeå Corpus. Version 1.0. Produced by Department of Linguistics; Umeå University and Department of Linguistics; Stockholm University. ISBN 91-7191-348-3.

Halácsy; P.; Kornai; A.; and Oravecz; C. (2007). HunPos - an open source trigram tagger. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; pages 209–212; Prague; Czech Republic.

Helgadóttir; S.; Svavarsdóttir; A.; Rögnvaldsson; E.; Bjarnadóttir; K.; and Loftsson; H. (2012). The tagged icelandic corpus (mím). In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages; pages 67–72.

Jiampojamarn; S.; Kondrak; G.; and Sherif; T. (2007). Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007); pages 372–379; Rochester; NY.

Krauwer; S.; Maegaard; B.; Khalid; C.; and Damsgaard Jørgensen; L. (2004). Report on Basic Language Resource Kit (BLARK) for Arabic.

Loftsson; H. and Rögnvaldsson; E. (2007). IceNLP: A natural language processing toolkit for Icelandic. In Proceedings of InterSpeech; Special session: Speech and language technology for less-resourced languages; Antwerp; Belgium.

Loth; A.; editor (1962). Late Medieval Icelandic Romances I. Kaupmannahöfn; Copenhagen.

Matthews; D. (2007). Machine transliteration of proper names. Master’s thesis; School of Informatics.

Nakov; P. and Tiedemann; J. (2012). Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); pages 301–305; Jeju Island; Korea. Association for Computational Linguistics.

Nivre; J.; Hall; J.; and Nilsson; J. (2006a). MaltParser: A data-driven parser-generator for dependency parsing. In Proceedings of the 5th international conference on Language Resources and Evaluation (LREC); pages 2216–2219; Genoa; Italy.

Nivre; J.; Nilsson; J.; and Hall; J. (2006b). Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the 5th international conference on Language Resources and Evaluation (LREC); pages 24–26; Genoa; Italy.

Och; F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of ACL’03; pages 160–167; Sapporo; Japan.

Palsson; H.; editor (2012). The Uppsala Edda. Viking Society for Northern Research.

Pettersson; E.; Megyesi; B.; and Nivre; J. (2012). Rule-based normalisation of historical text - a diachronic study. In Proceedings of the First International Workshop on Language Technology for Historical Text(s); Vienna; Austria.

Pind; J.; editor (1991). Icelandic Frequency Dictionary. Institute of Lexicography; Reykjavik; Iceland.

Rayson; P.; Archer; D.; and Nicholas; S. (2005). VARD versus Word – A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In Proceedings from the Corpus Linguistics Conference Series on-line e-journal; volume 1; Birmingham; UK.

Rögnvaldsson; E.; Ingason; A. K.; sson; E. F. S.; and Wallenberg; J. (2012). The icelandic parsed historical corpus (icepahc). In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); Istanbul; Turkey. European Language Resources Association (ELRA).

Varga; D.; Németh; L.; Halácsy; P.; Kornai; A.; Trón; V.; and Nagy; V. (2005). Parallel corpora for medium density languages. In Proceedings of the RANLP; pages 590–596.

Vilar; D.; Peter; J.-T.; and Hermann; N. (2007). Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation; pages 33–39; Prague; Czech Republic. Association for Computational Linguistics.

Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18

Author:
Eva Pettersson, Beáta Megyesi, Jörg Tiedemann
Title:
An SMT approach to automatic annotation of historical text
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21