Article | Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language | Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts
Göm menyn

Title:
Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts
Author:
Gerold Schneider: Institute of Computational Linguistics and Department of English, University of Zurich, Switzerland Eva Pettersson: Department of Linguistics and Philology, Uppsala University, Sweden Michael Percillier: Department of English, University of Mannheim, Germany
Download:
Full text (pdf)
Year:
2017
Conference:
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language
Issue:
133
Article no.:
008
Pages:
40-46
No. of pages:
7
Publication type:
Abstract and Fulltext
Published:
2017-05-10
ISBN:
978-91-7685-503-4
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

To be able to use existing natural language processing tools for analysing historical text, an important preprocessing step is spelling normalisation, converting the original spelling to present-day spelling, before applying tools such as taggers and parsers. In this paper, we compare a probablistic, language-independent approach to spelling normalisation based on statistical machine translation (SMT) techniques, to a rule-based system combining dictionary lookup with rules and non-probabilistic weights. The rule-based system reaches the best accuracy, up to 94% precision at 74% recall, while the SMT system improves each tested period.

Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

Author:
Gerold Schneider, Eva Pettersson, Michael Percillier
Title:
Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts
References:

Alistair Baron and Paul Rayson. 2008. VARD 2: A tool for dealing with spelling variation in historical corpora. In Proceedings of the Postgraduate Conference in Corpus Linguistics, Birmingham. Aston University.


Douglas Biber, Edward Finegan, and Dwight Atkinson. 1994. Archer and its challenges: Compiling and exploring a representative corpus of historical English registers. In Udo Fries, Peter Schneider, and Gunnel Tottie, editors, Creating and using English language corpora, Papers from the 14th International Conference on English Language Research on Computerized Corpora, Zurich 1993, pages 1‚Äď13. Rodopi, Amsterdam.


BNC Consortium. 2007. The British National Corpus, Version 3. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/.


Peter Brown, Vincent Della Pietra, Stephen Della Pietra, and Robert Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2), pages 263‚Äď311.


Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models. in Proceedings of Interspeech 2008, pages 1618‚Äď1621.


Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst, 2007. Moses: Open Source Toolkit for Statistical Machine Translation. in Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 177‚Äď180.


Hrafn Loftsson. 2008. Tagging icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics, 31(1).


Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 1(29), pages 19‚Äď51.


Eva Pettersson, Be¬īata Megyesi, and J¬®org Tiedemann. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the NoDaLiDa 2013 workshop on Computational Historical Linguistics.


Eva Pettersson, Be√°ta Megyesi, and Joakim Nivre. 2014. A multilingual evaluation of three spelling normalisation methods for historical text. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) @ EACL 2014, pages 32‚Äď41, Gothenburg, Sweden.


Paul Rayson, Dawn Archer, Alistair Baron, Jonathan Culpeper, and Nicholas Smith. 2007. Tagging the bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of Corpus Linguistics 2007. University of Birmingham, UK.


Silke Scheible, Richard J. Whitt, Martin Durrell, and Paul Bennett. 2011. Evaluating an ’off-the-shelf’ POS-tagger on Early Modern German text. In Proceedings of the ACL-HLT 2011 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011), Portland,
Oregon.
Christer Samuelsson and Atro Voutilainen. 1997. Comparing a linguistic and a stochastic tagger. In Proceedings of of ACL/EACL Joint Conference, Madrid.


Yves Scherrer and Toma?z Erjavec. 2013. Modernizing historical Slovene words with character-based SMT. In Proceedings of the 4th Biennial Workshop on Balto-Slavic Natural Language Processing, pages 58‚Äď62.


Gerold Schneider, Hans Martin Lehmann, and Peter Schneider. 2014. Parsing Early Modern English corpora. Literary and Linguistic Computing, first published online February 6, 2014 doi:10.1093/llc/fqu001.


J¬®org Tiedemann. 2009. Character-based PSMT for closely related languages. Proceedings of 13th Annual Conference of the European Association for Machine Translation (EAMT’09), pages 12‚Äď19.

Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

Author:
Gerold Schneider, Eva Pettersson, Michael Percillier
Title:
Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21