Article | Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16 | Building a Large Automatically Parsed Corpus of Finnish Link�ping University Electronic Press Conference Proceedings
Göm menyn

Title:
Building a Large Automatically Parsed Corpus of Finnish
Author:
Filip Ginter: Department of IT, University of Turku, Finland Jenna Nyblom: Department of IT, University of Turku, Finland Veronika Laippala: Department of Languages and Translation Studies, University of Turku, Finland Samuel Kohonen: Department of IT, University of Turku, Finland Katri Haverinen: Department of IT, University of Turku, Finland and Turku Centre for Computer Science (TUCS), Turku, Finland Simo Vihjanen: Lingsoft, Inc., Turku, Finland Tapio Salakoski: Department of IT, University of Turku, Finland and Turku Centre for Computer Science (TUCS), Turku, Finland
Download:
Full text (pdf)
Year:
2013
Conference:
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16
Issue:
085
Article no.:
026
Pages:
291-300
No. of pages:
10
Publication type:
Abstract and Fulltext
Published:
2013-05-17
ISBN:
978-91-7519-589-6
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press; Linköpings universitet


Export in BibTex, RIS or text

We describe the methods and resources used to build FinnTreeBank-3; a 76.4 million token corpus of Finnish with automatically produced morphological and dependency syntax analyses. Starting from a definition of the target dependency scheme; we show how existing resources are transformed to conform to this definition and subsequently used to develop a parsing pipeline capable of processing a large-scale corpus. An independent formal evaluation demonstrates high accuracy of both morphological and syntactic annotation layers. The parsed corpus is freely available within the FIN-CLARIN infrastructure project.

Keywords: Dependency parsing; Finnish; CLARIN; parsebank; treebank

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Author:
Filip Ginter, Jenna Nyblom, Veronika Laippala, Samuel Kohonen, Katri Haverinen, Simo Vihjanen, Tapio Salakoski
Title:
Building a Large Automatically Parsed Corpus of Finnish
References:

Bohnet; B. (2010). Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of COLING’10; pages 89–97.

de Marneffe; M.-C. and Manning; C. (2008a). Stanford typed dependencies manual. Technical report; Stanford University. Revised for Stanford Parser v. 2.0.4 in November 2012.

De Marneffe; M.-C. and Manning; C. (2008b). Stanford typed dependencies representation. In Proceedings of COLING’08; Workshop on Cross-Framework and Cross-Domain Parser Evaluation; pages 1–8.

Hakulinen; A.; Vilkuna; M.; Korhonen; R.; Koivisto; V.; Heinonen; T.-R.; and Alho; I. (2004). Iso suomen kielioppi / Grammar of Finnish. Suomalaisen kirjallisuuden seura.

Haverinen; K. (2012). Syntax annotation guidelines for the Turku Dependency Treebank. Technical Report 1034; Turku Centre for Computer Science.

Haverinen; K.; Ginter; F.; Laippala; V.; Kohonen; S.; Viljanen; T.; Nyblom; J.; and Salakoski; T. (2011). A dependency-based analysis of treebank annotation errors. In Proceedings of Depling’11; pages 115–124.

Haverinen; K.; Viljanen; T.; Laippala; V.; Kohonen; S.; Ginter; F.; and Salakoski; T. (2010). Treebanking Finnish. In Proceedings of TLT9; pages 79–90.

Koehn; P. (2005). Europarl: a parallel corpus for statistical machine translation. In Proceedings of MT Summit X; pages 79–86.

Nivre; J.; Hall; J.; Nilsson; J.; Chanev; A.; Eryi?git; G.; Kübler; S.; Marinov; S.; and Marsi; E. (2007). MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering; 13(2):95–135.

Pahikkala; T.; Tsivtsivadze; E.; Airola; A.; Boberg; J.; and Salakoski; T. (2007). Learning to rank with pairwise regularized least-squares. In Joachims; T.; Li; H.; Liu; T.-Y.; and Zhai; C.; editors; SIGIR 2007 Workshop on Learning to Rank for Information Retrieval; pages 27–33.

Steinberger; R.; Pouliquen; B.; Widiger; A.; Ignat; C.; Erjavec; T.; Tufi¸s; D.; and Varga; D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of LREC’06; pages 2142–2147.

Voutilainen; A.; Lindén; K.; and Purtonen; T. (2011). Designing a dependency representation and grammar definition corpus for Finnish. In Las tecnologías de la información y las comunicaciones: Presente y future en el análisis de córpora. Actas del III Congreso Internacional de Lingüística de Corpus; pages 151–158.

Voutilainen; A.; Purtonen; T.; and Muhonen; K. (2012a). FinnTreeBank2 manual. Technical report; University of Helsinki; Department of Modern Languages.

Voutilainen; A.; Purtonen; T.; and Muhonen; K. (2012b). Outsourcing parsebanking: The FinnTreeBank project. In Shall We Play the Festschrift Game?; pages 117–132. Springer.

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Author:
Filip Ginter, Jenna Nyblom, Veronika Laippala, Samuel Kohonen, Katri Haverinen, Simo Vihjanen, Tapio Salakoski
Title:
Building a Large Automatically Parsed Corpus of Finnish
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2018-9-11