Article | Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands | Using Data Mining and the CLARIN Infrastructure to Extend Corpus-based Linguistic Research
Göm menyn

Title:
Using Data Mining and the CLARIN Infrastructure to Extend Corpus-based Linguistic Research
Author:
Thomas Bartz: TU Dortmund University, Department of German Language and Literature, Dortmund, Germany Christian Pölitz: TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany Katharina Morik: TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany Angelika Storrer: Mannheim University, Department of German Philology, Mannheim, Germany
Download:
Full text (pdf)
Year:
2014
Conference:
Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands
Issue:
116
Article no.:
001
Pages:
1-13
No. of pages:
13
Publication type:
Abstract and Fulltext
Published:
2015-08-26
ISBN:
978-91-7685-954-4
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

Large digital corpora of written language, such as those that are held by the CLARIN-D centers, provide excellent possibilities for linguistic research on authentic language data. Nonetheless, the large number of hits that can be retrieved from corpora often leads to challenges in concrete linguistic research settings. This is particularly the case, if the queried word-forms or constructions are (semantically) ambiguous. The joint project called ‚ÄėCorpus-based Linguistic Research and Analysis Using Data Mining‚Äô (‚ÄúKorpus-basierte linguistische Recherche und Analyse mit Hilfe von Data-Mining‚ÄĚ ‚Äď ‚ÄėKobRA‚Äô) is therefore underway to investigating the benefits and issues of using machine learning technologies in order to perform after-retrieval cleaning and disambiguation tasks automatically. The following article is an overview of the questions, methodologies and current results of the project, specifically in the scope of corpus-based lexicography/historical semantics. In this area, topic models were used in order to partition search result KWIC lists retrieved by querying various corpora for polysemous or homonym words by the individual meanings of these words.

Keywords: corpus-based linguistic and lexicographic studies;data mining;disambiguation

Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands

Author:
Thomas Bartz, Christian Pölitz, Katharina Morik, Angelika Storrer
Title:
Using Data Mining and the CLARIN Infrastructure to Extend Corpus-based Linguistic Research
References:

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3 (3), 993-1022.

David M. Blei and John D. Lafferty. (2006). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, 113-120.

Samuel Brody and Mirella Lapata. (2009). Bayesian word sense induction. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 103-111.

Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. (1991). Word-sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, 264‚Äď270.

Jacob Cohen. (1960). A coefficient of agreement for nominal scales. In Educational and Psychological Measurement 20, 37-46.

Stefan Engelberg and Lothar Lemnitzer. (2009). Lexikographie und W√∂rterbuchbenutzung. T√ľbingen: Stauffenburg.

Tony McEnery, Richard Xiao, and Yukio Tono. (2006). Corpus-Based Language Studies ‚Äď an advanced resource book. London: Routledge.

Gerd Fritz. (2012). Theories of meaning change ‚Äď an overview. In C. Maienborn et al. (Eds.), Semantics. An International Handbook of Natural Language Meaning. Volume 3. Berlin: de Gruyter, 2625-2651.

Gerd Fritz. (2005). Einf√ľhrung in die historische Semantik. T√ľbingen: Niemeyer.

Alexander Geyken. (2007). The DWDS corpus. A reference corpus for the German language of the twentieth century. In C. Fellbaum (Ed.), Idioms and collocations. Corpus-based linguistic and lexicographic studies. London: Continuum, 23-40.

Thomas L. Griffiths and Mark Steyvers. (2004). Finding scientific topics. In Proceedings of the National Academy of Sciences, 101 (Suppl. 1), 5228-235.

Erhard Hinrichs and Thomas Zastrow. (2012). Automatic Annotation and Manual Evaluation of the Diachronic German Corpus T√ľBa-D/DC. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 1622-1627.

Rudi Keller and Ilja Kirschbaum. (2003). Bedeutungswandel. Eine Einf√ľhrung. Berlin: de Gruyter.

Dan Klein & Christopher D. Manning (2003): Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics ‚Äď Volume 1, ACL ‚Äô03, pag-es 423‚Äď430, Stroudsburg, PA, USA. Association for Computational Linguistics.

Wolfgang Klein and Alexander Geyken. (2010). Das Digitale Wörterbuch der Deutschen Sprache (DWDS). In U. Heid et al. (Eds.), Lexikographica. Berlin: de Gruyter, 79-93.

Anke L√ľdeling and Merja Kyt√∂. (Eds.). (2008). Corpus Linguistics. An International Handbook. Volume 1. Berlin: de Gruyter.

Anke L√ľdeling and Merja Kyt√∂. (Eds.). (2009). Corpus Linguistics. An International Handbook. Volume 2. Berlin: de Gruyter.

Ingo Mierswa et al. (2006). YALE: Rapid Prototyping for Complex Data Mining Tasks. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining.

Roberto Navigli. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41 (2), 10:1-10:69.

Roberto Navigli and Giuseppe Crisafulli. (2010). Inducing word senses to improve web search result clustering. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 116-126.

Roberto Navigli and Daniele Vannella. (2013). Semeval-2013 task 11: Word sense induction and disambiguation within an end-user application. In Second Joint Conference on Lexical and Computational Semantics, Volume 2: Proceedings of the Seventh International Workshop on Semantic valuation, 193-201.

Uwe Quasthoff, Matthias Richter, and Chris Biemann. (2006). Corpus Portal for Search in Monolingual Corpora. In Proceedings of the fifth international conference on Language Resources and Evaluation, 1799-1802.

Christian Rohrdantz et al. (2011). Towards Tracking Semantic Change by Visual Analytics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 305-310.

Paul Rayson and Mark Stevenson. (2008). Sense and semantic tagging. In A. L√ľdeling and M. Kyt√∂ (Eds.), Corpus Linguistics. Volume 1. Berlin: de Gruyter, 564-578.

Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. (2004). Probabilistic author-topic models for information discovery. In Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining, 306‚Äď315.

Angelika Storrer. (2011). Korpusgest√ľtzte Sprachanalyse in Lexikographie und Phraseologie. In K. Knapp et al. (Eds.), Angewandte Linguistik. Ein Lehrbuch. 3. vollst. uberarb. und erw. Aufl. Tubingen: Francke, 216-239.

Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands

Author:
Thomas Bartz, Christian Pölitz, Katharina Morik, Angelika Storrer
Title:
Using Data Mining and the CLARIN Infrastructure to Extend Corpus-based Linguistic Research
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21