Article | Proceedings of the third workshop on NLP for computer-assisted language learning at SLTC 2014, Uppsala University | You Get what You Annotate: A Pedagogically Annotated Corpus of Coursebooks for Swedish as a Second Language
Göm menyn

Title:
You Get what You Annotate: A Pedagogically Annotated Corpus of Coursebooks for Swedish as a Second Language
Author:
Elena Volodina: Swedish Language Bank, Department of Swedish, University of Gothenburg, Sweden Ildikó Pilán: Swedish Language Bank, Department of Swedish, University of Gothenburg, Sweden Stian Rødven Eide: Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg, Sweden Hannes Heidarsson: Department of Swedish, University of Gothenburg, Sweden
Download:
Full text (pdf)
Year:
2014
Conference:
Proceedings of the third workshop on NLP for computer-assisted language learning at SLTC 2014, Uppsala University
Issue:
107
Article no.:
010
Pages:
128‚Äď144
No. of pages:
17
Publication type:
Abstract and Fulltext
Published:
2014-11-11
ISBN:
978-91-7519-175-1
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

We present the COCTAILL corpus, containing over 700.000 tokens of Swedish texts from 12 coursebooks aimed at second/foreign language (L2) learning. Each text in the corpus is labelled with a proficiency level according to the CEFR proficiency scale. Genres, topics, associated activities, vocabulary lists and other types of information are annotated in the coursebooks to facilitate Second Language Acquisition (SLA)-aware studies and experiments aimed at Intelligent Computer-Assisted Language Learning (ICALL). Linguistic annotation in the form of parts-of-speech (POS; e.g. nouns, verbs), base forms (lemmas) and syntactic relations (e.g. subject, object) has been also added to the corpus. In the article we describe our annotation scheme and the editor we have developed for the content mark-up of the coursebooks, including the taxonomy of pedagogical activities and linguistic skills. Inter-annotator agreement has been computed and reported on a subset of the corpus. Surprisingly, we have not found any other examples of pedagogically marked-up corpora based on L2 coursebooks to draw on existing experiences. Hence, our work may be viewed as ‚Äúgroping in the darkness‚ÄĚ and eventually a starting point for others. The paper also presents our first quantitative exploration of the corpus where we focus on textually and pedagogically annotated features of the coursebooks to exemplify what types of studies can be performed using the presented annotation scheme. We explore trends shown in use of topics and genres over proficiency levels and compare pedagogical focus of exercises across levels. The final section of the paper summarises the potential this corpus holds for research within SLA and various ICALL tasks.

Keywords: L2 coursebook corpus; annotation scheme; CEFR proficiency levels; SLA-aware ICALL; inter-annotator agreement

Proceedings of the third workshop on NLP for computer-assisted language learning at SLTC 2014, Uppsala University

Author:
Elena Volodina, Ildikó Pilán, Stian Rødven Eide, Hannes Heidarsson
Title:
You Get what You Annotate: A Pedagogically Annotated Corpus of Coursebooks for Swedish as a Second Language
References:

Anping He. (2005). Corpus-Based Evaluation of ELT textbooks. Paper presented at the joint conference of the American Association of Applied Corpus Linguistics and the International Computer Archive of Modern and Medieval English, 12-15 May 2005, University of Michigan.


Artstein Ron & Massimo Poesio. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4): 555-596.


Attali Yigal & Jill Burstein. (2006). Automated essay scoring with e-rater v.2. The Journal of Technology, Learning and Assessment, 4(3).


Bird Steven. (2006). NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, pp. 69-72.


Borin Lars, Markus Forsberg & Johan Roxendal. (2012). Korp ‚Äď the corpus infrastructure of Spr√•kbanken. Proceedings of LREC 2012. Istanbul: ELRA. 474‚Äď478.


Council of Europe (COE). (2001). The Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge University Press.


Davies Mark & Joseph L. Fleiss. (1982). Measuring agreement for multinomial data. Biometrics, 38(4): 1047‚Äď1051.


François Thomas. (2011). Les apports du traitement automatique du langage à la lisibilité du français langue étrangère, Ph.D. Thesis, Université Catholique de Louvain. Thesis Supervisors : Cédrick Fairon and Anne Catherine Simon.


François Thomas, Nuria Gala, Patrick Watrin & Cédrick Fairon. (2014). FLELex: a graded lexical resource for French foreign learners. In the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland, 26-31 May.


Gamson David A., Lu Xiaofei, & Eckert Sarah Anne. (2013). Challenging the research base of the common core state standards: A historical reanalysis of text complexity. Educational Researcher, 42(7):381-391.


Jaccard Paul. (1908). Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles, 44: 223-270.


Hancke Julia & Detmar Meurers. (2013). Exploring CEFR classification for German based on rich linguistic modeling. Learner Corpus Research 2013, Book of Abstracts. pp. 54-56. Bergen, Norway.


Johansson Britt & Anniqa Sandell Ring. (2010). Låt språket bära: genrepedagogiken i praktiken. Hallgren och Fallgren, Stockholm.


Krippendorff Klaus. (1980). Content Analysis: An Introduction to Its Methodology, chapter 12. Sage, Beverly Hills, CA.


Meunier Fanny & Gouverneur Céline. (2007). The treatment of phraseology in ELT textbooks, In: Corpora in the Foreign Language Classroom. Selected papers from the Sixth International Conference on Teaching and Language Corpora (TaLC6), University of Granada, 4-7 July 2004, Encarnación H., Quereda L. and Santana J. ed(s), Amsterdamm & New York, Rodopi, Language and Computers Series 61, p. 119-139.


Meunier Fanny & Gouverneur Céline. (2009). New types of corpora for new educational challenges: collecting, annotating and exploiting a corpus of textbook material, In: Corpora and Language Teaching, Aijmer, K. ed(s), Amsterdam & Philadelphia, Benjamins, p. 179-201.


Passonneau Rebecca J. (2006). Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Proceedings of LREC, Genoa, pp. 831‚Äď836.


Reda Ghsoon. (2003). English Coursebooks: Prototype Textsts and Basic Vocabulary Norms. ELT Journal 57(3): 260-268.


R√∂mer Ute. (2006). Looking at Looking: Functions and Contexts of Progressives in Spoken English and ’School’ English. In: Renouf, Antoinette & Andrew Kehoe (eds.). The Changing Face of Corpus Linguistics. Papers from the 24th International Conference on English Language Research on Computerized Corpora (ICAME 24). Amsterdam: Rodopi. p.231-242.


Singleton David. (1995). Introduction: A Critical Look at the Critical Period in Second Language Acquisition Research, In Singleton D. & Lengyel, Z. (Eds.), The Age Factor in Second Language Acquisition (1-29). Avon: Multilingual Matters, Ltd.


Vajjala Sowmya & Detmar Meurers. (2013). On The Applicability of Readability Models to Web Texts. Proceedings of the Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), ACL 2013


Volodina Elena, Ildikó Pilán, Lars Borin, & Therese Lindström Tiedemann. (2014). A flexible language learning platform based on language resources and web services. Proceedings of LREC 2014, Reykjavik, Iceland.


Volodina Elena & Sofie Johansson Kokkinakis. (2013). Compiling a corpus of CEFR-related texts. Proceedings of the Language Testing and CEFR conference, Antwerpen, Belgium, May 27-29, 2013.

Proceedings of the third workshop on NLP for computer-assisted language learning at SLTC 2014, Uppsala University

Author:
Elena Volodina, Ildikó Pilán, Stian Rødven Eide, Hannes Heidarsson
Title:
You Get what You Annotate: A Pedagogically Annotated Corpus of Coursebooks for Swedish as a Second Language
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21