This paper reports on the status of learner corpus anonymization for the ongoing research infrastructure project SweLL. The main project aim is to deliver and make available for research a well-annotated corpus of essays written by second language (L2) learners of Swedish. As the practice shows, annotation of learner texts is a sensitive process demanding a lot of compromises between ethical and legal demands on the one hand, and research and technical demands, on the other. Below, is a concise description of the current status of pseudonymization of language learner data to ensure anonymity of the learners, with numerous examples of the above-mentioned compromises.
Keywords: learner corpus, anonymization, pseudonymization, legal issues, GDPR
Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning (NLP4CALL 2018) at SLTC, Stockholm, 7th November 2018
Accenture. 2016. Building digital trust: The role of data ethics in the digital age. https://www.ccenture.com/t20160613T024441__w__/us-en/_acnmedia/PDF-22/Accenture-Data-Ethics-POV-WEB.pdf.
Malin Ahlberg, Lars Borin, Markus Forsberg, Martin Hammarstedt, Leif-J¨oran Olsson, Olof Olsson, Johan Roxendal, and Jonatan Uppström. 2013. Korp and Karp - a bestiary of language resources: the research infrastructure of Språkbanken. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pages 429–433.
Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.
Karën Fort. 2016. Collaborative Annotation for Reliable Natural Language Processing: Technical and Sociological Aspects. John Wiley & Sons.
Ben Medlock. 2016. An Introduction to NLP-based Textual Anonymisation. In Proceedings of Language Resources and Evaliation, pages 1051–1056.
Nives Mikelic Preradovic, Monika Berac, and Damir Boras. 2015. Learner Corpus of Croatian as a Second and Foreign Language. In Multidisciplinary Approaches to Multilingualism. Peter Lang. Riksdagen. 1949. Tryckfrihetsförordningen (1949:105). http://www.riksdagen.se/sv/dokument-lagar/dokument/svensk-forfattningssamling/tryckfrihetsforordning-1949105_sfs-1949-105.
Alexandr Rosen. 2017. Introducing a corpus of nonnative Czech with automatic annotation. Language, Corpora and Cognition, pages 163–180.
Dan Ros´en, Mats Wir´en, and Elena Volodina. 2018. Error Coding of Second-Language Learner Texts Based on Mostly Automatic Alignment of Parallel Corpora. In CLARIN Annual conference 2018.
Kari Tenfjord, Paul Meurer, and Knut Hofland. 2006. The ASK corpus: A language learner corpus of Norwegian as a second language. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pages 1821–1824.
Elena Volodina, Lena Granstedt, Sofia Johansson, Beáta Megyesi, Julia Prentice, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg, and Mats Wir´en. 2018. Annotation of learner corpora: first SweLL insights. In Proceedings of SLTC 2018, Stockholm, Sweden.
Elena Volodina, Beáta Megyesi, Mats Wirén, Lena Granstedt, Julia Prentice, Monica Reichenberg, and Gunl¨og Sundberg. 2016a. A Friend in Need? Research agenda for electronic Second Language infrastructure. In Proceedings of SLTC 2016, Umeå, Sweden.
Elena Volodina, Ildikó Pilán, Lars Borin, and Therese Lindström Tiedemann. 2014. A flexible
language learning platform based on language resources and web services. In LREC, pages 3973–3978.
Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, and Monica Sandell. 2016b. Swell on the rise: Swedish learner language corpus for European reference level studies. Proceedings of LREC 2016