Article | Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure | Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries
Göm menyn

Title:
Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries
Author:
Michael Beißwenger: University of Duisburg-Essen, Germany Thierry Chanier: Université Clermont, Auvergne, France Tomaž Erjavec: Jožef Stefan Institute, Ljubljana, Slovenia Darja Fišer: University of Ljubljana, Ljubljana, Slovenia Axel Herold: Berlin-Brandenburg Academy of Sciences, Berlin, Germany Nikola Ljubešic: Jožef Stefan Institute, Ljubljana, Slovenia Harald Lüngen: Institute for the German Language, Mannheim, Germany Céline Poudat: Université de Nice, Sophia Antipolis, France Egon Stemle: Eurac Research, Bolzano, Italy Angelika Storrer: University of Mannheim, Mannheim, Germany Ciara Wigham: Université Clermont, Auvergne, France
Download:
Full text (pdf)
Year:
2017
Conference:
Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure
Issue:
136
Article no.:
001
Pages:
1-18
No. of pages:
18
Publication type:
Abstract and Fulltext
Published:
2017-05-23
ISBN:
978-91-7685-499-0
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

The paper presents best practices and results from projects dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC) from four different countries. Even though there are still many open issues related to building and annotating corpora of this type, there already exists a range of tested solutions which may serve as a starting point for a comprehensive discussion on how future standards for CMC corpora could (and should) be shaped like.

Keywords: CMC corpora, computer-mediated communication, social media corpora, corpus annotation, language resources, TEI, community building

Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure

Author:
Michael Beißwenger, Thierry Chanier, Tomaž Erjavec, Darja Fišer, Axel Herold, Nikola Ljubešic, Harald Lüngen, Céline Poudat, Egon Stemle, Angelika Storrer, Ciara Wigham
Title:
Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries
References:

[Baron et al.2012] Alistair Baron, Paul Rayson, Phil Greenwood, James Walkerdine, and Awais Rashid. 2012.


Children Online: A Survey of Child Language and CMC Corpora. International Journal of CorpusLinguistics, 17(4):443–81.


[Bartz et al.2014] Thomas Bartz, Michael Beißwenger, and Angelika Storrer. 2014. Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Ph nomene, erausforderungen, Erweiterungsvorschl ge. Journal for Language Technology and Computational Linguistics, 28(1):157–198.


[Beißwenger and Storrer2008] Michael Beißwenger and Angelika Storrer. 2008. Corpora of computer-mediated communication. In: Lüdeling, Anke; Kytö, Merja (eds.). Corpus Linguistics HSK, vol. 29.1. Walter de Gruyter, Berlin, Germany, pp. 292–309.


[BeiĂźwenger et al.2012] Michael BeiĂźwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and Angelika Storrer. 2012. A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative (Online), (3) (doi: 10.4000/jtei.476). http://jtei.revues.org/476.


[Beißwenger2013] Michael Beißwenger. 2013. Das Dortmunder Chat-Korpus. Zeitschrift für germanistischeLinguistik, 41(1):161–164.


[BeiĂźwenger et al.2015] Michael BeiĂźwenger, Thomas Bartz, Angelika Storrer, and Swantje Westpfahl. 2015. Tagset and Guidelines for the PoS Tagging of Language Data from Genres of Computer-mediatedCommunication / Social Media. http://sites.google.com/site/empirist2015/home/annotation-guidelines.


[Beißwenger et al.2016] Michael Beißwenger, Sabine Bartsch, Stefan Evert, and Kay-Michael Würzner. 2016. EmpiriST 2015: A shared task on the automatic linguistic annotation of computer-mediated communication and web corpora. In: Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST SharedTask. Berlin, Germany, pp. 44–56. http://aclweb.org/anthology/W/W16/W16-2606.pdf


[Bolander and Locher2014] Brook Bolander and Miriam A. Locher. 2014. Doing Sociolinguistic Research on Computer-Mediated Data: A Review of Four Methodological Issues. Discourse, Context & Media, (3):14–26.


[Chanier et al.2014] Thierry Chanier, Celine Poudat, Benoit Sagot, Georges Antoniadis, Ciara Wigham, Linda Hriba, Julien Longhi, and Djamé Seddah. 2014. The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. Journal of language Technology and Computational Linguistics, 29(2):1–30. http://www.jlcl.org/2014_Heft2/1Chanier-et-al.pdf.


[Chanier and Wigham2016] Thierry Chanier and Ciara Wigham. 2016. Standardizing Multimodal Teaching and Learning Corpora. In: Marie-Jo, Hamel; Caws, Catherine (eds.). Language-Learner Computer Interactions:Theory, Methodology and CALL Applications. John Benjamins, Amsterdam, Netherlands, pp. 215-240. DOI: 10.1075/lsse.2.10cha.


[Chiari and Canzonetti2014] Isabella Chiari and Alessio Canzonetti. 2014. Le forme della comunicazione mediata dal computer: generi, tipi e standard di annotazione. In: Garavelli, Enrico; Suomela-Härmä, Elina (eds.). Dal manoscritto al web: canali e modalitĂ  di trasmissone dell’ italiano. Tecniche, materiali e usi nellastoria della lingua. Atti del XII Convegno della SocietĂ  Internazionale di Linguistica e Filologia Italiana (SILFI), Helsinki, 18-19 June 2012. Franco Cesati Editore, Firenze, Italy, pp. 595-606.


[Cibej and Ljubešic2015] Jaka Cibej and Nikola Ljubešic. 2015. “S kje pa si?” – Metapodatki o regionalni pripadnosti uporabnikov druĹľbenega omrĹľja Twitter. Zbornik konference Slovenšcina na spletu in v novih medijih, Ljubljana, Slovenia, pp. 10-14.


[CLARIN-D schema2015] CLARIN-D TEI schema for CMC corpora. 2015. http://wiki.tei-c.org/index.php?title=SIG:CMC/clarindschema.


[CoMeRe repository2016] CoMeRe repository. 2016. Corpora of Computer-Mediated Communication in French. Ortolang.fr, Nancy, France. http://hdl.handle.net/11403/comere.


[CoMeRe schema2014] CoMeRe TEI schema for CMC corpora, version 2. 2014. https://repository.ortolang.fr/api/content/comere/v2/tei_cmr.rng and http://wiki.teic.org/index.php/SIG:CMC/CoMeRe_schema_draft_for_representing_CMC_in_TEI_(2014).


[Dobrovoljc et al.2015] Kaja Dobrovoljc, Simon Krek, Peter olozan, TomaĹľ Erjavec, and Miro Romih. 2015. Morphological Lexicon Sloleks 1.2., Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1039.


[DĂĽrscheid and Stark2011] Christa DĂĽrscheid and Elisabeth Stark. 2011. sms4science: An international corpusbased texting project and the specific challenges for multilingual Switzerland. In: Thurlow, Crispin; Mroczek, Kristine (eds.): Digital Discourse. Language in the New Media. Oxford University Press, Oxford, UK, pp. 299-320.


[Erjavec2012] Tomaž Erjavec. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1):131–142.


[Erjavec2015] Tomaž Erjavec. 2015. The IMP historical Slovene language resources. Language Resources andEvaluation, 49(3):753–775.


[Erjavec et al.2016a] Tomaž Erjavec, Jaka Cibej, and Darja Fišer. 2016. Omogocanje dostopa do korpusov slovenskih spletnih besedil v luci pravnih omejitev. Slov n na 2.0, 4(2):189–219.


[Erjavec et al.2016b] Tomaž Erjavec, Jaka Cibej, Špela Arhar oldt, Nikola Ljubešic, and Darja Fišer. 2016. Gold-Standard Datasets for Annotation of Slovene Computer-Mediated Communication. In: Proceedings ofthe Tenth Workshop on Recent Advances in Slavonic Natural Languages Processings, Brno, the Czech Republic, pp. 29–40.


[Erjavec et al.2016c] Tomaž Erjavec, Darja Fišer, Jaka Cibej, Špela Arhar oldt, and Nikola Ljubešic. 2016. CMC Training Corpus Janes-Norm 1.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1084.


[Erjavec et al.2016d] Tomaž Erjavec, Darja Fišer, Jaka Cibej, Špela Arhar oldt, and Nikola Ljubešic. 2016. CMC Training Corpus Janes-Tag 1.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1085.


[Fišer and Beißwenger2016] Darja Fišer and Michael Beißwenger (eds.). 2016. Proceedings of the 4thConference on CMC and Social Media Corpora for the Humanities (cmc-corpora2016). University of Ljubljana, Slovenia. http://nl.ijs.si/janes/cmc-corpora2016/proceedings/


[Fišer et al.2016] Darja Fišer, Tomaž Erjavec, and Nikola Ljubešic. 2016. JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin. Slov n na 2.0, 4(2):67–99.


[Forsyth an Martell2007] Eric N. Forsyth and Craig H. Martell. 2007. Lexical and Discourse Analysis of Online Chat Dialog. In: Proceedings of the First IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, USA, pp. 19-26.


[Frey et al.2014] Jennifer-Carmen Frey, Egon W. Stemle, and Aivars Glaznieks. 2014. Collecting Language Data of Non-Public Social Media Profiles. In: Workshop Proceedings of the 12th Edition of the KONVENSConference, edited by Gertrud Faaß and Josef Ruppenhofer. Universitätsverlag Hildesheim, Hildesheim, Germany, pp. 11-15.


[Frey et. al.2016] Jennifer-Carmen Frey, Aivars Glaznieks, and Egon W. Stemle. 2016. The DiDi Corpus of South Tyrolean CMC Data: A Multilingual Corpus of Facebook Texts. Accepted at CLIC-it 2016.


[Grcar et al.2012] Miha Grcar, Simon Krek, and Kaja Dobrovoljc. 2012. Ob l : tat t n obl o lad njo na valn n l mat ato za slovenski jezik (Obeliks: a statistical morphosyntactic tagger and lemmatiserfor Slovene). Zbornik Osme konference Jezikovne tehnologije, Ljubljana, Slovenia.


[Holozan et al.2008] Peter olozan, Simon Krek, Matej Pivec, Simon Rigac, Simon Rozman, and Aleš Velušcek. 2008. Specifikacije za ucni korpus. Project "Sporazumevanje v slovenskem jeziku (Specifications for the Training Corpus. The "Communication in Slovene" project). http://www.slovenscina.eu/Vsebine/Sl/Kazalniki/K2.aspx.


[Horbach et al.2014] Andrea Horbach, Diana Steffen, Steffen Thater, and Manfred Pinkal. 2014. Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication. In: Proceedings of KONVENS 2014, pp. 171–177. https://hildok.bsz-bw.de/frontdoor/index/index/docId/241.


[iRights.Law2016] iRights.Law Rechtsanwälte. 2016. Rechtsgutachten zur Integration mehrerer Text-Korporain die CLARIN-D-Infrastrukturen. (Legal opinion for the ChatCorpus2CLARIN project, 46 pages).


[Kilgarriff et al.2014] Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubícek, Vojtech Kovár, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. 2014. The Sketch Engine: ten years on. Lexicography, 1(1):7–36.


[Krek et al.2013] Simon Krek, TomaĹľ Erjavec, Kaja Dobrovoljc, Sara MoĹľe, Nina Ledinek, and Nanika Holz. 2013. Training Corpus ssj500k 1.3. Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1029.


[Ljubešic and Erjavec2016] Nikola Ljubešic and Tomaž Erjavec. 2016. Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In: Proceedings of the 10th Language Resources andEvaluation Conference, Portorož, Slovenia, pp. 1527–1531.


[Ljubešic et al.2016a] Nikola Ljubešic, Tomaž Erjavec, and Darja Fišer. 2016. Corpus-based diacritic restoration for South Slavic languages. In: Proceedings of the 10th Language Resources and Evaluation Conference. Portorož, Slovenia, pp. 3612–3616.


[Ljubešic et al.2016b] Nikola Ljubešic, Katja Zupan, Darja Fišer, and Tomaž Erjavec. 2016. Normalising Slovene data: historical texts vs. user-generated content. In: Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), Bochum, Germany, pp. 146–155.


[Ljubešic et al.2015] Nikola Ljubešic, Darja Fišer, Tomaž Erjavec, Jaka Cibej, Dafne Marko, Senja Pollak, and Iza Škrjanec. 2015. Predicting the level of text standardness in user-generated content. In: Proceedings of the10th International Conference on Recent Advances in Natural Language Processing, Hissar, Bulgaria, pp. 371–378.


[Logar Berginc et al.2012] Nataša Logar Berginc, Miha Grcar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt, and Simon Krek. 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja,vsebina, uporaba (The Gigafida, KRES, ccGigafida and ccKRES corpora of Slovene language: compilation, content, use.) Ljubljana, Slovenia: Trojina, zavod za uporabno slovenistiko, Faculty of Social Sciences.


[Lüngen et al.2016] Harald Lüngen, Michael Beißwenger, Eric Ehrhardt, Axel Herold, and Angelika Storrer. 2016. Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN. In: Proceedings of the 13th Conference on Natural Language Processing(KONVENS 2016), Bochum, Germany, pp. 156–164. https://www.linguistics.rub.de/konvens16/pub/20_konvensproc.pdf.


[Margaretha and Lüngen2014] Eliza Margaretha and Harald Lüngen. 2014. Building Linguistic Corpora from Wikipedia Articles and Discussions. Journal of language Technology and Computational Linguistics, 29(2):59–82. http://www.jlcl.org/2014_Heft2/3MargarethaLuengen.pdf.


[Oostdijk et al.2013] Nelleke Oostdijk, Martin Reynaert, VĂ©ronique Hoste, and Ineke Schuurman. 2013. The Construction of a 500 Million Word Reference Corpus of Contemporary Written Dutch. In: Spyns, Peter; Odijk, Jan (eds). Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme, Springer Verlag, Berlin, Germany, pp. 219-247.


[Panckhurst et al.2016] Rachel Panckhurst, Catherine DĂ©trie, CĂ©dric Lopez, Claudine MoĂŻse, Mathieu Roche, and Bertrand Verin. 2016. 88milSMS: A corpus of authentic text messages in French. [corpus] In: Chanier, Thierry (ed). Banque de corpus CoMeRe. Ortolang, Nancy, France. https://hdl.handle.net/11403/comere/cmr-88milsms.


[Poudat et al.2017] CĂ©line Poudat, Natalia Grabar, Camille Paloque-Berges, Thierry Chanier, and Kun Jin. 2017. Wikiconflits: un corpus de discussions Ă©ditoriales conflictuelles du WikipĂ©dia francophone. In: Wigham, C.R.; Ledegen, G. (eds.). 2017. Corpus de communication mĂ©diĂ©e par les rĂ©seaux: Construction, structuration, analyse. Collection umanitĂ©s NumĂ©riques. L’ armattan, Paris, France, pp. 211-222.


[Riou and Sagot2016] StĂ©phane Riou and Benoit Sagot. 2016. Etiquetage morpho-syntaxique du corpus FAVI [corpus]. D’après Yun, . & Chanier, T. (2014). Corpus d’apprentissage FAVI (Français acadĂ©mique virtuel international) [cmr-favi-tei-v1]. Banque de corpus CoMeRe. Ortolang, Nancy, France. http://hdl.handle.net/11403/comere/cmr-favi/cmr-favi-tei-v2


[Schiller et al.1999] Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Institut für maschinelle Sprachverarbeitung, University of Stuttgart, Germany. http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf.


[Schröck and Lüngen2015] Jasmin Schröck and Harald Lüngen. 2015. Building and Annotating a Corpus of German-Language Newsgroups. In: Proceedings of the 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media (NLP4CMC2015). Essen, Germany, pp. 17-22. https://sites.google.com/site/nlp4cmc2015/program


[TEI P5] TEI Consortium (eds) (2007): TEI P5: Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/Guidelines/P5/.


[Verheijen and Stoop2016] Lieke Verheijen and Wessel Stoop. 2016. Collecting Facebook Posts and WhatsApp Chats. In: Proceedings. Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Springer International Publishing, Cham, Germany, pp. 249–58.


[Westpfahl and Schmidt2016] Swantje Westpfahl and Thomas Schmidt. 2016. FOLK-Gold – A GOLD standard for Part-of-Speech- Tagging of Spoken German. In: Proceedings of the Tenth conference on International Language Resources and Evaluation (LREC16), Paris, France, pp. 1493-1499.


[Wigham and Chanier2013] Ciara Wigham and Thierry Chanier. 2013. Interactions Between Text Chat and Audio Modalities for L2 Communication and Feedback in the Synthetic World Second Life. Computer Assisted Language Learning, 28(3):260-283. DOI: 10.1080/09588221.2013.851702.


[Yimam et al.2013] Seid Muhie Yimam, Iryna Gurevych, Richard Eckart de Castilho, and Chris Biemann. 2013. Webanno: A flexible, web-based and visually supported system for distributed annotations. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (System Demonstrations), Association for Computational Linguistics, Stroudsburg, USA, pp. 1–6.

Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure

Author:
Michael Beißwenger, Thierry Chanier, Tomaž Erjavec, Darja Fišer, Axel Herold, Nikola Ljubešic, Harald Lüngen, Céline Poudat, Egon Stemle, Angelika Storrer, Ciara Wigham
Title:
Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21