Article | Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, Proceedings of the Workshop, July 11, 2016, Krakow, Poland | The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP
Göm menyn

Title:
The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP
Author:
Stian Rødven Eide: Språkbanken, Dept. of Swedish University of Gothenburg, Sweden Nina Tahmasebi: Språkbanken, Dept. of Swedish University of Gothenburg, Sweden Lars Borin: Språkbanken, Dept. of Swedish University of Gothenburg, Sweden
Download:
Full text (pdf)
Year:
2016
Conference:
Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, Proceedings of the Workshop, July 11, 2016, Krakow, Poland
Issue:
126
Article no.:
002
Pages:
8--12
No. of pages:
5
Publication type:
Abstract and Fulltext
Published:
2016-07-08
ISBN:
978-91-7685-733-5
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

In this paper we present a dataset of contemporary Swedish containing one billion words. The dataset consists of a wide range of sources, all annotated using a state-of-the-art corpus annotation pipeline, and is intended to be a static and clearly versioned dataset. This will facilitate reproducibility of experiments across institutions and make it easier to compare NLP algorithms on contemporary Swedish. The dataset contains sentences from 1950 to 2015 and has been carefully designed to feature a good mix of genres balanced over each included decade. The sources include literary, journalistic, academic and legal texts, as well as blogs and web forum entries.

Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, Proceedings of the Workshop, July 11, 2016, Krakow, Poland

Author:
Stian Rødven Eide, Nina Tahmasebi, Lars Borin
Title:
The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP
References:

Yvonne Adesam, Lars Borin, Gerlof Bouma, Markus Forsberg, and Richard Johansson. 2014. Koala – korp’s linguistic annotations developing an infrastructure for text-based research with high-quality annotations.


BNC Consortium. 2007. The British National Corpus, version 3 (BNC XML Edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/.


Lars Borin, Markus Forsberg, and Johan Roxendal. 2012. Korp – the corpus infrastructure of Spr°akbanken. In Proceedings of LREC 2012, page 474–478, Istanbul. ELRA.


Lars Borin, Markus Forsberg, and Lennart L¨onngren. 2013. SALDO: a touch of yin toWordNet’s yang. Language Resources and Evaluation, 47(4):1191–1211.


Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. 2008. Introducing and evaluating ukwac, a very large web-derived corpus of english. In In Proceedings of the 4th Web as Corpus Workshop (WAC-4.


Yoav Goldberg and Omer Levy. 2014. word2vec explained: deriving mikolov et al.’s negative-sampling wordembedding method. CoRR, abs/1402.3722.


Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. 2011. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182.


Luis Nieto PiËśna and Richard Johansson. 2016. Embedding senses for efficient graph-based word sense disambiguation. In Proceedings of TextGraphs-10, San Diego, United States.


Gertrud Pettersson. 1996. Svenska spr°aket under sjuhundra °ar. Studentlitteratur, Lund.


E. Sandhaus. 2008. The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia, 6(12).


Roland Schäfer and Felix Bildhauer. 2012. Building large corpora from the web using a new efficient tool chain. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).

Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, Proceedings of the Workshop, July 11, 2016, Krakow, Poland

Author:
Stian Rødven Eide, Nina Tahmasebi, Lars Borin
Title:
The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21