Article | NEAL Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland | Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language Linköping University Electronic Press Conference Proceedings
Göm menyn

Title:
Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language
Author:
Marina Santini: RISE Research Institutes of Sweden, (Division ICT - RISE SICS East), Stockholm, Sweden Benjamin Danielsson: Department of Computer and Information Science , Linköping University, Linköping, Sweden Arne Jönsson: RISE Research Institutes of Sweden, Stockholm, Sweden / Department of Computer and Information Science , Linköping University, Linköping, Sweden
Download:
Full text (pdf)
Year:
2019
Conference:
NEAL Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland
Issue:
167
Article no.:
011
Pages:
105--114
No. of pages:
9
Publication type:
Abstract and Fulltext
Published:
2019-10-02
ISBN:
978-91-7929-995-8
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

We explore the effectiveness of four feature representations -- bag-of-words, word embeddings, principal components and autoencoders -- for the binary categorization of the easy-to-read variety vs standard language. Standard language refers to the ordinary language variety used by a population as a whole or by a community, while the ``easy-to-read’’ variety is a simpler (or a simplified) version of the standard language. We test the efficiency of these feature representations on three corpora, which differ in size, class balance, unit of analysis, language and topic. We rely on supervised and unsupervised machine learning algorithms. Results show that bag-of-words is a robust and straightforward feature representation for this task and performs well in many experimental settings. Its performance is equivalent or equal to the performance achieved with principal components and autoencorders, whose preprocessing is however more time-consuming. Word embeddings are less accurate than the other feature representations for this classification task.

Keywords: feature representation text classification easy-to-read variety standard language weka supervised machine learning deep learning clustering bag-of-words principal components autoencoders word embeddings

NEAL Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland

Author:
Marina Santini, Benjamin Danielsson, Arne Jönsson
Title:
Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language
References:
No references available

NEAL Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland

Author:
Marina Santini, Benjamin Danielsson, Arne Jönsson
Title:
Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2019-11-06