Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania | The Effect of Author Set Size in Authorship Attribution for Lithuanian
The Effect of Author Set Size in Authorship Attribution for Lithuanian
Jurgita Kapočiūtė-Dziki ė: Vytautas Magnus University, Kaunas, Lithuania Ligita Šarkutė: Kaunas University of Technology, Kaunas, Lithuania Andrius Utka: Vytautas Magnus University, Kaunas, Lithuania
Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania
This paper reports the first authorship attribution results based on the effect of the author set size using automatic computational methods for the Lithuanian language. The aim is to determine how fast authorship attribution results are deteriorating while the number of candidate authors is gradually increasing: i.e. starting from 3, going up to 5, 10, 20, 50, and 100. Using supervised machine learning techniques we also investigated the effect of balancing on the dataset, and the influence of the different features (lexical, character, morphological, etc.), and language types (normative parliamentary speeches and non-normative forum posts). The experiments revealed that the effectiveness of the method and feature type depends more on the language type than on the number of candidate authors. The content features based on word lemmas are the most useful type for the normative texts, due to the fact that Lithuanian is a highly inflective, morphologically and vocabulary rich language. The character features are the most accurate type for forum posts, where texts are too complicated to be effectively processed with the external morphological tools.

