Article | Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16 | Exploring Features for Named Entity Recognition in Lithuanian Text Corpus
Göm menyn

Title:
Exploring Features for Named Entity Recognition in Lithuanian Text Corpus
Author:
Jurgita Kapočūtė-Dzikienė: Kaunas University of Technology, Kaunas, Lithuania Anders Nøklestad: University of Oslo, Norway Janne Bondi Johannessen: University of Oslo, Norway Algis Krupavičius: Kaunas University of Technology, Kaunas, Lithuania
Download:
Full text (pdf)
Year:
2013
Conference:
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16
Issue:
085
Article no.:
011
Pages:
73-88
No. of pages:
16
Publication type:
Abstract and Fulltext
Published:
2013-05-17
ISBN:
978-91-7519-589-6
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press; Linköpings universitet


Export in BibTex, RIS or text

Despite the existence of effective methods that solve named entity recognition tasks for such widely used languages as English; there is no clear answer which methods are the most suitable for languages that are substantially different. In this paper we attempt to solve a named entity recognition task for Lithuanian; using a supervised machine learning approach and exploring different sets of features in terms of orthographic and grammatical information; different windows; etc. Although the performance is significantly higher when language dependent features based on gazetteer lookup and automatic grammatical tools (part-of-speech tagger; lemmatizer or stemmer) are taken into account; we demonstrate that the performance does not degrade when features based on grammatical tools are replaced with affix information only. The best results (micro-averaged F-score=0.895) were obtained using all available features; but the results decreased by only 0.002 when features based on grammatical tools were omitted.

Keywords: Named entity recognition and classification; supervised machine learning; Lithuanian

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Author:
Jurgita Kapočūtė-Dzikienė, Anders Nøklestad, Janne Bondi Johannessen, Algis Krupavičius
Title:
Exploring Features for Named Entity Recognition in Lithuanian Text Corpus
References:

Al-Rfou’; R. and Skiena; S. (2012). SpeedRead: A Fast Named Entity Recognition Pipeline. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 51–66.

Daudaravicius; V.; Rimkute; E. and Utka; A. (2007). Morphological annotation of the Lithuanian corpus. In Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies (ACL’07); pages 94–99.

Desmet; B. and Hoste; V. (2010). Dutch named entity recognition using ensemble classifiers. In Computational Linguistics in the Netherlands 2010: selected papers from the twentieth CLIN meeting (CLIN 2010); pages 29–41.

Elsebai; A.; Meziane; F. and Belkredim; F. Z. (2009). A Rule Based Persons Names Arabic Extraction System. In Proceedings of the 11th International Conference on Innovation and Business Management (IBIMA); pages 53–59.

Georgiev; G.; Nakov; P.; Ganchev; K.; Osenova; P. and Simov; K. (2009). Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP- 2009); pages 113–117.

Gokhan; A. S. and Gulsen; E. (2012). Initial Explorations on using CRFs for Turkish Named Entity Recognition. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 2459–2474.

Haaland; Ă…. (2008). A Maximum Entropy Approach to Proper Name Classification for Norwegian. PhD thesis; University of Oslo.

Hasan; K. S.; Rahman; A.; and Ng; V. (2009). Learning-based named entity recognition for morphologically-rich; resource-scarce languages. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics; pages 354–262.

Johannessen; J. B.; Hagen; K.; Haaland; Å.; Nøklestad; A.; Jónsdottir; A. B.; Kokkinakis; D.; Meurer; P.; Bick; E. and Haltrup; D. (2005). Named Entity Recognition for the Mainland Scandinavian Languages. Literary & Linguistic Computing; 20(1): 91–102.

Kapociute; J. and Raškinis; G. (2005). Rule-based annotation of Lithuanian text corpora. Information technology and control; Kaunas; Technologija; 34 (3): 290–296.

Kitoogo F. E.; Baryamureeba; V; and De Pauw; G. (2008). Towards Domain Independent Named Entity Recognition. International Journal of Computing and ICT Research; 2 (2): 84– 95.

Krilavicius; T. and Medelis; Ĺ˝. Lithuanian stemmer. (2010). May; 2012. <https://github.com/tokenmill/ltlangpack/tree/master/snowball/>.

Lafferty; J. D.; McCallum; A. and Pereira; F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML’01); pages 282–289.

Mai; M. O. and Khaled; S. (2012). A Pipeline Arabic Named Entity Recognition Using a Hybrid Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); Linköping Electronic Conference Proceedings #85 [page 87 of 474] Approach. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 2159–2176.

Marcinczuk; M. and Janicki; M. (2012). Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing (CICLing’12); (1): 258–269.

Marcinczuk; M.; Stanek; M.; Piasecki; M. and Musial; A. (2011). Rich Set of Features for Proper Name Recognition in Polish Texts. SIIS; Lecture Notes in Computer Science; 7053: 332–344.

Marcinkeviciene; R. (2000). Tekstynu lingvistika (teorija ir paktika) [Corpus linguistics (theory and practice)]. Darbai ir dienos; 24: 7–63. (in Lithuanian).

Nadeau; D. and Sekine; S. (2007). A survey of named entity recognition and classification. Linguisticae Investigationes; 30 (1): 3–26.

Nøklestad A. (2009). A Machine Learning Approach to Anaphora Resolution Including Named Entity Recognition; PP Attachment Disambiguation; and Animacy Detection. PhD Thesis; University of Oslo.

Pinnis; M. (2012). Latvian and Lithuanian Named Entity Recognition with TildeNER. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); pages 1258–1265.

Popov; B.; Kirilov; A.; Maynard; D. and Manov; D. (2004). Creation of Reusable Components and Language Resources for Named Entity Recognition in Russian. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004); pages 309– 312.

Savickiene; I.; Kempe; V. and Brooks; P. J. (2009). Acquisition of gender agreement in Lithuanian: exploring the effect of diminutive usage in an elicited production task. Journal of Child Language; 36: 477–494.

Singh; U.; Goyal; V. and Lehal; G. S. (2012). Named Entity Recognition System for Urdu. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 2507–2518.

Sundheim; B. (1995). Overview of results of the muc-6 evaluation. In Proceedings of the 6th Conference on Message Understanding (MUC-6); pages 13–31.

Willett; P. (2006). The Porter stemming algorithm: then and now. Program: electronic library and information systems; 40 (3): 219–223.

Yeh; A. (2000). More Accurate Tests for the Statistical Significance of Result Differences. In Proceedings of the 18th International Conference on Computational Linguistics (COLING’00); 2; pages 947–953.

Zinkevicius; V. (2000). Lemuoklis – morfologinei analizei [Morphological analysis with Lemuoklis]. In: Gudaitis; L. (ed.) Darbai ir Dienos; 24: 246–273. (in Lithuanian).

Zinkevicius; V.; Daudaravicius; V. and Rimkute; E. (2005). The Morphologically annotated Lithuanian Corpus. In Proceedings of the Second Baltic Conference on Human Language Technologies; pages 365–370.

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Author:
Jurgita Kapočūtė-Dzikienė, Anders Nøklestad, Janne Bondi Johannessen, Algis Krupavičius
Title:
Exploring Features for Named Entity Recognition in Lithuanian Text Corpus
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21