Lexical recognition tests are widely used to assess the learners’ vocabulary size. We investigate the role that diacritics play in increasing the difficulty of an Arabic lexical recognition test. An NLP pipeline is implemented to reliably estimate the frequency of diacritized word forms. We conduct a user study and compare Arabic LRTs in three settings: one has no diacritics, and two are diacritized using the most frequent and least frequent diacritized form of a word. We find that the use of infrequent diacritics can better increase the difficulty of Arabic LRTs.
Keywords: Lexical Recognition Tests, Arabic LRTs, Vocabulary Size, Diacritics, Frequency Counts, Test Difficulty/Generation
Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning (NLP4CALL 2018) at SLTC, Stockholm, 7th November 2018
Afnan Aqel, Sahar Alwadei, and Mohammad Dahab. 2015. Building an Arabic Words Generator. International Journal of Computer Applications, 112(14).
Harun Baharudin, Zawawi Ismail, Adelina Asmawi, and Normala Baharuddin. 2014. TAV of Arabic language measurement. Mediterranean Journal of Social Sciences, 5(20):2402.
Marc Brysbaert. 2013. LEXTALE FR: A fast, free, and efficient test to measure language proficiency in French. Psychologica Belgica, 53(1):23–37.
Kareem Darwish, Ahmed Abdelali, and Hamdy Mubarak. 2014. Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging. In LREC, pages 2926–2931.
Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A New Fast and Accurate Arabic Word Segmenter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC2016), Paris, France. European Language Resources Association (ELRA).
Ali Farghaly and Khaled Shaalan. 2009. Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4):14.
Abed Alhakim Ali Kayed Freihat, Gabor Bella, Mubarak Hamdy, Fausto Giunchiglia, et al. 2018. A single-model approach for arabic segmentation, pos-tagging and named entity recognition. In International Conference on Natural Language and Speech Processing ICNLSP 2018, Algiers, Algeria. ICNLSP.
Nizar Habash. 2010. Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1):1–187.
Osama Hamed and Torsten Zesch. 2015. Generating Nonwords for Vocabulary Proficiency Testing. In Proceeding of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 473–477, Pozna, Poland.
Osama Hamed and Torsten Zesch. 2017a. A Survey and Comparative Study of Arabic Diacritization Tools. JLCL: Special Issue - NLP for Perso-Arabic Alphabets., 32(1):27–47.
Osama Hamed and Torsten Zesch. 2017b. The Role of Diacritics in Designing Lexical Recognition Tests for Arabic. In 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE. Elsevier.
Osama Hamed and Torsten Zesch. 2018. Exploring the Effects of Diacritization on Arabic Frequency Counts. In Proceeding of the 2nd International Conference on Natural Language and Speech Processing (ICNLSP 2018), Algiers, Algeria.
Ineke Huibregtse, Wilfried Admiraal, and Paul Meara. 2002. Scores on a yes-no vocabulary test: Correction for guessing and response style. Language testing, 19(3):227–245.
Cristina Izura, Fernando Cuetos, and Marc Brysbaert. 2014. Lextale-Esp: A test to rapidly and efficiently assess the Spanish vocabulary size. Psicol´ogica, 35(1):49–66.
Kristin Lemh¨ofer and Mirjam Broersma. 2012. Introducing LexTALE: A quick and valid lexical test for advanced learners of English. Behavior Research Methods, 44(2):325–343.
Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In LREC, pages 1094–1101.
Robert Ricks. 2015. The Development of Frequency-Based Assessments of Vocabulary Breadth and Depth for L2 Arabic.
Raymond Stubbe. 2012. Do pseudoword false alarm rates and overestimation rates in yes/no vocabulary tests change with japanese university students English ability levels? Language Testing, 29(4):471–488.
Wajdi Zaghouani. 2014. Critical survey of the freely available Arabic corpora. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’2014), OSACT Workshop. Rejkavik, Iceland.
Taha Zerrouki and Amar Balla. 2017. Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data in Brief, 11:147–151.