Conference article

Enriching and Increasing the Usability of Lexicographical Data for Less-Resourced

Dirk Goldhahn
Natural Language Processing Group, University of Leipzig, Germany. Saxon Academy of Sciences and Humanities, Leipzig, Germany

Thomas Eckart
Natural Language Processing Group, University of Leipzig, Germany. Saxon Academy of Sciences and Humanities, Leipzig, Germany

Sonja Bosch
Department of African Languages, University of South Africa, South Africa

Download articlehttps://doi.org/10.3384/ecp2020172004

Published in: Selected Papers from the CLARIN Annual Conference 2019

Linköping Electronic Conference Proceedings 172:4, p. 23-32

Show more +

Published: 2020-07-03

ISBN: 978-91-7929-807-4

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

This paper presents a use case for enriching lexicographical data for less-resourced languages employing the CLARIN infrastructure. Newly prepared lexicographical data sets for under-resourced Bantu languages spoken in southern regions of the African continent form the basis of the presented work. These datasets have been made digitally available using well-established standards of the Linguistic Linked Open Data (LLOD) community. To overcome the insufficient amount of freely available reference material, a crowdsourcing web portal for collecting textual data for less-resourced languages has been created and incorporated into the CLARIN infrastructure. Using this portal, the number of available text resources for the respective languages was significantly increased in a community effort. The collected content is used to enrich lexicographical data with real-world samples to increase the usability of the entire resource.

Keywords

minority languages, lesser resourced languages, use case, lexical resources, Bantu languages

References

No references available

Citations in Crossref