Feel free to improve the article, but do not remove this notice before the discussion is closed. For more information, see the guide to deletion. Find sources:"Somali Corpus" – news·newspapers·books·scholar·JSTOR%5B%5BWikipedia%3AArticles+for+deletion%2FSomali+Corpus%5D%5DAFD
The Somali Corpus, also known as Kaydka Af Soomaaliga (KAF), is a digital collection of texts in the Somali, a language spoken in Greater Somalia, Ethiopia, and Kenya. It was started with 3 million words of Somali literature and language developed by Jama Musse Jama in 2016[1][2] as part of his doctoral dissertation.[3] The corpus currently contains over 7 million words, mainly from literature, poetry, songs, news, essays, and political speeches,[4] making it one of the most extensive collections of text types of language corpora within African languages and an important addition to online materials from under-resourced languages.[5][6][7][8] The words of the corpus are tagged for part of speech categories. The corpus can be used to distill frequency lists for Somali words.[9] The corpus also serves as the basis for an online Somali spell checker.[10]
^Bendjaballah, Sabrina. 2024. Somali particle clusters: Complete paradigms, syncretism and corpus frequency. Brill’s Journal of Afroasiatic Languages and Linguistics. Brill 16(1). 102–136. https://doi.org/10.1163/18776930-01601003.
^Mohammed, Siraj. 2020. Using machine learning to build POS tagger for under-resourced language: the case of Somali. International Journal of Information Technology 12(3). 717–729. https://doi.org/10.1007/s41870-020-00480-2.
^Hashi, Awil. 2014. Developing a Model Corpus for Endangered Languages. Graduate Studies. University of Calgary. Doctoral thesis. https://doi.org/10.11575/PRISM/25614.
^Nimaan, Abdillahi. 2014. Building and Evaluating Somali Language Corpora. In Jeff Good, Julia Hirschberg & Owen Rambow (eds.), Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, 73–76. Baltimore, Maryland, USA: Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-2210.