A persistent and largely invisible problem in African artificial intelligence development has a new solution. Most AI systems cannot reliably identify which African language a piece of text is written in, a flaw that prevents those languages from being used to train more capable tools. A new open-source model released last week is designed to fix that foundational gap.
AI research company Pleias and the GSM Association (GSMA) launched CommonLingua on April 28, a language identification (LID) model that covers 334 languages including 61 African languages across eight language families. It is the first release under the GSMA’s “AI Language Models in Africa, by Africa, for Africa” initiative, a coalition focused on closing the language gap in AI development on the continent.
The problem CommonLingua addresses sits at the very start of the AI pipeline. Before a model in Swahili, Yoruba, or Wolof can be built, text in those languages must first be correctly classified by language. Existing identification tools including fastText, GlotLID, and OpenLID were built primarily around European and Asian languages and routinely misclassify African-language text as English or French. Even leading frontier AI models lose roughly 30 percentage points in accuracy on African languages compared to major world languages.
CommonLingua achieves 83 percent accuracy and an F1 macro score of 0.79 on the new CommonLID benchmark, outperforming leading language identification models by more than 10 percentage points under comparable conditions, while using roughly one three-hundredth of their parameters. The model ships as an 8 megabyte file and can process approximately 20 texts per second on a standard central processing unit (CPU) and up to 3,000 texts per second on a single graphics processing unit (GPU), making it practical for deployment in low-resource settings.
The 61 African languages covered span Bantu, Niger-Congo and West African, Afro-Asiatic and Semitic, Cushitic and Chadic, Berber, Nilo-Saharan, and pidgin and creole families. The model operates on raw text byte sequences rather than language-specific tokenizers, allowing it to handle multiple scripts including Latin, Arabic, Ethiopic, N’Ko, and Tifinagh consistently.
Pierre-Carl Langlais, Co-founder and Chief Technology Officer of Pleias, described language identification as a prerequisite for everything that follows in African AI development. Louis Powell, Director of AI Initiatives at GSMA, said the release addresses a foundational infrastructure gap that has held back progress for years, and that shared tools of this kind are essential to building AI systems that reflect Africa’s linguistic reality at scale.
The model is trained exclusively on open-licensed and public domain data and all datasets are released under permissive licences. The GSMA and partners plan to continue the conversation at MWC26 Kigali in June.


