Researchers from across the continent are collaborating on an open source AI project to develop machine translation for African languages – facilitating communication, increasing accessibility and opening doors to the world’s youngest continent to play a stronger role in shaping the digital world.
Masakhane means “We Build Together” in isiZulu, one of 2,140 languages spoken across the African continent. It’s also the name of a cross-continent open source AI project to develop the neural machine translation systems needed to put African languages on the technological map and connect Africa’s diverse and numerous linguistic populations. Their philosophy? The 4th industrial revolution in Africa cannot take place in English. And yet, currently, many of the digital tools and services that are booming around the world, are available primarily in English, or other major Western languages. While Africa is home to millions of English, French and Portuguese speakers, there are thousands of other languages spoken around the continent that are excluded from the digital world and the opportunities and information that it brings.
Salomon Kabongo, who joined the Masakhane project in 2019 as a representative of the Tshiluba language (spoken across Central Africa and DRC), says that a lot of Congolese people in his country don’t speak French or English, but instead the Congolese national languages Lingala, Tshiluba, Kikongo and Swahili. The phones they use include advanced technology like Siri, Google Talk and Alexa – but the speech recognition hasn’t been programmed to work for their native languages. The same is true when it comes to information on the internet. While Wikipedia prides itself on being a depository of open information, certain languages are dramatically underserved. For example, there are just 9.6 million people Swedish speakers in the world, and over 3 million articles are written in Swedish on Wikipedia. Whereas the Oromo language has 34 million speakers in Ethiopia, but its Wikipedia contains only 786 articles. And Google Translate, the most popular automatic translation system available, currently translates 103 of the world’s 7000 languages – but only 13 are African ones. Salomon’s dream is to make these kind of technologies and digital resources available in Congolese native languages, opening up a world of possibilities to those people facing linguistic exclusion. That’s where Masakhane comes in.
Seeking out language data across the African continent
The Masakhane project aims to build community and strengthen natural language processing (NLP) in native African languages. NLP is a field of artificial intelligence, where systems and computational algorithms are built that can automatically understand, analyse, manipulate, and potentially generate human language. Machine translation (MT) is just one example of an NLP-based system, while other applications include speech recognition, auto-prediction and correction and sentiment analysis, to name just a few.
Just like most machine-learning models, effective machine translation needs to be fed huge amounts of “training data” in order to produce decent results. One of the main challenges when it comes to African languages is that they are “low-resourced”, meaning that this essential language data is lacking, scattered, or not publically available.
In the neural machine translation world, the documents that serve to create the datasets required are known as corpora. Parallel text corpora – large sets of texts that are equivalent, sentence-by-sentence, in multiple languages – are a huge benefit when it comes to training machine translation models. While parallel corpora are lacking for African languages, there is no shortage for major Western ones. Part of the reason is that the European Union’s policies and documents provide high-quality, human translated “parallel corpora” in a huge variety of EU languages. The availability of such documents has real life implications for the availability of information on the internet.
To start to address those problems, the Masakhane team’s 103 (and counting) members are working to gather their own corpora, working together with groups like Translators Without Borders, to source publicly-available datasets such as governmental documents, religious texts, literature, and news. They’re then using that data to develop and machine translation models from English to their African mother tongues. All of the data sets and translation models that they create are open-source and anyone can use them or contribute to the project. This will allow anyone, regardless of the resources they have at hand, to have the ability to build digital tools for Africa. In the words of founder Jade Abbott, “This research enables anyone from the smallest African startup to NGOs to large corporates to researchers in and outside of the continent to benefit from the datasets discovered, and the expertise being built.”
So far the Masakhane researchers have developed baseline models of 16 African languages on the software development platform GitHub. They plan to publish 3 in-progress papers in April at the eighth ICLR in Addis Ababa, Ethiopia in April 2020. ICLR, the International Conference on Learning Representations, gathers professionals who work in the branch of artificial intelligence called representation learning, an aspect of machine learning.
Machine translation as a lever for more inclusion
African language translations are important for the local population for a number of reasons. Firstly, when crises happen in areas where a low-resource language is spoken, relief services face a language barrier when trying to provide aid. Machine translation tools could literally save lives. Secondly, people are shown to learn more effectively when educated in their native tongue. And 63% of Sub-Saharan Africans cannot access global markets due to language barriers.
Being able to automatically translate, and thus include, African languages in more digital services and tools, will also open up possibilities for AI use cases in Africa and allow Africans to engage further in the digital economy.
Currently, AI is principally developed and researched outside of the African continent. This not only puts AI products and services at risk of biases and discrimination, but also limits the depth to which AI can advance to improve African lives. “Algorithms define the future and people forget that algorithms are not just technical, they are political and cultural,” says Tom Ilube, founder of the African Science Academy for girls in Ghana. Masakhane’s publications will open up more opportunities for the development of technology by Africans, in their own languages – African solutions for African challenges, rather than interventions that come from outside the local context.
Is the future of AI in Africa?
Masakhane is far from being the only project to be working on translation solutions for African languages. In 2019, Mozilla and the GIZ initiated a collaboration with African startups to develop Mozilla’s “Common Voice” and “Deep Speech” projects, which will provide voice-enabled products and services in African languages. And in November 2019 the Artificial Intelligence for Development programme (AI4D) launched the start of the African Language Dataset Challenge, in collaboration with the data science challenge website Zindi, as part of another bid to bridge the gap between those languages with plenty of data available on the Internet and those without.
The world’s largest tech giant has launched AI and machine learning projects in Africa too. In 2018, Google opened an AI research lab in Accra, Ghana and last year, Google’s African Launchpad Accelerator chose AI as the focus for its fourth cohort of startups. Google’s open source platform for machine learning, TensorFlow, provides code that can be put to use for a wide range of purposes and has already been used by Africans to create apps and digital services to address local issues across the continent. It’s been used in applications such as PlantVillage Nuru that African farmers use to diagnose crop diseases and improve their agricultural outputs. According to GitHub’s 2019 annual “Octoverse” report, African nations are already leading the way when it comes to growth of participation in open source projects around the world, with growth highest in Nigeria, Kenya, Tunisia, and Morocco. Across Africa, contributions are up 40%, more than on any other continent.
Moustapha Cisse, who leads Google’s AI research lab in Accra, points out that Africa has a strong advantage in AI due to its vast human resources: it is home to the youngest and fastest growing population in the world (Africa’s median age is 19, whereas in Europe, it is 43). However, he and other African tech leaders agree that pan-African strategy and financial investment is a necessary next step.
The Masakhane project is still in its early days and while the team already includes researchers from across the continent, they’re still on the lookout for more. You can find out how to join them by visiting their website here.