Inside a radical new project to democratize AI
Last updated: 2022-12-15 Thursday
BLOOM a truly open Large Language Model
Is a news article by Melissa Heikkilä and can be found at technologyreview.com
BLOOM (which stands for BigScience Large Open-science Open-access Multilingual Language Model) is designed to be as transparent as possible, with researchers sharing details about the data it was trained on, the challenges in its development, and the way they evaluated its performance.
Here is the link to the webpage https://huggingface.co/bigscience/bloom with model cards and detailed training information. this is an arxiv paper describing the model
This model is way more inclusive, has support for many languages that are not included in other large language models:
It can handle 46 of them, including French, Vietnamese, Mandarin, Indonesian, Catalan, 13 Indic languages (such as Hindi), and 20 African languages. Just over 30% of its training data was in English. The model also understands 13 programming languages.
The reason BLOOM was able to improve on this situation is that the team rallied volunteers from around the world to build suitable data sets in other languages even if those languages weren’t as well represented online. For example, Hugging Face organized workshops with African AI researchers to try to find data sets such as records from local authorities or universities that could be used to train the model on African languages, says Chris Emezue, a Hugging Face intern and a researcher at Masakhane, an organization working on natural-language processing for African languages.
what do I think about it
The training data is still biased, but at least this org is very open. I applaud HuggingFace for this work!