Collecting texts for the Slovenian Big Language Model (povejmo.si)
from 01/01/2024 until 01/01/2026
Under the national programme Adaptive Natural Language Processing with Large Language Models (PoVeJMo), we are developing a large language model for Slovene. We have estimated that we need 40 billion words of text. To this end, we are organising a national collection campaign of written and spoken texts in Slovene.
We have already asked institutions such as the National and University Library and media houses for texts. We also invite individuals to contribute texts. We collect general texts, such as notes, emails, requests, blog entries, social media posts, or specialised texts such as articles, reports, essays, etc. It doesn't matter whether the texts are standard or non-standard, proofread or unproofread - we accept them all. All that matters is that the contributors hold the copyright for the submitted texts.
The project is part of Slovenia's efforts to strengthen the development of language technologies and to ensure that Slovene remains relevant in the digital age. With the Big Language Model, we want to enable the development of safe, high-quality and openly accessible artificial intelligence in Slovene. The model will be useful for researchers, companies, developers and the general public.
Aim
The
idea behind the Povejmo project is to develop a Slovenian generative large
language model. Slovenian has a community of two million speakers and is
considered a severely under-resourced language. Based on scaling laws, we
estimated that training an LLM efficiently requires 40 billion words in
Slovenian.
We
are using a citizen science approach to address the scarcity of language data
available for training a large language model. We have launched a national text
collection campaign, inviting institutions such as libraries, universities,
media outlets, and individuals to contribute their texts. Throughout this
process, we ensure transparency and a clear methodology to protect providers'
data.
Unlike
existing LLMs provided by foreign for-profit corporations and struggling with
Slovene, our model will be openly accessible to the public, allowing users to
customise it to their specific needs. It will be available for applications in
medicine, law, museums, and everyday individual use. This project brings an
under-resourced language into the digital age and ensures its digital
sovereignty.
How to participate
Anyone wishing to participate can submit texts via the online form on the Povejmo.si website.
Needed equipment
Leonardo HPC System Pre-Exascale Supercomputer, Canva, Google
Workspace, social media platforms (Instagram, Facebook, LinkedIn, YouTube), Microsoft
suite
About funding
Funding bodies: ARIS – Slovenian Research and Innovation Agency Ministry of Higher Education European Union - NextGenerationEU
Funding program: European Union – NextGenerationEU: Recovery and Resilience Plan
Coordinator
Centre for Language Resources…
Academic
Go to Project