Collecting texts for the Slovenian Big Language Model (povejmo.si)

Active

from 01/01/2024 until 01/01/2026

Under the national programme Adaptive Natural Language Processing with Large Language Models (PoVeJMo), we are developing a large language model for Slovene. We have estimated that we need 40 billion words of text. To this end, we are organising a national collection campaign of written and spoken texts in Slovene.

We have already asked institutions such as the National and University Library and media houses for texts. We also invite individuals to contribute texts. We collect general texts, such as notes, emails, requests, blog entries, social media posts, or specialised texts such as articles, reports, essays, etc. It doesn't matter whether the texts are standard or non-standard, proofread or unproofread - we accept them all. All that matters is that the contributors hold the copyright for the submitted texts.

The project is part of Slovenia's efforts to strengthen the development of language technologies and to ensure that Slovene remains relevant in the digital age. With the Big Language Model, we want to enable the development of safe, high-quality and openly accessible artificial intelligence in Slovene. The model will be useful for researchers, companies, developers and the general public.

Aim

The idea behind the Povejmo project is to develop a Slovenian generative large language model. Slovenian has a community of two million speakers and is considered a severely under-resourced language. Based on scaling laws, we estimated that training an LLM efficiently requires 40 billion words in Slovenian.

We are using a citizen science approach to address the scarcity of language data available for training a large language model. We have launched a national text collection campaign, inviting institutions such as libraries, universities, media outlets, and individuals to contribute their texts. Throughout this process, we ensure transparency and a clear methodology to protect providers' data.

Unlike existing LLMs provided by foreign for-profit corporations and struggling with Slovene, our model will be openly accessible to the public, allowing users to customise it to their specific needs. It will be available for applications in medicine, law, museums, and everyday individual use. This project brings an under-resourced language into the digital age and ensures its digital sovereignty.

How to participate

Anyone wishing to participate can submit texts via the online form on the Povejmo.si website.

Needed equipment

Leonardo HPC System Pre-Exascale Supercomputer, Canva, Google Workspace, social media platforms (Instagram, Facebook, LinkedIn, YouTube), Microsoft suite

About funding

Funding bodies: ARIS – Slovenian Research and Innovation Agency Ministry of Higher Education European Union - NextGenerationEU

Funding program: European Union – NextGenerationEU: Recovery and Resilience Plan

Coordinator

Centre for Language Resources…
Academic

Go to Project

Keywords

Slovene Language Artificial Intelligence Large Language Models

Science Topics

Education Information & Computing sciences

Difficulty Level

Medium

Participation tasks

DIY hacking/making Data analysis Data Entry

Location

National

Most of the participant activities take place online in Slovenia, as we have developed a web portal for data donation, which can be found at the link povejmo.si. The second most common project location is Ljubljana, the capital of Slovenia, where roundtable discussions and presentations of project a

Contact

E-mail

0 0 2031

Created Dec. 9, 2025, 11:29 a.m.

Updated Dec. 9, 2025, 12:06 p.m.

What do you want to submit?

Project

Resource

Training

Organisation

Event

Platform