Spoken Slovene

Active

from 01/09/2024 until 30/09/2027

Do you want to preserve recordings of your dialect as part of your cultural heritage for future generations? Do you want to contribute to the development of tools such as speech recognition or machine translation, also for Slovenian and its dialects? Are you interested in the differences in the speech of people from different places, do you notice peculiarities and want to understand them better?  Join us and become part of our efforts as a citizen scientist!

Aim

The project aims to collect and preserve recordings of everyday spoken Slovenian through citizen participation. Its core objective is to build a diverse and ethically managed speech corpus that supports linguistic research, advances speech and language technology development, and safeguards the long-term preservation of spoken heritage. The project also supports research on child speech and contributes to improved treatment approaches for children with speech and language disorders.

By encouraging contributors to record spontaneous conversations or narratives, the project captures authentic contemporary language that would otherwise vanish without documentation. All recordings, enriched with metadata, will be stored in the national CLARIN.SI repository and made available under open licenses such as Creative Commons. Individual utterances will also be accessible through specialised linguistic concordancers.

The project fosters public engagement with science by involving individuals and communities in research, raising awareness of linguistic diversity, and enabling contributions to technologies such as speech recognition, speech synthesis, and AI chatbots.

An additional objective is educational: to familiarise the public with the characteristics of spoken language and to provide guidance on speech corpus creation, technical and legal requirements for speech data collection, transcription principles, and the use of appropriate transcription tools.

How to participate

See how and what kind of speech we collect at spoken-slovenscina.um.si. Once you are familiar with the technical and content requirements, find a person who would like to participate and record them following the instructions on the website. You can record your conversation, interview, narration, instructions or explanation. You can then upload the recording to the portal with the scanned consent of the speakers and information about the speakers and the circumstances of the recording. We will thank citizen scientists with more than 150 minutes of collected speech with practical prizes.

Needed equipment

Participants can use any commonly available recording device to contribute spoken data to the project. A modern smartphone is typically sufficient, as most phones allow high-quality audio recording in everyday environments. Alternatively, contributors may use a laptop or desktop computer equipped with a built-in or external microphone. Dedicated audio recorders or other recording devices may also be used, but they are not required.

No specialised equipment is necessary. The platform is designed to be accessible to a broad public, allowing contributions with tools that participants already have at home. Contributors are encouraged to choose a quiet environment and ensure that the microphone is positioned clearly and unobstructed to support good audio quality.

About funding

Funding bodies: Slovenian Research and Innovation Agency (ARIS)

Funding program: LLM4DH – Large Language Models for Digital Humanities (2024–2027)

Coordinator
Created Dec. 10, 2025, 12:24 p.m.
Updated Dec. 10, 2025, 12:35 p.m.
x
This website is using cookies. More info. That's Fine