Spoken Slovene
from 01/09/2024 until 30/09/2027
Do you want to preserve recordings of your dialect as part of your cultural heritage for future generations? Do you want to contribute to the development of tools such as speech recognition or machine translation, also for Slovenian and its dialects? Are you interested in the differences in the speech of people from different places, do you notice peculiarities and want to understand them better? Join us and become part of our efforts as a citizen scientist!
Aim
The project aims to collect and preserve recordings of everyday spoken
Slovenian through citizen participation. Its core objective is to build a
diverse and ethically managed speech corpus that supports linguistic research,
advances speech and language technology development, and safeguards the
long-term preservation of spoken heritage. The project also supports research
on child speech and contributes to improved treatment approaches for children
with speech and language disorders.
By encouraging contributors to record spontaneous conversations or
narratives, the project captures authentic contemporary language that would
otherwise vanish without documentation. All recordings, enriched with metadata,
will be stored in the national CLARIN.SI repository and made available under
open licenses such as Creative Commons. Individual utterances will also be
accessible through specialised linguistic concordancers.
The project fosters public engagement with science by involving
individuals and communities in research, raising awareness of linguistic
diversity, and enabling contributions to technologies such as speech
recognition, speech synthesis, and AI chatbots.
An additional objective is educational: to familiarise the public with
the characteristics of spoken language and to provide guidance on speech corpus
creation, technical and legal requirements for speech data collection,
transcription principles, and the use of appropriate transcription tools.
How to participate
See how and what kind of speech we collect at spoken-slovenscina.um.si. Once you are familiar with the technical and content requirements, find a person who would like to participate and record them following the instructions on the website. You can record your conversation, interview, narration, instructions or explanation. You can then upload the recording to the portal with the scanned consent of the speakers and information about the speakers and the circumstances of the recording. We will thank citizen scientists with more than 150 minutes of collected speech with practical prizes.
Needed equipment
Participants can use any commonly available recording device to
contribute spoken data to the project. A modern smartphone is typically
sufficient, as most phones allow high-quality audio recording in everyday
environments. Alternatively, contributors may use a laptop or desktop computer
equipped with a built-in or external microphone. Dedicated audio recorders or
other recording devices may also be used, but they are not required.
No specialised equipment is necessary. The platform is designed to be
accessible to a broad public, allowing contributions with tools that
participants already have at home. Contributors are encouraged to choose a
quiet environment and ensure that the microphone is positioned clearly and
unobstructed to support good audio quality.
About funding
Funding bodies: Slovenian Research and Innovation Agency (ARIS)
Funding program: LLM4DH – Large Language Models for Digital Humanities (2024–2027)
Coordinator
Faculty of Electrical Enginee…
Academic
Go to Project