- - - Nota Lyd- og tekstdata Published by the Danish Agency for Digital Government Contact info@sprogteknologi.dk for questions regarding this data License: CC0 1.0 Universal https://creativecommons.org/publicdomain/zero/1.0/ Language: da Tags: - speech - text - - - # Background: This data was created by the public institution Nota (https://nota.dk/), which is part of the Danish Ministry of Culture. Nota has a library audiobooks and audiomagazines for people with reading or sight disabilities. Nota also produces a number of audiobooks and audiomagazines themselves. The dataset consists of .wav and .txt files from Nota's audiomagazines "Inspiration" and "Radio/TV". The dataset has been published as a part of the initiative sprogteknologi.dk, within the Danish Agency for Digital Government (www.digst.dk). # Data summary 336 GB available data, containing approximately 500 hours of voicerecordings in wav and accompanying transcripts in txt. For wav files the sample rate is 44.1 KhZ mono (1 channel). All files related to one reading of one edition of the magasine "Inspiration" or "Radio/TV" has been segmented into bits of 2 - 50 seconds .wav files with an accompanying transcription in .txt. Each reading has been compressed into a of zip-file containing all files related to the given publication. There is a total of 455 folders. The 455 folders have have further been distributed into 6 seperate folders. 2 folders containing the readings of "Inspiration" and 4 folders containing the readings of "Radio- Tv Program": - Inspiration 2008 - 2016 - Inspiration 2016 - 2021 - Nota-txt_only - Radio- TV Program 2007 - 2012 - Radio- TV Program 2013 - 2015 - Radio- TV Program 2016 - 2018 - Radio- TV Program 2019 - 2022 The folder "Nota-txt_only" is a zip compressed folder containing all available transcripts (98.600 txt files). Unpacking this can take some time. The names of the .wav and .txt files are corresponding throughout the whole dataset, with the file extension being different: Example: INSL20160005_000001.wav is the soundfile aligning with the transcription found in INSL20160005_000001.txt. The dataset consists of 547 hours of female voices spoken by 4 different voices and 37 hours of male voices spoken by 2 different vocies. # Personal information: The dataset is made public and free to use. Recorded individuals has by written contract accepted and agreed to the publication of their recordings. Other names appearing in the dataset are already publically known individuals (i.e. TV or Radio host names). Their names are not to be treated as sensitive or personal data in the context of this dataset. # Disclaimer: There might be smaller discrepancies between the .wav and .txt files. Therefore, there might be issues in the alignment of timestamps, text and sound files. There are no strict rules as to how readers read aloud non-letter characters (i.e. numbers, €, $, !, ?). These symbols can be read differently throughout the dataset. # Contact: Contact info@sprogteknologi.dk if you have questions regarding use of data. We gladly receive inputs and ideas on how to distribute the data.