Category Archives: Speech Corpora

GALLU Welsh Speech Corpus

One of the main aims of the GALLU project is to collect a Welsh speech corpus through crowd sourcing to develop a LVCSR (large vocabulary continuous speech recognition) system for Welsh.

The corpus will collect a set of sentences which will contain all of the sounds of the language to train a HTK acoustic model to recognise phonemes. A grammar will then be used to translate the recognised phonemes into full words. The model and speech recognition system will be open code (within Julius).

By the end of the project (the end of August 2014) the acoustic models and Julius will be able to control the movements of a robotic arm through Welsh speech commands for the Raspberry Pi.

Due to the fact that the outputs are going to be open source, the speech recognition system, the corpus to train the acoustic models and the code to make the software answer to the Welsh speech commands will be available openly by the end of the project.

The outputs can be incorporated into coding projects and classes for children in Wales.

At the moment there are 20 recordings of contributers saying the sample prompts which have been written to train the robotic arm. Click here to download them.