Category Archives: Speech Recognition
Welsh Language Communications Infrastructure
The Welsh Language Communications Infrastructure project’s aim is to lay foundations for a range of Welsh language communication technologies, including transcription, voice control, and speech to speech translation. The goal is to stimulate the development of Welsh language software packs and services, and mainstream Welsh in communication in the “internet of things”, question and answer software, and multilingual environments.
The project will create the foundations for enabling speaking Welsh with your television set, asking questions in Welsh to your smart phone, and receiving replies in spoken Welsh.
The project has been sponsored by the Welsh Government through their Welsh Language Technology and Digital Media Fund and S4C.
Outputs
Report : “Towards a Welsh Language Intelligent Personal Assistant”
Prompts in the Paldaruo app
The prompts within the Paldaruo app were developed to contain the most common sounds of the language. The recordings of the prompts below have been used to develop acoustic models for a Welsh speech recognition system.
sample1: lleuad, melyn, aelodau, siarad, ffordd, ymlaen, cefnogaeth, Helen sample2: gwraig, oren, diwrnod, gwaith, mewn, eisteddfod, disgownt, iddo sample3: oherwydd, Elliw, awdurdod, blynyddoedd, gwlad, tywysog, llyw, uwch sample4: rhybuddio, Elen, uwchraddio, hwnnw, beic, Cymru, rhoi, aelod sample5: rhai, steroid, cefnogaeth, felen, cau, garej, angau, ymhlith sample6: gwneud, iawn, un, dweud, llais, wedi, gyda, llyn sample7: lliw, yng Nghymru, gwneud, rownd, ychydig, wy, yn, llaes sample8: hyn, newyddion, ar, roedd, pan, llun, melin, sychu sample9: ychydig, glin, wrth, Huw, at, nhw, bod, bydd sample10: yn un, er mwyn, neu ddysgu, hyd yn oed, tan, ond fe aeth, ati sample11: y gymdeithas, yno yn fuan, mawr, ganrif, amser, dechrau, cyfarfod sample12: prif, rhaid bod, rheini, Sadwrn, sy'n cofio, cyntaf, rhaid cael sample13: dros y ffordd, gwasanaeth, byddai'r rhestr, hyd, llygaid, Lloegr sample14: cefn, teulu, enwedig, ond mae, y tu, y pryd, di-hid, peth, hefyd sample15: morgan, eto, yma, ddefnyddio, bach, yn wir, diwedd, llenyddiaeth sample16: ym Mryste, natur, ochr, mae hi, newid, dy gymorth, nes, gwahanol sample17: i ddod, cyngor, athrawon, bychan, neu, digwydd, hud, mynd i weld sample18: ei gilydd, cyffredin, hunain, lle, cymdeithasol, y lle, unwaith sample19: i ti, newydd, ysgrifennu, y gwaith, darllen, fyddai, addysg, daeth sample20: llywodraeth, ond, hynny, esgob, cyrraedd, a bod, gwrs, ceir sample21: rhaid gweld, chwarae, nad oedd, wedyn, flwyddyn, ond nid, ardal sample22: buasai, hanes, ddiweddar, wedi cael, o bobl, merched, ffilm, cafodd sample23: awdur, na, oedd modd, dod, yr hen, gen i, olaf, ddechrau sample24: dyna, ddigon, i beidio, bynnag, rhan, trwy, am y llyfr, y cyfnod sample25: athro, anifeiliaid, pob, o fewn, yn gwneud, cartref, elfennau sample26: er enghraifft, bron, yn fwy, ar gael, sylw, edrych arno, arall sample27: cyhoeddus, un pryd, clywed, ohonom, ei fod, aros, gwyrdd golau sample28: yn ei gwen, mai, dod o Gymru, personol, allan, wrth y ffenestr sample29: ystyr, dda, arbennig, mae'n bwysig, oeddwn, farw, nifer o wyau, maer sample30: America, ar gyfer, iaith, bellach, genedlaethol, ateb, at y bont sample31: ar y cefn, ac roedd, nesaf, i gyd, doedd dim, cynnwys, amlwg sample32: amgylchiadau, gweithwyr, fy mam, ac yn llogi, pethau, unrhyw, drws sample33: Evans, yn mynd, corff, neb, eglwys, cafwyd, sef, ar ei sample34: datblygu, ac ati, traddodiad, yn byw, ond hefyd, y dydd, Williams sample35: dosbarth, yr un, fod yn fawr, ni, yr ysgol, ail ganrif, am, nid sample36: gofynnodd, gwybod, llawer, rhywbeth, o rywle, chwilio am, oddi ar sample37: cynllun, cychwyn, diolch, llyfr, yn y blaen, dan, i ddim, cyn sample38: i'r dde, ddyletswydd, hi, mae'n hwyr, dros, megis, milltir, adeg sample39: ambell, yr ogof, yna, Lerpwl, ysgolion, parc, dal, plant sample40: mam, oedd hwn, ifanc, gellir, oesoedd canol, capel, ysgol, mlynedd sample41: o gwmpas, hon, weithiau, erbyn hyn, stori, i fod, ganddo, yn cael sample42: Sir Benfro, gweld, gilydd, ond doedd, oes, un o'ch ffrindiau, ystod sample43: ddim, ond pan, edrych, wrth gwrs, a phan, ystyried, wedi bod
Project GALLU Resources
Paldaruo App
Help us develop Welsh language Speech Recognition.
Contribute your voice through our ‘Paldaruo’ app
To collect the Welsh speech corpus, crowdsourcing will be used to recruit people from all ages and geographical background to read a script out loud in Welsh.
A specialised app (Paldaruo) has been developed for iOS and Android mobile phones and tablets to collect the data. The app is available in the individual AppStores. Here’s a video to give an ida of how to use the app:
Here are some screenshots of the app from the iOS devices:
The Paldaruo app will collect metadata from the users including their sex, age, geographical background and accent.
To see the metadata questions that are in the app, click here.
|
|||
|
The app guides users through the process of recording their voice by providing a recording script to them. The sound files are sent back to the project server automatically.
To see a list of prompts that in in the Paldaruo app at the moment, click here.
The GALLU project has gained ethical approval from Bangor University in March 2014. Within the Paldaruo ap, the user will need to agree to the terms and conditions.
GALLU Project: further speech recognition development
The aim of the GALLU project was to further develop the speech recognition resources available for the Welsh language. The project was funded by a grant from the Welsh Government and S4C. The project built on the foundations laid by the Basic Speech Recognition Project of 2008-9. The following aims were acheied during the project:
- design and develop a collection of prompts that contain all of the phonemes of the Welsh language
- collect, through crowdsourching, recordings of these prompts being pronounced by a large number of varied people in order to create a new Welsh speech corpus.
- use elements of the corpus to train open code speech recognition software (Julius) and HTK to control the movement of a toy robot on a Raspberry Pi.
- prepare the corpus for future developments with Welsh dictation systems including creating a typology of language registers with appropriate metadata on a trained corpus which has been tagged with the register characteristics.
- create a plug-in which detects and confirms the default language of the browser in order to Welshify the crowdsourcing pages and other webpages.
Participation
Although the project has formally ended, we continue to collect voices through the Paldaruo app for future use. Welsh speakers of any background or proficiency are invited to participate by downloading the app and reading aloud the displayed prompts so that speech recognition software can be trained to understand Welsh.
Language register typology matrix
In the table below, we have defined different registers found within the Welsh language. We show their distribution and related levels of the language including examples of specific terms or texts.
In collecting a corpus, it is useful to be able to recognise different types of text automatically and try to classify examples of one type in a unified way. It is possible to have several different models to classify and recognise registers. The aim of the table below is to propose a pattern to aid this work using a computer, rather than using written guidelines.
The table below is not a closed distribution, and usually there is a mixture of different characteristics in a text. The frequency of use of the different characteristics is going to aid the machine to have a better understanding of the register under consideration, rather than the simple existence of the characteristics in a specific text.
* denotes a form that has been found in the the style guide of the Welsh Government Translators but does not necessarily correspond to the descriptive typology of the different registers.
** denotes a form that has been found in Cymraeg Clir.
When referring to vocabulary, the term “safonol” here means forms that are identified in the main Welsh dictionaries.
|
Archaic |
Classical |
Formal |
Tecnical |
Neural |
Simplified language/ Clear Welsh |
Informal |
Very informal/ spoken |
Regional |
Slang |
Verb forms |
Yr ydwyf….. |
Yr wyf…. |
Rwyf…. |
Rwy…. |
Rwy…. |
Rwy…. |
Dw i… [*Rydw i…] |
Dw i …./Wi…./I fi…. |
Dw i …./Wi…./I fi…. |
Fi…. |
Style dependent |
√ |
√ |
√ |
X |
X |
X |
X |
X |
X |
X |
Use of the impersonal |
√ |
√ |
√ |
√ |
√ |
X |
Yn fwy cyffredin yn y gorff. na’r pres. |
X |
X |
X |
Periphrastic and compact |
Cryno
|
Cryno |
Cryno yn bennaf |
Cryno yn bennaf |
Cymysg. Defnyddio ‘caiff’ i oresgyn |
Cwmpasog ac eithrio rhai cyfarwydd iawn |
Cymysg |
Cymysg gyda’r cwmpasog yn llawer mwy cyffredin |
Cwmpasog yn y gogledd, cryno anffurfiol |
Ffurfiau amrywiol ansafonol yn |
3rd person plural ending |
–nt hwy |
–nt hwy |
–nt hwy |
–nt hwy |
–nt hwy/-n nhw |
-n nhw |
-n nhw |
-n nhw |
-n nhw |
-n nhw |
Geirynnau rhagferfol |
X |
X |
Achlysurol |
X |
X |
√ |
√ |
√ |
√ |
√ |
Rhagenwau personol |
Chwi, chwychwi |
Chwi/chi [*chi] |
chi |
Defnydd o ffurfiau personol yn brin |
chi |
chi |
chi |
Chi/ti |
Chi/ti/chdi/fe |
Chi/ti/chdi/fe |
Negation |
Nid ydwyf…. |
Nid wyf…. |
Nid wyf…. |
Nid wyf/ Dw i ddim… |
Nid wyf/ Dw i ddim… |
Dw i ddim… |
Dw i ddim… [*Dydw i ddim….] |
Dw i ddim… |
Dw i’m/Sai’n…./Sana i…./Nagw i…. |
Fi ddim…. |
Long sentences, multiple clauses |
√ |
√ |
√ |
X |
X |
X [**Dim mwy na 25 gair mewn brawddeg] |
X |
X |
X |
X |
Geirfa |
Gall gynnwys geiriau hynafol/ |
Gall gynnwys geiriau hynafol ond |
Geirfa gyfoes safonol |
Termau technegol parth-benodol |
Geirfa gyfoes safonol |
Geirfa wedi’i symleiddio |
Syml safonol |
Syml gydag elfennau |
Marcwyr tafodieithol amlwg: De: taw ma’s/mâs, moyn, ffaelu Gogledd: efo/ hefo, lan,rŵan ddaru |
Gall gynnwys geiriau anweddus, |
Abbreviations |
X |
X |
X |
X |
X |
X |
√ |
√ |
√ |
√ |
Intrusive vowels |
X |
X |
X |
X |
X |
X |
√ |
√ |
√ |
√ |
Analogy of text and register
Note: creative literature e.g. novels can contain a number of different registers in order to convey different effects and thus are not accommodated below.
Archaic | Classical | Formal | Technical | Neutral | Simplified language/ Cymraeg Clir | Informal | Very informal/ spoken |
Regional | Slang | |
Extracts of old speech etc. religious texts |
X |
|||||||||
Legislation, international contracts |
X |
|||||||||
Committee reports, public administration, classical journalism |
X |
|||||||||
Technical documents/ research papers |
X |
|||||||||
Children school essays, students, press statements |
X |
X |
||||||||
Forms, handouts, corporate websites, public campaigns |
X |
X |
||||||||
Forms, handouts, websites, etc prescriptive language |
X |
|||||||||
Popular journalism |
X |
X |
||||||||
Private letters |
X |
|||||||||
Transcriptions of spoken language, scripts which are written to be spoken |
X |
|||||||||
Corporate blogs |
X |
X |
||||||||
Private blogs |
X |
X |
X |
|||||||
Facebook and similar social media sites |
X |
X |
X |
|||||||
X |
X |
X |
GALLU Welsh Speech Corpus
One of the main aims of the GALLU project is to collect a Welsh speech corpus through crowd sourcing to develop a LVCSR (large vocabulary continuous speech recognition) system for Welsh.
The corpus will collect a set of sentences which will contain all of the sounds of the language to train a HTK acoustic model to recognise phonemes. A grammar will then be used to translate the recognised phonemes into full words. The model and speech recognition system will be open code (within Julius).
By the end of the project (the end of August 2014) the acoustic models and Julius will be able to control the movements of a robotic arm through Welsh speech commands for the Raspberry Pi.
Due to the fact that the outputs are going to be open source, the speech recognition system, the corpus to train the acoustic models and the code to make the software answer to the Welsh speech commands will be available openly by the end of the project.
The outputs can be incorporated into coding projects and classes for children in Wales.
At the moment there are 20 recordings of contributers saying the sample prompts which have been written to train the robotic arm. Click here to download them.
Metadata Questions: Paldaruo App
1. Ym mha flwyddyn cawsoch chi eich geni? In what year were you born?
2. Beth yw’ch rhyw? What is your sex?
Benyw Female
Gwryw Male
3. Ym mha ranbarth treuliasoch chi’r rhan fwyaf o’ch plentyndod? In which area did you spend most of your childhood?
De Ddwyrain Cymru South East Wales
De Orllewin Cymru South West Wales
Gogledd Ddwyrain Cymru North East Wales
Gogledd Orllewin Cymru North West Wales
Canolbarth Cymru Central Wales
Gogledd Lloegr North England
Canolbarth Lloegr Central England
De Lloegr South England
Gwlad arall Another country
Nifer o ardaloedd A number of countries
4. Enwch eich ysgol uwchradd olaf. Name your last secondary school.
Os nad ydych chi wedi mynd i’r ysgol uwchradd, rhowch ‘dim’ If you haven’t beeen to a secondary school put “none”
5. Ble rydych chi’n byw ar hyn o bryd? Where do you live at the moment?
De Ddwyrain Cymru South East Wales
De Orllewin Cymru South West Wales
Gogledd Ddwyrain Cymru North East Wales
Gogledd Orllewin Cymru North West Wales
Canolbarth Cymru Central Wales
Gogledd Lloegr North England
Canolbarth Lloegr Central England
De Lloegr South England
Gwlad arall Another Country
Nifer o ardaloedd A number of countries
6. Fel arfer, pa mor aml ydych chi’n siarad Cymraeg? Usually, how often do you speak Welsh?
Llai nag awr y mis Less than an hour a month
leiaf awr y mis Around an hour a month
leiaf awr yr wythnos Around an hour a week
leiaf awr y dydd Around an hour a day
Tua hanner yr amser About half the time
Rhan fwyaf o’r amser Most of the time
Bron yn ddieithriad All of the time
7. Ym mha gyd-destun rydych chi’n siarad Cymraeg? In which context do you speak Welsh?
Dewiswch y cyd-destunau ble rydych chi’n siarad Cymraeg unwaith yr wythnos neu fwy. Pick the contexts where you speak Welsh once a week or more.
Ddim yn siarad Cymraeg yn rheolaidd Don’t speak Welsh regularly
Gartref yn unig Home only
Ysgol/coleg/gwaith yn unig School/college/work only
Gyda ffrindiau yn unig With friends only
Gartref + Ysgol/coleg/gwaith Home + school/college/work
Gartref + Ffrindiau Home + with friends
Ysgol/coleg/gwaith + Ffrindiau School/college/work + with friends
Gartref + Ysgol/coleg/gwaith + Ffrindiau Home + school/college/work + with friends
Arall Other
8. Ydych chi’n siarad Cymraeg gydag acen iaith gyntaf? Do you speak Welsh with a first language accent?
Atebwch ‘Iaith Gyntaf’ os os gennych chi acen iaith gyntaf, neu ‘Dysgwr’ os oes gennych chi acen dysgwr
Answer “first language” if you have a first langauge accent, or “learner” if you have a learner accent
Acen Dysgwr Learner accent
Acen Iaith Gyntaf First language accent
9. Acen pa ranbarth sydd gennych chi? An accent from which area do you have?
Dewiswch yr ardal mae’ch acen yn dod ohoni (hyd yn oed os ydych chi’n byw yn rhywle arall)
Choose the area that your accent comes from (even if you live somewhere else)
De Ddwyrain South East
De Orllewin South West
Gogledd Ddwyrain North East
Gogledd Orllewin North West
Canolbarth Central
Acen gymysg/Arall Mixed accent/ other
Basic Speech Recognition for Welsh Project
The Basic Speech Recognition for Welsh Project was a small pilot project funded by the Welsh language Board. The project developed a speech-controlled calculator to highlight the potential for Welsh-language speech recognition. The resulting software was a laboratory prototype, rather than a product that was ready for the market. The research was incorporated into the GALLU and Seilwaith Cyfathrebu Cymraeg projects.