Category Archives: Speech Recognition

Welsh Language Communications Infrastructure

The Welsh Language Communications Infrastructure project’s aim is to lay foundations for a range of Welsh language communication technologies, including transcription, voice control, and speech to speech translation. The goal is to stimulate the development of Welsh language software packs and services, and mainstream Welsh in communication in the “internet of things”, question and answer software, and multilingual environments.

The project will create the foundations for enabling speaking Welsh with your television set, asking questions in Welsh to your smart phone, and receiving replies in spoken Welsh.

The project has been sponsored by the Welsh Government through their Welsh Language Technology and Digital Media Fund and S4C.

Outputs

Report : “Towards a Welsh Language Intelligent Personal Assistant”

Online Machine Translation API service

Machine Translation Demo

Welsh Language Speech Recognition with Julius

Prompts in the Paldaruo app

The prompts within the Paldaruo app were developed to contain the most common sounds of the language. The recordings of the prompts below have been used to develop acoustic models for a Welsh speech recognition system.

 sample1: lleuad, melyn, aelodau, siarad, ffordd, ymlaen, cefnogaeth, Helen
 sample2: gwraig, oren, diwrnod, gwaith, mewn, eisteddfod, disgownt, iddo
 sample3: oherwydd, Elliw, awdurdod, blynyddoedd, gwlad, tywysog, llyw, uwch
 sample4: rhybuddio, Elen, uwchraddio, hwnnw, beic, Cymru, rhoi, aelod
 sample5: rhai, steroid, cefnogaeth, felen, cau, garej, angau, ymhlith
 sample6: gwneud, iawn, un, dweud, llais, wedi, gyda, llyn
 sample7: lliw, yng Nghymru, gwneud, rownd, ychydig, wy, yn, llaes
 sample8: hyn, newyddion, ar, roedd, pan, llun, melin, sychu
 sample9: ychydig, glin, wrth, Huw, at, nhw, bod, bydd
 sample10: yn un, er mwyn, neu ddysgu, hyd yn oed, tan, ond fe aeth, ati
 sample11: y gymdeithas, yno yn fuan, mawr, ganrif, amser, dechrau, cyfarfod
 sample12: prif, rhaid bod, rheini, Sadwrn, sy'n cofio, cyntaf, rhaid cael
 sample13: dros y ffordd, gwasanaeth, byddai'r rhestr, hyd, llygaid, Lloegr
 sample14: cefn, teulu, enwedig, ond mae, y tu, y pryd, di-hid, peth, hefyd
 sample15: morgan, eto, yma, ddefnyddio, bach, yn wir, diwedd, llenyddiaeth
 sample16: ym Mryste, natur, ochr, mae hi, newid, dy gymorth, nes, gwahanol
 sample17: i ddod, cyngor, athrawon, bychan, neu, digwydd, hud, mynd i weld
 sample18: ei gilydd, cyffredin, hunain, lle, cymdeithasol, y lle, unwaith
 sample19: i ti, newydd, ysgrifennu, y gwaith, darllen, fyddai, addysg, daeth
 sample20: llywodraeth, ond, hynny, esgob, cyrraedd, a bod, gwrs, ceir
 sample21: rhaid gweld, chwarae, nad oedd, wedyn, flwyddyn, ond nid, ardal
 sample22: buasai, hanes, ddiweddar, wedi cael, o bobl, merched, ffilm, cafodd
 sample23: awdur, na, oedd modd, dod, yr hen, gen i, olaf, ddechrau
 sample24: dyna, ddigon, i beidio, bynnag, rhan, trwy, am y llyfr, y cyfnod
 sample25: athro, anifeiliaid, pob, o fewn, yn gwneud, cartref, elfennau
 sample26: er enghraifft, bron, yn fwy, ar gael, sylw, edrych arno, arall
 sample27: cyhoeddus, un pryd, clywed, ohonom, ei fod, aros, gwyrdd golau
 sample28: yn ei gwen, mai, dod o Gymru, personol, allan, wrth y ffenestr
 sample29: ystyr, dda, arbennig, mae'n bwysig, oeddwn, farw, nifer o wyau, maer
 sample30: America, ar gyfer, iaith, bellach, genedlaethol, ateb, at y bont
 sample31: ar y cefn, ac roedd, nesaf, i gyd, doedd dim, cynnwys, amlwg
 sample32: amgylchiadau, gweithwyr, fy mam, ac yn llogi, pethau, unrhyw, drws
 sample33: Evans, yn mynd, corff, neb, eglwys, cafwyd, sef, ar ei
 sample34: datblygu, ac ati, traddodiad, yn byw, ond hefyd, y dydd, Williams
 sample35: dosbarth, yr un, fod yn fawr, ni, yr ysgol, ail ganrif, am, nid
 sample36: gofynnodd, gwybod, llawer, rhywbeth, o rywle, chwilio am, oddi ar
 sample37: cynllun, cychwyn, diolch, llyfr, yn y blaen, dan, i ddim, cyn
 sample38: i'r dde, ddyletswydd, hi, mae'n hwyr, dros, megis, milltir, adeg
 sample39: ambell, yr ogof, yna, Lerpwl, ysgolion, parc, dal, plant
 sample40: mam, oedd hwn, ifanc, gellir, oesoedd canol, capel, ysgol, mlynedd
 sample41: o gwmpas, hon, weithiau, erbyn hyn, stori, i fod, ganddo, yn cael
 sample42: Sir Benfro, gweld, gilydd, ond doedd, oes, un o'ch ffrindiau, ystod
 sample43: ddim, ond pan, edrych, wrth gwrs, a phan, ystyried, wedi bod

Paldaruo App

Help us develop Welsh language Speech Recognition.
Contribute your voice through our ‘Paldaruo’ app

paldaruo

iTunes                        Google Play

To collect the Welsh speech corpus, crowdsourcing will be used to recruit people from all ages and geographical background to read a script out loud in Welsh.

A specialised app (Paldaruo) has been developed for iOS and Android mobile phones and tablets to collect the data. The app is available in the individual AppStores. Here’s a video to give an ida of how to use the app:

Here are some screenshots of the app from the iOS devices:

Welcome screen in the Paldaruo app for ipad

Welcome screen for the Paldaruo app on the iPhone

The Paldaruo app will collect metadata from the users including their sex, age, geographical background and accent.

To see the metadata questions that are in the app, click here.

Asking about your sex on the iPad

Asking about your location on the iPad

Asking about your childhood on the iPhone

Asking when you speak Welsh on the iPhone

Asking about your accent on the iPhone

The app guides users through the process of recording their voice by providing a recording script to them. The sound files are sent back to the project server automatically.

To see a list of prompts that in in the Paldaruo app at the moment, click here.

Recording screen on the iPad

Recording screen on the iPhone

The GALLU project has gained ethical approval from Bangor University in March 2014. Within the Paldaruo ap, the user will need to agree to the terms and conditions.

Terms and Conditions screen on the iPad

Terms and Conditions screen on the iPhone

GALLU Project: further speech recognition development

The aim of the GALLU project was to further develop the speech recognition resources available for the Welsh language.  The project was funded by a grant from the Welsh Government and S4C.  The project built on the foundations laid by the Basic Speech Recognition Project of 2008-9. The following aims were acheied during the project:

  • design and develop a collection of prompts that contain all of the phonemes of the Welsh language
  • collect, through crowdsourching, recordings of these prompts being pronounced by a large number of varied people in order to create a new Welsh speech corpus
  • use elements of the corpus to train open code speech recognition software (Julius) and HTK to control the movement of a toy robot on a Raspberry Pi.
  • prepare the corpus for future developments with Welsh dictation systems including creating a typology of language registers with appropriate metadata on a trained corpus which has been tagged with the register characteristics.
  • create a plug-in which detects and confirms the default language of the browser in order to Welshify the crowdsourcing pages and other webpages. 

Participation

Although the project has formally ended, we continue to collect voices through the Paldaruo app for future use. Welsh speakers of any background or proficiency are invited to participate by downloading the app and reading aloud the displayed prompts so that speech recognition software can be trained to understand Welsh.

paldaruo

iTunes  Google Play

Language register typology matrix

In the table below, we have defined different registers found within the Welsh language. We show their distribution and related levels of the language including examples of specific terms or texts.

In collecting a corpus, it is useful to be able to recognise different types of text automatically and try to classify examples of one type in a unified way. It is possible to have several different models to classify and recognise registers. The aim of the table below is to propose a pattern to aid this work using a computer, rather than using written guidelines.

The table below is not a closed distribution, and usually there is a mixture of different characteristics in a text. The frequency of use of the different characteristics is going to aid the machine to have a better understanding of the register under consideration, rather than the simple existence of the characteristics in a specific text.

* denotes a form that has been found in the the style guide of the Welsh Government Translators but does not necessarily correspond to the descriptive typology of the different registers.

** denotes a form that has been found in Cymraeg Clir.

When referring to vocabulary, the term “safonol” here means forms that are identified in the main Welsh dictionaries.



 

Archaic

Classical

Formal

Tecnical

Neural

Simplified language/ Clear Welsh

Informal

Very informal/ spoken

Regional

Slang

Verb forms

Yr ydwyf…..

Yr wyf….

Rwyf….

Rwy….

Rwy….

Rwy….

Dw i… [*Rydw i…]

Dw i …./Wi…./I fi….

Dw i …./Wi…./I fi….

Fi….

Style dependent

X

X

X

X

X

X

X

Use of the impersonal

X

Yn fwy cyffredin yn y gorff. na’r pres.

X

X

X

Periphrastic and compact

Cryno

 

Cryno

Cryno yn bennaf

Cryno yn bennaf

Cymysg. Defnyddio ‘caiff’ i oresgyn
problem cryno/cwmpasog

Cwmpasog ac eithrio rhai cyfarwydd iawn

Cymysg

Cymysg gyda’r cwmpasog yn llawer mwy cyffredin

Cwmpasog yn y gogledd, cryno anffurfiol
yn y de (e.e. es i yn lle euthum/ nes i fynd/ddaru mi fynd)

Ffurfiau amrywiol ansafonol yn
gyffredin

3rd person plural ending 

–nt hwy

–nt hwy

–nt hwy

–nt hwy

–nt hwy/-n nhw

-n nhw

-n nhw

-n nhw

-n nhw

-n nhw

Geirynnau rhagferfol

X

X

Achlysurol

X

X

Rhagenwau personol

Chwi, chwychwi

Chwi/chi [*chi]

chi

Defnydd o ffurfiau personol yn brin

chi

chi

chi

Chi/ti

Chi/ti/chdi/fe

Chi/ti/chdi/fe

Negation

Nid ydwyf….

Nid wyf….

Nid wyf….

Nid wyf/ Dw i ddim…

Nid wyf/ Dw i ddim…

Dw i ddim…

Dw i ddim…  [*Dydw i ddim….]

Dw i ddim…

Dw i’m/Sai’n…./Sana i…./Nagw i….

Fi ddim….

Long sentences, multiple clauses

X

X

X [**Dim mwy na 25 gair mewn brawddeg]

X

X

X

X

Geirfa

Gall gynnwys geiriau hynafol/
anarferedig

Gall gynnwys geiriau hynafol ond
arferedig

Geirfa gyfoes safonol

Termau technegol parth-benodol

Geirfa gyfoes safonol

Geirfa wedi’i symleiddio

Syml safonol

Syml gydag elfennau
cwtogi/cywasgu/ymwthiol

Marcwyr tafodieithol amlwg:

De: taw

ma’s/mâs, moyn, ffaelu

Gogledd: efo/ hefo, lan,rŵan

ddaru

Gall gynnwys geiriau anweddus,
rhegfeydd, llawer o eiriau Saesneg

Abbreviations

X

X

X

X

X

X

Intrusive vowels

X

X

X

X

X

X

Analogy of text and register

Note: creative literature e.g. novels can contain a number of different registers in order to convey different effects and thus are not accommodated below.

  Archaic Classical Formal Technical Neutral Simplified language/ Cymraeg Clir Informal Very informal/ spoken
Regional Slang
Extracts of old speech etc. religious texts

X

Legislation, international contracts

X

Committee reports, public administration, classical journalism

X

Technical documents/ research papers

X

Children school essays, students, press statements 

X

X

Forms, handouts, corporate websites, public campaigns 

X

X

Forms, handouts, websites, etc prescriptive language

X

Popular journalism

X

X

Private letters

X

Transcriptions of spoken language, scripts which are written to be spoken 

X

Corporate blogs

X

X

Private blogs

X

X

X

Facebook and similar social media sites

X

X

X

Twitter

X

X

X

GALLU Welsh Speech Corpus

One of the main aims of the GALLU project is to collect a Welsh speech corpus through crowd sourcing to develop a LVCSR (large vocabulary continuous speech recognition) system for Welsh.

The corpus will collect a set of sentences which will contain all of the sounds of the language to train a HTK acoustic model to recognise phonemes. A grammar will then be used to translate the recognised phonemes into full words. The model and speech recognition system will be open code (within Julius).

By the end of the project (the end of August 2014) the acoustic models and Julius will be able to control the movements of a robotic arm through Welsh speech commands for the Raspberry Pi.

Due to the fact that the outputs are going to be open source, the speech recognition system, the corpus to train the acoustic models and the code to make the software answer to the Welsh speech commands will be available openly by the end of the project.

The outputs can be incorporated into coding projects and classes for children in Wales.

At the moment there are 20 recordings of contributers saying the sample prompts which have been written to train the robotic arm. Click here to download them.

Metadata Questions: Paldaruo App

1. Ym mha flwyddyn cawsoch chi eich geni?             In what year were you born?

2. Beth yw’ch rhyw?                                                    What is your sex?

Benyw                                Female

Gwryw                               Male

3. Ym mha ranbarth treuliasoch chi’r rhan fwyaf o’ch plentyndod?     In which area did you spend most of your childhood?

De Ddwyrain Cymru                South East Wales

De Orllewin Cymru                  South West Wales

Gogledd Ddwyrain Cymru      North East Wales

Gogledd Orllewin Cymru        North West Wales

Canolbarth Cymru                   Central Wales

Gogledd Lloegr                          North England

Canolbarth Lloegr                    Central England

De Lloegr                                   South England

Gwlad arall                               Another country

Nifer o ardaloedd                   A number of countries

4. Enwch eich ysgol uwchradd olaf.       Name your last secondary school.

Os nad ydych chi wedi mynd i’r ysgol uwchradd, rhowch ‘dim’     If you haven’t beeen to a secondary school put “none”

5. Ble rydych chi’n byw ar hyn o bryd?     Where do you live at the moment?

De Ddwyrain Cymru                South East Wales

De Orllewin Cymru                  South West Wales

Gogledd Ddwyrain Cymru      North East Wales

Gogledd Orllewin Cymru        North West Wales

Canolbarth Cymru                   Central Wales

Gogledd Lloegr                         North England

Canolbarth Lloegr                    Central England

De Lloegr                                   South England

Gwlad arall                                Another Country

Nifer o ardaloedd                     A number of countries

6. Fel arfer, pa mor aml ydych chi’n siarad Cymraeg?   Usually, how often do you speak Welsh?

Llai nag awr y mis                    Less than an hour a month

leiaf awr y mis                           Around an hour a month

leiaf awr yr wythnos                Around an hour a week

leiaf awr y dydd                        Around an hour a day

Tua hanner yr amser               About half the time

Rhan fwyaf o’r amser              Most of the time

Bron yn ddieithriad                 All of the time

7. Ym mha gyd-destun rydych chi’n siarad Cymraeg? In which context do you speak Welsh?

Dewiswch y cyd-destunau ble rydych chi’n siarad Cymraeg unwaith yr wythnos neu fwy.   Pick the contexts where you speak Welsh once a week or more.

Ddim yn siarad Cymraeg yn rheolaidd            Don’t speak Welsh regularly

Gartref yn unig                                                     Home only

Ysgol/coleg/gwaith yn unig                              School/college/work only

Gyda ffrindiau yn unig                                      With friends only

Gartref + Ysgol/coleg/gwaith                          Home + school/college/work

Gartref + Ffrindiau                                            Home + with friends

Ysgol/coleg/gwaith + Ffrindiau                      School/college/work + with friends

Gartref + Ysgol/coleg/gwaith + Ffrindiau   Home + school/college/work + with friends

Arall                                                                     Other

8. Ydych chi’n siarad Cymraeg gydag acen iaith gyntaf?   Do you speak Welsh with a first language accent?

Atebwch ‘Iaith Gyntaf’ os os gennych chi acen iaith gyntaf, neu ‘Dysgwr’ os oes gennych chi acen dysgwr

Answer “first language” if you have a first langauge accent, or “learner” if you have a learner accent

Acen Dysgwr               Learner accent

Acen Iaith Gyntaf        First language accent

9. Acen pa ranbarth sydd gennych chi?  An accent from which area do you have?

Dewiswch yr ardal mae’ch acen yn dod ohoni (hyd yn oed os ydych chi’n byw yn rhywle arall)

Choose the area that your accent comes from (even if you live somewhere else)

De Ddwyrain               South East

De Orllewin                 South West

Gogledd Ddwyrain      North East

Gogledd Orllewin        North West

Canolbarth                   Central

Acen gymysg/Arall      Mixed accent/ other

Basic Speech Recognition for Welsh Project

The Basic Speech Recognition for Welsh Project was a small pilot project funded by the Welsh language Board. The project developed a speech-controlled calculator to highlight the potential for Welsh-language speech recognition. The resulting software was a laboratory prototype, rather than a product that was ready for the market. The research was incorporated into the GALLU and Seilwaith Cyfathrebu Cymraeg projects.