Electronic Corpus of Welsh (CEG)

The original project was financed during 1993/4 with a £21,000 grant awarded by the Higher Education Council for Wales to Ellis, O’Dochartaigh & Hicks from the IT Unit, Welsh Department and School of Psychology, University of Wales, Bangor.

It included 1,079,032 words of written Welsh prose, mainly form 1970 onwards, based on 500 samples of around 2,000 words each. The data was tagged and analysed for various linguistic studies, and the original files may still be accessed at http://www.bangor.ac.uk/canolfanbedwyr/ceg.php.en now maintained by the staff of the Language Technologies Unit.

Because of the demand for a user-friendly, searchable interface for the corpus, in 2012 the Unit developed another version, using the Cysefin and Hebog platform to display the data. The texts displayed are the same in both versions, the only difference is in the methods of showing and searching the data. The version using Cysefin and Hebog may be accessed from the Welsh National Corpus Portal.


A 1 million word lexical database and frequency count for Welsh

Ellis, N. C., O’Dochartaigh, C., Hicks, W., Morgan, M., & Laporte,
N.  (2001)

 


 

Fersiwn Cymraeg

Brief Summary

This is a word frequency analysis of 1,079,032 words of written Welsh prose, based on 500 samples of approximately 2000 words each, selected from a representative range of text types to illustrate modern (mainly post 1970) Welsh prose writing. It was conceived as providing a Welsh parallel to the Kucera and Francis
analysis for American English, and the LOB corpus for British English, in the expectation that such an analysed corpus would provide research tools for a number of academic disciplines: psychology and psycholinguistics,
child and second language acquisition, general linguistics, and the linguistics of Modern Welsh, including literary analysis.

The sample included materials from the fields of novels and short stories, religious writing, childrenís literature both factual and fiction, non-fiction materials in the fields of education, science, business, leisure activities, etc.,  public lectures, newspapers and magazines, both national and local, reminiscences, academic writing, and general administrative materials (letters, reports, minutes of meetings).

The resultant corpus was analysed to produce frequency counts of words both in their raw form and as counts of lemmas where each token is demutated and tagged to its root. This analysis also derives basic information concerning the frequencies of different word classes, inflections, mutations, and other grammatical features.

Articles based on the use of the database should cite:

Ellis, N. C., O’Dochartaigh, C., Hicks, W., Morgan, M., & Laporte,
N.  (2001). Cronfa Electroneg o Gymraeg (CEG): A
1 million word lexical database and frequency count for Welsh
. [On-line]


Background

This project was funded for the academic year 1993-94 by a grant of £21K from the Higher Education Funding Council for Wales to Ellis, O’Dochartaigh & Hicks of the Welsh IT Unit and the School of Psychology, University of Wales, Bangor. The researchers began work on the project in October 1993, and after the sample range had
been identified in collaboration with Professor Gwyn Thomas of the Department of Welsh, proceeded to collect the required range of texts. The original intention was that this range of materials would be acquired in an electronic
form from Welsh language publishers and other bodies, such as local authorities, governmental organizations, and papurau bro (locally produced newspapers). However, it proved to be impossible to collect the necessary breadth of
materials in an electronic form, primarily because at that time Welsh language publishers did not generally keep computer-based archive copies of books which they may have published using electronic means.

Under these circumstances, having acquired around 200 usable samples from various bodies, it was decided to input the remainder by using both typists and an OCR system. The task of checking such typed copy, and in particular of correcting the errors introduced by the OCR software, was carried out by the researcher, assisted by the on-going development of the Welsh spelling-checker, CySill. The additional costs of this work were borne by funding from the Welsh IT Unit at Bangor.

Where material was obtained directly from publishers or from individual authors, permission was sought for the data to be included in the project analysis, with the understanding that if they were ever to be made available to a wider audience, then a formal request would be made to the copyright holders for this use. Where samples were taken either by typing or by OCR from published works, formal permission for their use has not yet been requested, as it was regarded that the samples of 2000 words in most cases could be regarded as “fair-dealing” for academic research purposes under the Copyright Acts.  Any future public use of these materials will require the formal permission of their copyright holders.

It was decided to use the analytical software for Welsh which had been developed for a Welsh language spelling checker, then under way in the School of Psychology for Bwrdd yr Iaith Gymraeg / The Welsh Language Board. This spelling checker in its improved form involved a set of lemmatization algorithms for handling the language in a computer environment and it was felt that these programs  could be adaptable for lemming  the CEG text samples. The basic program for the spelling checker was modified to allow it to process and analyze the texts in an interactive way. This required the ability to present the original text on screen for inspection by the researcher, and to offer interactive dialogue boxes to solve two fundamental problems with the software. These were,
the appearance of words or word forms which did not appear in the spelling checker’s own dictionary, and the possibility of homographs. The latter difficulty was solved by arranging for the software to identify a lemma
by stripping off a particular ending and/or by demutating a word, then continuing to try possible endings and initial mutations in combinations with other lemmas to check for possible homographs, effectively on the
fly. Any such forms identified were presented on-screen to the researcher, with the original text still visible, to allow an informed choice to be made between the possibilities. In a similar way, the appearance of an unrecognized word or word form generated a dialogue box to allow the researcher to enter such words into a user dictionary, as well as allowing the forms to be incorporated into the tagged files which were produced from each separate text sample.

The main researcher worked on 350 out of the 500 samples, and a part-time researcher was employed through the Welsh IT Unit to analyze 150 of the samples. The average time for the analysis of each was around 1 hour, though the need to read over and correct typed or OCR scanned text, raised this to a figure of around 2 hours per sample.


Fileformats and Character coding conventions

All files are Windows files with<CR><LF> used as line separator.

Accents are place after the vowel ( + = circumflex, % = dieresis, / = acute accent, \  grave accent)


Description of the text files

Details of the 500 text samples are provided in the files below which list file number, text category, title, author and date.

The description data can be downloaded in the following formats:

The text category codes are as follows:

 

Rh Ff
Gwasg – Gwyddonol G Gw Press – Scientific
Gwasg – Adroddiad G A Press – Report
Gwasg – Golygyddol G G Press – Editorial
Gwasg – Adolygiad G Ad Press – Review
Gwasg – Llythyrau G Ll Press – Letters
Plant – Ffeithiol P Ff Factual – Children
Ysgrythurol Y Scriptural
Bro a Bywyd Gwerin B Community Life
Gweinyddol – Adroddiad Gw Ad Administrative – Report
Gweinyddol – Llythyrau Gw Ll Administrative – Letters
Gweinyddol – Cofnodion/cytundebau Gw C Administrative – Minutes/contracts
Academaidd A Academic
Hunangofiant / Cofiant/ Dyddiaduron / Atgofion H Biography/ Diaries/Memories
Sgyrsiau/pigion S Discussions/ Highlights
Medrau a Diddordebau M Skills and Interests
Rhyddiaith Ddychmygol Rh Dd Fiction
Nofelau N Novels
Straeon Byrion SB Short Stories
Plant – Nofel PN Children’s Novel
Plant – Straeon PS Children’s Stories
Dyddiadur Dychmygol D Fictitious Diaries
Ysgrifau YS Articles/ Essays

 

 


The Raw and Tagged Datafiles

Most users will probably only want to access the processed results – the frequency counts of word forms or lemmas
presented below. However, we also provide the original text samples as ASCII files along with the 500 tagged files for those who need to find words or constructions in their original context or for scholars who wish to correct or take forward the analyses presented here.

The 500 original text samples, each of approximately 2000 words:

The 500 tagged files have the following format

Lemma [tab] Raw word [tab] Part Of Speech [tab] Mutation – if present  [tab] Line Number

Each line shows the lemmatized form, the original word, the part of speech, type of mutation if present, and
the location of the word (sample number, sentence number within sample, word number within sentence). For verbal forms, a number is used with the lemma to show the particular morphographemic form appearing.

Illustration of a sample sentence from a text follows:

We believe this text corpus is of value for an analysis of Welsh prose sentence patterns, for co-occurrence analyses of both individual lemmas and grammatical parts of speech in running texts, and for further linguistic analysis by specialist researchers in the field of Welsh syntax and child language acquisition. However, researchers must take note of some limitations in data quality, particularly regarding the accuracy of some of the lemma tags which were prejudiced by word form homography – these limitations are described below.


Data quality

We believe that the accuracy of the raw word forms in the database and their counts is quite high. Whatever errors (spelling or typographical) there were in the original samples will be carried over to the corpus. We must surely have introduced and failed to detect some additional errors in input, but we have tried hard to keep this number
very low.

Tag quality is something of a different matter. The problems of high homography rates, a limited window template-matching lemmatiser with few rules, and the need for skilled linguistic analysis, compounded into a non-trivial number of tagging errors.  A preliminary analysis of 5% of the corpus indicates that there is an error rate of 4% +/- 3%.

These tagging errors are by no means distributed equally about the database. Thus,  for example, inaccuracies in the tagging of ynbod/fod, and a, that is more generally the high frequency closed class words, are much more common than inaccuracies with the open class words. Thus while the token error rate is perhaps 4%, the type error rate is much less than that. We do not have the resources to correct these miscodings.
As well as noting the errors on a print-out of the output files, it would be necessary for any corrections to be written back to the files, and we estimate that a detailed correction of the full set would require two years
work. Having tried to raise these resources, and waited too long, we have decided to release the database as it now stands – it is certainly better than nothing.

Nonetheless, researchers must take note of these limitations in data quality, particularly regarding the accuracy of some of the lemma tags.

 

a a part [74.2.1]
bod:3 ydi vbf [74.2.2]
hynny hynny DemPron [74.2.3]
‘n ‘n vbadj [74.2.4]
golygu golygu vb [74.2.5]
bod fod vb meddal [74.2.6]
y y DefArt [74.2.7]
rhai rhai pron [74.2.8]
dagreuol dagreuol adj [74.2.9]
yn yn prep [74.2.10]
ein ein pron [74.2.11]
plith plith nm [74.2.12]
yn yn YnPred [74.2.13]
iach iachach CompAdj [74.2.14]
na na conj [74.2.15]
‘r ‘r DefArt [74.2.16]
rhai rhai pron [74.2.17]
sych sych adj [74.2.18]
? ? punct [74.2.19]

 

We believe the Counts of raw word forms to be highly accurate.

The Lemma Counts with analysis of inflections and mutations runs at about 96% accuracy
with most problems on the high frequency closed class words.


Processed Results: Counts of Raw Word Forms

The word counts are based on the actual word forms occurring.  These words include spellings which represent dialectal forms, informal spellings of Welsh forms (generally following the suggestions of Cymraeg Byw, though this is by no means a universally applied standard for informal writing), foreign words (particularly from English), as well as wrongly spelled Welsh words (that is, misprints in the original texts).

Total number of word form tokens in the corpus is 1,079,032.

The total number of separate word form types is 37,195.

The 50 most frequent raw word forms are:

55588 yn . 3821 cael
45945 y . 3754 yw
33327 i . 3546 wrth
33231 a . 3545 ni
32573 ‘r . 3463 hyn
26927 o . 3023 na
15888 ar . 2870 o+l
14990 ei . 2721 hynny
14845 ‘n . 2646 fe
14523 yr . 2613 er
11785 ac . 2594 neu
9922 oedd . 2585 nid
9338 bod . 2542 at
9056 mae . 2511 sy
7751 am . 2417 ‘w
7093 wedi . 2401 hi
6118 ond . 2360 dim
5568 un . 2278 mynd
5415 ‘i . 2240 byddai
5294 eu . 2160 gyda
4991 gan . 2137 yng
4988 fel . 2110 iawn
4578 mewn . 2066 pob
4149 a+ . 2065 lle
4142 roedd . 2027 pan

At the other end of the frequency range, there is a very long tail of single occurrence forms, with 44% of
the total entries falling in to this group, and between them, the numbers of single, double and triple occurrence words make up 64% of the total number of separate words (37,195). As might be expected, a large number of these very low frequency words consist of foreign borrowings, mis-spellings, dialectal forms and other types of variant spellings, and numbers. In most cases, the analysis program does distinguish between several of these categories (mis-spellings, foreign words, informal spellings), but such entries would require further checking if 100% accuracy was essential.

16,316 words with a single occurrence :  44% of separate words
 5,013 words with two occurrences :  13% of separate words
 2,644 words showing three
occurrences:
   7% of separate words

 

 Lemma Counts with analyses of inflections and mutations

The lemming software was used to demutate and uninflect word forms in order to track them back to
their lemma. Examples of the resulting lemma analysis are shown for illustration in the table below:

 

ceg 118 ceg n 118 ceg 109 nf ceg 22 nf
cheg 21 nf llaes
geg 56 nf meddal
ngheg 10 nf trwynol
cegau 9 npl cegau 9 npl
rhodio 16 rhodio vb 16 rhodia 2 vbf rhodia 1 vbf :3
rodia 1 vbf :3 meddal
rhodiai 1 vbf rodiai 1 vbf :10 meddal
rhodio 12 vb rhodio 7 vb
rodio 5 vb meddal
rhodiwn 1 vbf rhodiwn 1 vbf :4.1

The lemma ceg appears 118 times. It appears exclusively as a noun. 109 of these occurrences are
as the noun singular feminine (ceg) and 9 as the noun plural (cegau). As the singular noun it appeared 22 in unmutated form, 21 times with aspirate mutation, 56 with soft mutation, and 10 times as a nasal mutation.

The lemma  rhodio appeared 16 times, always as a verb. Two of these occurrences were as the
third person singular present (rhodia) (once in unmutated form and once with soft mutation), 1 occurrence was as the third person singular imperfect in soft mutated form (rodia), 12 occurrences as the verb noun rhodio  (7 times unmutated and 5 times with soft mutation), and once as the third person plural present tense (rhodiwn). There
are many verb forms for Welsh – the full list of verb form codes is shown below.

Verb-form Codes

The table of verb form codes
is shown below:

1 af present tense first person singular
2 i present tense second person singular
3 a present tense third person singular
4 wn present tense first person plural
5 wch present tense second person plural
6 ant present tense third person plural
7 ir present tense impersonal
8 it imperfect tense first person singular
9 et imperfect tense second person singular
10 ai imperfect tense third person singular
11 em imperfect tense first person plural
12 ech imperfect tense second person plural
13 ent imperfect tense third person plural
14 id imperfect tense impersonal
15 ais past tense first person singular
16 aist past tense second person singular
17 odd past tense third person singular
18 asom past tense first person plural
19 asoch past tense second person plural
20 asant past tense third person plural
21 wyd past tense impersonal
22 aswn pluperfect first person singular
23 asit pluperfect second person singular
24 aset pluperfect second person singular
25 asai pluperfect third person singular
26 asem pluperfect first person plural
27 asech pluperfect second person plural
28 asent pluperfect third person plural
29 asid pluperfect impersonal
30 ed impersonal imperative
31 wyf subjunctive first person singular
32 ych subjunctive second person singular
33 o subjunctive third person singular
34 om subjunctive first person plural
35 och subjunctive second person plural
36 ont subjunctive third person plural
37 er subjunctive second person singular
38 es past tense first person singular
39 est past tense first person singular
40 ith Informal third person singular
41 iff Informal Future third person singular
42 on Informal Past third person plural
43 an Informal Future third person plural

The file, Lemma Counts with Analysis, downloadable below, is tab-separated and can be imported into Excel where it can be readily manipulated to provide a wide range of analyses. One example, based on a sort of the final field
(mutation), generates the following results for initial mutations.

Initial mutations
Welsh words can exhibit one of four types of morphophonemic initial mutation, and the occurrences and relative frequencies of such forms in the sample are:

Soft mutation (Treiglad Meddal) 134,349 12.45%
Spirant mutation (Treiglad Llaes)     9,123   0.85%
Nasal mutation (Treiglad Trwynol)     5,667   0.53%
h-provection     1,990   0.19%

Download Wordform Files

  • Word Counts (freq)  – Counts of raw word forms sorted in decreasing frequency
  • Word Counts (alpha)  – Counts of raw word forms sorted in alphabetic order
  • Lemma Counts with Analysis – Counts of lemmas, plus inflected forms, parts of speech and mutations

Use of these Materials

These materials have been produced on a small budget for academic research. You are welcome to use the materials for any non-commercial purpose. We have produced these analyses in good faith to the best of our abilities given the limited resources. As we have described above, you should be aware that there are some inaccuracies in the taggings. We bear no responsibility for any damaging consequences that may result from these.

We welcome further research to extend or correct these linguistic descriptions.

Articles based on the use of the database should cite:

Ellis, N. C., O’Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N.  (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh [On-line]