The original project was financed during 1993/4 with a £21,000 grant awarded by the Higher Education Council for Wales to Ellis, O’Dochartaigh & Hicks from the IT Unit, Welsh Department and School of Psychology, University of Wales, Bangor.
It included 1,079,032 words of written Welsh prose, mainly form 1970 onwards, based on 500 samples of around 2,000 words each. The data was tagged and analysed for various linguistic studies, and the original files may still be accessed at http://www.bangor.ac.uk/canolfanbedwyr/ceg.php.en now maintained by the staff of the Language Technologies Unit.
Because of the demand for a user-friendly, searchable interface for the corpus, in 2012 the Unit developed another version, using the Cysefin and Hebog platform to display the data. The texts displayed are the same in both versions, the only difference is in the methods of showing and searching the data. The version using Cysefin and Hebog may be accessed from the Welsh National Corpus Portal.
A 1 million word lexical database and frequency count for Welsh
Ellis, N. C., O’Dochartaigh, C., Hicks, W., Morgan, M., & Laporte,
- Brief Summary
- File formats and
Character coding conventions
- Description of the text files
- The Raw and Tagged Datafiles
- Data quality
- Counts of Raw Word Forms
Counts with analyses of inflections and mutations
Download Word Form files
- Contact Information
- Use of these Materials
This is a word frequency analysis of 1,079,032 words of written Welsh prose, based on 500 samples of approximately 2000 words each, selected from a representative range of text types to illustrate modern (mainly post 1970) Welsh prose writing. It was conceived as providing a Welsh parallel to the Kucera and Francis
analysis for American English, and the LOB corpus for British English, in the expectation that such an analysed corpus would provide research tools for a number of academic disciplines: psychology and psycholinguistics,
child and second language acquisition, general linguistics, and the linguistics of Modern Welsh, including literary analysis.
The sample included materials from the fields of novels and short stories, religious writing, childrenís literature both factual and fiction, non-fiction materials in the fields of education, science, business, leisure activities, etc., public lectures, newspapers and magazines, both national and local, reminiscences, academic writing, and general administrative materials (letters, reports, minutes of meetings).
The resultant corpus was analysed to produce frequency counts of words both in their raw form and as counts of lemmas where each token is demutated and tagged to its root. This analysis also derives basic information concerning the frequencies of different word classes, inflections, mutations, and other grammatical features.
Articles based on the use of the database should cite:
Ellis, N. C., O’Dochartaigh, C., Hicks, W., Morgan, M., & Laporte,
N. (2001). Cronfa Electroneg o Gymraeg (CEG): A
1 million word lexical database and frequency count for Welsh. [On-line]
This project was funded for the academic year 1993-94 by a grant of £21K from the Higher Education Funding Council for Wales to Ellis, O’Dochartaigh & Hicks of the Welsh IT Unit and the School of Psychology, University of Wales, Bangor. The researchers began work on the project in October 1993, and after the sample range had
been identified in collaboration with Professor Gwyn Thomas of the Department of Welsh, proceeded to collect the required range of texts. The original intention was that this range of materials would be acquired in an electronic
form from Welsh language publishers and other bodies, such as local authorities, governmental organizations, and papurau bro (locally produced newspapers). However, it proved to be impossible to collect the necessary breadth of
materials in an electronic form, primarily because at that time Welsh language publishers did not generally keep computer-based archive copies of books which they may have published using electronic means.
Under these circumstances, having acquired around 200 usable samples from various bodies, it was decided to input the remainder by using both typists and an OCR system. The task of checking such typed copy, and in particular of correcting the errors introduced by the OCR software, was carried out by the researcher, assisted by the on-going development of the Welsh spelling-checker, CySill. The additional costs of this work were borne by funding from the Welsh IT Unit at Bangor.
Where material was obtained directly from publishers or from individual authors, permission was sought for the data to be included in the project analysis, with the understanding that if they were ever to be made available to a wider audience, then a formal request would be made to the copyright holders for this use. Where samples were taken either by typing or by OCR from published works, formal permission for their use has not yet been requested, as it was regarded that the samples of 2000 words in most cases could be regarded as “fair-dealing” for academic research purposes under the Copyright Acts. Any future public use of these materials will require the formal permission of their copyright holders.
It was decided to use the analytical software for Welsh which had been developed for a Welsh language spelling checker, then under way in the School of Psychology for Bwrdd yr Iaith Gymraeg / The Welsh Language Board. This spelling checker in its improved form involved a set of lemmatization algorithms for handling the language in a computer environment and it was felt that these programs could be adaptable for lemming the CEG text samples. The basic program for the spelling checker was modified to allow it to process and analyze the texts in an interactive way. This required the ability to present the original text on screen for inspection by the researcher, and to offer interactive dialogue boxes to solve two fundamental problems with the software. These were,
the appearance of words or word forms which did not appear in the spelling checker’s own dictionary, and the possibility of homographs. The latter difficulty was solved by arranging for the software to identify a lemma
by stripping off a particular ending and/or by demutating a word, then continuing to try possible endings and initial mutations in combinations with other lemmas to check for possible homographs, effectively on the
fly. Any such forms identified were presented on-screen to the researcher, with the original text still visible, to allow an informed choice to be made between the possibilities. In a similar way, the appearance of an unrecognized word or word form generated a dialogue box to allow the researcher to enter such words into a user dictionary, as well as allowing the forms to be incorporated into the tagged files which were produced from each separate text sample.
The main researcher worked on 350 out of the 500 samples, and a part-time researcher was employed through the Welsh IT Unit to analyze 150 of the samples. The average time for the analysis of each was around 1 hour, though the need to read over and correct typed or OCR scanned text, raised this to a figure of around 2 hours per sample.
All files are Windows files with<CR><LF> used as line separator.
Accents are place after the vowel ( + = circumflex, % = dieresis, / = acute accent, \ grave accent)
Details of the 500 text samples are provided in the files below which list file number, text category, title, author and date.
The description data can be downloaded in the following formats:
The text category codes are as follows:
|Gwasg – Gwyddonol||G Gw||Press – Scientific|
|Gwasg – Adroddiad||G A||Press – Report|
|Gwasg – Golygyddol||G G||Press – Editorial|
|Gwasg – Adolygiad||G Ad||Press – Review|
|Gwasg – Llythyrau||G Ll||Press – Letters|
|Plant – Ffeithiol||P Ff||Factual – Children|
|Bro a Bywyd Gwerin||B||Community Life|
|Gweinyddol – Adroddiad||Gw Ad||Administrative – Report|
|Gweinyddol – Llythyrau||Gw Ll||Administrative – Letters|
|Gweinyddol – Cofnodion/cytundebau||Gw C||Administrative – Minutes/contracts|
|Hunangofiant / Cofiant/ Dyddiaduron / Atgofion||H||Biography/ Diaries/Memories|
|Medrau a Diddordebau||M||Skills and Interests|
|Rhyddiaith Ddychmygol||Rh Dd||Fiction|
|Straeon Byrion||SB||Short Stories|
|Plant – Nofel||PN||Children’s Novel|
|Plant – Straeon||PS||Children’s Stories|
|Dyddiadur Dychmygol||D||Fictitious Diaries|
Most users will probably only want to access the processed results – the frequency counts of word forms or lemmas
presented below. However, we also provide the original text samples as ASCII files along with the 500 tagged files for those who need to find words or constructions in their original context or for scholars who wish to correct or take forward the analyses presented here.
The 500 original text samples, each of approximately 2000 words:
- Original ASCII files (zipped) (2.1Mb)
The 500 tagged files have the following format
Lemma [tab] Raw word [tab] Part Of Speech [tab] Mutation – if present [tab] Line Number
Each line shows the lemmatized form, the original word, the part of speech, type of mutation if present, and
the location of the word (sample number, sentence number within sample, word number within sentence). For verbal forms, a number is used with the lemma to show the particular morphographemic form appearing.
Illustration of a sample sentence from a text follows:
We believe this text corpus is of value for an analysis of Welsh prose sentence patterns, for co-occurrence analyses of both individual lemmas and grammatical parts of speech in running texts, and for further linguistic analysis by specialist researchers in the field of Welsh syntax and child language acquisition. However, researchers must take note of some limitations in data quality, particularly regarding the accuracy of some of the lemma tags which were prejudiced by word form homography – these limitations are described below.
- All Tagged
Files (zipped) (All fields are tab delimited) – 8 Mb
We believe that the accuracy of the raw word forms in the database and their counts is quite high. Whatever errors (spelling or typographical) there were in the original samples will be carried over to the corpus. We must surely have introduced and failed to detect some additional errors in input, but we have tried hard to keep this number
Tag quality is something of a different matter. The problems of high homography rates, a limited window template-matching lemmatiser with few rules, and the need for skilled linguistic analysis, compounded into a non-trivial number of tagging errors. A preliminary analysis of 5% of the corpus indicates that there is an error rate of 4% +/- 3%.
These tagging errors are by no means distributed equally about the database. Thus, for example, inaccuracies in the tagging of yn, bod/fod, and a, that is more generally the high frequency closed class words, are much more common than inaccuracies with the open class words. Thus while the token error rate is perhaps 4%, the type error rate is much less than that. We do not have the resources to correct these miscodings.
As well as noting the errors on a print-out of the output files, it would be necessary for any corrections to be written back to the files, and we estimate that a detailed correction of the full set would require two years
work. Having tried to raise these resources, and waited too long, we have decided to release the database as it now stands – it is certainly better than nothing.
Nonetheless, researchers must take note of these limitations in data quality, particularly regarding the accuracy of some of the lemma tags.
We believe the Counts of raw word forms to be highly accurate.
The Lemma Counts with analysis of inflections and mutations runs at about 96% accuracy
with most problems on the high frequency closed class words.
The word counts are based on the actual word forms occurring. These words include spellings which represent dialectal forms, informal spellings of Welsh forms (generally following the suggestions of Cymraeg Byw, though this is by no means a universally applied standard for informal writing), foreign words (particularly from English), as well as wrongly spelled Welsh words (that is, misprints in the original texts).
Total number of word form tokens in the corpus is 1,079,032.
The total number of separate word form types is 37,195.
The 50 most frequent raw word forms are:
At the other end of the frequency range, there is a very long tail of single occurrence forms, with 44% of
the total entries falling in to this group, and between them, the numbers of single, double and triple occurrence words make up 64% of the total number of separate words (37,195). As might be expected, a large number of these very low frequency words consist of foreign borrowings, mis-spellings, dialectal forms and other types of variant spellings, and numbers. In most cases, the analysis program does distinguish between several of these categories (mis-spellings, foreign words, informal spellings), but such entries would require further checking if 100% accuracy was essential.
|16,316 words with a single occurrence :||44% of separate words|
|5,013 words with two occurrences :||13% of separate words|
| 2,644 words showing three
|7% of separate words|
The lemming software was used to demutate and uninflect word forms in order to track them back to
their lemma. Examples of the resulting lemma analysis are shown for illustration in the table below:
The lemma ceg appears 118 times. It appears exclusively as a noun. 109 of these occurrences are
as the noun singular feminine (ceg) and 9 as the noun plural (cegau). As the singular noun it appeared 22 in unmutated form, 21 times with aspirate mutation, 56 with soft mutation, and 10 times as a nasal mutation.
The lemma rhodio appeared 16 times, always as a verb. Two of these occurrences were as the
third person singular present (rhodia) (once in unmutated form and once with soft mutation), 1 occurrence was as the third person singular imperfect in soft mutated form (rodia), 12 occurrences as the verb noun rhodio (7 times unmutated and 5 times with soft mutation), and once as the third person plural present tense (rhodiwn). There
are many verb forms for Welsh – the full list of verb form codes is shown below.
The table of verb form codes
is shown below:
|1||af||present tense first person singular|
|2||i||present tense second person singular|
|3||a||present tense third person singular|
|4||wn||present tense first person plural|
|5||wch||present tense second person plural|
|6||ant||present tense third person plural|
|7||ir||present tense impersonal|
|8||it||imperfect tense first person singular|
|9||et||imperfect tense second person singular|
|10||ai||imperfect tense third person singular|
|11||em||imperfect tense first person plural|
|12||ech||imperfect tense second person plural|
|13||ent||imperfect tense third person plural|
|14||id||imperfect tense impersonal|
|15||ais||past tense first person singular|
|16||aist||past tense second person singular|
|17||odd||past tense third person singular|
|18||asom||past tense first person plural|
|19||asoch||past tense second person plural|
|20||asant||past tense third person plural|
|21||wyd||past tense impersonal|
|22||aswn||pluperfect first person singular|
|23||asit||pluperfect second person singular|
|24||aset||pluperfect second person singular|
|25||asai||pluperfect third person singular|
|26||asem||pluperfect first person plural|
|27||asech||pluperfect second person plural|
|28||asent||pluperfect third person plural|
|31||wyf||subjunctive first person singular|
|32||ych||subjunctive second person singular|
|33||o||subjunctive third person singular|
|34||om||subjunctive first person plural|
|35||och||subjunctive second person plural|
|36||ont||subjunctive third person plural|
|37||er||subjunctive second person singular|
|38||es||past tense first person singular|
|39||est||past tense first person singular|
|40||ith||Informal third person singular|
|41||iff||Informal Future third person singular|
|42||on||Informal Past third person plural|
|43||an||Informal Future third person plural|
The file, Lemma Counts with Analysis, downloadable below, is tab-separated and can be imported into Excel where it can be readily manipulated to provide a wide range of analyses. One example, based on a sort of the final field
(mutation), generates the following results for initial mutations.
Welsh words can exhibit one of four types of morphophonemic initial mutation, and the occurrences and relative frequencies of such forms in the sample are:
|Soft mutation (Treiglad Meddal)||134,349||12.45%|
|Spirant mutation (Treiglad Llaes)||9,123||0.85%|
|Nasal mutation (Treiglad Trwynol)||5,667||0.53%|
Download Wordform Files
- Zip file containing: (890Kb)
- Word Counts (freq) – Counts of raw word forms sorted in decreasing frequency
- Word Counts (alpha) – Counts of raw word forms sorted in alphabetic order
- Lemma Counts with Analysis – Counts of lemmas, plus inflected forms, parts of speech and mutations
These materials have been produced on a small budget for academic research. You are welcome to use the materials for any non-commercial purpose. We have produced these analyses in good faith to the best of our abilities given the limited resources. As we have described above, you should be aware that there are some inaccuracies in the taggings. We bear no responsibility for any damaging consequences that may result from these.
We welcome further research to extend or correct these linguistic descriptions.
Articles based on the use of the database should cite:
Ellis, N. C., O’Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh [On-line]