The DP measure looks at the proportion of a term’s occurrence across various “slices” of a corpus, taking into account the size of each slice. Each word’s final calculation involves three steps: (i) summing up all of the occurrences of that word’s for each slice and normalizing it against that word’s overall frequency in the whole corpus, called the “observed proportion”; (ii) normalizing each corpus slice with respect to the size of the whole corpus, called the “expected proportion”; (iii) computing the absolute difference between observed and expected proportions, summing them up, and dividing by 2. The result is a measure between 0 and 1, where 0 means the word is distributed evenly across the corpus slices and 1 means it is restricted to narrow parts of the corpus.
While helpful in describing word distribution across a corpus, the DP measure is only one metric, and for the purposes of this dictionary it was necessary to combine it with the raw frequency. Thus we computed, for each lemma, its frequency divided by its DP. The result determined the ranking of each lemma and hence its final appearance and relative order in the top 5000 words in the vocabulary. For example, all forms of the word “avoir” sum up to a frequency of 405,020 and its DP
score is 0.11533. Its ranking score is thus 405020/0.115363, or 3,510,831.029. This is the sixth highest score among all of the lemmas, so this word places sixth in the ranked list.
Finally, the DP values are somewhat unwieldy as long numbers behind a decimal point. To solve this problem we mapped these values to a much more intuitive set of integers ranging from about 27
to 100. These numbers are called dispersion codes. The mathematical calculation for obtaining a dispersion code from its corresponding DP measure involves an exponential function: 100*exp-DP.
Values approaching 100 indicate that the word is quite evenly distributed across the corpus; values below 50 indicate words that are limited to only certain narrow portions of the corpus.
Though these computations are somewhat technical, the general intuition is that the words in this dictionary are ranked by the summed frequency of all of their variant forms, tempered by how well they are spread across various portions of the corpus.
Once the terms were identified, additional information had to be collected to construct the associated entries.
Developing associated information
(tac1P
M
T
E
F
Providing parts of speech was done through a combination of automatic and manual methods. The values were derived from (i) the part of speech tags provided from the lemmatization process described above; (ii) popular lexical databases for French lexical information (e.g. BDLEX1); and (iii) hand-editing of the merged and accumulated results.
Glossing the terms was a completely manual effort. An intuitive effort was made to give as much of the core meaning(s) as possible while at the same time avoiding the temptation to be exhaustive.
The next stage involved finding a suitable usage context for each word. In each case the usage context comes from the corpus itself, so that it represents an illustration of natural French, the way a French-speaking person would use the word. Equally important was the need to find contexts that were clear, short, self-contained, and indicative of the core meaning of the word. Ideally, the contexts should also contain as few words as possible that are not covered in the dictionary elsewhere. To find the contexts, a computer-generated list of possible contexts was prepared for each word, and then scored automatically according to these criteria. We then manually chose from among these lists the best context for each word.
Like glossing, generating English translations for the usage contexts was also a human effort. Each context was taken in isolation and, often using the English glosses that had been prepared, a translation was entered manually. Some texts already had English translations from previous work and hence could have been extracted manually using word-alignment techniques, but we purposely chose to not use these techniques so as to assure that the translations were “fresh” in each instance.
1 See http://www.irit.fr/PERSONNEL/SAMOVA/decalmes/ IHMPT/ress_ling.v1/rbdlex_en.php.
Page 6
Finally, we compiled the thematic lists. In each case the content of the list was done using a combination of automatic and manual techniques. For semantic subject areas (e.g. food and weather terms) hierarchical lexical databases (e.g. French WordNet2) were used to locate the terms’ position in a taxonomy of semantic field areas. A parallel effort of hand-selecting relevant terms was also carried out, and the results were merged together.
All of these results have been combined into a comprehensive database (we used both mySQL and Microsoft Access) that enables versatile retrieval of relevant information.