The dictionary, then, is designed as an instrument for helping students acquire a core vocabulary of French words in various ways, including based on their observed frequency in recent French language usage. The versatility in its organization should presumably allow its use in a wide range of language learning scenarios.
Previous frequency dictionaries for French
French dictionaries are plentiful and widely varied in content, so one might wonder whether another ow
P
m
estpT
F
dictionary is necessary. A short survey of existing dictionaries should suffice to illustrate why this one was developed.
Two landmark frequency dictionaries have been produced for French. One (Henmon 1924) was based on 400,000 words of text, and the other (Juilland et al. 1970) derives from a study of 500,000
words.
Page 2
Information on the words contained in those lists, though, was minimal, and the ability to handle more sizable corpora has since – of course – been vastly improved with computer technology.
Other word reference lists have been developed largely for scholarly purposes and hence not very accessible to the average learner. Brunet (1981) focuses on development of French vocabulary over time based on the superb Trésor de la Langue Française (Imbs 1971-1994). Beauchemin et al.
(1992) focus only on the French spoken in Quebec. All of these resources require some effort to use effectively.
Some lexical resources are at the disposal of French language learners through the Internet, such as the ARTFL FRANTEXT and TLFi resources. The subscription costs and on-line access methods are sometimes less practical than having a reasonably sized dictionary like this one at one’s fingertips.
Finally, some helpful recent beginner dictionaries exist, though each has its own limitations. Recent ones by Oxford University Press (2006), Living Language (Lazare 1992), and Dover Publications (Buxbaum 2001) list from 1001 to 20,000 “most useful” words but give no rationale for how they were selected. Another venerable work by Gougenheim (1958) lists 3500 basic French words with related information including definitions, but which are entirely in French and hence challenging for the beginner.
Our dictionary seeks to combine the best from this tradition of French lexical research while at the same time avoiding these shortcomings. Its presentation design and the rationale and methodology for selecting the contents reflect what we believe to be the state of the art in corpus research, text processing, and lexicography.
The corpus and its annotation
Our dictionary is derived from a corpus of some 23,000,000 French words that have been assembled from a wide variety of sources. As mentioned above, half of this total reflects a collection of transcriptions from oral or spoken French, while the other half reflects French in its textual or written form. Reflecting a desire to make our dictionary a modern representation of the French language, we have included no materials that date before the year 1950.
We did not try to proportion our data based on geographical region or demographics, but we did try to achieve some balance across genres; however, this balance is not perfect. It is also important to note that some of our content from particular sources was exhaustive whereas in other cases it was selectively or randomly sampled; in other words, only parts of the material were used because there was too much content and hence the risk of skewing coverage of particular areas.
The spoken text portion of the corpus was made up of approximately 11.5 million words. These words were pulled from such various forms such as transcripts of governmental debates/hearings, telephone calls, and face-to-face dialogues. There were also transcripts of interviews with writers, entertainment figures, business leaders, athletes, academicians and other media personnel. And fT
C
hP
F
1
finally we made use of movie scripts/subtitles and theatrical plays.
The written text portion of the corpus was also made up of roughly 11.5 million words. This part of the corpus was assembled from newswire stories, daily and weekly newspapers, newsletters, bulletins, business correspondence, and technical manuals. Magazines such as popular science and other technical publications were used. We also targeted different genres of literature such as fiction/nonfiction essays, memoirs, novels and more.
Table 1 gives a more detailed listing of the composition of the corpus.
Corpus standardization and annotation