Читаем A Frequency Dictionary of French (2009) полностью

Once the corpus was standardized, it was then necessary for us to assign to each word its part of speech; in other words, whether it functions as a noun, a verb, an adjective, and so forth. Currently there are about a dozen different part of speech taggers for the French language, each with its own theoretical framework, implementation approach, and set of tag encodings to flag the relevant parts of speech for each word. In this work we installed and tested several of these taggers. In our case we found that each tagger had its own strengths and weaknesses and that by combining several of them and merging the results in a postprocessing stage we could create our own tagging procedure and tagset to produce the best results for our purpose. We also performed a certain amount of editing and correcting tagging results by hand for the most common tagging errors, though for the entire corpus a thorough examination of each word would have been prohibitively time-consuming and costly.

It was also necessary to perform a morphological analysis of each word in the corpus to find its base form, or lemma. For example, the second word in the sentence “Je suis heureux.” is a verb conjugation of the verb “être”, which is its base form or lemma. Similarly, pronouns with regular inflections (e.g. “il” to “ils”), adjectives, and determiners with variant forms were combined together. The lemmatization process was necessary for our frequency computations, to be described below. Various lemmatization programs exist for French, and in fact some of them perform both part of speech tagging and lemmatization at the same time. In this stage, too, there were challenges that we had to overcome. For example, many words are morphologically ambiguous, having several possible lemmas, such as the verb form “suis” having both “être” and “suivre” as possible lemmas, depending on the particular instance. Another difficulty is deciding when non-finite forms (i.e. past and present participles and infinitives) function more as verbs or as other parts of speech (especially nouns, adjectives). Again we found that combining some of the most popular programs and postprocessing the results ended up being the most helpful for our purposes.

Target vocabulary identification and description

With the whole corpus standardized and annotated, it was possible to compute word frequencies and identify the most-used words. Counting words in a corpus can be done in several ways. We have chosen to collapse all of the variant forms of the same word and sum them up together. For example, the word “pour” is a conjunction or preposition and occurs in two other forms across the corpus: “Pour” and “POUR”. Summing up all occurrences of the variant forms of this word we arrive at a total count of 151,709. Similarly, plural forms of nouns are normally reduced to their singular form, verb conjugations are reduced to their infinitive form, and inflected adjectives are reduced to the masculine singular form, as is done in other French dictionaries. For example, throughout the corpus there are 25 different forms of the verb “déterminer” including inflections and variant forms such as “déterminerait”, “détermine”, “déterminons”, and “Déterminez”; all of these were combined with their counts into the infinitive form.

tw

P

oehbaD

Our target vocabulary list is thus formed from the top 5000 scoring lemmas in the corpus. In identifying these top 5000 lemmas, some items (such as proper nouns and punctuation) were rejected. However, one more refinement was necessary in identifying the top 5000 words.

Experience in corpus linguistics has shown that the raw frequency count for all variants of a word turns out not to be the best measure of its usefulness. Consideration must be made of how widely a word is spread across the different parts of a corpus.

Exactly quantifying how widely a word is spread across a document or corpus has been a thorny problem in corpus linguistics. If a given word occurs very frequently in one part of the corpus (e.g.

the spoken part) but not elsewhere, it might be desirable to discount that word’s raw frequency so that it becomes a little less “important” in comparison to

Page 5

other less-frequent words. Literally dozens of approaches have been taken over the last decades to come up with workable solutions. One of the most promising, and the one used in the compilation of this book, is called the “deviation of proportions”, or DP (Gries, 2008).

Перейти на страницу:

Похожие книги

Агония и возрождение романтизма
Агония и возрождение романтизма

Романтизм в русской литературе, вопреки тезисам школьной программы, – явление, которое вовсе не исчерпывается художественными опытами начала XIX века. Михаил Вайскопф – израильский славист и автор исследования «Влюбленный демиург», послужившего итоговым стимулом для этой книги, – видит в романтике непреходящую основу русской культуры, ее гибельный и вместе с тем живительный метафизический опыт. Его новая книга охватывает столетний период с конца романтического золотого века в 1840-х до 1940-х годов, когда катастрофы XX века оборвали жизни и литературные судьбы последних русских романтиков в широком диапазоне от Булгакова до Мандельштама. Первая часть работы сфокусирована на анализе литературной ситуации первой половины XIX столетия, вторая посвящена творчеству Афанасия Фета, третья изучает различные модификации романтизма в предсоветские и советские годы, а четвертая предлагает по-новому посмотреть на довоенное творчество Владимира Набокова. Приложением к книге служит «Пропащая грамота» – семь небольших рассказов и стилизаций, написанных автором.

Михаил Яковлевич Вайскопф

Языкознание, иностранные языки