Automatic Extraction of Examples for Word Sense Disambiguation


Master's Thesis, 2009

100 Pages, Grade: 1.3


Excerpt


Contents

1 Introduction

2 Basic Approaches to Word Sense Disambiguation
2.1 Knowledge-Based
2.1.1 The Lesk Algorithm
2.1.2 Alternative Methods
2.2 Unsupervised Corpus-Based
2.2.1 Distributional Methods
2.2.2 Translational Equivalence Methods
2.3 Supervised Corpus-Based
2.3.1 Sense Inventories
2.3.2 Source Corpora
2.3.3 Data Preprocessing
2.3.4 Feature Vectors
2.3.5 Supervised WSD Algorithms
2.4 Semi-Supervised Corpus-Based

3 Comparability for WSD Systems
3.1 Differences between WSD Systems
3.2 Most Frequently Used Baselines
3.2.1 The Lesk Algorithm
3.2.2 Most Frequent Sense
3.2.3 Random Choice

4 Evaluation of WSD Systems
4.1 Fundamentals in Evaluation of WSD Systems
4.2 International Evaluation Exercise Senseval . .
4.3 Senseval-1
4.4 Senseval-2
4.5 Senseval-3
4.5.1 The All-Words Task
4.5.2 The Lexical Sample Task
4.5.3 Other Tasks
4.6 Semeval-1
4.7 Summary

5 TiMBL: Tilburg Memory-Based Learner
5.1 Overview
5.2 Application

6 Automatic Extraction of Examples for WSD
6.1 Overview of the System
6.2 Data Collection
6.2.1 Sense Inventory
6.2.2 Source Corpora
6.2.3 Automatic Annotation
6.3 Data Preprocessing
6.3.1 Basic Preprocessing
6.3.2 Part-of-Speech Tagging
6.4 Training and Test Sets
6.4.1 Feature Vectors
6.4.2 Training Set
6.4.3 Test Set
6.5 Algorithm Selection
6.6 Parameter Optimizations
6.6.1 General Parameter Optimizations
6.6.2 Automatic Feature Selection
6.7 Scoring
6.8 Experimental Results
6.8.1 Supervised WSD
6.8.2 Unsupervised WSD
6.8.3 Semi-supervised WSD
6.8.4 Discussion
6.9 Evaluation

7 Conclusion, Future and Related Work
7.1 Related Work
7.2 Future Work
7.3 Conclusion

References
A List of Abbreviations
B Pool of Features
C Training and Test Set (Senseval-3 English Lexical Sample Task)
D Graphs
E Tables

Erkl ärung

Hiermit versichere ich, dass ich die vorgelegte Arbeit selbstst ändig und nur mit den angegebenen Quellen und Hilfsmitteln einschließlich des www und anderer elektronischer Quellen angefertigt habe. Alle Stellen der Arbeit, die ich anderen Werken dem Wortlaut oder dem Sinne nach entnommen habe, sind kenntlich gemacht.

Desislava Zhekova

Abstract

In the following thesis we present a memory-based word sense disambiguation system, which makes use of automatic feature selection and minimal parameter optimization. We show that the system performs competitive to other state-of-art systems and use it further for evaluation of automatically acquired data for word sense disambiguation.

The goal of the thesis is to demonstrate that automatically extracted examples for word sense disambiguation can help increase the performance of supervised approaches. We conducted several experiments and discussed their results in order to illustrate the advantages and disadvantages of the automatically acquired data.

Acknowledgements

This thesis would not have been possible unless Professor Dr. Walt Detmar Meurers provided me the opportunity to complete my work at the Department of Linguistics at the University of T übingen. The valuable guidance and professional environment in the department ensured also by my second supervisor Dr. Dale Gerdemann contributed immensely to my final success.

I would like to express as well my extreme gratitude to my advisor Professor Dr. Sandra K übler for the constant and exceptionally constructive remarks, which were often the key to my progress.

I am also grateful to all those who provided assistance in numerous ways: David M ünzing and Julian M ünzing for the manual sense-annotation of a subset of my final data collection - a very laborious endeavor; Franziska Gruber, Ramon Ziai, Dominikus Wetzel and my loving family and friends for their endless support, encouragement and advice.

List of Tables

2.1 Basic approaches to WSD as in (Agirre and Edmonds, 2007)

2.2 Performance and short description for the unsupervised systems participating in the SENSEVAL-3 English lexical sample task. Precision (P) and recall (R) (see Section 4.1) figures are provided for both fine-grained and coarse-grained scoring (Mihalcea et al., 2004a)

2.3 Performance and short description of the supervised systems participating in the SENSEVAL-3 English lexical sample Word Sense Disambiguation task. Precision (P) and recall (R) (see Section 4.1) figures are provided for both fine-grained and coarse-grained scoring (Mihalcea et al., 2004a)

2.4 Feature vectors for the sentences in (2)

2.5 Supervised word sense disambiguation algorithms

4.1 Summary of the Senseval-1 evaluation task

4.2 Summary of the Senseval-2 evaluation task

4.3 Summary of the sense inventory of the words present in the English lexical sample task (Mihalcea et al., 2004a)

4.4 Summary of the Senseval-3 Lexical Sample Task

6.1 The collection and size of lexicon examples used in our system

6.2 The collection and size of corpora examples used in our system

6.3 Features included in the feature vectors of our system and their corresponding values from our toy example

6.4 The total collection of examples used in our system

6.5 System performance on manually annotated data without optimization of the k parameter

6.6 System performance on manually annotated data with optimization of the k pa- rameter

6.7 Comparison with the three best supervised systems in the Senseval-3 lexical sam- ple task (Mihalcea et al., 2004a)

LIST OF TABLES

6.8 System performance on automatically annotated data without optimization of the k parameter

6.9 System performance on automatically annotated data with optimization of the k parameter

6.10 Comparison with the four best unsupervised systems in the Senseval-3 lexical sample task (Mihalcea et al., 2004a)

6.11 System performance on automatically annotated data without optimization of the k parameter and filtered for preservation of the distribution of the senses as in the original manually annotated training set

6.12 System performance on automatically annotated data with optimization of the k parameter and filtered for preservation of the distribution of the senses as in the original manually annotated training set

6.13 System performance on automatically annotated data without optimization of the k parameter and distance and distribution filtering mixed with the manually an- notated training set

6.14 System performance on automatically annotated data with optimization of the k parameter and distance and distribution filtering mixed with the manually anno- tated training set

6.15 Summary of the results for all experiments

6.16 Comparison of our system’s scores with the upper and lower bound for the exper- iment and also with the best performing system in the Senseval-3 English lexical sample task

7.1 The performance of our system on the Senseval-3 Lexical Sample Task data

7.2 The performance of our system trained on the automatically gathered examples and tested on the provided by Senseval-3 Lexical Sample Task test set

7.3 System performance on automatically annotated data with optimization of the k parameter and filtered for preservation of the distribution of the senses as in the original manually annotated training set

7.4 System performance on automatically annotated data with optimization of the k parameter and distance filtering mixed with the manually annotated training set.

List of Figures

2.1 Supervised machine learning

2.2 General architecture of an memory-based learning system (Daelemans et al., 2007).

4.1 Screenshot from Open Mind Word Expert (Chklovski and Mihalcea, 2002)

6.1 Our semi-supervised WSD system

6.2 A screenshot of the result from the online interface of the Cambridge Advanced Learner’s Dictionary when searched for the word activate

6.3 A screenshot of the result from the online interface of WordNet when searched for the word activate

7.1 Accuracy change of the adjective hot

7.2 Accuracy change of the noun arm

7.3 Accuracy change of the verb appear

Chapter 1 Introduction

Ambiguity is one of the characteristics in all human languages and at the same time it is a problem to be solved in the studies of Computational Linguistics. It is represented by the uncertainty of meaning (if something can be understood in two or more senses) and thus requires a deeper linguistic processing of the natural language.

Nowadays it is easy to answer the question ”Who understands language better - computers or human beings?”. One of the reasons why computers still cannot compete with the human cognition is exactly the fact that ambiguity prevails in any natural language text. Thus, for the last few decades the problem has started to gain interest in the CL society and as Ide and Véronis (1998b) mention, proved to be a key to the solutions for other important areas of Natu- ral Language Processing - Information Retrieval, Machine Translation, Information Extraction, Parsing, Text Mining, Semantic Interpretation, Lexicography, Content and Thematic Analysis, Grammatical Analysis, Speech Processing, Knowledge Acquisition and many others.

Ambiguity can be present in many ways (e.g structural - when it appears in a sentence or a clause, lexical - when it appears in respect to just a single word). Lexical ambiguity is the problem that Word Sense Disambiguation (WSD) is concerned with and has already tackled to a great extend. Human beings resolve lexical ambiguity by looking in the dictionary, while WSD tries to automate this process. Its task is to automatically assign the correct sense of an ambiguous word dependent on the context in which it can be found. However, by now there has not been found a perfect solution to the problem.

Ide and Véronis (1998b) describe many of the still open questions in WSD - How important is the context of the word that is disambiguated?, Which is the optimal choice of possible senses for the target word?, How to compare all different systems and their results?, How much data is needed in order good results to be achieved?, How to provide data for data-driven approaches? Those and many other problems have been considered an issue of great importance from many computational linguists, which is easily proven by the rapidly increasing number of papers in the ACL Anthology[1] where the term word sense disambiguation is mentioned - approximately 850 times.

The advances in different aspects of WSD have resulted in a collection of expert views in the field - (Agirre and Edmonds, 2007). The book reviews basic questions like the nature of word senses - (Kilgarriff, 2007), (Ide and Wilks, 2007); discusses different WSD methods - (Mihalcea, 2007), (Pedersen, 2007), (M àrquez et al., 2007); examines evaluation techniques of automatic word sense disambiguation (Palmer et al., 2007), knowledge sources for WSD - (Agirre and Stevenson, 2007), domain-specific WSD - (Buitelaar et al., 2007) and as well discusses topics like automatic acquisition of lexical information (Gonzalo and Verdejo, 2007) and the application of WSD in Natural Language Processing (Resnik, 2007).

The predicament that we aim to discuss in the following thesis is a yet open question in WSD. It is the quantity and quality of the automatically extracted data for the process of word sense disambiguation - we will approach questions like how much and what kind of data can be acquired without human labor. In our work we present a WSD system and its application to various types of data, which outlines an environment and basis for evaluation of the automatic acquisition of examples for word sense disambiguation. We will not pursuit an exhaustive em- pirical analysis of the latter but rather discuss the most relevant for our approach advantages and disadvantages of its employment.

Since we do not concentrate on a particular WSD approach, but rather compare the results from several such in order to examine them in contrast we consider it extremely important first to examine the fundamentals of word sense disambiguation and all the basic approaches to it (Chapter 2). Chapter 3 and Chapter 4 are respectively concentrated on the ways to compare and evaluate WSD systems in general depending on the used by the system approach. Before reviewing our work, however, in Chapter 5 we will shortly introduce the software which we employed. The following chapter (Chapter 6) will discuss the structure of the suggested by us system and Chapter 7 will give a brief overview of the research that has already been done in that area and the work which will be attempted further on together with our concluding remarks.

Chapter 2 Basic Approaches to Word Sense Disambiguation

Word sense disambiguation is the process of identification of the correct sense of a word in the context in which it is used. WSD has mainly one aim and it can be described in many analogous ways - to find out if two different words or phrases refer to the same entity, to be able to replace one with the other at the right place, to find out a sense that is common for both etc. In the pro- cess, however, there is an innumerable amount of unanswered questions and unsolved problems. One such question that turns out to be considerably essential for finding the right sense is the adequacy of the predefined set of senses a word can have (also called sense inventory - see Section 2.3.1). Are there too many distinctions between the senses? Are there too few? All such problems thus lead to many distinct approaches in the attempt to find the correct solutions. In the fol- lowing chapter we will give a short overview of the fundamental divisions in those attempts. In order to explain what kind of WSD technique we use in our work it will be very helpful to first have a look at what kind of approaches to the problem exist according to the current state of art. A very clear overview on the basic approaches in WSD has already been given by Agirre and Ed- monds (2007). The authors differ between knowledge-based (dictionary-based) and corpus-based (further divided as unsupervised, supervised and semi-supervised) approaches. In addition to that they discuss combinations of the different approaches which finally results in a variety which they summarize in a very precise manner as in Table 2.1 on page 13.

2.1 Knowledge-Based

Knowledge-based Word Sense Disambiguation is build upon knowledge acquired from sources other than corpora (e.g. lexical knowledge bases, dictionaries, thesauri).

Abbildung in dieser Leseprobe nicht enthalten

Table 2.1: Basic approaches to WSD as in (Agirre and Edmonds, 2007).

2.1.1 The Lesk Algorithm

One highly influential dictionary-based method which provided to a great extent the basis for most of the research in the area is the Lesk method presented in (Lesk, 1986). The algorithm is based on the assumption that words with similar surroundings will usually share a common topic. In other words the contextual overlap (in this approach - among dictionary definitions) is used as a measure to pick up the most likely sense for a given word. Lesk’s method was as well one of the first to work on any word in any text. The reason why this algorithm is classified as knowledge-based is because of the fact that it acquires knowledge only from a set of dictionary entries (a separate entry is needed for each sense of every word) concentrating on the immediate context of the target word. Here, with immediate context we mean the words that closely surround the target word. As an example we can look at the ”evergreen” case of disambiguation of pine cone that Lesk (1986) first suggested as shown in (1) on page 14.

Abbildung in dieser Leseprobe nicht enthalten

As just mentioned, each of the separate senses has its own entry and is compared to the entries for the neighboring words of the target word. The aim of the comparison is to find over- lapping words in the targeted entries. Thus, each of the entries of pine is compared to each of the entries of cone. As a result the largest overlap appears to be between the first sense of pine and the third sense of cone, both marked with * in example (1), hence those are the senses selected from the Lesk algorithm for both words. In such a way the latter algorithm disambiguates both words pine and cone if they appear together in the same context. In cases where each of those words is seen in different surroundings, the process of comparison of the dictionary entries for those surrounding words and the word pine or respectively cone will be pursuit.

2.1.2 Alternative Methods

As Mihalcea (2007) notes, another type of knowledge-based method uses semantic similarity measures that are calculated from already existing semantic networks (e.g. WordNet[1], Lexi- pedia[2] etc.). The nature of this approach relies mostly on the fact that the smallest semantic distance between two words can be used as a trustworthy metric to show how closely related those words are. There are two different ways in the attempt to acquire the semantic similarity

- local and global. The only difference between those two approaches is that in order to build threads of meaning the method either looks only in a very small window size of the context of the target word and discards the rest (local) or the entire text/document is considered (global).

Selectional preferences can be used both in knowledge-based as well as in corpus-based ap- proaches for word sense disambiguation. This method tries to account for the fact that linguistic elements favor arguments of a specific semantic class, e.g. a verb like ’drink’ selects as object drinkable things while a verb like ’eat’ prefers as object edible things, and both prefer as subject animate entities, as for example: ” He ate some chocolate.”. Selectional preferences however could get a lot more complex than it might seem - consider for example sentences as: ” The acid ate the table.”, ” This truck eats a lot of fuel.” or ” We ate our savings.”, etc. Normally, however, the biggest problem with selectional preferences is their circular relation with WSD. To be able to figure out the correct semantic constraint one needs to have an idea about the senses of the target words and at the same time WSD improves considerably if a big enough set of selectional relations is available.

Apart from the already discussed possibilities for knowledge-based methods there are also some heuristic approaches - most frequent sense (MFS) (see Section 3.2.2) for example. This method is set around the idea that in each case the sense of the target word that has the high- est frequency amongst all of its senses is chosen regardless of the context in which this word is considered in. Unlike the other knowledge-based methods the heuristic methods are relatively easy to implement and fast to use with data of a bigger size. It has often been noted (e.g. Mihal- cea (2007)) that the Lesk algorithm as well as the selectional preferences algorithm can become excessively computationally-exhaustive if more than few words are in the process of disambigua- tion. Moreover, the MFS heuristic is as well usually used as a baseline for most of the evaluation exercises for WSD systems (see Chapter 4).

One of the biggest advantages of knowledge-based methods in respect to corpus-based methods is the fact that despite their poor performance in terms of accuracy (percentage of labeled words correctly disambiguated), they can be applied to unrestricted (not domain specific) texts and unlimited amount of target words regardless the existence of already manually annotated sense- tagged corpora.

2.2 Unsupervised Corpus-Based

The nature of unsupervised WSD is considerably related to the problem of density estimation in statistics. In the process of disambiguation the aim of an unsupervised method is to discover the different patterns and structures in a predefined data source that has not been manually annotated beforehand. Exactly the possibility to work with data that does not need the extremely expensive human effort for manual annotation makes this approach so appealing. Thus there is already quite much going on in researching those paths: (Sch ütze, 1998), (Ramakrishnan et al., 2004), (Niu et al., 2004a), (Preiss, 2004), (Litkowski, 2004a), (Buscaldi et al., 2004), (Seo et al., 2004), (Pedersen, 2004), etc.

Unsupervised corpus-based algorithms do not directly assign a sense to a target word but rather try to distinguish all possible senses of the given word based on the information they can gain from the unannotated corpora and then discriminate among those senses. Hence, the pro- cess is not dependent on the existence of already predefined sense inventories of the target words. Moreover, as a result, unsupervised corpus-based methods can provide us with sense inventories that are a lot more ”tuned” to different domains which is of a great help for different applica- tions like Machine Translation for example. There are two fundamental approaches to unsu- pervised corpus-based word sense disambiguation: distributional and translational equivalence approaches. Both are considered knowledge-lean approaches as a result of their dependency only on the existing but not necessarily annotated monolingual corpora or on the word-aligned parallel corpora.

2.2.1 Distributional Methods

Distributional methods make more use of statistics than lexicology. They do not search through the already predefined senses of the target word and choose the one that fits best but rather cluster the words that have similar environments not taking in consideration any of the already established sense inventories for those words. As a result there are no limits to the number of created clusters which in respect denote the granularity of senses the target words acquire. Pedersen (2007) describes the distributional method as an automated and idealized view of the work lexicographers do in the attempt to find a good definition of a word. However, in this case a good definition is relative to the corpus from which it has been extracted. This means that the clusters will only correspond to the senses that are present in the corpus itself and not to the senses that the target word actually has.

Apart from the discrimination among granularity of senses there is another issue that is of significant importance to distributional methods: the automatic labeling of the clusters. It has been often discussed (e.g. Pedersen (2007)) that such a task is quite difficult as well as very significant. One of the possible solutions to the problem as Pedersen (2007) shows is the extraction of separate sets of words that are related to the created clusters and their usage instead of a normal definition. Such words that capture the contextual similarity of the clusters are acquired by type based methods of discrimination. Since it is not our prior aim to describe in depth the distributional methods for unsupervised WSD, please refer to (Pedersen, 2007) for more information. However, to illustrate the process let us consider the word line. If we have two separate clusters representing the rope-like sense and the one connected with communication the type based methods will give us words that could be used as labels of the clusters like (cord, line, tie, cable) for the first rope-like cluster and (telephone, line, call) for the communication cluster.

2.2.2 Translational Equivalence Methods

Translational equivalence methods, as their name suggests, have something to do with the trans- lation of the target word into another language. Thus, for this purpose parallel corpora (collec- tions of texts placed alongside their translations) become very handy. Translational equivalence methods extract the translations of the target word from the parallel corpora and so create its sense inventory. In other words the considered sense inventory of a target word is build up of all the translations of this target word in the parallel corpora. This approach is believed to be very useful for many purposes, e.g. extraction of sense inventories that are more adequate to specific application (since as noted above, it is often the case that the sense inventories provided by dictionaries, WordNet, or other sources could be either too fine-grained or not fine-grained enough) or automatic derivation of bilingual dictionaries from parallel corpora (which again gives the opportunity of the dictionaries to be more specific to the given domain).

With the rapidly increasing use of large corpora, which are nowadays permanently available and easily extracted from the World Wide Web, unsupervised corpus-based methods have become more and more interesting to the computational linguistic society. Their application does not necessarily require a lot of linguistic knowledge, which makes them flexible in respect to the variety of languages to which they can be applied. Linguistic knowledge, however, naturally contributes a lot to the power and robustness of the unsupervised methods which is a direction in which they could be further developed, especially for the languages for which such knowledge already exists.

In general unsupervised corpus-based algorithms, as Mihalcea et al. (2004a) report, perform poorer than supervised or knowledge-based algorithms. However, as can be seen in Table 2.2 the results presented in evaluation exercises (e.g. Senseval - see Chapter 4 ) become more competitive than they were in previous years.

Abbildung in dieser Leseprobe nicht enthalten

Table 2.2: Performance and short description for the unsupervised systems participating in the SENSEVAL-3 English lexical sample task. Precision (P) and recall (R) (see Section 4.1) figures are provided for both fine-grained and coarse-grained scoring (Mihalcea et al., 2004a).

2.3 Supervised Corpus-Based

A considerable part of our approach is based on supervised corpus-based methods which is the reason why we will briefly discuss them as well. However, we will only present the information that is most relevant to our work.

Supervised corpus-based methods for word sense disambiguation in respect to the unsuper- vised ones are considerably more expensive in regard to the human work that needs to be in- vested in them. This work consists basically of semantic annotation of examples which for a domain independent application, as Ng (1997b) suggests, need to be at least 1000 instances of 3200 different words. In their earlier work Ng and Lee (1996) also report that the necessary human effort for the creation of a corpus of this size is about 16 human-years. Unquestionably, this is a prize quite hight to pay and of course this leads to consequences such as the knowledge acquisition bottleneck or in other words the lack of semantically annotated instances. This un- equivocally illustrates the biggest predicament for supervised corpus-based methods but fails to show their effectiveness. In the last decade the methods of this family proved to be more suc- cessful than unsupervised or knowledge based ones for which evaluation exercises (see Chapter 4) provide us with evidence. As can be seen in Table 2.3 on page 19 supervised systems reach up to 79.3% accuracy (the accuracy of a system represents its overall performance - exactly how it is measured is described in Section 4.1) which is already a good distance from the MFS classifier baseline that gives us only about 55.2% for the data in the given experiment.

The idea behind supervised methods for WSD is directly connected with the use of machine learning for classification. Those methods automatically learn to make correct predictions as long as they are provided the possibility to have some observations in advance. There are several advantages of the automated attempt for classification: it is most often much more accurate than human-crafted rules because it is data-driven; its flexibility enables its application on variable training data; no extra effort is needed for the creation of additional classifiers, etc. Along with the good aspects, there are some downsides of the supervised machine learning classification - the biggest one, as we just mentioned, is the fact that it depends on huge amounts of annotated training data. The sequence of actions that supervised machine learning methods employ is visualized by Figure 2.1 on page 20.

Abbildung in dieser Leseprobe nicht enthalten

Table 2.3: Performance and short description of the supervised systems participating in the SENSEVAL-3 English lexical sample Word Sense Disambiguation task. Precision (P) and recall (R) (see Section 4.1) figures are provided for both fine-grained and coarse-grained scoring (Mihalcea et al., 2004a).

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1: Supervised machine learning.

The attempt to solve a problem with a supervised machine learning methods starts first with the identification of the data set that is feasible for the given task (see Section 2.3.2). Then this set is pre-processed (Section 2.3.3) and prepared to be divided in the required training and test sets - Section 2.3.4. Training, however, can first be accomplished after a proper algorithm is depicted together with the parameters that are most likely to give good results (Section 2.3.5). Once this setup is ready, the test set is used to ”assess” the future classifier. If the evaluation is satisfying the classifier is finished but in case the evaluation yields not acceptable results the process could be reactivated at any of its previous states. The main idea behind this machinery is that once provided with a training set the selected algorithm induces a specific classifier which is normally described as a hypothesis of a function that is used to map the examples (in our case the test examples) to the different classes that have already been observed in the training cases. We already know, that human annotators label the training data manually, but a good question at this point is where do they know which set of senses they can use for classification. The answer is easy and it is hidden behind the employment of the so called sense inventories.

2.3.1 Sense Inventories

Most often machine readable dictionaries are the sources of the sense inventories that are used for the manual annotation of supervised WSD corpora and thus for the creation of the training and test sets. The most extensively used ones are nowadays WordNet (Miller, 1990; Fellbaum, 1998) for English and EuroWordNet (Vossen, 1998) (e.g. the DSO corpus Ng and Lee (1996), SemCor (Miller et al., 1993), Open Mind Word Expert (OMWE) (Mihalcea and Chklovski, 2003)) for languages as Dutch, Italian, Spanish, German, French, Czech, and Estonian. In Section 2.3.2 we point out which sense inventories have been used for the corresponding sense-annotated corpora but we do not explicitly keep track of their version (e.g. Wordnet 1.6, 1.7, 1.7.1, 2.0) since automatic mapping to the most recent version is usually also provided. Other such resources are: Hector (Atkins, 1993), Longman Dictionary of Contemporary English (Procter, 1978), BalkaNet (Stamou et al., 2002), etc.

Theoretically, each dictionary could be used as a sense inventory for WSD. However, there are several problems coming along. First, dictionaries are not always freely available for research, which was the reason why WordNet became in fact the standard sense inventory for the last decade. However, it is still being argued if it is good as such. Since WordNet distinguishes between the senses of each word in an extremely fine-grained manner, it is often hard to use it for WSD, hence there are cases where a coarser distinction is desirable. Calzolari et al. (2002) even argue that the use of WordNet as a sense inventory in WSD yields worse results than using traditional dictionaries. However, it is not WordNet itself but the predefined sense inventory as such that appears to hinder supervised word sense disambiguation. There is a large number of attempts to solve the latter problem and although none of them completely succeed WordNet will continue to be the standard sense inventory for WSD.

Another problem in respect to sense inventories is their compatibility. Each dictionary has its own granularity and representation of senses, which are normally extremely hard if not even impossible to map against each other. Thus systems that use different sense inventories are impossible to compare, since their performance is bound to the inventory they use. Of course, since this issue is a well known problem already, evaluation exercises (see Chapter 4) use a single sense inventory for all the participating systems or it is required that those inventories that are different from the standard one provide mapping to it.

2.3.2 Source Corpora

One of the biggest problems for supervised word sense disambiguation (the knowledge acquisi- tion bottleneck problem) is the fact that there are very few annotated corpora that can be used in order good and reliable broad-coverage systems to be trained. This is due to the fact that the creation of such corpora requires a highly laborious human effort. The huge dependency of the method on the provided corpora is the reason why for languages other than English, supervised WSD is extremely difficult if not even impossible. Below follows a brief description of the main data sources for supervised WSD.

Senseval provides several corpora not only for English but as well for languages as Italian, Basque, Catalan, Chinese, Romanian and Spanish. The most recent edition of Senseval (Senseval-3) resulted in the following annotated corpora:

- English all words - 5000 words were tagged from Penn Treebank text (Marcus et al., 1993) with WordNet as senses.
- English lexical sample - 57 words collected via The Open Mind Word Expert interface (Mihalcea and Chklovski, 2003) with WordNet sense inventory.
- Italian all words - 5000 words from the Italian Treebank (Bosco et al., 2000) semantically tagged according to the sense repository of ItalWordNet (Roventini et al., 2000).
- Italian lexical sample - 45 words with sense inventory specially developed for the task (Italian MultiWordNet for Senseval-3)
- Basque lexical sample - 40 words with sense inventory manually linked to WordNet.
- Catalan lexical sample - 45 words with sense inventory manually linked to WordNet.
- Spanish lexical sample - 45 words for which the sense inventory was again specially developed and was manually linked to WordNet.
- Chinese lexical sample - 20 words with sense inventory according to the HowNet knowledge base (Dong, 1998).
- Romanian lexical sample - 50 words for which senses are collected from the new Romanian WordNet, or DEX (a widely recognized Romanian dictionary). The data is collected via the OMWE (Romanian edition) (see Section 4.5.2).
- Swedish lexical sample task unfortunately was cancelled and thus no corpora were pro- vided.

SemCor - (Miller et al., 1993) is a lot broader in coverage than Senseval. Around 23 346 words are gathered form The Brown Corpus (about 80%) and the novel The Red Badge of Courage (about 20%) and for a sense inventory, WordNet is used.

The Open Mind Word Expert - (Mihalcea and Chklovski, 2003) consists of 230 words from the Penn Treebank, Los Angeles Times collection, Open Mind Common Sense and others. Here, WordNet is used as sense repository, too. This corpus, however, grows daily since it is being created mostly by volunteers that manually annotate examples on the Web. Further information about the OMWE can be found in Section 4.5.2.

The DSO corpus - Ng and Lee (1996) gathered 191 words from The Brown Corpus and The Wall Street Journal and annotated them with senses according WordNet.

Hector - (Atkins, 1993) account for about 300 words from the A 20M-word pilot for the British National Corpus[3] (BNC) for which the sense inventory is picked up as well from Hector.

HKUST-Chinese has approximately 38 725 sentences again from the HowNet knowledge base (Dong, 1998).

Swedish corpus - 179 151 instances with the Gothenburg lexical database as inventory and the SUC Corpus as its corpus.

Image captions - 2 304 words from image captions of an image collection and WordNet as source inventory.

All this shows how immense the efforts to solve this exceptionally important problem (the knowledge acquisition bottleneck) are and how despite this considerable attempt the biggest obstacle for supervised WSD is still on the way. It will surely employ further on a great deal of commission but until then we could continue discussing the process of WSD with a supervised system and the next step which should be taken from deciding on a sense inventory and having semantically annotated corpus is the data preprocessing.

2.3.3 Data Preprocessing

Having chosen the corpora and extracted the needed parts of it, however, does not already mean that we are ready to go ahead and train a classifier. There is another intermediate step which is not necessarily difficult but in most cases quite complex. It consists of data preprocessing and preparation for the extraction of the training and test examples. The former is believed to have a significant impact on the performance of supervised WSD systems. It conditions the quality of the discovered patterns and makes it possible to find useful information from initial data. The most basic preparation of the data is the filtering of any unwanted special or meta characters, which are normally existent in large corpora or the removal of exceptional punctuation marks. Such ”noise” in the data does not only lead to decrease in accuracy, but depending on the further processing and the software that is used it might even lead to other problems (e.g. part-of-speech (POS) tagger (usually used further in the process) not handling special punctuation characters).

The second part in this process is POS tagging. Depending on the decision that is made about the information which needs to be contained in the examples, multi-words (a multi-word expression is a lexeme build up of a sequence of at least two lexemes that has properties that are not predictable from the properties of the individual lexemes), lexical dependencies and n -gram (sub-sequence of n items, in our case words, from a given sequence) detection might be necessary as well.

When the above mentioned basic preprocessing and POS tagging are done, the data can be finally used for extraction of examples needed for training and testing of the classifier (also called word-expert). This step is further described in the following section.

2.3.4 Feature Vectors

The examples in the training set are constructed as feature vectors (FV). Feature vectors are fixed-length n -dimensional vectors of features (where n is the number of selected attributes/features) that represent the relevant information about the example they refer to. Those attributes are predefined and picked up uniformly for each instance so that each vector has the same number and type of features. There are three types of features from which the selection can be made

- local features, global features and syntactic dependencies. Local features describe the narrow context around the target word - bigrams, trigrams (other n -grams) that are generally consti- tuted by lemmas, POS tags, word-forms together with their positions with respect to the target word. Global features on the other hand consider not only the local context around the word that is being disambiguated, but the whole context (whole sentences, paragraphs, documents) aiming to describe best the semantic domain in which the target word is found. The third type of features that can be extracted are the syntactic dependencies. They are normally extracted at a sentence level using heuristic patterns and regular expressions in order to gain a better representation for the syntactic cues and argument-head relations such as subject, object, noun- modifier, preposition etc.

Commonly, features are selected out of a predefined pool of features (PF), which consists out of the most relevant features for the approached task. We compiled such PF (see Appendix B) being mostly influenced for our choice by (Dinu and K übler, 2007) and (Mihalcea, 2002).

Let us consider a simple example. Given the sentences in (2) we will attempt to create FVs that represent the occurrences of the noun bank in those sentences.

(2)

1. ” He sat on the bank of the river and watched the currents.
2. ” He operated a bank of switches.

For the sake of simplicity at this point we chose a minimum number of features from the predefined pool of features. The corresponding set consists only of local features [illustration not visible in this excerpt]The resulting vectors thus can be seen in Table 2.4 below. Since we do not explain in detail exactly how those vectors are put together, please refer to Chapter 6 where we follow stepwise the process of FV construction.

Abbildung in dieser Leseprobe nicht enthalten

Table 2.4: Feature vectors for the sentences in (2).

As we have already noted, supervised learning is based on the training data that has been manually annotated beforehand while unsupervised learning does not need such predefined sense inventory. Thus, a very important part of the FV is the class label that represents the classification of this vector (this is the value of the output space that is associated with a train- ing example) which in our case is to be found in the class column. As its value we have selected the correct sense of the noun bank in the sentence represented by the sense key that this sense has in WordNet (Fellbaum, 1998). Refer to Section 6.4.1 for further information on the final FVs that we constructed for our system.

The test set is indeed very similar to the training set. However, since we need to evaluate the system, no class labels are included in the feature vectors. System performance is normally believed to increase with the amount of training data and thus usually the training set is a relatively larger portion of the whole data than the test set. It is possible to divide the data in 5 portions where 4 portions will be the training set and 1 portion the test set resulting in a ratio 4:1. Other often used ratios are 2:1 and 9:1. According to Palmer et al. (2007) a division of 2:1 may provide a more realistic indication of a system’s performance, since a larger test set is considered. Still, we know that labeled data is not plenty, which is why it is preferably taken for training and not for testing. By dividing the data in any particular split however, a bias is unquestionably involved. Consequently a better generalization accuracy measurement has to be used in real experiments - n -fold cross-validation or in particular 10-fold cross-validation (Weiss and Kulkowski, 1991).

In n -fold cross-validation the data is divided into n number of folds for which it is desirable that they are of equal size. Accordingly n separate experiments are performed, and in each experiment (also called fold) n -1 portions of the data is used for training and 1 for testing, in such a way that each portion is used as a test item exactly once. If n equals the sample size (the size of the data set) the process is called leave-one-out cross-validation.

2.3.5 Supervised WSD Algorithms

One of the main decisions which needs to be met when designing a supervised WSD system is the choice of the algorithm that is to be employed. In Table 2.5 on page 26 is a basic overview of the most often used alternatives as well as some literature where more information can be found about them. A short description of the algorithms is provided as well in order to give an outline of their usage and importance.

Abbildung in dieser Leseprobe nicht enthalten

Table 2.5: Supervised word sense disambiguation algorithms.

Probabilistic methods categorize each of the new examples by using calculated probabilistic parameters. The latter convey the probability distributions of the categories and the contexts that are being described by the features in the feature vectors.

Naïve Bayes (Duda et al., 2001) is one of the simplest representatives of probabilistic methods that presupposes the conditional independence of features given the class label. The main idea is that an example is created by selecting the most probable sense for the instance and as well for each of its features independently considering their individual distributions. The algorithm uses the Bayes inversion rule (Fienberg, 2006). It is often considered that the independence assumption is a problem for Naïve Bayes and thus alternative algorithms as the decomposable model by (Bruce and Wiebe, 1994) have been developed.

Maximum entropy (Berger et al., 1996) is another quite robust probabilistic approach that combines stochastic evidence from multiple different sources without the need for any prior knowledge of the data.

Discriminating rules assign a sense to an example by selecting one or more predefined rules that are satisfied by the features in the example and hence selecting the sense that the predic- tions of those rules yield. Examples for such methods are Decision lists and Decision trees.

[...]


[1] http://www.aclweb.org/anthology-new/

[1] http://wordnet.princeton.edu/

[2] http://www.lexipedia.com/

[3] http://www.natcorp.ox.ac.uk/

Excerpt out of 100 pages

Details

Title
Automatic Extraction of Examples for Word Sense Disambiguation
College
University of Tubingen  (Seminar für Sprachwissenschaft)
Course
Computerlinguistik
Grade
1.3
Author
Year
2009
Pages
100
Catalog Number
V267346
ISBN (eBook)
9783656573395
ISBN (Book)
9783656573364
File size
3338 KB
Language
English
Notes
A peer-reviewed short paper based on this work has been also published: Kübler, Sandra and Desislava Zhekova (2009). Semi-Supervised Learning for Word Sense Disambiguation: quality vs. quantity. Proceedings of RANLP 2009, Borovets, Bulgaria.
Keywords
automatic, extraction, examples, word, sense, disambiguation
Quote paper
Desislava Zhekova (Author), 2009, Automatic Extraction of Examples for Word Sense Disambiguation, Munich, GRIN Verlag, https://www.grin.com/document/267346

Comments

  • No comments yet.
Look inside the ebook
Title: Automatic Extraction of Examples for Word Sense Disambiguation



Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free