Enhancement to Selective Incremental Approach for Transductive Nearest Neighbour Classification

Master's Thesis 2012 133 Pages

Computer Science - Didactics





List of Figures

List of Tables

List of Abbreviations

Chapter 1: Machine Learning An Introduction
1.1 Introduction
1.2 Data mining and Machine Learning
1.3 What is Machine Learning?
1.3.1 What is learning ?
1.4 Learning Strategies
1.5 History of Machine learning
1.5.1 Early enthusiasm (19551965)
1.5.2 Dark ages (19621976)
1.5.3 Renaissance (19761988)
1.5.4 Maturity(1988Present)
1.6 Applications of Machine Learning
1.7 Typical Taxonomy of Machine Learning
1.7.1 Supervised learning
1.7.2 Un Supervised Learning

Chapter 2: Semi Supervised Learning
2.1 Introduction to Semi Supervised Learning
2.2 Semi Supervised Learning
2.2.1 Semi Supervised Classification
2.2.2 Semi Supervised Clustering
2.2.3 Semi Supervised Feature Selection
2.3 Generative Models
2.3.1 Identifability
2.3.2 Model Correctness
2.3.3 EM Local Maxima
2.3.4 ClusterandLabel
2.3.5 Fisher kernel for discriminative learning
2.4 SelfTraining
2.5 CoTraining and Multi view Learning
2.5.1 Coraining
2.5.2 Multiview Learning
2.6 Semisupervised Learning Techniques
2.6.1 Transduction
2.6.2 Induction
2.7 Avoiding Changes in Dense Regions
2.7.1 Transductive SVMs (S3VMs)
2.7.2 Gaussian Processes
2.7.3 Information Regularization
2.7.4 Entropy Minimization
2.7.5 A Connection to Graphbased Methods?
2.8 GraphBased Methods
2.8.1 Regularization by Graph
2.8.2 Graph Construction
2.8.3 Fast Computation
2.9 Theoretical observations

Chapter 3: SITNNC
3.1 System Study
3.2 Traditionally Existing methods
3.2.1 Graph Mincut
3.2.2 Spectral Graph Partitioning
3.2.3 Standard algorithm3NNC
3.2.4 ID3 Decision Tree
3.3 What is Classification?
3.3.1 Model construction
3.3.2 Model usage
3.4 Proposed method
3.4.1 Algorithm SITNNC
3.4.2 Algorithm improved SITNNC
3.6 Using Leaders in SITNNC
3.6.1 Algorithm ModifiedLeaders
3.6.2 Algorithm Kmeans

Chapter 4: Design and Results of Enhanced SITNNC
4.1 Do Humans do SemiSupervised Learning?
4.1.1 Visual Object Recognition with Temporal Association
4.1.2 Infant WordMeaning Mapping
4.1.3 Human Categorization Experiments
4.2 Properties of the Data Sets Used
4.2.1 UCI repository
4.2.2. Types of metadata for machine learning
4.2.3 Dynamic or usage metadata
4.2.4 Using metadata in ML research
4.2.5 Organizational and structural support for metadata acquisition and maintenance
4.3 Used Datasets
4.4 Results

Chapter5: Conclusion Appendices
A. C Code for SITNNC
B. C Code for Improved SITNNC


List of Figures

Fig 2.1: Gives the basic idea of how unlabeled data could be useful in Learning a classifier
Fig 2.2: A binary classification problem
Fig 2.3: An example of unidentifiable models
Fig 2.4: If the model is wrong, higher likelihood may lead to lower Classification accuracy
Fig 2.5: Co-Training: Conditional independent assumption on feature split
Fig 2.7: In TSVM, U helps to put the decision boundary in sparse regions
Fig 2.8: The TSVM loss function (1 − |f (xi)|)+
Fig 2.9: Utility of pairwise constraints in data clustering

Fig 3.1: 3-Nearest Neighbor Classifier
Fig 3.2: Example of k -NN classification
Fig 3.3: Distribution of (a) Clustered. (b) Random (c) Regular
Fig 3.4: Aapplying NNC for Measuring the distance among the trees
Fig 3.5: An example for model construction
Fig 3.6: An example for model usage

Fig 4.1: As recently as 2005, semi-supervised learning methods have not addressed large-scale problems
Fig 4.2: Classify teapot images by its spout orientation. Some images within the same class are quite different, while some images from different classes are similar
Fig 4.3: Content independent catalog samples from the UCI ML dataset repository

Fig 5.1: Comparison of Results for Existing and Proposed Methods for voting dataset
Fig 5.2: Comparison of Results for Existing and Proposed Methods for Mush dataset
Fig 5.3: Comparison of Results for Existing and Proposed Methods for IONO dataset
Fig 5.4: Comparison of Results for Existing and Proposed Methods for BUPA dataset
Fig 5.5: Comparison of Results for Existing and Proposed Methods for PIMA dataset


Table 2.1: Examples of different Semi-supervised Settings

Table 3.1: Details of datasets used

Table 4.1: Data set Table

Table 5.1: Comparison of results of existing and proposed methods for five data sets
Table 5.2: Comparison of Time Complexities of SITNNC and Leaders methods


illustration not visible in this excerpt


During the last years, semi-supervised learning has emerged as an exciting new direction in machine learning research. It is closely related to profound issues of how to do inference from data, as witnessed by its overlap with transductive inference. Semi-Supervised learning is the half-way between Supervised and Unsupervised Learning. In this majority of the patterns are unlabelled, they are present in Test set and knowed labeled patterns are present in Training set. Using these training set, we assign the labels for test set.

Here our Proposed method is using Nearest Neighbour Classifier for Semi-Supervised learning we can label the unlabelled patterns using the labeled patterns and then compare these method with the traditionally Existing methods as graph mincut, spectral graph partisan, ID3,Nearest Neighbour Classifier and we are going to prove our Proposed method is more scalable than the Existing methods and reduce time complexity of SITNNC(Selective Incremental Approach for Transductive Nearest Neighbour Classifier) using Leaders Algorithm.


I am deeply indebted to my Principal Dr.K.Sreenivasula Reddy M.Tech, Ph.D, garu for his constant support and valuable guidance was a source of inspiration for me.

I am deeply indebted to our Management for providing good faculty and nice infrastructure in our college Madanapalle Institute of Technology and Science, Madanapalle.

Finally, a word of gratitude to My Parents, who have been a constant source of encouragement and love.

Chapter-1 Machine Learning An Introduction


Human in vitro fertilization involves collecting several eggs from a woman’s ovaries, which, after fertilization with partner or donor sperm, produce several embryos. Some of these are selected and transferred to the woman’s uterus. The problem is to select the “best” embryos to use—the ones that are most likely to survive. Selection is based on around 60 recorded features of the embryos—characterizing their morphology, oocyte, follicle, and the sperm sample. The number of features is sufficiently large that it is difficult for an embryologist to assess them all simultaneously and correlate historical data with the crucial outcome of whether that embryo did or did not result in a live child. In a research project in England, machine learning is being investigated as a technique for making the selection, using as training data historical records of embryos and their outcome. Every year, dairy farmers in New Zealand have to make a tough business decision: which cows to retain in their herd and which to sell off to an abattoir. Typically, one-fifth of the cows in a dairy herd are culled each year near the end of the milking season as feed reserves dwindle. Each cow’s breeding and milk production history influences this decision. Other factors include age (a cow is nearing the end of its productive life at 8 years), health problems, history of difficult calving, undesirable temperament traits (kicking or jumping fences), and not being in calf for the following season. About 700 attributes for each of several million cows have been recorded over the years. Machine learning is being investigated as a way of ascertaining what factors are taken into account by successful farmers—not to automate the decision but to propagate their skills and experience to others. Life and death From Europe to the antipodes. Family and business. Machine learning is a burgeoning new technology for mining knowledge from data, a technology that a lot of people are starting to take seriously.

1.2 Data mining and machine learning

We are overwhelmed with data. The amount of data in the world, in our lives, seems to go on and on increasing—and there’s no end in sight. Omnipresent personal computers make it too easy to save things that previously we would have trashed. Inexpensive multigigabyte disks make it too easy to postpone decisions about what to do with all this stuff—we simply buy another disk and keep it all. Ubiquitous electronics record our decisions, our choices in the supermarket, our financial habits, our comings and goings. We swipe our way through the world, every swipe a record in a database. The World Wide Web overwhelms us with information; meanwhile, every choice we make is recorded. And all these are just personal choices: they have countless counterparts in the world of commerce and industry. We would all testify to the growing gap between the generation of data and our understanding of it. As the volume of data increases, inexorably, the proportion of it that people understand decreases, alarmingly. Lying hidden in all this data is information, potentially useful information that is rarely made explicit or taken advantage of.

Actually we are looking for patterns in data. There is nothing new about this. People have been seeking patterns in data since human life began. Hunters seek patterns in animal migration behavior, farmers seek patterns in crop growth, politicians seek patterns in voter opinion, and lovers seek patterns in their partners’ responses. A scientist’s job (like a baby’s) is to make sense of data, to discover the patterns that govern how the physical world works and encapsulate them in theories that can be used for predicting what will happen in new situations. The entrepreneur’s job is to identify opportunities, that is, patterns in behavior that can be turned into a profitable business, and exploit them.

In data mining, the data is stored electronically and the search is automated—or at least augmented—by computer. Even this is not particularly new. Economists, statisticians, forecasters, and communication engineers have long worked.

1.3 What is Machine Learning

Machine Learning[1][2] is the domain of Artificial Intelligence which is concerned with building adaptive computer systems that are able to improve their competence and/or efficiency through learning from input data or from their own problem solving experience.

1.3.1 What is Learning?

Learning is a very general term denoting the way in which people and computers:

1. Acquire [3] and organize knowledge (by building, modifying and organizing internal representations of some external reality.
2. Discover new knowledge [4] and theories (by creating hypotheses that explain some data or phenomena).
3. Acquire skills (by gradually improving their motor or cognitive skills through repeated practice, sometimes involving little or no conscious thought).

Learning results in changes in the agent (or mind) that improve its competence and/or efficiency.

Two complementary dimensions for learning

Competence: - A system is improving its competence if it learns to solve a broader class of problems, and to make fewer mistakes in problem solving.

Efficiency: - A system is improving its efficiency, if it learns to solve the problems from its area of competence faster or by using fewer resources.

More directions of Research in Machine Learning:

1. Discovery of general principles, methods, and algorithms of learning.
2. Automation of the construction of knowledge-based systems.
3. Modeling human learning mechanisms.

1.4 Learning Strategies

A Learning [5] Strategy is a basic form of learning characterized by the employment of a certain type of inference (like deduction, induction or analogy) and a certain type of computational or representational mechanism (like rules, trees, neural networks, etc.).

- Rote learning.
- Learning from instruction.
- Learning from examples.
- Explanation-based learning.
- Conceptual clustering.
- Quantitative discovery.
- Abductive learning.
- Learning by analogy.
- Instance-based learning.
-Reinforcement learning.
-Neural networks [6].

Genetic algorithms and evolutionary computation. Reinforcement learning.

Bayesian learning.

Multistrategy learning [7][8].

1.5 History of Machine Learning

1.5.1 Early enthusiasm (1955 - 1965)

- Learning without knowledge.
- Neural modeling (self-organizing systems and decision space techniques).
- Evolutionary learning.
- Rote learning (Samuel Checker’s player).

1.5.2 Dark ages (1962 - 1976)

- To acquire knowledge one needs knowledge [3].
- Realization of the difficulty of the learning process and of the limitations of the explored methods (e.g. the perceptron cannot learn the XOR function).
- Symbolic concept learning (Winston’s influential thesis, 1972).

1.5.3 Renaissance (1976 - 1988)

- Exploration of different strategies (EBL, CBR, GA, NN, Abduction, Analogy, etc.).
- Knowledge-intensive learning.
- Successful applications.

Machine Learning conferences/workshops worldwide.

1.5.4 Maturity (1988 - present)

- Experimental comparisons.
- Revival of non-symbolic methods.
-Computational learning theory.
-Multistrategy learning [9].
- Integration of machine learning [10] and knowledge acquisition.
- Emphasis on practical application.

1.6 Applications of Machine Learning

1) Learning to recognize spoken words (all of the most successful systems use machine learning).
2) Learning to drive an autonomous vehicle on public highway.
3) Learning to classify new astronomical structures (by learning regularities in a very large data base of image data).
4) Learning to play games.
5) Automation of knowledge acquisition from domain experts.

1.7 Typical Taxonomy of Machine Learning

1.7.1 Supervised learning

The first task is supervised learning [11]. The goal is to learn a mapping from supervised learning x to y, given a training set made of pairs (xi, yi). Here, the yi Y are called the labels or targets of the examples xi. If the labels are numbers, [Abbildung in dieser Leseprobe nicht enthalten] ) denotes the column vector of labels. Again, a standard requirement is that the [Abbildung in dieser Leseprobe nicht enthalten] pairs (xi, yi) are sampled, from some distribution which here ranges over X × Y. The task is well defined, since a mapping can be evaluated through its predictive performance on test examples. When [Abbildung in dieser Leseprobe nicht enthalten] (or more generally, when the labels are continuous), the task is called regression. Mostly of this will focus on classification there is some work on regression in, i.e., the case where y takes values in a finite set (discrete labels).

Supervised learning aims to learn a mapping function f: X Y, where X and Y are input and output spaces, respectively (e.g. classification and regression [12]). The process of learning the mapping function is called training and the set of labelled objects used is called the training data or the training set. The mapping, once learned, can be used to predict the labels of the objects that were not seen during the training phase. Several pattern recognition [13], [14] and machine learning [1] textbooks discuss supervised learning extensively. A brief overview of supervised learning algorithms is presented in this section.

Supervised learning methods can be broadly divided into generative or discriminative approaches. Generative models assume that the data is independently and identically distributed and is generated by a parameterized probability density function. The parameters are estimated using methods like the Maximum Likelihood Estimation (MLE), Maximum A Posteriori estimation (MAP) [12], Empirical Bayes and Variational Bayes [13]. Probabilistic methods could further be divided into frequent list or Bayesian. Frequent list methods estimate parameters based on the observed data alone, while Bayesian methods allow for inclusion of prior knowledge about the unknown parameters. Examples of this approach include the Naive Bayes classifier, Bayesian linear and quadratic discriminants to name a few.

Instead of modelling the data generation process, discriminative methods directly model the decision boundary between the classes. The decision boundary is represented as a parametric function of data, and the parameters are learned by minimizing the classification error on the training set [12]. Empirical Risk Minimization (ERM) is a widely adopted principle in discriminative supervised learning. This is largely the approach taken by Neural Networks [14] and Logistic Regression [13]. As opposed to probabilistic methods, these do not assume any specific distribution on the generation of data, but model the decision boundary directly.

Most methods following the ERM principle suffer from poor generalization performance. This was overcome by Vapnik’s [25] Structural Risk Minimization (SRM) principle which adds a regularity criterion to the empirical risk that selects a classifier with good generalization ability. This led to the development of Support Vector Machines (SVMs) which regularize the complexity of classifiers while simultaneously minimizing the empirical error. Methods following ERM such as neural networks and Logistic Regression are extended to their regularized versions that follow SRM [13].

1.7.2 Unsupervised learning

The second one is unsupervised learning. Let [Abbildung in dieser Leseprobe nicht enthalten] be a set of n examples unsupervised learning (or points), where xi X for all [Abbildung in dieser Leseprobe nicht enthalten]. Typically it is assumed that the points are drawn i.i.d. (independently and identically distributed) from a common distribution on X. It is often convenient to define the (n × d)-matrix [Abbildung in dieser Leseprobe nicht enthalten] that contains the data points as its rows. The goal of unsupervised learning [Abbildung in dieser Leseprobe nicht enthalten] is to find interesting structure in the data X. It has been argued that the problem of unsupervised learning is fundamentally that of estimating a density which is likely to have generated X. However, there are also weaker forms of unsupervised learning, such as quantile estimation, clustering, outlier detection, and dimensionality reduction.

Unsupervised learning or clustering is a significantly more difficult problem than classification because of the absence of labels on the training data. Given a set of objects, or a set of pair wise similarities between the objects, the goal of clustering is to find natural groupings (clusters) in the data. The mathematical definition of what is considered a natural grouping defines the clustering algorithm. A very large number of clustering algorithms have already been published, and new ones continue to appear [15,17]. We broadly divide the clustering algorithms into groups based on their fundamental assumptions, and discuss a few representative algorithms in each group, ignoring minor variations within the group.

K-means [15,17], arguably, is the most popular and widely used clustering algorithm. K- means is an example of a sum of squared error (SSE) minimization algorithm. Each cluster is represented by its centroid. The goal of K-means is to find the centroids and the cluster labels for the data points such that the sum-of-squared error between each data point and its closest centroid is minimized. K-means is initialized with a set of random cluster centers that are iteratively updated by assigning the closest data point to each center, and recomputing the centroids. ISODATA [18] and Linear Vector Quantization [19] are closely related SSE minimization algorithms that are independently proposed in different disciplines.

Parametric mixture models are well known in statistics and machine learning communities [20]. A mixture of parametric distributions, in particular, GMM [21,22], has been extensively used for clustering. GMMs are limited by the assumption that each component is homogeneous, unimodal, and generated using a Gaussian density. Latent Dirichlet Allocation[23] is a multinomial mixture model that has become the de facto standard for text clustering.

Several mixture models have been extended to their non-parametric form by taking the number of components to infinity in the limit [24,25,26]. A non-parametric prior is used in the generative process of these infinite models (e.g. Dirichlet Process) for clustering in [24]. One of the key advantages offered by the non-parametric prior based approaches is that they adjust their complexity to fit the data by choosing the appropriate number of parametric components. Hierarchical Topic Models [25] are clustering approaches that have seen huge success in clustering text data.

Spectral clustering algorithms [27,28,29] are popular non-parametric models that minimize an objective function of the form J(f) = ∇ f, where f is the function to be estimated, and is the discrete graph Laplacian operator. Kernel K-means is a related kernel based algorithm, which generalizes the Euclidean distance based K-means to arbitrary. Metrics in the feature space. Using the kernel trick, the data is first mapped into a higher dimensional space using a possibly non-linear map, and a K-means clustering is performed in the higher dimensional space. In [30], the explicit relation (equivalence for a particular choice of normalization of the kernel) between Kernel K-means, Spectral Clustering and Normalized Cut was established.

Non-parametric density based methods are popular in the data mining community. Meanshift clustering [31] is a widely used non-parameteric density based clustering algorithm. The objective of Mean-shift is to identify the modes in the kernel-density, seeking the nearest mode for each point in the input space. Several density based methods like DBSCAN also rely on empirical probability estimates, but their performance degrades heavily when the data is high dimensional. A recent segmentation algorithm [32] uses a hybrid mixture model, where each mixture component is a convex combination of a parametric and non-parametric density estimates.

Hierarchical clustering algorithms are popular non-parametric algorithms that iteratively build a cluster tree from a given pairwise similarity matrix Agglomerative algorithms such as Single Link, Complete Link, Average Link [16,17], Bayesian Hierarchical Clustering [33],start with each data point in a single cluster, and merge them succesively into larger clusters based on different similarity criteria at each iteration. Divisive algorithms start with a single cluster, and successively divide the clusters at each iteration.

A decision tree is one of the earliest classifier , which can handle a variety of data with a mix of both real, nominal, missing features and multiple classes. It also provides interpretable classifiers, which give a user an insight about which features are contributing for a particular class being predicted for a given input example. Decision trees could produce complex decision rules, and are sensitive to noise in the data. Their complexity can be controlled by using approaches like pruning; however, in practice classifiers like SVM or Nearest Neighbour have been shown to outperform decision trees on vector data.

Chapter-2 Semi Supervised Learning

2.1 Introduction to Semi-supervised Learning

Supervised learning expects training data that is completely labeled. On the other extreme, unsupervised learning is applied on completely unlabeled data. Unsupervised learning is more difficult problem than supervised learning due to the lack of a well- defined user-independent objective [15,16]. For this reason, it is usually considered an ill- posed problem that is exploratory in nature that is, the user is expected to validate the output of the unsupervised learning process. Devising a fully automatic unsupervised learning algorithm that is applicable in a variety of data settings is an extremely difficult problem, and possibly infeasible. On the other hand, supervised learning is a relatively easier task compared to unsupervised learning. The ease comes with an added cost of creating a labeled training set.

- Labeling a large amount of data may be difficult in practice because data labeling:

1. is expensive: human experts are needed to perform labeling. E.g. Experts need to be paid to label, or tools such as Amazon’s Mechanical Turk [34] must be used.
2. Has uncertainty about the level of detail: the labels of objects change with the granularity at which the user looks at the object. As an example, a picture of a person can be labeled as “person”, or at a greater detail “face”, “eyes”, “torso” etc.
3. is difficult: sometimes objects must be sub-divided into coherent parts before they can be labeled. For example, speech signals and images have to be accurately segmented into syllables and objects, respectively before labeling can be performed.
4. Can be ambiguous: objects might have non-unique labelings or the labelings themselves may be unreliable due to a disagreement among experts.
5. Uses limited vocabulary: Typical labeling setting involves selecting a label from a list of pre-specified labels which may not completely or precisely describe an object. As an example labeled image collections usually come with a pre-specified vocabulary that can describe only the images that are already present in the training and testing data. Unlabeled data is available in abundance, but it is difficult to learn the underlying structure of the data. Labeled data is scarce but is easier to learn from Semi-supervised learning is designed to alleviate the problems of supervised and unsupervised learning problems, and has gained significant interest in the machine learning research community.

2.2 Semi-supervised learning

Semi-supervised learning (SSL) is halfway between supervised and unsupervised learning. In addition to unlabeled data, the algorithm is provided with some supervision information - but not necessarily for all examples. Often, this information will be the targets associated with some of the examples. In this case, the data standard setting of SSL set [Abbildung in dieser Leseprobe nicht enthalten]can be divided into two parts: the points [Abbildung in dieser Leseprobe nicht enthalten] which labels [Abbildung in dieser Leseprobe nicht enthalten] are provided, and the points [Abbildung in dieser Leseprobe nicht enthalten] the labels of which are not known. Semi-supervised learning (SSL) works in situations where the available information in data is in between those considered by the supervised and unsupervised learners; i.e. it can be approached from both supervised and unsupervised learning problems by augmenting their traditional inputs. Various sources of side-information considered in the literature are summarized in Table 2.1

illustration not visible in this excerpt

Table 2.1:Examples of different Semi-supervised Settings.

2.2.1 Semi-supervised classification

Semi-supervised classification algorithms train a classifier given both labeled and unlabeled data. A special case of this is the well-known transductive learning [35], where the goal is to label only the unlabeled data available during training. Semi-supervised classification can also be viewed as an unsupervised learning problem with only a small amount of labeled training data.

While semi-supervised classification is a relatively new field, the idea of using unlabeled samples to augment labeled examples for prediction was conceived several decades ago. The initial work in semi-supervised learning is attributed to Scudders for his work on “self-learning” [36]. An earlier work by Robbins and Monro [37] on sequential learning can also be viewed as related to semi-supervised learning. Vapnik’s Overall Risk Minimization (ORM) principle [38] advocates minimizing the risk over the labeled training data as well as the unlabeled data, as opposed to the Empirical Risk Minimization, and resulted in transductive Support Vector Machines.

illustration not visible in this excerpt

Fig 2.1: Utility of the unlabeled data in learning a classifier.

(a) Classifier learned using labelled data alone.

illustration not visible in this excerpt

Fig 2.1: (b) Utility of unlabeled data. The filled dots show the unlabeled data. The gray region depicts the data distribution obtained from the unlabeled data.

Fig. 2.1 gives the basic idea of how unlabeled data could be useful in learning a classifier. Given a set of labeled data, a decision boundary may be learned using any of the supervised learning methods (Fig. 2.1(a)). When a large number of unlabeled data is provided in addition to the labeled data, the true structure of each class is revealed through the distribution of the unlabeled data (Fig. 2.1(b)). The unlabeled data defines a “natural region” for each class, and the region is labeled by the labeled data. The task now is no longer just limited to separating the labeled data, but to separate the regions to which the labeled data belong. The definition of this “region” constitutes some of the fundamental assumptions in semi-supervised learning. Existing semi-supervised classification algorithms may be classified into two categories based on their underlying assumptions. An algorithm is said to satisfy the manifold assumption if it utilizes the fact that the data lie on a low-dimensional manifold in the input space. Usually, the underlying geometry of the data is captured by representing the data as a graph, with samples as the vertices, and the pair wise similarities between the samples as edge- weights. Several graph based algorithms such as Label propagation [39], Markov random walks [40], Graph cut algorithms [41], Spectral graph transducer [42], and Low density separation [43] proposed in the literature are based on this assumption. The second assumption is called the cluster assumption [44]. It states that the data.

2.2.2 Semi-supervised clustering

Clustering is an ill-posed problem, and it is difficult to come up with a general purpose objective function that works satisfactorily with an arbitrary dataset [16]. If any side information is available, it must be exploited to obtain a more useful or relevant clustering of the data. Most often, side-information in the form of pairwise constraints (“a pair of objects belongs to the same cluster or different clusters”) is available. The pairwise constraints are of two types: must-link and cannot-link constraints. The clustering algorithm must try to assign the same label to the pair of points participating in a must-link constraint, and assign different labels to a pair of points participating in a cannot-link constraint. These pairwise constraints may be specified by a user to encode his preferred clustering. Pairwise constraints can also be automatically inferred from the structure of the data, without a user having to specify them. As an example, web pages that are linked to one another may be considered as participating in a must-link constraint.

2.2.3 Semi-supervised feature selection

Feature selection can be performed for both supervised and unsupervised settings depending on the data available. Unsupervised feature selection is difficult for the same reasons that make clustering difficult, lack of a clear objective apart from the model assumptions. Supervised feature selection has the same limitations as classification, i.e. scarcity of labeled data. Semi-supervised feature selection aims to utilize pair wise constraints in order to identify a possibly superior subset of features for the task. to identify a possibly superior subset of features for the task.

Many other learning tasks, apart from classification and clustering have their semisupervised Counterparts as well (e.g., semi-supervised ranking). For example, page ranking algorithms used by search engines can utilize existing partial ranking information on the data to obtain a final ranking based on the query.

2.3 Generative Models

Generative models are perhaps the oldest semi-supervised learning method. It assumes a model p(x, y) = p(y) p(x|y) where p(x|y) is an identifiable mixture distribution, for example Gaussian mixture models. With large amount of unlabeled data, the mixture components can be identified; then ideally we only need one labeled example per component to fully determine the mixture distribution, see Figure 2.1. One can think of the mixture components as ‘soft clusters’. Nigam et al. (2000) apply the EM algorithm on mixture of multinomial for the task of text classification. They showed the resulting classifiers perform better than those trained only from L. Baluja (1998) uses the same algorithm on a face orientation discrimination task. Fujino et al. (2005) extend generative mixture models by including a ‘bias correction’ term and discriminative training using the maximum entropy principle. One has to pay attention to a few things:

2.3.1 Identifiability

The mixture model ideally should be identifiable. In general let {p } be a family of distributions indexed by a parameter vector . is identifiable if [Abbildung in dieser Leseprobe nicht enthalten], up to a permutation of mixture components. If the model family is identifiable, in theory with infinite U one can learn up to a permutation of component indices. Here is an example showing the problem with unidentifiable models. The model p(x|y) is uniform for[Abbildung in dieser Leseprobe nicht enthalten]. Assuming with large amount of un-labeled data U we know p(x) is uniform in [0, 1]. We also have 2 labeled data points [Abbildung in dieser Leseprobe nicht enthalten]. Can we determine the label for x = 0.5? No. With our assumptions we cannot distinguish the following two models:

illustration not visible in this excerpt

Fig 2.2: In a binary classification problem, if we assume each class has a Gaussian distribution, then we can use unlabeled data to help parameter estimation.

Which give opposite labels at x = 0.5, see Figure 2.3. It is known that a mixture of

illustration not visible in this excerpt

Fig2.3: An example of unidentifiable models.

Even if we known p(x) (top) are a mixture of two uniform distributions, we cannot uniquely identify the two components. For instance, the mixtures on the second and third line give the same p(x), but they classify x = 0.5 differently.

2.3.2 Model Correctness

If the mixture model assumption is correct, unlabeled data is guaranteed to improve accuracy. However if the model is wrong, unlabeled data may actually hurt accuracy. Figure 3 shows an example. This has been observed by multiple researchers. Cozman et al. (2003) give a formal derivation on how this might happen. It is thus important to carefully construct the mixture model to reflect reality. For example in text categorization a topic may contain several subtopics, and will be better modeled by multiple multinomials instead of a single one. Another solution is to down-weighing unlabeled data, which is also used by Nigam et al. (2000), and by Callison-Burch et al. (2004) who estimate word alignment for machine translation.

illustration not visible in this excerpt

Fig 2.4: If the model is wrong, higher likelihood may lead to lower classification accuracy.

For example, (a) is clearly not generated from two Gaussian. If we insist that each class is a single Gaussian, (b) will have higher probability than (c). But (b) has around 50% accuracy, while (c)’s much better.

2.3.3 EM Local Maxima

Even if the mixture model assumption is correct, in practice mixture components are identified by the Expectation-Maximization (EM) algorithm EM is prone to local maxima. If a local maximum is far from the global maximum, unlabeled data may again hurt learning. Remedies include smart choice of starting point by active learning.

2.3.4 Cluster-and-Label

We shall also mention that instead of using a probabilistic generative mixture model, some approaches employ various clustering algorithms to cluster the whole dataset, and then label each cluster with labeled data. Although they can perform well if the particular clustering algorithms match the true data distribution, these approaches are hard to analyze due to their algorithmic nature.

2.3.5 Fisher kernel for discriminative learning

Another approach for semi-supervised learning with generative models is to convert data into a feature representation determined by the generative model. The new feature representation is then fed into a standard discriminative classifier. Holubetal. (2005) used this approach for image categorization. First a generative mixture model is trained, one component per class. At this stage the unlabeled data can be incorporated via EM, which is the same as in previous subsections. However instead of directly using the generative model for classification, each labeled example is converted into a fixed-length Fisher score vector, i.e. the derivatives of log likelihood w.r.t. model parameters, for all component models (Jaakkola & Haussler, 1998). These Fisher score vectors are then used in a discriminative classifier like an SVM, which empirically has high accuracy.

2.4 Self-Training

Self-training is a commonly used technique for semi-supervised learning. In self-training a classifier is first trained with the small amount of labeled data. The classifier is then used to classify the unlabeled data. Typically the most confident unlabeled points, together with their predicted labels, are added to the training set. The classifier is re- trained and the procedure repeated. Note the classifier uses its own predictions to teach itself. The procedure is also called self-teaching or bootstrapping (not to be confused with the statistical procedure with the same name). The generative model and EM approach of section 2 can be viewed as a special case of ‘soft’ self-training. One can imagine that a classification mistake can reinforce itself. Some algorithms try to avoid this by ‘unlearn’ unlabeled points if the prediction confidence drops below a threshold. Self-training has been applied to several natural language processing tasks. Yarowsky (1995) uses self- training for word sense disambiguation, e.g. deciding whether the word ‘plant’ means a living organism or a factory in a give context. Riloff et al. (2003) uses it to identify subjective nouns. Maeireizo et al. (2004) classify dialogues as ‘emotional’ or ‘non- emotional’ with a procedure involving two classifiers. Self-training has also been applied to parsing and machine translation. Rosenberg et al. (2005) apply self-training to object detection systems from images, and show the semi-supervised technique compares favorably with a state-Of-the-art detector. Self-training is a wrapper algorithm, and is hard to analyze in general. However, for specific base learners, there has been some analyzer’s on convergence.

2.5 Co-Training and Multi view Learning

2.5.1 Co-Training

Co-training (Blum & Mitchell, 1998) (Mitchell, 1999) assumes that (i) features can be split into two sets; (ii) each sub-feature set is sufficient to train a good classifier; (iii) the two sets are conditionally independent given the class. Initially two separate classifiers are trained with the labeled data, on the two sub feature sets respectively. Each classifier then classifies the unlabeled data, and ‘teaches’ the other classifier with the few unlabeled examples (and the predicted labels) they feel most confident. Each classifier is retrained with the additional training examples given by the other classifier, and the process repeats.

illustration not visible in this excerpt

Fig2.5: Co-Training: Conditional independent assumption on feature split.

With this assumption the high confident data points in x1 view, represented by circled labels, will be randomly scattered in x2 view. This is advantageous if they are to be used to teach the classifier in x2 view.

In co-training, unlabeled data helps by reducing the version space size. In other words, the two classifiers (or hypotheses) must agree on the much larger unlabeled data as well as the labeled data.

We need the assumption that sub-features are sufficiently good, so that we can trust the labels by each learner on U. We need the sub-features to be conditionally independent so that one classifier’s high confident data points are iid samples for the other classifier. Figure 4 visualizes the assumption. Nigam and Ghani (2000) perform extensive empirical experiments to compare co-training with generative mixture models and EM. Their result shows co-training performs well if the conditional independence assumption indeed holds. In addition, it is better to probabilistically label the entire U, instead of a few most confident data points. They name this paradigm co-EM. Finally, if there is no natural feature split, the authors create artificial split by randomly break the feature set into two subsets. They show co-training with artificial feature split still helps, though not as much as before. Collins and Singer (1999); Jones (2005) used co-training, co-EM and other related methods for information extraction from text. Balcan and Blum (2006) show that co-training can be quite effective, that in the extreme case only one labeled point is needed to learn the classifier. Zhou et al. (2007) give a co-training algorithm using Canonical Correlation Analysis which also needs only one labeled point. Dasgupta et al. (Dasgupta et al., 2001) provide a PAC-style theoretical analysis.

Co-training makes strong assumptions on the splitting of features. One might wonder if these conditions can be relaxed. Goldman and Zhou (2000) use two learners of different type but both takes the whole feature set, and essentially use one learner’s high confidence data points, identified with a set of statistical tests, in U to teach the other learning and vice versa. Chawla and Karakoulas (2005) perform empirical studies on this version of co-training and compared it against several other methods, in particular for the case where labeled and unlabeled data do not follow the same distribution. Later Zhou and Goldman (2004) propose a single-view multiple-learner Democratic Co-learning algorithm. An ensemble of learners with different inductive bias is trained separately on the complete feature of the labeled data. They then make predictions on the unlabeled data. If a majority of learners confidently agree on the class of an unlabeled point xu, that classification is used as the label of xu. xu and its label is added to the training data. All learners are retrained on the updated training set. The final prediction is made with a variant of a weighted majority vote among all the learners. Similarly Zhou and Li (2005b) propose ‘tri-training’ which uses three learners. If two of them agree on the classification of an unlabeled point, the classification is used to teach the third classifier. This approach thus avoids the need of explicitly measuring label confidence of any learner. It can be applied to datasets without different views, or different types of classifiers. Balcan et al. (2005b) relax the conditional independence assumption with a much weaker expansion condition, and justify the iterative co-training procedure. Johnson and Zhang (2007) propose a two-view model that relaxes the conditional independence assumption.

2.5.2 Multiview Learning

More generally, we can define learning paradigms that utilize the agreement among different learners. The particular assumptions of Co-Training are in general not required by multi view learning models. Instead, multiple hypotheses (with different inductive biases, e.g., decision trees, SVMs, etc.) are trained from the same labeled data set, and are required to make similar predictions on any given unlabeled instance. Multi view learning has a long history (de Sa, 1993). It has been applied to semi-supervised regression (Sindhwani et al., 2005b; Brefeld et al., 2006), and the more challenging structured output spaces (Brefeld et al., 2005; Brefeld & Scheffer, 2006). Some theoretical analysis on the value of agreement among multiple learners can be found in (Leskes, 2005; Farquhar et al., 2006).

2.6 Semi-supervised Learning Techniques

‘Semi-supervised learning’ refers to the use of both labeled and unlabeled data for training. It contrasts supervised learning (data all labeled) or unsupervised learning (data all unlabeled). Other names are ‘learning from labeled and unlabeled data’ or ‘learning from partially labeled/classified data. Notice semi-supervised learning can be either transductive or inductive.

2.6.1 Transduction

‘Transductive learning’ will be used to contrast inductive learning. A learner is transductive if it only works on the labeled and unlabeled training data, and cannot handle unseen data. The early graph-based methods are often transductive. Inductive learners can naturally handle unseen data. Notice under this convention transductive support vector machines (TSVMs) are in fact inductive learners, because the resulting classifiers are defined over the whole space. The name TSVM originates from the intention to work only on the observed data (though people use them for induction anyway), which according to (Vapnik, 1998) is solving a simpler problem. People sometimes use the analogy that transductive learning is take-home exam, while inductive learning is in-class exam.

Some algorithms naturally operate in a transductive setting. According to the philosophy put forward by Vapnik, high-dimensional estimation problems should attempt to follow the following principle:

Vapnik’s principle: When trying to solve some problem, one should not solve a more difficult problem as an intermediate step.

Consider as an example supervised learning, where predictions of labels y corresponding to some objects x are desired. Generative models estimate the density of x as an intermediate step, while discriminative methods directly estimate the labels.

In a similar way, if label predictions are only required for a given test set, transduction can be argued to be more direct than induction: while an inductive method infers a function f : X Y on the entire space X, and afterward returns the evaluations f(xi) at the test points, transduction consists of directly estimating the finite set of test labels, i.e., a function f : Xu Y only defined on the test set. Note that transduction (as defined in this book) is not the same as SSL: some semi-supervised algorithms are transductive, but others are inductive. Now suppose we are given a transductive algorithm which produces a solution superior to an inductive algorithm trained on the same labeled data (but discarding the unlabeled data). Then the performance difference might be due to one of the following two points (or a combination thereof).

2.6.2 Induction

Most graph-based semi-supervised learning algorithms are transductive, i.e. they cannot easily extend to new test points outside of L U. Recently induction has received increasing attention. One common practice is to ‘freeze’ the graph on L U. New points do not (although they should) alter the graph structure. This avoids expensive graph computation every time one encounters new points.

Zhu et al. (2003c) propose that new test point be classified by its nearest neigbour in L U. This is sensible when U is sufficiently large. In (Chapelle et al., 2002) the authors approximate a new point by a linear combination of labeled and unlabeled points. Similarly in (Delalleau et al., 2005) the authors proposes an induction scheme to classify a new point x by Yu et al. (2004) report an early attempt on semi-supervised induction using RBF basis functions in a regularization framework. In (Belkin et al., 2004b), the function f does not have to be restricted. These methods create inductive learners that naturally handle new test points, to the graph. The graph is merely used to regularize f which can have a much larger support. It is necessarily a combination of an inductive algorithm and graph regularization. The authors give the graph regularized version of least squares and SVM. (Note such an SVM is different from the graph kernels in standard SVM in (Zhu et al., 2005). The former is inductive with both a graph regularizer and an inductive kernel. The latter is transductive with only the graph regularizer.) Following the work, Krishnapuram et al. (2005) use graph regularization on logistic regression. Sindhwani et al. (2005a) give a semi-supervised kernel that is defined over the whole space, not just on the training data points.

The harmonic mixture model (Zhu & Lafferty, 2005) naturally handles new points as well. The idea is to model the labeled and unlabeled data with a mixture model, e.g. mixture of Gaussian. In standard mixture models, the class probability p(y|i) for each mixture component i is optimized to maximize label likelihood. However in harmonic mixture models, p(y|i) is optimized differently to minimize an underlying graph-based cost function. Under certain conditions, the harmonic mixture model converts the original graph on unlabeled data into a ‘backbone graph’, with the components being ‘super nodes’. Harmonic mixture models naturally handle induction just like standard mixture models.

2.7 Avoiding Changes in Dense Regions

2.7.1 Transductive SVMs (S3VMs)

Discriminative methods work on p(y|x) directly. This brings up the danger of leaving p(x) outside of the parameter estimation loop, if p(x) and p(y|x) do not share parameters. Notice p(x) is usually all we can get from unlabeled data. It is believed that if p(x) and p(y|x) do not share parameters, semi-supervised learning cannot help. This point is emphasized in (Seeger, 2001). Transductive support vector machines (TSVMs)1 builds the connection between p(x) and the discriminative decision boundary by not putting the boundary in high density regions. TSVM is an extension of standard support vector machines with unlabeled data. In a standard SVM only the labeled data is used, and the goal is to find a maximum margin linear boundary in the Reproducing Kernel Hilbert Space. In a TSVM the unlabeled data is also used. The goal is to find a labeling of the unlabeled data, so that a linear boundary has the maximum margin on both the original labeled data and the (now labeled) unlabeled data. The decision boundary has the smallest generalization error bound on unlabeled data (Vapnik, 1998). Intuitively, unlabeled data guides the linear boundary away from dense regions.

illustration not visible in this excerpt

Fig2.6: In TSVM, U helps to put the decision boundary in sparse regions.

With labeled data only, the maximum margin boundary is plotted with dotted lines. With unlabeled data (black dots), the maximum margin boundary would be the one with solid lines.

However finding the exact transductive SVM solution is NP-hard. Major effort has focused on efficient approximation algorithms. Early algorithms (Bennett & Demiriz, 1999) (Demirez & Bennett, 2000) (Fung & Mangasarian, 1999) either cannot handle more than a few hundred unlabeled examples, or did not do so in experiments. The SVM- light TSVM implementation (Joachims, 1999) is the first widely used software. De Bie and Cristianini (De Bie & Cristianini, 2004; De Bie & Cristianini, 2006b) relax the TSVM training problem, and transductive learning problems in general to semi-definite programming (SDP). The basic idea is to work with the binary label matrix of rank 1, and relax it by a positive semi-definite matrix without the rank constraint. The paper also includes a speech up trick to solve median-sized problems with around 1000 unlabeled points. Xu and Schuurmans (2005) present a similar multi-class version of SDP formulation, which results in multi-class SVM for semi-supervised learning. The computational cost of SDP is still expensive though.

TSVM can be viewed as SVM with an additional regularization term on un-labeled data. Let f (x) = h(x) + b where h HK . The optimization problem is

Where (z)+ = max(z, 0). The last term arises from assigning label sign(f (x)) to unlabeled point x. The margin on unlabeled point is thus sign(f (x))f (x) = |f (x)|. The loss function (1 − |f (xi)|)+ has a non-convex hat shape as shown in Figure 6,which is the root of the optimization difficulty.

illustration not visible in this excerpt

Fig 2.7: The TSVM loss function (1 − |f (xi)|)+

Chapelle and Zien (2005) propose SVM, which approximates the hat loss (1 − |f (xi)|)+ with a Gaussian function, and perform gradient search in the primal space. Sindhwani et al. (2006) use a deterministic annealing approach, which starts from an ‘easy’ problem, and gradually deforms it to the TSVM objective. In a similar spirit, Chapelle et al.



ISBN (eBook)
ISBN (Book)
File size
2.4 MB
Catalog Number
enhancement selective incremental approach transductive nearest neighbour classification




Title: Enhancement to Selective Incremental Approach for Transductive Nearest Neighbour Classification