Rule Extraction from Support Vector Machine

Applications to Banking and Finance


Doctoral Thesis / Dissertation, 2010

254 Pages, Grade: none


Excerpt


Contents

Certificate

Declaration

Dedication

Acknowledgements

List of Tables

List of Figures

Abstract

Chapter 1: Introduction to Rule Extraction
1.1 Rule Extraction: Motivation
1.2 Rule Extraction: Significance
1.2.1 Provision of User Explanation Capability
1.2.2 Transparency
1.2.3 Data Exploration
1.3 Rule Extraction: A Taxonomy
1.3.1 Decompositional Approaches
1.3.2 Pedagogical Approaches
1.3.3 Eclectic/Hybrid Approaches
1.4 Rule Quality Criteria
1.5 Thesis Overview
1.6 Datasets Used
1.7 Experimental Setup
1.8 Thesis Outline

Chapter 2: Rule Extraction from SVM: An Introduction
2.1 Rule Extraction from SVM
2.1.1 Decompositional Approaches
2.1.2 Pedagogical Approaches
2.1.3 Eclectic/Hybrid Approaches
2.2 Gaps Identified in Literature
2.3 Outline of the Proposed Approaches

Part I: Decompositional Approaches

Chapter 3: Fuzzy Rules Extraction using SVM for solving Bankruptcy prediction in banks problem
3.1 Motivation
3.2 Proposed Fuzzy Rule Extraction Technique
3.2.1 Extraction of Support Vectors from SVM
3.2.2 Rule Generation
3.3 Literature Review of Bankruptcy Prediction in Banks and Firms ...
3.4 Results and Discussions
3.5 Conclusions

Part II: Pedagogical Approaches

Chapter 4: Rule Extraction from SVM using Feature Selection for Solving Classification and Regression Problems
4.1 Motivation
4.2 Proposed Rule Extraction Approach
4.2.1 Feature Selection using SVM-RFE
4.2.2 Building SVM/SVR models
4.2.3 Rule Generation
4.3 Problems Analyzed
4.3.1 Auto MPG Dataset
4.3.2 Body Fat Dataset
4.3.3 Boston Housing Dataset
4.3.4 Forest Fires Dataset
4.3.5 Pollution Dataset
4.4 Results and Discussions
4.4.1 Classification Problems
4.4.2 Regression Problems
4.4.3 Overall Observations
4.5 Conclusions

Part III: Eclectic/Hybrid Approaches

Chapter 5: Rule Extraction from SVM for Data Mining on Unbalanced Datasets
5.1 Motivation
5.2 Customer Relationship Management (CRM)
5.3 Churn Prediction Problem
5.4 Proposed Eclectic/Hybrid Rule Extraction Approach
5.4.1 Feature Selection using SVM-RFE
5.4.2 Support Vector Extraction using SVM
5.4.3 Rule Generation using NBTree
5.5 Dataset Description
5.6 Data Imbalance Problem
5.6.1 Literature Review of Techniques Dealing with Unbalanced Data
5.6.2 Random Under-Sampling
5.6.3 Random Over-Sampling
5.6.4 SMOTE (Synthetic Minority Over-sampling Technique)
5.7 Results and Discussions
5.8 Conclusions

Chapter 6: Modified Active Learning Based Approach for Rule Extraction from SVM
6.1 Motivation
6.2 Modified Active Learning Based Approach for Rule Extraction
6.2.1 Feature Selection Phase
6.2.2 Active Learning Phase
6.2.3 Rule Generation Phase
6.3 Finance Application Analyzed
6.3.1 Fraud Detection in Automobile Insurance Dataset
6.3.2 Pre-Processing
6.3.3 Literature Survey of Fraud Detection Problems
6.4 Results and Discussions
6.4.1 Churn Prediction using SVM+NBTree
6.4.2 Insurance Fraud Detection using SVM+NBTree
6.4.3 Churn Prediction using SVM+DT
6.4.4 Insurance Fraud Detection using SVM+DT
6.4.5 Overall Observations
6.5 Conclusions

Chapter 7: Rule Extraction from SVR for Solving Regression Problems
7.1 Motivation
7.2 Proposed Eclectic/Hybrid Rule Extraction Technique
7.2.1 Extraction of Support Vectors and SVR Predictions
7.2.2 Rule Generation
7.3 Problems Analyzed
7.4 Results and Discussions
7.5 Conclusions

Chapter 8: Overall Conclusions and Future Directions

Appendix A: Overview of SVM/SVR and SVM-RFE
A.1 Support Vector Machine
A.1.1 VC Dimension
A.1.2 Structural Risk Minimization
A.1.3 Support Vector Classification
A.1.4 Linearly Separable Case
A.1.5 Linearly non-Separable Case
A.1.6 Feature Space
A.1.7 Kernel Functions
A.1.8 Kernel Selection
A.2 Support Vector Regression
A.2.1 Linear Regression
A.2.2 non-Linear Regression
A.3 SVM-RFE (SVM-Recursive Feature Elimination)

Appendix B: Overview of the Transparent Machine Learning Techniques
B.1 Fuzzy Rule Based System (FRBS)
B.1.1 Rule Generation
B.1.2 Fuzzy Reasoning
B.1.3 Coding of Fuzzy if-then Rules
B.1.4 Outline of Fuzzy Classifier System
B.1.5 Initial Population
B.1.6 Evaluation of Each Rule
B.1.7 Genetic Operations for Generating New Rules
B.1.8 Rule Replancement
B.2 Decision Tree (DT)
B.2.1 Classification by Decision Tree Algorithm
B.2.2 Algorithm
B.2.3 Splitting Rule
B.3 Classification and Regression Tree (CART)
B.3.1 Construction of Maximum Tree
B.3.2 Choice of Right Tree Size
B.3.3 Classification of New Data using Constructed Tree
B.3.4 Advantages and Disadvantages of CART
B.4 Adaptive Network based Fuzzy Inference Systems (ANFIS)
B.4.1 Fuzzy if-then Rules
B.4.2 Fuzzy Inference System
B.4.3 Adaptive Networks
B.4.4 ANFIS Architecture
B.5 Dynamic Evolving Fuzzy Inference System (DENFIS)
B.5.1 Evolving Clustering Method (ECM)
B.5.2 Online ECM
B.5.3 DENFIS: Dynamic Evolving Fuzzy Inference System
B.6 NBTree (Naïve-Bayes Tree)
B.6.1 NBTree: The Hybrid Algorithm

Appendix C: Description of Datasets

Appendix D: Rules Tables

References

Papers Published out of this Thesis

Brief Bio Data of the Author

CERTIFICATE

This is to certify that the thesis entitled “Rule Extraction from Support Vector Machine: Applications to Banking and Finance” submitted by Mohammed Abdul Haque Farquad bearing Reg. No 04MCPC03 in partial fulfillment of the requirements for the award of Doctor of Philosophy in Computer Science is a bonafide work carried out by him/her under my/our supervision and guidance.

The thesis has not been submitted previously in part or in full to this or any other University or Institution for the award of any degree or diploma.

(Prof. S. Bapi Raju) (Dr. V. Ravi)

Signature of the Supervisor/s

Head of the Department/Centre Dean of the School

DECLARATION

I Mohammed Abdul Haque Farquad hereby declare that this thesis entitled “Rule Extraction from Support Vector Machine: Application to Banking and Finance” submitted by me under the guidance and supervision of Professor S. Bapi Raju, Department of Computer and Information Sciences, University of Hyderabad and Dr. V. Ravi, Associate Professor, Institute for Development and Research in Banking Technology, Hyderabad is a bonafide research work. I also declare that it has not been submitted previously in part or in full to this University or any other University or Institution for the award of any degree or diploma.

Date: 24.12.2010 Name: Mohammed Abdul Haque Farquad

Signature of the Student:

Redg. No. 04MCPC03

Dedicated to:

My proud parents, Rafiuddin and Nazneen

My tremendously inspiring and supportive elder brother, Zahid
My incredibly understanding guides, Dr. Ravi and Prof. Bapi

My wonderful gifted younger brothers and sisters, Salwa, Saad, Sana,
Khalid, Majid, Huda, Bushra and Tooba

My extremely, artistically entertaining friends.

Acknowledgements

First of all, I am very grateful to Almighty for such a beautiful life and energy to work hard. Later, I would like to thank my parents for their kind support towards achieving my goals, this thesis would not have been possible without my parents’ direct or indirect encouragement, I love you mom and dad.

This thesis would not have been possible without the help and support of the kind people around me, to only some of whom it is possible to give particular mention here. Above all, I would like to thank those who made this thesis possible such as my supervisors who helped me with the research material and my brother who is an inspiration for me. I owe my greatest gratitude to my supervisors Dr. V. Ravi, Associate Professor, IDRBT, Hyderabad and Dr. S. Bapi Raju, Professor, Department of Computer and Information Sciences, University of Hyderabad, whose encouragement, supervision and support from the preliminary to the concluding level enabled me to develop an understanding of the subject.

At the outset of my research, I was a bit scared of Dr. Ravi’s very strict and demanding nature, but now I realize the importance of the kind of training I was undergoing in his supervision. The best thing, apart from many others, that I could learn from him is the importance of an effective experimentation for solving real life financial problems using my proposed methodologies. Throughout my research, he has been very particular about empirical analysis. My second supervisor Prof. Bapi Raju has always shown confidence in me apart from scanning my research work and progress. Despite being busy, he spared time and made himself available for discussion and encouragement at times, for which I am very grateful. In all, both of my supervisors contributed a lot in building my confidence as a good researcher.

It is an honour for me to thank Prof. Hrishikesh Mohanty and Dr. Durga Bhavani, for serving on my examining committee. I thank you for your precious time in reviewing my research progress and providing thoughtful comments and suggestions. I also thank Mr. B. Sambamurthy, Director, IDRBT, Mr. Krishna Mohan, Former Director IDRBT, Prof. T. Amaranath, Dean, School of Mathematics Computers and Information Sciences, Prof. Arun Agarwal, Head of the Department, Department of Computer and Information Sciences, University of Hyderabad for extending their cooperation.

I would like to express my sincere thanks to colleagues and friends at IDRBT and University of Hyderabad and elsewhere, they have been generous with their support. Among them are, Zahid (my brother), Azeez, Habeeb, Saleem, Aijaz, Pavan, Srikanth, Kishore, Manoj, Daniel, Bhavani, Clare, Amer, Rushdie, Khalif, Orif, Amit, Nikunj, Satish, Vinay, Anil, Shiva, Karthik, Sunil, Ravi, Shakeel, Naveen, Rohit, Vasu, Jagan, Suresh, Dilip, Jaikumar, Pramod, Amareshwar, Mahesh, Anki, Adi, Vijay, etc. for their entertaining friendship and moral support to accomplish this thesis. If I miss any name I apologise to them for not able to mention but my friends support and due encouragement helped me to complete this thesis.

I would like to acknowledge the financial, academic and technical support provided by IDRBT, Hyderabad, particularly the award of a Research Fellowship that provided me the necessary financial support to carry on this research. I also thank the staff of library, IDRBT, Mr. Ratna Kumar, Mr. Prakash and Mr. Gyan, who helped me through the literature updates. Furthermore, I would like to thank Mr. Murthy and Mr. Prakash, who provided me the technical resources whenever requested. Last but not least, I would like to thank RKHS team as a whole for serving me the most hygienic and energetic food.

I would also like to thank my colleagues and friends in Vivek Vardhini P.G. College, Hyderabad. Last, but by no means least, I thank my friends in my hometown and elsewhere for their support and encouragement throughout, some of whom have already been named. For any errors or inadequacies that may remain in this work, of course, the responsibility is entirely my own.

...Thank you all.

M.A.H. Farquad

List of Tables

Table 1.1: Various approaches proposed for Rule Extraction from SVM

Table 3.1: Average Results on Validation set

Table 3.2: Fuzzy Rules Extracted using Iris Dataset

Table 3.3: Sample Fuzzy Rules Extracted using Wine Dataset

Table 3.4: Bankruptcy prediction dataset division into Training and Validation

Table 3.5: Average results for Spanish Banks on Validation Set

Table 3.6: Fuzzy rules extracted for Spanish banks dataset

Table 3.7: Average results for Turkish Banks on Validation Set

Table 3.8: Fuzzy Rules Extracted for Turkish Banks dataset

Table 3.9: Average results for US Banks on Validation Set

Table 3.10: Fuzzy rules extracted for US banks dataset

Table 3.11: AUC of the classifiers

Table 3.12: Fidelity of various hybrids

Table 4.1: Classification dataset Information

Table 4.2: Bankruptcy prediction dataset division into Training and Validation

Table 4.3: Average results obtained using IRIS data

Table 4.4: Rule set extracted using SVM+CART (Case-P) for Iris data

(all features)

Table 4.5: Average results using all and reduced features of Wine data

xii

Table 4.6: Rule set extracted using SVM+CART (Case-P) for Wine data

(8 features)

Table 4.7: Average results using all and reduced features of WBC data

Table 4.8: Rule set extracted using SVM+CART (Case-P) for WBC data

(5 features)

Table 4.9: Average results using all and reduced features of Spanish banks data

Table 4.10: Rule set extracted using SVM+CART (Case-P) for

Spanish banks data (6 features)

Table 4.11: Average results using all and reduced features of Turkish banks data ..

Table 4.12: Rule set extracted using SVM+DT (Case-P)

for Turkish banks data (8 features)

Table 4.13: Average results of US banks dataset

Table 4.14: Rule set extracted using SVM+CART (Case-P) for US banks data

(all features)

Table 4.15: Average results using all and reduced features of UK banks data

Table 4.16: Rule set extracted using SVM+CART (Case-P) for UK banks data

(6 features)

Table 4.17: Average Fidelity of SVM+DT and SVM+CART hybrids

Table 4.18: Average RMSE obtained using all and reduced features of

Auto MPG dataset

Table 4.19: Sample Rule set extracted using SVR+CART (Case-P)

for Auto MPG dataset (3 features)

Table 4.20: Average RMSE obtained using all and reduced features

of Body fat dataset

Table 4.21: Rule set extracted using SVR+DENFIS (Case-P)

for Body Fat dataset (5 features)

Table 4.22: Average RMSE obtained using all and reduced features

of Boston Housing dataset

Table 4.23: Sample Rule set extracted using SVR+CART (Case-P)

for Boston Housing dataset (all features)

xiii

Table 4.24: Average RMSE obtained using all and reduced features

of Forest Fires dataset

Table 4.25: Sample Rule set extracted using SVR+CART (Case-P)

for Forest Fires dataset (7 features)

Table 4.26: Average RMSE obtained using all and reduced features

of Pollution dataset

Table 4.27: Rule set extracted using SVR+CART (Case-P)

for Pollution dataset (all features)

Table 5.1: Average results obtained using original unbalanced data

Table 5.2: Average results obtained using 25% under-sampled data

Table 5.3: Average results obtained using 50% under-sampled data

Table 5.4: Average results obtained using 100% over-sampled data

Table 5.5: Average results obtained using 200% over-sampled data

Table 5.6: Average results obtained using 300% over-sampled data

Table 5.7: Average results obtained using 25% under+100% over sampled data

Table 5.8: Average results obtained using 50% under+200% over sampled data

Table 5.9: Average results obtained using SMOTE

Table 5.10: Rule set extracted using SMOTE data with reduced features

Table 5.11: Average fidelity of the proposed SVM+NBTree using Case-SP

Table 6.1: Feature Information of the Insurance data used (Pyle 1999)

Table 6.2: Feature Information of the pre-processed Insurance data

Table 6.3: Average Results of Churn Prediction SVM+NBTree (500 Extra Instances)

Table 6.4: Average Results of Churn Prediction SVM+NBTree (1000 Extra Instances)

Table 6.5: Average Results of Churn Prediction Feature Selection+SVM+NBTree (500 Extra Instances)

Table 6.6: Average Results of Churn Prediction Feature Selection+SVM+NBTree (1000 Extra Instances)

Table 6.7: Rule Extracted for Churn Prediction using NBTree (Reduced Features)

Table 6.8: Average Fidelity for Churn Prediction SVM+NBTree

Table 6.9: Average Results of Insurance Fraud Detection using SVM+NBTree (500 Extra Instances)

Table 6.10: Average Results of Insurance Fraud Detection using SVM+NBTree (1000 Extra Instances)

Table 6.11: Average Results of Insurance Fraud Detection Feature Selection+SVM+NBTree (500 Extra Instances)

Table 6.12: Average Results of Insurance Fraud Detection Feature Selection+SVM+NBTree (1000 Extra Instances)

Table 6.13: Rules Extracted for Insurance Fraud Detection using NBTree (reduced features)

Table 6.14: Average Fidelity for Insurance Fraud Detection SVM+NBTree

Table 6.15: Average Results of Churn Prediction using SVM+DT (500 Extra Instances)

Table 6.16: Average Results of Churn Prediction using SVM+DT (1000 Extra Instances)

Table 6.17: Average Results of Churn Prediction Feature Selection+SVM+DT (500 Extra Instances)

Table 6.18: Average Results of Churn Prediction Feature Selection+SVM+DT (1000 Extra Instances)

Table 6.19: sample Rules Generated for Churn Prediction using DT with reduced features

Table 6.20: Average Fidelity for Churn Prediction SVM+DT

Table 6.21: Average Results of Insurance Fraud Detection using SVM+DT (500 Extra Instances)

Table 6.22: Average Results of Insurance Fraud Detection using SVM+ DT (1000 Extra Instances)

Table 6.23: Average Results of Insurance Fraud Detection Feature Selection+SVM+DT (500 Extra Instances)

Table 6.24: Average Results of Insurance Fraud Detection Feature Selection SVM+DT (1000 Extra Instances)

Table 6.25: Rules Extracted for Insurance Fraud Detection using Decision Tree

Table 6.26: Average Fidelity for Insurance Fraud Detection SVM+DT

Table 7.1: Data set Information

Table 7.2: Division of the datasets into Training and Validation

Table 7.3: Average RMSE values by SVR for UCI benchmark datasets

Table 7.4: Average RMSE values using Auto MPG dataset

Table 7.5: Sample Rules Set using SVR + CART for Auto MPG dataset

Table 7.6: Average RMSE values using Body Fat dataset

Table 7.7: Sample Rules Set using SVR + CART for Body Fat dataset

Table 7.8: Average RMSE values using Boston Housing dataset

Table 7.9: Sample Rules Set using SVR + CART for Boston Housing dataset

Table 7.10: Average RMSE using Forest Fires dataset

Table 7.11: Rules Set using SVR + CART for Forest Fires dataset

Table 7.12: Average RMSE values using Pollution dataset

Table 7.13: Sample Rules Set using SVR + CART for Pollution dataset

Table C.1: Financial Ratios of Spanish Banks dataset

Table C.2: Financial Ratios of Turkish Banks dataset

Table C.3: Financial Ratios of US Banks dataset

xvi

Table C.4: Financial Ratios of UK banks dataset

Table C.5: Feature description of Auto MPG dataset

Table C.6: Feature description of Body Fat dataset

Table C.7: Feature description of Boston Housing dataset

Table C.8: Feature description of Forest Fires dataset

Table C.9: Feature description of Pollution dataset

Table C.10: Feature description of churn prediction dataset

Table D.1: Fuzzy Rules Extracted using Wine Dataset

Table D.2: Rule set extracted using SVR+CART (Case-P) for Auto MPG dataset (3 features)

Table D.3: Rule set extracted using SVR+CART (Case-P) for Boston Housing dataset (all features)

Table D.4: Rule set extracted using SVR+CART (Case-P) for Forest Fires dataset (7 features)

Table D.5: Rules Set using SVR + CART for Auto MPG dataset

Table D.6: Rules Set using SVR+DENFIS for Body Fat dataset

Table D.7: Rules Set using SVR+CART for Boston Housing dataset

Table D.8: Rules Set using SVR+DENFIS for Pollution dataset

List of Figures

Figure 1.1: Various Categories of Rule Extraction Approaches

Figure 1.2: AUC - Area Under ROC Curve

Figure 1.3: Experimental Setup

Figure 2.1: Active Learning Based Approach by Martenes et al., (2009)

Figure 2.2: RulExSVM phases in a two dimensional space Taken from Fu et al. (2004)

Figure 2.3: SVM+Prototype Phases. Taken from Nunez et al., (2002)

Figure 2.4: Hyper-rectangle region generated for each cluster per class Taken from Zhang et al., 2005

Figure 2.5: Taxonomy of the literature on Rule Extraction from SVM

Figure 2.6: Classification of the proposed approaches for Rule Extraction from SVM

Figure 3.1: Work Flow Diagram of the proposed Hybrid Approach

Figure 4.1: First Phase of the proposed hybrid (Predictions of SVM/SVR)

Figure 4.2: Rule generation phase

Figure 5.1: Rule extraction using selected features of data

Figure 6.1: Architecture of the proposed rule extraction approach

Figure 7.1: Phase 1 of the proposed hybrid (Extraction of Support Vectors and SVR Predictions)

Figure 7.2: Phase 2 of the proposed hybrid (Rule Generation)

Figure A.1: VC Dimension Illustration

Figure A.2: Optimal Separating Hyperplane

Figure A.3: Margin for the Hyperplane

Figure A.4: How do SVM choose the margin?

Figure A.5: Mapping Low Dimension Input Space to High Dimensional Feature Space

Figure A.6: Mapping into Non-Linear Feature Space

Figure A.7: (a, b and c) produce no sparseness in the support vectors, to address this issue Vapnik proposed the loss function in Figure A.7 (d) as an approximation to Huber’s loss function that enables a sparse set of support vectors to be obtained

Figure A.8: The soft margin loss setting of Linear SVR

Figure B.1: Example of fuzzy partition by simple fuzzy grid with five linguistic values for each axis of the 2-D pattern space [0, 1] x [0, 1]

Figure B.2: Example of fuzzy partition of the 2-D pattern space [0, 1] x [0, 1] with “don’t care” as an antecedent fuzzy set

Figure B.3: Uniform crossover for antecedent fuzzy sets (* denotes a crossover position)

Figure B.4: Mutation for antecedent fuzzy sets (* denotes a mutation position)

Figure B.5: Example Decision Tree for Bankruptcy prediction in Banks

Figure B.6: Three possibilities for partitioning objects based on the splitting criterion, shown with examples

Figure B.7: Example rule set for forest fires data

Figure B.8: Splitting Algorithm of CART

Figure B.9: Fuzzy Inference System

Figure B.10: Adaptive Networks

Figure B.11 (a): Type-3 Fuzzy Reasoning

Figure B.11 (b): Equivalent ANFIS (Type-3 ANFIS)

Figure B.12: Two input Type-3 ANFIS with nine rules

Figure B.13: A brief clustering process using ECM with samples x1 to x9 in a 2-D space

illustration not visible in this excerpt

ABSTRACT

Although Support Vector Machines have been used to develop highly accurate classification and regression models in various real-world problem domains, the most significant barrier is that SVM generates black box model that is difficult to understand. The procedure to convert these opaque models into transparent models is called rule extraction. This thesis investigates the task of extracting comprehensible models from trained SVMs, thereby alleviating this limitation. The primary contribution of the thesis is the proposal of various algorithms to overcome the significant limitations of SVM by taking a novel approach to the task of extracting comprehensible models. The basic contribution of the thesis are systematic review of literature on rule extraction from SVM, identifying gaps in the literature and proposing novel approaches for addressing the gaps. The contributions are grouped under three classes, decompositional, pedagogical and eclectic/hybrid approaches. Decompositional approach is closely intertwined with the internal workings of the SVM. Pedagogical approach uses SVM as an oracle to re-label training examples as well as artificially generated examples. In the eclectic/hybrid approach, a combination of these two methods is adopted.

The thesis addresses various problems from the finance domain such as bankruptcy prediction in banks/firms, churn prediction in analytical CRM and Insurance fraud detection. Apart from this various benchmark datasets such as iris, wine and WBC for classification problems and auto MPG, body fat, Boston housing, forest fires and pollution for regression problems are also tested using the proposed appraoch. In addition, rule extraction from unbalanced datasets as well as from active learning based approaches has been explored. For classification problems, various rule extraction methods such as FRBS, DT, ANFIS, CART and NBTree have been utilized. Additionally for regression problems, rule extraction methods such as ANFIS, DENFIS and CART have also been employed. Results are analyzed using accuracy, sensitivity, specificity, fidelity, AUC and t-test measures. Proposed approaches demonstrate their viability in extracting accurate, effective and comprehensible rule sets in various benchmark and real world problem domains across classification and regression problems. Future directions have been indicated to extend the approaches to newer variations of SVM as well as to other problem domains.

Chapter 1 Introduction to Rule Extraction

Over the last three decades, data mining and machine learning have been remarkably successful in extracting interesting knowledge and hidden patterns from the ever growing databases. The ability to learn from examples is an important aspect of intelligence and this has been an area of study for researchers in artificial intelligence, statistics, cognitive science, and related fields. Algorithms that are able to learn inductively from examples have been applied to numerous difficult, real-world problems of practical interest (Widrow et al., 1994; Langley and Simon, 1995). Inductive learning with comprehensibility is a central activity in the growing field of knowledge discovery in databases and data mining (Fayyad et al., 1996). Predictive accuracy and the comprehensibility are two main driving elements to evaluate any learning system. It is observed that the learning method which constructs the model with the best predictive accuracy is not the method that produces the most comprehensible model. Artificial neural networks (ANN) and support vector machines (SVM), for example, are amongst the most successful machine learning techniques applied in the area of data mining (Byun and Lee, 2002; Burbidge et al., 2001; Cai et al., 2000; 2004; El-Naqa et al., 2002; Guo and Li, 2003; Joachim, 1999; Kalatzis et al., 2003), but produce black box models that are difficult to understand by the end user. This thesis explores the following question: can we take an arbitrary, incomprehensible model produced by a learning algorithm, and re-represent it (or closely approximate it) in a language that better facilitates comprehensibility?

1.1 Rule Extraction: Motivation

Support Vector Machines (SVMs) have proved to be good alternative algorithm compared to other machine learning techniques specifically for solving classification and regression problems. However just like artificial Neural Networks (ANN), SVMs also produce black box models with the inability to explain the knowledge learnt by them in the process of training. Comprehensibility is very crucial in some applications like medical diagnosis, security and bankruptcy prediction etc. The process of converting opaque models into transparent models is often called Rule Extraction. Using the rules extracted one can certainly understand in a better way, how a prediction is made. Rule extraction from support vector machines follows the footsteps of the earlier effort to obtain human- comprehensible rules from ANN in order to explain the knowledge learnt by ANN during training. Much attention has been paid during last decades to find effective ways of extracting rules from ANN and very less work has been reported towards representing the knowledge learnt by SVM during training.

1.2 Rule Extraction: Significance

Andrews et al. (1995) describe the motivation behind rule extraction from neural networks. A brief review of the arguments of Andrews et al. (1995) will help to establish aims and significance for rule extraction from SVM techniques.

1.2.1 Provision of user explanation capability

In symbolic artificial intelligence (AI), the term “explanation” refers to an explicit structure which can be used internally for reasoning and learning, externally for the explanation of results to the user. Gallant (1988) observes that an explanation capability enables a novice user to gain insights into the problem at hand. Davis et al. (1977) argues that even limited explanation can positively influence acceptance of the system by the user. Traditionally, researchers have experimented with various forms of user explanation, in particular rule traces. It is obvious that explanations based on rule traces are too rigid and inflexible (Gilbert, 1989) because rules may not be equally useful to the user. Further, the granularity of the rule traces’ explanation is often inappropriate (Gilbert, 1989; Andrews et al., 1995).

1.2.2 Transparency

The creation of a “user explanation” capability is the primary objective for extracting rules from neural networks and SVMs, with the provision of “transparency” of the internal 2 states of a system. Transparency means that internal states of the machine learning system are both accessible and can be interpreted unambiguously. Such capability is mandatory if neural network or SVM based solutions are to be accepted into “safety-critical” problem domains such as air traffic control, operations of power plants, medical diagnosis, etc (Andrews et al., 1995).

1.2.3 Data exploration

A learning system might discover salient features in the input data whose importance was not previously recognized (Craven and Shavlik, 1994).

1.3 Rule Extraction: A Taxonomy

More broadly taxonomy for rule extraction from ANN has been introduced (Andrews et al., 1995; Tickle et al., 1998) which includes five evaluation criteria: translucency, rule quality, expressive power, portability and algorithmic complexity. These evaluation criteria are now commonly used for rule extraction from SVMs, in particular, the translucency and rule quality metrics (Martens et al., 2009; Martens et al., 2006). Rule extraction from neural networks has previously almost exclusively been used to generate propositional rule sets (Hayward et al., 2000). While this is sufficient for many applications where rule sets can be effectively used, it is clearly desirable to provide a more general explanation capability.

A significant research effort has been expended in the last few decades to address the deficiency in the understandability of ANN (Saito and Nakano, 1988; Thrun, 1995; Craven, 1996; Jackson and Craven, 1996). Craven (1996) presented a complete overview on this research. The generally used strategy to understand a model represented by a trained neural network is to translate the model into a more comprehensible language (such as a set of if-then rules or a decision tree). This strategy is investigated under the rubric of rule extraction.

Craven (1996) defines the task of rule extraction from neural network as follows:

Given a trained neural network and the data on which it was trained, produce a description of the network ’ s hypothesis that is comprehensible yet closely approximates the network ’ s prediction behaviour.”

Although the task of rule extraction has only been formally formulated in the context of interpreting neural networks, this formulation can be generalized to any other opaque model such as SVM.

Over the last few years, a number of studies on rule extraction from SVM have been introduced. The research strategy in these projects is often based on this idea: develop an algorithm for rule extraction based on the perception (or “view”) of the underlying SVM which is either explicitly or implicitly assumed within the rule extraction technique. In the context of rule extraction from neural networks the notion of “translucency” describes the degree to which the internal representation of the ANN is accessible to the rule extraction technique (Andrews et al. 1995; Tickle et al. 1998). Therefore, rule extraction algorithms are classified into three types: Decompositional, Pedagogical and Eclectic. Figure 1.1 shows the categorization of rule extraction algorithms in general and the methodology based on which this categorization is proposed.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.1: Various Categories of Rule Extraction Approaches

1.3.1 Decompositional Approaches

A decompositional approach is closely intertwined with internal workings of the SVM, namely with support vectors and the constructed hyperplance (Nunez et al., 2002a; 2002b; 4 2002c; Fung et al., 2005, Barakat and Bradely, 2007, Martens et al., 2009). Towell and Shavlik (1993) focused on extracting symbolic rules from the trained feedforward neural network. The rules thus extracted are more general and yielded superior performance compared to earlier approaches. They concluded that their approach is capable of producing more human comprehensible rules. Arbatli and Akin (1997) extracted rules form neural network using Genetic Algorithm.

Fu (1994) analyzed the ability of certainty-factor-based activation function which can improve the network generalization performance from a limited amount of training data. He applied the proposed rule extraction approach to molecular genetics and concluded that it outperformed the standard C4.5 decision tree algorithm. Later, Craven and Shavlik (1996) proposed a learning-based rule extraction approach to provide the explanation capability to trained neural network. The proposed algorithm is able to extract both conjunctive and M-of-N rules, and they concluded that it is more efficient than conventional search-based approaches.

1.3.2 Pedagogical Approaches

Pedagogical algorithm considers the trained model as a black box (Clark and Niblett, 1989; Craven and Shavlik, 1996; Martens et al., 2006). Instead of looking at the internal structure, these algorithms directly extract rules which relate the inputs and outputs of the SVM. These techniques typically use trained SVM model as an oracle to label or classify artificially generated training examples that are later used by a symbolic learning algorithm. The idea behind these techniques is the assumption that the trained model can better represent the data than the original dataset.

Many approaches to rule extraction have set up the task as a search problem. This search problem involves exploring a space of candidate rules and testing individual candidate against the network to see if they are valid rules. One of the first rule-extraction methods developed by Saito and Nakano (1988) employs a breadth-first search process to extract conjunctive rules in binary problem domains. Gallant (1988) developed a similar rule- extraction technique, which like the method of Saito and Nakano, manages the combinatorics of searching for rules by limiting the search depth. The principal difference between the two approaches is the procedure used to test rules against the network. Unlike Saito and Nakano’s method, Gallant’s rule-testing procedure is guaranteed to accept only rules that are valid.

Thrun (1995) developed a method called validity interval analysis (VIA) that is a generalized and more powerful version of Gallant’s technique. VIA tests rules by propagating activation intervals through a network, after constraining some of the input and output units. Thrun frames the problem of determining validity intervals as a linear programming problem. Search methods for rule extraction from neural networks work by finding those combinations of inputs that make the neuron active. By sorting the input weights to a neuron and ordering the weights suitably, it is possible to prune the search space. Using this said concept Krishnan et al. (1999) proposed a rule extraction approach which extracts crisp rules from the neural network. McGarry et al. (1999) proposed a rule extraction approach to extract rules from Radial Basis Function Networks. Based on the neurons of the network, later, Fan and Li (2002) extracted diagnostic rules from trained feed forward neural network. They applied the rule extraction procedure for detecting a high-pressure air compressor's (HPAC) suction and discharge valve faults from static measurements including temperatures and pressures of various stages of the compressor.

1.3.3 Eclectic/Hybrid Approaches

Some authors also consider a third category: eclectic rule extraction techniques, which incorporate elements of both the decompositional and pedagogical approaches (Nunez et al., 2006, Barakat and Diederich, 2005). Sato and Tsukimoto (2001) proposed a hybrid approach of rule extraction, where they employed decision tree algorithm with trained neural networks to generate rules. Campos and Ludermir (2005) presented Literal and ProRulext algorithms to extract rules from the trained artificial neural network. Aliev et al. (2008) used fuzzy recurrent neural network for extraction of rules for battery charging.

1.4 Rule Quality Criteria

The quality of the extracted rules is a key measure of the success of the rule extraction algorithm. Four rule quality criteria were suggested for rule extraction algorithm (Andrews 6 et al., 1995; Tickle et al., 1998): rule accuracy, fidelity, comprehensibility and portability. In this context, a rule set is considered to be accurate if it can correctly classify previously unseen examples.

illustration not visible in this excerpt

When we deal with two-class classification problem, specifically in finance domain, we need to consider Sensitivity, Specificity and AUC as well.

illustration not visible in this excerpt

Positive samples are referred to the class of objective of the study. For example, if we are solving the problem of Churn prediction, then predicting churn is object of the study, hence, instances related to churn are positive samples. Likewise, negative samples for the above example will be the samples for non-churner class or loyal customers-class.

A Receiver Operating Characteristics (ROC) graph (Fawcett, 2006) has long been used in signal detection theory to depict the trade-off between hit accuracies and false alarm accuracies of classifiers. The ROC graph is a two dimensional graph which represents various classifiers based on their output results in the point form in a region, which has FP rate (100-Specficity) on the X-axis and TP rate (Sensitivity) on the Y-axis. ROC provides richer measure of classification performance than scalar measures such as accuracy, error rate or error cost. Higher the area under ROC, better the classifier (refer to Figure 1.2 below). The area under ROC curve (AUC) of a Classifier A can be calculated as the sum of the areas of Triangle - OAD, Rectangle - DAEH and Triangle - AEG. Likewise the AUC of classifier B will be the sum of the areas of Triangle - OBC, Rectangle - CBFH and Tringle - BGF. Thus, classifier B is superior to classifier A on the AUC criterion.

illustration not visible in this excerpt

Figure 1.2: AUC - Area Under ROC Curve

Similarly a rule set is considered to display a high level of fidelity if it can mimic the behaviour of the machine learning technique from which it was extracted.

illustration not visible in this excerpt

An extracted rule set is deemed to be consistent if, under different training sessions the machine learning technique generates same rule sets that produce the same classifications of unseen examples. Finally the comprehensibility of a rule set is determined by measuring the size of the rule set (in terms of number of rules) and the number of antecedents per rule. Expressive power refers to type or the language of the extracted rules. Propositional (If-Then) rules are the most commonly extracted rules. However, other types are also extracted such as fuzzy rule set (Faifer et al., 1999), and finite state machines (Giles and Omlin, 1993). Portability refers to the independence of a rule-extraction method from the ANN architecture and/or a training method.

1.5 Thesis Overview

In this thesis, novel algorithms are presented and evaluated for the task of extracting comprehensible descriptions from hard-to-understand learning systems i.e. support vector machine. The hypothesis advanced by this research is that it is possible to develop algorithms for extracting symbolic descriptions from trained SVMs that: (i) produce more comprehensible, high-fidelity descriptions of trained SVMs using fuzzy logic approaches, (ii) scale to analyze medium scale unbalanced data, (iii) various modifications of the training data such as, Case-SA dataset which represents the support vectors with corresponding actual target values, Case-SP dataset which represents the support vectors with corresponding predicted target values and Case-P dataset which represents the training set with corresponding predicted target values and (iv) applications are extended to Bankruptcy Prediction, Churn Prediction in Bank Credit Cards, Insurance Fraud Detection and Regression Analysis. During this research work, various rule learning algorithms have been analyzed such as, Fuzzy Rule Based System (FRBS), Decision Tree (DT), Classification and Regression Tree (CART), Adaptive Network based Fuzzy Inference System (ANFIS), Dynamic Evolving Fuzzy Inference System (DENFIS) and Naive-Bayes Tree (NBTree). Description of the rule extraction techniques on SVM and SVR are given in Appendix A. Table 1.1 below presents various approaches proposed by us for extracting if-then rules from SVM.

Table 1.1: Various approaches proposed for Rule Extraction from SVM

Abbildung in dieser Leseprobe nicht enthalten

1.6 Datasets used

Various datasets are analyzed during this research work which includes benchmark datasets, medical diagnosis dataset, finance datasets and regression datasets.

Benchmark datasets (UCI Machine Learning Repository): Iris and

Wine.

Medical Diagnosis dataset (UCI Machine Learning Repository): Wisconsin Breast Cancer (WBC).

Finance datasets:

Bankruptcy prediction in banks datasets include: Spanish banks, Turkish banks, US banks and UK banks. The Spanish banks ’ dataset is obtained from (Olmeda and Fernandez 1997). Spanish banking industry suffered the worst crisis during 1977-85 resulting in a total cost of 12 billion dollars. The ratios used for the failed banks were taken from the last financial statements before the bankruptcy was declared and the data of non-failed banks was taken from 1982 statements, financial ratios considered are presented in Table C.1 in Appendix C. This dataset contains 66 banks where 37 went bankrupt and 29 were healthy banks.

Turkish banks ’ dataset is obtained from (Canbas and Kilic 2005) and is available at (http://www.tbb.org.tr/english/bulten/yillik/2000/ratios.xls). Banks association of Turkey published 49 financial ratios. Initially, Canbas and Kilic (2005) applied univariate analysis of variance (ANOVA) test on these 49 ratios of previous year for predicting the health of the bank in present year. Canbas and Kilic (2005) found that only 12 ratios (refer to Table C.2 in Appendix C) act as early warning indicators that have the discriminating ability (i.e. significance level is <5% in ANOVA) for healthy and failed banks, one year in advance. Among these variables, 12th variable has some missing values meaning that the data for some of the banks are not given. So, we filled those missing values with the mean value of the variable following the general approach in data mining (Fayyad et al. 1996). The resulting dataset contains 40 banks where 22 banks went bankrupt and 18 banks were healthy.

The US banks ’ dataset contains 129 banks from the Moody’s Industrial Manual where banks went bankrupt during 1975-1982 (Rahimian et al. 1996). This dataset includes 65 bankrupt banks and 64 healthy banks. Financial ratios pertaining to US banks data are presented in Table C.3 of Appendix C.

The UK banks ’ dataset is obtained from Beynon and Peel (2001). This dataset consists of 10 features, 30 bankrupt banks and 30 healthy banks data. Financial ratios pertaining to UK banks data are presented in Table C.4 of Appendix C.

The churn prediction in bank credit card customers ’ data is obtained from a Latin American Bank that suffered from an increasing number of churns with respect to their credit card customers and decided to improve its retention system. Two groups of variables are available for each customer: sociodemographic and behavioural data, which are described in Table C.10 in Appendix C. The dataset comprises 22 variables, with 21 predictor variables and 1 class variable. It consists of 14814 records, of which 13812 are nonchurners and 1002 are churners, which means there are 93.24% nonchurners and 6.76% churners. Hence, the dataset is highly unbalanced in terms of the proportion of churners versus nonchurners (Business Intelligence Cup 2004).

Automobile insurance fraud detection data is provided by Agnoss Knowledge Seeker Software and this is the only available fraud detection dataset. Originally named “carclaims.txt”, it can be found in the accompanying compact disc from Pyle, (1999). This dataset contains 11338 records from January 1994 to December 1995, and 4083 records from January 1996 to December 1996. It has a 6% fraudulent and 94% legitimate distribution, with an average of 430 claims per month. The original dataset has 6 numerical features and 25 categorical features, including the binary class label (fraud or legal). Description of original fraud detection dataset’s feature is presented in Table 7.1.

Regression dataset (UCI Machine Learning Repository and StatLib repository): AutoMPG,

Body fat,

Boston Housing,

Forest Fires and

Pollution.

1.7 Experimental Setup

Results and discussions in this thesis is carried out in a little different fashion. We first divided the dataset into two parts in 80:20 ratios. 20% data is then named validation set and stored aside for later use and 80% of the data is used as training set. Then 10-fold cross validation was performed on the 80% of the data i.e. training data for building the model and extracting rules. Later, the efficiency of the rules is evaluated against validation set i.e. 20% of the original data. Empirical results presented in this thesis are average results on validation set, intermediate average results obtained during 10-fold cross validation are not presented in this thesis. Figure 1.3 presents the experimental setup followed throughout the research work presented in this thesis.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.3: Experimental Setup

1.8 Thesis Outline

The remaining chapters of this thesis are organized as follows;

Chapter 2 provides the literature survey of the approaches proposed for extracting rules from SVM. The first section describes in detail about the rule extraction approaches proposed in decompositional, pedagogical and eclectic/hybrid category. Next section provides the details about the gaps observed during literature survey and the final section in this chapter provides an outline of the proposed methods.

Chapter 3 presents the proposed decompositional rule extraction approach for SVM which is one of the main contributions of this thesis. First section of the chapter provides the motivation behind extracting fuzzy rules for solving bankruptcy prediction problems. In the second section details about the proposed approach are presented. Third section provides the literature survey of bankruptcy prediction. Datasets used and results and

discussions are presented in the following sections and final section presents the conclusions of the chapter.

Chapter 4 presents a pedagogical rule extraction approach proposed which uses SVM as feature selection algorithm and using reduced feature set rules are extracted. With the motivation in the first section, the proposed approach is presented in detail in next section. Applications of the proposed approach for solving classification and regression problems are discussed in the next section. Following section provides the detailed Results and discussions. Final section concludes the chapter.

Chapter 5 presents an eclectic rule extraction technique which analyzes medium scale unbalanced dataset pertaining to finance. At the outset the motivation for the proposed approach is presented. Introduction to customer relationship management (CRM) is then presented and literature of churn prediction problem is reviewed in the next section. Subsequent section presents the proposed approach in detail. The proposed approach can also be considered as an application of analytical CRM. Fourth section provides the details about the literature to deal with unbalanced datasets. Dataset description and Results and discussions is presented in the following sections. Final section concludes the chapter.

Chapter 6 presents an eclectic active learning based rule extraction approach which involves active learning to modify the training data. In the beginning section of the chapter motivation behind the proposed approach is presented. Later, section two provides the details about the proposed approach and active learning. Next sections in this chapter present the details of problems analyzed followed by Results and discussions. Final section concludes the chapter.

Chapter 7 presents an eclectic rule extraction approach proposed for solving regression problems. First section presents the motivation behind the proposed approach. Proposed approach is then presented in detail in second section. Brief description about the datasets used for empirical study is presented in the next section. Fifth section discusses the results and the implications. Final section concludes the chapter. Finally, Chapter 8 presents the overall conclusion and contributions of this thesis, and proposed future work.

Chapter 2 Rule Extraction from SVM: An Introduction

This Chapter provides the literature survey of the approaches proposed for extracting rules from SVM. The first section describes in detail about the rule extraction approaches proposed in decompositional, pedagogical and eclectic/hybrid category. Next section provides the details about the gaps observed during literature survey and the final section in this chapter provides an outline of the proposed methods.

2.1 Rule Extraction from SVM

In this section, an overview of rule extraction from SVM is presented. The rule extraction approaches are grouped into three categories based on the SVM components utilized for rule extraction. In particular, the first category of algorithms uses support vectors only i.e. decompositional approaches. The second treats SVM as a black box and uses the developed SVM as an oracle, this is also called as pedagogical approaches, and the third category approaches utilize the support vectors, the decision function and the training data as well i.e. eclectic or hybrid approaches.

2.1.1 Decompositional Approaches

Methods in this group extract rules utilizing the SVM’s support vectors and the separating hyperplane. Different methods have been used to extract rules from support vectors as summarized in the following paragraphs.

Most recently, Martens et al., (2009) proposed an active learning based approach (ALBA) for rule extraction from SVM. They first extracted support vectors and the distance between support vectors and the actual training set is calculated. Later, additional training samples are generated which are “close to” randomly selected support vectors based on the distance consideration. Class labels for these samples and training samples are then obtained using the developed SVM model. The generated examples are then used with the training data to train different rule learning algorithm that learn what SVMs have learned. Steps involved in ALBA are presented in Figure 2.1. As decision tree works better with more number of samples, the authors argued that generating more number of samples near support vectors and using these samples in combination with the training data whose target class is replaced by the prediction of SVM will increase the performance of the decision tree algorithm. They employed decision tree and RIPPER to generate rules.

The main drawback of their approach is that ALBA is not feasible to deal with real life data mining problems, where the class distribution is unbalanced and dataset size is medium scale or large scale. The approach resulting in high complexity of the system and the rule set because they used the generated instances with all the training instances thereby increasing the number of instances for rule induction algorithm to learn from.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1: Active Learning Based Approach by Martens et al., (2009)

One of the most recent methods in this group termed SQRex-SVM (Barakat and Bradley, 2007) extracts rules directly from support vectors of an SVM using a modified sequential covering algorithm and based on ordered search of the most discriminative features as measured by inter-class separation. An initial set of rules is formed from these features which are then pruned using user defined criteria and utilizes the concept of coverage or probabilistic normal (PN) space (Furnkranz and Flach, 2005) and its relationship to the area under the receiver operating characteristic curve (AUC).

Steps involved in SQRex-SVM algorithm:

SVM training

Train an SVM and get the SVM model.

Rule Generation:

- Use the SVM to predict the class of the training examples that become SVs.
- Get the True Positive and True Negative model SVs, then define a ranked list of most discriminative features based on inter-class separation.
- Obtain best threshold for each of these features and rank them based on True Positive and False Positive rates over the support vectors.
- Formulate rules for the positive class from the ranked features and the corresponding thresholds in earlier step.
- Sequentially, learn/refine rules if possible and add them to rule set if their performance is above user defined criterion.

Rule Pruning

Prune the rules that do not lead to a significant increase in AUC of the whole rule set.

Performance of the rules is evaluated in terms of accuracy, fidelity, comprehensibility, True Positive rate, False Positive rate and AUC. Rules produced by SQRex-SVM exhibit both good generalisation performance and comprehensibility, in addition the extracted rules have high fidelity to the SVM from which they were extracted.

Chaves et al., (2005) proposed a method of extracting fuzzy rules from SVMs. The main idea of their approach is to project each feature, in each of the support vectors along its coordinate axes, forming a number of fuzzy sets with triangular membership functions of the same length (Kecman, 2001). Fuzzy membership degrees are then computed and each of the support vectors is then assigned to the fuzzy set with the highest membership degree. One fuzzy rule is then extracted from each support vector.

The example fuzzy rule is given below;

illustration not visible in this excerpt

Steps involved in (Chaves et al., 2005) rule extraction approach are described below.

SVM training

Train an SVM and get the support vectors.

Rule Generation

- Project each feature in the support vectors along its coordinate axes.
- Define fuzzy sets for each feature using triangular membership function.
- Evaluate the membership degree for each of the support vector projections and assign it to the set with maximum membership value.
- Generate a rule from each support vector with each feature and fuzzy set of maximum membership value as antecedent and the support vector class label as the consequent.

One limitation of this algorithm is that it may not be suitable for categorical and binary features as the evaluation of membership degree value is non-trivial in these cases. Rules extracted by this method appear to be of poor quality based on the accuracy (i.e. 53.2%) and fidelity (74.59%) reported. Furthermore, the extracted rule set comprehensibility is low as the number of the extracted rules is large, which is as many as the number of support vectors. However, the authors argue that if the objective is explanation, then poor classification performance is acceptable.

Fu et al., (2004) suggested a method RulExSVM for rule extraction from non-linear SVMs, trained with radial basis function (RBF) kernel. The idea behind RulExSVM is to find hyper-rectangles whose upper and lower corners are defined by finding the intersection of each of the support vectors with the separating hyperplane.

RulExSVM proceeds in three phases: initial rule generation, tuning and rule set pruning. In the initial phase, hyper-rectangles are defined by the intersections of lines extended from each of the support vectors with the SVM decision boundary obtained. In the tuning phase, these hyper-rectangles are resized to exclude outliers from other classes. Finally, redundant rules are removed in the pruning phase. One limitation of this approach is that it is only valid for rule extraction from RBF kernel SVMs. However, the authors argue that this method could be extended to other types of kernels though they provide no framework for this. Another potential limitation is the requirement for normalization of all features to lie between (0, 1) before training. Normalization should be handled with care in applications such as medical diagnosis because some normalization methods may be susceptible to noise in the data. Furthermore, it adds an extra burden on the rule extraction process, as the features have to be transformed back to the original values for the extracted rules to make sense.

Figure 2.2 A shows the lines extended from SVs A1 and A2, and the hyper-rectangles defined by their intersections with the SVM decision boundary. Figure 2.2 B shows the tuning phase, where the two hyper-rectangles were chopped along coordinate axes to exclude outliers from other classes, while Figure 2.2 C shows the pruning phase which removes the overlapping hyper- rectangle (redundant rules).

illustration not visible in this excerpt

Figure 2.2 A: Rule generation phase: lines extended and hyper-rectangles constructed for A1 and A2 SVs

illustration not visible in this excerpt

Figure 2.2 B: Tuning Phase: Excluding Outliers

Figure 2.2 C: Pruning Phase: Removing Overlapped Hyper-Rectangles

Figure 2.2: RulExSVM phases in a two dimensional space. Taken from Fu et al. (2004).

The first method by Nunez et al., (2002a, 2002b, 2002c; 2006) introduced the SVM+prototype method. The main idea of this method is to utilize a clustering algorithm to determine prototype vectors for each class which are then used together with the support vectors to define regions (ellipsoids and hyper-rectangles) in the input space. Later, each ellipsoid is transformed to a rule. Figure 2.3 A shows the first iteration of the algorithm where an initial ellipsoid (hyper-rectangle) is defined with outliers (positive partition test). Figure 2.3 B shows a later iteration after the division of the ellipsoid (hyper-rectangle) to exclude outliers.

illustration not visible in this excerpt

Figure 2.3 A: One Region (Cluster) including Outliers Figure 2.3 B: Region Division to Exclude Outliers

Figure 2.3: SVM+Prototype Phases. Taken from Nunez et al., (2002)

However, the algorithm has two main limitations: first is that the number and the quality of the extracted rules depend upon the initial values of the prototype vectors which define the centers of the clusers. The second limitation is that the method does not scale well. In the case of larger number of input features and/or training examples, the comprehensibility of the extracted rule set will deteriorate as more clusters are likely to be formed, while all the features are present as rules antecedents (Martens et al., 2009).

In a similar study Zhang et al., (2005) proposed the hyper-rectangle rule extraction (HRE) which also utilizes a clustering algorithm. The algorithm first employs support vector clustering algorithm (Ben-Hui et al., 2001) to find prototype vectors for each class. Small hyper-rectangles are then defined around these prototypes using a nested generalized exemplar algorithm, which are incrementally grown until the stopping criterion is met. Stopping criterion is to control the size of the hyper-rectangles, hence quality rules are generated.

illustration not visible in this excerpt

Figure 2.4 A shows the initially defined, small hyper-rectangles around the prototype vectors. Figure 2.4 B shows larger hyper-rectangles. Given a user defined Minimum Confidence Threshold (MCT), the algorithm defines the following criteria to stop growing the hyper-rectangles:

1. All examples of a cluster are covered by a hyper-rectangle and the confidence is less than the (MCT), as in the r1 hyper-rectangle in Figure 2.4 B;
2. The hyper-rectangle covers a support vector of the cluster, as in the r4 hyper- rectangle in Figure 2.4 B;
3. The hyper-rectangle covers a training vector from a different class and the confidence is less than the MCT, as in the r2 hyper-rectangle in Figure 2.4 B;
4. The hyper-rectangle covers a prototype of a different cluster as in the r3 hyper- rectangle in Figure 2.4 B.

Abbildung in dieser Leseprobe nicht enthalten

2.4 A. Iteration “1”

2.4 B. Iteration “n” showing different stopping Criteria

Figure 2.4: Hyper-rectangle region generated for each cluster per class, Zhang et al., 2005

In another study, Fung et al., (2005) suggested a different approach for rule extraction from linear SVMs based on a linear programming formulation of the SVMs with a linear kernel. The approach handles rule extraction as multiple constrained optimization problems to generate a set of non-overlapping rules. Each extracted rule defines a non-empty hyper- cube which has one vertex that lies on the separating hyper-plane. The rule extraction algorithm iterates over training data in each half space, to find rules for the examples which have not been covered by any previous rule using a depth-first search.

One limitation of this approach is that the extracted rules cover training data, so they may not provide an explanation for new unseen data. However, authors have suggested modifying the volume maximisation rule extraction method to generate only one rule with maximum volume that may contain (generalize over) the test examples.

In a related study Chen and Wei (2007) proposed a novel and interesting rule extraction algorithm for gene expression data to improve the comprehensibility of SVMs. The algorithm constitutes one component of a multiple kernel SVM (MK-SVM) scheme, consisting of feature selection namely MK-SVMI, prediction modelling and rule extraction namely MK-SVMII. In the feature selection module, a new single feature kernel (MKI) is proposed which transforms the computationally expensive feature selection problem into finding sparse feature coefficients representing the weight of a single feature kernel. Features with zero coefficients have no impact on the output of SVM and can be discarded.

The rule extraction method proposed for the MK-SVMII is similar to the one addressed in Fung et al. (2005). However, in MK-SVMII, support vectors are used as vertices of hyper- cubes, where a series of hyper-cubes approximates the subspace of each class. They suggested the following rule quality measure in addition to the comprehensibility of the rules. Soundness, which measures the number of times a rule is correctly fired. Completeness, which measure the number of times the sample is correctly classified by a specific rule and false-alarm, which measures the number of times each rule is misfired (Chen and Wei 2007). The extracted rules by this method have good comprehensibility, good generalization performance and compact gene subsets were found. Furthermore, the authors argue that multiple good diagnostic gene subsets found to be useful in defining possible pathways of genetic network.

2.1.2 Pedagogical Approaches

Methods in this group treat the SVM as a black box, and extract rules that describe the relationship between the model’s inputs and outputs (Barakat and Diederich, 2004a, 2004b; Torres and Rocco, 2005; Martens et al., 2006)

The idea is to create artificially labelled samples where the target class of the training data is replaced by the class predicted by the SVM. The modified training set is then used with another machine learning technique with explanation capability such as decision tree learner that learns what the SVM has learned.

The main steps involved in such techniques are;

SVM Training

Train an SVM and get the SVM model.

Rule Generation

- Use the SVM to predict the class of the training set examples, therefore an artificial dataset (AD) is generated.
- Use the artificial dataset (AD) to train a C5 decision tree, or any other algorithm which generates a comprehensible model. Hence, rules representing the concepts learned by the SVM are generated.

Rule Testing

Use the generated rules to classify another independent test set to compute rule accuracy and fidelity.

The most commonly used decision tree learners are modified TREPAN (Browne et al. 2004), C5 (Quinlan, 1993), REX (Markowska-Kaczmar and Trelak, 2003) and CART (Brieman et al., 1994). However, the methods in this category are not classifier-specific as they do not utilize any model-specific components in rule extraction. The aim is simply to model the output of the original classifier using another classifier with better comprehensibility. It is observed that relatively high accuracy and high fidelity rules are obtained using these methods.

In a related study, He et al., (2006) suggested SVM_DT as a method for interpreting prediction of protein secondary structure. The proposed SVM_DT method is motivated by the need to integrate the good generalization performance of SVMs and the comprehensibility of decision trees. In particular, the method utilizes an SVM as a preprocessing step for decision tree. The authors argue that the use of an SVM as a preprocessing step can generate better and/or cleaner data than the original dataset, where some bad ingredients and weak cases can be reduced. This argument is valid, if optimum training parameters of SVM can be found. The authors also argue that the extracted rules have biological meaning and can be interpreted (He et al., 2006).

2.1.3 Eclectic/Hybrid Approaches

Methods in this group extract rules utilizing the internal working of the SVM and also consider SVM as a black box. During eclectic approaches, first SVM model is developed and predictions for support vector instances are obtained and only the modified support vectors are then used for generating rules.

Barakat and Diederich, (2005) proposed an eclectic approach for rule extraction from SVM. They used only the support vectors of the data and replaced the target values of the support vectors using the SVM model. The modified data is then used to train decision tree algorithm and the rules are generated. They reported their results on benchmark datasets only and concluded that using support vector instances results in reduction of the dataset size and the number of rules extracted. Later, Barakat and Bradely (2006) reported AUC as one of the important measure to evaluate the efficiency of the rule extraction algorithm.

[...]

Excerpt out of 254 pages

Details

Title
Rule Extraction from Support Vector Machine
Subtitle
Applications to Banking and Finance
Course
Department of Computers and Information Sciences - Ph.D.
Grade
none
Author
Year
2010
Pages
254
Catalog Number
V193686
ISBN (eBook)
9783656188087
ISBN (Book)
9783656189657
File size
2280 KB
Language
English
Notes
This thesis provides a very broad aspects of Data Mining applied for Customer Relationship Management. Very rare doctoral thesis are so well supported by empirical analysis, very few which I have seen, this is one of them. External Reviewer Comment.
Keywords
rule, extraction, support, vector, machine, applications, banking, finance, rule extraction, support vector machine, banking and finance, customer relationship management CRM
Quote paper
Mohammed Farquad (Author), 2010, Rule Extraction from Support Vector Machine, Munich, GRIN Verlag, https://www.grin.com/document/193686

Comments

  • No comments yet.
Look inside the ebook
Title: Rule Extraction from Support Vector Machine



Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free