# XGBoost. The Extreme Gradient Boosting for Mining Applications

Technical Report 2017 52 Pages

## Excerpt

## Table of Contents

CHAPTER I

Theoretical Foundations

1.1 Outline

1.1.1 AdaBoost

1.1.2 Gradient boosting

1.1.3 XGBoost

1.1.5 Comparison of Boosting Algorithms

1.1.6 Loss Functions in Boosting Algorithms

1.2 Motivation

1.3 Problem Statement

1.4 Scope and Main Objectives

1.5 Impact to the Society

1.6 Organization of the Book

CHAPTER II

Literature Review

2.1 History

2.2 XGBoost

2.3 Random Forest

2.4 AdaBoost

2.5 Loss Function

CHAPTER III

Proposed Work

3.1 Outline

3.2 Proposed Approach

3.2.1 Objective of XGBoost

3.2.2 Parameters

3.2.3 Parameters for Tree Booster

3.2.4 Learning Task Parameters

3.2.5 Training Parameter tuning

3.2.6 What XGBoost Brings to the Table

3.2.7 Square Logistics Loss Function (SqLL)

CHAPTER IV

Results Discussions

4.1 Outline

4.2 Dataset

4.3 Tools and Platforms

4.4 Feature Construction

4.5 Feature Selection

4.6 Training the Model

4.7 Evaluation Techniques

4.8 Analysis of Results

CHAPTER V

Summary, Recommendations, and Future Directions

5.1 Overview

5.2 Summary

5.3 Recommendations

5.4 Future Research Directions

References

## List of Figures

Abbildung in dieser Leseprobe nicht enthalten

## List of Tables

Abbildung in dieser Leseprobe nicht enthalten

## List of Acronyms

Abbildung in dieser Leseprobe nicht enthalten

## Abstract

Tree boosting has empirically proven to be a highly effective and versatile approach for data-driven modelling. The core argument is that tree boosting can adaptively determine the local neighbourhoods of the model thereby taking the bias-variance trade-off into consideration during model tting. Recently, a tree boosting method known as XGBoost has gained popularity by providing higher accuracy. XGBoost further introduces some improvements which allow it to deal with the bias-variance trade-off even more carefully. In this research work, we propose to demonstrate the use of an adaptive procedure i.e. Learned Loss(LL) to update the loss function as the boosting proceeds. Accuracy of the proposed algorithm i.e. XGBoost with Learned Loss boosting function is evaluated using test/train method, K-fold cross validation, and Stratified cross validation method and compared with the state of the art algorithms viz. XGBoost, AdaBoost, AdaBoost-NN, Linear Regression(LR),Neural Network(NN), Decision Tree(DT), Support Vector Machine(SVM), bagging-DT, bagging-NN and Random Forest algorithms. The parameters evaluated are accuracy, Type 1 error and Type 2 error (in Percentages). This study uses total ten years of historical data from Jan 2007 to Aug 2017 of two stock market indices CNX Nifty and SP BSE Sensex which are highly voluminous.

Further, in this research work, we will investigate how XGBoost differs from the more traditional ensemble techniques. Moreover, we will discuss the regularization techniques that these methods offer and the effect these have on the models.

In addition to this, we will attempt to answer the question of why XGBoost seems to win so many competitions. To do this, we will provide some arguments for why tree boosting, and in particular XGBoost, seems to be such a highly effective and versatile approach to predictive modelling. The core argument is that tree boosting can be seen to adaptively determine the local neighbourhoods of the model. Tree boosting can thus be seen to take the bias-variance trade off into consideration during model fitting. XGBoost further introduces some improvements which allow it to deal with the bias-variance trade off even more carefully.

## Acknowledgement

I would like to thank first and foremost to God, who has been my strength during my whole life.

Then, I would like to thank my supervisor Prof (Dr) Ajay K Sharma for his guidance, inspiration, encouragement, and demonstrative support throughout this work. Without his continuous and extensive support, this work would never have gone this far.

I wish to thank my parents for their love and encouragement during my whole life. Special mention must be made to my sister and brother-in-law for their never ending support from starting till the end. I would like to express my gratefulness to my friend Akanksha for her invaluable support and encouragement.

Finally, I express my deepest gratitude and appreciation to my husband Neelabh and my daughters Shambhvi and Shreyanvi for their continuous support and understanding during my pursuit that made the completion of book possible. I gratefully dedicate this book to them for their love, care and support.

## CHAPTER I Theoretical Foundations

### 1.1 Outline

In the course of recent decades, machine learning and data mining have turned out to be one of the backbones of data innovation and with that, a somewhat central, although typically hidden, part of our life. With the constantly expanding amounts of data getting to be noticeably accessible there is justifiable reason to consider that information mining will turn out to be a significant element for technological advancements. Nowadays, machine learning and information mining have turned into a vital piece of human life. In all aspects of life, applications of these two are utilized. Example applications include fraud detection, e-mail protection, in recommender system it helps to find out the user taste product etc. (1).

Further, Boosting is the most widely used tool used in machine learning that improves the accuracy of prediction of various classification models. Boosting technique is an ensemble technique created by combing various weak learners to build a strong learner with higher precision. Weak learners are those indicators that give more precision than random guessing. However, strong learners are those classifiers that give maximum accuracy and hence coined as the base of machine learning. Boosting technique is employed when the dataset is large and high predictive power is the vital requisite of the application. Further, it is also used to reduce the bias and variance in the prediction models. However, the technique also solves the over-fitting problem for smaller dataset. Additionally, it has wide application area and applies on numerous classification techniques viz. feature selection, feature extraction, and multi-class categorization. The applications of boosting include medical area, text classification, page ranking and business and so on (2).

Furthermore, Boosting technique is a type of ensemble method, which is used when there is a collection of many weighted same or different type of predictors. However in this technique, a collection of several hypothesis is selected and eventually their prediction is combined. For example, if 50 decision trees are generated over same or different training data set then a new test dataset is created and voted for best classification.

To illustrate, a simple example of boosting is suppose we have to identify about insurance company i.e. fraud or not and following key points are there:

- Company has a high profit - Strong

- Queries are solved on the toll free number - Weak

- Company has a proper mail id and proper website - Strong

- It gives a proper receipt after paying the amount - Strong

- Large number of customers - Strong

- Behaviour with the customers - Weak

In the above query, there are four strong and two weak points and when we collect all the points. Then according to the majority it can be inferred that the company is not fraud and anyone can invest his money on this. Hence, boosting considers assigning weights to various points and then combining the results to predict the class. In addition, several boosting algorithms are already in place. The most widely used are:

1. Adaptive Boosting (AdaBoost)

2. Gradient Boosting

3. Extreme Gradient Boosting (XGBoost)

4. Random Forest

#### 1.1.1 AdaBoost

AdaBoost also known as Adaptive Boosting algorithm is proposed by Freund and Schapire (5). It is an ensemble learning technique where multiple weak learners are consolidated to create one strong learning algorithm. The algorithm starts by selecting a base classification algorithm (e.g. Naïve Bayes) and repetitively enhancing its prediction accuracy by draining the inaccurately classified samples in the training dataset. Initially, AdaBoost assigns same weights to all the training samples and selects a base classifier algorithm. Further, for every iteration, the base algorithm classifies the training samples and the weights of the inaccurate classified samples are increased. The algorithm iterates times, repeatedly applying base classification algorithm on the training dataset with new calculated weights. At the end, the final classification model is the weighted sum of the classifiers (6). Fig. 1 shows the AdaBoost algorithm, where *D* is the decision tree that is used to classify plus and minus point. Box 4 is the final model that correctly classifies the data points.

Abbildung in dieser Leseprobe nicht enthalten

Fig. 1.1 AdaBoost Example

#### 1.1.2 Gradient boosting

Gradient boosting is an effective off-the-shelf strategy for creating accurate models for classification problems. The technique has empirically proven itself to be highly effective for a vast array of classification and regression problems. As stated previously, gradient boosting is a variant of ensemble method, meaning that the prediction is consolidated from several simpler predictors. The aim of this method is to train a collection of decision tress, given the case that the training of single decision tree is known apriori. The technique is called “boosting” in light of the fact that we anticipate that combined results will provide more accuracy than a single learner. However, in this method the boosting is visualized as an optimization problem, where the objective of the technique is to minimize the loss of the classifier model by adding one weak learner at a time as done in a gradient descent. Gradient boosting is also called stage-wise additive classifier as a new weak learner is added at one time and the previously classified weak learners are left frozen i.e. unchanged for that iteration.

#### 1.1.3 XGBoost

XGBoost stands for extreme gradient boosting, developed by Tianqi Chen. Since its introduction in 2014, XGBoost has quickly become among the most popular methods used for classification in machine learning. It is an implementation over the gradient boosting. XGBoost is greedy in nature so it follows greedy approach. It has high performance and speed. XGBoost is a scalable system for learning Tree ensemble method. It is used for wide number of applications and it also supports outdoor memory (3). Due to parallel computation process it is faster than other boosting algorithms (4).The reason behind the higher performance of XGBoost is that it is scalable in nature. Additionally, it has following advantages over other algorithm.

- Due to parallel processing process it has faster performance than gradient boosting.

- It controls the over fitting problem.

- It gives a better performance result on many datasets.

- It is extreme gradient boosting.

- Basically it is a tree building algorithm.

- It is used for classification, regression and ranking with custom loss functions.

#### 1.1.4 Random Forest

Another ensemble technique that is widely used in Machine Learning is random forest. It is used in classification, regression and many more prediction problems. At training time, multiple decision trees are created and the output is the mean or average prediction of each trees. The algorithm is proposed by Tin Kam Ho (7). Random forest follows following steps:

- Using the bagging process sampling of training dataset takes place. It gives a

number of trees.

- Nodes are split according to some splitting criteria.

- Due to splitting criteria, data is divided into each node.

- Classification takes place on leaf node.

- After trained for trees, test data is sampled. Each sample is given to all trees.

- At the leaf node classification takes place.

- At last, the class of the test dataset is decided by majority voting or average process.

Furthermore, in random forest algorithm, the classifier shows low bias and high variance. Random Forest follows the parallel computation process. The algorithm that is used for training and testing process is bootstrapping. However, for very iteration, data is split into number of trees using bagging process. Bagging process divides the whole dataset and creates samples. Then, classification is done on these samples using decision trees. Further the classifier predicts the classes of samples and final class is predicted by the majority voting or it can be the simple average. In various situations, Random Forest gives more exact predictions when distinguished with simple Classification and Regression Tree (CART) models or regression models.

#### 1.1.5 Comparison of Boosting Algorithms

XGBoost and its entire variant exhibit the maximal performance among all the categories of boosting algorithms, however AdaBoost displays the minimal performance lower than random forest. This can be attributed to the computation process followed in the respective boosting algorithms. XGBoost and Random forest implement parallel computation process while Adaboost realize serial computation process. Hence, the performance also displays the same realization. In boosting, several weak learners are combined and prediction is given by one strong learner. The final prediction can hence, be made by weighted average or simple average of weak learners. In weighted average, more weight is given to a learner with higher classification accuracy. On the other hand, no weights are allocated to the weak learners in simple average. XGBoost and AdaBoost use the concept of weighted average while random forest consider the simple average of weak learners.

Further, in boosting various weak learners are consolidated and give a strong learner with higher accuracy. Therefore, bias and variance are considered important parameters to measure the accuracy of these algorithms. The better algorithm is the one which provides high bias and low variance. Both XGBoost and AdaBoost depict the same. But Random forest shows the opposite. Accuracy is also impacted by the cross validation of error. All the four algorithms implement the cross validation of error and hence are more accurate than single learner. To state the comparative analysis of the accuracy of all the four algorithms, accuracy of XGBoost is maximum and random forest shows the least accuracy amongst all. Over fitting of data occurs due to the branches involving noisy data or outliers. It is imperative to reduce the over fitting problem to enhance the accuracy of the learners. Pruning is done to remove the over fitting of data. The pruning can be done in two ways: Pre-pruning and post-pruning. Pre-pruning involves the avoidance of over fitting problem while post pruning removes the overfitted data after the learner is created. XGBoost and Random Forest avoids over fitting problem, Adaboost does not avoid the problem completely but it is less prone to over fitting.

In summary, XGBoost which is implemented on gradient boosting algorithm and follows the greedy approach is the best performing boosting algorithm in terms of computation performance and accuracy. A comparative analysis between the techniques is demonstrated in Table 1.1.

Table 1.1: Comparative Evaluation of Boosting Algorithms

Abbildung in dieser Leseprobe nicht enthalten

#### 1.1.6 Loss Functions in Boosting Algorithms

Every learning issue has a particular learning objective that have to achieve. A loss function is the cost of the error between the prediction g(b) and the observation a at the point b. Loss function is convex function. A convex function shows that there are no local minima. In every optimization problem, the main task is to minimize the loss or cost function. It may be its objective function also. And to minimize the conditional risk, a loss function is derived that is used for outliers.

Typically, a loss function is a value which is based on predicted and true value. And it is the difference between actual and predicted value. Loss function is the penalty for misclassified data points in any classification problem. It is used to analysis the performance of linear regression.

Furthermore, it classifies the data points with high accuracy and it is fit to outliers. Loss function always affects the accuracy. Loss function is convex in nature. Loss function gives two type of error one is positive part error and second is negative part error. Negative part error is always down the accuracy. Proof of its complexity is also explained in proposed methodology chapter.

In binary classification, loss function shows the minimum probability error but from the computational point it is difficult to tackle. Savageboost is faster than other and it is more robust on outliers than others. Moreover it has a faster convergence rate. (8)

Yet, Boosting is gradient decent algorithm. Loss function affects the boosting accuracy. Loss function is convex in nature. Loss function gives two type of error one is positive part error and second is negative part error. Negative part error is always down the accuracy. In this paper positive part truncation is proposed, positive part truncation makes strong to any boosting algorithm Many type of loss function is explained for example hinge loss is used in SVM, exponential loss is used in AdaBoost .It is valid only for global minimization. The loss function which gives local optimization is to be finding out. It increases the adaboost weights and selects the weight which gives minimum error and performance is better than other weights. Gross error sensitivity term is used in this paper Gross error sensitivity is related to outliers. Outliers are the error that is manually created in the dataset. Outliers always decrease the performance. The outliers are increased and performance is noticed on every time. For a given input parameter the conditional probability of class label is known as robust loss function. When sample size become infinite then it became the expected loss function. (9)

Further, output value that is based upon some input parameter is predicted for any classification problem. Relationship between these two is shown by some probability distribution. The main aim of this paper is to find out the classification error probability become very lesser. Classification error will be consistent under some condition and it has a fast convergence rate under less noise condition. Due to non-differentiability, its minimization problem is an N-P hard. Convex function is the base of the boosting algorithm.

In addition, It is based on greedy approach it would stop adding weak classifier when the error become minimum. A boosting algorithm is given for binary classification named direct boost. Ada is work sequential. Error that is given by ensemble classifier is confined by distribution of margin. In this relaxation term is used .Relaxation is used to overcome the problem when it is impossible to maximize the margin along coordinate direction.

After setting the positive and negative relaxation algorithm allow this on a single coordinate to change margin(10).

### 1.2 Motivation

Boosting is a representation on gradient decent algorithm for loss functions. Loss function maps a real time event to a number representing the cost associated with that event. The goal of any optimization problem is to minimize the loss function as much as possible. Loss function is the penalty for misclassified data points in any classification problem. The main objective of estimation is to find an objective function that models its input well i.e. it should be able to predict the values correctly. The loss function is a measure to quantify the amount of deviation between the predicted values and actual values.

Suppose that is observed, where is the value of input space and is the class label which takes 1 or -1 value. The collection of weak learners is denoted by ,where each learner gives class label for the input values. Strong learner is made by consolidated weak learners, where is given

by:

Abbildung in dieser Leseprobe nicht enthalten

(1.1)

The loss given by over sample is , where ! is continuous and differentiable function except finite points. It is also called growing functions, Loss function is given by:

Abbildung in dieser Leseprobe nicht enthalten

(1.2)

So, in simple term, a loss function is the cost of the error between the prediction $ % and the observation at the point %. Loss function is convex function. A convex function shows that there are no local minima. In every optimization problem, the main task is to minimize the loss or cost function. And to minimize the conditional risk, a loss function is derived that is used for outliers. When the main objective of the classifier model is to predict a real valued function, the squared loss error function lead to poor accuracy. However, the choice of a particular loss function plays an important role in generalizing the error in regression(11). The loss function is assumed to be directly proportional to the negative log of the residual density. A novel loss function named learned loss function considers this and uses a negative log of the residual density (12). This research work proposes the use of the learned loss function in extreme gradient boosting in modelling of stock markets.

### 1.3 Problem Statement

The aim of XGBoost is to give almost accurate result on any dataset. Moreover its performance is also much more than existing ones. A lot of research has been done in this context to develop the boosting algorithm. But the majority of research is focused on accuracy problem. Research on boosting algorithm is started in 1989s (11). Later with the development of boosting algorithm Yoav and Schapire give the Adaboost algorithm. In this algorithm, the final prediction is given by weighted average. Moreover the computation process followed by Adaboost is serial. Boosting is a type of ensemble method and another approach in ensemble method is random forest. But it follows the parallel computation process. In XGBoost both things are available one is parallel computation and other is weighted average. The capacity to do parallel computation in one single machine, as well as the implementation of a sparsity aware algorithm makes it 10 times faster than other implementations. XGBoost can costume objective functions and easily handle missing values. It allows to run cross-validation in each iteration of the algorithm. It has achieved state-of-the-art results in several data competitions and it is an end to end developed system. There is a community that improves the algorithm at Distributed (Deep) Machine Learning Community.

Also, we have to gather data with minimum error. And qualitative data is more useful than the quantitative data. Now, when combining the XGBoost with squared logistic loss function, it gives the minimum loss as compared to logistics loss.

### 1.4 Scope and Main Objectives

1. To increase the confidence of classification points using squared logistic loss (SqLL) function on Extreme Gradient Boosting for large dataset.

2. The main goal is to find out the ajthat minimize the

### 1.5 Impact to the Society

Gradient boosting is an important technique in the rapidly growing field known as predictive data modelling and is being applied in a variety of engineering and scientific disciplines such as biology, psychology, medicine, marketing, computer vision, and remote sensing. This research work will be useful for those in the scientific community who gather data and seek tools for analyzing and interpreting data. It will be a valuable reference for scientists in a variety of disciplines and can serve as a reference for pattern recognition, image processing, and remote sensing.

### 1.6 Organization of the Book

We begin by describing the state-of-the-art in boosting framework, characteristics requirements, various applications, challenges, design, issues and classification of boosting in general for data mining applications in Chapter 1. The rest of the chapters are organized as under:

Chapter 2: In Chapter 2, we present our idea of boosting with the emphasis on literature review of prior work and present a requesting and basic scale of the publically done work that was published in literature concerned to point of the analysis.

Chapter 3: In this part detailed procedure of the proposed work is described along with the proposed algorithm to answer the picked problem. A detailed remark of a current work is likewise explained in this part

Chapter 4: Results are discussed in this chapter with a detailed description of the dataset, its analysis and sampling process.

Chapter 5: In this chapter conclusion of the whole work done for this research work is given and future work associated with the proposed algorithm is given.

## CHAPTER II Literature Review

### 2.1 History

Boosting is an algorithm that is widely used in machine learning field. Kearns and Valiant (1988, 1989) (12) (13) asked a question Can weak learners consolidated into a strong one? Boosting algorithm is a reply of this question. This reply is given by Robert Schapire's in a 1990 paper. (11) . So Kearns and Valiant has played an important role in the enlargement of boosting. (14).This algorithm fits a weak classifier to weighted versions of the data iteratively. At each iteration, the data is reweighted such that misclassified data points receive larger weights.

In boosting, several weak learners are combined and prediction is given by one strong learner. The final prediction can hence, be made by weighted average or simple average of weak learners. In weighted average, more weight is given to a learner with higher classification accuracy. On the other hand, no weights are allocated to the weak learners in simple average (12) (11).

In boosting system, various weak learners are consolidated and give a strong learner with higher accuracy. Therefore, bias and variance are considered important parameters to measure the accuracy of these algorithms. The better algorithm is the one which provides high bias and low variance.

In the first boosting hypothesis algorithm problem weak learners turns into a strong learner, which give more accuracy than the random guessing (12).

In this development of boosting algorithm is shown .When boosting come in machine learning field, then adaboost come and after that XGboost play an important role in this field. All related information regarding these algorithms is given in the Fig 1 .Finally; boosting became popular algorithm in this field.

The flow chart of history of boosting is given in the Figure 2.

Abbildung in dieser Leseprobe nicht enthalten

Fig. 1.1 History of Boosting

### 2.2 XGBoost

XGBoost is also well-known by extreme Gradient Boosting. XGBoost is an extension of gradient boosting by (Friedman, 2001) (Friedman et al., 2000). XGBoost has been successfully used in recent Kaggle competitions, usually as an integral part of the winning ensemble. It implements a variety of Gradient Boosting algorithms, including Generalized Linear Model(GLM) and Gradient Boosted Decision Tree (GBDT). The focus is on scalability. XGBoost differs from Random Forests mainly in the way it creates the tree ensemble. Trees do not have to be trained on a subset of the data or a subset of the features. The ensemble is build sequentially. In each round, k-trees are used to classify examples into classes. New trees focus on previously misclassified examples to improve the discriminative power of the ensemble.

**[...]**