Data mining techniques in financial fraud detection

Seminar Paper 2016 12 Pages

Computer Science - General



List of Abbreviations

List of tables

1. Introduction
1.1. Goals
1.2. Structure of seminar thesis

2. Terminology
2.1. Data Mining
2.2. Fraud
2.3. Financial Fraud
2.4. Insurance Fraud
2.5. Bank Fraud

3. Research methodology

4. Classification of Data Mining Applications

5. Literature Review

6. Conclusion


Print sources

Internet sources

List of Abbreviations

illustration not visible in this excerpt

List of tables

Table 1 - Research on data mining techniques in different fraud areas

1. Introduction

In this seminar thesis you will get a view about the Data Mining techniques in financial fraud detection. Financial Fraud is taking a big issue in economical problem, which is still growing. So there is a big interest to detect fraud, but by large amounts of data, this is difficult. Therefore, many data mining techniques are repeatedly used to detect frauds in fraudulent activities. Majority of fraud area are Insurance, Banking, Health and Financial Statement Fraud. The most widely used data mining techniques are Support Vector Machines (SVM), Decision Trees (DT), Logistic Regression (LR), Naives Bayes, Bayesian Belief Network, Classification and Regression Tree (CART) etc. These techniques existed for many years and are used repeatedly to develop a fraud detection system or for analyze frauds.

1.1. Goals

The main object of this study is to analyze the literature in reference to Data Mining techniques for financial fraud detection with the focus on Insurance, Health, Banking and Financial Statement Fraud areas.

1.2. Structure of seminar thesis

First of, an overview into the various terminology will be given, which are relevant for this literature review. In the next chapter, the researched methodology is present. Then the most common data mining applications are classified and described, which are related in the different fraud detection areas. For that, literatures were analyzed and the best and most common application was chosen for this review. The basis for this literature review, the review from the authors Sharma, A., et. al. was used [1]. This review presents the different data mining techniques that are used for financial statement fraud. This study extends the financial fraud area on Insurance, Banking, Health and also Financial Statement fraud. After classification of the most common data mining application the results of the literature review is presented. A framework is developed to present the main objective included the fraud area, data mining application and data mining technique.

2. Terminology

2.1. Data Mining

While searching for the term “Data Mining” has been found, that for this term exist no clear definition in web. In fact, the term is not specific to a product, methodology technology or practice. According to the Gartner Group [2], “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.”

2.2. Fraud

There is no existing a universal definition of fraud. The authors of the literature [2] describe different definitions of fraud. The first definition according the literature is from Oxford English Dictionary, which define fraud as „wrongful or criminal deception intended to result in financial or personal gain”[2]. The second definition mentioned is from Phua et al. who describe, “fraud as leading to the abuse of a profit organization’s system without necessarily leading to direct leading consequences” [2].

2.3. Financial Fraud

Same as fraud, there is no universal definition existing for financial fraud. Ngai. et. al. mention one definition in his paper from Wang et. al. which define financial fraud as “a deliberate act that is contrary to low, rule, or policy with intent to obtain unauthorized financial benefit.”[2].

2.4. Insurance Fraud

Insurance fraud is a process of financial fraud. It is also a popular researched area at the present. According to Legal Dictionary “Insurance Fraud occurs, when a person or entity makes false insurance claims in order to obtain compensation or benefits to which they are not entitled [3]. Ngai et. al. classify insurance fraud into three following categories: healthcare insurance fraud, crop insurance fraud and automobile insurance fraud [2].

2.5. Bank Fraud

While searching for the definition of Bank fraud, has been found that there are also many words to define this type of fraud. The Legal Dictionary defines Bank Fraud as “The act of using illegal means to obtain money or other assets held by a financial institution.”[4]. The statistic of published works shows that Bank Fraud is the most researched area [5]. The most common subcategory of Bank Fraud is credit card fraud.

3. Research methodology

To create this literature review on Data Mining techniques in fraud areas the following procedure was used. First, the topic was divided into individual keywords. Following keywords was used in this review to find the relevant literature: Data Mining, Financial Fraud, Banking Fraud, Insurance Fraud, Healthcare Fraud, and Data mining techniques. Only Journals from free available online databases was used. The search was restricted and used only the following online databases: Google Scholar, ScienceDirect and Springer Link.

To find the most relevant literatures the search in online databases was restricted to Abstract, Title and Keywords. Also, only those articles that had been published between 2010 and 2016 were selected and only those articles, which clearly described how the mentioned data mining techniques could be applied and assisted in frauds, were selected. The result of the literatures search showed, that there are total 10 articles found, which are most relevant for this literature review.

4. Classification of Data Mining Applications

In this chapter, the most common approaches of data mining applications classes are described. The following applications of data mining can handle different classes of problems.

Classification: Classification is the most commonly applied data mining technique, which employs a set of pre-classified examples to develop a model that can classify the population of records at large [6]. The literature research [7] says, that classification or prediction is the process of identifying a set of common features, and suggesting differentiating models that describe and distinguish data classes and concepts based on an example. The following example is very nice to understand ‚Classification’ in easy words:

A loan creditor must analyze the data to determine which applicants are „safe“ and which can be classified as „risky“.

The most common data mining techniques for fraud detection are Neural Networks (NN), Naive Bayes, decision tress (DT) and also support vector machines (SVM).

Clustering: In Clustering, as known as cluster analysis the groups of objects, which have a similarity, are identified. The reason to choose the clustering procedure is, that some applications the class affiliation is not available or costly to identify [7]. So the task of Clustering is thus to assign the properties of a feature unclassified record a certain number of clusters [7]. Objects, which are not assigned here, can be assigned in the Data Mining class “outlier detection”.

The goal of the cluster analysis is: “Classifying without knowing the classes prior”.

The most common clustering techniques are neural networks, Naïve Bayes technique and K-nearest neighbor.

Regression: The goal if regression analysis is similar to the classification technique above. The difference is only that in regression no classes are formed. According to DMG this function is used to determine the relationship between the dependent variable and one or more independent variable [8].


From the data of a production facility has been recognized, that a certain product parameters correlated very strongly with product quality; now is to find out how these parameters must be set to achieve a specific level of quality [9].

Common Tools for Regression are linear regression and logistic regression.

Prediction: Prediction is similar to classification. The difference is, that in prediction the exception applies, the results lie in the future. For example, one possible question of prediction analysis would be: “How would be develop the dollar exchange rate in the future”.

Neural networks and logistic model prediction are the most commonly used technique in prediction analysis.

Visualization: Visualization refers to presentation of data mining results so that the users can view complex view in the data as visual objects in dimensions and colors [10]. So it is easier for the users to understand the complicated data in clear patterns and use it. “Visualization helps business and data analysts to quickly and intuitively discover interesting patterns and effectively communicate these insights to other business and data analysts, as well as, decision makers [11].” Following visualization and presentation techniques provides this type of data mining technique: trees, tables, graphs, charts, matrices, crosstabs, curves or rules.

Outlier Detection: The aim of outlier detection is to identify data that are not compatible with rest of the dataset. It is one of the most fundamental issues in data mining. A commonly used technique for outlier detection is the discounting learning algorithm [2].

5. Literature Review

The authors of the first articles proposed a novel hybrid approach for under- sampling the majority class in largely skewed unbalanced datasets in order to improve the performance of classifiers [12]. For that, they used different Data mining techniques such as PNN, MLP, SVM, DT and GMDH to test the effectiveness of their approach [12]. The result shows that, by using DT and SVM on Insurance fraud detection achieved about 91% fraudulent claims detection rate (sensitivity). Against that GMDH achieved 81.3% sensitivity [12]. So it is clear that the proposed hybrid undersampling approach performed better than a original unbalanced data presented [12].

Bhowmik [13] present a confusion matrix of model applied to test data set. In his paper he provide a matrix with two classes with four possible outcomes of the classification to identify frauds. First is true positive, second false positive, third true negative and the last one false negative. To view and understand the output he recommends visualizing the output. For this, he proposes the following data mining techniques: Naïve Bayesian visualization to provide an interactive view of the prediction results [13]. Attribute columns graphs in neural networks to find the significant attributes and DT visualization to builds trees by splitting attributes from C4.5 classifiers [13].

This article [14] is about to develop a model for detecting cases of prescription fraud. A novel model is proposed for detecting cases of prescription fraud. Using a data mining approach dividing the six dimensional features into several 2 dimensional sub-domains [14]. The result is, that the automated fraud detection methodology gives considerably compatible results with the human expert auditing [14]. Based in the performance measurements with a true positive rate of 77.4 % and false positive rate of 6%, the developed system works good to detect prescription fraud problems [14].

The author of this article [15] performs an analysis using data mining methods to detect fraud in healthcare insurance. For the anomaly detection analysis he used the data mining technique SVM that is performed on an Oracle system. For the results he describe a data mining software that calculates the probability of the anomaly of each record [15]. If the probability from a claim header records greater than 50%, so the software marked the record as anomalous. For the analysis, the author presents 3 different criteria in this article. First criteria the Rejected claims, second the Excessive claims in health center types and the last Excessive claims in health centers.



ISBN (eBook)
ISBN (Book)
File size
521 KB
Catalog Number
Institution / College
Heilbronn University
Fraud Fraud Detection Literature review Data Mining techniques financial fraud detection Data mining techniques in financial fraud detection Financial Fraud Insurance Fraud Bank Fraud Classification of Data Mining Applications Support Vector Machine Logistic Regression DM Decision Tree



Title: Data mining techniques in financial fraud detection