Table Of Content
- List of Figures:
- List of Tables:
- Chapter 1
- 1.1 Background
- 1.2 Overview of Data Mining Techniques
- 1.2.1 Clustering
- a. Application of Clustering Analysis
- b. Prerequisites of Clustering in Data Mining
- 1.2.2 Classification
- a. Classifications Issues
- 1.2.3 Association rule mining
- 1.3 Challenges in accident
- 1.4 Objective
- 1.5 Organization of Thesis
- Chapter 2
- LITERATURE SURVEY
- 2.1 Introduction
- 2.2 Factors responsible for accident
- 2.3 Traditional Statistical approach for accident analysis
- 2.4 Data Mining approaches for Accident Analysis
- Chapter 3
- METHODOLOGY AND DATA COLLECTION
- 3.1 Introduction
- 3.2 Proposed Methodology
- 3.2.1 K-modes clustering
- 3.2.2 Self-Organizing Map (SOM)
- 3.2.3 Hierarchical Clustering
- a. Agglomerative Clustering
- b. Divisive Clustering
- 3.2.4 Latent Class Clustering (LCC)
- 3.2.5 BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
- 3.2.6 Support Vector Machine (SVM)
- 3.2.7 Naïve Bayes (NB)
- 3.2.8 Decision Tree
- 3.2.9 Multilayer Perceptron
- 3.2.10 Association Rule Mining
- Interestingness computation
- 3.2.11 Cluster Selection Criteria
- 3.2.12 Accuracy Measurement
- 3.3 Data Collection
- 3.3.1 Description of dataset for Result No. 1, 2 3
- 3.3.2 Description of dataset for Result No. 4
- 3.3.3 Description of dataset for Result No. 5
- Chapter 4
- ANALYSIS AND RESULTS
- 4.1 Introduction
- 4.2 Result No. 1 (Road-user Specific Analysis of Traffic Accident using Data Mining Techniques)
- 4.2.1 Classification Analysis
- 4.2.2 Classification followed by clustering of accident
- a. Performance evaluation of SVM
- b. Performance evaluation of Naïve Bayes
- c. Performance evaluation of Decision Tree
- 4.2.3 Analysis
- 4.3 Result No. 2 (Performance Evaluation of Lazy, Decision Tree Classifier and Multilayer Perceptron on Traffic Accident Analysis)
- 4.3.1 Direct Classification Analysis
- 4.3.2 Classification followed by clustering techniques
- a. Lazy Classifier Output
- K-Star: In this, classified result increased from 67.7324 % to 82.352%. It’s sharp improvement in the result after clustering
- IBK: In this, classified result increased from 68.5634% to 84.4729%. It’s sharp improvement in the result after clustering.
- b. Decision Tree Output
- c. Multilayer Perceptron Output
- 4.3.3 Analysis
- 4.4 Result No. 3 (A Conjoint Analysis of Road Accident Data using K-modes clustering and Bayesian Networks)
- 4.4.1 Cluster Analysis
- 4.4.2 Performance Evaluation of Bayesian Network
- 4.4.3 Analysis
- 4.5 Result No. 4 (Augmenting Classifiers Performance through Clustering: A Comparative Study on Road Accident Data)
- 4.5.1 Cluster Analysis
- 4.5.2 Classification Analysis
- 4.5.3 Analysis
- 4.5 Result No. 5 (Analysis of Airplane crash by utilizing Text Mining Techniques)
- 4.5.1 Cluster Analysis
- 4.5.2 Association Rule Mining
- 4.5.3 Analysis
- Chapter 5
- CONCLUSION AND RECOMMENDATIONS
- Publications related to thesis:
- Web of Science and Scopus indexed Publications
Accident data analysis is one of the prime interests in the present era. Analysis of accident is very essential because it can expose the relationship between the different types of attributes that commit to an accident. Road, traffic and airplane accident data have different nature in comparison to other real world data as accidents are uncertain. Analyzing diverse accident dataset can provide the information about the contribution of these attributes which can be utilized to deteriorate the accident rate. Nowadays, Data mining is a popular technique for examining the accident dataset. In this study, Association rule mining, different classification, and clustering techniques have been implemented on the dataset of the road, traffic accidents, and an airplane crash. Achieved result illustrated accuracy at a better level and found many different hidden circumstances that would be helpful to deteriorate accident ratio in near future.
Keywords: road and traffic accident, airplane crash, data mining, clustering techniques, classification techniques, association rule mining, accident rate
While writing my MSc dissertation I have been immensely fortunate to be surrounded by inspiring people, whose special contribution to this dissertation I would like to acknowledge.
Firstly, I am very thankful to my supervisor, Prof. D.V. Kalitin for his assistance at NUST “MISIS”. I really appreciate his support that he allowed me to do independent work for my MSc dissertation.
I would like to give special thanks to the people closest to my heart: my family. My deepest love and gratitude to my parents for being the most wonderful parents in the world, I can never thank them enough for the motivation, support and for the many sacrifices they made so that I can achieve the best in my life.
I would like to thanks especially my friends Dr. Sachin Kumar and Dr. Vijay Bhaskar Semwal for always inspiring, supporting me and they are my mentor. Last but not the least I bow to the God Almighty for making these studies a successful one.
NUST “MISIS”, Moscow
The accident has been the major reason for untimely death as well as damage to property and economic losses around the world. There are a lot of people die every year in a different type of accident. Hence, traffic authority gives generous attempt to reduce the accident but still, there is no such lessening in accident rate since in these analyzed years. The accident is unpredictable and undetermined. Hence, analysis of accident requires the comprehension of circumstance which is affecting them. Data Mining [1, 2, 28, 29, 30] has pulled in a lot of consideration in the IT industries as well as in public arena because of the extensive accessibility of vast quantity of data. So, it’s necessary to transform these data into applicable knowledge and information. This applicable knowledge and information may be utilized to implement in different areas such as marketing, road accident analysis fraud detection and so on .
Road and traffic accident are one of the critical issues over the world. Lessening accident proportion is the best to approach to enhance traffic safety. There are diverse research has been done in many countries in traffic and road accident analysis by utilizing a different type of data mining approaches. Many researchers proposed their work in order to deteriorate the accident ratio by identifying risk factors which particularly impact in the accident [3-7].
Transportation frameworks itself is not in charge of these diverse crashes but rather a few different circumstances [12, 13]. These circumstances can be characterized as natural elements, for example, climate and temperature, road particular circumstances, for example, street sort, street width, and street bear width, human circumstances i.e. wrong side driving, abundance driving velocity and different variables. At whatever point an accident occurred in any street over the world, some of these accident circumstances are included. Likewise, these factors and their impact on the accident are not comparative in all nations; but rather they affected each accident in various nations in various ways.
Several works [14-18] have concentrated on recognizable proof of these factors so that connection between accident variables and accident severity can be built up. This connection can be used to conquer the accident rate by giving some preventive measures. Accident analysis is generally known as street and activity security in which result of accident investigation can be used for car crash avoidance
Data mining  is a mutative method which has been utilizing in the area of transportation. Although Barai  expressed that there is the different approach of information retrieval in the engineering field of transportation, for example, pavement examination, road surface investigation etc. Data mining involves numerous techniques, for example, preprocessing, association rule mining, classification, clustering and so forth.
Airplane crashes are dubious and erratic occurrences and their examination requires the information of the variables influencing them. Airplane crashes are characterized by an arrangement of factors which are for the most part of discrete nature. The real issue in the investigation of accident information is its heterogeneous nature. In this manner heterogeneity must be considered amid an analysis of the information generated, some association between the information may stay covered up. Despite the fact that, analysts utilized division of the information to diminish this heterogeneity utilizing a few measures, for example, expert learning, however, there is no certification that this will prompt an ideal division which comprises of homogeneous gatherings of an airplane crash. In this way, cluster analysis may help the division of airplane crashes.
The aircraft organizations are one of the fields, with the fast development in air travel, cancellations, flight deferrals, and occurrences have likewise significantly expanded in late years. Therefore, there is a lot of information and information aggregation in the aeronautics business. This information could be put away as pilot reports, sup-port reports, occurrence reports, segment reports or postpone reports. Likewise in the flight business, information mining applications have been performed.
Data Mining is characterized as the strategy of retrieving data from big sets of data. As such, we can state that data mining is mining information from data.
Clustering is the gathering of a specific arrangement of objects on the basis of their features, grouping them as indicated by their resemblance. What makes a contrast between clustering and classification is that in classification, every record allocated a pre-defined class in according to an enhanced model alongside training on the pre-classified cases and also clustering does not rely on upon predefined classes .
a. Application of Clustering Analysis
Clustering approach is comprehensively utilized as a part of numerous applications, for example, pattern recognition, image processing, market research, data analysis.
Clustering can likewise enable advertisers to find unmistakable gatherings in their client base. What's more, they can portray their client group on the pattern for purchasing.
In the area of biology, it can be utilized to determine plant and creature taxonomies, categorize genes with comparative functionalities and pick up knowledge into structures inalienable to populations.
Clustering likewise helps in the finding of areas of comparative land use in an earth perception database. It additionally helps in the recognition of gatherings of houses in a city as indicated by house sort, value, and geographic area.
Clustering likewise helps in grouping documents on the web for data retrieval.
Clustering is additionally utilized as a part of anomaly identification applications, for example, fraud detection in credit card.
b. Prerequisites of Clustering in Data Mining
Ability to manage various type of attributes
Ability to manage with noisy kind of data
Classification is a data mining approach that allocates things in a gathering to class or target categories. The objective of classification is to precisely predict the targeted class for each case in the dataset. For instance, a classification model could be utilized to distinguish loan candidates as high, medium and low credit risks.
a. Classifications Issues
Preparing the data is major issue for classification and prediction. There are following steps involved in preparing the data
Data transformation and reduction
Association rule mining is essentially centered on finding continuous co-happening associations among a gathering of things. It is at times alluded to as "Market Basket Analysis" since that was the first application territory of association mining. The objective is to discover associations between things that happen together more frequently than you would anticipate from an irregular inspecting of all conceivable outcomes. The great case of this is the well-known Beer and Diapers association that is frequently said in data mining books. The story goes this way: men who go to the store to purchase diapers will likewise tend to purchase beer in the meantime .
The fundamental concern with accident data investigation is to recognize the most persuasive feature influencing accident recurrence and seriousness of the accident. The real issue with accident dataset analysis is its heterogeneous behavior. Heterogeneity in accident data is exceedingly undesirable and unavoidable . The real inconvenience of heterogeneity of accident data is that sure connections may stay concealed, for example, certain accident features related with specific vehicle sort may not be significant in the whole informational index; the immensity of the impact of certain accident-related variables might be diverse for different conditions; seriousness levels for an accident contributing circumstances might be distinctive for various accident sorts. This heterogeneous behavior of accident data may prompt less precise outcome . So as to get more precise outcomes, this heterogeneity of street accident data must be expelled. Keeping in mind the end goal to manage this heterogeneous nature of accident data, a few reviews [21, 22], partition the information into gatherings in light of some exogenous traits e.g. accident area, street condition, reason for accident and so on and broke down each gathering independently to recognize a few powerful variables related with accident in each gathering. Be that as it may, this decision is doubtful as gathering, the information in light of specific characteristics may bring about less critical gatherings.
The overall objective of this thesis is to achieve the accuracy and identify the factors behind crashes or accident that could be helpful to reduce accident ratio in near future and could be helpful to save many lives, deteriorate wealth destruction as well as many other things. In next section, Overview of research articles related from this thesis has been mentioned.
The result of this thesis is based on 5 research articles and structure of thesis would be as follow with a short description of each research article in this section.
In a 1st research article, different clustering, classification techniques as well association rule mining used to find the correlation between diverse factors. In the first research articles, K-modes clustering, Self-Organizing Map (SOM) technique has been utilized to group the data into homogeneous segments and then applied Naive Bayes (NB), Decision tree, Support vector machine (SVM) to classify the dataset. It has been performed classification on data with and without clustering. The result illustrates that superior classification accuracy can be achieved after segmentation of data using clustering .
In a 2nd research article, it has been proposed a different classification and clustering techniques to analyze data. There are various implemented classification techniques such as Lazy classifier, Decision Tree, and Multilayer Perceptron classifier to classify a bunch of dataset on the basis of casualty class as well as clustering techniques which are Hierarchical and k-means clustering techniques to cluster whole dataset. Firstly, Dataset was analyzed by utilizing these classifiers and accomplished precision at some level and later, Clustering approach was implemented and after that implemented classification approaches on that clustered dataset. Achieved precision level enhanced at some level by utilizing clustering on dataset contrasted with a dataset which was classified without clustering 
In a 3rd research article, it has been presented a conjoint analysis using k-mode clustering and Bayesian Networks on an imbalanced road accident data from Leeds, UK. The motivation of this study was to validate the performance of classification before and after the clustering process .
In a 4th research article, It has been performed a comparative study using k-modes, LCC and BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) clustering techniques over a new multi-vehicular accidents data from Muzzafarnagar, Uttar Pradesh, India. Further, applied Naïve Bayes (NB) algorithm to predict the severity of traffic and road accidents for each of the clusters obtained. The cause behind the selection of Naïve Bayes algorithm for classification is that several data tuples are their which contains the similar attribute values for different class values or dependent attribute value. In such cases, decision tree technique usually fails (Tan et al., 2006). Therefore, it has been selected NB technique as it classifies the data tuple based on the probability values. The results revealed that prediction accuracy for NB, SVM, and RF are found higher for the clusters attained from LCC whereas for the clusters obtained from BIRCH the prediction accuracy was comparatively low than LCC and k-modes clusters. On the other side, the computation speed of k-modes is found higher than both LCC and BIRCH. The organization of rest of the paper is as follows: Next section focused on literature survey, materials and methods used in the study that comprise of a brief overview of the data set being utilized in the study and various techniques that have been utilized in the study .
In a 5th research article, this paper proposes a system that depends on the cluster investigation utilizing hierarchical clustering and association rule mining utilizing Apriori technique. Utilizing cluster investigation as a preparatory undertaking can gather the information into various homogeneous portions. Association rule is additionally connected to these groups and additionally on the whole dataset to produce association rules. In the best of insight, it is the first occasion when that both the methodologies have been utilized together for investigation of a dataset of a road accident. The consequence of the investigation demonstrates that utilizing cluster analysis as a preparatory assignment can help in expelling heterogeneity to some degree in the dataset of a road accident . Chapter 4-5 will discuss the experiment and results which are followed by a conclusion and recommendation section.
Accident analysis is an essential area of study in the transportation domain (Kumar and Toshniwal, 2016a). Various studies used statistical methods (Savolainen et al., 2010; Karlaftis and Tarko, 1998; Jones et al., 1991, Poch and Mannering, 1996, Maher and Summersgill, 1996) and data mining techniques (Kumar and Toshniwal, 2015a, 2016b, 2016c; Chang and Chen, 2005; Kashani et al., 2011, Prayag et al., 2017) to analyze traffic accident data and establishing relationships between accident attributes and road accident severity. The results obtained from these studies are very useful as different circumstances affecting road accidents are revealed. Awareness of these accident factors is certainly helpful in taking preventive measures to overcome the accident rates in the area of study [37-41, 32, 7, 19, 42-43, 79, 80].
A review by Peng and Boyle  tried to pick up bits of knowledge on the impact of commercial driver considers on crash seriousness regarding run-of-road (ROR), single-vehicle crashes. This review said safety belt utilize essentially lessened the probability of damage and deadly ROR crashes. Driver diversion and heedlessness improved the probability of an ROR crash. Fatigue, Laziness and speeding fundamentally improved the probability of harm and lethal ROR crashes. Commercial motor vehicle (CMV) drivers who drove a non-damaged truck were related with a lower probability of harm and deadly ROR crashes. An ROR crash was around 3.8 times more prone to be harmful and deadly on the off chance that it occurred on provincial streets or dry streets. No other informative factors were seen as critical. The consequences of this review propose that few driver elements: fatigue and laziness, speeding, diversion, distractedness, and safety belt utilize influenced the probability of an ROR crash being injurious. Hence, the investigation of Peng and Boyle  gives bits of knowledge on the greatness of the impact of these driver calculates on ROR crashes that include huge trucks. The outcomes have suggestions for behavioral security countermeasures that can help relieve the effects of driver diversions and speeding.
Transient circumstances identified with the inability to recognize vehicle may incorporate liquor, fatigue/absence of rest, negligence, and data over-burden, while elements that are more lasting may incorporate "intellectual" conspicuity and field reliance . The model with the better fitting and most elevated prescient ability was utilized to identify the impact of the roadway, an ecological issue, vehicle, and driver related circumstances on severity. Gadget utilization, travel speed, purpose of effect, utilization of drugs and liquor, individual situation, regardless of whether the driver is to blame, sex, curve/grade and rural/urban nature presence at the crash area were distinguished as the critical elements for having a harm severity effect to older drivers muddled in single-vehicle accidents . Logistic regression was implemented to crash-related information gathered from traffic police records keeping in mind the end goal to inspect the involvement of several circumstances to the severity of accident . The requested probit model was utilized to compute the impact of the roadway and zone sort factors on injury seriousness of pedestrian crash in a rural area . Inability to wear safety belts did not anticipate crashes but rather did altogether impact the seriousness of crashes that occurred; that is, the individuals who had before revealed utilizing safety belts "always" were more improbable than others to be harmed when the crash occurred. Budgetary anxiety increased the probability of inclusion in a more dangerous accident . These outcomes will impact the urban movement police authorization measures, which will change the improper conduct of drivers and secure the minimum experienced street users.
There are many factors responsible for an accident like driver alcohol and drug involvement, the age of the driver, improper driving education, other vehicle driver experience, urban/rural nature, speed, environmental issue, runway, engine problem, pilot fault etc.
Statistical techniques or “statistics” are not data mining approaches. They were being utilized sometime before the term data mining was begotten to apply to business applications. Notwithstanding, statistical approaches are driven by the information and are utilized to find a pattern and make predictive models.
Statistical approaches have also played an important role in road safety research. Karlaftis and Tarko (1998) studied the impact of age of riders on accident patterns. They used negative binomial models along with cluster analysis to analyze the road accident data. Several important studies (Savolainen et al., 2010; Karlaftis and Tarko, 1998; Jones et al., 1991, Poch and Mannering, 1996) using statistical techniques have been performed on road accident data.
Lord and Mannering (2010) provided a detailed survey of the key issues related to crash-recurrence information and the weakness and strengths of the diverse methodological techniques that scientists have utilized to address these issues. While the consistent march of methodological advancement (consisting recent utilizations of the finite mixture model and random parameter) has considerably enhanced our comprehension of circumstances that influence crash-frequencies, it is the anticipation of joining developing approaches with much more point by point vehicle crash information that holds the better promise of what's to come in near future .
Poch and Mannering (1996) Utilized seven-year accident dataset from 63 intersections in Bellevue, Washington (all of which were focused on operational changes), this paper evaluates a negative binomial regression of the recurrence of crashes at crossing point approaches. The estimation comes about reveal imperative intersections amongst traffic and geometric related factors and crash frequencies. The motivation of this paper gives exploratory methodological and exact proof that could prompt a way to deal with gauge the accident lessening advantages of different proposed upgrades on operationally lacking intersections .
However, it is found in several studies (Ona et al., 2013; Kumar and Toshniwal, 2016d) that clustering improves the performance of classification or prediction. Latent class clustering has been widely utilized for cluster analysis in traffic and road accident dataset whereas k-modes clustering has also been used in few studies. Although every clustering technique has its own advantage and limitations, it usually depends on the choice of authors to select any clustering algorithm that suits best for the data. Therefore, it is required to estimate the execution of clustering techniques on explicit frameworks such as clustering efficiency based on computation speed and clustering result [7, 32 19, 42-43].
Kumar and Toshniwal (2015a) proposed a framework to remove the heterogeneity from the road accident data and suggested that clustering prior to analysis is very useful to manage with the heterogeneity of traffic and road accident data. Ona et al. (2013) used latent class clustering (LCC) technique to remove heterogeneity from the data. They suggested that LCC is very useful clustering technique and also provides different cluster selection criteria to be used for identifying a number of clusters present in the data set. Further, (Kumar and Toshniwal, 2016d) performed a comparative study on road accident data from Haridwar, Uttarakhand, India. In this study, they used LCC and K-modes (Chaturvedi et al., 2001; Kumar and Toshniwal, 2015b) clustering techniques to cluster the data prior to performing the analysis. Further, they extracted association rules using Frequent Pattern (FP) growth technique to extract the rules that described accident pattern in each cluster. They concluded that both techniques have similar efficiency on cluster formation and are able to remove the heterogeneity from the data. However, their findings were not suitable to reveal the superiority of one technique over other.
Karlaftis and Tarko  used the investigation to cluster the data and afterward sorted that dataset of the accident into individual categories and additionally clustered output of investigated dataset by utilizing Negative Binomial (NB) to identify the reason of accident by centering age of driver which may exhibit a few outcomes.
Kwon OH  used Naive Bayes and Decision Tree classification approach to analyzing aspect dependencies related with safety of the road. Youthful Sohn  utilized an alternate algorithm to improve the accuracy of various classifiers for two severity categories of a traffic accident and every classifier utilized the neural network and decision tree. Tibebe  built up a classification model for Traffic officers at Addis Ababa Traffic office that could help them for taking the decision to manage traffic connected activities in Ethiopia. S. Kuznetsov et al. [48-50] used an algorithm based on FCA for numerical data mining and provided more efficient results.
Christopher [51, 52] utilized five classification techniques on airplane crash dataset to identify the performance of every each classification techniques on a different component of aviation dataset. The primary commitment of this review is to assess the execution of various classification systems are NB, SVM, DT, NN and KNN on a component of aeronautics. In this paper, examined the significance of highlight determination strategies for enhancing the execution of order techniques. It is found that the distinctive component determination trait lessons a number of excesses and unessential qualities in this manner expand the execution of classifiers. We discover that foremost part examination based on highlight selector properties and choice tree based with a classifier is best appropriate for prediction of aircraft crashes dataset.
Bineid and Fielding utilized information mining strategies to clarify the improvement of a dispatch dependability expectation strategy for a traveler flying machine. Nazeri and Zhang depicted the utilization of information mining to investigating extreme climate impacts on the national airspace framework (CAA) was connected to achieve the review. Their approach takes the more perplexing connections among applicable execution [53-54].