Homepage > Catalog > Computer Sciences - Artificial Intelligence

Prediction of Adverse Drug Reaction (ADR) Outcomes with use of Machine Learning. Feed-Forward Artificial Neural Network with Backpropagation

Name: Prediction of Adverse Drug Reaction (ADR) Outcomes with use of Machine Learning. Feed-Forward Artificial Neural Network with Backpropagation
Price: 47.95 EUR
Availability: InStock
Author: Kamil Szymanski
ISBN: 9783346163387

Master's Thesis, 2019

135 Pages, Grade: 88.2

Kamil Szymanski (Author)

Excerpt

1. CHAPTER ONE: INTRODUCTION
1.1 Pharmaceutical Products and Side Effects
1.2 ADR Reporting and Their Importance for Patients and Big Pharma
1.3 Problems with ADRs and ADR Reporting
1.4 Economic Impact of ADRs
1.5 Current Statistical (non-machine learning) Methods Used for Identification of ADRs
1.6 Data Science and its Application in Healthcare Industry
1.7 Aims and Objectives

2. CHAPTER TWO: LITERATURE REVIEW
2.1 Introduction to Machine Learning
2.2. Supervised Learning
2.2.1 Logistic Regression
2.2.2 Decision Tree Classifier
2.2.3 Ensemble Models
2.2.3.1 Random Forest (RF)
2.2.3.2 The eXtreme Gradient Boosting (XGBoost) Classifier
2.2.4 Support Vector Machines (SVMs)
2.2.5 Artificial Neural Networks (ANN)
2.3 Use of Machine Learning for ADR Prediction (based on FDA database)
2.3.1 Standard Machine Learning and ADRs Prediction
2.3.2 Neural Networks and ADR Prediction
2.4 Selection of Machine Learning Model
2.5 Cross-Industry Standard Process for Data Mining (CRISP-DM)

3. CHAPTER THREE: RESEARCH DESIGN & METHODOLOGY
3.1 Research Design
3.1.2 Research Philosophy & Strategy
3.1.3 Ethics and Privacy
3.2 Methodology
3.2.1 Problem Understanding / Research Purpose Statement
3.2.2 Technologies and Hardware Used
3.2.3 Data Understanding
3.2.3.1 Data Accessibility and Data Accuracy
3.2.3.2 The ADR dataset
3.2.3.3 Data Acquisition and Assembly
3.2.3.4 Assessment of Data Quality and Initial Statistical Analysis
3.2.4 Data Cleaning
3.2.4.1 Weight Variable
3.2.4.2 Age Variable
3.2.4.3 Product Active Ingredient (prod_ai) variable
3.2.4.4 Commercial Drug Name variable
3.2.4.5 Data Filtering and Data Selection
3.2.5 Data Preparation
3.2.5.1 Encoding
3.2.5.2 Multicollinearity
3.2.5.3 Data Balancing by Over-sampling
3.2.5.4 Data Split
3.2.5.5 Feature Scaling
3.2.5.6 Modelling and Model Evaluation
3.2.6 Limitation of Research Design and Methodology

4. CHAPTER FOUR: RESULTS
4.1 Results of Preliminary Data Assessment
4.2 Results of Univariate Analysis of Cleaned Data
4.3 Results of Multicollinearity Analysis
4.4 Results of Optimizers for Multi-label Classification Models
4.5 Results of the Final (multiclass) ADR-Outcome Classification (Model 7)
4.6 Results of Binary Classification Model

5. CHAPTER FIVE: DISCUSSION
5.1 ADR Data Trends in This Research and Literature
5.2 Evaluation of Model Optimization Practices
5.3 Multiclass Prediction of ADR Outcomes
5.4 Binary Classification of ADR Outcomes

6. CHAPTER SIX: CONCLUSIONS & RECOMMENDATIONS
6.1 Project Objectives
6.2 Challenges and Recommendations
6.3 Further Research

7. References

Appendix A: Python Code

Appendix B: FDA ASC_NTS Description file

Appendix C: Initial Data Quality Assessment

Appendix D: WHO Age vs Age Relationship Diagrams

Appendix E: Specification of subsequent model parameters

Appendix F: Performance Metrics of designed models

Acknowledgements

First, I would like to express a gratitude to my supervisor Dr. Joseph Kehoe, who provided me with guidance during this research project. Your experience in planning and conducting a research project as well as formulating a proper research question was of great benefit to me.

Secondly, to my colleagues – Helen Tai whose code troubleshooting skills have saved me a lot of hours spend on front of a screen and Joyce Mahon who passionately discussed the best approaches of tackling dissertation related problems throughout the duration of this research. Your support and advice were most appreciated.

Finally, a big thank you to my Family and Friends, who motivated me and made an extra effort to release me from my professional and personal duties, so I had time to complete this research project.

Abstract

Adverse Drug Reactions contribute to $7.2 billion in medical expenses annually in United States alone. Apart from economic impact, the adverse drug reactions are one of the main causes of deaths in developed countries. An ability to predict adverse drug reaction outcomes or identify patients that are likely to experience adverse drug reactions after ingesting medicinal drug product can reduce the fatality rates and financial burden associated with medical treatments of patients experiencing adverse drug reactions. To date, two studies attempted to predict adverse drug reactions with use of machine learning algorithms and Food and Drug Administration database. However, one of them did not provided sufficient accuracy scores and the other was a small-scale proof of concept study. This research has built on top of these two studies and aimed to improve predictive results and provide more specific adverse drug reaction predictive labels. Two deep, fully connected Neural Networks have been created to predict adverse drug reaction outcomes. Model 7 and the Binary Model. The model 7 was a multi-label classification algorithm that was trained to classify patients that were hospitalised, died or have been hospitalised then died. The binary model was trained to segregate the adverse drug reaction outcomes to 2 classes – hospitalisation and death. The multiclass Model-7 achieved similar performance (74% accuracy) to standard machine learning (binary) models generated in previous studies. However, it did not achieve better performance than binary proof of concept Artificial Neural Network model built by other researchers. The binary Artificial Neural Network model created in this study, significantly outperformed other binary standard machine learning models designed in past by achieving 83% of accuracy - 8% higher than competitors. Nevertheless, it did not outperform the small-scale Artificial Neural Network model which scored 99% accuracy. The results of this study demonstrate that Artificial Neural Networks are better suited for binary classification problems on large scale uneven data than standard machine learning models, provided there is sufficient access to computing power. The multi-label Artificial Neural Networks also achieve satisfactory results, however a reduction in model performance must be acknowledged due to nature of the multiclass output.

Keywords: Artificial Neural Network, Adverse Drug Reactions, Side Effects, Classification

Table of Figures

Figure 1. Visualisation of clinical trial phases and their size and length (EUPATI, 2015)

Figure 2. Showing sigmoid function classifying two linearly separable categorical variables to 2 classes using binomial logistic regression. X-axis = dependent variable, Y-axis independent variable. Graph adapted for the purpose of the review (Peng, Lee and Ingersoll, 2002)

Figure 3. Visualising the concept of classifying categorical variables using decision trees

Figure 4. RF Classifier operational mechanism. Decision tree 2 and 3 overpowered the tree no.1. Therefore, the output variable will be classified as class B (Koehrsen, 2017)

Figure 5. Showing Linear SVM model. Thick line is the hyperplane separating data points to different groups. The slope of the hyperplane determines the size of the margin to support vectors (Ahuja and Yadav, 2012)

Figure 6. Visualisation of deep feed forward neural network model with 2 hidden layers

Figure 7. Showing increase in Accuracy measures vs number of hidden layers added to ANN

Figure 8. Distribution of the two main predictor variables: Age (on left) and Weight (on right)

Figure 9. Schematic representation of CRISP-DM model

Figure 10. Visualization of research designs development in form of research onion (Saunders et al., 2009)

Figure 11. Process flow diagram of research methodology

Figure 12. ADR database entity relationship diagram

Figure 13. A Number of missing (NaN) values in the entire downloaded FDA data frame. White represents missing values, while black represents complete values

Figure 14. The screen capture of the FDA data, for single case ID =

Figure 15. Distribution of age value for patients used for ADR outcome classification

Figure 16. Distribution of weight variable for patients used for ADR outcome classification

Figure 17. Percentage of value counts of each independent ADR outcome (blue: hospitalised, yellow died, red: hospitalised then died

Figure 18. Ratio of Male to Female patients used for ADR outcome classification, represented in percentages

Figure 19. Value counts associated with top 10 drug-administration routes used for classification of ADR outcomes

Figure 20. Top 10 independent features, according to their predictive power

Figure 21. Multicollinearity matrix of independent variables and the dependent variable

Figure 22. Classification accuracy and loss function of the final model on test and train sets

Figure 23. Precision Recall and F1- Score performance measures for every independent label classified by model

Figure 24. Multiclass ROC curve. Showing probability of the model correctly classifying each label (0: hospiitalisation,1: Hospitalisation + Death and 2: Death) correctly. False Positive Rate on X-axis and True Negative Rate on Y.-Axis

Figure 25. Accuracy and Loss of the binary model regarding classification of Hospitalisation and Death Outcomes

Figure 26. Precision Recall and F1- Score performance measures for binary model classification of hospitalisation and death ADR outcomes.

Table of Tables

Table 1. Performance metrics of 4 machine learning models for death and hospitalisation test data sets (Chen, 2018)

Table 2 . Conversion factors used for conversion of age to year timescale

Table 3 . Age ranges used for rounding of age values

Table 4. Explanation of dummy-encoding process

Table 5 . Iterations of classification model specifications

Table 6 . Classification accuracy scores throughout the model optimization process. The table shows accuracy on train set, validation/test set as well as difference between accuracy scores and associated loss.

Table of Abbreviations

Abbildung in dieser Leseprobe nicht enthalten

1. CHAPTER ONE: INTRODUCTION

1.1 Pharmaceutical Products and Side Effects

The prescription drugs produced by pharmaceutical companies must pass through three stages of clinical trials in order to acquire marketing authorization that allows manufacturers to sell the medicinal products to public (EMA, 2018).

The purpose of putting a drug substance through clinical trials is to verify drug substance safety and efficacy for treatment of particular disease, before it is made available to general population. The three phases of clinical trials test the drug substance on a small number of subjects (Figure 1). Phase 1 is typically 1-50 patients in size, phase 2 is 100-500 patients in size and phase 3 is typically 1000-5000 in size (EUPATI, 2015).

Abbildung in dieser Leseprobe nicht enthalten

Figure 1. Visualisation of clinical trial phases and their size and length (EUPATI, 2015).

Because it already takes 12-15 years to get the drug substance through clinical trials to get it approved, there is a pressure to reduce the time of clinical trials by reducing number of subjects used for the trial series. Potential drugs are tested on small sub-group of a global population that does not always reassemble the genetic and physiological make up of general population (Berlin et al., 2008). This causes side effects to occur in some ethnic groups even after medicinal product has been approved by regulatory authorities / deemed to be safe and effective to use.

The causes of side effects to approved medicinal products include: lack of adherence to dosage and doctor’s recommendations, drug-drug interactions, severity of the disease, mentioned lack of compatibility of drug and physiological make up of individual that in turn can depend on ethnicity and race and what is most important – the personal attributes of the patient like age (Yu et al., 2015), sex (Rademaker, 2001), weight (Alomar, 2014) and overall health, which are thought to be a major factor contributing to manifestation of drug side effects.

The side effects caused by medicinal drug products can be divided to non-serious and serious side effects. The serious side effects are also referred to as Adverse Drug Reactions (ADRs) and are characterised by instances of patient hospitalisation and/or death.

1.2 ADR Reporting and Their Importance for Patients and Big Pharma

Provided successful finalisation of stage 3 clinical trial, a drug substance receives the marketing authorization to be sold to general public. The pharmaceutical companies are legally obliged to report any ADR that occur within the general population even after the regulatory approval has been granted in case the drug displays serious side effects that have not been observed before. This is informally known as phase 4 clinical trial and the ADR data collected during this phase can contribute to removal of the drug from the market if necessary (Suvarna, 2010).

Pharmaceutical companies report any side effects directly to relevant regulatory authority like FDA or EMA or both in some instances. Healthcare practitioners as well as the patients can voluntarily submit reports of adverse drug reactions to these authorities too (Center for Drug Evaluation and Research, 2018).

1.3 Problems with ADRs and ADR Reporting

An ability to gather potential ADRs from various sources is advantageous, as it increases the number of data points to be tested, thus improving the chances of identifying lethal ADRs before they occur. However, under reporting of ADRs is still the biggest flow of the current system Hazell and Shakir (2006), that negatively impacts on ability to accurately assess the ADR incidence rates. Collecting the data from multiple sources and through various routes also has its disadvantages, as it often results in duplication of the same ADR that reduces data quality and infringe on validity of FDA reports. The ADR reports are written and submitted by healthcare professionals, thus introducing inconsistencies due to imputation/human error. This is evident when looking at the dataset, that consists of non-uniform data, missing values or multiple names for one entity.

Additionally the data collected by FDA suffers with “Weber Effect” (Hoffman et al., 2014) and “Notoriety Bias” (Pariente et al., 2007). These two phenomena refer to a non-realistic transposition of the frequency of ADR reports to actual ADRs, caused by overreporting of ADRs for certain medicinal drug products. The Webber effect is associated with overreporting of ADRs after a new drug is released. The medical practitioners are cautious of any side effect caused by the new drug and the data shows there is surge in ADR reporting for the new drugs for the first 2 years after they are approved by FDA, followed by decline in ADR reports. The Notoriety Bias is associated with over reporting of ADRs caused by drug alerts released by regulatory authorities that aim to bring attention of medical practitioners to medicinal drug products under investigation or products which designated use is being changed. These 2 effects may impact statistical analysis of the data but should have no effect on predictive ability of the models or their results, when FDA database is used as data source.

1.4 Economic Impact of ADRs

The ADR occurrence creates an economic burden for the hospitals, patients and pharmaceutical companies. A clinical study investigating impact of hospitalised patients experiencing ADRs vs non-symptomatic patients has determined that ADR-positive patients occupy hospital beds on average 4 days longer than control patients and the costs $5,500 more per patient per stay that regular patients (Suh et al., 2000). In addition, the study identified cardiovascular drugs and anti-infective drugs to cause the most of the serious side effects in hospitalised patients. Another German study conducted by Dormann et al. (2004) established that 44.3% of ADRs are preventable and 23% of patients experiencing ADRs are readmitted to hospitals following the primary doctor ADR intervention, thus increasing the cost associated with ADR management. Additionally, the ADRs have been shown to extend hospital visits, that add extra costs to already expensive hospital admissions and treatments.

According to Classen et al. (1997), cited in Sultana et al. (2013), the cost related to single ADR management was estimated to be $2262 per patient per single ADR event in general care unit. The instances where ADR had to be treated in intensive care unit (ICU) were associated with higher costs of $3,961 per patient per ADR. The study also estimated that total national cost of ADR treatment with 20.7 million ADR positive patients for given year was between $5.8 billion – $7.2 billion, dependently on care unit specification (Kaushal et al., 2007). A very recent US/European study has also confirmed the high economic burden associated with ADR hospitalisation and treatment, estimating the costs to be on average €4344 and €5933 for outpatient setting and inpatient setting respectively (Formica et al., 2018). It is believed, the reduction in ADR occurrence or improvement in their management could reduce the associated costs.

1.5 Current Statistical (non-machine learning) Methods Used for Identification of ADRs

The current methods of identifying and predicting potential ADRs are based on statistical analysis of the databases and are referred to as ‘Quantitative’ Signal Detection Methodologies (QSDMs). There are two distinct methodologies (based on frequentist and Bayesian concepts) that identify the ADRs for a specific medicinal drug product. The two most prominent frequentist methodologies include Reporting Odds Ratio and Proportional Reporting Ratio (Poluzzi et al., 2012), while the most used Bayesian approach is Multiitem Gamma Poison Shrinker method (Dumouchel, 1999) cited in (Poluzzi et al., 2012).

Although, the process involved in these various approaches is different; in the end all of them compare disproportionality of ADR reports between suspected drug product versus other drug products in the database. It is important to highlight that statistically significant QSDMS do not prove relationship between ADRs and suspected medicinal product as well as the lack of statistical correlation does not prove lack of relationship between ADRs and medicinal product (Hauben and Aronson, 2009). Currently there is no gold standard technique in pharmacovigilance for ADR detection. However, out of the mentioned techniques the Reporting Odds Ratio is the most used methodology due to its transparency and ease of application (Heijden et al., 2002).

1.6 Data Science and its Application in Healthcare Industry

The flows and limitations of simple statistical methodologies call for more advanced techniques of identifying potential ADRs with high confidence. Data Science is a rapidly growing field that can be utilised for analytical purposes ie. patter recognition or risk assessment and well as prediction ie. the machine learning tools that can predict events happening based on historic data. Data science can be applied or is already being applied to various aspects of healthcare industry such as: drug discovery (Brown et al., 2018), diagnosis (Sajda, 2006), personalised disease treatment (Fröhlich et al., 2018) or optimisation of hospital operations (Agrawal, 2017).

1.7 Aims and Objectives

This research intends to use machine learning model to predict serious ADR outcomes such as hospitalisation and/or death based on the data obtained from FDA Adverse Event Reporting System (FEARS) database. The main objective of this study is to create a predictive model with use of Artificial Neural Network Algorithm and verify if it is suitable classification mechanism on large scale data. This will be achieved by design of multiclass model (and potentially binary class model) with use of 10,000+ ADR cases. A part of this process will include model optimisation and comparison of model accuracy to previously conducted prove of concept /small scale study that used 3.000 ADR cases published byYen et al. (2011).

The sub-objective of this research study is to compare the designed Artificial Neural Network performance to the standard machine learning models such as Logistic Regression, Support Vector Machine Model, Random Forest Model and Gradient Boosted Tree Model that has been published by Chen (2018).

It is the author’s hypothesis that Artificial Neural Networks will achieve better accuracy of predicting ADRs than standard machine learning models on large dataset. It is expected that use of 10,000 cases for modelling purpose will result in reduction of prediction accuracy when compared to ANN model that used 3000 cases. However, it is expected that ANN prediction accuracy of the model created as part of this project will be better than prediction accuracy of standard machine learning models designed by Chen (2018).

2. CHAPTER TWO: LITERATURE REVIEW

Machine Learning is a branch of Artificial Intelligence, that uses mathematical algorithms, statistics and computers to create predictive models that enable machines to learn and produce valuable insights without human interference, based on patterns gathered from the historic data (data mining). Machine Learning can be used for tasks such as pattern recognition, diagnosis, planning, automated machine control, prediction and others.

This chapter will provide an introduction to machine learning and explain how it can be used for this project. Different supervised machine learning algorithms will be explored to assess their suitability for prediction of ADRs. Finally, the current research within the area of ADR outcome prediction with machine learning algorithms will be explored and author will identify gaps in the literature that will be addressed as part of this research project.

2.1 Introduction to Machine Learning

The Machine Learning models used for the above tasks can be classified into 3 major groups: Supervised Machine Learning Models, Unsupervised Machine Learning Models and Semi-supervised Machine Learning Models.

The Supervised Machine Learning models are trained on a sample data with available labels and categorise new data to existing labels based on similarity of the data points to those labels or importance of the weights of the input variables. The supervised models do not produce new labels but categorise all the outputs to whatever labels the model has already access to.

In the unsupervised machine learning approaches, the labels are not available. The algorithm is presented with raw data, that is clustered together based on the similarity of each data point to each other, thus aiming to identify patterns in large or complex datasets. With this approach, the analysis starts with no available labels and new label clusters are identified. The number of available labels must be pre-specified – which could be considered problematic in some cases, because if the number of distinct labels is too small, different outputs will be grouped together, however if the number of labels is too big, the members of the same clusters will be spread to separate groups.

The semi-supervised machine learning models are combination of the supervised and unsupervised machine learning models. In this instance some labels are available and the algorithm assigns matches to existing labels but is also capable of creating new labels when necessary, thus increasing the label range (Libbrecht and Noble, 2015).

2.2. Supervised Learning

The supervised learning deals with 2 distinct areas – classification and prediction. These terms are often used interchangeably, however their outcomes differ. The classification models (categorise the data into separate, pre-labelled classes/groups). While the regression models work similarly to classifiers, they are used for prediction of values or likelihood of the events happening. The output of predictive model is the probability or numeric values, as opposed to classification model, where output is a categorical variable. The main supervised models defined in literature are: logistic regression, support vector machines, decision trees, random forest, XGBoost and neural networks (Akinsola, 2017). Even though these models are capable of providing similar output, their working principles differs and is suitable for different data types. Therefore, the theory behind each of these models will be further discussed. It is believed, the review of literature will allow to determine which of the models might be best suited for analysis of the dataset used in this project.

2.2.1 Logistic Regression

Logistic regression is one of the most used machine learning algorithms and is based on an assumption that data points are separable by linear hyperplane into 2 distinct regions by a linear discriminant. The logistic regression model uses maximum likelihood estimation function to maximise the average probability that the data point of interest has been classified correctly. The output of logistic regression is the probability that the data point of interest belongs to one of the classes and can be visualised with sigmoid. By default, if the P>0.5 the data point belongs to the measured class (Figure 2). The model assumes that the classes/ events are independent of each other (Peng et al., 2002).

Abbildung in dieser Leseprobe nicht enthalten

Figure 2 . Showing sigmoid function classifying two linearly separable categorical variables to 2 classes using binomial logistic regression. X-axis = dependent variable, Y-axis independent variable. Graph adapted for the purpose of the review (Peng et al. 2002).

Logistic regression can be used for binary classification or classification where number of classes is bigger than two. The model is suited for classification of nominal/ categorical variables such as Gender (M/F). As explained by Ranganathan et al. (2017) and Yang and Loog (2018), this model works well for binary classification, but can become problematic where the number of classes or independent variables is high. The model assumes a linear relationship between the natural logarithm of odds ratio (odds of the event happening) and variables used for regression analysis. In cases where the operational model conditions are not met, a data transformation is essential for the model to work. According to John H. McDonald (2014), the model does not assume normal distribution of the data. This is considered a major advantage because most of the real-world datasets are not normally distributed. Additionally, the logistic regression algorithm requires the data to be structured. In instances where dataset is unstructured, the data extraction and tabularization is essential (Basharat et al., 2016).

2.2.2 Decision Tree Classifier

The Decision Trees are series of sequential and hierarchical features which decide on optimal data splits/branches to minimise the entropy and classify the variables to one of two or more classes at the bottom of a tree. Each branch of the tree represents a potential decision/occurrence or reaction. Patel (2012), has visualised the decision tree algorithm classifying whether players should play tennis given the weather data (Figure 3).

Abbildung in dieser Leseprobe nicht enthalten

Figure 3. Visualising the concept of classifying categorical variables using decision trees.

The Decision trees are build using pruning and induction techniques. The Induction builds the tree by setting the decision boundaries on hierarchical branches. The Pruning aims to reduce the complexity of the decision tree by removal of branches that have low importance for the model. This contributes to reduction in model overfitting, which decision trees are very prone to.

The model overfitting along with low model stability are a major disadvantages of decision tree classifiers, which tend to magnify on large datasets (Sebban et al., 2000). However, this method is very transparent and easy to interpret, as the decision-making process can be followed from branch to branch of the tree. Additionally, the decision trees handle missing values, discreet numerical and categorical data very well – with no need for conversions as it is the case in other models eg. Neural Networks (Song and Lu, 2015). However, the decision trees have been shown to work effectively mostly in scenarios where outcome is binary.

2.2.3 Ensemble Models

The Ensemble Models are based on the concept of ensemble learning where multiple small algorithms work to gather to create more advanced machine learning model than as if they worked on their own. Due to use of multiple of algorithms, the ensemble models provide higher prediction/classification accuracy than their sub-model algorithms.

The ensemble models can be classified to 2 classes, which are based on their operational characteristics: models utilising bagging concept and models utilising boosting concept (Rokach, 2005). The bagging architecture modifies basic decision tree model and makes it more complex. An example of an ensample model that takes advantage of bagging approach is Random Forest Classifier. On contrary to bagging approach, the boosting algorithms simplifies complex machine learning models. An example of boosting algorithm is XGBoost.

2.2.3.1 Random Forest (RF)

The RF classifier is an ensemble algorithm based on bagging concept. The RF classifier uses the outcomes from multiple decision trees to predict/classify the dependent variable (Figure 4). The RF classifier requires a selection of random data points from the training set and construction of decision tree around those points. The number of decision trees must be specified before multiple / other trees can be created. Each of the new trees is assigned with a category for prediction. The results are compared between the trees and outcome with majority of ‘votes’ is predicted.

Abbildung in dieser Leseprobe nicht enthalten

Figure 4 . RF Classifier operational mechanism. Decision tree 2 and 3 overpowered the tree no.1. Therefore, the output variable will be classified as class B (Koehrsen, 2017).

A comparative study between logistic regression and RF classifiers concluded the RF model is better binary classifier than logistic regression. The project investigated 243 distinct datasets that allowed to deeply test the model performance. Overall, the RF model performed well on all datasets and has been shown to be better classifier than logistic regression. However, the researchers discovered, that RF classifier does not handle category-rich features well. The RF model has failed to function if dataset contained 1 feature that had 53 or more categories (Couronné et al., 2018). It is believed this might be a major disadvantage, because the ADR dataset has high number of features with wide category range.

Another study examining neurological impairment of 310 patients investigated a possibility of automated disease classification based on medical records. The research conducted by Siddiqui et al. (2017) used number of classifier models to segregate patients to 5 different disease classes: normal, Alzheimer’s, AIDS, cerebral calcinosis, glioma or metastatic. Out of the tested classifiers, the RF has achieved the best predictive accuracy of 96%. Additionally it has been shown that accuracy of RF increased proportionally to increase in features added to analysis. Number of features (independent variables) varied from 5 to 20, during 4 separate model training operations. Moreover, other performance metrics of the RF classifier were optimal and better than all other models tested, which provides a solid argument that RF is one of the top multi-class classifiers currently available. It is important to mention that this study used a small sample size and the results might be different in a large scale project . The researchers also failed to clearly address the bias-variance trade off, which is the main downside of using random forest model (Ghosal and Hooker, 2018).

2.2.3.2 The eXtreme Gradient Boosting (XGBoost) Classifier

XGBoost Classifier uses adaptation of gradient boosting concept that takes many weak learners (shallow decision trees) with low predictive power – but better than guessing i.e. P>0.5 and combine them to form a strong learner -a prediction/classification model with high accuracy. Normally, the gradient boosting sequentially adds classifiers (decision trees) one by one, while correcting the errors introduced by each preceding decision tree. However, in XGBoost model instead of reassigning the weights to the sub-decision trees after every batch, this algorithm fits the new model to new residuals of the previous prediction and then minimizes the error when adding the latest prediction (Chen and Guestrin, 2016).It also uses loss function that is regularised to decide on tree splitting during model creation and reduce overfitting (Chen and Guestrin, 2016). This is a very effective approach of dealing with variance and bias which are also the biggest flow of standard decision tree models (Geurts et al., 2001).

The XGBoost is a relatively new model and in recent years it became very popular within data science community by repetitively winning machine learning competitions (Yi Jin, 2017). The XGBoost has shown to outperform other machine learning models and in most cases outperform other gradient boosted models. The Santhanam et al. (2017) have described 2 classification experiments conducted with XGBoost consisting of different data types. The first experiment reassembled data structure of this project and consisted of structured numerical/categorical data of diabetic patients. XGBoost has been shown to outperform the traditional gradient boosting model with respect to prediction accuracy scoring 78% prediction accuracy – a 5% higher than alternative models. Moreover, the time required to complete classification task with XGBoost was 10x faster than the other gradient boosted model. No other performance matrices were listed by researchers, nor they provided details about model optimisation.

Due to XGBoost being a new model, there is very little scientific literature available regarding model performance in multi-class scenarios. However, there are reports that multi-label classification is possible using One vs Rest approach also known as binary relevance method, which joins multiple binary classifiers to produce the required output (Ziegler, 2019).

2.2.4 Support Vector Machines (SVMs)

The SVMs, classify the data by segregating it to 2 classes separated by maximum margin hyperplane. The classification of new data points to different classes depends on the side on which the new data point appears. The maximum margin hyperplane is created at the optimal distance between support vectors of the distinct classes. A large margin enables reduction in possibility of misclassification (Figure 5). The closest points from each group to the hyperplane are support vectors. These are the data points from the dataset that are most difficult to classify as A or B.

Abbildung in dieser Leseprobe nicht enthalten

Figure 5 . Showing Linear SVM model. Thick line is the hyperplane separating data points to different groups. The slope of the hyperplane determines the size of the margin to support vectors (Ahuja and Yadav, 2012).

In cases where SVM is unable to separate the data points linearly, the data are converted to higher dimensional space eg. from 2D to 3D, and so on, until the data points can be separated (Crisci et al., 2012). Alternatively, the Gaussian Kernel can be utilised to separate non-linearly separable data, which is a lot less power consuming method than mapping the data points to higher dimensional space separable by 2 or more linear or non-linear hyperplanes(Ahuja and Yadav, 2012).

A study conducted by Cervantes et al. (2015) has shown that SVM algorithm performs better on large datasets than small datasets and have no problems with analysing multidimensional data. The researchers also noted that performance accuracy of SVM increased proportionally to increase in number of support vectors within the modelled data. However, this study along with other studies Cortes and Vapnik (1995) and Mertsalov and McCreary (2009) highlighted that a major disadvantages of the SVMs are power intensive and time-consuming computations that are required when using the model. Due to model complexity and its nature - especially in non-linear kernels, it is difficult to determine which parameter weight scores are responsible for the classification outcome, thus making the results and model performance difficult to interpret and optimise. The interpretation of results is often limited to graphical representation and local linear approximation of the scores (Auria and Moro, 2008). The SVMs has been mainly designed for 2-class classification. The multi-class classification using this model has presented a number of performance and accuracy issues over the years (Chih-Wei Hsu and Chih-Jen Lin, 2002).

2.2.5 Artificial Neural Networks (ANN)

The artificial neural network architecture is based on the human neuron. As any machine learning model, the ANN is trained first and then the trained model is applied to a test dataset for predictions/classifications. In ANN, the input values (columns in the dataset) are assigned random weights and passed through hidden layer or multiple hidden layers with prespecified activation functions (Figure 6). The output of the activation function is compared to the actual value. The cost function is calculated by comparing the output value and actual value to estimate the error in the prediction. This error is fed back into the network to readjust the weights on the input nodes as part of ANN backpropagation. The input data is passed through activation function again and the process is repeated for a specified number of epochs (Maglogiannis, 2007).

Abbildung in dieser Leseprobe nicht enthalten

Figure 6 . Visualisation of deep feed forward neural network model with 2 hidden layers.

Data scientists argue that neural networks are less robust in predicting binary outcomes, while using feed forward architecture than other machine learning models (Suhail Gupta, 2018). The research suggests the Neural Networks are difficult to optimise due to number of optimisation variables and complex architecture. Additionally, the inability to determine operations happening within the hidden layers of the ANNs, makes them more difficult to understand and replicate. Even though, the complexity of the ANNs is widely known in the data science industry, the results of the study should be taken with caution as they are contradictory to a number of other studies that demonstrated ANNs to be overpowering when compared to standard machine learning models.

A study conducted by Zekić-Sušac et al. (2014), that focused on classification of entrepreneurial aptitude of students has demonstrated the ANNs to be significantly better at classifying the experiment outcomes then SVMs, KNN classifiers and decision tree classifiers, scoring an average of 78% of prediction accuracy – approximately 10% better than other machine learning models. Other performance metrics (Precision, Sensitivity and Specificity) were also satisfactory. The study has used multiple hidden layers and with a number of nodes on hidden layers varying between 2-20. The outcomes were successfully classified to two categories.

Another research has used ANNs for multiclass problem. The aim of the research study was to classify patience to four classes, representing different heart diseases, based on number of tabulated independent variables (heart rate and a range of parameters from electrocardiogram signals). The study used backpropagation technique that loops back to the network with produced errors to readjust previously randomly assigned weights, thus establishing importance of input variables. This mechanism allowed to determine strengths of connections between the independent variables (ANN nodes) in hidden layers and reduce the overall mean squared errors in the algorithm. Using the backpropagation with gradient decent activation function, the study was able to achieve 85% precision accuracy at correctly classifying the patients to the four distinct classes: cardiomyopathy, atrial fibrillation, complete heart block and normal (Acharya et al., 2003). The information on model quality was limited, as the study has failed to produce performance metrics other than accuracy. The lack of scores for sensitivity, specificity F1 Score or recall, made it is impossible to determine if the results were not skewed by False Positive and False negative outcomes.

A study conducted by Dencelin and Ramkumar (2016) that classified protein structures (helix, sheet or coil) using encoded amino acid sequences has also achieved much better results with multi-layered perceptron (ANN) than other machine learning approaches. Interestingly, the researchers noted significant increase in accuracy measures, with increase in number of hidden layers in designed ANN (Figure 7). This is something to consider when designing a multi-layer feed forward perceptron.

Abbildung in dieser Leseprobe nicht enthalten

Figure 7 . Showing increase in Accuracy measures vs number of hidden layers added to ANN.

On contrary, Guang-Bin Huang (2003), has shown that sufficient model accuracy in feed forward perceptron, can be achieved with as little as 2 hidden layers and specifically selected number of hidden nodes in each of these layers, that is dependent on the number of input nodes and dataset size. Another study suggested, that increase in number of hidden layers does not always increase the accuracy of the model and can lead to model-overfitting and memorization – an ability of model to discover unrelated patterns within the data, that add noise to the output (Neto, 2018)

Most of described studies have used multi layered perceptron model which is the regular Feed Neural Network with more than 1 hidden layer. The literature also refers to this approach as Deep Supervised Learning and in most cases, it is been shown to outperform other models during classification tasks of tabular/structured data in both binary and non-binary outcome scenarios. The review of literature also confirmed that ANN can be difficult, computer-power and labour intensive to work with, hence why often the accuracy gained with ANNs is sacrificed in favour to more robust and less demanding standard machine learning models that can provide acceptable results, while using less resources.

2.3 Use of Machine Learning for ADR Prediction (based on FDA database).

The pool of research studies that have used machine learning algorithms for prediction/ classification of adverse drug reactions with use of FDA data is very limited. There are two research projects of interest, one using standard machine learning algorithms and the other – a proof of concept study involving neural networks. These studies will be discussed further in the following two sections.

2.3.1 Standard Machine Learning and ADRs Prediction

A study conducted by Chen (2018) has utilised multiple standard machine learning models to predict ADR outcomes from 4 million events reported to FDA. The database consisted of reports of ADR occurrence between 2012 and 2017. The researchers performed univariate analysis on the obtained FDA data that included some of the independent variables used for prediction/classification model (Figure 8). The pool of patients constituted of 46.6% of males and 53.4% of females, where the mean age was 59 and mean weight was 72kg.

Abbildung in dieser Leseprobe nicht enthalten

Figure 8. Distribution of the two main predictor variables: Age (on left) and Weight (on right).

The ADR data is formatted in a way where 1 ADR event can be split into multiple rows of data. Some rows relate to the patient being hospitalised after the intake of drug product and other rows may relate to the death of the same patient after ingesting the medicinal drug product. Apart of these 2 ADR outcomes, other outcomes are present in the database that add noise to the data.

This research study has split the dataset to 2 separate datasets – one containing the events where patient was hospitalised and other dataset containing the events where patient died after drug intake. These actions caused the ADR outcomes to be classified as mutually exclusive, which does not correspond to the realistic events, because patient can be hospitalised and then die. Multiple machine learning models were tested for each of the 2 datasets separately for convenience and probably as an assurance mechanism of decent predictive accuracy. As previously discussed, the standard machine learning models perform better with binary classification (hospitalised or died) rather than multiclass output (hospitalised or died or hospitalised then died). Nevertheless, the methodology used by Chen is considered a major disadvantage of this study and it is believed the multiclass approach would be preferable.

The investigator has collated a master dataset from the files provided by FDA data directory. Data collection has been followed by cleansing where all reported events were merged for the same patient to determine the number of medicinal products used by each patient at the time. The continuous variables like age, weight and drug dose were standardised using the following formula, where X = variable, x̅ = mean and σ = standard deviation:

Standardised X = (X – x̅) / σ

The dummy variables for each categorical variable were created following the feature standardisation. The categorical variables included: drug role, drug name, route of drug administration (Chen, 2018). The exact approach of creating dummy variables for categorical data has not been published. However, it is known the researchers have used 2 out of 7 available FDA sub-datasets related to patient demographics and medicinal drug-product information.

The study has tested 4 machine learning models to classify the ADR outcomes ( logistic regression, SVM, random forest and gradient boosted tree). The initial results were affected by unbalanced data. The researchers have addressed this issue by taking a multiple of random samples from the dataset with replacement, where each sample contained an equal amount of serious and non-serious ADRs. The use of replacement in balancing the data might have skewed the results, thus impacting on the realistic representation of the ADR outcome classification. The models were then trained on the subsamples using ensample approach, where prediction/classification score was achieved by taking the vote of majority of classifiers. The same approach was used for classification of outcomes in the test set.

Chen’s methodology seems labour intensive and may introduce errors and inaccuracy to the final classification scores. Perhaps, this is the reason why the model performance metrics were shown only for the test sets and not the training sets. If the performance metrics were available for both sets, it would be possible to determine if any overfitting occurred by looking at the discrepancy between accuracy scores of the test set and training set. Nevertheless, the balancing of the data significantly improved the available performance metrics of the machine learning models and yielded overall good performance (Table 1).

Table 1. Performance metrics of 4 machine learning models for death and hospitalisation test data sets (Chen, 2018).

Abbildung in dieser Leseprobe nicht enthalten

In this study the machine learning models were used to separately classify the occurrences of death and hospitalisation. The logistic regression model performed the best for the death outcome, followed by random forest model, with prediction accuracy of 76% and 73% respectively. The Precision and Recall Scores for these 2 models were also satisfactory, which suggests that models performed well at correctly classifying true positive outcomes in actual and predicted result. The F1 Score will be ignored as it combines the precision and recall scores into 1 metric.

Interestingly, the prediction/classification accuracy for Hospitalisation outcome has been slightly lower than for Death outcome, but still good with top two models scoring 75% (logistic regression) and 74% (random forest). This might have been due to random sampling with replacement approach that Chen (2018) has used, which potential implication were discussed before.

2.3.2 Neural Networks and ADR Prediction

The investigated literature suggests that ANNs are better fit for predictive analytics of ADRs with multiclass output. A research conducted by Xu et al. (2017) investigated the drug toxicity has utilised deep learning and ANNs for classification of multiclass output on dataset similarly structed to the FDA data. Therefore, if ANNs were used in this study, the ADR outcomes would not have been separated or treated as mutually exclusive. This approached would allow to classify the outcome to 3 classes simultaneously, thus generating more realistic results ie. it would be possible to predict who was hospitalised or who died or who was hospitalised and then died.

A number of data mining studies have been completed, where the performance of predictive models was compared between standard machine learning models and deep learning algorithms on structured data.

The research showed that ANNs perform better than other machine learning models on large scale datasets – like the ADR dataset. The study achieved superior results with use of ANNs where the outcomes had multiple classes that were not mutually exclusive and number of independent features was high (Teshnizi and Ayatollahi, 2015).

A proof of concept study examining the use of Feed Forward Neural Architecture for prediction of ADR outcomes has been conducted by Yen et al. (2011). The research yielded an exceptional predictive accuracy in its classification but used the binary outcome approach (hospitalised OR dead) and limited number (3000) of ADR events. The study used personal attributes such as sex, weight and age, which have been previously identified as the main predicting factors in ADR occurrence. Yen et al. (2011) selected a part of FDA dataset, but the details have not been specified. The data had to extensively processed to data forms accepted by ANN algorithm.

The side effects and diseases have been encoded using MedDRA dictionary used by health organisations. The ADRs have been categorised to non-serious and serious and encoded as 1 and 2 respectively. This research omitted the ADRs related to both - hospitalisation and death, which is being considered a disadvantage. However, a scale and cut off points have been proposed to estimate the severity of the ADR. The ADRs were considered non-serious if the outcome was in 0.5-1.5 range, while 1.5-2.5 was considered a serious ADR. The design of feed forward, backpropagated neural network consisted of 4 layers (2 hidden layers) with 30, 30, 20,1 nodes in each layer respectively.

The initial ANN has scored prediction accuracy of 44.2% for serious ADRs and 99.78% for non-serious ADRs. The big discrepancy in predictive ability was probably caused by data being unbalanced. The parameters of the ANN have been optimised and the accuracy scores recalculated. The optimised ANN yielded a prediction accuracy of 99,87% and 100% for serious ADRs and non-serious ADRs respectively.

The research paper failed to provide detailed performance metrics apart from prediction accuracy. Given the 99% result, it is suspected that the model has been over-trained and overfitted. It is widely known and has been confirmed by Chen (2018) that training model on small number of data points like in this study, typically leads to overfitting. To confirm this claim it would be necessary to examine full confusion matrix as well as predictive accuracy scores of training and test datasets. The researchers did mention - the false positive rate for serious ADRs was 55.8%, which is not ideal. However, without full confusion matrix the quality of the results can be only speculated.

Another flow of the study is lack of information on selection process of the 3000 cases/patients for the analysis. The FDA database has 2-4 million cases per year. No information has been provided if the events were selected at random, or for particular disease or for particular reporting year. The data is rarely normally distributed therefore each of these factors could have a significant impact on quality of prediction/classification of ADRs.

Yen et al. (2011) suggested a further study should be conducted with 10,000+ cases to determine If a larger data set will impact the quality of ADR classification. Since the date of publication in 2011 no large-scale study has been conducted. Current research shows that for the sample size to yield significant results when using ANNs, it is necessary to follow a “factor 50 rule of thumb” (Alwosheel et al., 2018). This requires having a minimum of 50 times of data rows for each parameter within the Neural Network that can be adjusted.

2.4 Selection of Machine Learning Model

It has been decided to further pursue the research of classifying ADRs with use of artificial neural networks and build on top of previously conducted prove-of-concept study published by (Yen et al., 2011). Based on findings of literature review involving standard machine learning models and ANNs, it is believed that the latter will be the most suited algorithm due to its versatile nature. The ANN can handle most types of data and there are methodologies to convert non accepted data types (categorical variables) to numerical variables that are best suited for ANNs. Yen et al., (2011) demonstrated that ANN can be used for binary classification of ADRs on small scale, however its effectiveness must be verified on large scale data to ensure the claims made by researchers are true.

Additionally, compared to other machine learning models, the ANNs are exceptional at multiclass classification problems, which is the main aim of this project. Chen (2018) demonstrated on large scale data that standard machine learning models can cope with binary classification of ADR outcomes relatively well and therefore conducting this project with these models would not bring much benefit to scientific community.

The current gold standard model for classification of structured data - XGBoost stood out from other machine learning models and ANNs. However, it will not be included in analysis of ADR data. This is due to novelty of the algorithm and lack in significant amount of scientific, peer review literature that would confirm that this model is suitable for large datasets, with variety of feature types and multiclass output.

2.5 Cross-Industry Standard Process for Data Mining (CRISP-DM)

Data Science projects consist of complex and multistage processes that are often labour intensive and time consuming. Therefore, it is important to follow a process flow that ensures all the necessary steps are being completed during data science project. A number of studies has been identified that recommended CRISP-DM model as the best framework for conducting data mining endeavours. The CRISP- DM model has been praised for being easily implemented (Wowczko, 2015), useful in planning and documentation of processes involved in data mining, that allows even inexperienced data scientist to replicate the advanced data-oriented tasks (Wirth and Hipp, 2000) and foresee potential issues with the project (Palacios et al., 2017).

The CRISP-DM has been described by Chapman et al. (2000) in 6 main steps (Figure 9). The 6 steps include: business understanding, data understanding, data preparation, modelling, evaluation and deployment. The CRISP-DM is a cyclical process where some of the steps are repeated until the desired results are achieved.

Abbildung in dieser Leseprobe nicht enthalten

Figure 9. Schematic representation of CRISP-DM model

Chapman et al. (2000) explains, that business understanding stage of CRISP-DM model is the initial process which involves setting purpose and objectives for the project (research question). It is also at this stage that initial requirements and resources required for the completion of the project are assessed. The data understanding stage is concerned with data collection and accessibility, exploration and statistical analysis of the dataset as well as assessment of the data quality.

This is followed by data preparation stage, where relevant data is filtered/selected for further analysis, cleaned and transformed for the purpose of the project. This may include data encoding or formatting. Next stage is about the modelling. During the modelling stage the appropriate modelling techniques are selected, test/ validation approaches are designed and the machine learning models are build. The created models are evaluated and reconstructed where necessary. The modelling stage and data preparation stage are iterative steps that can be repeated multiple of times until desired outputs are achieved.

The final two stages of CRISP-DM model include evaluation and deployment. The evaluation stage focuses on assessment of model accuracy and generality as well as overall project success with respect to its aims and objectives. It reviews the processes used to complete the project and determines what future work should involve. The deployment stage relates to commercial setting and is outside of the scope of this research project.

3. CHAPTER THREE: RESEARCH DESIGN & METHODOLOGY

3.1 Research Design

The research design can be led by 3 contrastive research philosophies. These distinctive research designs have been visualized, along with research approaches, strategies and techniques by Saunders et al. (2009) in form of a research onion (Figure 10). The philosophies of Positivism, Realism and Interpretivism guide the research approach in a specific manner, that is based on the assumptions that each philosophy follows (Flick, 2015)

Abbildung in dieser Leseprobe nicht enthalten

Figure 10 . Visualization of research designs development in form of research onion (Saunders et al., 2009)

The theory of Positivism accentuates the observational evidence that is utilized in quantitative research and depends on the values of objectivity and reproducibility that can be confirmed in an intelligent and mathematical way (Bryman, 2003). Interpretivism contradicts the methodology of positivism and is used in qualitative research. The interpretivists incorporate human perception into the research study. In this scenario, the researchers examine opinions and believes of different people from different social groups or different fields on the same topic. This derives from a believe that research guided by ideology of Positivism (like in biomedical sciences) and research quidded by Interpretivism (psychology) requires different methods, thus the quantitative approaches of positivism might not be suitable or sufficient for data collection and analysis from individuals of different viewpoints (Myers, 2013). The theory of realism is somewhat aligned with the theory of positivism, as it draws conclusions from the same quantitative approaches and techniques, however differs at the stage of results interpretation. In contrast to positivism, the realisms relies on uncovering non-observable mechanisms that cause the events observed by researchers in biomedical sciences and psychology (Saunders et al., 2009).

3.1.2 Research Philosophy & Strategy

Due to the numeric nature of the data, and desired mathematical output, the author will align this research with the philosophy of positivism. This will be clearly observed at the primary data collection, processing and analysis stages, where the findings will be subject to statistical review. The results will be represented as numbers relative to each other, as opposed to authors opinions and believes. This research will use deductive reasoning also referred to as “top-down” approach (Heit and Rotello, 2010) and therefore an experiment involving classification with use of ANN will be designed to test a hypothesis stated in pt. 1.7. The obtained results will test the validity of the hypothesis and conclusions will be drawn based on the results and their comparison to already available studies.

3.1.3 Ethics and Privacy

No ethical committee approval or subject consent is required to conduct this study and therefore, there are no direct ethical implications surrounding this research project. The ADR reports does not contain personal information of patients, thus disabling the identification of individual patients. The data obtained from FDA is public and can be used for purpose of this project with no restrictions. The processing of the medical records will not infringe on General Data Protection Regulation 2016/679 (GDPR) or its equivalents. The study will be conducted with accordance to ethical principles for quantitative research described by Drew et al. (2008) and the Python code used for processing and analysis of the data will be made available (Appendix A). It is essential to highlight that author is aware the data fabrication, omission of important information or plagiarism are considered as serious academic misconduct and may result in cancelation of the project.

3.2 Methodology

Nine steps have been identified to be involved in the methodology process of this research project (Figure 11). The process flow has been aligned with CRISP-DM model. Some of the stages of CRSIP-DM model have been split in separate tasks and some of them have been joined together. It is important that research methodology follows CRISP-DM model to ensure quality of process documentation and that the analysis is conducted on dataset with at least 10,000 rows as per Yen et al. (2011) suggestions to ensure obtained results are significant.

Abbildung in dieser Leseprobe nicht enthalten

Figure 11 . Process flow diagram of research methodology.

3.2.1 Problem Understanding / Research Purpose Statement

The idea of predicting ADR outcomes with use of machine learning and deep learning models has been investigated in literature review. The secondary research revealed 2 studies of high importance. First was a study conducted by Chen (2018), that used 4 million rows of data to predict events where patient was hospitalized or died. The second study was proved of concept study conducted by Yen et al. (2011), that used 3000 rows of data and feed forward neural network to predict ADR outcomes. Both of the studies were based on binary outcome and omitted the ADR cases where patient was hospitalized and then died.

The purpose of this primary research is to confirm or disprove the findings of literature review about ability of Artificial Neural Networks to precisely predict ADR outcomes and outperform the standard machine learning models described by Chen (2018). This project will investigate predicative ability of ANN on multiple classes as opposed to binary outcome prediction done by Chen (2018) and Yen et al. (2011).

3.2.2 Technologies and Hardware Used

Jupyter Notebook (Python 3.6) in Anaconda Suit will be used to perform all tasks associated with data processing and modeling. The major packages used in this project include: Pandas, NumPy, Altair, Matplot Lib, Scikit-Learn and TensorFlow with Keras. The hardware used in this project was Macbook Pro 13” 2017 Model, Processor 2.3 GHz Intel Core i5, RAM Memory8 GB 2133 MHz LPDDR3, Graphics card Intel Iris Plus Graphics 640 1536 MB, running Mac OS 10.14.3.

3.2.3 Data Understanding

The data understanding stage involved establishment of data privacy principals and accessibility. Data assembly, exploration and assessment of data quality as well as univariate analysis was also part of this process as has been discussed below.

3.2.3.1 Data Accessibility and Data Accuracy

The FDA Adverse Event Reporting System is an official open platform, where copy of original data is made available to the public. FDA data set consist of all reported ADR cases in US and serious ADR reports from other countries. The FDA database structure and content adheres to International Conference on Harmonisation (ICH) standards (FDA, 2019). FDA reports a number of flows / data limitations on their website that should be taken into consideration when analysing the data:

- Duplicate reports and incomplete reports are present in ADR dataset
- Information in the reports is not verified, as the reporters are not required to submit any confirmatory evidence.
- ADR reports reflect the opinion and observation of the reporters and are not proof of causation of ADR by a drug suspect
- The rate of ADR manifestation cannot be estimated using reports, as only a fraction of ADR reports are being reported.

It is important that author improves the data quality without impacting on data accuracy. Therefore, the missing data that can be reconstructed will be filled in and redundant and unusable data will be omitted from analysis. This will ensure the results are based on complete data that is as close to reality as possible.

3.2.3.2 The ADR dataset

The dataset was made available by FDA as ASCII file, which in turn consist of data for 1 quarter of each year. Each quarter file has been subdivided to 7 separate files:

- Patient Demographics
- Drug Information
- ADR outcome
- Reaction
- Reporting Source
- Therapy details
- Indication for drug use

The description and contents of each table are available in ASC_NTS.doc file supplied by FDA with ADR dataset and are attached (Appendix B).

3.2.3.3 Data Acquisition and Assembly

The data was downloaded as ASCII file consisting of separate text files from FDA database. The data files used for this project contained information corresponding to ADR reports from January 2015 until March 2019. The data older than 2015 was excluded from analysis, due to difference in dataset format, coding of columns and their values. The 7 text files contained within each ASCII folder consisted of continuous rows of strings separated by “$” symbols.

In total 119 text files were loaded to Jupyter Notebook using intrinsic python commands and relevant delimiter was set to separate the continuous strings to columns. Each of 119 files was converted to Pandas’ data frames for ease of data wrangling. The 7 tables (Figure 12), were joined in a logical manner to avoid errors in assembly. First, Demographics and Drug files were joined on ‘primaryid’ to create main table for a year quarter. Next, the Reaction, Outcome and Report Source were joined to the main table on ‘primaryid’ and ‘caseid’ columns. This was followed by join between main table and remaining 2 tables – Therapy and Indication on ‘primaryid’, ‘caseid’ and ‘drug_seq’ (ADR drug suspect sequence number). The process was repeated for all 17 quarters of each year investigated.

Abbildung in dieser Leseprobe nicht enthalten

Figure 12. ADR database entity relationship diagram.

The full tables for each quarter were stacked together forming a master data frame with 53 columns and 703,727 rows of data. Taking into account Chen (2018) recommendations, the following variable were identified as a requirement for final analysis: drug name, product active ingredient, weight, age, sex, side effect, ADR outcome. Other columns were kept for the initial analysis and assistance with data cleaning operations, to make sure no duplicates or wrong values were included in the dataset.

3.2.3.4 Assessment of Data Quality

The quality of the data has been assessed by verification of quantity of missing values, type of duplicates and their occurrence pattern as well as inappropriate datatypes for each column/ feature. The dataset has been explored for other inconsistencies like spelling mistakes, random symbols and punctuation signs at the end of the words, input of values in wrong columns eg. dosage form in commercial name of the drug. The exploration of the raw data has been followed by preliminary univariate analysis of the main variables of interest. These included: age distribution, weight distribution, sex ratio and frequency of ADR outcomes (Appendix C).

3.2.4 Data Cleaning

Some of the independent variables contained mixed format letters. As the first and very basic cleaning operation the strings within the data have been reformatted to uppercase letters. The column datatypes have been changed to correctly represent variable types. Then, it has been decided to focus on important independent variables/features. These are the variables that have been previously identified by Chen (2018) as being necessary for prediction of ADR outcomes.

3.2.4.1 Weight Variable

The data related to weight has been separated into 2 columns. The “wt” – a numeric value and “wt_cod” – the units in which the weight was measured (KG or LBS). The dataset has been filtered for all rows containing LBS values in wt_cod column and the values of “wt” column were converted into kg by dividing the numeric LBS value by 2.205. This resulted in creation of wt column that had uniform kg values.

Next the data was grouped by mean weight per age. This has shown outliers for the patients that were 78 years old and 0-2 years old. The filtering of data frame for cases were individuals were 78 years old and above 150 kg in weight revealed 5 cases were the weight of the individual was 9525kg. It was impossible to determine whether this was punctuation error or misplaced value, therefore rows of data that included weight of patient that was >1000kg were deleted from the data frame.

Next the 0-2 age group was investigated. A number of instances for patients below 2 years of age, where the weight was outside of the realistic measures were found. For example, 1 moth old baby with a weight of 140kg. In order to keep the rows of data associated with inappropriate weight values, the outliers were corrected using data provided by World Health Organization (WHO). Every value with a weight outside of the maximum weight value for given age recorded by WHO, was substituted by an average value for given age with accordance to age weight relationship (WHO, 2019). Refer to Appendix D for the details of minimum, average and maximum weight value for given age.

The FDA data frame was identified to have approximately 210,000 missing weight values. To fill in the values, the mean of weight per age was calculated. The null weight values were filled in with a mean weight values that corresponded to the age of the relevant patient. This has reduced the amount of missing values to 29,000. According to studied WHO data, a strong correlation between age and weight of a patient was identified, therefore it is believed this is acceptable method of filling in the missing data.

3.2.4.2 Age Variable

The age of the individuals was also separated to 2 columns. Age – a numeric value of age and age_cod – the unit in which age was measured. The units of measure of age variable were given in hours, days, weeks, months and years. Similarly, to the case of weight variable, the age variables were selected based on their unit and converted to years by division by specific number (Table 2) for ease of analysis. For example, to convert the age of individual that was 730 days old, the age value was divided according to formula: 730days / 365 = 2 years old. All time conversions were based on Gregorian calendar.

Table 2 . Conversion factors used for conversion of age to year timescale

Abbildung in dieser Leseprobe nicht enthalten

The conversions of age values resulted in new float values for age groups 0-10 with many decimal points. To generalize the data, the ages were rounded up according to groups (Table 3). If there was an age within the data frame that was not a natural number, the fractional age was converted to a natural number. The rows with missing age values were removed from data frame, because the mean method previously used for the estimation of the weight values did not work.

Table 3 . Age ranges used for rounding of age values

Abbildung in dieser Leseprobe nicht enthalten

3.2.4.3 Product Active Ingredient (prod_ai) variable

The 73 missing values for active ingredient have been substituted with drug name values from “drugname” column. A lot of drug name values also contained product ingredient in their name in the following format Drugname[product_ai], for example Proventil[Salbutamol]. In these cases, the ingredient name from within the bracket has been transferred from “drugname” column to “prod_ai” column. The product active ingredient data has been further cleaned by removing dosage, brackets, numbers, punctuation signs or other characters that were not part of the name.

3.2.4.4 Commercial Drug Name variable

Some of the “drugname” values were empty strings or one letter strings. Where possible the empty strings were substituted with “prod_ai” name. In many instances, the commercial names of drugs included dosage of the medial product, dosage forms, names of the manufacturers or punctuation signs. For example, MIFEPRISTONE TABLETS, 200 MG (DANCO LABS). All unnecessary information has been removed using inbuild Python functions so that the “drugname” column contained only the commercial drug name, for example MIFEPRISTONE.

3.2.4.5 Data Filtering and Data Selection

The data frame has been filtered for rows of data with outcome hospitalization or death. The remaining unknown, duplicate and “NaN” values have been removed from the data frame resulting in a new dataset of approximately 35,000 rows. The dataset has also been reduced in column count. The 10 columns used for final data preparation included:

- ”drugname” – Commercial name of the drug
- “prod_ai” – The working substance of the drug (active ingredient)
- “route” – The route of drug administration eg. Oral, intravenous, intermuscular
- “pt” – Observed side effects after ingestion of the medicinal product
- “indi_pt” – Indicative use of the medicinal drug product
- “age” – Age of the patient (years)
- “sex” – Gender of the patient
- “wt” – Weight of the patient (kg)
- “drug_seq” - Drug suspect number for given ADR outcome (1-20)
- “outc_cod” – The Adverse Drug Reaction Outcome eg. Hospitalization, Death etc.

The outcome variable has been split into 3 temporary data frames, representing cases were patient was:

- Data frame 1 - Hospitalized (H) or
- Data frame 2 - Died (D) or
- Data frame 3 - Hospitalized then died (HD)

The data frame 3 was created by filtering of occurrences of the event happening in both data frames 1 and 2. The selected HD cases were removed from the data frame 1 and 2 to avoid data duplication. The data frame has been assembled back together, having 3 possible ADR outcome values.

Following the cleaning operations, the assessment of data has been performed again, to explore statistics of the variables used for the classification of ADR outcomes.

3.2.5 Data Preparation

In order to correctly perform classification with use of ANNs and get optimal results, the dataset had to be processed so it reassembles the feature format required by ANN algorithm. The data frame was encoded, the dependent variables were then balanced, independent features had to be scaled and multicollinearity removed.

3.2.5.1 Encoding

The ANNs operate on numbers and no categorical input is allowed. Therefore, all categorical variables were converted to numeric variables. The ADR outcome variable was a 3-class variable and has been encoded as 0 = Hospitalized, 1 = Hospitalized, then died and 2 = Died. The Sex variable has been encoded as binary column, where 0= Female and 1 = Male.

Previous research studies used MedDra medical Dictionary that encodes similar diseases or side effects as one group. For example, Cardiac Fibrillation and Cardiac Arrhythmia would be classified together as heart disease. The access to MedDra dictionary in Ireland is restricted to Health Service Executive (HSE) and Health Product Regulatory Authority (HPRA) and the author was not able to use it.

Instead, the medical terms found in the dataset (Commercial drug names, drug delivery route, drug active products and drug side effects) were encoded manually with OneHotEncoder-like method. Dummy variables were created for every categorical column. Dummification created a column for every unique entry within a categorical feature (Table 4). This means that a column with 4000 unique variables was converted to 4000 binary columns containing 1s and 0s. The first dummy-encoded column of each categorical variable was removed to avoid dummy variable trap (Suits, 1957).

Table 4 . Explanation of dummy-encoding process

Abbildung in dieser Leseprobe nicht enthalten

3.2.5.2 Multicollinearity

The collinearity analysis has been done using Pandas and NumPy packages. The analysis found significant correlation between commercial drug name and product substance. This was expected as there can be many different commercial drug names for every drug. For example, paracetamol (drug substance) is being marketed by different companies under the names: Paracetamol, Panadol, Panamax, Adol, Anadin, Parapane, Paralief and other. Refer to results section for correlation matrix. Due to collinearity, the drug name column has been removed from further analysis. The rationale behind removing “drugname” over “prod_ai” is that:

- “drugname” has high variable count compared to “prod_ai” which caused Jupyter Notebook to crash multiple of times and
- side effects as well as ADR outcomes are caused by product substance, rather than commercial drug name.

3.2.5.3 Data Balancing by Over-sampling

The final data selection resulted in having 15,000 of Hospitalised (0) outcomes, 7,082 of Hospitalized, then Died (1) outcomes and 12,734 of Died (2) outcomes. The “Hospitalized, then Died (1)“ outcome was under represented and below the value count of 10,000 recommended by previous studies. To maximize the classification accuracy of the model, the dataset was balanced using Synthetic Minority Over-sampling Technique (SMOTE) discussed by Chawla et al. (2002), which has artificially created the data rows with minority of predictor variable (1) and randomly picked an equal number of oversampled predictor variables (0) and (2). In total, 11,310 of each predictor variable types were used for analysis, which is just above the required threshold of 10,000, resulting with total row count of 34,816 and column/feature count of 4330.

3.2.5.4 Data Split

The dataset has been split to X = 4329 Independent Features and Y = 1 Multiclass Dependent Variable. The X and Y variables have been further split to 75:25 ratio for training and test set at random, thus creating 4 variables: X_train and Y_train as well as X_test and Y_test.

3.2.5.5 Feature Scaling

To gain the maximum accuracy and precision as well as improve the stability of classification algorithm, the data containing the independent variables (X_train and X_test) has been normalized with mean x̅ = 0 and standard deviation σ = 1. The feature scaling has been achieved with “Standard Scaler” function of Scikit-Learn package, which uses the standardization formula referenced by Chen (2018).

3.2.5.6 Modelling and Model Evaluation

The correctly formatted data has been used to train number of ANN models. The designed ANNs were Feed forward Neural Networks. The models have been designed using TensorFlow with Keras. Mostly dense layers were used for creation of the models, meaning the nodes were fully connected from layer to layer. Some models used deep learning for classification, as the models included multiple hidden layers. Each model was trained on X arrays (75% of the data) and for validation purposes tested on Y arrays (25% of the data). Standard Machine Learning Model Parameters (Precision, Accuracy, Sensitivity Specificity and F1 Score) were used to verify quality of classifications. The difference between accuracy in Y_train and Y_test was used to asses model fitting. The models were evaluated each time round and model optimizers were adjusted for the model to produce the best possible results (Table 5). For detailed information on composition of subsequent ANN models please refer to Model Coding section (Appendix E).

Table 5 . Iterations of classification model specifications

Abbildung in dieser Leseprobe nicht enthalten

3.2.6 Limitation of Research Design and Methodology

The main limitation of the research was a low number of data points used. However, it is the maximum amount of data that could be handled using owned hardware. Having a fully functional Hadoop cluster would be of most benefit to the project as it would allow for analysis of the entire FDA database. Another limitation of this study was poor quality of the data used. The FDA database is full of missing values and rich in duplicate events.

4. CHAPTER FOUR: RESULTS

4.1 Results of Preliminary Data Assessment

The analysis of missing data has reviled that the master data frame containing 119 FDA .txt files / over 700,00 rows of ADR event is missing approximately 40% of data (Figure 13). It is important to highlight that values containing an empty string have been classified as complete entries (coloured black). The missing value counts for the important variables include: drugname = 0, prod_ai = 73, age = 47,812, wt= 210,884, route = 118,401, pt=0 and outc_cod=0. More data is available (Appendix C).

Abbildung in dieser Leseprobe nicht enthalten

Figure 13 . A Number of missing (NaN) values in the entire downloaded FDA data frame. White represents missing values, while black represents complete values.

A close analysis of data contained in the FDA data frame, has shown multiple inconsistencies such as values misplaced between columns, out of range values, values containing inappropriate signs, columns containing multiple variables that should exist in separate columns (Figure 14).

The exploration of the data has also shown the complexity of the data that has not been observed before (Figure 14). The captured image shows 22 ADR events for one patient. The events are characterised by the same primary ID and the same case ID even though they are different in some aspects. The “drugname” and “prod_ai” contain information on drug substance administered to the patient. Some of the rows relate to the same drug, and some relate to different drug. The ADR outcome “outc_cod” for these events is uniform (HO – Hospitalisation), but it is not the case for other case IDs in the dataset. The indicative drug use “indi_pt” is the same for multiple occurrences of the drug substance, but the side effects “pt” for the same drug are different, yet classified under the same primary ID and case ID. This creates multiple rows for per one ID, per one drug.

Abbildung in dieser Leseprobe nicht enthalten

Figure 14 . The screen capture of the FDA data, for single case ID = 15129680.

4.2 Results of Univariate Analysis of Cleaned Data

The analysis of the cleaned data (to be used for classification model) has been done before the oversampling of the minor dependent variable. Univariate analysis included visualisation of unique value counts of important features: age (Figure 15), weight (Figure 16), ADR outcome (Figure 17). sex (Figure 18) and drug route (Figure 19). The drug name and drug active ingredient were excluded from univariate analysis as they contained 4000+ of unique values each, making them impossible to plot. Top 10 predictor variables were plotted (Figure 20), according to their predictive power.

Abbildung in dieser Leseprobe nicht enthalten

Figure 15 . Distribution of age value for patients used for ADR outcome classification.

Abbildung in dieser Leseprobe nicht enthalten

Figure 17 . Percentage of value counts of each independent ADR outcome (blue: hospitalised, yellow died, red: hospitalised then died

Abbildung in dieser Leseprobe nicht enthalten

Figure 18 . Ratio of Male to Female patients used for ADR outcome classification, represented in percentages.

Abbildung in dieser Leseprobe nicht enthalten

Figure 19 . Value counts associated with top 10 drug-administration routes used for classification of ADR outcomes.

Abbildung in dieser Leseprobe nicht enthalten

Figure 20 . Top 10 independent features, according to their predictive power.

4.3 Results of Multicollinearity Analysis

The analysis of collinearity has identified Commercial Drug Name and Product ID to be slightly corelated, meaning there is a linear relationship between those 2 independent variables (Figure 21).

Abbildung in dieser Leseprobe nicht enthalten

Figure 21 . Multicollinearity matrix of independent variables and the dependent variable.

4.4 Results of Optimizers for Multi-label Classification Models

The table 6, shows the results obtained after modification of model optimizers. The initial model used RMSprop activator function, recommended for multiclass purposes by software documentation (Keras, n.d.) and 10 epochs and the last model used Adam activation Function and 22 epochs. The full cycle of iterations has been described (Table 6). The goal was to design a model with maximum predictive accuracy, but with the smallest difference in accuracy scores between train set and test set. Models 1 to 6 had relatively large difference in accuracy scores (12%-21%). This means these models were overfitted to the training set. More information on performance of designed model is available in Appendix F. Model 7 had satisfactory accuracy scores (Figure 22), the best performance metrics (Figure 23), and was the least overfitting.

Abbildung in dieser Leseprobe nicht enthalten

4.5 Results of the Final (multiclass) ADR-Outcome Classification (Model 7)

The adjustment of various optimization parameters resulted in model 7 classifier. The final model has achieved an accuracy score of 82% on training set and 74% on the test set, while classifying the outcome variable to 3 independent classes. The model is still experiencing some overfitting as there is 8% difference in accuracy scores between training and validation sets, but very small error due to low loss value (Figure 22). The performance metrics (precision, recall and f1-score) were also satisfactory (Figure 23). The ability of the model to classify the labels in general and individually has been visualised (Figure 24). The Area Under the Curve was good at 89% for combined measure.

Abbildung in dieser Leseprobe nicht enthalten

Figure 22 . Classification accuracy and loss function of the final model on test and train sets.

Abbildung in dieser Leseprobe nicht enthalten

Figure 23 . Precision Recall and F1- Score performance measures for every independent label classified by model 7.

Abbildung in dieser Leseprobe nicht enthalten

Figure 24 . Multiclass ROC curve. Showing probability of the model correctly classifying each label (0: hospiitalisation,1: Hospitalisation + Death and 2: Death) correctly. False Positive Rate on X-axis and True Negative Rate on Y.-Axis.

4.6 Results of Binary Classification Model

The results of the binary classification relate to the classification of Hospitalised and Died ADR outcomes. The classification accuracy for the binary model was 89% on training set and 83% on the validation set (Figure 25). The loss for both models was low (below 1) therefore, no significant error was present. The precision of classification was 82%, the recall was 82% and f1-score was 81.5% on average. The performance matrices for each label were visualised (Figure 26).

Abbildung in dieser Leseprobe nicht enthalten Figure 25 . Accuracy and Loss of the binary model regarding classification of Hospitalisation and Death Outcomes.

Abbildung in dieser Leseprobe nicht enthalten Figure 26 . Precision Recall and F1- Score performance measures for binary model classification of hospitalisation and death ADR outcomes.

5. CHAPTER FIVE: DISCUSSION

5.1 ADR Data Trends in This Research and Literature.

The extensive data cleaning and preparation has achieved satisfactory results, as the univariate analysis of the main variables reassembled the distributions presented by Chen (2018). This discussion section will evaluate obtained results and compare them to previous research.

The distribution of the age variable (Figure 15) presented a negative skewedness (left skewedness) that reassembled the distribution plotted by Chen (2018), (Figure 8). However, an increase in the number of individuals aged 35-40 has been observed – something that had not been present on Chen’s age distribution plot. This could be due to the fact that the files used in this project included the recent FDA files for 2018 and 2019 and Chen’s analysis was based on the data related to 2012-2017. It might be the case that in recent years there has been an increase of 35-40-year-old individuals experiencing ADRs or simply the data cleansing done as part of this project missed some outliers that are skewing the results. Overall, Chen reported that most of the patients were 50-90 years old with a mean age of 59. According to the analysis of the data used in this project, most of the patients were 55-85 years old, with a mean age of 58.4 – which is close enough to Chen’s estimations. The mean age of the patients also confirmed the findings of literature review that state the side effects of drugs are more prevalent in aging population.

The distribution of weight variable followed a (close) to normal distribution (Figure 16). This aligns with the weight data presented by Chen, (Figure 8). However, there is a small discrepancy between the weight range. According to Chen the most patients in the FDA database weighted 52kg - 100kg with mean weight of 72kg. The analysis of the data used in this project revealed that most patients were heavier - with a range of 55kg-95kg and mean weight of 77.6kg. Some of the weight increase could be explained by the general population getting heavier from year to year (Vásquez et al., 2018). However according to Lewis et al. (2000) estimations, this would explain only 0.5kg-1kg of weight increase per year. Therefore, it is important to highlight there must be a small error in statistics done by Chen (2018) or the author of this research.

The cleaned and selected data consisted of following ADR outcomes: 85.8% of hospitalisations, 9.2% of deaths and 5.1% of the events where patient was hospitalised and died, which in turned corresponded to 119,331, 12,734 and 7,082 cases respectively (Figure 17). The obtained data was balanced so that each category was represented with randomly selected 11,310 samples – approximately 10% higher than the number of cases required for ANN generation- suggested by Yen et al. (2011). This approach improved the prediction accuracy of models by approximately 20%. However, it should be noted that data balancing had its disadvantages, as it disregarded significant amount of hospitalised data and some death data that was potentially useful, as well as copied part of “hospitalised + died” cases to increase the event count from 7,082 to 11,310, thus increasing the overfitting of the training set. The study conducted by Chen (2018) used only 2 labels: hospitalisation and death, while Yen et al. (2011) study used other 2 labels: serious ADRs (combined death and hospitalisation) and non-serious ADRs (all other ADR outcomes)

The route of drug administration (Figure 19) has been one of non-personal attributes that was used to predict ADR outcome. The most abundant route of drug administration was oral route with approximately 22,000 cases, followed by intravenous (approximately 7,600 events) and subcutaneous (approximately 2,700 events)

It was expected that medicinal drug products that were administered orally will have the most predictive power due to its count being 3 times higher than drugs administered intravenously and 7 times higher than drug administered subcutaneously. Nevertheless, it is intravenously administered drugs that had the most predictive power with oral drugs being next (Figure 20). After further thought, the author thinks it was understandable result as intravenous drugs are administered directly to the bloodstream, thus surpassing all protective barriers of the organism and therefore might be more prone to causing ADRs than orally administered alternatives.

Among other features with high predictive power were personal attributes of the patient (age, weight and sex), drug sequence (which describes the order of primary suspect drug to be causing the ADR, but also signifying the number of medicinal drug products ingested by patient at the time of the event), a drug with active ingredient “Lenalidomide” as well as descriptions of the ADR event: “Cardiac Arrest” and “Disease Progression” (Figure 20). The main independent variable with the highest predictive power was description of ADR event/ “pt”: “Death”.

The findings of literature review acknowledged that sex of the patient contributes to occurrence of ADR and females are more likely to experience ADRs after ingesting a drug than males, due to more complex hormonal system that can interfere with drug pharmacodynamics. The data used by Chen as well as initial assessment of male to female ratio in this project confirmed this claim as there was more females than males in the datasets (Appendix C). However, after data cleaning and data selection, the ratio of females to males inverted (Figure 18). It is important to note that Chen’s research used 4 million events and this research uses a fraction of the data (11,000 events), thus some differences in variable distributions are to be expected.

The proof of concept Neural Network study that used 3,000 events did not provide descriptive statistics on the data used. Therefore, the results cannot be compared to statistical analysis of FDA data used in this research.

5.2 Evaluation of Model Optimization Practices

Removal of collinearity, standardisation of the features as well as calibration of model parameters are the main concepts that can that effect model performance and must be looked after to obtain optimal classification results. The multilinearity matrix (Figure 21) revealed a 0.55 correlation between drug name and product active ingredient/substance. This is due to some of the product substances being also product names. For example, paracetamol is a drug active substance and commercial trading name.

The multicollinearity has been eliminated by removal of “drug name” variable form further analysis. The “drug name” feature was composed of 3,700 unique values, while product active ingredient feature was composed of 1,700 unique values. A feature with lower unique value count was selected to be used for machine learning, as it made the modelling process more robust. Also, it is the active substance that causes the patient to experience side effects, not the trading name of the product. Also, as previously discussed many different drug trading names can be equivalent to the same drug substance. The study conducted by Chen used the drug name for machine learning tasks However, it is believed the decision to keep product active ingredient was more appropriate and aided this project in the feature minimisation process.

The features in the final data frame were measured on different scale. For example, age was measured in years 0-120, sex was binary 0-1 and weight was 0-200kg. With use of the standard scaler, the features were transformed to be measured on the same scale. This allowed the machine learning model to focus on feature importance without being influenced by difference in feature scaling.

It took a high number of training rounds to optimise the neural network model. The training time was reduced from 35 minutes to 6 minutes. The main 9 rounds were summarised (Table 5). The optimisation activities were based on Tensorflow/Keras documentation as well as authors experience in optimising the ANNs. The Grid Search package, which forms a part of Skit-Learn library has also been trialled. The package analyses the data and find optimal parameters for ANN. The use of Grid Search package was quickly aborted as, it required more power than the hardware could provide. The author tried multiple of times to find optimal parameters for batch size, epoch count, as well as weight constraints and dropout rate, leaving the process to run for hours at a time, but no results was obtained. Therefore, the author revived to the fail and try again methodology used at the start.

All created models used the DNN architecture, meaning the ANN was fully connected. The first model used Categorical Cross-entropy loss function, one hidden layer with 2145 hidden nodes (the number of hidden nodes stemmed from the rule of thumb where the number of nodes on hidden layer equals the number of input nodes + number of output nodes, divided by two) and RMSprop optimiser recommended by Keras documentation as a good choice for multi-label classification(Keras, n.d.). The performance results of the first model were very disappointing. The model returned 33%-55% classification accuracy (dependently on the random selection of data).

This was followed by model 2a and 2b, where author set the main parameters with accordance to his knowledge. The loss function was changed to sparse categorical cross-entropy and Adam optimiser was used. The Adam optimiser is a variation of Stochastic Gradient Decent algorithm, that adapts the learning rate based on certain criteria instead of maintaining learning rate for all weight updates.

The difference between model 2a and 2b was the number of epochs. Once the first epoch of model 2 was completed and achieved decent results, the model 2b was initiated for 10 epochs. The model 2b achieved outstanding accuracy on train set at 95% and 74% on validation/test set. The big difference (21%) in accuracy scores suggested that the model was unable to generalise and was overfitted to the data, making in not useful in real life scenario.

The subsequent optimisation processes (Table 5) were aimed at reduction of overfitting to under 10% and improvement of classification accuracy. The results of the model modifications have been tabulated (Table 6).

The impact of reduction and increase of batch size on the model performance was verified, however the modifications did not result in any significant improvement to the model (model3). Therefore, the batch size was set as default (32).

Model 4 and 4b attempted to reduce the difference in prediction accuracy between training set and validation set by introducing model L2 regularisation (which forces feature weights to be small and not dispersed on scale) and weight constraints (which set the maximum weight to be set at 3). The reduction in model overfitting was minor (18%) and reduced the overall model accuracy.

The regularisation is based around idea that smaller weights lead to simpler model, thus reduce overfitting. Therefore, model 5 limited the maximum weight by reducing kernel constrained from 3 to 1. The model overfitting reduced significantly to 11%. However, the accuracy of the model dropped to 85% for training set and 71% for validation set.

The model 6 added an extra hidden layer. The findings of literature review suggested that accuracy measures of the model increase exponentially with the increase in number of the hidden layers. This was not the case this time, as the performance measures dropped and even though the training accuracy increased by 1%, the validation accuracy stayed unchanged from model 5 (Appendix F).

Interestingly, model 7 performed better when having 2 hidden layers. This could also be due to the fact that many other model optimisation parameters were changed, and extra parameters were added. The main changes involved move away from the rule of thumb regarding number of nodes on the hidden layer. The number of nodes was reduced to 300 on 1st hidden layer and 10 on the 2nd hidden layers. The number of epochs was increased to 22 and dropout function of 0.25 was added after input layer and 1st hidden layer. The most difference was seen when the type of regularisation was changed from L2 to L1. The L1 regularisation reduces the coefficients of the non-important features and even removes some features from the equation if their predictive power is low. It is believed the L1 regularisation worked well because it acted as a feature selection mechanism on the data set that had over 4000 features and not all of them were important. Model 7 has reached authors objective by reducing the overfitting to under 10%, while maintaining satisfactory accuracy on both – training set and validation sets (Table 6).

5.3 Multiclass Prediction of ADR Outcomes

The final model (model7) was designed to predict 3 classes (hospitalisation or death or hospitalisation and death). The model achieved 82% prediction accuracy on training set and 74% prediction accuracy on the test set, with difference in accuracy scores being 8% (Figure 22). The models designed by Chen were binary classification models and have obtained a prediction accuracy of 69% - 76% on test set (Table 1). To author’s surprise, some of Chen’s models outperformed the accuracy of Multiclass Neural Network Model designed as part of this research by 2%. However, the advantage of the designed model 7 was the ability to distinguish between 3 classes of events, which are more realistic representation of the actual events rather than conversion of the ADR outcomes to 2 classes.

The other performance matrices of the designed model were 74%, 73% and 73% for precision, recall and F1-socre respectively. The precision is a measure of correctly classified positive events to the total predicted positive events, meaning that 75% of predicted hospitalisation cases, 74% of hospitalisation + death cases and 72% of death cases were classified correctly out of all predictions per class outcome; or other words out of predicted ADR outcomes 74% of them were actually true. Mathematically, the precision has been measured as follows:

Precision = True Positive / True Positive + False Positive

The recall refers to the sensitivity of the model and measures how many ADRs occurred per class and how many was labelled in total (positive and negative). It is important to note that the recall value for “Hospitalisation + Death” outcome label experienced reduction during modelling by more than 10%, when compared to the other two labels. This means that the model had minor issues with correctly classifying “Hospitalisation + Death” outcome to the actual “Hospitalisation + Death” outcome labels present in the dataset. It is believed the reduction in recall score for the “Hospitalisation + Death” outcome is the cost of the previously conducted oversampling of this label in favour of improving the accuracy score. The recall value for model 7 was calculated as follows:

Recall = True Positive / True Positive + False Negative

The complied F1-Score for the multiclass model was 72% and took into account the false positives and false negatives as it is a weighted average of the 2 preceding measures. It is an accuracy measure that is best suited for scenarios where there is an uneven class distribution. The studies identified in literature review used this measure therefore it has been calculated for performance of model 7 too, according to formula:

F1-Score = 2*(Recall * Precision) / (Recall + Precision)

On average the precision, recall and F1-scores of model-7 were similar to the performance measures obtained by Chen (2018).

The prove of concept ANN study that used 3000 cases claimed to achieve a prediction accuracy of 99.87% for serious side effects and sensitivity of 99.11%. The study used a scale system where an ADR was considered serious if its value was 1.5-2.5. The scores of less than 1.5 were considered a non-serious ADR. The paper vaguely described how the ADRs were put on scale, however according to FDA the hospitalisation is associated with a serious side effect, therefore the author assumes Yen et al. (2011) followed this concept.

As expected the classification results of the conducted study were lower than classification results presented by Yen et al. (2011). It is important to note this was also a binary classification study that distinguished between serious ADRs (a mix of hospitalisation, death and hospitalisation + death events) and non-serious ADRs (all other ADR outcome categories). The study did not provide precision and f1-score values therefore these measures could not be compared to the results of this study.

Overall, the author believes the designed ANN model for multi-label classification has a good performance on the FDA data. This can be seen by looking at ROC curve, where area under the curve (AUC) is 0.89. This means that approximately 89% of the data can be explained by the model in general (Figure 24). To look closely, there is a bit of variance within the model’s ability to distinguish between the 3 classes. Model 7 is the best at explaining the overbalanced label (“hospitalisation+ death”) with AUC score of 0.91. The AUC score for “death” label is 0.89, while the lowest AUC score has been achieved with hospitalisation label (0.87).

5.4 Binary Classification of ADR Outcomes

The multiclass model performance was good; however, it did not meet the expectations of the author. After a thorough analysis it was concluded that classification accuracy scores were lower than expected due to two potential reasons: not enough optimisation or the fact that the model was predicting three labels instead of two (like Chen’s or Yen’s models). To verify this hypothesis, the author went one step further and created a binary output ANN to predict binary outcome (hospitalisation vs death). The binary model was a clone of model 7 apart from the output activation function, where the parameter was changed from “Softmax” (used for multiclass output only) to “Sigmoid” (used for binary class output). No model finetuning was done to improve the performance of the binary model.

The prediction accuracy of the binary model segregating the data into 2 independent classes (Death vs Hospitalisation) was 89% for training set and 83% for validation set (Figure 25). This is 9% higher than prediction accuracy of the multilabel model compiled for the purpose of this research project and 8% higher than the best model (logistic regression) designed by Chen (2018). Most of the other performance metrics of the binary ANN model (Figure 26) were also better than Chen’s logistic regression model (Table 1) with precision being 3-4% higher, recall being 10% higher for death outcome, but 9% lower for hospitalisation outcome and F-1 score being the same for hospitalisation outcome at 83%, but higher by 6% for death outcome. It is unknow why there was such a big discrepancy between recall of death and hospitalisation in Chen’s logistic regression model, but it seems like his model was better at predicting positive (hospitalisation) examples, given hospitalisation result than the binary ANN model designed by the author.

The performance of the binary ANN model compiled by the author was better than Chen’s models, but still worse than results of the binary ANN model published by Yen et al. (2011), with difference in prediction accuracy of 16%. It is necessary to bear in mind that Yen’s ANN was based on only 3000 events, while this binary model used over 11,000 events. Yen’s model also generalised the results by classifying both hospitalisation and death as a serious event while the binary model designed by author classified these labels as separate classes, according to FDA instructions.

6. CHAPTER SIX: CONCLUSIONS & RECOMMENDATIONS

6.1 Project Objectives

The research objectives of this project were met as fully functional ANN classification models were created. The study has verified that ANN is a suitable classification algorithm on large scale data. This statement is true, provided the researchers have access to hardware that enables a high computing power or fully functional Hadoop cluster. The subobjective of the project has also been met as the performance of designed models has been compared to predictive models designed by Chen (2018) and Yen et al. (2011). The author did not anticipate at the start that it might be difficult to compare binary classification performance to multiclass performance – as it was the case.

Even though the project is considered a success, the author hypothesis (pt. 1.7) did not fully stand. It was authors hypothesis that multiclass ANN model will yield higher accuracy than standard machine learning model (logistic regression) created by Chen (2018) but its accuracy will be lower than small scale ANN compiled by Yen et al. (2011). To authors dissatisfaction the designed multiclass ANN model achieved lower performance than models designed by Chen and as expected lower performance than Yen’s small-scale ANN model. Knowing from experience how powerful ANNs can be, the author further hypothesised that the obtained result might not be due to the ANN architecture itself but due to the ANN predicting a multilabel output.

A clone of the multiclass model was created and adapted for binary output classification on the same dataset. The new binary ANN model outperformed Chen’s logistic regression model, proving that ANNs are better suited to classification problems on large and complex datasets, and the reduced accuracy previously observed in multiclass model was due to the nature of multi-label classification and not the ANN algorithm itself.

Furthermore, it believed if the time constrain was not an issue, the author could further improve the accuracy of the models by at least 5%, thus making both models superior to the models proposed by Chen (2018). The project was a success as it correctly identified patients that were hospitalised and then died. This was an element that has not been achieved by any other previous study and it is believed this result is of major importance to healthcare professionals, as it identifies the patients that are at most risk of lethal event after they have been admitted to a hospital due to an ADR. The project became more challenging than expected, and if author was to do it again, even more time would be allocated for programming and analytical part of the study.

6.2 Challenges and Recommendations

The assembly and formatting of the FDA dataset turned out to be challenging. On top of that the author had to deal with significant amount of missing data and mislabelled data, which took 80% of time related to coding and methodology. The distributions of the independent variables reassembled the data distributions presented in previous research studies. It is authors opinion that with more time the data preparation could have been done better, which would lead to even better model performance. It was evident at the final stage of analysis that some variables of ‘pt’ feature could have been omitted from analysis. The ‘pt’ feature was related to medical terminology for observed side effect. Some of the fields in the “pt” columns were encoded by medical practitioners as Death – which is also equivalent to the dependent variable/ ADR outcome. This was discovered in feature importance visualisation (Figure 20) but was kept in for predictive analytics, as it aided the model in correct classification of ADR events.

The visualisations were completed using Matplotlib library and Seaborn library. The Altair visualisation package has not been used as initially proposed. The Altair visualisations render each data point individually, which results in large files being generated that crash the Jupyter Notebook. Having 4,000+ columns and a minimum of 35,000 rows in the final data frame made it impossible to use this technology. The limitations of Altair are to be kept in mind for future research.

In order to improve model performance, it is authors recommendation that independent variables should be normalised. The lack of MedDra encoding dictionary forced author to use OneHot-like encoding technique which resulted in a very large feature count (4000+). It is believed normalisation of features to be on scale 0 -1 or -1 – 1 would drastically improve the results of both multilabel and binary ANN models.

Also, if the author was to complete the project again; instead of standard validation with a data split 75:25 for training and test set, a K-fold cross-validation technique should be used. The K-fold cross-validation is a lot more power intensive and time-consuming methodology, however it has been shown in previous research to significantly reduce overfitting which models created as part of this project suffered with. Alternatively, an “early stopping” technique could be tested during the model training phase, to see if it improves the accuracy of model classifications. If the results were positive, it would mean the models created by the author were over-trained (although this is unlikely, as the models consisted of 22 epochs only).

6.3 Further Research

It is clearly evident that more research is required within the area of machine learning and ADR prediction/ classification as only 2 studies have been identified to use machine learning on FDA’s ADR database.

If the author was to continue the research, it would be interesting to add new fields to FDA data, like type of drug (steroidal, anti-inflammatory, antibiotic, small molecule drug or other) and verify which drugs cause the most of deaths in affected patient population. Moreover, the further research could look at specific drug-drug interaction that lead to occurrence of ADRs, as this study took into account only the number of medicinal products ingested by a patient at a time of ADR event and not the type if ingested drug.

If the research was being progressed to a PhD level, the project scope could be increased and chemical structures of the ADR causing drugs could be included in the analysis. This would give a lot of insight as to which molecular structures give rise to ADRs in specific patient sub-population. The results of such research could be used to screen the new chemical compounds entering clinical trials for the ADR-causing molecular structures, which would reduce the amount of chemical substances entering clinical trials as well as identify potential ADR causing drugs that have not been tested on human subjects yet.

7. References

Acharya, U.R. et al. (2003) ‘Classiÿcation of Heart Rate Data Using Artiÿcial Neural Network and Fuzzy Equivalence Relation’. Pattern Recognition, p. 8.

Agrawal, S. (2017) ‘Why Hospitals Need Better Data Science’. Harvard Business Review, 19 October. Available at: https://hbr.org/2017/10/why-hospitals-need-better-data-science (Accessed: 3 July 2019).

Ahuja, Y. and Yadav, S.K. (2012) ‘Multiclass Classification and Support Vector Machine’. p. 7.

Akinsola, J.E.T. (2017) ‘Supervised Machine Learning Algorithms: Classification and Comparison’. International Journal of Computer Trends and Technology (IJCTT), 48, pp. 128–138. DOI: 10.14445/22312803/IJCTT-V48P126.

Alomar, M.J. (2014) ‘Factors Affecting the Development of Adverse Drug Reactions (Review Article)’. Saudi Pharmaceutical Journal, 22(2), pp. 83–94. DOI: 10.1016/j.jsps.2013.02.003.

Alwosheel, A., van Cranenburgh, S. and Chorus, C.G. (2018) ‘Is Your Dataset Big Enough? Sample Size Requirements When Using Artificial Neural Networks for Discrete Choice Analysis’. Journal of Choice Modelling, 28, pp. 167–182. DOI: 10.1016/j.jocm.2018.07.002.

Auria, L. and Moro, R.A. (2008) ‘Support Vector Machines (SVM) as a Technique for Solvency Analysis’. SSRN Electronic Journal. DOI: 10.2139/ssrn.1424949.

Basharat, I. et al. (2016) ‘A Framework for Classifying Unstructured Data of Cardiac Patients: A Supervised Learning Approach’. International Journal of Advanced Computer Science and Applications, 7(2). DOI: 10.14569/IJACSA.2016.070218.

Berlin, J.A., Glasser, S.C. and Ellenberg, S.S. (2008) ‘Adverse Event Detection in Drug Development: Recommendations and Obligations Beyond Phase 3’. American Journal of Public Health, 98(8), pp. 1366–1371. DOI: 10.2105/AJPH.2007.124537.

Brown, N. et al. (2018) ‘Chapter Five - Big Data in Drug Discovery’. In Witty, D.R. and Cox, B. (eds.) Progress in Medicinal Chemistry. Elsevier, pp. 277–356. DOI: 10.1016/bs.pmch.2017.12.003.

Bryman, A. (2003) Quantity and Quality in Social Research. Routledge DOI: 10.4324/9780203410028.

Center for Drug Evaluation and Research (2018) Questions and Answers on FDA’s Adverse Event Reporting System (FAERS). FDA. Available at: http://www.fda.gov/drugs/surveillance/fda-adverse-event-reporting-system-faers (Accessed: 24 July 2019).

Cervantes, J. et al. (2015) ‘Data Selection Based on Decision Tree for SVM Classification on Large Data Sets’. Applied Soft Computing, 37, pp. 787–798. DOI: 10.1016/j.asoc.2015.08.048.

Chapman, P. et al. (2000) ‘CRISP-DM 1.0: Step-by-Step Data Mining Guide’. In SPSS.

Chawla, N.V. et al. (2002) ‘SMOTE: Synthetic Minority Over-Sampling Technique’. Journal of Artificial Intelligence Research, 16, pp. 321–357. DOI: 10.1613/jair.953.

Chen, A.W. (2018) ‘Predicting Adverse Drug Reaction Outcomes with Machine Learning’. International Journal Of Community Medicine And Public Health, 5(3), p. 901. DOI: 10.18203/2394-6040.ijcmph20180744.

Chen, T. and Guestrin, C. (2016) ‘XGBoost: A Scalable Tree Boosting System’. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, pp. 785–794. DOI: 10.1145/2939672.2939785.

Chih-Wei Hsu and Chih-Jen Lin (2002) ‘A Comparison of Methods for Multiclass Support Vector Machines’. IEEE Transactions on Neural Networks, 13(2), pp. 415–425. DOI: 10.1109/72.991427.

Classen, D.C. et al. (1997) ‘Adverse Drug Events in Hospitalized Patients. Excess Length of Stay, Extra Costs, and Attributable Mortality’. JAMA, 277(4), pp. 301–306.

Cortes, C. and Vapnik, V. (1995) ‘Support-Vector Networks’. Machine Learning, 20(3), pp. 273–297. DOI: 10.1007/BF00994018.

Couronné, R., Probst, P. and Boulesteix, A.-L. (2018) ‘Random Forest versus Logistic Regression: A Large-Scale Benchmark Experiment’. BMC Bioinformatics, 19. DOI: 10.1186/s12859-018-2264-5.

Crisci, C., Ghattas, B. and Perera, G. (2012) ‘A Review of Supervised Machine Learning Algorithms and Their Applications to Ecological Data’. Ecological Modelling, 240, pp. 113–122. DOI: 10.1016/j.ecolmodel.2012.03.001.

Dormann, H. et al. (2004) ‘Readmissions and Adverse Drug Reactions in Internal Medicine: The Economic Impact’. Journal of Internal Medicine, 255(6), pp. 653–663. DOI: 10.1111/j.1365-2796.2004.01326.x.

Drew, C., Hardman, M. and Hosp, J. (2008) Designing and Conducting Research in Education. 2455 Teller Road,Thousand OaksCalifornia91320United States: SAGE Publications, Inc. DOI: 10.4135/9781483385648.

Dumouchel, W. (1999) ‘Bayesian Data Mining in Large Frequency Tables, with an Application to the FDA Spontaneous Reporting System’. The American Statistician, 53(3), pp. 177–190. DOI: 10.1080/00031305.1999.10474456.

EMA (2018) Clinical Trials in Human Medicines. European Medicines Agency. Available at: https://www.ema.europa.eu/en/human-regulatory/research-development/clinical-trials-human-medicines (Accessed: 24 July 2019).

EUPATI. (2015) Making a Medicine. Step 7: Phase II - Proof of Concept. EUPATI. Available at: https://www.eupati.eu/clinical-development-and-trials/making-medicine-step-7-phase-ii-proof-concept/ (Accessed: 24 July 2019).

FDA. (2019) OpenFDA. Available at: https://open.fda.gov/data/faers/ (Accessed: 11 July 2019).

Flick, U. (2015) Introducing Research Methodology: A Beginner’s Guide to Doing a Research Project. second edition. Thousand Oaks, Calif: SAGE.

Formica, D. et al. (2018) ‘The Economic Burden of Preventable Adverse Drug Reactions: A Systematic Review of Observational Studies’. Expert Opinion on Drug Safety, 17(7), pp. 681–695. DOI: 10.1080/14740338.2018.1491547.

Fröhlich, H. et al. (2018) ‘From Hype to Reality: Data Science Enabling Personalized Medicine’. BMC Medicine, 16(1), p. 150. DOI: 10.1186/s12916-018-1122-7.

Geurts, P., Olaru, C. and Wehenkel, L. (2001) ‘Improving the Bias/Variance Tradeoff of Decision Trees: Towards Soft Tree Induction’. International Journal of Engineering Intelligent Systems for Electrical Engineering and Communications, 9, pp. 195–204.

Ghosal, I. and Hooker, G. (2018) ‘Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and Its Variance Estimate’. ArXiv:1803.08000 [Cs, Stat]. Available at: http://arxiv.org/abs/1803.08000 (Accessed: 24 July 2019).

Guang-Bin Huang (2003) ‘Learning Capability and Storage Capacity of Two-Hidden-Layer Feedforward Networks’. IEEE Transactions on Neural Networks, 14(2), pp. 274–281. DOI: 10.1109/TNN.2003.809401.

Hauben, M. and Aronson, J.K. (2009) ‘Defining “Signal” and Its Subtypes in Pharmacovigilance Based on a Systematic Review of Previous Definitions’. Drug Safety, 32(2), pp. 99–110. DOI: 10.2165/00002018-200932020-00003.

Hazell, L. and Shakir, S.A.W. (2006) ‘Under-Reporting of Adverse Drug Reactions : A Systematic Review’. Drug Safety, 29(5), pp. 385–396. DOI: 10.2165/00002018-200629050-00003.

Heijden, P.G.M. van der. et al. (2002) ‘On the Assessment of Adverse Drug Reactions from Spontaneous Reporting Systems: The Influence of under-Reporting on Odds Ratios.’ (21), pp. 2027–2044. DOI: 10.1002/sim.1157.

Heit, E. and Rotello, C.M. (2010) ‘Relations between Inductive Reasoning and Deductive Reasoning.’ Journal of Experimental Psychology: Learning, Memory, and Cognition, 36(3), pp. 805–812. DOI: 10.1037/a0018784.

Hoffman, K.B. et al. (2014) ‘The Weber Effect and the United States Food and Drug Administration’s Adverse Event Reporting System (FAERS): Analysis of Sixty-Two Drugs Approved from 2006 to 2010’. Drug Safety, 37(4), pp. 283–294. DOI: 10.1007/s40264-014-0150-2.

John H. McDonald (2014) Handbook of Biological Statistics. 3rd Edition. Baltimore, Maryland.: Sparky House Publishing Available at: http://www.biostathandbook.com/multiplelogistic.html (Accessed: 10 June 2019).

Kaushal, R. et al. (2007) ‘Costs of Adverse Events in Intensive Care Units’. Crit Care Med, 35(11), p. 5.

Keras Guide to the Sequential Model - Keras Documentation. Available at: https://keras.io/getting-started/sequential-model-guide/ (Accessed: 29 July 2019).

Koehrsen, W. (2017) Random Forest Simple Explanation. Will Koehrsen. Available at: https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d (Accessed: 13 June 2019).

Konstantin Mertsalov and Michael McCreary (2009) Document Classification with Support Vector Machines. Semantic Scholar. Available at: /paper/Document-Classification-with-Support-Vector-Mertsalov/10d70843a86918b19b941cc29d4dc8a9587d9245 (Accessed: 13 June 2019).

Leo Dencelin and Ramkumar T (2016) ‘Analysis-of-Multilayer-Perceptron-Machine-Learning-Approach-in-Classifying-Protein-Secondary-Structures’. Biomed Res, p. 8.

Lewis, C.E. et al. (2000) ‘Weight Gain Continues in the 1990s: 10-Year Trends in Weight and Overweight from the CARDIA Study. Coronary Artery Risk Development in Young Adults’. American Journal of Epidemiology, 151(12), pp. 1172–1181. DOI: 10.1093/oxfordjournals.aje.a010167.

Libbrecht, M.W. and Noble, W.S. (2015) ‘Machine Learning in Genetics and Genomics’. Nature Reviews. Genetics, 16(6), pp. 321–332. DOI: 10.1038/nrg3920.

Maglogiannis, I.G. (2007) Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in EHealth, HCI, Information Retrieval and Pervasive Technologies. IOS Press.

Myers, M.D. (2013) Qualitative Research in Business and Management. SAGE.

Neto, E.C. (2018) ‘Detecting Learning vs Memorization in Deep Neural Networks Using Shared Structure Validation Sets’. ArXiv:1802.07714 [Cs, Stat]. Available at: http://arxiv.org/abs/1802.07714 (Accessed: 27 May 2019).

Palacios, H. et al. (2017) ‘A Comparative between CRISP-DM and SEMMA through the Construction of a MODIS Repository for Studies of Land Use and Cover Change’. Advances in Science, Technology and Engineering Systems Journal, 2, pp. 598–604. DOI: 10.25046/aj020376.

Pariente, A. et al. (2007) ‘Impact of Safety Alerts on Measures of Disproportionality in Spontaneous Reporting Databases: The Notoriety Bias’. Drug Safety, 30(10), pp. 891–898. DOI: 10.2165/00002018-200730100-00007.

Patel, B.N. (2012) ‘Efficient Classification of Data Using Decision Tree’. Bonfring International Journal of Data Mining, 2(1), pp. 06–12. DOI: 10.9756/BIJDM.1098.

Peng, C.-Y.J., Lee, K.L. and Ingersoll, G.M. (2002) ‘An Introduction to Logistic Regression Analysis and Reporting’. The Journal of Educational Research, 96(1), pp. 3–14. DOI: 10.1080/00220670209598786.

Poluzzi, E. et al. (2012) ‘Data Mining Techniques in Pharmacovigilance: Analysis of the Publicly Accessible FDA Adverse Event Reporting System (AERS)’. Data Mining Applications in Engineering and Medicine. DOI: 10.5772/50095.

Rademaker, M. (2001) ‘Do Women Have More Adverse Drug Reactions?’ American Journal of Clinical Dermatology, 2(6), pp. 349–351. DOI: 10.2165/00128071-200102060-00001.

Ranganathan, P., Pramesh, C.S. and Aggarwal, R. (2017) ‘Common Pitfalls in Statistical Analysis: Logistic Regression’. Perspectives in Clinical Research, 8(3), pp. 148–151. DOI: 10.4103/picr.PICR_87_17.

Rokach, L. (2005) ‘Ensemble Methods for Classifiers’. In Maimon, O. and Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Boston, MA: Springer US, pp. 957–980. DOI: 10.1007/0-387-25465-X_45.

Sajda, P. (2006) ‘Machine Learning for Detection and Diagnosis of a Disease’. Annual Review of Biomedical Engineering, 8(1), pp. 537–565. DOI: 10.1146/annurev.bioeng.8.061505.095802.

Santhanam, R. et al. (2017) ‘Experimenting XGBoost Algorithm for Prediction and Classification of Different Datasets’.

Saunders, M.N.K., Lewis, P. and Thornhill, A. (2009) Research Methods for Business Students. 5th ed. New York: Prentice Hall.

Sebban, M. et al. (2000) ‘IMPACT OF LEARNING SET QUALITY AND SIZE ON DECISION TREE PERFORMANCES’. p. 21.

Siddiqui, M. et al. (2017) ‘Multi-Class Disease Classification in Brain MRIs Using a Computer-Aided Diagnostic System’. Symmetry, 9(3), p. 37. DOI: 10.3390/sym9030037.

Song, Y. and Lu, Y. (2015) ‘Decision Tree Methods: Applications for Classification and Prediction’. Shanghai Archives of Psychiatry, 27(2), p. 130. DOI: 10.11919/j.issn.1002-0829.215044.

Suh, D.-C. et al. (2000) ‘Clinical and Economic Impact of Adverse Drug Reactions in Hospitalized Patients’. Annals of Pharmacotherapy, 34(12), pp. 1373–1379. DOI: 10.1345/aph.10094.

Suhail Gupta (2018) Machine Learning - Why Neural Networks Do Not Perform Well on Structured Data?. Data Science Stack Exchange. Available at: https://datascience.stackexchange.com/questions/38392/why-neural-networks-do-not-perform-well-on-structured-data (Accessed: 31 July 2019).

Suits, D.B. (1957) ‘Use of Dummy Variables in Regression Equations’. Journal of the American Statistical Association, 52(280), pp. 548–551. DOI: 10.2307/2281705.

Sultana, J., Cutroneo, P. and Trifirò, G. (2013) ‘Clinical and Economic Burden of Adverse Drug Reactions’. Journal of Pharmacology & Pharmacotherapeutics, 4(Suppl1), pp. S73–S77. DOI: 10.4103/0976-500X.120957.

Suvarna, V. (2010) ‘Phase IV of Drug Development’. Perspectives in Clinical Research, 1(2), pp. 57–60.

Teshnizi, S.H. and Ayatollahi, S.M.T. (2015) ‘A Comparison of Logistic Regression Model and Artificial Neural Networks in Predicting of Student’s Academic Failure’. Acta Informatica Medica, 23(5), pp. 296–300. DOI: 10.5455/aim.2015.23.296-300.

Vásquez, F., Vita, G. and Müller, D.B. (2018) ‘Food Security for an Aging and Heavier Population’. Sustainability, 10(10), p. 3683. DOI: 10.3390/su10103683.

WHO (2019) Weight-for-Age. WHO. Available at: https://www.who.int/childgrowth/standards/chts_wfa_boys_p/en/ (Accessed: 24 July 2019).

Wirth, R. and Hipp, J. (2000) ‘Crisp-Dm: Towards a Standard Process Modell for Data Mining’. Available at: https://pdfs.semanticscholar.org/48b9/293cfd4297f855867ca278f7069abc6a9c24.pdf?_ga=2.96392220.123247452.1552240417-406863476.1543062810.

Wowczko, I. (2015) ‘A Case Study of Evaluating Job Readiness with Data Mining Tools and CRISP-DM Methodology’. International Journal for Infonomics, 8(3), pp. 1066–1070. DOI: 10.20533/iji.1742.4712.2015.0126.

Xu, Y., Pei, J. and Lai, L. (2017) ‘Deep Learning Based Regression and Multi-Class Models for Acute Oral Toxicity Prediction with Automatic Chemical Feature Extraction’. ArXiv:1704.04718 [Cs, q-Bio, Stat]. Available at: http://arxiv.org/abs/1704.04718 (Accessed: 11 January 2019).

Yang, Y. and Loog, M. (2018) ‘A Benchmark and Comparison of Active Learning for Logistic Regression’. Pattern Recognition, 83, pp. 401–415. DOI: 10.1016/j.patcog.2018.06.004.

Yen, P., P. Mital, D. and Srinivasan, S. (2011) ‘Prediction of the Serious Adverse Drug Reactions Using an Artificial Neural Network Model’. Int. J. of Medical Engineering and Informatics, 3, pp. 53–59. DOI: 10.1504/IJMEI.2011.039076.

Yi Jin (2017) Tree Boosting With XGBoost — Why Does XGBoost Win ‘Every’ Machine Learning Competition? Medium. Available at: https://medium.com/syncedreview/tree-boosting-with-xgboost-why-does-xgboost-win-every-machine-learning-competition-ca8034c0b283 (Accessed: 14 June 2019).

Yu, Y.M. et al. (2015) ‘Patterns of Adverse Drug Reactions in Different Age Groups: Analysis of Spontaneous Reports by Community Pharmacists’. PLoS ONE, 10(7). DOI: 10.1371/journal.pone.0132916.

Zekić-Sušac, M., Pfeifer, S. and Šarlija, N. (2014) ‘A Comparison of Machine Learning Methods in a High-Dimensional Classification Problem’. Business Systems Research Journal, 5(3), pp. 82–96. DOI: 10.2478/bsrj-2014-0021.

Ziegler, G. (2019) Multiclass & Multilabel Classification with XGBoost. Medium. Available at: https://medium.com/@gabrielziegler3/multiclass-multilabel-classification-with-xgboost-66195e4d9f2d (Accessed: 14 June 2019).

Appendix A: Python Code

This appendix contains some of the data cleaning and preparation code of file 1 and file 2. The full code is available to be viewed in verification folder supplied to IT Carlow.

Load the data for each quarter of years 2015-2019 and join the tables for each year together

import pandas as pd
demo1 = pd.read_csv('DEMO18Q1_new.txt', sep="$", header=0)
drug1 = pd.read_csv('DRUG18Q1.txt', sep="$", header=0)
ther1 = pd.read_csv('THER18Q1.txt', sep="$", header=0)
outc1 = pd.read_csv('OUTC18Q1.txt', sep="$", header=0)
reac1 = pd.read_csv('REAC18Q1.txt', sep="$", header=0)
rpsr1 = pd.read_csv('RPSR18Q1.txt', sep="$", header=0)
indi1 = pd.read_csv('INDI18Q1.txt', sep="$", header=0)

reconstructiuon of the table has to be done with a logical manner. case id primary id drug naame and drug sequence will play crucial role

master_table_b = pd.merge(drug1, demo1, on=['primaryid', 'caseid'], how='inner')
master_table_b2 = pd.merge(master_table_b, outc1, on=['primaryid', 'caseid'], how='inner')
master_table_b3 = pd.merge(master_table_b2, reac1, on=['primaryid', 'caseid'], how='inner')
master_table_b4 = pd.merge(master_table_b3, rpsr1, on=['primaryid', 'caseid'], how='inner')

ther1.isnull().sum()

## to join the rest of the tables correctly the column names dsg drug seq from therapy table and indi drug seg from indicator must be converted to drug seq column name..as in drug tasble

ther1 = ther1.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})

indi1 = indi1.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_b5 = pd.merge(master_table_b4, ther1, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_b6 = pd.merge(master_table_b5, indi1, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q1data = master_table_b6

q1data['quarter'] = 1

Add Q2 of 2018

demo2 = pd.read_csv('DEMO18Q2.txt', sep="$", header=0)
drug2 = pd.read_csv('DRUG18Q2.txt', sep="$", header=0)
ther2 = pd.read_csv('THER18Q2.txt', sep="$", header=0)
outc2 = pd.read_csv('OUTC18Q2.txt', sep="$", header=0)
reac2 = pd.read_csv('REAC18Q2.txt', sep="$", header=0)
rpsr2 = pd.read_csv('RPSR18Q2.txt', sep="$", header=0)
indi2 = pd.read_csv('INDI18Q2.txt', sep="$", header=0)

master_table_c = pd.merge(drug2, demo2, on=['primaryid', 'caseid'], how='inner')
master_table_c2 = pd.merge(master_table_c, outc2, on=['primaryid', 'caseid'], how='inner')
master_table_c3 = pd.merge(master_table_c2, reac2, on=['primaryid', 'caseid'], how='inner')
master_table_c4 = pd.merge(master_table_c3, rpsr2, on=['primaryid', 'caseid'], how='inner')

ther2 = ther2.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi2 = indi2.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_c5 = pd.merge(master_table_c4, ther2, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_c6 = pd.merge(master_table_c5, indi2, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q2data = master_table_c6
q2data['quarter'] = 2

Add Q3 of 2018

demo3 = pd.read_csv('DEMO18Q3.txt', sep="$", header=0)
drug3 = pd.read_csv('DRUG18Q3.txt', sep="$", header=0)
ther3 = pd.read_csv('THER18Q3.txt', sep="$", header=0)
outc3 = pd.read_csv('OUTC18Q3.txt', sep="$", header=0)
reac3 = pd.read_csv('REAC18Q3.txt', sep="$", header=0)
rpsr3 = pd.read_csv('RPSR18Q3.txt', sep="$", header=0)
indi3 = pd.read_csv('INDI18Q3.txt', sep="$", header=0)

master_table_d = pd.merge(drug3, demo3, on=['primaryid', 'caseid'], how='inner')
master_table_d2 = pd.merge(master_table_d, outc3, on=['primaryid', 'caseid'], how='inner')
master_table_d3 = pd.merge(master_table_d2, reac3, on=['primaryid', 'caseid'], how='inner')
master_table_d4 = pd.merge(master_table_d3, rpsr3, on=['primaryid', 'caseid'], how='inner')

ther3 = ther3.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi3 = indi3.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_d5 = pd.merge(master_table_d4, ther3, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_d6 = pd.merge(master_table_d5, indi3, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q3data = master_table_d6
q3data['quarter'] = 3

Add Q4 of 2018

demo4 = pd.read_csv('DEMO18Q4.txt', sep="$", header=0)
drug4 = pd.read_csv('DRUG18Q4.txt', sep="$", header=0)
ther4 = pd.read_csv('THER18Q4.txt', sep="$", header=0)
outc4 = pd.read_csv('OUTC18Q4.txt', sep="$", header=0)
reac4 = pd.read_csv('REAC18Q4.txt', sep="$", header=0)
rpsr4 = pd.read_csv('RPSR18Q4.txt', sep="$", header=0)
indi4 = pd.read_csv('INDI18Q4.txt', sep="$", header=0)

master_table_e = pd.merge(drug4, demo4, on=['primaryid', 'caseid'], how='inner')
master_table_e2 = pd.merge(master_table_e, outc4, on=['primaryid', 'caseid'], how='inner')
master_table_e3 = pd.merge(master_table_e2, reac4, on=['primaryid', 'caseid'], how='inner')
master_table_e4 = pd.merge(master_table_e3, rpsr4, on=['primaryid', 'caseid'], how='inner')

ther4 = ther4.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi4 = indi4.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_e5 = pd.merge(master_table_e4, ther4, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_e6 = pd.merge(master_table_e5, indi4, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q4data = master_table_e6
q4data['quarter'] = 4

q4data.head()

Insert first quarter of 2019 to create unseen/ unanalysed dataset

demo5 = pd.read_csv('DEMO19Q1.txt', sep="$", header=0)
drug5 = pd.read_csv('DRUG19Q1.txt', sep="$", header=0)
ther5 = pd.read_csv('THER19Q1.txt', sep="$", header=0)
outc5 = pd.read_csv('OUTC19Q1.txt', sep="$", header=0)
reac5 = pd.read_csv('REAC19Q1.txt', sep="$", header=0)
rpsr5 = pd.read_csv('RPSR19Q1.txt', sep="$", header=0)
indi5 = pd.read_csv('INDI19Q1.txt', sep="$", header=0)

master_table_f = pd.merge(drug5, demo5, on=['primaryid', 'caseid'], how='inner')
master_table_f2 = pd.merge(master_table_f, outc5, on=['primaryid', 'caseid'], how='inner')
master_table_f3 = pd.merge(master_table_f2, reac5, on=['primaryid', 'caseid'], how='inner')
master_table_f4 = pd.merge(master_table_f3, rpsr5, on=['primaryid', 'caseid'], how='inner')

ther5 = ther5.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi5 = indi5.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_f5 = pd.merge(master_table_f4, ther5, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_f6 = pd.merge(master_table_f5, indi5, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q5data = master_table_f6
q5data['quarter'] = 5

Insert 4 quarters of 2017

demo6 = pd.read_csv('DEMO17Q1.txt', sep="$", header=0)
drug6 = pd.read_csv('DRUG17Q1.txt', sep="$", header=0)
ther6 = pd.read_csv('THER17Q1.txt', sep="$", header=0)
outc6 = pd.read_csv('OUTC17Q1.txt', sep="$", header=0)
reac6 = pd.read_csv('REAC17Q1.txt', sep="$", header=0)
rpsr6 = pd.read_csv('RPSR17Q1.txt', sep="$", header=0)
indi6 = pd.read_csv('INDI17Q1.txt', sep="$", header=0)

master_table_g = pd.merge(drug6, demo6, on=['primaryid', 'caseid'], how='inner')
master_table_g2 = pd.merge(master_table_g, outc6, on=['primaryid', 'caseid'], how='inner')
master_table_g3 = pd.merge(master_table_g2, reac6, on=['primaryid', 'caseid'], how='inner')
master_table_g4 = pd.merge(master_table_g3, rpsr6, on=['primaryid', 'caseid'], how='inner')

ther6 = ther6.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi6 = indi6.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_g5 = pd.merge(master_table_g4, ther6, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_g6 = pd.merge(master_table_g5, indi6, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q6data = master_table_g6
q6data['quarter'] = 6

demo7 = pd.read_csv('DEMO17Q2.txt', sep="$", header=0)
drug7 = pd.read_csv('DRUG17Q2.txt', sep="$", header=0)
ther7 = pd.read_csv('THER17Q2.txt', sep="$", header=0)
outc7 = pd.read_csv('OUTC17Q2.txt', sep="$", header=0)
reac7 = pd.read_csv('REAC17Q2.txt', sep="$", header=0)
rpsr7 = pd.read_csv('RPSR17Q2.txt', sep="$", header=0)
indi7 = pd.read_csv('INDI17Q2.txt', sep="$", header=0)

master_table_h = pd.merge(drug7, demo7, on=['primaryid', 'caseid'], how='inner')
master_table_h2 = pd.merge(master_table_h, outc7, on=['primaryid', 'caseid'], how='inner')
master_table_h3 = pd.merge(master_table_h2, reac7, on=['primaryid', 'caseid'], how='inner')
master_table_h4 = pd.merge(master_table_h3, rpsr7, on=['primaryid', 'caseid'], how='inner')

ther7 = ther7.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi7 = indi7.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_h5 = pd.merge(master_table_h4, ther7, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_h6 = pd.merge(master_table_h5, indi7, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q7data = master_table_h6
q7data['quarter'] = 7

demo8 = pd.read_csv('DEMO17Q3.txt', sep="$", header=0)
drug8 = pd.read_csv('DRUG17Q3.txt', sep="$", header=0)
ther8 = pd.read_csv('THER17Q3.txt', sep="$", header=0)
outc8 = pd.read_csv('OUTC17Q3.txt', sep="$", header=0)
reac8 = pd.read_csv('REAC17Q3.txt', sep="$", header=0)
rpsr8 = pd.read_csv('RPSR17Q3.txt', sep="$", header=0)
indi8 = pd.read_csv('INDI17Q3.txt', sep="$", header=0)

master_table_i = pd.merge(drug8, demo8, on=['primaryid', 'caseid'], how='inner')
master_table_i2 = pd.merge(master_table_i, outc8, on=['primaryid', 'caseid'], how='inner')
master_table_i3 = pd.merge(master_table_i2, reac8, on=['primaryid', 'caseid'], how='inner')
master_table_i4 = pd.merge(master_table_i3, rpsr8, on=['primaryid', 'caseid'], how='inner')

ther8 = ther8.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi8 = indi8.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_i5 = pd.merge(master_table_i4, ther8, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_i6 = pd.merge(master_table_i5, indi8, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q8data = master_table_i6
q8data['quarter'] = 8

demo9 = pd.read_csv('DEMO17Q4.txt', sep="$", header=0)
drug9 = pd.read_csv('DRUG17Q4.txt', sep="$", header=0)
ther9 = pd.read_csv('THER17Q4.txt', sep="$", header=0)
outc9 = pd.read_csv('OUTC17Q4.txt', sep="$", header=0)
reac9 = pd.read_csv('REAC17Q4.txt', sep="$", header=0)
rpsr9 = pd.read_csv('RPSR17Q4.txt', sep="$", header=0)
indi9 = pd.read_csv('INDI17Q4.txt', sep="$", header=0)

master_table_j = pd.merge(drug9, demo9, on=['primaryid', 'caseid'], how='inner')
master_table_j2 = pd.merge(master_table_j, outc9, on=['primaryid', 'caseid'], how='inner')
master_table_j3 = pd.merge(master_table_j2, reac9, on=['primaryid', 'caseid'], how='inner')
master_table_j4 = pd.merge(master_table_j3, rpsr9, on=['primaryid', 'caseid'], how='inner')

ther9 = ther9.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi9 = indi9.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_j5 = pd.merge(master_table_j4, ther9, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_j6 = pd.merge(master_table_j5, indi9, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q9data = master_table_j6
q9data['quarter'] = 9

insert 2016 data

demo6 = pd.read_csv('DEMO16Q1.txt', sep="$", header=0)
drug6 = pd.read_csv('DRUG16Q1.txt', sep="$", header=0)
ther6 = pd.read_csv('THER16Q1.txt', sep="$", header=0)
outc6 = pd.read_csv('OUTC16Q1.txt', sep="$", header=0)
reac6 = pd.read_csv('REAC16Q1.txt', sep="$", header=0)
rpsr6 = pd.read_csv('RPSR16Q1.txt', sep="$", header=0)
indi6 = pd.read_csv('INDI16Q1.txt', sep="$", header=0)

master_table_g = pd.merge(drug6, demo6, on=['primaryid', 'caseid'], how='inner')
master_table_g2 = pd.merge(master_table_g, outc6, on=['primaryid', 'caseid'], how='inner')
master_table_g3 = pd.merge(master_table_g2, reac6, on=['primaryid', 'caseid'], how='inner')
master_table_g4 = pd.merge(master_table_g3, rpsr6, on=['primaryid', 'caseid'], how='inner')

ther6 = ther6.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi6 = indi6.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_g5 = pd.merge(master_table_g4, ther6, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_g6 = pd.merge(master_table_g5, indi6, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q10data = master_table_g6
q10data['quarter'] = 10

demo7 = pd.read_csv('DEMO16Q2.txt', sep="$", header=0)
drug7 = pd.read_csv('DRUG16Q2.txt', sep="$", header=0)
ther7 = pd.read_csv('THER16Q2.txt', sep="$", header=0)
outc7 = pd.read_csv('OUTC16Q2.txt', sep="$", header=0)
reac7 = pd.read_csv('REAC16Q2.txt', sep="$", header=0)
rpsr7 = pd.read_csv('RPSR16Q2.txt', sep="$", header=0)
indi7 = pd.read_csv('INDI16Q2.txt', sep="$", header=0)

master_table_h = pd.merge(drug7, demo7, on=['primaryid', 'caseid'], how='inner')
master_table_h2 = pd.merge(master_table_h, outc7, on=['primaryid', 'caseid'], how='inner')
master_table_h3 = pd.merge(master_table_h2, reac7, on=['primaryid', 'caseid'], how='inner')
master_table_h4 = pd.merge(master_table_h3, rpsr7, on=['primaryid', 'caseid'], how='inner')

ther7 = ther7.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi7 = indi7.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_h5 = pd.merge(master_table_h4, ther7, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_h6 = pd.merge(master_table_h5, indi7, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q11data = master_table_h6
q11data['quarter'] = 11

demo8 = pd.read_csv('DEMO16Q3.txt', sep="$", header=0)
drug8 = pd.read_csv('DRUG16Q3.txt', sep="$", header=0)
ther8 = pd.read_csv('THER16Q3.txt', sep="$", header=0)
outc8 = pd.read_csv('OUTC16Q3.txt', sep="$", header=0)
reac8 = pd.read_csv('REAC16Q3.txt', sep="$", header=0)
rpsr8 = pd.read_csv('RPSR16Q3.txt', sep="$", header=0)
indi8 = pd.read_csv('INDI16Q3.txt', sep="$", header=0)

master_table_i = pd.merge(drug8, demo8, on=['primaryid', 'caseid'], how='inner')
master_table_i2 = pd.merge(master_table_i, outc8, on=['primaryid', 'caseid'], how='inner')
master_table_i3 = pd.merge(master_table_i2, reac8, on=['primaryid', 'caseid'], how='inner')
master_table_i4 = pd.merge(master_table_i3, rpsr8, on=['primaryid', 'caseid'], how='inner')

ther8 = ther8.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi8 = indi8.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_i5 = pd.merge(master_table_i4, ther8, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_i6 = pd.merge(master_table_i5, indi8, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q12data = master_table_i6
q12data['quarter'] = 12

demo9 = pd.read_csv('DEMO16Q4.txt', sep="$", header=0)
drug9 = pd.read_csv('DRUG16Q4.txt', sep="$", header=0)
ther9 = pd.read_csv('THER16Q4.txt', sep="$", header=0)
outc9 = pd.read_csv('OUTC16Q4.txt', sep="$", header=0)
reac9 = pd.read_csv('REAC16Q4.txt', sep="$", header=0)
rpsr9 = pd.read_csv('RPSR16Q4.txt', sep="$", header=0)
indi9 = pd.read_csv('INDI16Q4.txt', sep="$", header=0)

master_table_j = pd.merge(drug9, demo9, on=['primaryid', 'caseid'], how='inner')
master_table_j2 = pd.merge(master_table_j, outc9, on=['primaryid', 'caseid'], how='inner')
master_table_j3 = pd.merge(master_table_j2, reac9, on=['primaryid', 'caseid'], how='inner')
master_table_j4 = pd.merge(master_table_j3, rpsr9, on=['primaryid', 'caseid'], how='inner')

ther9 = ther9.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi9 = indi9.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_j5 = pd.merge(master_table_j4, ther9, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_j6 = pd.merge(master_table_j5, indi9, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q13data = master_table_j6
q13data['quarter'] = 13

insert 2015 data

demo6 = pd.read_csv('DEMO15Q1.txt', sep="$", header=0)
drug6 = pd.read_csv('DRUG15Q1.txt', sep="$", header=0)
ther6 = pd.read_csv('THER15Q1.txt', sep="$", header=0)
outc6 = pd.read_csv('OUTC15Q1.txt', sep="$", header=0)
reac6 = pd.read_csv('REAC15Q1.txt', sep="$", header=0)
rpsr6 = pd.read_csv('RPSR15Q1.txt', sep="$", header=0)
indi6 = pd.read_csv('INDI15Q1.txt', sep="$", header=0)

master_table_g = pd.merge(drug6, demo6, on=['primaryid', 'caseid'], how='inner')
master_table_g2 = pd.merge(master_table_g, outc6, on=['primaryid', 'caseid'], how='inner')
master_table_g3 = pd.merge(master_table_g2, reac6, on=['primaryid', 'caseid'], how='inner')
master_table_g4 = pd.merge(master_table_g3, rpsr6, on=['primaryid', 'caseid'], how='inner')

ther6 = ther6.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi6 = indi6.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_g5 = pd.merge(master_table_g4, ther6, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_g6 = pd.merge(master_table_g5, indi6, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q14data = master_table_g6
q14data['quarter'] = 14

demo7 = pd.read_csv('DEMO15Q2.txt', sep="$", header=0)
drug7 = pd.read_csv('DRUG15Q2.txt', sep="$", header=0)
ther7 = pd.read_csv('THER15Q2.txt', sep="$", header=0)
outc7 = pd.read_csv('OUTC15Q2.txt', sep="$", header=0)
reac7 = pd.read_csv('REAC15Q2.txt', sep="$", header=0)
rpsr7 = pd.read_csv('RPSR15Q2.txt', sep="$", header=0)
indi7 = pd.read_csv('INDI15Q2.txt', sep="$", header=0)

master_table_h = pd.merge(drug7, demo7, on=['primaryid', 'caseid'], how='inner')
master_table_h2 = pd.merge(master_table_h, outc7, on=['primaryid', 'caseid'], how='inner')
master_table_h3 = pd.merge(master_table_h2, reac7, on=['primaryid', 'caseid'], how='inner')
master_table_h4 = pd.merge(master_table_h3, rpsr7, on=['primaryid', 'caseid'], how='inner')

ther7 = ther7.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi7 = indi7.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_h5 = pd.merge(master_table_h4, ther7, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_h6 = pd.merge(master_table_h5, indi7, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q15data = master_table_h6
q15data['quarter'] = 15

demo8 = pd.read_csv('DEMO15Q3.txt', sep="$", header=0)
drug8 = pd.read_csv('DRUG15Q3.txt', sep="$", header=0)
ther8 = pd.read_csv('THER15Q3.txt', sep="$", header=0)
outc8 = pd.read_csv('OUTC15Q3.txt', sep="$", header=0)
reac8 = pd.read_csv('REAC15Q3.txt', sep="$", header=0)
rpsr8 = pd.read_csv('RPSR15Q3.txt', sep="$", header=0)
indi8 = pd.read_csv('INDI15Q3.txt', sep="$", header=0)

master_table_i = pd.merge(drug8, demo8, on=['primaryid', 'caseid'], how='inner')
master_table_i2 = pd.merge(master_table_i, outc8, on=['primaryid', 'caseid'], how='inner')
master_table_i3 = pd.merge(master_table_i2, reac8, on=['primaryid', 'caseid'], how='inner')
master_table_i4 = pd.merge(master_table_i3, rpsr8, on=['primaryid', 'caseid'], how='inner')

ther8 = ther8.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi8 = indi8.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_i5 = pd.merge(master_table_i4, ther8, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_i6 = pd.merge(master_table_i5, indi8, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q16data = master_table_i6
q16data['quarter'] = 16

demo9 = pd.read_csv('DEMO15Q4.txt', sep="$", header=0)
drug9 = pd.read_csv('DRUG15Q4.txt', sep="$", header=0)
ther9 = pd.read_csv('THER15Q4.txt', sep="$", header=0)
outc9 = pd.read_csv('OUTC15Q4.txt', sep="$", header=0)
reac9 = pd.read_csv('REAC15Q4.txt', sep="$", header=0)
rpsr9 = pd.read_csv('RPSR15Q4.txt', sep="$", header=0)
indi9 = pd.read_csv('INDI15Q4.txt', sep="$", header=0)

master_table_j = pd.merge(drug9, demo9, on=['primaryid', 'caseid'], how='inner')
master_table_j2 = pd.merge(master_table_j, outc9, on=['primaryid', 'caseid'], how='inner')
master_table_j3 = pd.merge(master_table_j2, reac9, on=['primaryid', 'caseid'], how='inner')
master_table_j4 = pd.merge(master_table_j3, rpsr9, on=['primaryid', 'caseid'], how='inner')

ther9 = ther9.rename(index=str, columns={"dsg_drug_seq": "drug_seq"})
indi9 = indi9.rename(index=str, columns={'indi_drug_seq':'drug_seq'})

master_table_j5 = pd.merge(master_table_j4, ther9, on=['primaryid', 'caseid', 'drug_seq'], how='inner')
master_table_j6 = pd.merge(master_table_j5, indi9, on=['primaryid', 'caseid', 'drug_seq'], how='inner')

q17data = master_table_j6
q17data['quarter'] = 17

Join 17 quarters for years 2015 - q1 2019

frames = [q1data, q2data, q3data, q4data, q5data, q6data, q7data, q8data, q9data, q10data, q11data, q12data, q13data, q14data, q15data, q16data, q17data]

full_fda_data = pd.concat(frames)

full_fda_data.sex.value_counts()

full_fda_data.isnull().sum()

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Sub in drugname values for AI values or take the AI values from drug name values

full_fda_data.prod_ai = full_fda_data.prod_ai.fillna(value=full_fda_data.drugname)

full_fda_data.isnull().sum()

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import missingno as msno # if you dont have this package installed go to terminal and pip install missngno

msno.matrix(full_fda_data, labels=True)

msno.heatmap(full_fda_data, labels=True)

Exploration of data - duplications, spelling mistakes, napropiert values and univariate analysis of main variables

pd.set_option('display.max_columns', None)

full_fda_data.loc[full_fda_data['caseid'] == 15129680]

#### potential ADR outcomes

full_fda_data.outc_cod.value_counts()

%matplotlib inline

full_fda_data.outc_cod.value_counts().sort_index().plot(kind='bar')

full_fda_data.wt.value_counts().sort_index().plot(kind='kde')

import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

plt.hist(full_fda_data.age.value_counts().sort_index())

go with d5

FIX

df5 = full_fda_data.drop_duplicates()

len(df5)

weight

df5['tmpwt']= df5.wt_cod == 'LBS'

df5[df5['wt_cod'] == 'LBS'].head()

df5.wt /= np.where(df5.tmpwt, 2.205, 1)

df5[df5['wt_cod'] == 'LBS']

df5.head()

age

df5[df5['age_cod'] == 'HR'].head()

# divide hr by 8760 to convert to years

df5['tmp_hr']= df5.age_cod == 'HR'

df5.age /= np.where(df5.tmp_hr, 8760, 1)

df5[df5['age_cod'] == 'HR'].head()

fill in the age and wt based on each other anf=d grouping

df5[df5['age_cod'] == 'DY'].head()

df5['tmp_dy']= df5.age_cod == 'DY'

df5.age /= np.where(df5.tmp_dy, 365, 1)

df5[df5['age_cod'] == 'DY'].head()

df5[df5['age_cod'] == 'DY'].head()

df5[df5['age_cod'] == 'WK'].head()

df5['tmp_wk']= df5.age_cod == 'WK'

df5.age /= np.where(df5.tmp_wk, 52, 1)

df5[df5['age_cod'] == 'WK'].head()

df5[df5['age_cod'] == 'MON'].head()

df5['tmp_mon']= df5.age_cod == 'MON'

df5.age /= np.where(df5.tmp_mon, 12, 1)

df5[df5['age_cod'] == 'MON'].head()

df5.head()

df5m = df5
df5m

df5m.info()

df5m['avrwt'] = df5m.wt.groupby(df5m.age).mean() ## this will be usefull ## Weight 2

df5m.groupby('age')['wt'].mean() # there are outliers in the data->fix

df5m.loc[(df5m['age'] == 78.0) & (df5m['wt'] >150.0)]

xdel = df5m.loc[df5m['wt'] >1000.0]

xdel

Normalise weight for age ranges 0-10

df5m.loc[df5m['age'] <0.5]

df5m.loc[(df5m.age <= 0.0833) & (df5m.wt > 6), 'wt'] = 4.5
df5m.loc[(df5m.age > 0.0833) & (df5m.age <= 0.166) & (df5m.wt > 7), 'wt'] = 5.6
df5m.loc[(df5m.age > 0.166) & (df5m.age <= 0.254) & (df5m.wt > 8), 'wt'] = 6.4 #
df5m.loc[(df5m.age > 0.254) & (df5m.age <= 0.54) & (df5m.wt > 9.8), 'wt'] = 8 #
df5m.loc[(df5m.age > 0.54) & (df5m.age <= 0.754) & (df5m.wt > 11), 'wt'] = 9 #
df5m.loc[(df5m.age > 0.754) & (df5m.age <= 1.0) & (df5m.wt > 12), 'wt'] = 9.5 #

df5m = df5m.drop(df5m[df5m.wt > 1000].index)

df5m.loc[(df5m['age'] == 78.0) & (df5m['wt'] >150.0)]

Normalise the fractional age values for people aged 0-10 for new borns use small 0, 1 month, 2 month 3 month, 6 month, 9 month, 1 year old scale because during the first year there is accelerated weight gain (WHO)

df5m.groupby('age')['wt'].mean()

df5m.loc[(df5m.age > 0.000001) &(df5m.age <= 0.0833), 'age'] = 0.0833 # simplyfee newborn group 1month, 2month, 3 month, 6 month, 9month, 1 year ...al in year units
df5m.loc[(df5m.age > 0.0833) & (df5m.age <= 0.166), 'age'] = 0.166
df5m.loc[(df5m.age > 0.166) & (df5m.age <= 0.254), 'age'] = 0.25
df5m.loc[(df5m.age > 0.254) & (df5m.age <= 0.54), 'age'] = 0.5
df5m.loc[(df5m.age > 0.54) & (df5m.age <= 0.754), 'age'] = 0.75
df5m.loc[(df5m.age > 0.754) & (df5m.age <= 1.0), 'age'] = 1.0

df5m.groupby('age')['wt'].mean().head(100)

# fix the remaining weight data

df5m.loc[(df5m.age > 1.0) & (df5m.age <= 1.5) & (df5m.wt > 13.5), 'wt'] = 11.0
df5m.loc[(df5m.age > 1.50) & (df5m.age <= 2) & (df5m.wt > 15.2), 'wt'] = 12.2

df5m.groupby('age')['wt'].mean().head(100)

df5m.loc[(df5m.age > 1.0) & (df5m.age <= 1.49999), 'age'] = 1 # fix the remaining age data
df5m.loc[(df5m.age > 1.49999) & (df5m.age <= 2.49999), 'age'] = 2.0
df5m.loc[(df5m.age > 2.49999) & (df5m.age <= 3.49999), 'age'] = 3.0
df5m.loc[(df5m.age > 3.49999) & (df5m.age <= 4.49999), 'age'] = 4.0
df5m.loc[(df5m.age > 4.49999) & (df5m.age <= 5.49999), 'age'] = 5.0
df5m.loc[(df5m.age > 5.49999) & (df5m.age <= 6.49999), 'age'] = 6.0
df5m.loc[(df5m.age > 6.49999) & (df5m.age <= 7.49999), 'age'] = 7.0
df5m.loc[(df5m.age > 7.49999) & (df5m.age <= 8.49999), 'age'] = 8.0
df5m.loc[(df5m.age > 8.49999) & (df5m.age <= 9.49999), 'age'] = 9.0
df5m.loc[(df5m.age > 9.49999) & (df5m.age <= 10.49999), 'age'] = 10.0

pd.set_option('display.max_rows', 500)

df5m.groupby('age')['wt'].mean()

df5m.loc[(df5m.age > 32.5) & (df5m.age <= 32.99), 'age'] = 33
df5m.loc[(df5m.age > 73.0) & (df5m.age <= 73.49), 'age'] = 73
df5m.loc[(df5m.age > 77.5) & (df5m.age <= 77.9), 'age'] = 78

df5m.isnull().sum()

mean_wtperage = df5m.groupby('age')['wt'].mean()

mean_wtperage

mean_wtperage = mean_wtperage.to_frame()
#mean_wtperage = pd.DataFrame(mean_wtperage, columns = ['age', 'meanweight'])

mean_wtperage

mean_wtperage['age'] = mean_wtperage.index

mean_wtperage

fill in missing weight values with an average weight value for given age

a= df5m.copy()
a.isnull().sum()

a['wt'] = a['wt'].fillna(mean_wtperage.set_index('age')['wt'])

a['wt'] = a['wt'].mask(a['wt'].eq(0)).fillna(a['age'].map(mean_wtperage.set_index('age').wt))

a.isnull().sum()

a[a['age'].isnull()]

# Trial of opposite operation - fill in average age for given weight - does not work because multiple weights for the same age

meanwt2 = mean_wtperage.copy()
meanwt2.wt = round(meanwt2.wt)

meanwt2 = meanwt2.ffill()
meanwt2.wt = meanwt2.wt.astype(int)

meanwt2

b = pd.DataFrame(meanwt2.wt.unique(), columns = {'wt'})
b['age'] = np.NaN
b.head()

for i in b['wt']:
mean_age = meanwt2[meanwt2.wt == i].age.mean()
b.loc[meanwt2.wt == i, 'age'] = mean_age

b.head()

a['age'] = a['age'].mask(a['age'].eq(0)).fillna(round(a['wt'])
.map(b
.set_index('wt').age))

a[a.age.isnull()].head(10)

a.isnull().sum()

Filter the dataframe for important variables and remove the duplicates

dfx = a[['primaryid', 'caseid', 'drug_seq', 'drugname', 'prod_ai', 'route', 'age', 'sex', 'wt', 'pt', 'indi_pt', 'outc_cod']]

dfx.prod_ai = dfx.prod_ai.fillna(value=dfx.drugname)

dfx1 = dfx.dropna().drop_duplicates()

dfx1

dfx1.isnull().sum()

case id = case id and drugname = drug name and outcome = DE and HO create d8fferent datasets and compare them

total = dfx1

Data reformatting #outcome

death = total.loc[total['outc_cod'] == 'DE']

hospital = total.loc[total['outc_cod'] == 'HO']

hospital

death

death['caseid'].isin(hospital['caseid']).astype(int)

hospital['vp2p'] = hospital["caseid"].isin(death["caseid"])

hospital.vp2p.value_counts()

death['vp2p'] = death["caseid"].isin(hospital["caseid"])

death.info()

death.vp2p.value_counts()

hospitalised_then_died = hospital.loc[hospital['vp2p'] == True] # cases where patient was hospitalised then died

hospitalised_then_died

newhospital = hospital.loc[hospital['vp2p'] == False] # selects cases where patient was only hospitalised but didnt die

newdied = death.loc[death['vp2p'] == False] # selects cases of death without hospitalisation

hospitalised_then_died

newdied

newhospital

newhospital.loc[(newhospital.outc_cod == 'HO'), 'outc_cod'] = 'H'

newhospital

newdied.loc[(newdied.outc_cod == 'DE'), 'outc_cod'] = 'D'

hospitalised_then_died.loc[(hospitalised_then_died.outc_cod == 'HO'), 'outc_cod'] = 'HD'

f1 = [newhospital, newdied, hospitalised_then_died]
newdf = pd.concat(f1)

newdf

Other cleaning

newdf['drugname'] = newdf['drugname'].str.upper()
newdf['prod_ai'] = newdf['prod_ai'].str.upper()
newdf['route'] = newdf['route'].str.upper()
newdf['pt'] = newdf['pt'].str.upper()
newdf['indi_pt'] = newdf['indi_pt'].str.upper()

newdf.head()

newdf.info()

newdf.nunique()

newdf.route.value_counts()

squeeze=False

%matplotlib inline

newdf.age.plot(kind='kde') # After the cleaning the age class has close to a normal distribution

Fill in the values for drugname and prod_ai , fix the punctuation and abnormal symbols in the naming

newdf['route'] = newdf['route'].str.split(' ').str[0]

newdf.route.value_counts()

newdf.loc[newdf['drugname'] == '']

newdf.drugname.value_counts()

newdf2 = newdf.copy()

newdf2.drugname.value_counts()

newdf2['drugname'] = newdf2['drugname'].str.split('0').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('1').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('2').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('3').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('4').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('5').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('6').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('7').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('8').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('9').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('(').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('.').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split(',').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('P/O').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('- GRASS').str[0]
newdf2['drugname'] = newdf2['drugname'].str.split('/PFIZER').str[0]

newdf2.drugname.value_counts()

newdf2.loc[newdf2['drugname'] == 'N']

newdf2.loc[newdf2['drugname'] == 'D']

newdf2.loc[(newdf2.drugname == 'N'), 'drugname'] = 'NIVOLUMAB'

newdf2.loc[(newdf2.drugname == 'D') & (newdf2.prod_ai == "DEXTROSE\RINGER'S SOLUTION, LACTATED"), 'drugname'] = 'DEXTROSE'

newdf2.loc[(newdf2.drugname == 'D') & (newdf2.prod_ai == "DIVALPROEX SODIUM"), 'drugname'] = 'DIVALPROEX SODIUM'

newdf2.loc[(newdf2.drugname == 'D') & (newdf2.prod_ai == "DEXTROSE"), 'drugname'] = 'DEXTROSE'

newdf2.loc[newdf2['prod_ai'] == '']

newdf2.drugname = np.where(newdf2.drugname == '', newdf2.prod_ai, newdf2.drugname)

newdf2

newdf2.prod_ai.value_counts()

newdf2['prod_ai'] = newdf2['prod_ai'].str.split('0').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('1').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('2').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('3').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('4').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('5').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('6').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('7').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('8').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('9').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split('.').str[0]
newdf2['prod_ai'] = newdf2['prod_ai'].str.split(',').str[0]

newdf2.prod_ai.value_counts()

newdf2.loc[newdf2['prod_ai'] == '']

newdf2.prod_ai = np.where(newdf2.prod_ai == '', newdf2.drugname, newdf2.prod_ai)

newdf2.loc[newdf2['prod_ai'] == '']

newdf2.drugname.value_counts()

newdf2.drugname = np.where(newdf2.drugname == '', newdf2.prod_ai, newdf2.drugname)

newdf2.drugname.value_counts()

newdf2.loc[newdf2['drugname'] == 'R']

newdf2.drugname = np.where(newdf2.drugname == 'R', newdf2.prod_ai, newdf2.drugname)

newdf2.loc[newdf2['drugname'] == 'R']

newdf3 = newdf2.copy()

newdf3.drugname = newdf3.drugname.str.rstrip()

newdf3.drugname = newdf3.drugname.str.rstrip('/')

newdf3.drugname = newdf3.drugname.str.rstrip('?')

newdf3.drugname = newdf3.drugname.str.rstrip('+')

newdf3.drugname = newdf3.drugname.str.rstrip('-')

newdf3.drugname.value_counts()

newdf3.prod_ai.value_counts()

newdf3.prod_ai = newdf3.prod_ai.str.rstrip('-')

newdf3.prod_ai.value_counts()

newdf3

newdf3 = newdf3.drop(labels='vp2p', axis=1)

newdf3

newdf3.outc_cod.value_counts()

### Enoding of newdf3 for the purpose of multicolinearity validation. For model training the data will have to be encoded in different way with onhotencoder

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df_encoded = newdf.apply(le.fit_transform)

fdf = newdf3[['drugname', 'prod_ai', 'route','pt', 'indi_pt', 'age', 'sex', 'wt', 'drug_seq', 'outc_cod']]

fdf = fdf[fdf.sex != 'UNK']

fdf

fdf['drugname'] = le.fit_transform(fdf['drugname'])

fdf['prod_ai'] = le.fit_transform(fdf['prod_ai'])
fdf['route'] = le.fit_transform(fdf['route'])
fdf['sex'] = le.fit_transform(fdf['sex'])
fdf['pt'] = le.fit_transform(fdf['pt'])
fdf['indi_pt'] = le.fit_transform(fdf['indi_pt'])
fdf['outc_cod'] = le.fit_transform(fdf['outc_cod'])

fdf

Check for multicollinearity 2

# the code adapted from https://stackoverflow.com/questions/29432629/plot-correlation-matrix-using-pandas

rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(12, 12))
corr = fdf.corr()
corr_graph = corr.style.background_gradient(cmap='coolwarm')

corr_graph

from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing

# create function
enc = preprocessing.OneHotEncoder()

# fit the df to the function
enc.fit(fdf)

# Transform to array and check the shape
onehotlabels = enc.transform(fdf).toarray()
onehotlabels.shape

onehotlabels ## do not run - the jupyter notebook will crash ... there are too many unique feature values to use all features and onhotencoder - dummify the relevant instead

newdf3

Encode the sex

fdf.loc[(fdf.sex == 'F'), 'sex'] = '0'
fdf.loc[(fdf.sex == 'M'), 'sex'] = '1'

Export the dataframe as csv and open in new jupyter notebook because this one might be crashing due to large amount of data being opened and rendered

fdf.to_csv('fdf.csv', sep='\t')

fdf2 = newdf3[['drugname', 'prod_ai', 'route','pt', 'indi_pt', 'age', 'sex', 'wt', 'drug_seq', 'outc_cod']]

fdf2.outc_cod.value_counts()

fdf2 = fdf2[fdf2.sex != 'UNK']
fdf2.to_csv('fdf.csv', sep='\t')

OPEN Jupyter Notebook File 2

import pandas as pd
fdf = pd.read_csv('fdf.csv', sep='\t', header=0)

fdf.info()

Do the remaining cleaning and processing operations

hosp1 = fdf.loc[fdf['outc_cod'] == 'H']

dead1 = fdf.loc[fdf['outc_cod'] == 'D']

hospdead1 = fdf.loc[fdf['outc_cod'] == 'HD']

newhosp = hosp1.sample(n=15000)

newhosp2 = hosp1.sample(n=12734)

newhosp.loc[(newhosp.outc_cod == 'H'), 'outc_cod'] = '0'
hospdead1.loc[(hospdead1.outc_cod == 'HD'), 'outc_cod'] = '1'
dead1.loc[(dead1.outc_cod == 'D'), 'outc_cod'] = '2'

frames = [newhosp, hospdead1, dead1]

data = pd.concat(frames)

data.outc_cod.value_counts()

data.info()

data["outc_cod"] = pd.to_numeric(data["outc_cod"])

data.loc[(data.sex == 'F'), 'sex'] = '0'
data.loc[(data.sex == 'M'), 'sex'] = '1'

data["sex"] = pd.to_numeric(data["sex"])

data.info()

Dumify

dummyai = pd.get_dummies(data['prod_ai'],prefix='d_ai', prefix_sep='_', drop_first=True)

dummyroute = pd.get_dummies(data['route'],prefix='route', prefix_sep='_', drop_first=True)

dummypt = pd.get_dummies(data['pt'],prefix='pt', prefix_sep='_', drop_first=True)

numandencoded_features = data[['drug_seq', 'age', 'sex', 'wt', 'outc_cod']]

numandencoded_features.outc_cod.value_counts()

mldata3 = dummyai.join(dummyroute)

mldata2 = mldata3.join(dummypt)

mldata = mldata2.join(numandencoded_features)

Split the data in 75 to 25 ratio for train and test sets respectively

mldata.shape

n = int(mldata.shape[1] - 1)
n

X = mldata.iloc[:,0:n].values

Y = mldata.iloc[:,n].values

import numpy as np

np.unique(Y)

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

Balance the data

from imblearn.over_sampling import SMOTE

import collections

collections.Counter(Y_train)

smt = SMOTE() # overbalancing done only on the training sets
X_train, Y_train = smt.fit_sample(X_train, Y_train)

collections.Counter(Y_train)

collections.Counter(Y_train)

Standardise the features

# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

FIX FOR Y ...train label

from keras.utils import to_categorical

Y_train = to_categorical(Y_train)Y_trainY_test = to_categorical(Y_test)

Appendix B: FDA ASC_NTS Description file

Abbildung in dieser Leseprobe nicht enthalten

Appendix C: Initial Data Quality Assessment

ADR outcome counts

Abbildung in dieser Leseprobe nicht enthalten

Weight distribution

Abbildung in dieser Leseprobe nicht enthalten

Age Distribution

Abbildung in dieser Leseprobe nicht enthalten

Data types and their counts

Abbildung in dieser Leseprobe nicht enthalten

Female to Male ratio

Abbildung in dieser Leseprobe nicht enthalten

List of missing values

Abbildung in dieser Leseprobe nicht enthalten

Appendix D: WHO Age vs Age Relationship Diagrams

Abbildung in dieser Leseprobe nicht enthalten

Appendix E: Specification of subsequent model parameters

Models 1-7

Binary Model

Abbildung in dieser Leseprobe nicht enthalten

Binary Model

Abbildung in dieser Leseprobe nicht enthalten

Appendix F: Performance Metrics of designed models

The accuracy scores are in format Train Loss, Train Acc %] [Test Loss, Test Acc%

Model2

Abbildung in dieser Leseprobe nicht enthalten

Model 2b

Abbildung in dieser Leseprobe nicht enthalten

Model 3

Abbildung in dieser Leseprobe nicht enthalten

Model 4

Abbildung in dieser Leseprobe nicht enthalten

Model 4B

Abbildung in dieser Leseprobe nicht enthalten

Model 5

Abbildung in dieser Leseprobe nicht enthalten

Model 6

Abbildung in dieser Leseprobe nicht enthalten

Model 7

Abbildung in dieser Leseprobe nicht enthalten

Binary Model

Abbildung in dieser Leseprobe nicht enthalten

[...]

Excerpt out of 135 pages

Details

Title: Prediction of Adverse Drug Reaction (ADR) Outcomes with use of Machine Learning. Feed-Forward Artificial Neural Network with Backpropagation
College: Institute of Technology Carlow
Grade: 88.2
Author: Kamil Szymanski (Author)
Year: 2019
Pages: 135
Catalog Number: V541975
ISBN (eBook): 9783346163370
ISBN (Book): 9783346163387
Language: English
Keywords: adverse, prediction, outcomes, neural, network, machine, learning, feed-forward, drug, backpropagation, artificial, reaction

Quote paper: Kamil Szymanski (Author), 2019, Prediction of Adverse Drug Reaction (ADR) Outcomes with use of Machine Learning. Feed-Forward Artificial Neural Network with Backpropagation, Munich, GRIN Verlag, https://www.grin.com/document/541975

Comments