A Short Critical, Non-Technical, Non-Mathematical Paper about Regression Analysis

An Introduction for Beginners

by Matthias Zöphel (Author) Christian Egger (Author) Hansjakob Riedi (Author)

Research Paper (postgraduate) 2008 18 Pages

Mathematics - Analysis






3.1. The Effect of a Quality Management System on Supply Chain Performance: An Empirical Study in Taiwan” (Liu 2009)
3.2 Applying the Theory of Planned Behavior (TPB) to Predict Internet Tax Filing Intentions (Ramayah, Yusliza et. al 2009)




The following report will provide an insight into regression analysis based on three sections. First, the technique will be described in a non-mathematical way by indicating a six-stages-procedure which is used in Hair’s multivariate analysis 3rd edition. For understanding and extending reasons several other sources have been incorporated into this assignment. The second section will identify limitations to regression analysis indicating when it is appropriate to use and what limitations arise once it is used. Finally, the third section of this report will provide two research examples which are established according to the six-stage-procedure exemplified in the technique description section.


Regression Analysis is one technique component of multivariate analysis and is used when the relationship between several predicting or independent variables and a dependent or criterion variable is to be analyzed (Pearson 1908). In contrast to simple regression which incorporates only a single independent variable, multiple regressions incorporate several independent variables with the advantage of achieving a higher level of prediction of the dependent variable. The second intended outcome of multiple regressions besides predicting the dependent variable is to identify correlations between independent variables thus explaining which independent variable has what contribution in predicting the dependent variable. The purpose of these two intentions of regression analysis is to enable organizations to create knowledge and thereby improve decision-making (Hair 2006). The regression analysis technique incorporates six stages which are indicated below, leading to the prediction of the predicting variable.

Stage 1: Stating the Research Problem

The first stage in multiple regression analysis is to determine the researcher’s objective meaning that a dependent variable must be selected which the researcher wishes to have predicted. Once the researcher knows what he wants to predict, he selects variables that he considers to be influential to the dependent variable, variables that have explanatory intention of why changes in these independent variables influence the dependent or predicting variable. Based on the researchers theoretical knowledge he can assume which of the independent variables he has selected carries what weight in predicting the dependent variable. Essential to use multiple regression analysis is the selection of metric or quantitative variables and no non-metric or qualitative variables unless they are transformed to dummy variables which this report will refer to later on. When selecting the variables, the researcher must be aware of specification errors which are either the inclusion of a non-essential variables or the omission of an essential one which in both cases can lead to a non-model parsimony which further leads to a fraud outcome of the regression analysis. Once the researcher knows what he would like to have predicted and by what means, he has to collect data being the second step of regression analysis.

Stage 2: Designing the Research

The size of data needed depends on the number of independent variables the researcher intends to incorporate into his regression model. It is suggested (Hair 2006), that for each independent variable, the sample size should not fall below a ratio of 5:1. However, the desired ratio is 15 to 20:1 meaning that there have to be 15-20 data collections available for each independent variable. The higher the ratio, the higher will be the degree of freedom and statistical power which are in charge of achieving generalizability of the results. As degree of freedom and statistical power reflect generalizability, the researcher can raise the degree by either excluding independent variables or increasing the sample size which both ways lead to a greater ratio and thus to a greater degree of freedom and statistical power and thus to more generalizability. However, the researcher must also be careful when selecting a sample size that is too great as this can also make the regression analysis overly sensitive. Data can be collected through surveys, questionnaires or other means. The researcher must pay attention to use questions that are appropriate and lead to the most reliable outcome. If this is not done appropriately, or if data once collected is not appropriately used, or if some data is missing then this is referred to a measurement error leading to invalid and unreliable results. However, it is not possible to eliminate the measurement error as there will always be some problem when designing the research. Still the researcher is required to minimize measurement error and thus trying to reach out for the most acceptable level of validity. Once the researcher has collected the necessary data, he inserts them into statistical software which helps him to identify whether his data meets the assumptions required by regression analysis stated now in stage three.

Stage 3: Meeting the Assumptions

The four assumptions in multivariate analysis are linearity, homoscedasticity, Normality and Independence of the error terms. A violation of these assumptions will lead to invalid outcomes. Therefore, the researcher is required to test the assumptions even twice: first for each separate variable and second for the overall model (Hair 2006). When one of the assumptions is not met, the researcher needs to undertake data transformation.

i. Normality

Normality is the most fundamental assumption of regression analysis referring to the shape of the data distribution for an individual metric variable and its correspondence to the normal distribution (Hair 2006). Normality can be identified by the help of statistical tests, histograms or normal probability plots which last one is rather used for smaller sample sizes. Smaller sample sizes are even the main reason why non-normality can occur. Histograms are usually used for identification and when non-normality exists then this can be seen by a flat of peaked curve or by an unbalanced curve that is either shifted to the right or to the left. Outliers are often the reason for a non-normal distribution. Outliers occur when a data is incorporated into the analysis that deviates tremendously from the other collected data. When a non-normal distribution was identified by the diagrams, the researcher is required to undertake remedies to achieve normal patterns of distribution again. Taking the inverse, square-root, logarithms, least- or cubed squares are the transforming remedies that can be applied. For deciding which of the stated remedies to use depends on the non-normal pattern of the curve. However, it is often a trial and error process the researcher has to undergo. In case of outliers, it is even often preferable to delete that data from the analysis in order to achieve normal distribution.

ii. Homoscedasticity

Homoscedasticity refers to the assumption that the dependent variable exhibits equal levels of variance across the range of predictor variables. Thus, if homoscedasticity is not meet then this will lead to an unfair testing of the relationship across all values of the non-metric variables (Foster 2006). In order to identify whether homoscedasticity exists, the researcher is best advised to use scatterplots. When most of the values are in the middle range of the scatterplot or when one or more variables seem to be skewed, the researcher is confronted with heteroscedasticity and is required to undertake remedies in order to achieve homoscedasticity again. Taking the inverse or the square root are also the actions required to be taken here. When the scatterplot indicates a cone opened to the right, the researcher must take the inverse, when it is opened to the left, he must take the square root.

iii. Linearity

Linearity means that changes in the independent variable must be proportionate to changes in the dependent variable. If this is not the case it is called to be curvilinear leading to an invalid outcome of the results. Linearity can best be identified by scatterplots and is met when the plots indicate a linear line. When curviliearity exist, the researcher is best off by creating polynomials. Polynomials are power transformations of an independent variable that add a non-linear component for each additional power of the independent variable (Stamatis 2003). Thus, through adding powers to the variables, the researcher is able to meet the assumption of linearity.

iv. Interdependence of the error terms

As the term already clarifies, interdependence of the error terms is the absence of any correlation between prediction errors. When scatterplot patterns indicate all errors to be positive while on the other hand the alternative values are negative, the researcher can be certain to have identified correlation between error terms (Gilmartin and Hartka 1992). Interdependence of error terms is related to multicollinearity in which independent variables are correlated with each other. When errors occur while multicollinearity exists then this makes it difficult for the researcher to identify which independent variable causes what contribution to the error, thus making him either underestimate or overestimate the effect that one independent variable has on the dependent variable. In order to remedy a violation of this assumption, the researcher is often advised to add a moderator effect that indicates the interdependence. A moderator effect is most of the time a dummy variable stating that the interdependence is either caused by a) or b). This can be for example a gender case indicating that males (a) have a different effect on the outcome than women (b).

Once each variable and the overall variate have been tested for assumptions and remedies have been successfully implied, the researcher proceeds to stage four in which he estimates the regression model.

Stage 4: Estimating and Assessing Model

The researcher has measured all variables in the first three stages and will now decide what independent variables to include in the variate with the intention to achieve highest statistical and practical significance of the prediction. In order to do so, the researcher can either himself select the variables through the confirmatory approach based on the data results from the previous steps or he utilizes a regression procedure through sequential search methods or through the combinatorial approach which both choose the independent variables for him based on best regression model. The three approaches that intent to find the best regression model are now discussed below.



ISBN (eBook)
ISBN (Book)
File size
584 KB
Catalog Number
Institution / College
University of Applied Sciences Chur – MSc Entrepreneurship
short critical non-technical non-mathematical paper regression analysis introduction beginners




Title: A Short Critical, Non-Technical, Non-Mathematical Paper about Regression Analysis