Table of Contents
I. Introduction and Purpose of this Project
II. Project Related Basics in Statistics
III. Description of Selected Data Set
A. General Description
D. Scatter Diagrams
E. Seasonal Index
IV. Regression Analyses
A. Simple Linear Regression Analysis
B. Multiple Regression Analysis – Linear Model
C. Analysis of Residuals
D. Multiple Regression Analysis – Natural Log Transformation
I. Introduction and Purpose of this Project
Statistical analyses are very important today. In many areas like science or economics, for example, statistical analyses are used to support assumptions and to predict future data. With regards to business administration, modern business statistics can be used to influence decision making in finance, marketing or production, for instance.
The scope of the current project is to analyze a data set “Ibell” of phone calls and to predict future quantity of phone calls based on a regression analysis. The “Ibell” data set is related to the U.S. based company International Bell Communications (Ibell) that owns and operates direct routes through-out the world (International Bell Communications, 2008). Four variables are provided in the “Ibell” data set; three independent variables and one dependent (also called response) variable. The independent respectively predictor variables are “Quarter”, “Price” (price charged for long-distance calls in US$), and “Perinc” (reflecting the local average personal income in US$). The dependent variable is “Quantity” – the number of long-distance phone calls. The present data set was provided by the professor of the QMB class. Thus, the data has not been personally collected and hence the author of this report can not personally guarantee for the quality of the data set. However, the predictor variables of “Quarter”, “Price”, and “Perinc” seem fairly reasonable influences on the number of long-distance calls, in general.
There are three major parts in this report. First, a general description of the data set will be presented, including the sort of variables, the characteristics of the observations, and the peculiarities in the distribution. Second, regression analyses estimate the validity of a modeled relationship between the dependent and the independent variables. Finally, the researcher will predict future quantity of long-distance calls for the upcoming four quarters in order to support International Bell Communications in network capacity planning as well as in revenue forecasts, for instance.
II. Project Related Basics in Statistics
Since the current data set is only a sample of a population some crucial properties of sample statistics have to be taken into account before starting with the report. Every sample statistic has got a “sample” error, which is the result of the fact that the sample represents “only” an extract of the total population. Besides those inherent sampling errors, there are also non-sampling errors such as measurement errors, mismatch between sample and population, or experimenter bias, for instance (Gayle Baugh lecture notes). As previously mentioned, the researcher did not personally collect the data and therefore can only assume that non-sampling errors are not included in the data set.
Assuming a perfect sample data set, certain predictions on the total population are statistically valid. Although the sample coefficients are not the same as the population parameters, the distribution of latter parameters can be hence inferred from the sample. If the sample size is large enough (according to Anderson, Sweeny, Williams a size of 30 respectively 50 if population is highly skewed) the sampling distribution of a variable can be approximated by a normal distribution (Central Limit Theorem). In addition, the bigger the sample, the higher the probability that the sample result is relevant for the population. Thus, the higher the probability that the sample mean falls within a specified distance of the population mean (Anderson, Sweeney, Williams, 2006). However, “because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error, to the point estimate.” (Anderson, Sweeney, Williams, 2006, p. 307)
Based on the findings in the sample, assumptions can be made. The tentative assumption is called the null hypothesis. The opposite is the alternative hypothesis. “The hypothesis testing uses data from a sample to test the two competing statements indicated by null hypothesis and alternative hypothesis.” (Anderson, Sweeney, Williams, 2006, p. 347) As a general guideline, research statements should be formulated as the alternative hypothesis. Hypothesis testing and particularly hypothesis based decisions might be critical. Therefore the error probability of Type I errors (rejecting the null hypothesis although it is true) and particularly of Type II errors (accepting the null hypothesis although the alternative hypothesis is true) has to be minimized as much as possible. Certain checks help to decrease the error probability. For instance, “the level of significance is the probability of making a Type I error where the null hypothesis is true as an equality.” (Anderson, Sweeney, Williams, 2006, p. 350) Researchers can define a level of significance “alpha” which represents the risk that they are willing to take. This alpha value can be compared to a probability value “p-value”. “The p-value is a probability, computed using the test statistic, that measures the support (or lack of support) provided by the sample for the null hypothesis.” (Anderson, Sweeney, Williams, 2006, p. 354) That means it is recommended to reject the null Hypothesis if the p-value is less or equal than alpha.
The process of, statistical inference, using data obtained from a sample to make estimates or test hypothesis about the characteristics of a population can also be applied to the Ibell project. The corresponding data set comprises 76 observations. This amount of data is sufficient for the statistical analysis which will be presented in this project and hence allows predicting future quantity of long-distance calls for the specific local region in which the data was collected. The alternative hypothesis for this project is increased quantity of long-distance phone calls in the upcoming four quarters 77 to 80. In contrast, the opposite statement, no increased quantity of long-distance phone calls, is the null hypothesis. Since the predicted quantity is a very important number for International Bell Communications, the researcher defines an alpha value of .05 as a level of significance (see chapter IV B. “Multiple Regression Analysis – Linear Model”) and hence as a limit for the p-value probability checks.
III. Description of Selected Data Set
The following chapter deals with a description of the data set “Ibell”. Starting with a general presentation of some data properties of the corresponding data set, there will be additional graphical illustrations of the data set. For instance, a box plot, a histogram, a seasonal index, and some scatter diagrams will be presented in order to run a first step description and analysis of the “Ibell” data set. The main focus will be on the dependent variable “Quantity”.
A. General Description
The data set contains 76 observations – it is complete and not missing any information – and four variables – three independent and one dependent variable. All variables are based on quantitative data and they are measured in a ratio scale. “The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.” (Anderson, Sweeney, Williams, 2006, p. 7) Moreover, the data set comprises time series data which means that the observations were collected over several time periods. With respect to the source of the data the researcher can only make assumptions since the data was not personally collected. There is no denying the fact, that the variables “Quarter”, “Price”, and “Quantity” can be extracted from existing sources like company records, for instance. The fourth variable “Perinc”, however, is supposed to be derived from statistical studies.
Running a Microsoft Excel based “descriptive statistics” analysis on the variables the results are as presented on table 1-3. Table 1 indicates the range as well as the minimum and maximum values of each variable. Based on this first analysis there is no indicator for outliers or extreme values so far. On table 2 some average key figures like the mean and the median are calculated. The mean provides a measure of central location for the data. It is calculated by the sum of all values divided by the number of observations. Arranging the data in ascending or descending order the median can be determined by the middle value of the observations. Finally, the mode is the value that occurs with greatest frequency. The variables “Quantity” and “Quarter” do not have a mode value since no observation value appears twice in the data set. In general, the values for the mean, median, and the mode look pretty good and can be an indicator (combined with the minimum and maximum values) for a normal distribution of the data sets. Table 3 contains some derived values measuring the variability of dispersion of the previously mentioned values. The sample variance is a measure for the variability of the data set in relation to the mean. It is the sum of the squared differences between the value of each observation and the mean divided by the number of observations. The standard deviation is a value derived from the variance. It is defined to be the positive square root of the variance (Anderson, Sweeney, Williams, 2006). The advantage of the standard deviation compared to the variance is, that it is measured in the same unit as the original data by removing the square effect of the variance calculation.
With respect to the dependent variable “Quantity” this means that we do have a range between 10,164.84 (mean value + standard deviation value) and 18,281.19 (mean value - standard deviation value). Assuming an approximate normal distribution based on our assumptions mentioned above, we can estimate that about 68.30percent (this is the standard value of the range defined by the mean +/- 1. standard deviation in a normal distribution) of the data are within this range. In summary, the current analysis of the dependent variable “Quantity” indicates that we do have an approximate normal distribution and there is no sign for extreme values. Even if there was any unusual observation, the researcher would not be allowed to correct it since the data has not been personally collected and the researcher has no indicator for non-sampling errors. Nonetheless, it definitely makes sense to support evidence by creating boxplot diagrams with KADD.
A boxplot, invented in 1977 by the American statistician John Tukey, is a convenient way of visualizing certain statistical data. It consists of four different observation categories. The lower respectively first quartile cuts off the lowest 25percent of the data. As already explained in the previous chapter the median can be determined by the middle value of the observations. The upper or third quartile cuts off the highest 25percent of the data. Finally, the interquartile range (IQR) is the range between the third and first quartiles and is a measure of statistical dispersion. It is a more stable statistic than the (standard) range presented in the previous chapter, and hence is often preferred. According to the range defined by the mean +/- 1st standard deviation in a normal distribution (representing 68percent of the data), the interquartile range is a measure defining a certain area (50percent) of dispersion. This interquartile range is represented by the “box” pattern in the graphic on table 4. The box is bounded on the bottom by the first quartile (vertical line below the box) and on the top by the third quartile (vertical line above the box). The horizontal line dividing the box indicates the median value.
Moreover, a boxplot indicates moderate and extreme outliers. Any value that lies more than 1.5 IQR-times lower than the first quartile or 1.5 IQR-times higher than the third quartile is considered an outlier. Extreme outliers lie more than three times the IQR to the bottom and top from the first and third quartiles respectively.
According to table 5 all variables do not show any outliers respectively extreme values. Besides, the values indicate an approximate normal distribution of the dependent as well as of the independent variables. This shape helps significantly in further statistical analysis. If there were any extreme values they might affect the regression analysis and the consecutive predictability of the dependent variable. Fortunately, this is not the case and we can continue the statistical analysis without interference.
A histogram is a common graphical presentation of quantitative data. It is an enhanced version of a frequency distribution and is designed by placing the variable of interest on the horizontal axis and the frequency on the vertical axis. Latter is represented by rectangles, comparable to a bar graph. Nonetheless, a histogram does not contain separations between rectangles of adjacent classes. The classes respectively bins are defined by the researcher. These specified categories are defined as non-overlapping intervals. It often needs a couple of reiterations until a meaningful interval specification is achieved. The purpose of the bin definition and hence of the histogram is to provide information about the shape of a distribution. Ideally, the shape indicates a normal distribution. The normal respectively Gaussian distribution is a probability distribution of great importance in many fields. The standard normal distribution is the normal distribution with a mean of zero and a variance of one (Wikipedia, 2008a).
According to table 6 the distribution of the dependent variable “Quantity” in the histogram looks fairly good. This histogram shows the frequency of “Quantity” per bin range. In other words, the height of each bar is equal to the number of long-distance phone calls in a certain interval. In total five bin ranges from 8,000 to 24,000 have been defined by the researcher. The histogram indicates that most of the data are in the interval between 16,000 and 20,000. The distribution is quite symmetric and resembles the shape of a normal distribution. However, analyzing from the perspective of Gaussian distribution the distribution is a little bit skewed to the right. This little skewness is not severe since “data from applications in business and economics often lead to histograms that are skewed to the right.” (Anderson, Sweeney, Williams, 2006, p. 44)
In summary, the distribution resembles pretty well a standard distribution and there are no indicators for violations. An analog illustration could be performed for the other variables in the data set as well. However, the researcher is mainly interested in the shape of the distribution of the dependent variables and we already checked in the previous chapters that there are no extreme values in the predictor variables. With respect to the whole data set, the prerequisites for a regression analysis seem to be sufficient. Nevertheless, to support evidence and to get an even better feeling of the characteristics of this data set, scatter diagrams are a reasonable next step of analysis.
D. Scatter Diagrams
A scatter diagram is a visualization of the relationship between two variables. The trendline is the corresponding approximation. Table 7-12 contain six different scatter diagrams, representing the relationships between each of the four variables. Considering all scatter patterns and trendlines strong mostly linear relationships are noticeable. However, the relationships are not perfect, since not all points are on a straight line. The trendline is described by an equation and the R² is a measure of variance (see details in chapter IV “Regression Analyses”).
Particularly the scatter diagram on table 7, indicating the relationship of the dependent variable “Quantity” over the independent time variable “Quarter”, is important to the researcher. The relationship is positive, which indicates that higher quarters are associated with higher number of long-distance phone calls. Further details on this particular relationship will be presented in chapter IV A. “Simple Regression Analysis”.
Scatter diagrams are not only used for graphical presentation, but also for seasonality analysis. With regard to the present project, none of the time (“Quarter”) related scatter diagrams leads to a strong suspicion of seasonality within the underlying data. In order to support evidence, the researcher will run an additional seasonal index analysis in KADD.