# Least Squares Regressions with the Bootstrap

A Survey of their Performance

Diploma Thesis 2009 51 Pages

## Excerpt

## Contents

1 Introduction

2 The model

2.1 The basic form

2.2 The disturbance term

3 Regression techniques

3.1 The method of least squares

3.1.1 Ordinary Least Squares

3.1.2 Generalized Least Squares

3.2 Alternative regression methods

4 Classical measures of performance

4.1 Bias

4.2 Variances

4.2.1 The variance of OLS

4.2.2 The variance of GLS

4.2.3 A remark on the variances

4.3 Confidence intervals

4.3.1 A remark on the critical values

4.3.2 A confidence interval for OLS .

4.3.3 A confidence interval for GLS .

4.4 Rate of convergence

5 The bootstrap

5.1 How does the bootstrap work?

5.2 When does the bootstrap work?

5.3 The non-parametric bootstrap

5.4 The parametric bootstrap

5.5 Why does the bootstrap work?

5.6 How many bootstrap repetitions?

5.7 On the size of each repetition

6 Regressions with the bootstrap

6.1 Case resampling

6.2 Residual resampling

6.3 Wild bootstrap

6.4 When to use which method?

7 Inference with the bootstrap

7.1 Variances with the bootstrap

7.2 Confidence intervals with the bootstrap .

7.2.1 The percentile interval

7.2.2 The bootstrap-t interval

7.2.3 Other bootstrap intervals

7.3 Convergence with the bootstrap

8 Classical or bootstrap inference?

8.1 Which variance estimate?

8.2 Which confidence interval?

8.3 When the bootstrap fails

9 A practical test for the bootstrap

9.1 The datasets

9.1.1 The homoscedastic data

9.1.2 The heteroscedatic data

9.2 The simulations

9.2.1 Results simulation one .

9.2.2 Results simulation two .

9.3 Resumé of simulations

10 Concluding remarks

A Tables simulation one

A.1 Table of coefficient β1

A.2 Table of coefficient β2

B Tables simulation two

B.1 Table of coefficient β1

B.2 Table of coefficient β2

C Friedman test values

C.1 Test values simulation one

C.2 Test values simulation two

D Some further formulae

E Bibliography

## 1 Introduction

Imagine the kids are in the living room. They are watching TV, Family Feud. The voice of the presenter cuts through the tense silence: Name a method of working of a statistician! - One second, two, thr the titleholder hits the buzzer and shouts: Regressions, of course! So how big a score did he get for this answer? Well, that is unknown since that was just an imaginary scene in a gameshow. However, this scene gives a good cue to the content of this paper, for the reason that one of the most prominent problems of statisticians are indeed the fields of regression.

They have to find a relationship between some explanatory variable and the response variable with the help of one of the various regression techniques. Unfortunately there exists no perfect technique since none of them outperforms the rest in all possible surroundings which is why depending on the framework, which is described by some model assumptions, different methods are used.

But how to measure the performances of the different regression techniques at all? And after choosing a method: what determines which is the best parameter estimate? There, too are no definitive answers available.

In the past one often relied on complicated formulae and the asymptotic behavior of an estimator to measure the performance of the estimator with the (only finitely available) sample of observations. This was done many years, often satisfactorily. A quite many times not, because some estimators induced difficulties by drawing conclusions about their asymptotic distributions or an inference formula just could not be obtained. This led to the situation that regularly inferior regression methods had to be chosen, although there were indications that another technique would probably be the more effective choice. Just because the performance of the suspected inferior estimator was known, whereas it was not possible to judge the performance of the other estimation technique conclusively.

That was not a satisfying state of affairs. And it lasted till Bradley Efron’s construction of a technique named bootstrap in the late seventies, until a solution for this dilemma was found.^{1} Like his predecessors he saw that it is sometimes not possible to use a sample and an estimator of a parameter to draw a conclusion on the real parameter value and its relationship to the real population. However, in contrast to his forerunners Efron also asked himself if this relationship can be estimated, which it can. Therefore one takes the original sample and the original estimate to treat them as a new population and its parameter value respectively. And from this ‘new‘ population one can draw new samples, estimate new parameter estimates. Even draw inference, because all information about the underlying population are known. With the help of the law of large numbers, those results can be used as approximation of the behavior of the original population which was often unknown in the past!

This technique called bootstrap is usable with a lot of statistical problems and it is the main topic of this paper. Since the bootstrap provides material for a whole series of books it is essential to pick one special aspect of the bootstrap and investigate it in depth, otherwise the analysis would inevitably become too general. This aspect is the topic of regression. Hence, this paper will introduce the bootstrap and compare the performance of the new inference methods which it provides with some classical methods of judging a regression which were used in the years before the bootstrap.^{2} ^{3}

Therefore the remainder of this paper is as follows: First there will be a description of the basic model in which all of the following investigations will be done, chapter two. The next chapter will describe the different regression techniques which try to solve the model. The fourth chapter is going to show the behavior of these regression techniques in large samples, i.e. shows some classical methods of statistical inference. Following chapter five will give an introduction to the bootstrap which will be succeeded by a description of the bootstrap in regression problems, chapter six. The seventh chapter will show how inference is done with the help of the bootstrap. The eighth chapter is going to compare the performances of classical and bootstrap inference in regressions. Before the concluding remarks of chapter ten, there will be a practical application in chapter nine which tries to prove some observations of the preceeding chapters.

## 2 The model

As mentioned above, one task in statistics is to compute regressions. Therefore one typically builds a model in order to depict a simplified version of the relationship which will be investigated. But: Is the relationship linear or non-linear? How many parameters should the regression have? What is the form of the disturbance term in the model? Which regression technique to use?

These are some questions the statistician has to answer before starting to calculate. And they are the topic of this chapter, which will describe the basic model along with a few assumptions which are valid throughout this paper. Nearly all of them are standard assumptions, used in many statistical papers which is why they put the subsequent evaluation of the bootstrap in the later chapters into a broader context.

### 2.1 The basic form

The basic model which will be used in this paper will be the standard linear regression model in matrix form:

Abbildung in dieser Leseprobe nicht enthalten

Here the regressand y depends linearly on the covariate matrix X (the regressor) and a disturbance term ǫ. One also assumes full rank, so that no covariate is perfectly correlated with another in order to avoid complications while estimating the parameters of the model. What the regression tries to do is find the best possible estimate for the parameter vector β.^{4} Doing this without too complex calculations various assumptions are required concerning the form of the disturbance term.^{5}

### 2.2 The disturbance term

Regarding the error term there are a few assumptions. The first concerns the conditional expected value, which is assumed to be zero:

Abbildung in dieser Leseprobe nicht enthalten

That way the covariate does not convey any information concerning the form of the error term. This will hold throughout the paper which is why all conclusions will be drawn conditional on X. It is possible to show that they also hold in the unconditional case, but that will not be done here.^{6}

The next assumption is about the variance-covariance matrix Ω of ǫ. The covariances (the off-diagonal elements of Ω) of any two disturbances will be zero:

Abbildung in dieser Leseprobe nicht enthalten

That way all problems which arise because of autocorrelation can be discarded since they do not matter here.

Then an assumption is usually made on the variance of the disturbance, the elements on Ω’s diagonal. That will not be done here. The reason is that this paper uses different forms of variances. It assumes in one part that the error is homoscedastic, so that the variances are all constant and the same:

Abbildung in dieser Leseprobe nicht enthalten

In the other part the estimators will also have to deal with heteroscedasticity, so that:

Abbildung in dieser Leseprobe nicht enthalten

These two different characteristics of the disturbance make it necessary to introduce two different regression techniques in the next chapter. They also have some influence on the choosing of bootstrap methods in chapter 6.

## 3 Regression techniques

The regression technique this paper concentrates on is least squares regression. Indeed, there are many other techniques, and some of them will be discussed at the end of this chapter. However, the bulk of the investigations will be done with the method of least squares. The reason is that a satisfactory analysis is difficult with some other methods since their statistical properties have not been obtained by a classical approach until now - as for example the standard error of a least median of squares regression.^{7} Thus, in order to make one intensive investigation instead of an extensive but shallow one this focus had to be chosen.

### 3.1 The method of least squares

The two regression methods which will be discussed below are applicable to different settings and depending on it perform quite differently. In spite of this they have one thing in common which distinguishes them from other regression methods: they minimize the sum of squared residuals.^{8} These two methods are Ordinary Least Squares (OLS) and Generalized Least Squares (GLS) and their fundamental ideas are depicted in chapters 3.1.1 and 3.1.2 respectively.

#### 3.1.1 Ordinary Least Squares

Underlying the OLS minimization of the sum of squared residuals is the basic regression model, equation (1). Along with that model, the other important fact is that one assumes the error term to be homoscedastic:

Abbildung in dieser Leseprobe nicht enthalten

The popularity of this solution depends crucially on the assumption of constant error variances; Ω = σ^{2} I.^{9} In all other cases OLS does not yield a good estimator since it then fails to incorporate the form of the variance of the error term truthfully. However, among all estimators which are available in a linear model with homoscedasticity and which are unbiased the OLS estimate is the one with the smallest variance; a result known as Gauss-Markov-theorem.^{10}

#### 3.1.2 Generalized Least Squares

The other least squares technique discussed here is GLS. As the name indicates it is a generalized version of OLS. However, contrary to OLS it does not depend on a homoscedastic error term but allows the variance of the disturbance to vary.^{11} A few matrix multiplications are necessary to transform model (1) and getting an equation for which OLS is again applicable. Consequently the new minimization problem, which differs from the OLS equation (6), is the following:

Abbildung in dieser Leseprobe nicht enthalten

Hence one sees that the GLS estimator varies from its OLS counterpart through the appearance of Ω−^{1}, the inverse matrix of the disturbance’s variance. In practice this matrix Ω−^{1} is usually unknown which is why it is typically estimated by a prior OLS run. Thus in practice βGLS is not a function of the known quantities X,y and Ω but of X, y and the estimated covariance matrix Ω. Of course this procedure entails more calculations than a simple OLS. However, this is worth the costs since by that way the heteroscedasticity is dealt with which OLS is unable to do. Using OLS in spite of heteroscedasticity would give wrong estimates.^{12} Among the generalized models βGLS is that unbiased estimator with the smallest variance, a result known as Aitken’s theorem.^{13}

### 3.2 Alternative regression methods

These were two regression techniques to find the parameters in a linear model. And they were the most popular ones. But there are other regression methods as well and not each one depends on the minimization of the squared residuals.

One could also, like Breidt et at.(2001) did, minimize the sum of absolute deviations.^{14} This would have the advantage of less influence of outlying observations, which is a ma- jor problem with least squares since only one observation is necessary to influence the whole regression.^{15} But, although Least Absolute Deviations (LAD) performs much bet- ter against outlying regressands it still, like LS has the high breakdown of 1/n, because outlying regressors manage to influence the regression very easily.^{16} A second alternative would be to use Least Median of Squares (LMS), where those coeffi- cients are taken which induce the least median of the squared residuals.^{17} Rousseeuw and Leroy(1987) show that this technique is very robust against outlying observations with a breakdown point of roughly 50%.^{18} The cause why it is not used more widely is not that it is not very efficient - which in actual fact it is not.^{19} The reason is that both techniques (LMS and also LAD) lack information about their performances since for a long time it was very difficult to base inference on them.

These two regression methods will not be investigated any further. Their only purpose is to show two alternatives which could be used whenever least squares regressions fail to give satisfying results, because it may be the case that one of them is able to achieve the desired performance.

## 4 Classical measures of performance

After introducing the OLS and GLS estimators the next step is to investigate their per- formance. There are many possible ways to evaluate this performance so it is necessary to depict more than one measure. Otherwise the conclusions are conditioned by the per- formance under one criterion while possibly performing diametrically under another one. Thus, without a paramount criterion of performance it is reasonable and necessary to use more than one measure.

### 4.1 Bias

One such criterion could be the bias. However, under the present circumstances that would not be a good choice. Since the disturbance is assumed to have an expected value of zero it yields regression coefficients which are unbiased under both least squares methods. Thus, evaluating a regression on the basis of its bias would be a futile effort since the two perform exactly the same, so that OLS and GLS are not distinguishable:^{20}

Abbildung in dieser Leseprobe nicht enthalten

### 4.2 Variances

Contrary to the bias, the variances of the coefficients yield information. This is due to the form of the variance of the error term which not only leads to differing variances of the coefficients but also to a case where one special method is superior to the other one.

4.2.1 The variance of OLS

The variance of the OLS estimate is obtained by:

Abbildung in dieser Leseprobe nicht enthalten

As is apparent by the look of the variance, this term depends substantially on the assumption of homoscedasticity and only in those cases does OLS yield an estimator with a satisfying variance.^{21} Otherwise the true variance becomes [Abbildung in dieser Leseprobe nicht enthalten]and thus is not thruthfully estimated by equation (9) anymore.^{22}

#### 4.2.2 The variance of GLS

In spite of the possible misspecification problem of OLS, the variance of the GLS estimator is obtained similarly:

Abbildung in dieser Leseprobe nicht enthalten

To take account of the heteroscedasticity, the GLS version differs from the variance of the OLS estimate through a Ω−^{1} matrix which now also appears in the corresponding variance.

Ω−^{1} originates from the model transformations which were mentioned in chapter 3.1.2 and is the inverse of the covariance matrix of the disturbance term.

#### 4.2.3 A remark on the variances

It can be seen after investigating the variances of the two estimators that both versions have their special fields of application. If the disturbance’s variance is constant, then OLS yields the smallest variation, namely a variance of only σ^{2} (XT X)−^{1}. If on the other hand there are indicators that the variance of the error varies, then one shall use GLS because now this version leads to the smallest variance, (XT Ω−^{1} X)−^{1}. However, the dominance of OLS in the homoscedastic case is only ostensible since with homoscedasticity Ω = σ^{2} I, so that both estimators are equivalent.^{23} But of course GLS is always somewhat more complex to calculate.

### 4.3 Confidence intervals

Analyzing if the calculated estimators are satisfying ones need not only be done looking at the bias or the variance but also by making a hypothesis test. Equivalently one can investigate whether the estimated parameters are included in a confidence interval.^{24} The only two other ingredients (apart from the estimate) for constructing a confidence interval to the level 1 − α are the standard error of the parameter estimate se βi) and a critical value of the underlying distribution, c1−α/2 .

The discussion will only be about confidence intervals, and not confidence regions. Hence corresponding to hypothesis tests which only test one parameter at a time and not all at the same time. Cumulating type I errors and needing to apply a correction as for example Bonferroni’s will thus not be regarded.^{25}

#### 4.3.1 A remark on the critical values

A major challenge by the interval construction process is the choosing of a critical value, because in order to do this one has to know the distribution which underlies the parameter estimates. This knowledge is only rarely given. But only with this knowledge is it possible to create confidence intervals with exact coverage probabilities.

In absence of this knowledge one uses auxiliary distributions which consequently fail to give the targeted coverage. In large samples this is not very serious because of the central limit theorem, which induces that one can use the quantiles of the standard normal distribution, z1−α/2 which lead to approximately exact intervals.^{26} In small samples this approach is not reasonable anymore. Often this is solved by assuming the disturbance to be Gaussian distributed which leads to an estimate β which is also normal. Therefrom the quantiles of the Student’s t-distribution with (n − K) degrees of freedom, tn−K,1−α/2 are used.^{27}

However, choosing the Gaussian for the disturbance is often arbitrary and of course the exactness of the interval depends crucially on the correctness of this assumption. So ignorance of the true distribution often leads to intervals which miss the targeted coverage. To circumvent this old statistical problem of not knowing a reasonable critical value the intervals below will be depicted with a general critical value c1−α/2. Later chapters will show that it is now possible to give a solution to this problem via the bootstrap.

#### 4.3.2 A confidence interval for OLS

Abbildung in dieser Leseprobe nicht enthalten

4.3.3 A confidence interval for GLS

The confidence interval for the generalized estimator is obtained similarly only that the standard error of [Abbildung in dieser Leseprobe nicht enthalten]. So the corresponding confi-dence interval for GLS to the level 1 − α has the form:

Abbildung in dieser Leseprobe nicht enthalten

### 4.4 Rate of convergence

Another method to measure the accuracy of a statistic is using an Edgeworth expansion to get a hold on its rate of convergence. The following expressions are variations of Navidi’s(1989) and they are a method to measure the error of a statistic by its rate of convergence. Therefore one describes the distribution of the quantity [Abbildung in dieser Leseprobe nicht enthalten] via an Edgeworth expansion.^{29} If G(x) denotes the distribution function of T then the expansion will take the following form:^{30} ^{31}

Abbildung in dieser Leseprobe nicht enthalten

It can be seen that it converges to the standard normal as the sample size increases just as predicted by the central limit theorem.^{32}

Now the values of T are typically unknown which is why T has to be approximated by T. For that reason the difference between the real distribution G(x) and the approximation [Abbildung in dieser Leseprobe nicht enthalten] is a term which converges to a certain bound at the speed of n−^{1} /^{2}:

Abbildung in dieser Leseprobe nicht enthalten

Thus the approximation of an unknown distribution by the standard normal entails an error which depends on the sample size. Therefore the error is the bigger the smaller the sample.

**[...]**

^{1} Cp. Efron (1979a, pp.1-26)

^{2} Cp. Efron (1979b, pp.465-468)

^{3} In order to avoid confusion with bootstrap inference tools, all inference techniques different from the bootstraps are denoted as classical in this paper

^{4} Cp. Greene (2008, p.11)

^{5} Sometimes also named error term

^{6} Cp. Greene (2008, pp.49-50)

^{7} Cp. Efron/Tibshirani (1993, pp.119-121)

^{8} ǫ = y − β ⇔ residuals=regressand − fitted model

^{9} This term is the same as equation (4) but in matrix notation

^{10} Cp. Tanizaki (2004, pp.62-63)

^{11} All other elements in the covariance matrix Ω remain zero (Ω having different values on the diagonal only), since the assumption of non-autocorrelation is not loosened

^{12} Cp. Davidson/MacKinnon (2004, pp.257-262)

^{13} Cp. Greene (2008, p.155)∑n

^{14} minβ i=1 |yi −xi β|

^{15} Cp. Breidt et al. (2001, pp.919-946)

^{16} Cp. Rousseeuw/Leroy (1987, pp.10-12)

^{17} minβ median(y − Xβ)T (y − Xβ)

^{18} Cp. Rousseeuw/Leroy (1987, p.183)

^{19} Cp. Rousseeuw (1984, pp.871-880)

^{20} Cp. Greene (2008, pp.46-47)

^{21} In practice σ^{2} is rarely known which is why it has to be estimated by σ^{2}. This problem will not matter here and it will be assumed that σ^{2} is known

^{22} Cp. Davidson (2000, p.25)

^{23} Cp. Davidson/MacKinnon (2004, p.260)

^{24} Cp. van Giersbergen/Kiviet (2002, p.133)

^{25} Cp. Jobson (1991, p.410)

^{26} Cp. van der Vaart (1998, pp.326-327)

^{27} Cp. Gourieroux/Monfort (1995, p.217)

^{28} Cp. Davidson/MacKinnon (2004, pp.178-185)

^{29} cT is a 1 × p vector and σ the standard deviation

^{30} φ(x) is the normal density and q(x) a certain polynominal. The exact expression of q(x) is in appendix D

^{31} Cp. Hall (1992, pp.83-84) and Navidi (1989, pp.1472-1474)

^{32} Cp. Yatchew (2003, p.155)

## Details

- Pages
- 51
- Year
- 2009
- ISBN (eBook)
- 9783640422418
- ISBN (Book)
- 9783640421831
- File size
- 704 KB
- Language
- English
- Catalog Number
- v135688
- Institution / College
- University of Bonn – Statistische Abteilung der Rechts- und Staatswissenschaftlichen Fakultät
- Grade
- 1,6
- Tags
- Least Squares Regressions Bootstrap Survey Performance