How to Bootstrap Anything

the Stata Way

Introduction: What is Bootstrapping

One of the things econometricians/economist are quite interested in is estimating point estimates. However, because we know point estimates contain errors (sampling errors), we are also interested in estimating the precision of those estimates, at least given the information we have in hand (the sample). In other words, its standard errors.

In most intro to econometrics courses, we typically learn that one way we could estimate the precision of those estimates would be drawing multiple samples, estimating the model of interest for each new subsample, and summarizing the estimated coefficients. What ever you find as Standard deviations of the estimated coefficients are the coefficient Standard Errors.

We know, however, that collecting multiple samples from the same population is technically impossible. So we need to rely on other approaches:

But What is Bootstrapping?

As a non-native speaker, the word bootstrapping didn't mean anything to me but a statistical technique to obtain empirical Standard errors. After a few years in Gradschool, however, I have heard a few times the expression:

Pull yourself up by your own bootstraps

Which is quite appropriate for describing what Bootstrapping does. Since we do not have access to other samples, we pull standard errors by using, and reusing, that sample in different ways, to estimate Standard errors for our estimates.

The differences in how the sample information is reused is what differentiates the type of bootstrapping method you are applying.

Types of Bootstrapping.

There are, in fact, many approaches that can be used to obtained Bootstrap standard errors (BSE), all depending on the assumptions we are willing to impose. And not all of them can be applied in every sceneario. For sake of simplicilty, I will refer to the ones that can be applied for the case of linear regressions. Thus, say that you are interested in a linear regression model that has the following functional form:

y=Xβ+ey=X\beta+e

To ground the principles of the different methodologies, we will use the dataset auto.dta.


Setup and Asymptotic SE

Just to ground things up, we will estimate a very simple Linear Regression model using auto.dta dataset.

sysuse auto, clear reg price mpg foreign Source | SS df MS Number of obs = 74 -------------+---------------------------------- F(2, 71) = 14.07 Model | 180261702 2 90130850.8 Prob > F = 0.0000 Residual | 454803695 71 6405685.84 R-squared = 0.2838 -------------+---------------------------------- Adj R-squared = 0.2637 Total | 635065396 73 8699525.97 Root MSE = 2530.9 ------------------------------------------------------------------------------ price | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- mpg | -294.1955 55.69172 -5.28 0.000 -405.2417 -183.1494 foreign | 1767.292 700.158 2.52 0.014 371.2169 3163.368 _cons | 11905.42 1158.634 10.28 0.000 9595.164 14215.67 ------------------------------------------------------------------------------

Parametric Bootstrap

As the name suggest, parametric bootstrap requires imposing some parametric assumptions on the source of the error in the model: ee. Those characteristics of the error is what we will learn from the original sample.

For example, we could assume ee follows some normal distribution, with variance equal to the variance of the observed error σe2=Var(e^)\sigma^2_e=Var(\hat e).

What you would need to do, then, is draw/create different samples using the following rule:

y~=Xβ^+e~ where e~N(0,var(e^))\tilde y = X \hat \beta + \tilde e \ where \ \tilde e\sim N(0,var(\hat e))

In this case, all XsX's are fixed, but the new samples are given by the different y~s\tilde y's, which differ only because of the draws of the errors e~\tilde e.

Once you get multiple of samples, and coefficients for each, you can simply Report the associated Standard Errors.

How to do it in Stata? Here I will cheat a bit, and use Mata, because it will be faster. Note that you will need to copy the whole code into a dofile and run all of it, or type each line individualy in the command window (once you activate mata):

set seed 10 // First we start mata mata: // load all data y = st_data(.,"price") // You load X's and the constant x = st_data(.,"mpg foreign"),J(rows(y),1,1) // Estimate Betas: b = invsym(x'*x)*x'y // Estimate errors: e = y:-x*b // Estimate STD of errors std_e=sqrt(sum(e:^2)/73) // Now we can do the bootstrap: // We first create somewhere to store the different betas bb = J(1000,3,.) // and start a loop for(i=1;i<=1000;i++){ // each time we draw a different value for y..say ys ys = x*b+rnormal(74,1,0,std_e) // and estimate the new beta, storing it into bb bb[i,]=(invsym(x'*x)*x'ys)' } // Now just report SE b,diagonal(sqrt(variance(bb))) end

If everythings goes well, it should give you the following

1 2 +-------------------------------+ 1 | -294.1955331 55.59108401 | 2 | 1767.292243 689.3703271 | 3 | 11905.41528 1159.921784 | +-------------------------------+

Things not notice. We are explicitly impossing the assumption of homoskedasticity and normality on the errors. This explain why this standard errors are almost identical to the simple asymptotic standard errors.


Residual Bootstrap

Residual bootstrap is very similar to the parametric bootstrap I described above. The main difference is that we no longer impose assumptions on the errors distributions, and instead use the empirical distribution.

What does this mean? well, In the above example, we have 74 different values for the error ee, thus resampling means that you create a new y~\tilde y by drawing 74 errors from this bag of errors, where all have the same probability of being choosen.

y~=xβ^X+e~ where e~[e^1,e^2,...,e^N]\tilde y = x \hat \beta X + \tilde e \ where \ \tilde e \sim [\hat e_1, \hat e_2,...,\hat e_N]

How would you implement it? Lets go to the Mata code:

set seed 10 // First we start mata mata: // This remains the same as before y = st_data(.,"price") x = st_data(.,"mpg foreign"),J(rows(y),1,1) b = invsym(x'*x)*x'y e = y:-x*b // Now we need to know how many observations we have nobs=rows(y) // Same as before bb = J(1000,3,.) for(i=1;i<=1000;i++){ // Here is where we "draw" a different error everytime, // runiformint(nobs,1,1,nobs) <- This says Choose a randome number between 1 to K // and use that value to assing as the new error to create ys ys = x*b+e[runiformint(nobs,1,1,nobs)] bb[i,]=(invsym(x'*x)*x'ys)' } // Now just report SE b,diagonal(sqrt(variance(bb))) end

This will give us the following:

1 2 +-------------------------------+ 1 | -294.1955331 53.63708346 | 2 | 1767.292243 703.1172879 | 3 | 11905.41528 1118.818836 | +-------------------------------+

Once again, this method keeps XsX's as fixed, and assumes errors are fully homoskedastic.


Wild-Bootstrap/multiplicative bootstrap

Wild bootstrap is another variant of residual bootstraps. You start as always estimating the original model, and obtaining the model errors. For the resampling, however, rather than suffling errors, or making assunptions on its distribution, we reuse the error after we disturb the error.

In other words:

ys=Xβ^+e^vys=X \hat \beta + \hat e * v

where vv is the source of the noise we need to add to the model. Technically we can use any distribution for vv, as long as E(v)=0E(v)=0 and Var(v)=1Var(v)=1. The most common wildbootstrap implementations use mammen distribution, but for simplicitly I will use a normal distribution:

set seed 10 mata: // This remains the same as before y = st_data(.,"price") x = st_data(.,"mpg foreign"),J(rows(y),1,1) b = invsym(x'*x)*x'y e = y:-x*b nobs=rows(y) bb = J(1000,3,.) for(i=1;i<=1000;i++){ // Here is where we "draw" a different error multiplying the original error by v ~ N(0,1) ys = x*b+e:*rnormal(nobs,1,0,1) bb[i,]=(invsym(x'*x)*x'ys)' } // Now just report SE b,diagonal(sqrt(variance(bb))) end

Results:

1 2 +-------------------------------+ 1 | -294.1955331 59.62775803 | 2 | 1767.292243 586.545592 | 3 | 11905.41528 1357.383088 | +-------------------------------+

Although it does seem surprising, this approach allows you to control for heteroskedasticiy, which is why the standard errors are quite similar to the ones you would get using reg, robust.

The other advantage of this method is that you do not really need to get an estimate of the error itself. Instead, you could obtain the Influence Functions of the estimated parameters, and disturb those to obtain standard errors. This makes method feasible for a larger set of estimators, assuming you can derive the corresponding Influence Functions.


Paired bootstrap/Nonparametric bootstrap

Paired bootstrap is perhaps the most commonly used method, even though it could also be the most computationally intensive approach.

The basic idea is the same one I started this post with. You use your original samples, and draw subsamples with replacement so they are of the same size as the original sample. Then, as in previous approaches, you estimate the parameters of interest for each subsample, and summarize the results.

What really differentiates this approach from the others, is that the whole set obs observations and characteristics are used in the resampling (not only residuals). Because of this, it does behave well when there is heteroskedasticity, and is relatively easy to implement even for complex model estimators.

There is the added advantage that Stata has a module exclusively dedicated to make the implementation of paired bootstrap easy. However, before I go into that, lets implement it in mata, as I have done before.

set seed 10 mata: // This remains the same as before y = st_data(.,"price") x = st_data(.,"mpg foreign"),J(rows(y),1,1) b = invsym(x'*x)*x'y e = y:-x*b nobs=rows(y) bb = J(1000,3,.) for(i=1;i<=1000;i++){ // What I do here is get a vector that will identify the resampling. r_smp = runiformint(nobs,1,1,nobs) // then use this resampling vector to reestimate the betas brs = invsym(x[r_smp,]'*x[r_smp,])*x[r_smp,]'y[r_smp,] bb[i,]=brs' } // Now just report SE b,diagonal(sqrt(variance(bb))) end Result: 1 2 +-------------------------------+ 1 | -294.1955331 61.88438186 | 2 | 1767.292243 575.3963704 | 3 | 11905.41528 1374.021799 | +-------------------------------+

Easier Paired bootstrap

Assuming you do not like to do this with Mata, or that your estimator is a bit more complex than a simple OLS, a better approach for implementing paired bootstrap in Stata is simply using the bootstrap prefix:

For example:

bootstrap, seed(10) reps(1000):reg price mpg foreign

And of course, this approach can implement bootstrap for any 1 line command:

bootstrap, seed(10) reps(1000):qreg price mpg foreign bootstrap, seed(10) reps(1000):poisson price mpg foreign

Being faster for some methods than others.

Often, however, you may want to "bootstrap" something more complex. For example a two/three/...K step estimator. You can still use bootstrap, but it requires a bit more programming. So lets go with a simple 2-step heckman estimator. My recommendation, first implement the estimator for 1 run:

webuse womenwk ** data prep gen dwage=wage!=. ** estimation probit dwage married children educ age predict mill, score reg wage educ age mill ** delete the variable that was created as intermediate step drop mill

Notice that mill was dropped at the end. This is important, because bootstrap will have to create it all over again.

Finally, we write our little bootstrap program

** I like to add eclass properties here program two_heckman, eclass ** you implement your estimator: probit dwage married children educ age predict mill, score reg wage educ age mill ** Delete all variables that were created capture drop mill ** Finally, you will Store all the coefficients into a matrix matrix b=e(b) ** and "post" them into e() so they can be read as an estimation output eretur post b end

and apply bootsrap:

. bootstrap, reps(250):two_heckman (running two_heckman on estimation sample) warning: two_heckman does not set e(sample), so no observations will be excluded from the resampling because of missing values or other reasons. To exclude observations, press Break, save the data, drop any observations that are to be excluded, and rerun bootstrap. Bootstrap replications (250) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 .................................................. 150 .................................................. 200 .................................................. 250 Bootstrap results Number of obs = 2,000 Replications = 250 ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | coefficient std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- education | .9825259 .0500337 19.64 0.000 .8844617 1.08059 age | .2118695 .0218391 9.70 0.000 .1690657 .2546734 mill | 4.001615 .5787537 6.91 0.000 2.867279 5.135952 _cons | .7340391 1.154367 0.64 0.525 -1.528479 2.996557 ------------------------------------------------------------------------------

A warning comes up because sample was not defined, but in this case that is not important. That can be "fixed" by adding "esample" to the posting step:

** I like to add eclass properties here program two_heckman2, eclass ** you implement your estimator: probit dwage married children educ age tempvar smp ** define sample, based on the first step. gen byte `smp'=e(sample) predict mill, score reg wage educ age mill ** Delete all variables that were created capture drop mill ** Finally, you will Store all the coefficients into a matrix matrix b=e(b) ** and "post" them into e() so they can be read as an estimation output ereturn post b, esample(`smp') end bootstrap, reps(250):two_heckman2 end

Conclusions

There you have it. a very quick overview on how bootstrap works, and how to program it in Stata.

Now, not all methods work in every situation. To fully account for clustering, and weights, resampling methods require you to know, understand, and incorporate the original sampling structure (how the data was collected) to correctly account for those features when estimating standard errors.

Accounting for clustering is usually straight forward, but Strata and weights may require more attention. I ll try to cover some of those modifications in a future blog.

Hope you find this useful. And if you have comments or questions, send me an email or comment.