Analytical questions

Question 1 Show that the univariate ols model \(y_i = \beta x_i + \epsilon_i\) is identified when \(E(x\epsilon)=0\) and \(var(x)>0\).

Question 2 Derive the bias of the sample variance.

Computer Question

require(kableExtra)
require(ggplot2)
require(texreg)
require(readstata13) #to install, run install.packages("readstata13")
require(sandwich)
options(knitr.table.format = "html") 

Here is some relevant material:

To install these packages

install.packages("lmtest")
install.packages("sandwich")

We are going to reproduce an exercise similar to the example for the computation of standard error. Start by downloading the CPS data from here. We first load the data into R.

# replace this with the path to your download folder
data = read.dta13("~/Downloads/files to share/CPS_2012_micro.dta") 
data = data.table(data)
data$age = as.numeric(data$age)

Next generate a fictuous policy that you randomly assigned at the state times gender level. Run the regression and report standard errors given by R for one draw of the poilcy.

set.seed(60356548) # I fix the seed to make sure the draws are reproducible
data <- data[,fp := runif(1)>0.5, statefip]
fit1 = lm(lnwage ~fp,data)
htmlreg(fit1,single.row=TRUE)
Statistical models
Model 1
(Intercept) 2.68 (0.00)***
fpTRUE -0.02 (0.00)**
R2 0.00
Adj. R2 0.00
Num. obs. 65685
RMSE 0.63
p < 0.001, p < 0.01, p < 0.05

Note We do not control for state specific fixed effect as these would would be perfectly colinear with the policy.

Now this is surprising. We generated fp randomly across states and so we should have that when the number of states becomes very large \(E(\epsilon_i fp_i)=0\). To gain understanding on what is happening we will generate our own data in a way where we control exactly what is happening.

IID errors

Let’s start by reassuring ourselves. Let’s use an IID data generating process (DGP), run the regression and check the significance.

  1. compute the variance of lnwage in the sample. This is an estimate of our homoskedastic error.
  2. simulate a fictuous outcome y2 by adding to fp a normal error with the estimated variance, and truly independent across individuals. Use y2:=rnorm(.N)*var_est inside your data.table data.
  3. regress this outcome y2 on fp, our fictuous policy and collect the coefficient, also save if the coefficient is significant at 5%.
  4. run steps (2,3) 500 times.

Question 3 Follow the previous steps and report the rejection rate of the test on fp. You should find something close to 5% and you should feel better!

Heteroskedastic errors

Now we want to compute heteroskedastic robust standard errors which requires us to use some co-variates. We then want to repeat the previous procedure, but we are going to use a different test for the significance. We then want to construct our variance co-variance matrix using the following formula:

\[ V =(X'X)^{-1} X' \Omega X' (X'X)^{-1} \] where \(\Omega = diag \{ \epsilon_i^2 \}\). Using vcovHC with type type="const" and type="HC0" will do that for you!

We want to check this by simulating from a model with heteroskedesatic errors. To do so we are going to use linear model for the variance.

  1. use the following regression lnwage ~ yrseduc + age + I(age^2) and regress the square of the residual on the same co-variates formula to get an estimate of the heteroskedastic variance.
  2. predict the value of the square residual for each individual in the data and store this as new variable s.
  3. predict the value of the level and store it in pred.
  4. simulate data by drawing a normal, multiplying it by individual specific variancs s and adding the pred.
  5. replicate (4) this 500 times, evaluate the significance of fp using vcovHC with type type="const" and type="HC0".

Question 4 Follow the steps and report the rejection rate for each of the variance evaluation.

State clustered errors

We are again here going to try to simulate corrolated error within state. For this we pick a correlation parameter \(\rho\). Then, to simulate we are going to draw the first individual in an iid way, then using an auto-regressive structure to compute the error of the following people. Given \(\rho\) it can be done in the following way:

fit0  = lm(lnwage ~ yrseduc + age + I(age^2),data)
data <- data[,yhat := predict(fit0)]
rho = 0.8
data <- data[, res_hat := {
  r = rep(0,.N)
  r[1] = rnorm(1)
  for (i in 2:.N) {
    r[i] = rho*r[i-1] + rnorm(1)
  }
  r
},statefip]
data <- data[,y2:= yhat + res_hat]
data <- data[,fp := runif(1)>0.5, statefip]
fitn = lm(y2 ~ fp+yrseduc + age + I(age^2),data)
#summary(fitn)

htmlreg(fitn,single.row=TRUE,omit.coef="state")
Statistical models
Model 1
(Intercept) -0.50 (0.09)***
fpTRUE -0.03 (0.01)**
yrseduc 0.10 (0.00)***
age 0.07 (0.00)***
I(age^2) -0.00 (0.00)***
R2 0.04
Adj. R2 0.04
Num. obs. 65685
RMSE 1.66
p < 0.001, p < 0.01, p < 0.05

Question 5 Explain the expression that starts with data[, res_hat := {...

Question 6 For \(\rho=0.7,0.8,0.9\) run 500 replications and report the proportion at each value of replication for which the coefficient on our ficutous policy was significant at 5%.

State level bootstrap

We have not covered this in class yet, but one could instead try to resample the data.

Use the following procedure:

  1. Draw 51 states from the 51 states (at the state level) with replacement
  2. Create a dataset from the actual data, appending the observations for each of the state
    • when a state appears multiple times, attach the data of that state, but treat these states as different. In other words the names of the states in this synthetic data set should just be 1,2,3,4…51.
  3. compute the regression on this synthetic data set
  4. store the resulting regression coeffecient for each repetition, repeat 500 times.

Note do not redraw fp!

Question 7 Report the 0.05 and 0.095 quantiles for the regression coefficients. This is a test at 10%, does this interval include 0?

Fork me on GitHub