This homework builds on what we studied in class. We are going to simulate from the very simple model of labor supply we considered.

The agent problem is

\[ \max_{c,h,e} c - \beta \frac{h^{1+\gamma}}{1+\gamma}\\ \text{s.t. } c = e \cdot \rho \cdot w\cdot h +(1-e)\cdot r \cdot h \] The individual takes his wage \(w\) as given, he chooses hours of work \(h\) and consumption \(c\). He also chooses whether to work in the labor force or to work at home where he has an equivalent wage \(r\).

Simulating data

We are going to simulate a data set where agents will choose participation as well as the number of hours if they decide to work. This requires for us to specify how each of the individual specific variables are drawn. We then set the following:

\[ \begin{align*} \log W_i &= \eta X_i + Z_i + u_i \\ \log R_i &= \delta_0 + \log(W_i) + \delta Z_i + \xi_i \\ \log \beta_i &= X_i +\epsilon_i + a \xi_i \\ \end{align*} \]

and finally \((X_i,Z_i,\epsilon_i,u_i,\xi_i)\) are independent normal draws. Given all of this we can simulate our data.

Question 1 What does the \(a\) parameter capture here?

p  = list(gamma = 0.8,beta=1,a=1,rho=1,eta=0.2,delta=-0.2,delta0=-0.1,nu=0.5) # parameters
N=10000  # size of the simulation
simdata = data.table(i=1:N,X=rnorm(N))

# simulating variables
simdata[,X := rnorm(N)]
simdata[,Z := rnorm(N)]
simdata[,u := rnorm(N)]
simdata[,lw := p$eta*X  + Z + 0.2*u ]  # log wage

simdata[,xi := rnorm(N)*0.2]
simdata[,lr := lw + p$delta0+ p$delta*Z + xi]; # log home productivity

simdata[,eps:=rnorm(N)*0.2]
simdata[,beta := exp(p$nu*X  + p$a*xi + eps)]; # heterogenous beta coefficient

# compute decision variables
simdata[, lfp := log(p$rho) + lw >= lr] # labor force participation
simdata[, h   := (p$rho * exp(lw)/beta)^(1/p$gamma)] # hours
simdata[lfp==FALSE,h:=NA][lfp==FALSE,lw:=NA]
simdata[,mean(lfp)]

We have now our simulated data.

Question 2 Simulate data with \(a=0\) and \(a=1\). Comment on the value of the coefficient of the regression of log hours on log wage and X.

Heckman correction

As we have seen in class, Heckman (74) offers a way for us to correct the our regression in order to recover our structural parameters. We need to understand how the error term in the hour regression correlates with the labor participation decision.

Question 3 Following what we did in class, and using the class note, derive the expression for the Heckman correction term as a function of known parameters. In other words, derive \(E( a \xi_i + \epsilon_i | lfp=1)\).

Construction of this epxression requires us to recover the parameters \(\delta/\sigma_xi,\delta_0/\sigma_xi\). We can get these by running a probit of participation on \(Z_i\).

fit2 = glm(lfp ~ Z,simdata,family = binomial(link = "probit"))

Question 4 Check that the regression does recover correctly the coefficients. Use them to construct the inverse Mills ratio. Use the correction you created and show that the regression with this extra term delivers the correct estimates for \(\gamma\) even in the case where \(a\neq 0\).

Repeated cross-section

Lastly we want to replicate the approach of Blundell, Duncan and Meghir (1998) paper . The paper relies on the assumption that the endogeneity term can be expressed as a the sum of a term which is group specific and a term which is time specific. To achieve this structure we simply change the distribution of the \(\xi\) and make it an exponential, to be precise, assume that \(-\xi\) is distributed according to an exponential.

Question 5 Given that for the exponential distribution we have that \(E[ X | X > a] = 1+ a\), express the new conditional mean of \(\xi\) in the regression of log-hours on log-wages and X. Show that it can be writen at a linear term in \(\delta_0\) and \(Z_i\).

In estimation however we do not want to stricly rely on the linearity in \(Z\) and hence we are going to follow the procedure in the paper and bin the \(Z\) variable and rely on variation in time. The following code will simulate two cross-sections of data (not a panel). The difference from period 1 to 2 will be reflected in a change in the return to \(X\) and a change in the intercept of the labor force participation intercept.

simulate <- function(p,N,t) {
  simdata = data.table(i=1:N,X=rnorm(N))
  
  # simulating variables
  simdata[,X := rnorm(N)]
  simdata[,Z := rnorm(N)]
  simdata[,u := rnorm(N)]
  simdata[,lw := p$eta*X + Z + 0.2*u ]  # log wage
  
  simdata[,xi := -rexp(N)]
  simdata[,lr := lw + p$delta0 + p$delta*Z + xi]; # log home productivity
  
  simdata[,eps:=rnorm(N)*0.2]
  simdata[,beta := exp(p$nu*X  + p$a*xi +  eps)]; # heterogenous beta coefficient
  
  # compute decision variables
  simdata[, lfp := log(p$rho) + lw >= lr] # labor force participation
  simdata[, h   := (p$rho * exp(lw)/beta)^(1/p$gamma)] # hours
  
  # make hours and wages unobserved in case individual doesn't work
  simdata[lfp==FALSE,h:=NA][lfp==FALSE,lw:=NA]
  simdata[,mean(lfp)]
  
  # store time
  simdata[, t := t]
  return(simdata)
}

p  = list(gamma = 0.8,beta=1,a=1,rho=1,eta=0.2,delta=-1.0,delta0=-0.1,nu=0.5) # parameters
N=50000  # size of the simulation

# simulate period 1 
sim1  = simulate(p,N,1); 
p$eta = 0.4; p$delta0 = -1.1
# simulate period 2 with different eta and delta0 (incuding intercept shifts and variation in wages)
sim2 = simulate(p,N,2); 
simdata = rbind(sim1,sim2) # combine the two periods

Question 6 Run the regression of log-hours on log-wages and X how describe the bias in the recovered parameters. Similary compute the inverse Mills Ratio, include in the regression and comment on the value of the estimate parameters. Is it still biased?

Question 7 Finally slice the \(Z\) variables into deciles (or more splits) and regress log-hours on log-wage, X, dummies for each of the bins, and time \(t\). This mimics the procedure of Blundell Duncan Meghir, does this recovers a coefficient closer to the estimand of interest?