The IV-Probit: margins and ml

Story of marginal effects

Introduction

In previous articles, I have shown how to use the margins command after ml for the linear regression model (assuming normality) and the probit model. Today, I’ll provide another example for a more complex case: a two-equation model estimation, specifically the “IVprobit,” or instrumental variable probit.

This model has gained notoriety due to its inconsistent estimated marginal effects, which have been the subject of multiple threads on Statalist. The issue comes from different versions of “marginal effects” produced by margins in different Stata versions. Each version is correct from a computational standpoint, but not always consistent with common sense.

My assessment, backed up by discussions with colleagues, is that:

  • Stata 13 estimated the correct marginal effects for the IVprobit MLE but not for the two-step approach.

  • Stata 14 and 15 estimated the Full Information Marginal Effect, which is technically correct but contradicts common sense. Prof Wooldridge has extensively discussed this issue and advocates for either MLE or two-step marginal effects.

  • Stata 16 released an update that now produces something equivalent to the two-step marginal effect for MLE estimation, likely due to recent discussions.

No changes that I am aware came in Stata 17, except perhaps that we still do not have marginal effects for a two-step approach.

Without further ado, let me present my own version of how to estimate an IVprobit model and how to set up the predict program. I’ll also explain why there were so many types of marginal effects in use.

The Setup

First thing first. Unless you already have this program saved somewhere in your accesible ado files (most likely the “ado/personal” folder), make sure to have the following program in memory. It will allow you to add or modify information to e(), which is where all estimation commands store information.

clear all
program adde, eclass
    ereturn `0'
end

Let’s start with some Econometrics 101: The IVprobit model is a nonlinear model that is useful when you have a binary dependent variable (0-1) but one or more of your controls suffer from endogeneity. In this case, you want to estimate the “probability of success (y=1)” given a set of characteristics, but the characteristics of insterest is a continuous but endogenous explanatory.

Now, when a variable is endogenous, we cannot estimate the model and interpret the results as causal effects. This happens because changes in the endogenous variable can happen at the same time as changes in unobserved components. Therefore, if the outcome changes, we cannot tell if it is because the endogenous variable changed or because the unobservables changed (as they are, after all, correlated).

To deal unobserved confounders, we have two options:

  1. We can use instruments to isolate the exogenous variation of the variable of interest (using the 2SLS approach, for example)

  2. Use the instruments to obtain an approximation of the endogenous component that we can control for directly (Control function approach).

In fact, IV-probit is the application of the latter: a control function approach.

Formally, IVprobit model can be written as follows:

\[ y_2 = z_1 \delta_1 + z_2 \delta_2 + u_2 \tag{1}\]

\[ y_1^*= z_1 \beta_1 + y_2 \beta_2 + u_1 \tag{2}\]

\[ y_1 = 1(y_1^*>0) \]

where the errors \(u_1,u_2\) follow a bivariate normal distribution:

\[ \begin{pmatrix} u_1 \\ u_2 \end{pmatrix} \sim Normal \begin{pmatrix} \begin{matrix} 0 \\ 0 \end{matrix}, \begin{matrix} 1 & \rho \sigma_2\\ \rho \sigma_2 & \sigma^2_2 \end{matrix} \end{pmatrix} \]

In this model \(z_1\) and \(z_2\) are exogenous variables, \(z_2\) is a set of instruments, and \(y_2\) is a continuous but endogenous variable in the model. Finally we do not observe the latent variable \(y_1^*\), but instead observe \(y_1\) which only takes values of 0 or 1. So how do we estimate this model?? by parts!

The Equation 1 can be estimated directly, because it is a function of exogenous variables only. Thus, we could estimate that equation using standard OLS (as ivprobit-two-step does), or via MLE assuming the normality of the errors.

The one that requires more attention is Equation 2. We know that \(corr(y_2,u_1)\) is different from zero, which is the cause of the endogeneity of \(y_2\). We could, however, decompose \(u1\) into two parts. One that contains the endogenous component, and one that is exogenous and uncorrelated with all other variables.

To do this, we should first recall that if \(u_1, u_2\) follow a bivariate normal distribution, then, conditional on \(u_2\), \(u_1\) will have the following distribution:

\[ u_1 \sim N\left(\rho \frac{u_2}{\sigma_2}, {1-\rho^2} \right) \tag{3}\]

which implies, we could write \(u_1\) as follows:

\[ u_1 = \rho \frac{ u_2}{ \sigma_2} + v_1 \tag{4}\]

In this case, \(v_1\) will be, by construction, uncorrelated with \(u_2\) or with \(y_2\). So, we if substitute Equation 4 into Equation 1, we obtain:

\[ y_1^* = z_1 \beta_1 + y_2 \beta_2 + \rho \frac{ u_2}{ \sigma_2} + v_1 \tag{5}\]

This equation can now be estimated directly, assuming we observe \(u_2\). However, to be estimated with a probit model, we also need to rescale the equation so that the re-scaled error \(v_1\) has a variance of 1.

Based on Equation 3, we know \(v_1\) has a variance of \(1-\rho^2\), so we just need to divide all terms in Equation 5 by \(\sqrt{1-\rho^2}\) and estimating the following model (or its simplification) using standard probit model. \[ \frac{y^*_1}{\sqrt{1-\rho^2}} = z_1 \frac{\beta_1}{\sqrt{1-\rho^2}} + y_2 \frac{\beta_2}{\sqrt{1-\rho^2}} + \frac{\rho}{\sqrt{1-\rho^2}} \frac{u_2}{\sigma_2} + \frac{v_1}{\sqrt{1-\rho^2}} \tag{6}\] \[ y^{**}_1 = z_1 \beta^r_1 + y_2 \beta^r_2 + \theta \frac{u_2}{\sigma_2} + v_1 \tag{7}\]

What is the difference between using either equation?. I would argue none, as long as you know how to estimate the standard errors from the system.

The actual estimation

Let’s discuss the different methods that can be used to estimate the ivprobit model. There are at least three ways to do so.

The first method is the two-step approach. In this method, one estimates Equation 1 using OLS, obtains the predicted residuals, which are plugged into equation Equation 7. This can be estimated using a simple probit model.

This method has two problems:

  • It only provides estimates for the “rescaled” coefficients, not the structural coefficients.

  • It will not provide you with the correct estimation of standard errors, because it will not consider the residuals are carrying over errors from the first step.

Some textbooks suggest that doing so is a simple application of the delta method, or use bootstrap. But, the fact of the matter, is that you need to take into account that the residuals from Equation 1 are estimated not the true residuals \(u_2\).

The second method is to estimate Equation 1 and Equation 5 simultaneously using full information maximum likelihood. This imposes the assumption that the errors follow a bivariate normal distribution, and allows you to obtain estimates for the structural parameters, in addition to the “link” parameters \(\sigma_2\) and \(\rho\), providing correct standard errors.

Under this strategy, the contribution of a single observation to the likelihood function becomes:

\[ \begin{aligned} L_i &= L_i^1*L_i^2 \\ L_i^1 &= \phi(y_2,z_1\delta_1 + z_2 \delta_2, \sigma_2) \\ \hat P(y_1|.) &= \Phi \left( \frac{ z_1 \beta_1 + y_2 \beta_2 + \rho \frac{y_2 -z_1 \delta_1 - z_2 \delta_2}{\sigma_2}} {\sqrt{1-\rho^2}} \right) \\ L^2_i & = P(y_1|.)^{y_1} * (1-P(y_1|.))^{1-y_1} \end{aligned} \]

Notice that instead of plugging in \(\hat u_2\) in the probit equation, I explicitly add \(y_2 -z_1 \delta_1 - z_2 \delta_2\). This allows to explicilty account for the measurement errors of the first stage.

There is a third option, which I will call two-step-mle. I call it this way, because the ivprobit will be estimated using Equation 1 and Equation 7. However, I call it MLE, because both equations are estimated simultaneously using MLE:

\[ \begin{aligned} L_i &= L_i^1*L_i^2 \\ L_i^1 &= \phi(y_2,z_1\delta_1 + z_2 \delta_2, \sigma_2) \\ \hat P(y_1|.) &= \Phi \left( z_1 \beta^r_1 + y_2 \beta^r_2 + \theta \frac {y_2 -z_1 \delta_1 - z_2 \delta_2}{\sigma_2} \right) \\ L^2_i & = P(y_1|.)^{y_1} * (1-P(y_1|.))^{1-y_1} \end{aligned} \tag{8}\]

The difference with the standard FIML, is that only rescaled coefficients are estimated, and that the link between both equation is \(\theta\) not \(\rho\). Nevertheless, For this simplified example, both equations identify exactly the same model. If you are interested in this type of use of -ml- see my paper Rios-Avila and Canavire-Bacarreza (2018).

Compared to the usual two-step approach, however, because the model is estimated simultaneously, the standard errors of all coefficients are correctly estimated, without further calculations (no delta method nor bootstrap).

One last thing to notice in this model. First, there is a close relationship between \(\theta\) and \(\rho\), which will affect the rescaled parameters: \[ \theta = \frac{\rho}{\sqrt{1-\rho^2}} \rightarrow \rho = \frac{\theta}{\sqrt{1+\theta^2}} \] \[ \beta = \beta^r \times \sqrt{1-\rho^2} = \frac{\beta^r}{\sqrt{1+\theta^2}} \]

The Log Likelihood function

While I have shown that using FIML and two-step-ml will provide the same results, I’ll stick with the two-step approach, as it allows me to derive marginal effects telling the story of what happened to -margins- through different Stata versions.

The following program defines this the log-likelihood function for the IV probit, using the two-step approach (Equation 8), using the following walk-through for the specification:

  • xb will contain all the exogenous variables \(z_1\) plus the endogenous variable \(y_2\)
  • zb will contain all the exogenous variables \(z_1\) and the instruments \(z_2\)
program myivprobit_2sls
    args lnf xb theta zb lnsigma
    qui {
        local y1 $ML_y1
        local y2 $ML_y2
        local u2 (`y2'-`zb')
        tempvar xb_zb p1 p0
        gen double `xb_zb'= `xb'+`theta'*((`u2')/exp(`lnsigma')) 
        gen double `p1'   = normal( `xb_zb')
        gen double `p0'   = normal(-`xb_zb')
        tempvar lnf1 lnf2
        gen double `lnf1'  = log(normalden(`y2', `zb', exp(`lnsigma')))
        gen double `lnf2' = log(`p1') if `y1'==1
        replace    `lnf2' = log(`p0') if `y1'==0
        replace `lnf' = `lnf1' + `lnf2'
    }
end    

The predict program

So finally, the part that will be a bit more controversial. The prediction of the probability of success!.

The reason why this is controversial is because there are two candidates to identify this expression.

The first candidate relates to the structural Equation 2. Basically, if we can estimate the unscaled coefficients, the predicted outcome could be identified by:

\[ P(y_1=1| z_1 , y_2) = \Phi \left( z_1 \beta_1 + y_2 \beta_2 \right) \]

or if one prefers the version based on rescaled coefficients: \[ P(y_1=1| z_1 , y_2) = \Phi \left( z_1 \beta^r_1 * {\sqrt{1-\rho^2}} + y_2 \beta^r_2 * {\sqrt{1-\rho^2}} \right) \]

Thus, marginal effects can be obtained by analyzing either one of these equations alone. Standard errors for this expression can be identified directly only if we estimate the structural equation using FIML, or using the rescaled coefficients, making sure standard errors are calculated acounting for the estimation errors of the first stage.

The second option relates to estimate the marginal effects using Equation 7:

\[ P(y_1=1| z_1 , y_2, \hat{u}_2 ) = \Phi \left( z_1 \beta^r_1 + y_2 \beta^r_2 + \theta \hat{u}_2 \right) \tag{9}\]

However, because \(hat u_2\) is never observed, it is usually recommended to average (but not ignore) the impact of \(\hat u_2\) on the equation:

\[ P(y_1=1| z_1 , y_2) = E \left( \Phi \left( z_1 \beta^r_1 + y_2 \beta^r_2 + \theta \hat{u}_2 \right)| z_1, y_2 \right) \tag{10}\]

The bottom line: If one uses the two-step approach, marginal effects could be estimated assuming \(\hat u_2\) is just another exogenous variable in the model. The difficulty would be obtaining the correct estimation of standard errors.

So lets write these two options into a “predict” program.

program myivprobit_p
    syntax newvarname [if] [in] , [ pr1 pr2  *]
    if "`pr1'`pr2'" =="" {
        ml_p `0'
    }
    tokenize `e(depvar)'
    local y1  `1'
    local y2  `2'
    marksample touse, novarlist
    if "`pr1'" !=""  {
        tempvar xb zb theta lnsigma
        _predict double `xb'   , eq(#1)
        _predict double `theta', eq(#2)
        _predict double `zb'   , eq(#3) 
        _predict double `lnsigma', eq(#4)       
        gen `typlist' `varlist' = ///
            normal(`xb'+`theta'*(`y2'-`zb')/exp(`lnsigma')) if `touse'
        label var `varlist' "P(y=1|X) two-step"
    }       
    else if "`pr2'"!="" {
        tempvar xb zb theta lnsigma
        _predict double `xb' , eq(#1)
        _predict double `theta'  , eq(#2)
        gen `typlist' `varlist' = ///
            normal(`xb'/sqrt(1+`theta'^2)) if `touse'
        label var `varlist' "P(y=1|X) FIML"
    }
end

The first option pr1 will estimate the predicted probability as if the model were estimated using the two-step approach, whereas the second will estimate the predicted probability based on the structural equation.

Alright, so lets estimate the model and compare the results with the built-in ivprobit command:

clear  
webuse laborsup
global  y1   fem_work
global  z1   fem_educ   kids  
global  y2   other_inc
global  z2   male_educ   

*Built in command:
ivprobit $y1  $z1 ($y2 = $z2), two
Checking reduced-form model...

Two-step probit with endogenous regressors        Number of obs   =        500
                                                  Wald chi2(3)    =      93.97
                                                  Prob > chi2     =     0.0000

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   other_inc |   -.058473   .0093364    -6.26   0.000    -.0767719    -.040174
    fem_educ |    .227437   .0281628     8.08   0.000     .1722389     .282635
        kids |  -.1961748   .0496323    -3.95   0.000    -.2934522   -.0988973
       _cons |   .3956061   .4982649     0.79   0.427    -.5809752    1.372187
------------------------------------------------------------------------------
Wald test of exogeneity: chi2(1) = 6.50                   Prob > chi2 = 0.0108
Instrumented: other_inc
 Instruments: fem_educ kids male_educ
*my ivprobit two-step
ml model lf myivprobit_2sls ($y1 = $z1  $y2 ) ///
         (theta:) ($y2 = $z1 $z2  ) (lnsigma:) , ///
         technique(nr bhhh) init(lnsigma:_cons = 2.81 ) maximize nolog
ml display

                                                        Number of obs =    500
                                                        Wald chi2(3)  =  94.01
Log likelihood = -2368.2062                             Prob > chi2   = 0.0000

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
eq1          |
    fem_educ |    .227437   .0281561     8.08   0.000     .1722519     .282622
        kids |  -.1961748   .0496179    -3.95   0.000    -.2934242   -.0989254
   other_inc |  -.0584729   .0093339    -6.26   0.000    -.0767671   -.0401788
       _cons |   .3956051   .4981099     0.79   0.427    -.5806724    1.371883
-------------+----------------------------------------------------------------
theta        |
       _cons |   .4008081   .1626174     2.46   0.014     .0820838    .7195323
-------------+----------------------------------------------------------------
eq3          |
    fem_educ |   .3351867   .2825972     1.19   0.236    -.2186937     .889067
        kids |   .8329056   .5475666     1.52   0.128    -.2403052    1.906116
   male_educ |   2.845253    .282746    10.06   0.000     2.291081    3.399425
       _cons |   9.872559   5.029193     1.96   0.050     .0155214     19.7296
-------------+----------------------------------------------------------------
lnsigma      |
       _cons |   2.813383   .0316228    88.97   0.000     2.751404    2.875363
------------------------------------------------------------------------------

You can see right away that except for differences attributed to rounding errors and degrees of freedom, the results are virtually the same. It is also reasuring to see that the results are also the same when we compared ivprobit-mle and the rescaled coefficients:

*FIML
ivprobit $y1  $z1 ($y2 = $z2), ml nolog

Probit model with endogenous regressors                 Number of obs =    500
                                                        Wald chi2(3)  = 163.88
Log likelihood = -2368.2062                             Prob > chi2   = 0.0000

-------------------------------------------------------------------------------
              | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
--------------+----------------------------------------------------------------
    other_inc |  -.0542756   .0060854    -8.92   0.000    -.0662028   -.0423485
     fem_educ |    .211111   .0268648     7.86   0.000     .1584569    .2637651
         kids |  -.1820929   .0478267    -3.81   0.000    -.2758315   -.0883542
        _cons |   .3672086   .4480724     0.82   0.412    -.5109971    1.245414
--------------+----------------------------------------------------------------
corr(e.othe~c,|
   e.fem_work)|   .3720375   .1300518                      .0946562    .5958136
sd(e.other_~c)|   16.66621   .5270318                      15.66461    17.73186
-------------------------------------------------------------------------------
Wald test of exogeneity (corr = 0): chi2(1) = 6.70        Prob > chi2 = 0.0096
Instrumented: other_inc
 Instruments: fem_educ kids male_educ
*my ivprobit two-step
ml model lf myivprobit_2sls ($y1 = $z1  $y2 )  (theta:) ($y2 = $z1 $z2  ) (lnsigma:) , ///
    technique(nr bhhh)   init(lnsigma:_cons = 2.81 ) maximize nolog
adde local predict myivprobit_p                 
est store myivp
*with rescaled coefficients:
nlcom   (other_inc: _b[other_inc]/sqrt(1+_b[theta:_cons]^2)) ///
        (fem_educ: _b[fem_educ]/sqrt(1+_b[theta:_cons]^2)) ///
        (kids: _b[kids]/sqrt(1+_b[theta:_cons]^2)) ///
        (cons: _b[_cons]/sqrt(1+_b[theta:_cons]^2)) 

   other_inc: _b[other_inc]/sqrt(1+_b[theta:_cons]^2)
    fem_educ: _b[fem_educ]/sqrt(1+_b[theta:_cons]^2)
        kids: _b[kids]/sqrt(1+_b[theta:_cons]^2)
        cons: _b[_cons]/sqrt(1+_b[theta:_cons]^2)

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   other_inc |  -.0542756   .0060854    -8.92   0.000    -.0662027   -.0423485
    fem_educ |    .211111   .0268648     7.86   0.000      .158457    .2637651
        kids |  -.1820929   .0478267    -3.81   0.000    -.2758316   -.0883543
        cons |   .3672077   .4480724     0.82   0.412    -.5109982    1.245414
------------------------------------------------------------------------------

Again, showing exactly the same results

A Story of marginal effects

Let me now walk you through the Story of marginal effects with ivprobit.

Stata 13

Back in Stata 13, marginal effects for IV probit were estimated using the structural equation coeffients: \[ P(y_1=1|z_1,y_2)=\Phi(z_1\beta_1+y_2\beta_2) \]

So that marginal effects were defined as: \[ \begin{aligned} \frac{\partial P(y_1=1|.)}{\partial z_1} = \phi ( z_1 \beta_1 + y_2 \beta_2 ) \beta_1 \\ \frac{\partial P(y_1=1|.)}{\partial y_2} = \phi ( z_1 \beta_1 + y_2 \beta_2 ) \beta_2 \end{aligned} \]

Note

As you may have noticed, I’m rewritting few of my older posts using Quarto. So, I can only use Stata17 dynamically. Because of that the code you will see below will not be reproducible, unless you have the same Stata version

If you have access to Stata 13, you will be able to reproduce the following output:

margins, dydx(*) predict(pr)

Average marginal effects                          Number of obs   =        500
Model VCE    : OIM

Expression   : Probability of positive outcome, predict(pr)
dy/dx w.r.t. : other_inc fem_educ kids male_educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   other_inc |   -.014015   .0009836   -14.25   0.000    -.0159428   -.0120872
    fem_educ |   .0545129   .0066007     8.26   0.000     .0415758      .06745
        kids |  -.0470199   .0123397    -3.81   0.000    -.0712052   -.0228346
   male_educ |          0  (omitted)
------------------------------------------------------------------------------

This marginal effects are emulated using pr2 after myivprobit:

est restore myivp
margins, dydx(*) predict(pr2) force
(results myivp are active now)
note: prediction is a function of possibly stochastic quantities other than
      e(b).

Average marginal effects                                   Number of obs = 500
Model VCE: OIM

Expression: P(y=1|X) FIML, predict(pr2)
dy/dx wrt:  fem_educ kids other_inc male_educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0545129   .0066003     8.26   0.000     .0415766    .0674493
        kids |  -.0470199   .0123394    -3.81   0.000    -.0712047   -.0228351
   other_inc |   -.014015   .0009837   -14.25   0.000    -.0159431   -.0120869
   male_educ |          0  (omitted)
------------------------------------------------------------------------------

You will see output from margins also include male_educ in the list of exogenous variables. This happens because it is an explanatory variable for at least one equation in the model (first). However, because this variable is not included in the second equation, it has a no effect on it.

Stata 14.1

When we reached Stata 14.1, a change was introduced in how probabilities were calculated after ivprobit. As it says in the “whatsnew” material, the new formulation would take into account endogeneity.

Specifically, they use what I call the 2sls predicted probabilities, following equation (5) wth the caveat that \(\hat u_2\) was substituted by \(y_2 - z_1 \delta_1 -z2\delta_2\):

\[ P(y_1=1|z_1,y_2,z_2) =\Phi \left( \frac{z_1 \beta_1 + y_2 \beta_2 + \rho \frac{y_2 - z_1 \delta_1 - z_2 \delta_2}{\sigma_2}} {\sqrt{1-\rho^2}} \right) \tag{11}\]

While the two equations above are basically the same, they have important differences when marginal effects are estimated by software. Specifically, the probability of sucess is now a function of \(z_2\)!.

So let me explain first what Stata 14.1, did. To estimate marginal effects, partial derivatives were based on Equation 11:

\[ \begin{aligned} \frac{\partial P(y=1|.)}{\partial z_1} &= \phi(.)*\left( \frac{\beta_1}{\sqrt{1-\rho^2}} -\frac{\rho}{\sqrt{1-\rho^2}} * \frac{\delta_1}{\sigma_2} \right) \\ \frac{\partial P(y=1|.)}{\partial z_2} &= \phi(.)*\left( 0 -\frac{\rho}{\sqrt{1-\rho^2}} * \frac{\delta_2}{\sigma_2} \right) \\ \frac{\partial P(y=1|.)}{\partial y_2} &= \phi(.)*\left( \frac{\beta_2}{\sqrt{1-\rho^2}} +\frac{1}{\sqrt{1-\rho^2}} * \frac{1}{\sigma_2} \right) \end{aligned} \]

From the technical point of view, these partial derivatives are correct, since they are capturing both the direct and indirect effects of all variables on the probability of success. Something similar to total, rather than partial, derivative.

The problem, however, is that this assumes we could actually observe how the unobserved component changes when other variables change. Standard regression analysis, however, would say that these unobserved components should be considered as fixed, and instead one should estimate marginal effects averaging over the unobserved factors. Thus, the second term on each one of the above derivatives should be zero.

Nevertheless, if you try estimating marginal effects with Stata 14.2, you will get the following result:

margins, dydx(*) predict(pr)

Average marginal effects                        Number of obs     =        500
Model VCE    : OIM

Expression   : Probability of positive outcome, predict(pr)
dy/dx w.r.t. : other_inc fem_educ kids male_educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   other_inc |  -.0097802   .0014994    -6.52   0.000     -.012719   -.0068414
    fem_educ |   .0623273    .007099     8.78   0.000     .0484135     .076241
        kids |  -.0614265   .0139446    -4.41   0.000    -.0887574   -.0340956
   male_educ |  -.0194406   .0022103    -8.80   0.000    -.0237728   -.0151084
------------------------------------------------------------------------------

To replicate this using myivprobit, I would estimate marginal effects using option pr1, requesting derivates to be estimated without the chain rule (nochain). This makes sure that one takes into account the effect of all changes in \(y_2, z_1\) and \(z_2\) on the predicted outcome \(P(y=1|.)\):

est restore myivp
margins, dydx(*) predict(pr1) force nochain
(results myivp are active now)
note: prediction is a function of possibly stochastic quantities other than
      e(b).

Average marginal effects                                   Number of obs = 500
Model VCE: OIM

Expression: P(y=1|X) two-step, predict(pr1)
dy/dx wrt:  fem_educ kids other_inc male_educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0623273   .0060316    10.33   0.000     .0505055     .074149
        kids |  -.0614265   .0128383    -4.78   0.000    -.0865891   -.0362639
   other_inc |  -.0097802   .0009828    -9.95   0.000    -.0117065   -.0078539
   male_educ |  -.0194406   .0074886    -2.60   0.009    -.0341181   -.0047631
------------------------------------------------------------------------------

Stata 16

The earlier version of Stata 16 came with very similar problems as the ones mentioned above. However, due in part to an earlier version of this article, in November of 2020, Stata made a correction in how marginal effects were estimated for ivprobit as well as other related commands (see update 19nov2020).

In this update, they change the default option and now produces the correct marginal effects, assuming the predicted errors are fixed.

Let’s take a closer look at the findings. First, it appears that Stata 14, 15, and early 16 versions were unintentionally estimating a partial effect that accounted for a second-order effect through the first stage regression. While this might have a negligible effect on the exogenous variables, it could have a considerable impact on the endogenous variable of interest, resulting in some people reporting negative marginal effects even when the estimated coefficient was positive. Additionally, the instrument, in this case male_educ, would also appear in the output, capturing only a second-order effect on the outcome of interest.

However, after a lively discussion on Statalist, including input from Prof. Wooldridge, it was revealed that Stata (and margins) was incorrectly estimating marginal effects. As shown here, partial derivatives were being estimated through the first and second equations, leading to incorrect results.

Prof. Wooldridge recommended a manual two-step approach for estimating marginal effects, with standard errors obtained via bootstrap, using Equation 9 to estimate the partial effects. This makes a difference because we will be making the explicit assumption that \(\hat u_2\) does not change when the other variables change. This will modify the partial effects to the following:

\[ \begin{aligned} \frac{\partial P(y=1|.)}{\partial z_1} &= \phi(.)*\left( \frac{\beta_1}{\sqrt{1-\rho^2}} \right) \\ \frac{\partial P(y=1|.)}{\partial z_2} &= \phi(.)* 0 \\ \frac{\partial P(y=1|.)}{\partial y_2} &= \phi(.)* \left( \frac{\beta_2}{\sqrt{1-\rho^2}} \right) \end{aligned} \]

The differences with the “structural” marginal effects are that the evaluation of \(\phi(.)\) includes the predicted values of the errors (\(\hat u_2\)), and that coefficients used correspond to the two-step procedure ones (rescaled).

To show empirically how this works, we can compare the builtin command, with the “two-step” procedure suggested by Prof. Wooldridge:

* two step procedure
* 1st
qui: reg $y2 $z1 $z2
predict double u2, resid
* 2nd
qui: probit $y1 $z1 $y2 u2, nolog

margins, dydx(*) predict(pr)

Average marginal effects                                   Number of obs = 500
Model VCE: OIM

Expression: Pr(fem_work), predict(pr)
dy/dx wrt:  fem_educ kids other_inc u2

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0646175   .0060271    10.72   0.000     .0528045    .0764304
        kids |  -.0557355   .0129302    -4.31   0.000    -.0810782   -.0303929
   other_inc |  -.0166128   .0022499    -7.38   0.000    -.0210225   -.0122032
          u2 |   .0068326    .002632     2.60   0.009     .0016741    .0119912
------------------------------------------------------------------------------

The standard errors here will not be correct, but bootstrap could be applied to obtain corrected standard errors.

With the correction to estimation of marginal effects pushed in Novenber of 2020, we can produce the correct point estimates for marginal effects, which follows Prof Wooldrige suggestion, and my discussion presented here.

** built-in command
qui:ivprobit $y1  $z1 ($y2 = $z2), ml
margins, dydx(*) predict(pr)

Average marginal effects                        Number of obs     =        500
Model VCE    : OIM

Expression   : Average structural function probabilities, predict(pr)
dy/dx w.r.t. : other_inc fem_educ kids

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   other_inc |  -.0166128   .0012889   -12.89   0.000     -.019139   -.0140867
    fem_educ |   .0646175   .0073529     8.79   0.000      .050206     .079029
        kids |  -.0557355   .0144233    -3.86   0.000    -.0840047   -.0274664
------------------------------------------------------------------------------

So you can see that the two-step approach and the built-in approach now provide the same marginal effects. And since the official command estimates all coefficients simultaneously, the standard errors can be taken as correct (more on that later).

So how can we correct for this with our predict program. Since there is nothing to prevent margins to obtain numerical derivatives across both equations, we need to modify the specification slighly. First, we create clone copies of all variables that enter the second stage: z_1 and y_2, and use them for the model estimation:

clonevar c_other_inc = other_inc
clonevar c_fem_educ  = fem_educ
clonevar c_kids      = kids
global  y2b c_other_inc
global  z1b c_fem_educ c_kids 

ml model lf myivprobit_2sls ($y1 = $z1  $y2 )  (theta:) ($y2b = $z1b $z2  ) (lnsigma:) , ///
            technique(nr bhhh)   init(lnsigma:_cons = 2.81 ) maximize nolog
adde local predict myivprobit_p                 
est sto myivp   

The idea of using “clones” of the exogenous variables and endogenous one is to have access to the same information as the original data, but making sure they do not change when the original data changes.

Marginal effects can be calculated as I did before, except that I now make it explicit to request marginal effects with respect to \(z_1\) and \(y_2\) only.

est restore myivp       
margins, dydx($z1 $y2) predict(pr1) force 
(results myivp are active now)
note: prediction is a function of possibly stochastic quantities other than
      e(b).

Average marginal effects                                   Number of obs = 500
Model VCE: OIM

Expression: P(y=1|X) two-step, predict(pr1)
dy/dx wrt:  fem_educ kids other_inc

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0646175    .006331    10.21   0.000     .0522089    .0770261
        kids |  -.0557355   .0134693    -4.14   0.000    -.0821349   -.0293362
   other_inc |  -.0166128   .0023499    -7.07   0.000    -.0212185   -.0120072
------------------------------------------------------------------------------

And done!. We have been able to reproduce the second version, two-step, marginal effects for the instrumental variable probit model, that follows the two-step approach advocated by Prof. Wooldridge, and officially included in Stata 16 and above.

There is only one last perky detail. If you look at the marginal effect standard errors I produce with the myivprobit command, and compare it with the marginal effects the ivprobit command produces, you will notice they are different.

The reason for this was that, based on unofficial words from the developers, at the time:

the current formulation assumes \(\rho\) and \(\sigma\) to be constant, when standard errors are obtained.

While this may seem incorrect, I understand the intuition behind this idea.

If you recall the estimation of marginal effects from the structural equation, it is not affected by \(\rho\) nor \(\sigma\). Perhaps this was one of the reasons why the estimated standard errors (Nov2020) are so similar to the ones based on the “old” structural marginal effects.

My own command, however, accounts for the uncertainty in these parameter. This also seems correct since two-step marginal procedures are expected to be less efficient than the Full Information counterparts.

Of course, if you prefer to have a tie-breaker on which one is correct, I can use a Bootstrap procedure to produce the elusive standard errors. Basically, I’ll use the manual two-step procedure, along with a 250 bootstrap repetitions, to report the results:

program bs_ivprobit, eclass
    reg $y2 $z1 $z2
    capture drop u2
    predict double u2, resid
    probit $y1 $z1 $y2 u2
    margins, dydx(*) predict(pr) nose post
end
bootstrap , reps(250) seed(1) nodots:bs_ivprobit


Average marginal effects                                   Number of obs = 500
                                                           Replications  = 250

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0646175   .0058983    10.96   0.000      .053057     .076178
        kids |  -.0557355   .0138358    -4.03   0.000    -.0828532   -.0286179
   other_inc |  -.0166128   .0023778    -6.99   0.000    -.0212732   -.0119525
          u2 |   .0068326   .0027569     2.48   0.013     .0014292    .0122361
------------------------------------------------------------------------------

In this case, it seems that the bootstrap estimates seem to favor my version of marginal effects and standard errors!

One last change.

Seems my guess was correct!. In March of 2021, Stata pushed another update to Stata16. They have now changed how SE are estimated after margins, which now coincides with the experiment I started with.

ivprobit $y1  $z1 ($y2 = $z2), ml nolog
margins, dydx(*) predict(pr)

Probit model with endogenous regressors                 Number of obs =    500
                                                        Wald chi2(3)  = 163.88
Log likelihood = -2368.2062                             Prob > chi2   = 0.0000

-------------------------------------------------------------------------------
              | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
--------------+----------------------------------------------------------------
    other_inc |  -.0542756   .0060854    -8.92   0.000    -.0662028   -.0423485
     fem_educ |    .211111   .0268648     7.86   0.000     .1584569    .2637651
         kids |  -.1820929   .0478267    -3.81   0.000    -.2758315   -.0883542
        _cons |   .3672086   .4480724     0.82   0.412    -.5109971    1.245414
--------------+----------------------------------------------------------------
corr(e.othe~c,|
   e.fem_work)|   .3720375   .1300518                      .0946562    .5958136
sd(e.other_~c)|   16.66621   .5270318                      15.66461    17.73186
-------------------------------------------------------------------------------
Wald test of exogeneity (corr = 0): chi2(1) = 6.70        Prob > chi2 = 0.0096
Instrumented: other_inc
 Instruments: fem_educ kids male_educ

Average marginal effects                                   Number of obs = 500
Model VCE: OIM

Expression: Average structural function probabilities, predict(pr)
dy/dx wrt:  other_inc fem_educ kids

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   other_inc |  -.0166128   .0023501    -7.07   0.000    -.0212189   -.0120068
    fem_educ |   .0646175   .0063309    10.21   0.000     .0522091    .0770258
        kids |  -.0557355   .0134692    -4.14   0.000    -.0821347   -.0293364
------------------------------------------------------------------------------

So there is my small contribution to Stata!

Conclusion

The command -ml- is a powerful tool that can be used to estimate single or multiple equation models, as long as the loglikelihood functions (and their inter-relations) can be properly defined.

-margins- is also a very flexible command that can be easily combined with -ml- to expand the estimation of marginal effects for properly defined outcomes. While the command is flexible and relatively easy to use, these properties can also be double-edge swords, if one is not aware of the mechanics behind the actual estimation of partial effects.

In my view, the original estimation of marginal effects after iv-probit was correct, but the changes it received in Stata 14.1 introduced what we could call a bug, that was based on solid Math. However, unless you dig deeper into what ivprobit tries to estimate, it would be difficult to say why that change produced undesirable results.

The updates pushed in Stata 16 made the necesary corrections following my suggestions, and now produces correct partial effects (two-step like), even adopting my comment regarding standard errors.

References

Rios-Avila, Fernando, and Gustavo Canavire-Bacarreza. 2018. “Standard-Error Correction in Two-Stage Optimization Models: A Quasi–Maximum Likelihood Estimation Approach.” The Stata Journal 18 (1): 206–22. https://doi.org/10.1177/1536867X1801800113.